Post-Project Details | PenguinPro.ca

Hello again, I have been a little less communicative recently but I wanted to detail some post-project information. For those of who who may not be aware I have been attempting to optimize an OpenSource software project for a course that I have taken called “software portability and optimization.” To see my previous posts about this project you can check the link here:

https://penguinpro.ca/category/school-work/project/

If you have been following the project, in my final post:

https://penguinpro.ca/category/school-work/project/project-finalization

You can see my project wrap up and reflection. For this post however I am going to kind of look back on the work that has been done, detail some of the things that I may have left out; and provide detail for different options that I tried over the course of this project in a more technical way.

The actual optimization itself took some effort and was an excellent learning experience. I found myself in many rabbit holes spending days researching the topics I love, however the part that I loved the most about this project was the time dedicated to creating the “scaffolding” required to make the optimization work on multiple platforms without having to recompile from source.

This is where the project became the most fun.

In attempting to do this I created 3 different methods of accomplishing the tasks, all with their own benefits and difficulties. I will detail each of them here.

Option one:

For this first option, each architectures optimization solution was put into their own function with parameters that are accessed by reference. From there a function jmp table was created and the address of the function was appended. As you will see by doing this we can easily call the function by an index, or offset from the beginning of the array; the benefit is that we are not evaluating conditions at every run. The downside is that we still have the overhead of calling the function In the first place.

This is also the option that I opted to put up publically, It had about 1-2% reduction in optimization, which was the best option in testing. Here is the code bits:

Function declarations:

// optimization options:

void charSearch_C(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                uint8_t *found, const unsigned char *bp, const unsigned char *pt);

void charSearch_sse4_2(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
         uint8_t *found, const unsigned char *bp, const unsigned char *pt);

Function JMP table:

// function table:

void (* const stringSearch[]) (const uint32_t, const uint32_t, uint32_t *,
                    uint32_t *, uint8_t *, const unsigned char *, const unsigned char *)
                                    = {charSearch_C, charSearch_sse4_2};

You can see the parameters and how they relate directly with the function table, which is an array of function pointers; you can learn more about this concept for yourself here:

https://barrgroup.com/Embedded-Systems/How-To/C-Function-Pointers

Function definitions:

Here just for the sake of having them included, (even though they do not really change across calling conventions) you can see the actual definitions:

void charSearch_C(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                   uint8_t *found, const unsigned char *bp, const unsigned char *pt)

{

                // Original program code, just moved here.
                *found = 1;
                for (*j = 0; *j < l && *off < length; *j++, *off++) {
                    if (bp[*j] != pt[*j]) {
                                *found = 0;
                                break;
                    }
                }
}

void charSearch_sse4_2(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                        uint8_t *found, const unsigned char *bp, const unsigned char *pt)
{
    *found = 0;
    for (*j = 0; *j < length && *j < l ;)
    {
       // make sure that we are not going past either end of any array.
       if (*j >= length - 16 || *j >= l -16)
       {
                  if ( length >= 16 && l >= 16)
                  {
                     *j = (length < l) ? length -16 : l -16;
                     __m128i xbp = _mm_loadu_si128((__m128i*) &bp[*j]);
                     __m128i xpt = _mm_loadu_si128((__m128i*) &pt[*j]);
                     *j += _mm_cmpistri(xbp, xpt,
                                _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
                     break;
                  }
       }

       __m128i xbp = _mm_loadu_si128((__m128i*) &bp[*j]);
       __m128i xpt = _mm_loadu_si128((__m128i*) &pt[*j]);
       uint8_t y = _mm_cmpistri(xbp, xpt,
                    _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
       *j += y;

       // if y is not 16 then either the end of the string is reached or
       // a character miss-match has been detected.
       if ( y != 16 )
                       break;
    }
       // set found and offset based on our results.
       *found = ( length == l && *j == length ) ? 1 : 0;
       *off += *j;
}

Option 2:

For this option the code was modified to still use the functions but instead of a function table being directly referenced with in the calling function, ifunc was used to create a resolver function. This allowed us to have two full identical functions with the optimization modification added to each.

// Use ifunc to seemlessly call appropriate function

cl_error_t cli_bm_scanbuff(const unsigned char *buffer, uint32_t length, 
const char **virname, const struct cli_bm_patt **patt, const struct cli_matcher *root, 
uint32_t offset, const struct cli_target_info *info, struct cli_bm_off *offdata,
 cli_ctx *ctx)
                __attribute__ ((ifunc ("resolve")));

// jmp table for resolver function:

cl_error_t (* const scanBuffs[]) (const unsigned char *buffer, uint32_t length,
 const char **virname, const struct cli_bm_patt **patt, const struct cli_matcher *root,
 uint32_t offset, const struct cli_target_info *info, struct cli_bm_off *offdata,
 cli_ctx *ctx)
                = {cli_bm_scanbuff_sse4_2, cli_bm_scanbuff_C};

In the above you can see that the __attribute__ tag with gcc allows us to keep a seemless programming interface for the function so as we do not have to modify any calls. The function table is used in the resolve function to choose the appropriate logic for the architecture the build may be running on. You can see it working here:

// Used to choose which function version to resolve to at run time.

// implements ifunc.
static void (*resolve (void)) (void)
{
                // return index in jmp table, dependant on hw capabilities.
                return scanBuffs[0];
}

I feel there are many practical use cases outside of hardware detection that ifunc would be useful for outside of this project. This is something I am happy that I was introduced to.

Option 3:

This is where things start to get a bit “strange” and where I started to have a lot of fun. In the code I have bolded the areas of interest, as most of this is just an inline version of the previous code. However instead of function calls we are using goto statements instead. Take a look at it first and then I will try to explain what is going on.

// function table:

   void (* const search[])() = {&&c_search, &&sse4_2};

// UNCOMMENT "if" for invalid option prot:
// if (runtime_detection < OPTIMIZE_MAX_OPTIONS)

   goto *search[runtime_detection]; 

c_search:
   found = 1;
   for (j = 0; j < l && off < length; j++, off++) {
   if (bp[j] != pt[j]) {
      found = 0;
      break;
   }
}

   goto exit;

sse4_2:
   for (j = 0; j < length && j < l && off < length ;)
   {
   // check, if index is within range.
   if (j >= length - 16 || j >= l -16)
   {
      if ( length >= 16 && l >= 16)
      {
         j = (length < l) ? length -16 : l -16;
         __m128i xbp = _mm_loadu_si128((__m128i*) &bp[j]);
         __m128i xpt = _mm_loadu_si128((__m128i*) &pt[j]);
         j += _mm_cmpistri(xbp, xpt,
         _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
         break;
      }

   }
      __m128i xbp = _mm_loadu_si128((__m128i*) &bp[j]);
      __m128i xpt = _mm_loadu_si128((__m128i*) &pt[j]);
      uint8_t y = _mm_cmpistri(xbp, xpt,
         _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
      j += y;

   // if y is not 16 then either the end of the string is reached or
   // a character miss-match has been detected.
      if ( y != 16 )
         break;
   }

   // set found and offset based on our results.
   found = ( length == l && j == length ) ? 1 : 0;
   off += j;

exit:

Okay, to explain this one let me refer you back to a previous post where I was doing some practice/play with the address of a label. Note that I am still playing with this source, and as of currently I have found a way to do nonlocal goto’s using the address of a label and even infinitely loop. It still is not practical as I still have not found a way to predictably grab entities at offsets within the stack (think local variables.) The neatest thing I found about this is that label addresses are address offsets from the instruction pointer, and NOT offset from the entry point of the routine. I will try my best to continue to work with it. Anyways, here’s the link to that post:

https://penguinpro.ca/2019/11/26/project-detour/

Now for the above code:

void (* const search[])() = {&&c_search, &&sse4_2};

This loads the addresses of the labels into an array of function pointers, note however we do not call them with the function operator or with “brackets” doing so causes a call instruction that pushes an address and saved frame pointer to the stack which must be cleaned up. To avoid this we use:

goto *search[runtime_detection];

We need to dereference the address within the pointer to jmp execution to that point. My current “play” situation is attempting to call a function this way and manage the stack myself.

Also note that runtime_detection is a global variable set by the optimization initializer function which you will see later in this post.

Next up, the first label is c_search: which requires that we have an exit: label. The exit label is used as a forward jmp to skip past any code we don’t want to execute. And finally we have sse4_2: This is the label for the start of execution of our SSE optimization code.

This option seems to have minimal overhead, minus the loading of the “label address table” at every function call. I assume this is required because the labels offsetting from the instruction pointer needs a point to catch a reference, else it’s out of scope. I may also be totally wrong, label scope at the lower level is something I will need to do more research on. Either way the reason I did not include this option in the public repository is that it was sometimes unpredictable in which branch it would take? It didn’t crash the program but without further analysis (of which there was not enough time to do within the remaining course time,) it was too unstable for a realistic use at this time. As stated multiple times, I am hopeful to continue my work on this.

Runtime Detection

This is the code that is used to determine the architecture at run time, it uses gcc built in functions to request information from the system. My other thoughts for this would be to write a separate HW_detect module that would be able to query the /proc filesystem for attributes as a good safe portable method for runtime hardware detection, time being a factor again makes me push this to my own stack of personal project ideas. Maybe I should create a “public” type of todo list.

Anyways back to the code:

// Functions for instrinsics jump table:
void optimize_scanbuf_init()
{
                // default to C code.
                uint8_t runtime_detection = OPTIMIZE_USE_DEFAULT;

                // set jmp table index:
                if (__builtin_cpu_supports("sse4.2"))
                                runtime_detection = OPTIMIZE_HAS_SSE4_2;

                // This one is only here for an example:
                if (__builtin_cpu_supports("avx"))
                      runtime_detection = OPTIMIZE_HAS_AVX;
}

Which is set to be called in a high level function in libclamav, specifically when the database is initial ally read, by doing this, the init function is only called once and any error should cause it to fall back to default C.

Final Project Overview:

This project has been one of the most fun and challenging things I have done in quite some time. After many late nights and reading countless hours, all with no idea whether it would ever work, I feel I have accomplished something I wouldn’t otherwise be able to.

Also I know that I have not posted much of the code in previous posts as much of it was still experimental for me; hopefully this post will clear some things up and help anyone who is reading it move forward with their own personal projects.

I have really enjoyed having you as my readers and hope to continue on with this blog into the future. It’s time to shut down the test machine; and say thank you to everyone involved on this journey, I won’t forget it.

With Love, always.

ElliePenguins