Category Archives: School Work

Post-Project Details

Hello again, I have been a little less communicative recently but I wanted to detail some post-project information. For those of who who may not be aware I have been attempting to optimize an OpenSource software project for a course that I have taken called “software portability and optimization.” To see my previous posts about this project you can check the link here:

https://penguinpro.ca/category/school-work/project/

If you have been following the project, in my final post:

https://penguinpro.ca/category/school-work/project/project-finalization

You can see my project wrap up and reflection. For this post however I am going to kind of look back on the work that has been done, detail some of the things that I may have left out; and provide detail for different options that I tried over the course of this project in a more technical way.

The actual optimization itself took some effort and was an excellent learning experience. I found myself in many rabbit holes spending days researching the topics I love, however the part that I loved the most about this project was the time dedicated to creating the “scaffolding” required to make the optimization work on multiple platforms without having to recompile from source.

This is where the project became the most fun.

In attempting to do this I created 3 different methods of accomplishing the tasks, all with their own benefits and difficulties. I will detail each of them here.

 

Option one:

For this first option, each architectures optimization solution was put into their own function with parameters that are accessed by reference. From there a function jmp table was created and the address of the function was appended. As you will see by doing this we can easily call the function by an index, or offset from the beginning of the array; the benefit is that we are not evaluating conditions at every run. The downside is that we still have the overhead of calling the function In the first place.

This is also the option that I opted to put up publically, It had about 1-2% reduction in optimization, which was the best option in testing. Here is the code bits:

Function declarations:

// optimization options:

void charSearch_C(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                uint8_t *found, const unsigned char *bp, const unsigned char *pt);

void charSearch_sse4_2(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
         uint8_t *found, const unsigned char *bp, const unsigned char *pt);

 

Function JMP table:

// function table:

void (* const stringSearch[]) (const uint32_t, const uint32_t, uint32_t *,
                    uint32_t *, uint8_t *, const unsigned char *, const unsigned char *)
                                    = {charSearch_C, charSearch_sse4_2};

 

You can see the parameters and how they relate directly with the function table, which is an array of function pointers; you can learn more about this concept for yourself here:

https://barrgroup.com/Embedded-Systems/How-To/C-Function-Pointers

 

Function definitions:

Here just for the sake of having them included, (even though they do not really change across calling conventions) you can see the actual definitions:

 

void charSearch_C(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                   uint8_t *found, const unsigned char *bp, const unsigned char *pt)

{

                // Original program code, just moved here.
                *found = 1;
                for (*j = 0; *j < l && *off < length; *j++, *off++) {
                    if (bp[*j] != pt[*j]) {
                                *found = 0;
                                break;
                    }
                }
}

void charSearch_sse4_2(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                        uint8_t *found, const unsigned char *bp, const unsigned char *pt)
{
    *found = 0;
    for (*j = 0; *j < length && *j < l ;)
    {
       // make sure that we are not going past either end of any array.
       if (*j >= length - 16 || *j >= l -16)
       {
                  if ( length >= 16 && l >= 16)
                  {
                     *j = (length < l) ? length -16 : l -16;
                     __m128i xbp = _mm_loadu_si128((__m128i*) &bp[*j]);
                     __m128i xpt = _mm_loadu_si128((__m128i*) &pt[*j]);
                     *j += _mm_cmpistri(xbp, xpt,
                                _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
                     break;
                  }
       }

       __m128i xbp = _mm_loadu_si128((__m128i*) &bp[*j]);
       __m128i xpt = _mm_loadu_si128((__m128i*) &pt[*j]);
       uint8_t y = _mm_cmpistri(xbp, xpt,
                    _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
       *j += y;

       // if y is not 16 then either the end of the string is reached or
       // a character miss-match has been detected.
       if ( y != 16 )
                       break;
    }
       // set found and offset based on our results.
       *found = ( length == l && *j == length ) ? 1 : 0;
       *off += *j;
}

Option 2:

For this option the code was modified to still use the functions but instead of a function table being directly referenced with in the calling function, ifunc was used to create a resolver function. This allowed us to have two full identical functions with the optimization modification added to each.

 

// Use ifunc to seemlessly call appropriate function

cl_error_t cli_bm_scanbuff(const unsigned char *buffer, uint32_t length, 
const char **virname, const struct cli_bm_patt **patt, const struct cli_matcher *root, 
uint32_t offset, const struct cli_target_info *info, struct cli_bm_off *offdata,
 cli_ctx *ctx)
                __attribute__ ((ifunc ("resolve")));

 

// jmp table for resolver function:

cl_error_t (* const scanBuffs[]) (const unsigned char *buffer, uint32_t length,
 const char **virname, const struct cli_bm_patt **patt, const struct cli_matcher *root,
 uint32_t offset, const struct cli_target_info *info, struct cli_bm_off *offdata,
 cli_ctx *ctx)
                = {cli_bm_scanbuff_sse4_2, cli_bm_scanbuff_C};

 

In the above you can see that the __attribute__ tag with gcc allows us to keep a seemless programming interface for the function so as we do not have to modify any calls. The function table is used in the resolve function to choose the appropriate logic for the architecture the build may be running on. You can see it working here:

// Used to choose which function version to resolve to at run time.

// implements ifunc.
static void (*resolve (void)) (void)
{
                // return index in jmp table, dependant on hw capabilities.
                return scanBuffs[0];
}

 

I feel there are many practical use cases outside of hardware detection that ifunc would be useful for outside of this project. This is something I am happy that I was introduced to.

 

Option 3:

This is where things start to get a bit “strange” and where I started to have a lot of fun. In the code I have bolded the areas of interest, as most of this is just an inline version of the previous code. However instead of function calls we are using goto statements instead. Take a look at it first and then I will try to explain what is going on.

 

// function table:

   void (* const search[])() = {&&c_search, &&sse4_2};

// UNCOMMENT "if" for invalid option prot:
// if (runtime_detection < OPTIMIZE_MAX_OPTIONS)

   goto *search[runtime_detection]; 

c_search:
   found = 1;
   for (j = 0; j < l && off < length; j++, off++) {
   if (bp[j] != pt[j]) {
      found = 0;
      break;
   }
}

   goto exit;

sse4_2:
   for (j = 0; j < length && j < l && off < length ;)
   {
   // check, if index is within range.
   if (j >= length - 16 || j >= l -16)
   {
      if ( length >= 16 && l >= 16)
      {
         j = (length < l) ? length -16 : l -16;
         __m128i xbp = _mm_loadu_si128((__m128i*) &bp[j]);
         __m128i xpt = _mm_loadu_si128((__m128i*) &pt[j]);
         j += _mm_cmpistri(xbp, xpt,
         _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
         break;
      }

   }
      __m128i xbp = _mm_loadu_si128((__m128i*) &bp[j]);
      __m128i xpt = _mm_loadu_si128((__m128i*) &pt[j]);
      uint8_t y = _mm_cmpistri(xbp, xpt,
         _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
      j += y;

   // if y is not 16 then either the end of the string is reached or
   // a character miss-match has been detected.
      if ( y != 16 )
         break;
   }

   // set found and offset based on our results.
   found = ( length == l && j == length ) ? 1 : 0;
   off += j;

exit:

 

Okay, to explain this one let me refer you back to a previous post where I was doing some practice/play with the address of a label. Note that I am still playing with this source, and as of currently I have found a way to do nonlocal goto’s using the address of a label and even infinitely loop. It still is not practical as I still have not found a way to predictably grab entities at offsets within the stack (think local variables.) The neatest thing I found about this is that label addresses are address offsets from the instruction pointer, and NOT offset from the entry point of the routine. I will try my best to continue to work with it. Anyways, here’s the link to that post:

https://penguinpro.ca/2019/11/26/project-detour/

Now for the above code:

void (* const search[])() = {&&c_search, &&sse4_2};

This loads the addresses of the labels into an array of function pointers, note however we do not call them with the function operator or with “brackets” doing so causes a call instruction that pushes an address and saved frame pointer to the stack which must be cleaned up. To avoid this we use:

goto *search[runtime_detection];

We need to dereference the address within the pointer to jmp execution to that point. My current “play” situation is attempting to call a function this way and manage the stack myself.

Also note that runtime_detection is a global variable set by the optimization initializer function which you will see later in this post.

Next up, the first label is c_search: which requires that we have an exit: label. The exit label is used as a forward jmp to skip past any code we don’t want to execute. And finally we have sse4_2: This is the label for the start of execution of our SSE optimization code.

This option seems to have minimal overhead, minus the loading of the “label address table” at every function call. I assume this is required because the labels offsetting from the instruction pointer needs a point to catch a reference, else it’s out of scope. I may also be totally wrong, label scope at the lower level is something I will need to do more research on. Either way the reason I did not include this option in the public repository is that it was sometimes unpredictable in which branch it would take? It didn’t crash the program but without further analysis (of which there was not enough time to do within the remaining course time,) it was too unstable for a realistic use at this time. As stated multiple times, I am hopeful to continue my work on this.

 

Runtime Detection

This is the code that is used to determine the architecture at run time, it uses gcc built in functions to request information from the system. My other thoughts for this would be to write a separate HW_detect module that would be able to query the /proc filesystem for attributes as a good safe portable method for runtime hardware detection, time being a factor again makes me push this to my own stack of personal project ideas. Maybe I should create a “public” type of todo list.

Anyways back to the code:

// Functions for instrinsics jump table:
void optimize_scanbuf_init()
{
                // default to C code.
                uint8_t runtime_detection = OPTIMIZE_USE_DEFAULT;

                // set jmp table index:
                if (__builtin_cpu_supports("sse4.2"))
                                runtime_detection = OPTIMIZE_HAS_SSE4_2;

                // This one is only here for an example:
                if (__builtin_cpu_supports("avx"))
                      runtime_detection = OPTIMIZE_HAS_AVX;
}

Which is set to be called in a high level function in libclamav, specifically when the database is initial ally read, by doing this, the init function is only called once and any error should cause it to fall back to default C.

 

Final Project Overview:

 This project has been one of the most fun and challenging things I have done in quite some time. After many late nights and reading countless hours, all with no idea whether it would ever work, I feel I have accomplished something I wouldn’t otherwise be able to.

Also I know that I have not posted much of the code in previous posts as much of it was still experimental for me; hopefully this post will clear some things up and help anyone who is reading it move forward with their own personal projects.

I have really enjoyed having you as my readers and hope to continue on with this blog into the future. It’s time to shut down the test machine; and say thank you to everyone involved on this journey, I won’t forget it.

 

With Love, always.

ElliePenguins

Project Detour

 

Hello, Everyone I have something kind of unrelated to my project but is still interesting. Actually it may be related to the project but is currently more of an interesting discovery (for me.) After discussing with my professor about eliminating potential overhead in my program, they brought up the idea of potentially using goto statements to select which piece of code to run. Where in this conversation the idea of getting the address of a label to use as a function call. This was something I couldn’t stop thinking about.

So this article, although somewhat unrelated will dig into some C based ways of playing with execution at a lower level. As of starting this writing I have no idea what will work and what will not but I intend to piece together different ideas’ purely to see what is possible. My first stop of course on this journey was google, where I found this article about how to extract the address from a label:

https://stackoverflow.com/questions/1777990/is-it-possible-to-store-the-address-of-a-label-in-a-variable-and-use-goto-to-jum

and

https://stackoverflow.com/questions/15702297/dynamic-jump-to-label-in-c

The second link above, actually has some interesting ideas, specifically https://stackoverflow.com/a/15736890 that may be very useful in my project; you will absolutely hear about that later; but for now let’s see what we can do with goto and function operators.

Here is the example code that I will be starting from:

   #include <stdio.h>

   int main ( void ) {
     void *ptr = NULL;
     label: ptr = &&label;
     printf(“label address %p\n”, ptr);
     return 0;
   }

 

Which itself compiles and outputs the address of the label. Neat.

Next up, what can we do with these things, well first let’s try and call that address with the function operator:

       ptr();

Error: called object ‘ptr’ is not a function or function pointer.

Dissapointing, but maybe we can try again with an array of pointers to functions:

 

               Void (*ptr_array[ ] ) = {ptr};

               Ptr_array[0]();

 

This on the other hand does run and provides us with between 30,000  and over 100,000 lines of output, just before segfaulting. Let’s dig into that a little bit with gdb. Start it up with:

               Gdb –q <name_of_program>

Note that the ‘q’ flag is used to suppress that large, unsolicited welcome message at EVERY start.

 

Continuing on and running the program within gdb provides us the opportunity to examine to call stack at the time of the segfault, you can do this with “bt” and it seems to be breaking within printf. It is difficult to make assumptions at this time as the address within ptr from label should not have changed; basically we aren’t reading any data difference, we need to investigate further.

Let’s start with break points to see what exactly is happening at each call:

1 – Pointer seems to be getting assigned to the address of the label.

2 – The printf function is executed and shows the address of the label.

3 – Ptr_array[0] is called as a function.

  • Requesting the address contained at ptr_array[0] from gdb still shows null interestingly enough.
  • This however is resolved by using ptr_array[0] = &&label;

4 – After the call to ptr_array[0]() we are somehow back at the beginning at a break point before the label.

  • Due to the fact that this was a function call and not a go to, we have presumably pushed a return address to the stack. Let’s examine that after though.

5 – And looping continues like this until we reach a segfault.

My next idea is to create a watch point on our pointers and see if the values are ever changed. We can do this in gdb with:

Watch <entity_name>

Aside from a slow run, this provides us with nothing of value, let’s remove the print statement that is slowing it down and see if we can run it again to catch the watch point on fail while the earth is still turning. I will note that this requires that the program is recompiled.

After considerable time, we finally have some output. It seems that our label pointer that is saved in ptr_array has not been modified at the time of the segfault. However if we consider our thoughts that this segfault may be related to a stack size issue (which thanks to this software portability and optimization course) we know is of a fixed size, let’s check the stack pointer:

P $esp

               -8392704

This output is very close the the size of one MebiByte, or the realistic size of an 8 MB stack. To examine further let’s see what bytecode is in that section, in gdb we can use:

               x/32x $esp

0x5555518c        0x00005555        0x5555518c        0x00005555

0x5555518c        0x00005555        0x5555518c        0x00005555

0x5555518c        0x00005555        0x5555518c        0x00005555

0x5555518c        0x00005555        0x5555518c        0x00005555

[…snip…]

This large piece of repetitive data on the stack seems to show our suspicions that we are calling a function that will never return. This means that when we call the address of the label the compiler never thinks to put in an instruction to read the address back into $EIP and pop that address off the stack. Instead each call is made over and over again which itself pushes an address to the stack and never cleaning it up. Dumping 1024, 2048 and 4096 bytes from $sp still shows the exact same data; which relates to my suspicions exactly. Let’s see if we can find a correlation in data next.

So what exactly do these hex numbers mean, lets put it together. Running our gbd examine command again however requesting ‘g’ instead of ‘x’ we will get it all put together:

x/32gx $esp

0x000055555555518c 0x000055555555518c

0x000055555555518c 0x000055555555518c

0x000055555555518c 0x000055555555518c

0x000055555555518c 0x000055555555518c

[…snip…]

Now think about what we are saving if we want to return, lets compare this against what is in our instruction pointer register:

I r

Meaning inspect registers, and from the output we can see the line:

$rip            0x55555555518a      0x55555555518a <main+81>

Which compared against

disass main

 

[…snip…]

   _[34m0x0000555555555167_[m <+46>:              lea    rax,[rip+0xfffffffffffffff9]        # 0x555555555167 <main+46>

_[34m0x000055555555516e_[m <+53>:              mov    QWORD PTR [rbp-0x18],rax

   _[34m0x0000555555555172_[m <+57>:              lea    rax,[rip+0xffffffffffffffee]        # 0x555555555167 <main+46>

   _[34m0x0000555555555179_[m <+64>:              mov    QWORD PTR [rbp-0x10],rax

_[34m0x000055555555517d_[m <+68>:              add    DWORD PTR [rbp-0x1c],0x1

   _[34m0x0000555555555181_[m <+72>:              mov    rdx,QWORD PTR [rbp-0x10]

_[34m0x0000555555555185_[m <+76>:              mov    eax,0x0

=> _[34m0x000055555555518a_[m <+81>:           call   rdx

_[34m0x000055555555518c_[m <+83>:               mov    eax,0x0

_[34m0x0000555555555191_[m <+88>:              mov    rcx,QWORD PTR [rbp-0x8]

_[34m0x0000555555555195_[m <+92>:              xor    rcx,QWORD PTR fs:0x28

_[34m0x000055555555519e_[m <+101>:            je     0x5555555551a5 <main+108>

_[34m0x00005555555551a0_[m <+103>:            call   0x555555555030 <__stack_chk_fail@plt>

_[34m0x00005555555551a5_[m <+108>:            leave

_[34m0x00005555555551a6_[m <+109>:            ret

[…snip…]

 

From what I can see, we are using lea or “load effective address” to bring in the address of an offset from $rip, which I assume to be the address of our label and storing it in rax. Note the second bolded lea call is because we are loading that same address again however into our array. That value is being stored at an address indexed from rbp meaning it is on the stack, where it is after loaded again from that offset into rdx. Finally this is where we make the call to the address loaded into rdx. Every time that call is run it pushes the address of the next (in red above) instruction to the stack; which without a return instruction explains the mess on our stack.

This creates some kind of weird failed recursion like “hyper drive”. However if there was a means in C to get that address off the stack (or never put one in the first place) this would be a fantastic and fast way to accomplish a low overhead means of calling my optimization functions. It would be especially useful if it were possible to do a non-local goto and have all the local variables still available.

 

Hope you enjoyed my quick write up, maybe you can find a practical use for it in whatever you’re doing.

 

Cheers,

ElliePenguins

 

 

Project Progress Update

Hey again, I am writing this post to let anyone who is interested know that my project has been progressing well and have a working optimization and even implemented features to allow other multiplatform/architecture optimizations in the future. For those of you who may be new to my blog, I am attempting to optimize ClamAV for a course that I am taking in software portability and optimization, you can read previous posts here:

https://penguinpro.ca/category/school-work/project/

My big announcement was that I was able to get the program to loop over my SSE4.2 optimization so that all of the data is checked and all indexes and offsets are properly set. The overall program run time now is about 2 minutes and 41 seconds for a speed up of just shy of 5 seconds, which is an optimization of about 3% overall.

I was hopeful to keep you all updated sooner on the progress of this project; however, due to the fact that it has been moving along quite well it has been somewhat difficult to shift focus from working to writing. I should note however though that there is more now to tell you then just the fact that the optimization is working but also that I have been able to implement some features that may allow more architecture dependent optimizations in the future. What I mean is that as of now, I have also implemented the scaffolding (in two different ways) to allow a single version of the code to be run on multiple architectures. As well as the initialization function that will need to be called to set the best possible option for optimization in the programs during program start-up.

Now for the downside of that “feature.” By making this program able to detect its running environment and call the appropriate function, it has brought our optimization down to about 2 or 3 seconds (at best) of optimization instead of the above mentioned 4 or 5. This essentially cuts our optimization in half, to almost 1.5% to 2% optimization with one implementation, the other of which completely negates any benefit. I am still hopeful that there is enough overall benefit of having the ability to implement hardware specific optimizations as functions that someone who understands better than myself may be able to run with my changes and optimize further.

I will note here that both methods include jumptables however one contains pointers to a complete copy of the optimized function, using:

__attribute__ ((ifunc (“resolver_name”))); 

The other takes only the block of optimized code, places it within its own function which is called in the optimized function where it would normally be; passing the required information as arguments.

Again both of these methods are implemented with an array of function pointers similar to:

searchOptions[hardware_capability] ( arg1, arg2 arg3);

“hardware_capability” being set by the initialization function that was mentioned earlier. As well as (ifunc) needing no arguments, as the address to the function is “seamlessly” returned by the resolver_function, which uses the same function API (argument list.) Both of these solutions are transparent to developers using the function, however ifunc does cause more performance degradation; the reason for this will require further research.

I feel that the biggest issue with my optimization in terms of the actual use of intrinsic functions is that the strings are not properly aligned, which I apologize for my previous post stating my solution “aligned” the strings. Instead I learned that I was actually only attempting “fancy” index operations. I am hopeful that by implementing a proper string alignment, we will be able to eliminate the required checks and adjustments that inevitably require some processing time. My other issue that I have with my optimization is that in being new to topics such as: intrinsics, inline assembler, and simd, is that I chose to use the SSE4.2 instructions that used 128 bit register widths. This may have been a really safe bet while I was learning however by taking advantage of AVX’s 512 bit register widths more performance can be gained. Also I feel that there may be other areas of potential optimization within this software where I can apply my new knowledge to, as well as other Open Source projects in the future.

My next steps will be to get in touch with my instructor to make sure that I have implement the jump tables and Ifunc resolver appropriately. I am really hopeful that a simple mistake is the reason for the performance drop and that a fix can be implemented quickly with some extra direction. When this is resolved I will try to get a more technical post available to my readers who may be attempting to implement similar functionality in their own projects.

Again, thank you for reading; and I hope to have some actual code posted for you all soon.

With love,

ElliePenguins

Project Mile Marker

Hello again, I have another post about my attempt at optimizing clamav, which if you are new to my blog you can check out previous posts here:

https://penguinpro.ca/category/school-work/project/

 

In my last post I was able to get into the source code for my project, and after what I feel was a considerable amount of analysis of a single function, I have started my optimization attempt. As of currently I am now getting the correct output on a single directory; add recursion into the mix and you get a segfault.

However, I have noticed a considerable speed up in the program, running against a single directory without recursion (because it’s broken) we have a run time of:

----------- SCAN SUMMARY -----------
Scanned directories: 1
Scanned files: 8
Infected files: 1
Data scanned: 8.00 MB
Data read: 2089.99 MB (ratio 0.00:1)
Time: 27.139 sec (0 m 27 s)
Start Date: 2019:11:20 16:25:02
End Date: 2019:11:20 16:25:30

In comparison to a previous run time of:

----------- SCAN SUMMARY -----------
Scanned directories: 1
Scanned files: 8
Infected files: 1
Data scanned: 8.00 MB
Data read: 2089.99 MB (ratio 0.00:1)
Time: 47.126 sec (0 m 47 s)
Start Date: 2019:11:20 16:23:36
End Date: 2019:11:20 16:24:23

Which is a significant improvement to the point that I am skeptical, especially with the fact that we cannot run this against a larger data set without a segfault. There are too many discrepancies to say for certain but the numbers are consistent and the output is correct; it at least gives me some hope for this project.

My next task of course is to solve that issue of the segfault which I believe relates to aligning the modified block to avoid loading values past the end of an array. But after fighting with the software to get the correct output when it comes to character strings makes me feel that making this post is required.

 

Modification overview:

The most difficult thing about this optimization was trying to get the character fields to load in at least 16 characters at a time without overrunning the buffer. The first run through our SIMD intrinsic operation went off without any issue and returned the index of the first not valid character, perfect. The second run, for any string that is longer than 16 characters (most of them) required that we “rewind” the current position pointer. The reason we need to do this is to make sure that we are getting 16 characters and not garbage values off the end of the buffer which could cause erroneous data.

That was completed with a ternery operator, to select the maximum number of characters and then subtracting the index by the number of characters to be loaded into the vector registers, it looked like this:

                L = p->prefix_length + p->length);

                J = (length > l) ? length -16 : l – 16;

This allows me to load the string from a safe offset; even though we are re checking some characters it is still much faster than checking our loop condition after every single character comparison.

From there we could load the values like:

                xbp = _mm_loadu_si128(__m128i*) &bp[j]);

                xpt = _mm_loadu_si128(__m128i*) &pt[j]);

Here is a graphic representation:

post_one

The green and yellow together are a complete string length, in this case 21 characters. The blue represents vector register lanes where each individual character can be loaded into and the yellow represents characters that are beyond the length of the registers or more then index 16 from the beginning of the array.

You can see that we still need to check the next 5 characters after the first call, however we cannot just load the vector register with then next set of 16 characters as we will load data off the end of the array as demonstrated in this image:

post_two

Now here the indigo color (that runs off the edge of the image) represents the characters in the string that have already been checked. The green represents the characters that are to be checked and then the red represents data that is past the end of our character array, essentially it is filled with arbitrary data. Attempting to access this data with either cause the program to segfault or worse, run checks against that data and corrupt our output. This is where the above solution comes into play with rewinding the pointer back to a point where we can check all the new characters, even if it means checking some of the previous ones; because we are doing this in one instruction it is much fasters then looping over every single character to determine a match. This alignment would look something like this:

post_three

Here we have indigo representing checked characters, yellow representing characters that are being rechecked and green being our next set of characters to check.

After we complete each check operation the intrinsic function that we are using returns an index that can be used to determine if a match on the entire string has been found. In either any indexes that rely on that information can then be set and seem to be running appropriately we are getting the proper output in terms of character string matches and now we can move on to the next part.

Determining a loop condition that can be used to allow this function to iterate through large blocks of non-text data which I believe is where we are causing our segfault, and the unexpected speed up.

 

Summery:

This post was only a quick update on my work so far with my chosen piece of software. Breaking past this milestone means that we (maybe) have a handle on how the offsetting is handled. Our next task it make sure that this function works with any byte array not only text. To do this we need to determine what data is causing it to crash and in what context, or function it is being called during the segfault. Once that is solved it is next a matter of cleaning up the code for efficiency and maintainability, but one thing at a time.

Hope to have more for you soon.

ElliePenguins.

Another Project Update

Another quick update; but first, for those of you who may not be totally familiar with my attempt at optimizing an open source project can check out the previous posts here:

                https://penguinpro.ca/category/school-work/project/

The reason I needed to post again today was because last night was somewhat productive; or well not really productive, but I did a lot of work anyways and you get to hear about it.

So last night was when I started really digging into the source code of the project and even attempted to make some modifications. If you look back on this post:

                https://penguinpro.ca/2019/11/15/project-update/

You will see that I was attempting to modify some of the loops within matcher-bm.c within libclamav. I thought that it would be possible to potentially modify some of the loops in a way that more than one character could be checked at each iteration. The issue with this thinking was that I was not aware of the fact that this function is in fact a computer science algorithm called a “boyer-moore string search algorithm.” For more info you can checkout this link:

https://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Where mentioned in this link that, “The Boyer-Moore algorithm is considered as the most efficient string-matching algorithm”.

I (think I) was right about one thing, this function is doing the heavy lifting in terms of comparing character and string values. However the above mentioned search algorithm with c style linked lists used individual nodes to contain the patterns and offsets that are required for the algorithm to work. This meant that it would have been required to get the next node for comparison. For our attempt at optimization, we needed to get the address of the next data structure from the next pointer within the gotten node and access the individual element.

Attempting this in practice only slowed the software down, the extra loads and compares from memory only got in the way of how the algorithm was supposed to search, though it did not stop the program from making the correct selections.

The parts of the program that where showing up as busy instructions were just processing and incredible amount of information simply by looking at a single character. Which is exactly why we had the movzbl instruction at almost 30% of the total function run time; that specific instruction was responsible for loading the character for comparison. Wouldn’t it be great if we could load 2, 4 or maybe 8 characters for comparison at a time? Of course, and I am still hopeful to be able to do that before the time in this class runs out; however it must be done in a way that does not break the algorithm.

After playing with this block of code until about 3am, I was able to find that with adding debug printf statements to individual sections of interest that this program is unique (to me) in how it is able to compare a massive amount of data in a very short amount of time. That is by checking individual characters it is able to process and compare almost 8GB of data in about 2 and a half minutes.

This is a great example of how a community of people can come together to make something about as good as it can get. Whatever happened with this project, it has defiantly increased my appreciation for open source technologies. Not that I ever had a doubt in the “bazaar” style of development.

Now I must finally mention that, I am not completely giving up hope on this project yet, there is still one little glimmer of hope in the form of a character comparison loop within matcher-bm.c at line 345 (as of this writing) that looks like this:

  found = 1;
  for(j = 0; j < p->length + p->prefix_length && off < length; j++, off++) {
     if (bp[j] != pt[j]) {
            found = 0;
            break;
      }
  }

This small loop of about 7 lines of code, does not seem to run particularly busy. It is however one of the last checks in the control of its parent loop. If the character bp and pt at index j do not match, then the break statement ends the loop and execution jumps to the starting for loop at line 276. This is where p, which is a type cli_bm_patt linked list pointer; is then advanced to get the next pattern structure and searching continues.

Unfortunately as each node in the linked list must traverse the above part of the parent loop to see if it is appropriate to be used for a match attempting to work on multiple nodes is not going to work. This is not an optimization that can be handled with vectorization and instead needs to be tackled with parallelization; a task that is already handled by clamdscan.

My last shred of hope for any kind of optimization within this block of code is that we can somehow try and make this loop take advantage of vector registers and compare multiple characters per iteration. To do this within the X86 environment we MAY have a couple instructions available to us, they are:

MPSADBW – Compute Multiple Packed Sums of Absolute Difference

https://www.felixcloutier.com/x86/mpsadbw

&

PCMPESTRM – Packed Compare Explicit Length Strings, Return Mask

   https://www.felixcloutier.com/x86/pcmpestrm

From what I understand string compares can be difficult to use in terms of vectorization. This is due to that fact that we could potentially load values past the end of the array if we are not properly aligned, or we are comparing strings of different sizes. But by using these instructions, the first of which showing more promise we MAY be able to get a tiny bit more performance out of this function.

Before I get anyone’s hopes up, I need to note that the way this must be done will be architecture specific. The best way I can think of is with intrinsics which will require we use some kind of pre-processor logic. My overall worry is this may overly complicate this block of code to the point it may not be worth the minimal performance gain that will be achieved; let’s try anyway.

 

Cheers,

ElliePenguins

Project Update

Creating a dedicated testing box & some actual code.

Hey everyone, I have another update on my progress with software portability and optimization, as always, for anyone new to this blog. I am attempting to apply what I have learned so far as a final project and you can check my previous posts on the topic here:

https://penguinpro.ca/category/school-work/project/

 

New/Old Machine:

As you were aware I had attempted to build and benchmark my software on a virtual machine, I thought that taking advantage of the tools within VirtualBox such as snapshots and virtual networking that it would make the process go much smoother. In reality however I found that it caused strange fluctuations in output times as well as attempted “optimization” modifications to actually slow the run times down as you will read in a bit.

Luckily for me I was able to attend a used hardware sale and found myself with a “new” dedicated i5 with a fresh arch install, the process of which will make a good blog post for anyone interested into making the jmp to command line junky. But for now this post will just be about new benchmarks on this new hardware as well as describe some direction as I make a first attempt at optimization.

Also the networking that I am using for this new machine involves a usb NIC and a virtual installation of pf sense. For anyone who wants to start experimenting with software it is always a good idea to learn a bit about good vm/lab hygiene. Being aware of what machines can see each other or have access to different parts of you internal network are important to know as you move forward. You can check this post for instructions on how to install PFsense onto a Virtualbox VM here:

https://penguinpro.ca/2019/09/14/87/

 

Before we continue I wanted to say to anyone who has old computing hardware laying around, please do not throw it away, not only is this bad for the environment but cheap or free hardware is how I was able to learn computers; there are many talented people who cannot afford them.

Please consider donating old computers and hardware, you may just inspire the next generation of tech professionals:

https://www.rebootcanada.ca/

https://www.pcnix.ca/donate-your-old-computer/

(Note: I only found the above examples with a quick google search as an example, it may be a good idea to research further.)

Now back to the software.

 

New Build

 Just like in earlier posts, the software was gotten from the github repo:

                Git Clone https://github.com/Cisco-Talos/clamav-devel

However this time instead of creating individual build directories with different configurations, we created:

                Git branch static

                Git branch vector

And tailored the source in each branch for the type of build that we are trying to create.

When it comes to the actual build I have decided on this configuration, as I feel that there is some opporuinity for vectorization within matcher-bm.c, although I am not 100 percent confident with it. This is also the configuration I have decided to settle on while adding my personal modifications to the source code.

CFLAGS=-ggdb -ftree-vectorize -mavx -fopt-info-vec-missed -fopt-info-vec -fopt-info-vec-note -O2 ./configure –prefix=’/home/<username>/vector_build’ –enable-check –enable-coverage -with-systemdsystemunitdir=no –enable-experimental –enable-clamtop –enable-pcre –disable-llvm

After we run our configure script we can use the next lines to have our make commands gives us output in regards to decisions made about vectorization in source file’s loops. This information is stored to a file that can be searched for each specific loop with:

                Make –j<core_count> 2>&1 | tee ~/compiler_output.txt

Or

                Make –j<core_count> 2>&1 | grep vector | tee ~/compiler_output_vectorize.txt

 

To get a more detailed explanation on the actual build process see this link:

https://penguinpro.ca/2019/11/07/project-build/

 

As well as the provided documentation:

https://www.clamav.net/documents/clam-antivirus-user-manual

 

The Test:

Our test on this new machine is with the exact same data that was used within the virtual environment. However we are now consistently achieving a run time of 2 minutes and 45 seconds after 10 runs with only milliseconds of deviation between them.

This is in contrast to our previous run times of well over 6 minute, and about 30 seconds of deviation between high and low. You can read about that here: https://penguinpro.ca/2019/11/10/main-project-benchmark/

 

Proper Static build:

To anyone who was following the blog, or read some of my previous posts. You will have known that I has a bit of an issue with trying to create the software as a static binary. My attempt in doing this was by adding flags to the compiler, where after reviewing the clamav documentation again. I noticed that we were supposed to actually use the –enable-static and –disable-shared flags when calling the configure script. After we did this we were provided with this large map:

post7

This output shows a better representation of the number of calls to each function, (because it is using instrumentation instead of sampling.) But aside from a few areas’ we may investigate in the future, it seems our original plan of targeting “cli_bm_scanbuf” in “libclamav” holds true. Let’s look at some code.

 

Some Code:

Here is a direct link to each source file of our matcher-bm module:

Header: https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.103/libclamav/matcher-bm.h

Source: https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.103/libclamav/matcher-bm.c

The lines that stand out when inspected/disassembled with perf are:

 

 

283:  if (p && p->cnt == 1 && p->pattern0 != prefix) 

Which relates to:

Movzbl 0x3e(%r13), %edx           ; @ 19.23% usage

 This specific statement is within a large for loop with many conditions between. Not to mention it acts on elements of a c-style linked list that are single entities, in this case one character, and one int. Specifically looking at p->pattern0 != prefix, there is another identical comparison approximately 20 lines of code later, however due to the nature of the linked list pointer potentially being advanced in several places, storing this calculated result is of little value, and after attempting it actually slowed our code (in the virtual environment) by seconds. My thoughts, specifically relating to this would be in the presence of DMA, and if a pattern can be detected; storing these results may be of some worth. We will look into this in the future.

  

312:  if ((off + p->length > length) || (p->prefix_length > off)) {
             p = p->next;
             continue;
       }

                 Which relates to:

Movzwl 0x38(%r13),%eax            ; @ 12.12% usage

Again we are looking at values that are resolved using single entities stored as elements of a linked list. The actual reasoning for this, with a lack of comments will have to be researched further; However seems to related to the comparison of signatures. Again determining and storing these values earlier on may relieve some computational overhead.

 

Loops:

 Loops specifically in this program are handled interestingly, they seem to use for loops with all of the calculation and advancement of pointers being done in the compare and increment part of the loop. Rewriting these to take advantage of SIMD, may provide some optimization however I worry about the length of data being compared, if we advancing a pointer through an array of values, we may need to know the length of that value; as well as requirements involved in forcing alignment of the data structures themselves, again this will require research.

 

                Example:

     for (; offdata->pos < offdata->cnt && off >= offdata->offtab[offdata->pos]; offdata->pos++)
                ;

 

 

Interesting variable:

On line 249 within cli_bm_scanbuf there is a variable of type uint16_t called idx, I am assuming this is intended to mean index. Where on line 277 we see:

 idx = HASH(buffer[i], buffer[i + 1], buffer[i + 2]);

 

 2 interesting things I have noted are that, HASH seems to report back some kind of index which makes that function an interesting place to investigate for signature lengths. The next and more “suspicious” is that while stepping through this block of code with gdb at –O3, I noticed that when attempting to print this variables value it was reported as “optimized out.” This to me indicated that it may be worthwhile to investigate the intended reason and overall value of this variable.

 

Summery:

My intention for this post was mostly to give an update on the change of my dev environment, as well as to keep track as I attempt to cut into the more technical elements of this project. My next post will hopefully include any findings while stepping though individual blocks of code. As well as maybe include some information on how to use gdb to get a better understanding of how a program works.

 

Love,

ElliePenguins

Main Project Benchmark

 

Hello again friends, I have finally gotten around to actually digging into the code a bit for my selected project for software portability and optimization. For those of you who may not be familiar, here is a link to previous posts in the project:

https://penguinpro.ca/category/school-work/project/

I think in this post; that is enough of an introduction because if you’re as excited as I am to finally dig in and get started on this project; it’s been to be difficult to focus on anything else.

At this point I will list some of the tools that we will be using for profiling. That’s not to say we may use others as we go, but for now this should give you a pretty good idea of any requirements before we start:

We will need:

            Perf

            Grpof

            Gprof2dot

            Dot

            Flamegraph

I will also mention our “test data.” We are testing on about 7.7GB of various files types. This directory includes pdf’s, docx, jar files, mp4 video, many audio formats and jpeg as well as some ISO disk images, random data and executables; all ranging in size from a few bytes to a few gigabytes.

Before running:

Before we jump into running our tools, there are some setting we need to configure. For our profiling tools to be able to check the call stack and know where some routines are located we need to modify some things. We can do this with:

Sysctl kernel.kptr_restrict=0

Sysctl kernel.perf_event_paranoid=1

Running:

Let’s run our first clamscan against our test directory with perf, which interrupts the program many times and then records the state of the process during that interrupt.

            Perf record –freq=max –g ./clamscan –r ~/clam_test/

In the above command you can see that we run perf record which tells perf that we would like to analyze the specified process, followed by –freq=max meaning we want perf to interrupt the program to record analysis as many times as possible; and finally we add the –g flag which tells perf to append stack traces. The reason we run in this way is because of the visual analysis tool that we are going to pipe our data to, called flamegraph.

This was another package that needed to be built from the Arch User Repository which you can get here: https://aur.archlinux.org/flamegraph.git

As Mentioned in earlier articles I described how to build these packages but I will include it again because it is useful to know in the Arch world.

Git clone https://aur.archlinux.org/flamegraph.git

                Makepkg

                Pacman –U flamegraph-[version]-x86_64.pkg.tar.xz

Once we have this built we can do:

            Perf script > run_one.perf

            Stackcollapse-perf.pl run_one.perf | flamegraph.pl > run_one.svg

Which provides us with this graphic:

post2

This SVG graphic can be opened in a browser and is interactive! For example by doing a search for a function, we can see graphically every instance of “cli_bm_scanbuff” highlighted in purple. The reason I mention “cli_bm_scanbuff” is because it is the function I feel has the highest impact potential for optimization. Being a top level function we can assume that there is actual work being done in that function instead of being delegated to other routines. We can also see that although the call from “cli_chkign” is the largest, it is also not the only place that it is called, again these instances are also at the top of the stack trace.

I should note that there may be some optimization in the “insert_list” function as well but that would be completely code based and most likely not within scope for this course. Once we complete our work with “cli_bm_scanbuff” we may open up this function too and see what’s inside.

Before I continue here’s a general link to flamegraph, Its great:

http://www.brendangregg.com/flamegraphs.html

Analysis:

Opening our perf report, we can see the routines that are frequently on the stack during each sample are:

Cli_bm_scanbuff         in libclamav.so  @ 17.39%

Pcre2_match_8           in libpcre2-8.so @ 17.1%

Cli_ac_scanbuff          in libclamav.so  @ 13.25%

Insert_list                    in libclamav.so  @   8.11%

Perf Report output:

post1

Digging deeper into cli_bm_scanbuff we can see that the single most noticed call is from cli_loadhash, through cli_chkign and finally at the top most level at “cli_bm_scanbuff.” Where it seems to report 59,047 samples, making up 8.23% of the total number of samples captured. This however is not considering other calling functions and may have more overhead then originally considered.

The module that this function seems to be a part of is:

https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.103/libclamav/matcher-bm.c

We will dig into the source a bit more in my next post, as for now we just need a general idea of where to focus our effort, but this looks the most promising.

Other Tools:

Let’s see what our other methods of profiling/benhmarking report. The other tools that we have to use are gprof which require that the software is built with the –pg compiler options. This type of profiling, actually embeds instructions into the code as opposed to the way perf interrupts the program and inspects its state at that time.

This profiling tool can be as simple as just running the program and then running:

 gprof <executable_name>

Like perf and flame graph, this program also has tools that can convert its output to a graphical output for easier analysis, this can be done with:

gprof <executable_name> | gprof2dot | dot –Tx11

and saved to a file with:

gprof <executable_name> | gprof2dot > filename.svg

Which in our case provides us with:

`post3

This does not exactly look right when compared against our previous run with perf, these are all functions that relate to the program itself like scan manager and output information. With a little research we find this link: http://gernotklingler.com/blog/gprof-valgrind-gperftools-evaluation-tools-application-level-cpu-profiling-linux/

Which tells us that gprof does not have the ability to profile functions in shared libraries? A pretty big problem considering the fact that we are trying to optimize libclamav, nevertheless we will find a way to make this tool do the thing we want. One way we can do that is by recompiling the source as a static executable, however attempting this required dependencies I was not able to satisfy in a reasonable amount of time; so we’ll circle back to this one. For now let’s run the program a few times over to get some statistics for the average run time;

Let’s try something like:

            While true; do (time ./clamscan –r ~/clam_test/) 2>&1 | tail -15 >> run_times.txt; echo “run complete”; done

 This line in bash will run the program the exact same way until we stop it, saving the time data to a file called run_times.txt and outputting “run complete” to the terminal at every iteration. From there we can look at those run times and try to calculate the highest, lowest and average run times.

Note we are putting the output from the time command together with the scan summery to create entries of every run in a file that look [kind of] like:

———– SCAN SUMMARY ———–
Scanned directories: 35
Scanned files: 339
Infected files: 50
Data scanned: 660.14 MB
Data read: 7753.45 MB (ratio 0.09:1)
Time: 398.016 sec (6 m 38 s)

real      6m38.032s
user     6m10.382s
sys       0m25.658s

After allowing it to run for some time we are getting outputs between 6 Minutes, 7 Seconds; and 6 Minutes, 45 Seconds. Averaging about 6 Minutes 36 Seconds.

Overview:

Hopefully now we have a general idea of what is going on within this piece of software, in regards to optimization; and most importantly a direction for our next steps. In my next post I am hopeful to be able to actually dig into the source code from the function that we looked at earlier in this post. I am also hopeful we can attempt to compare different optimization levels to see if there are any tweaks we can make in the build itself.

 

Send more coffee,

ElliePenguins

 

Project Build

Real Project Build

Hey again, so as some of you may be aware that my first project selection and build attempt did not exactly take off like I expected it to, but that is totally okay. “For my next trick” as you might be thinking, I am going to discuss the build process of another software project. Now although my first software choice may not have been ideal, I would like to remind you that it actually was just “not ideal for the situation.” Meaning that for the class I am taking, that software choice would present too many roadblocks which both hinder learning and potentially being able to meet the deadlines. I am hopeful however to continue on with it in the future but for now I will present my new chosen software: ClamAV.

ClamAV required about a 5 minute build time and maybe a 20 minute setup; and I was ready to start profiling the software for optimization possibilities within an hour easily. Compared to my previous choice which took more than 10X as long to build and about 30X as long to configure/setup (and is still not complete.) Now before I start writing this article I will provide the link to my previous build attempt for context:                https://penguinpro.ca/2019/11/05/software-build/

Now pending any further catastrophes; let’s start our ClamAV build with some reasoning:

 

Why ClamAV:

It more or less found me, for some reason I was able to take a moment to check my twitter when I noticed that a well-known figure in the tech community posted about a recent POC that was released online: https://twitter.com/hackerfantastic/status/1190685521153937408

Further analysis led me to their website and eventually to the CalmAV github source repo: https://github.com/Cisco-Talos/clamav-devel

Here I noticed that within their repository read me there was a section labeled “Want to make a contribution” follow by a “thank you,” which is always a good sign right? From there I looked on their website to find that their documentation in regards to build instructions was really well done, and even included some profiling/benchmarking instructions.

You can check it out here: https://www.clamav.net/documents/clamav-development

With all that, (and the fact I was out 1 project.) There was only really one option, clone the source and attempt the build:

Git clone https://github.com/Cisco-Talos/clamav-devel

 

From there it was a simple matter of creating a build directory and running:

CFLAGS=”-ggdb -O0″ ./configure –prefix=`pwd`/installed –enable-debug –enable-check –enable-coverage –with-systemdsystemunitdir=no –enable-experimental –enable-clamdtop –enable-xml –enable-pcre –disable-llvm

Which again was well documented, even detailing the reasoning of each flag, also I created another directory and followed the instructions again. However this time including –pg for gprof within the CFLAGS line, but more on that in later posts.

 

Building:

Next it was a simple matter of running:

make –j <number of cores>

make install

 

Which the time command reported:       real        1m2.708s

User      2m12.359s

Sys         0m20.945s

 

Configuration:

After this to make the program run we just needed to create a few directories and copy some configuration files from their “example” state to the “working” state. For this build I am going to use a general configuration, meaning that I am going to leave a lot of the defaults as they are set.

This is done in the etc directory of the directory that you built your software into. Here there are two files, one is clamd.conf.sample the other is freshclam.conf.sample.

We need to open both of these files with a text editor then modify the line near the top (line 7 in my case) by commenting out the word “Example.” To do this just start the line off with a ‘#’ symbol.

The other lines we need to modify are LogFile (line 14 currently) and set the log directory. It may be a good idea to put it in the build directory, in my case the line looked like:

LogFile /home/builder/clamav/var/log/clamd.log

As well as the line labeled:

Extended detectionInfo yes

And finally:

LocalSocket /tmp/clamd.socket

Now save this file with the filename: clamd.conf in the same directory etc. (that’s local build etc, not /etc)

Our next step is to modify the config file: freshclam.conf.sample in a similar way.

First, just like above; comment out the line:

                #Example

Then find, and uncomment the line:

#Debug yes

Similarly the file needs to be saved as: freshclam.conf again in the same etc directory.

 

Creating Directories:

Our next step is to create the directories that we defined in the above configuration files. In the main build directories we can run the command:

Mkdir –p  var/log

That is the directory we told clamd.conf to save log information to; and although we are not using this setup for anything real, generating log files can generally be helpful/realistic.

Next we need another directory for our AV database:

                Mkdir share/clamav

Then one last directory that we can put some files into to run our scan against:

mkdir ~/clamav_test

Note that our build has provided us with some benign files that should trip clamscan when run against, they are in our build directory and can be copied over with:

Cp –vi clam-devel/test/* ~/clamav_test

The reason we copy the samples over instead of running a scan against this directory itself is that we can add our own files to extend out test without modifying the contents of the original build directory.

 

First run:

First we need to get the signatures to compare against, to do this we need to run from our local build bin directory:

./freshclam

You can notice here, that since we enabled debug information we get some output from the run that may be of some interest to us. Next go into the local sbin directory and run:

./clamd

To start the clamav daemon and then we are ready to start playing with the tools in the local bin directory, most notably “clamscan” which we will now run against our test directory with:

./clamscan ~/test_dir

  

Output:

Our first run seemed to be a success, we have output stating that 52 files where scanned, with 48 detections and total run time report at:

Real       2m31.918s

User      2m28.895s

Sys 0m1.242s

Noticeably this process spun 100% on a single core and a large memory pool in virtual, resident and shared before any output was produced. Interestingly this program also outputs its own run time in the final output which will be helpful when it comes to benchmarking. In this run it was reporting 151.634 seconds (2m31s) which is directly in line with what was reported by the time command.

 

Summery:

Overall I believe that I now have an appropriate software selection for my course project. One that does not take long to build (in comparison to my previous software selection,) and also has potential for optimization due to the type of work it does in terms of parsing and hashing. The added bonus of excellent documentation and all being written in a language I personally really love makes me feel like I was able to find a good candidate after all.

Hopefully this information can be helpful to you and I hope to have my next post up soon as I am anxiously excited to start digging deeper into this technology.

 

Love as always,

ElliePenguins

Software Build

 

For those who are following this blog, you are aware that I am attempting to apply some of the knowledge I have recently gained in a software portability and optimization class. For those of you who are not aware you can check my previous posts, specifically:

https://penguinpro.ca/category/school-work/project/

And for those interested in reading my entries on some of the material we worked on in the course:

https://penguinpro.ca/category/school-work/

 

Briefing:

Before I get into this I am going to make a clear note here that as mentioned in the previous posts about the course having a time restriction I am “concerned” in regards to the complexity of the chosen software. This Build has not been the easiest software to build, in regards to the required runtime software and kernel modules; user permissions and even source modifications, as you will soon see/read.

What does this mean? Well I may have chosen another, potentially more appropriate for the course, piece of software. Regardless, this will not stop me from chipping away at this project too, even if just for personal interest.

 

Before Building:

The first thing we do is download the source, which comes as a tar file:

https://download.virtualbox.org/virtualbox/6.0.14/VirtualBox-6.0.14.tar.bz2

This file is then extracted with:

tar –xvf VirtualBox-6.0.14.tar.bz2

Running this with the time command shows that it took about 57.98 Seconds to decompress, and after which we are given a directory with our source. I am going to provide this specific link just in case you are attempting this build yourself. Even just glancing this will save you save you some headaches:

https://www.virtualbox.org/wiki/Linux%20build%20instructions

There is even a section showing what args to pass the packages manager for many Linux distributions specifically. Now let’s move into that directory it’s time to build.

 

Building:

Within this directory we can see that there is a configure file that needs to be run, but let’s first take a look at some of the files with “config” in the name.

The first is CONFIG.kmk, after some google searching we can see that kmk is part of “KBUILD” a type of makefile:

https://stackoverflow.com/questions/8016678/understand-what-is-kbuild

From this point you can read some of the README’s and configurations yourself as this post does have to be completed at some point today and this information is pretty straight forward for anyone who is interested enough.

 

Configuration:

At this point we are going to run the configure command:

                ./configure

Which will run the script with the file of the same name. The point of this is to make sure that we have all the requirements to properly build this software, let’s run it with the time command. Not that during this write up I had already previously run through the dependencies to make sure there would not be any issues during my explanation.

After we run this, (which took about 15 seconds), we get out output stating that we have “Sucessfully generated autoconfig.kmk and env.sh”, more about this a moment because we have some interesting output:

 

  +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++

  Hardening is enabled which means that the VBox binaries will not run from

  the binary directory. The binaries have to be installed suid root and some

  more prerequisites have to be fulfilled which is normally done by installing

  the final package. For development, the hardening feature can be disabled

  by specifying the –disable-hardening parameter. Please never disable that

  feature for the final distribution!

  +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++

 

It seems that we are required to run configure with “—disable-hardening –disable-docs” to get a build that would be appropriate for our purposes. “—disable-docs” being included to avoid the docbooks dependency.

Running it again we are given a visually similar warning with output stating:

 

  +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++

  Hardening is disabled. Please do NOT build packages for distribution with

  disabled hardening!

  +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++ WARNING +++

 

Compiling:

Looking at the Oracle build instructions, from this point it seems that we are supposed to run env.sh and then a command called “kmk all”, so we will need to first run:

chmod +x env.sh

after running this command and executing ./env.sh we have no output and there is no “kmk” command. After further analysis we can see that in the kBuild directory there is another env.sh. Also I should note that it is already set to be executable, lets run it.

We get the output:         “Spawning work shell…”

 

So that’s interesting, lets go back to our main working directory where the source was decompressed to, and try running “kmk all” again, which runs until:

                Src/VBox/Devices/PC/vboxssdt-cpuhotplug.dsl

With a repeating error of: “_UID inside processor declaration must be an integer ^ (found a string)”

Using google to find the error message led me to this fix:

https://forums.virtualbox.org/viewtopic.php?f=10&t=94467

Within the link, there is a comment from someone with similar issue. The solution as you will see, involved changing the last argument to an incrementing number of int constants. At this point being left with minimal options in this, I am going to attempt the fix on the source code

 

Source Modification:

In src/VBox/Devices/PC/vboxssdt-cpuhotplug.dsl between lines 97 and 130 (as of this writing,) is this block of code:

 

        GENERATE_CPU_OBJECT(0x00, SCK0, “SCKCPU0”, CPU0, “SCK0-CPU0”)

        GENERATE_CPU_OBJECT(0x01, SCK1, “SCKCPU1”, CPU1, “SCK1-CPU0”)

        GENERATE_CPU_OBJECT(0x02, SCK2, “SCKCPU2”, CPU2, “SCK2-CPU0”)

                […snip…]

        GENERATE_CPU_OBJECT(0x1e, SCKU, “SCKCPUU”, CPUU, “SCKU-CPU0”)

        GENERATE_CPU_OBJECT(0x1f, SCKV, “SCKCPUV”, CPUV, “SCKV-CPU0”)

 

We need to go through and change the source to be:

        GENERATE_CPU_OBJECT(0x00, SCK0, “SCKCPU0”, CPU0, 0)

        GENERATE_CPU_OBJECT(0x01, SCK1, “SCKCPU1”, CPU1, 1)

        GENERATE_CPU_OBJECT(0x02, SCK2, “SCKCPU2”, CPU2, 2)

                […SNIP…]

 

All the way through the list from 0 – 31

The exact reasoning for this, I will have to look into deeper, but in the case that you are the curious type and want to research more yourself. I think it (might) relate to these links:

https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#SSDT

https://en.wikipedia.org/wiki/System_Service_Descriptor_Table

 

JAVA:

Carrying on with the build where the compilation stops again to complain that:

                src/libs/xpcom18a4/java/src/nsAppFileLocProviderProxy.h

Cannot find the included jni.h, and digging deeper into the error message output from the compiler, it is requesting a directory of:

/usr/lib/jvm/java-6-sun/include

Again another link for the curious:

https://en.wikipedia.org/wiki/Java_Native_Interface

 

At this point I am assuming this is looking for java version 6 but unfortunately in Arch Linux that means we are going to need to build a package. You can use google to look for “Arch Linux Java 6” or I can just provide you the link here:

https://aur.archlinux.org/jdk6.git

 

Building Dependencies:

Specifically we can use git like this:

Git clone https://aur.archlinux.org/jdk6.git

Of course being aware of what directory you are in; basically, don’t add it to the VirtualBox source. It may be a good idea to have a separate directory in your home for this stuff, but let’s not get too off topic. Once you have that run the next set of commands, it may warn you about dependencies as you go and you will be required to install them as well.

                makepkg

                pacman –U jdk6-[version]-pkg.tar.xz

Remember to change [version] to the version info in the built filename.

 

After building and installing this version of the JVM it failed at the same spot again, it seems that it may also require some further modifications.

Modifying directories:

Looking at the error message produced during compilation. It seems that the build is looking for /usr/lib/jvm/java-6-sun, where as we have /usr/lib/jvm/java-6-jdk.

The files are all the same internally so I am going to attempt a symlink:

ln –s /usr/lib/jvm/java-6-jdk /usr/lib/jvm/java-6-sun

I don’t know if this is a good idea but it seems to work.

 

Building Other Requirements:

There we have it, our build is complete (kind of,) still more that needs to be done. Our next step is to add the kernel modules. To get there; from our VirtualBox source directory let’s type the command:

                cd out/<your platform>/release/bin

Where we need to go into the directory labeled src where the source files are for creating our kernel modules. In this directory we can simply type:

make

Once that completes and we have the .ko files, do not attempt to add them yourself with insmod, there is a script in the bin directory we were previously in.

Back in bin there is are shell scripts called loadall.sh and load.sh, these scripts will handle adding those built modules in whatever order they are required.

Finally we are ready to run:

                ./VirtualBox

Where we are provided with:

Image_1

As you can see, this was not the easiest thing I have ever tried to build from source; and with that being noted I unfortunately do not feel that it is appropriate for the context of this class.

I am however hopeful that I can continue on with this technology by myself. Learning more about virtualization and how all of it works is a bit of a personal fascination. One that I will most likely pursue on my own and although I will be building another piece of software for my course. You are more than welcome to stick around and see how “playing” with the VirtualBox source works out.

As for now, how well does this thing work? Well in user space? It doesn’t. It does start and I can create VMs but they crash with permission errors upon loading the .iso file. When running as root however, (which is another thing that complicates things in terms of attempting optimization.) I am able to create VMs, install O.S’s and even import appliances, check it out:

image2

 

Thanks for reading, and I hope to have the my next post about the built process of my actual chosen software build up soon.

 

With love,

ElliePenguins

Project Selection

As we have progressed in our software portability and optimization course, it is now time to select a project and attempt to use the knowledge we gained in the course so far, however we won’t be just doing them as labs. Our project is to apply this gained knowledge in a real world, practical application, in this case to attempt optimization on an existing open source software project. This will require not only applying what we learned in terms technical knowledge but also how contributions work in regards to open source projects.

With Open Source Software we learned that it is often maintained by individuals paid by large companies, for example Microsoft or Oracle to add the features that they desire. This not only creates better faster and more efficient software for the company that is contributing the resources, but also the community of users and developers who use the software as well.

Decisions:

There were a lot of considerations when it came to trying to choose what software I would most likely dedicate considerable time to over the rest of this course. In selecting the software for this project I tried to think about software that I use frequently and am comfortable with as a user. The common tools within any Linux distributions where my first thoughts, however I found that many of these communities; although open source, seemed to operate in a closed small group that would not be easy to set up communications with. Also a lot of these pieces of software have existed for a considerable amount of time and I assume have had years of improvements already made.

One of my considerations was specialized software such as chat clients like irssi. However I have never personally had to consider the resources such a client ever needed to run as is, meaning there was most likely not much to optimize. The other factor with this project to consider was that testing could require a client server testing model. Which would most likely be more complicated than I would be able to handle during the remaining time for this course.

Another option I looked at was the window manager DWM, it is easily my favorite window manager for ANY Linux distribution. (If not my most favorite software ever; I mean like, just look at the source.) This project was developed by a team who calls themselves “suckless,” the link is here:

https://dwm.suckless.org/

Not only is this project amazingly well done, it is also easy to compile, written in my favorite language “C” and they seem to maintain a philosophy of simplicity which translates directly into their final design.

It is stated on the above website that:

“dwm is only a single binary, and its source code is intended to never exceed 2000 SLOC.”

“SLOC” meaning “Source Lines Of Code”, here’s a link:

https://en.wikipedia.org/wiki/Source_lines_of_code

However ultimately I did not choose this project because I was not familiar with their contribution requirements, which may deter me from contributing in the context of a time constrained course. I am hopeful however to potentially pursue contributing source and/or getting to know the community based entirely on the fact that I love this piece of software.

 

Chosen Software:

Which brings me to the point of this article, the software that I have ultimately chosen to try and optimize for this course; that software is Oracle’s VirtualBox.

VirtualBox is a software project that has been critical in my personal learning of computers. It allows me to create and maintain large networks of (virtual) computers is a safe for learning environment.

See my post on using PfSense with VirtualBox:

https://penguinpro.ca/2019/09/14/87/

 

VirtualBox has also allowed me to practice installing and playing with various software and operating systems. It is able to do this without the need for specialized hardware and even allowed me access to the development environment that I am most comfortable when working on public machines.

The other reason that I have chosen VirtualBox is that I feel there is a lot of potential for optimization; not in that there is much wrong with it, or that I feel it is inefficient to use in its current state. But that there is a larger probability for optimization to be done. For example, this particular piece of software has many internal libraries that could show potential for optimization upon further analysis. Not to mention the fact that it has a GUI that may benefit from optimization and even relies on modules that must be loaded into the kernel itself.

Finally the model for contributions, seems to be quite straight forward and active. It is a piece of software that has the backing of a large company (Oracle) and the community around it seems to post into their forums in a professional manner.

https://www.virtualbox.org/wiki/Contributor_information

 

Concerns:

The main concerns that I have with attempting optimizations on such a large and complex piece of software is my lack of understanding of how virtualization, and bytecode pass through itself takes place.

My other concern is in relating to building the source code itself. The larger the project the more dependencies it generally can have, as well as the overall complexity and compile time which may become a factor in the context (again) of a time constrained course.

Finally the hardware that this software needs to run on must be considered, “Do I have access to any machines that can handle virtualization?” as well as “how can I work on this on the go?”, or without an active internet connection, I assume “virtualizing virtualization software” to be just as strange as it sounds.

My Next post will hopefully be on how attempting the build will go, wish me luck, it’s been an experience so far.

 

Cheers,

ElliePenguins