All posts by elliepenguins

Linux Process

Hello, now that I am back into the swing of learning and writing again, I thought I would share some recent research with you. My thoughts are that by sharing my work it will both help me keep structure in my personal research; as well as maybe helping my readers in theirs as well.

In this post I will try and explain what I learned about Linux Processes, their definition, creation and how to find more information about them. This will be a more theoretical post then some of what you might find the on this blog, but I will try to offer some application of the concepts as well.

Definition:

What exactly is a processes and how are they different from a program and how do threads “tie” into it. To start I’m going to offer a proper definition from one of the best books I’ve ever found when it comes to the inner workings of Linux.

“The Linux Programming Interface” by Michael Kerrisk:
https://nostarch.com/tlpi

In this book, Kerrisk defines a process as “an abstract entity, defined by the kernel, to which system resources are allocated in order to execute a program.” page 114.

This is something I think is good to keep in mind, especially when it comes to understanding the difference between a thread and a process. Specifically a process is a collection of resources; think like, open files, memory blocks etc. Threads are the flow of execution that take actions on those resoureces within the process.

If we think about a program as a set of instructions for a computer to follow then the actual act of following those instructions would at a simple level conceptually be a process. Now first, a computer must take a few actions before beginning the work of walking through a program and producing its desired effects. Specifically the action of loading this program into memory in such a way that the machine is able to interpret it as instructions.

Note: I will not be getting into the Linux boot process as I believe that would make a good post onto itself, and it may also overly complicate this post as well.

Entering a command:

What happends when a command to start an application is entered into a terminal, one of two things may occur. The first that the shell itself may have the functionality built into its own code which handles the task itself. This is known as a builtin function and from what I understand its implementation is designed to cut back on the overhead of frequently used functionality.

The next method in which a program becomes a processes is through a set of system calls (functions provided by the kernel to request actions and resources in userspace.) The two system calls (or syscalls) are known as fork() and exec(). There are actually a few different variations on exec, but know that they carry out the same general task.

Creating an instance:

To carry out this task, first the calling program will make a duplicate of itself in memory. This newly allocated block of memory is identical to the calling process, although its reference id or what is known as its process ID or PID is different; it is a different running process at this point. To prove this point I found a couple really interesting little practice programs from: https://ece.uwaterloo.ca/~dwharder/icsrts/Tutorials/fork_exec/

You can play with them yourself, but essentially the idea is to fork the current process and then check the PIDs that are created. The neat thing about it is where execution continues after the fork procedure. Playing with these programs you can see that execution in the child process continues from the next line of code after the call to fork().

Here is a small example:
#include <sys/types.h>
#include <unistd.h>

pid_t pid = fork();

What I find interesting is the use of the PID variable to determine execution flow based on wether or not it is in the child or parent process; 0 is child, and non-zero is parent. This can be applied as such:

if (pid == 0)
execvp(“<command>”<args>);

This also could be handled via. switch-&-case where default is parent execution, 0 is child execution flow and -1 is error handling. All of this is described in much better detail in chapters 24-28 of “The Linux Programming Interface” by Michael Kerrisk; again, I really recommend it to anyone who is curious.

New process data:

From this point we know how to create an instance for our new process to exist in. We now need a way to change the newly created process into something more useful than a copy of its parent process. As mentioned earlier, this is done through a family of functions known as exec.

One of the first things you may notice in regards to exec is that there are different versions (execve, execl, execvp, etc.) This is known as a wrapper and is common in Linux, it acts similar to function overloading allowing multiple variations on the same general functionality. You consider writing your own wrappers if you find yourself frequently using a library function in a specific way, I’ll try to get into the details of that in a different post.

Now back to exec. This function replaces the memory map of the current child process (remember that a fork call duplicates the parent process in memory) with the new process. The usage is simpler then it seems, but for myself it took some play time to wrap my head around.

Here is an example:

char * args[] = {“sleep”, “20”, NULL};
execvp(“sleep”, args);

Walking through this code we first declare an array of pointers, each of which is initialized in order of the bracketed code; specifically, “Sleep”, “20”, and a NULL pointer. The first is the name of the program, if this confuses you then think about the line:

int main( int argc, char* argv[])

Focusing on argv, the first element in the array is the name of the program, this array is the new processes argument list. 20 is the first “working” argument, meaning the new program code will take an action based on this parameter (and any other if there are more.) Finally the NULL pointer acts as a stop sign so as not to read beyond allocated memory; just like when working with C style strings.

The last thing you should know before I give you a block of code to play with are the functions known as exit() and wait(); You may have used exit() in your code already but it has a neat property. The parent can wait for the child process to terminate through the wait call, which provides some basic synchronization. Through the exit() function the child can report an exit status to the parent, which is accessed through the wait() function call. You will see this in action in the example program below

Example code:

Here is some basic code you can play with:

#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

int main(void) {
     int pid = fork();

     if ( pid == 0 ) {
          printf(“Child Process: %d - %d\n”, pid, getpid());
          printf("Starting\n");
          char* argv[] = {"sleep", "20", NULL};
          execvp("sleep", argv);
          printf("Completed\n");
          exit(0);
     } else {
          printf(“Parent Process: %d - %d\n”, pid, getpid());
          int *status = malloc(sizeof(int));
          *status = 1;
          int c_pid = wait(status);
          printf(“Parent Process: child process %d returned: %d\n”,
               c_pid, status);
     }
}

Take your time and play with it, try modifying the different arguments and predicting the output. One thing that I found helpful during this play session, is to make the program sleep() at interesting points. Just long enough to copy elements from /proc/<pid>/ and compare the differences before and after the exec call. Specifically one “file” of interest to you may be /proc/<pid>/statm, which according to: https://man7.org/linux/man-pages/man5/proc.5.html, “Provides information about memory usage, measured in pages”. I believe this provides a simple way to verify that the process in memory has changed without getting into the complexities of extracting and examining the actual bytes.

One interesting thing that I did notice when it comes to checking /proc/<pid>/statm is that the output of the exec call to sleep() matches exactly the output of /proc/pid/statm when running sleep in bash. Obviously sleep has a simple function (not implementation,) but I am now curious if similar memory usage patterns could be used to determine an obfuscated process? That of course is beyond what I intended to get into in this post; but something I noticed during writing this and is a curiosity.

For now though, I would like to say thank you for reading. I hope this post was able to provide some insight; or at least give some direction for further learning. In regards to the content for my next post, I am still unsure, but don’t doubt I will stumble upon something interesting to share with you soon.

Cheers,
ElliePenguins

RAID Playground

Hello friends, it has been some time since my last post; which I am now realizing was over a year ago. Maybe current world events have altered my ability to sense time? Anyways I hope you all have been well.

My thoughts are that this post will act as a bit of a bump start to get me blogging again. Warm the engines a little before diving into something a little more complex. To begin, I would like to start sharing with you some of the ways that I studied computer concepts before post-secondary education. For this first post I will detail how to create a lab environment for playing with RAID.

What is RAID:

RAID is an acronym for “Redundant Array Of Independent Disks,” although I have also heard people say “Redundant Array of Inexpensive Disks.” Whatever you choose to call it is up to you.

RAID allows disks to work together as a single storage pool that can both speed performance of the disks by spreading the I/O across multiple interfaces and add a bit of protection against data loss. This is done by having each disk store a little bit of the data on each platter, which is known as striping, it also has each of the disks store a little bit of the data from others in the array which allows for recovery of that data in the event of a disk failure within the pool. This is however a very high level explanation of the concept; Hopefully I will be able to dig deeper into the underlying mechanism in the future.

Note: There are multiple configurations such as just striping (RAID 0) and mirroring (Raid 1) and even newer concepts such as RAID 10, but we are only setting up a play environment in this post. You can play with different configs on your own.

Lets Start:

To begin, open a terminal and create a directory to work in. This is where we will be creating our “disks.” These “disks,” will actually be files, just big blocks of zeros on a current file system that will be used to simulate real disks, these are called loops.

Quick Warning: be careful when working with these tools as they can cause unintended complications if you accidentally enter the wrong arguments, this can include data loss and system failure. You should also be aware of your systems current storage configuration; specifically, does it already have any raid devices running on it?
Here is a link that can give you some more info on how to check:
https://www.cyberciti.biz/faq/how-to-check-raid-configuration-in-linux/

In your working directory, create the storage files with:

 dd if=/dev/zero of=disk1 bs=1024 count=10240

you should now have a file in the current working directory named disk1 that is 10MB in size.

Because for this example we are using raid level 5, do this 2 more times, naming them disk2 and disk3.

The next thing we will need to do is create a device node and attach our disk files to them.

To do this use the command:

 losetup -f

If there are no errors the next step is to attach the disk file, as such:

 losetup /dev/loop0 disk1

you will need to iterate over both of those commands 2 more times in order, the reason is that losetup -f provides the last unused device node; creating a new one with this method thus involves making sure the previously created node is attached to a disk before creating another.

Once this is complete, you can verify the config with:

 losetup -a

which should reply with something along the lines of:

/dev/loop1: []: (/home/user/raid/disk2)
/dev/loop2: []: (/home/user/raid/disk3)
/dev/loop0: []: (/home/user/raid/disk1)

If you are wondering the reason for doing this as opposed to just using a virtual machine with virtual disks, which is common when learning. This method allows you to create and delete “disks” without stepping out of the environment. Or in the case of a physical system, add disks or change partitioning schemes when you want to play with a different or complex config on the fly. You could also try applying this method in a script to create, test, and destroy completely in the current environment for learning, play or testing.

Now you should be able to see the current block devices available to the system with the command:

 lsblk

which should look something like:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 10M 0 loop 
loop1 7:1 0 10M 0 loop 
loop2 7:2 0 10M 0 loop 
sda 8:0 0 128G 0 disk 
├─sda1 8:2 0 80G 0 part /
├─sda2 8:3 0 8G 0 part [SWAP]
└─sda3 8:4 0 40G 0 part /home

If you see the loop devices that we created then it is time to create our array, yay!

To do this, use the command:

mdadm –create –verbose /dev/md0 –level=5 –raid-devices=3
/dev/loop0 /dev/loop1 /dev/loop2

Now check that it is created with:

 mdadm --detail /dev/md0 --scan

Or by checking in:

 /proc/mdstat

If all works well you can see the details of the array with the command:

 mdadm --detail /dev/md0

Which should reply with information similar too:

/dev/md0:
Version : 1.2
Raid Level : raid5
Array Size : 16384 (16.00 MiB 16.78 MB)
Used Dev Size : 8192 (8.00 MiB 8.39 MB)
Raid Devices : 3
Total Devices : 3

State : clean 
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Consistency Policy : resync

Number Major Minor RaidDevice State
0 7 0 0 active sync /dev/loop0
1 7 1 1 active sync /dev/loop1
3 7 2 2 active sync /dev/loop2

Running lsblk again, should return something similar too:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 10M 0 loop
└─md0 9:0 0 16M 0 raid5 
loop1 7:1 0 10M 0 loop 
└─md0 9:0 0 16M 0 raid5 
loop2 7:2 0 10M 0 loop 
└─md0 9:0 0 16M 0 raid5 
sda 8:0 0 128G 0 disk 
├─sda1 8:2 0 80G 0 part /
├─sda2 8:3 0 8G 0 part [SWAP]
└─sda3 8:4 0 40G 0 part /home

Your new totally software RAID environment should now be ready to play with, you can try creating a new file system on /dev/md0 with:

 mkfs.ext4 /dev/md0

and mount with:

 mount /dev/md0 /mnt

Or if you are curious, layer in LVM, which provides the ability to carve the large pool of disks into smaller partitions for specific usages within a larger system. Using RAID in combination with LVM is an incredibly powerful storage solution that is important to understand.

Some Fun ideas might also include, failing disks, creating and adding new disks, removing disks and resizing the file system or, if you chose to use LVM the volume groups. Other ideas you could try might be moving the array of disks to a new box or virtual machine by transfering to files and reassembling on the new machine. Practising concepts like this in advance can allow you to act fast when required.

Here are some commands to get you started:

simulate failed drive:

 mdadm –fail /dev/md0 /dev/loop0

add another disk, remember to create it with dd and losetup:

 mdadm –add /dev/md0 /dev/loop5

remove a disk, don’t forget to modify the file system:

 mdadm –remove /dev/md0 /dev/loop5

Hopefully this will provide you with a good test environment to practice managing storage.

Best of luck
ElliePenguins.

Free Time and My Project(s)

Hello again friends, It has been some time since I have made a blog post. Mostly because I have graduated from my program (yay!) and I am getting antsy to dig into SOMETHING productive.

After piecing together next steps in terms of my career it is time for me to dive into another project. However, instead of for grades and credits I will be able to work on something for myself, which sounds wonderful but has actually been difficult. This is because what should I work on? Should I apply some of what I learned in school to add features or attempt another optimization for some selected open source project? Where to start, what to do.

Well to begin I have mostly been working on learning things I did not have time for while I was in school. Taking the time to climb further down the rabbit holes that I’ve wanted to explore sooner, this mostly has involved one of the best purchases I’ve made:

https://nostarch.com/tlpi

I absolutely recommend this book.

Other things I have been playing with include making an attempt at learning some basic electronics. By using this book:

https://nostarch.com/arduino

For those of my readers who may not already be aware, you can get great deals on tech E-books through this site, while also helping the charities of your choice; check it out:

https://www.humblebundle.com/

Now back to what I have been up to with electronics, the above mentioned book gives you a good introduction and eases you into reading electronic schematics. Paired with lots of images of electronic components seems to be a good resource in getting started with principles of electricity and simple circuit design. This has been fun but is a bit of a hard reminder why I dedicated myself more to code and not learning to wield a soldering iron, but it would be really nice to be able to fix that broken 1541 drive.

Otherwise, most of my time has been dedicated to practicing applied elements of Linux development, such as building kernel modules from scratch; something I was not able to keep up with while in school. I also learned a neat trick during an online course for lower-level manipulation:

A neat trick:

This involves a union of an unsigned type and a bit field, which creates a type that can be both interacted with as an integer (for example) and also have its individual bits manipulated to modify that value. It goes something like this, and could be a good asset in someone’s personal library:

Union {
   Uint8_t byte;
   Struct {
      uint8_t  b0     :1;
      uint8_t  b1     :1;
      uint8_t  b2     :1;
      uint8_t  b3     :1;
      uint8_t  b4     :1;
      uint8_t  b5     :1;
      uint8_t  b6     :1;
      uint8_t  b7     :1;
   } bits;
} ubit8;
// instantiate, manipulate, display:

ubit8 value = 256;
value.bits.b7 = 0;
printf(“Value: %d\n”, value);

It should also be possible to extract bit values based on the integer data that byte is set too. Play with it and see what you can make it do, also because it is a union it shouldn’t take up any more space than a standard uint8_t; but I may be wrong, computers/standards get weird at this level.

Some Other ideas:

Recently I came across an opportunity to acquire a lot of (old) computer memory modules (ram) about 8 gigs worth. Now you might think this is not much at all, considering most modern laptops ship with 8 to 16 GB give or take, but I can’t emphasis enough: “old,” like LGA775 old. To clarify when I was learning computers some time ago I acquired as many Pentium 4’s (Actually more Prescott/Presler era) as I could get my hands on, some of which only having 256mb of ram (great if Arch Linux didn’t deprecate 32 bit.) But they did and I have been trying to think of what to do with these machines.

Well I have a lot of ideas but nothing that’s going to run in that environment. Now with this resource in hand, and a massive thank you again to this person for not throwing away usable hardware just because of age. I might be able to bring some of these machines back out of retirement.

My one idea for this is to find a usable Linux distro for 32 bit and then attempt some openMPI clustering, (crazy I know.)

 

Another thing was coming across my old Tandy trs-102, here’s some info:

https://en.wikipedia.org/wiki/TRS-80_Model_100

This thing is brilliant, runs on AA batteries and has a really impressive keyboard. The best thing about this is that has an rs-232 via. db25 port on it, meaning it should be able to interface with my modern systems. Other people in the vintage and tech community have made this work effectively and there are projects to make this work such as:

https://www.cinlug.org/files/db/trashtalk/index.html

(I don’t personally know the validity of this yet though.)

This could be a fun project as well and would be about time that I learned more about serial protocol, although I love this machine and it’s not like you can just get them off the shelf at RadioShack anymore. It does have an “extended” warranty apparently though.

 

My Personal Project

Finally the other reason for this blog post is to say that I am hopeful to dive back into my personal UserData library. This library started out as an “I love to code,” “make C style linked lists accessible” type of project, which may be the best way to describe it. Here’s a link:

https://github.com/ElliePenguins/UserData

My thoughts are that this library will hopefully be more than just a linked list library; instead, it is made of linked lists of linked lists where each node has metadata nodes attached. From these data structures I am foreseeing the ability to add, modify or delete entries. To have this library be responsible for handling any data within any program a dev may want it for. Basically just create a head node, call an initialization function and then retrieve and manipulate data via. predefined functions. At least this the direction I am pushing this block of code; but ultimately I just love to code in C and if someone finds it useful along the way, that is great.

Project status: I have just added a Makefile (finally) that allows a developer to both create the shared object and optionally create a simple cli program that uses the library for development and debugging purposes. You can look at this Makefile for more info, it’s pretty straight forward. Otherwise I am still getting lots of segfaults, some of which can take time to replicate due to the nature of the internal linked lists of linked lists ( a lot of strange memory handling ).

Hopefully between now and finding employment I should be able to keep marching forward on this. Not just for the experience of maintaining a complex project myself, but also for a reason to get up in the meantime.

Specific Next task: Re-familiarizing myself with the code and the call stack during different times of execution; then, try and determine best way to make this somewhat usable.

 

Wish me luck,

ElliePenguins

Post-Project Details

Hello again, I have been a little less communicative recently but I wanted to detail some post-project information. For those of who who may not be aware I have been attempting to optimize an OpenSource software project for a course that I have taken called “software portability and optimization.” To see my previous posts about this project you can check the link here:

https://penguinpro.ca/category/school-work/project/

If you have been following the project, in my final post:

https://penguinpro.ca/category/school-work/project/project-finalization

You can see my project wrap up and reflection. For this post however I am going to kind of look back on the work that has been done, detail some of the things that I may have left out; and provide detail for different options that I tried over the course of this project in a more technical way.

The actual optimization itself took some effort and was an excellent learning experience. I found myself in many rabbit holes spending days researching the topics I love, however the part that I loved the most about this project was the time dedicated to creating the “scaffolding” required to make the optimization work on multiple platforms without having to recompile from source.

This is where the project became the most fun.

In attempting to do this I created 3 different methods of accomplishing the tasks, all with their own benefits and difficulties. I will detail each of them here.

 

Option one:

For this first option, each architectures optimization solution was put into their own function with parameters that are accessed by reference. From there a function jmp table was created and the address of the function was appended. As you will see by doing this we can easily call the function by an index, or offset from the beginning of the array; the benefit is that we are not evaluating conditions at every run. The downside is that we still have the overhead of calling the function In the first place.

This is also the option that I opted to put up publically, It had about 1-2% reduction in optimization, which was the best option in testing. Here is the code bits:

Function declarations:

// optimization options:

void charSearch_C(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                uint8_t *found, const unsigned char *bp, const unsigned char *pt);

void charSearch_sse4_2(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
         uint8_t *found, const unsigned char *bp, const unsigned char *pt);

 

Function JMP table:

// function table:

void (* const stringSearch[]) (const uint32_t, const uint32_t, uint32_t *,
                    uint32_t *, uint8_t *, const unsigned char *, const unsigned char *)
                                    = {charSearch_C, charSearch_sse4_2};

 

You can see the parameters and how they relate directly with the function table, which is an array of function pointers; you can learn more about this concept for yourself here:

https://barrgroup.com/Embedded-Systems/How-To/C-Function-Pointers

 

Function definitions:

Here just for the sake of having them included, (even though they do not really change across calling conventions) you can see the actual definitions:

 

void charSearch_C(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                   uint8_t *found, const unsigned char *bp, const unsigned char *pt)

{

                // Original program code, just moved here.
                *found = 1;
                for (*j = 0; *j < l && *off < length; *j++, *off++) {
                    if (bp[*j] != pt[*j]) {
                                *found = 0;
                                break;
                    }
                }
}

void charSearch_sse4_2(const uint32_t length, const uint32_t l, uint32_t *j, uint32_t *off,
                        uint8_t *found, const unsigned char *bp, const unsigned char *pt)
{
    *found = 0;
    for (*j = 0; *j < length && *j < l ;)
    {
       // make sure that we are not going past either end of any array.
       if (*j >= length - 16 || *j >= l -16)
       {
                  if ( length >= 16 && l >= 16)
                  {
                     *j = (length < l) ? length -16 : l -16;
                     __m128i xbp = _mm_loadu_si128((__m128i*) &bp[*j]);
                     __m128i xpt = _mm_loadu_si128((__m128i*) &pt[*j]);
                     *j += _mm_cmpistri(xbp, xpt,
                                _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
                     break;
                  }
       }

       __m128i xbp = _mm_loadu_si128((__m128i*) &bp[*j]);
       __m128i xpt = _mm_loadu_si128((__m128i*) &pt[*j]);
       uint8_t y = _mm_cmpistri(xbp, xpt,
                    _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
       *j += y;

       // if y is not 16 then either the end of the string is reached or
       // a character miss-match has been detected.
       if ( y != 16 )
                       break;
    }
       // set found and offset based on our results.
       *found = ( length == l && *j == length ) ? 1 : 0;
       *off += *j;
}

Option 2:

For this option the code was modified to still use the functions but instead of a function table being directly referenced with in the calling function, ifunc was used to create a resolver function. This allowed us to have two full identical functions with the optimization modification added to each.

 

// Use ifunc to seemlessly call appropriate function

cl_error_t cli_bm_scanbuff(const unsigned char *buffer, uint32_t length, 
const char **virname, const struct cli_bm_patt **patt, const struct cli_matcher *root, 
uint32_t offset, const struct cli_target_info *info, struct cli_bm_off *offdata,
 cli_ctx *ctx)
                __attribute__ ((ifunc ("resolve")));

 

// jmp table for resolver function:

cl_error_t (* const scanBuffs[]) (const unsigned char *buffer, uint32_t length,
 const char **virname, const struct cli_bm_patt **patt, const struct cli_matcher *root,
 uint32_t offset, const struct cli_target_info *info, struct cli_bm_off *offdata,
 cli_ctx *ctx)
                = {cli_bm_scanbuff_sse4_2, cli_bm_scanbuff_C};

 

In the above you can see that the __attribute__ tag with gcc allows us to keep a seemless programming interface for the function so as we do not have to modify any calls. The function table is used in the resolve function to choose the appropriate logic for the architecture the build may be running on. You can see it working here:

// Used to choose which function version to resolve to at run time.

// implements ifunc.
static void (*resolve (void)) (void)
{
                // return index in jmp table, dependant on hw capabilities.
                return scanBuffs[0];
}

 

I feel there are many practical use cases outside of hardware detection that ifunc would be useful for outside of this project. This is something I am happy that I was introduced to.

 

Option 3:

This is where things start to get a bit “strange” and where I started to have a lot of fun. In the code I have bolded the areas of interest, as most of this is just an inline version of the previous code. However instead of function calls we are using goto statements instead. Take a look at it first and then I will try to explain what is going on.

 

// function table:

   void (* const search[])() = {&&c_search, &&sse4_2};

// UNCOMMENT "if" for invalid option prot:
// if (runtime_detection < OPTIMIZE_MAX_OPTIONS)

   goto *search[runtime_detection]; 

c_search:
   found = 1;
   for (j = 0; j < l && off < length; j++, off++) {
   if (bp[j] != pt[j]) {
      found = 0;
      break;
   }
}

   goto exit;

sse4_2:
   for (j = 0; j < length && j < l && off < length ;)
   {
   // check, if index is within range.
   if (j >= length - 16 || j >= l -16)
   {
      if ( length >= 16 && l >= 16)
      {
         j = (length < l) ? length -16 : l -16;
         __m128i xbp = _mm_loadu_si128((__m128i*) &bp[j]);
         __m128i xpt = _mm_loadu_si128((__m128i*) &pt[j]);
         j += _mm_cmpistri(xbp, xpt,
         _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
         break;
      }

   }
      __m128i xbp = _mm_loadu_si128((__m128i*) &bp[j]);
      __m128i xpt = _mm_loadu_si128((__m128i*) &pt[j]);
      uint8_t y = _mm_cmpistri(xbp, xpt,
         _SIDD_UBYTE_OPS |_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY);
      j += y;

   // if y is not 16 then either the end of the string is reached or
   // a character miss-match has been detected.
      if ( y != 16 )
         break;
   }

   // set found and offset based on our results.
   found = ( length == l && j == length ) ? 1 : 0;
   off += j;

exit:

 

Okay, to explain this one let me refer you back to a previous post where I was doing some practice/play with the address of a label. Note that I am still playing with this source, and as of currently I have found a way to do nonlocal goto’s using the address of a label and even infinitely loop. It still is not practical as I still have not found a way to predictably grab entities at offsets within the stack (think local variables.) The neatest thing I found about this is that label addresses are address offsets from the instruction pointer, and NOT offset from the entry point of the routine. I will try my best to continue to work with it. Anyways, here’s the link to that post:

https://penguinpro.ca/2019/11/26/project-detour/

Now for the above code:

void (* const search[])() = {&&c_search, &&sse4_2};

This loads the addresses of the labels into an array of function pointers, note however we do not call them with the function operator or with “brackets” doing so causes a call instruction that pushes an address and saved frame pointer to the stack which must be cleaned up. To avoid this we use:

goto *search[runtime_detection];

We need to dereference the address within the pointer to jmp execution to that point. My current “play” situation is attempting to call a function this way and manage the stack myself.

Also note that runtime_detection is a global variable set by the optimization initializer function which you will see later in this post.

Next up, the first label is c_search: which requires that we have an exit: label. The exit label is used as a forward jmp to skip past any code we don’t want to execute. And finally we have sse4_2: This is the label for the start of execution of our SSE optimization code.

This option seems to have minimal overhead, minus the loading of the “label address table” at every function call. I assume this is required because the labels offsetting from the instruction pointer needs a point to catch a reference, else it’s out of scope. I may also be totally wrong, label scope at the lower level is something I will need to do more research on. Either way the reason I did not include this option in the public repository is that it was sometimes unpredictable in which branch it would take? It didn’t crash the program but without further analysis (of which there was not enough time to do within the remaining course time,) it was too unstable for a realistic use at this time. As stated multiple times, I am hopeful to continue my work on this.

 

Runtime Detection

This is the code that is used to determine the architecture at run time, it uses gcc built in functions to request information from the system. My other thoughts for this would be to write a separate HW_detect module that would be able to query the /proc filesystem for attributes as a good safe portable method for runtime hardware detection, time being a factor again makes me push this to my own stack of personal project ideas. Maybe I should create a “public” type of todo list.

Anyways back to the code:

// Functions for instrinsics jump table:
void optimize_scanbuf_init()
{
                // default to C code.
                uint8_t runtime_detection = OPTIMIZE_USE_DEFAULT;

                // set jmp table index:
                if (__builtin_cpu_supports("sse4.2"))
                                runtime_detection = OPTIMIZE_HAS_SSE4_2;

                // This one is only here for an example:
                if (__builtin_cpu_supports("avx"))
                      runtime_detection = OPTIMIZE_HAS_AVX;
}

Which is set to be called in a high level function in libclamav, specifically when the database is initial ally read, by doing this, the init function is only called once and any error should cause it to fall back to default C.

 

Final Project Overview:

 This project has been one of the most fun and challenging things I have done in quite some time. After many late nights and reading countless hours, all with no idea whether it would ever work, I feel I have accomplished something I wouldn’t otherwise be able to.

Also I know that I have not posted much of the code in previous posts as much of it was still experimental for me; hopefully this post will clear some things up and help anyone who is reading it move forward with their own personal projects.

I have really enjoyed having you as my readers and hope to continue on with this blog into the future. It’s time to shut down the test machine; and say thank you to everyone involved on this journey, I won’t forget it.

 

With Love, always.

ElliePenguins

Project Detour

 

Hello, Everyone I have something kind of unrelated to my project but is still interesting. Actually it may be related to the project but is currently more of an interesting discovery (for me.) After discussing with my professor about eliminating potential overhead in my program, they brought up the idea of potentially using goto statements to select which piece of code to run. Where in this conversation the idea of getting the address of a label to use as a function call. This was something I couldn’t stop thinking about.

So this article, although somewhat unrelated will dig into some C based ways of playing with execution at a lower level. As of starting this writing I have no idea what will work and what will not but I intend to piece together different ideas’ purely to see what is possible. My first stop of course on this journey was google, where I found this article about how to extract the address from a label:

https://stackoverflow.com/questions/1777990/is-it-possible-to-store-the-address-of-a-label-in-a-variable-and-use-goto-to-jum

and

https://stackoverflow.com/questions/15702297/dynamic-jump-to-label-in-c

The second link above, actually has some interesting ideas, specifically https://stackoverflow.com/a/15736890 that may be very useful in my project; you will absolutely hear about that later; but for now let’s see what we can do with goto and function operators.

Here is the example code that I will be starting from:

   #include <stdio.h>

   int main ( void ) {
     void *ptr = NULL;
     label: ptr = &&label;
     printf(“label address %p\n”, ptr);
     return 0;
   }

 

Which itself compiles and outputs the address of the label. Neat.

Next up, what can we do with these things, well first let’s try and call that address with the function operator:

       ptr();

Error: called object ‘ptr’ is not a function or function pointer.

Dissapointing, but maybe we can try again with an array of pointers to functions:

 

               Void (*ptr_array[ ] ) = {ptr};

               Ptr_array[0]();

 

This on the other hand does run and provides us with between 30,000  and over 100,000 lines of output, just before segfaulting. Let’s dig into that a little bit with gdb. Start it up with:

               Gdb –q <name_of_program>

Note that the ‘q’ flag is used to suppress that large, unsolicited welcome message at EVERY start.

 

Continuing on and running the program within gdb provides us the opportunity to examine to call stack at the time of the segfault, you can do this with “bt” and it seems to be breaking within printf. It is difficult to make assumptions at this time as the address within ptr from label should not have changed; basically we aren’t reading any data difference, we need to investigate further.

Let’s start with break points to see what exactly is happening at each call:

1 – Pointer seems to be getting assigned to the address of the label.

2 – The printf function is executed and shows the address of the label.

3 – Ptr_array[0] is called as a function.

  • Requesting the address contained at ptr_array[0] from gdb still shows null interestingly enough.
  • This however is resolved by using ptr_array[0] = &&label;

4 – After the call to ptr_array[0]() we are somehow back at the beginning at a break point before the label.

  • Due to the fact that this was a function call and not a go to, we have presumably pushed a return address to the stack. Let’s examine that after though.

5 – And looping continues like this until we reach a segfault.

My next idea is to create a watch point on our pointers and see if the values are ever changed. We can do this in gdb with:

Watch <entity_name>

Aside from a slow run, this provides us with nothing of value, let’s remove the print statement that is slowing it down and see if we can run it again to catch the watch point on fail while the earth is still turning. I will note that this requires that the program is recompiled.

After considerable time, we finally have some output. It seems that our label pointer that is saved in ptr_array has not been modified at the time of the segfault. However if we consider our thoughts that this segfault may be related to a stack size issue (which thanks to this software portability and optimization course) we know is of a fixed size, let’s check the stack pointer:

P $esp

               -8392704

This output is very close the the size of one MebiByte, or the realistic size of an 8 MB stack. To examine further let’s see what bytecode is in that section, in gdb we can use:

               x/32x $esp

0x5555518c        0x00005555        0x5555518c        0x00005555

0x5555518c        0x00005555        0x5555518c        0x00005555

0x5555518c        0x00005555        0x5555518c        0x00005555

0x5555518c        0x00005555        0x5555518c        0x00005555

[…snip…]

This large piece of repetitive data on the stack seems to show our suspicions that we are calling a function that will never return. This means that when we call the address of the label the compiler never thinks to put in an instruction to read the address back into $EIP and pop that address off the stack. Instead each call is made over and over again which itself pushes an address to the stack and never cleaning it up. Dumping 1024, 2048 and 4096 bytes from $sp still shows the exact same data; which relates to my suspicions exactly. Let’s see if we can find a correlation in data next.

So what exactly do these hex numbers mean, lets put it together. Running our gbd examine command again however requesting ‘g’ instead of ‘x’ we will get it all put together:

x/32gx $esp

0x000055555555518c 0x000055555555518c

0x000055555555518c 0x000055555555518c

0x000055555555518c 0x000055555555518c

0x000055555555518c 0x000055555555518c

[…snip…]

Now think about what we are saving if we want to return, lets compare this against what is in our instruction pointer register:

I r

Meaning inspect registers, and from the output we can see the line:

$rip            0x55555555518a      0x55555555518a <main+81>

Which compared against

disass main

 

[…snip…]

   _[34m0x0000555555555167_[m <+46>:              lea    rax,[rip+0xfffffffffffffff9]        # 0x555555555167 <main+46>

_[34m0x000055555555516e_[m <+53>:              mov    QWORD PTR [rbp-0x18],rax

   _[34m0x0000555555555172_[m <+57>:              lea    rax,[rip+0xffffffffffffffee]        # 0x555555555167 <main+46>

   _[34m0x0000555555555179_[m <+64>:              mov    QWORD PTR [rbp-0x10],rax

_[34m0x000055555555517d_[m <+68>:              add    DWORD PTR [rbp-0x1c],0x1

   _[34m0x0000555555555181_[m <+72>:              mov    rdx,QWORD PTR [rbp-0x10]

_[34m0x0000555555555185_[m <+76>:              mov    eax,0x0

=> _[34m0x000055555555518a_[m <+81>:           call   rdx

_[34m0x000055555555518c_[m <+83>:               mov    eax,0x0

_[34m0x0000555555555191_[m <+88>:              mov    rcx,QWORD PTR [rbp-0x8]

_[34m0x0000555555555195_[m <+92>:              xor    rcx,QWORD PTR fs:0x28

_[34m0x000055555555519e_[m <+101>:            je     0x5555555551a5 <main+108>

_[34m0x00005555555551a0_[m <+103>:            call   0x555555555030 <__stack_chk_fail@plt>

_[34m0x00005555555551a5_[m <+108>:            leave

_[34m0x00005555555551a6_[m <+109>:            ret

[…snip…]

 

From what I can see, we are using lea or “load effective address” to bring in the address of an offset from $rip, which I assume to be the address of our label and storing it in rax. Note the second bolded lea call is because we are loading that same address again however into our array. That value is being stored at an address indexed from rbp meaning it is on the stack, where it is after loaded again from that offset into rdx. Finally this is where we make the call to the address loaded into rdx. Every time that call is run it pushes the address of the next (in red above) instruction to the stack; which without a return instruction explains the mess on our stack.

This creates some kind of weird failed recursion like “hyper drive”. However if there was a means in C to get that address off the stack (or never put one in the first place) this would be a fantastic and fast way to accomplish a low overhead means of calling my optimization functions. It would be especially useful if it were possible to do a non-local goto and have all the local variables still available.

 

Hope you enjoyed my quick write up, maybe you can find a practical use for it in whatever you’re doing.

 

Cheers,

ElliePenguins

 

 

Project Progress Update

Hey again, I am writing this post to let anyone who is interested know that my project has been progressing well and have a working optimization and even implemented features to allow other multiplatform/architecture optimizations in the future. For those of you who may be new to my blog, I am attempting to optimize ClamAV for a course that I am taking in software portability and optimization, you can read previous posts here:

https://penguinpro.ca/category/school-work/project/

My big announcement was that I was able to get the program to loop over my SSE4.2 optimization so that all of the data is checked and all indexes and offsets are properly set. The overall program run time now is about 2 minutes and 41 seconds for a speed up of just shy of 5 seconds, which is an optimization of about 3% overall.

I was hopeful to keep you all updated sooner on the progress of this project; however, due to the fact that it has been moving along quite well it has been somewhat difficult to shift focus from working to writing. I should note however though that there is more now to tell you then just the fact that the optimization is working but also that I have been able to implement some features that may allow more architecture dependent optimizations in the future. What I mean is that as of now, I have also implemented the scaffolding (in two different ways) to allow a single version of the code to be run on multiple architectures. As well as the initialization function that will need to be called to set the best possible option for optimization in the programs during program start-up.

Now for the downside of that “feature.” By making this program able to detect its running environment and call the appropriate function, it has brought our optimization down to about 2 or 3 seconds (at best) of optimization instead of the above mentioned 4 or 5. This essentially cuts our optimization in half, to almost 1.5% to 2% optimization with one implementation, the other of which completely negates any benefit. I am still hopeful that there is enough overall benefit of having the ability to implement hardware specific optimizations as functions that someone who understands better than myself may be able to run with my changes and optimize further.

I will note here that both methods include jumptables however one contains pointers to a complete copy of the optimized function, using:

__attribute__ ((ifunc (“resolver_name”))); 

The other takes only the block of optimized code, places it within its own function which is called in the optimized function where it would normally be; passing the required information as arguments.

Again both of these methods are implemented with an array of function pointers similar to:

searchOptions[hardware_capability] ( arg1, arg2 arg3);

“hardware_capability” being set by the initialization function that was mentioned earlier. As well as (ifunc) needing no arguments, as the address to the function is “seamlessly” returned by the resolver_function, which uses the same function API (argument list.) Both of these solutions are transparent to developers using the function, however ifunc does cause more performance degradation; the reason for this will require further research.

I feel that the biggest issue with my optimization in terms of the actual use of intrinsic functions is that the strings are not properly aligned, which I apologize for my previous post stating my solution “aligned” the strings. Instead I learned that I was actually only attempting “fancy” index operations. I am hopeful that by implementing a proper string alignment, we will be able to eliminate the required checks and adjustments that inevitably require some processing time. My other issue that I have with my optimization is that in being new to topics such as: intrinsics, inline assembler, and simd, is that I chose to use the SSE4.2 instructions that used 128 bit register widths. This may have been a really safe bet while I was learning however by taking advantage of AVX’s 512 bit register widths more performance can be gained. Also I feel that there may be other areas of potential optimization within this software where I can apply my new knowledge to, as well as other Open Source projects in the future.

My next steps will be to get in touch with my instructor to make sure that I have implement the jump tables and Ifunc resolver appropriately. I am really hopeful that a simple mistake is the reason for the performance drop and that a fix can be implemented quickly with some extra direction. When this is resolved I will try to get a more technical post available to my readers who may be attempting to implement similar functionality in their own projects.

Again, thank you for reading; and I hope to have some actual code posted for you all soon.

With love,

ElliePenguins

Project Mile Marker

Hello again, I have another post about my attempt at optimizing clamav, which if you are new to my blog you can check out previous posts here:

https://penguinpro.ca/category/school-work/project/

 

In my last post I was able to get into the source code for my project, and after what I feel was a considerable amount of analysis of a single function, I have started my optimization attempt. As of currently I am now getting the correct output on a single directory; add recursion into the mix and you get a segfault.

However, I have noticed a considerable speed up in the program, running against a single directory without recursion (because it’s broken) we have a run time of:

----------- SCAN SUMMARY -----------
Scanned directories: 1
Scanned files: 8
Infected files: 1
Data scanned: 8.00 MB
Data read: 2089.99 MB (ratio 0.00:1)
Time: 27.139 sec (0 m 27 s)
Start Date: 2019:11:20 16:25:02
End Date: 2019:11:20 16:25:30

In comparison to a previous run time of:

----------- SCAN SUMMARY -----------
Scanned directories: 1
Scanned files: 8
Infected files: 1
Data scanned: 8.00 MB
Data read: 2089.99 MB (ratio 0.00:1)
Time: 47.126 sec (0 m 47 s)
Start Date: 2019:11:20 16:23:36
End Date: 2019:11:20 16:24:23

Which is a significant improvement to the point that I am skeptical, especially with the fact that we cannot run this against a larger data set without a segfault. There are too many discrepancies to say for certain but the numbers are consistent and the output is correct; it at least gives me some hope for this project.

My next task of course is to solve that issue of the segfault which I believe relates to aligning the modified block to avoid loading values past the end of an array. But after fighting with the software to get the correct output when it comes to character strings makes me feel that making this post is required.

 

Modification overview:

The most difficult thing about this optimization was trying to get the character fields to load in at least 16 characters at a time without overrunning the buffer. The first run through our SIMD intrinsic operation went off without any issue and returned the index of the first not valid character, perfect. The second run, for any string that is longer than 16 characters (most of them) required that we “rewind” the current position pointer. The reason we need to do this is to make sure that we are getting 16 characters and not garbage values off the end of the buffer which could cause erroneous data.

That was completed with a ternery operator, to select the maximum number of characters and then subtracting the index by the number of characters to be loaded into the vector registers, it looked like this:

                L = p->prefix_length + p->length);

                J = (length > l) ? length -16 : l – 16;

This allows me to load the string from a safe offset; even though we are re checking some characters it is still much faster than checking our loop condition after every single character comparison.

From there we could load the values like:

                xbp = _mm_loadu_si128(__m128i*) &bp[j]);

                xpt = _mm_loadu_si128(__m128i*) &pt[j]);

Here is a graphic representation:

post_one

The green and yellow together are a complete string length, in this case 21 characters. The blue represents vector register lanes where each individual character can be loaded into and the yellow represents characters that are beyond the length of the registers or more then index 16 from the beginning of the array.

You can see that we still need to check the next 5 characters after the first call, however we cannot just load the vector register with then next set of 16 characters as we will load data off the end of the array as demonstrated in this image:

post_two

Now here the indigo color (that runs off the edge of the image) represents the characters in the string that have already been checked. The green represents the characters that are to be checked and then the red represents data that is past the end of our character array, essentially it is filled with arbitrary data. Attempting to access this data with either cause the program to segfault or worse, run checks against that data and corrupt our output. This is where the above solution comes into play with rewinding the pointer back to a point where we can check all the new characters, even if it means checking some of the previous ones; because we are doing this in one instruction it is much fasters then looping over every single character to determine a match. This alignment would look something like this:

post_three

Here we have indigo representing checked characters, yellow representing characters that are being rechecked and green being our next set of characters to check.

After we complete each check operation the intrinsic function that we are using returns an index that can be used to determine if a match on the entire string has been found. In either any indexes that rely on that information can then be set and seem to be running appropriately we are getting the proper output in terms of character string matches and now we can move on to the next part.

Determining a loop condition that can be used to allow this function to iterate through large blocks of non-text data which I believe is where we are causing our segfault, and the unexpected speed up.

 

Summery:

This post was only a quick update on my work so far with my chosen piece of software. Breaking past this milestone means that we (maybe) have a handle on how the offsetting is handled. Our next task it make sure that this function works with any byte array not only text. To do this we need to determine what data is causing it to crash and in what context, or function it is being called during the segfault. Once that is solved it is next a matter of cleaning up the code for efficiency and maintainability, but one thing at a time.

Hope to have more for you soon.

ElliePenguins.

Another Project Update

Another quick update; but first, for those of you who may not be totally familiar with my attempt at optimizing an open source project can check out the previous posts here:

                https://penguinpro.ca/category/school-work/project/

The reason I needed to post again today was because last night was somewhat productive; or well not really productive, but I did a lot of work anyways and you get to hear about it.

So last night was when I started really digging into the source code of the project and even attempted to make some modifications. If you look back on this post:

                https://penguinpro.ca/2019/11/15/project-update/

You will see that I was attempting to modify some of the loops within matcher-bm.c within libclamav. I thought that it would be possible to potentially modify some of the loops in a way that more than one character could be checked at each iteration. The issue with this thinking was that I was not aware of the fact that this function is in fact a computer science algorithm called a “boyer-moore string search algorithm.” For more info you can checkout this link:

https://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Where mentioned in this link that, “The Boyer-Moore algorithm is considered as the most efficient string-matching algorithm”.

I (think I) was right about one thing, this function is doing the heavy lifting in terms of comparing character and string values. However the above mentioned search algorithm with c style linked lists used individual nodes to contain the patterns and offsets that are required for the algorithm to work. This meant that it would have been required to get the next node for comparison. For our attempt at optimization, we needed to get the address of the next data structure from the next pointer within the gotten node and access the individual element.

Attempting this in practice only slowed the software down, the extra loads and compares from memory only got in the way of how the algorithm was supposed to search, though it did not stop the program from making the correct selections.

The parts of the program that where showing up as busy instructions were just processing and incredible amount of information simply by looking at a single character. Which is exactly why we had the movzbl instruction at almost 30% of the total function run time; that specific instruction was responsible for loading the character for comparison. Wouldn’t it be great if we could load 2, 4 or maybe 8 characters for comparison at a time? Of course, and I am still hopeful to be able to do that before the time in this class runs out; however it must be done in a way that does not break the algorithm.

After playing with this block of code until about 3am, I was able to find that with adding debug printf statements to individual sections of interest that this program is unique (to me) in how it is able to compare a massive amount of data in a very short amount of time. That is by checking individual characters it is able to process and compare almost 8GB of data in about 2 and a half minutes.

This is a great example of how a community of people can come together to make something about as good as it can get. Whatever happened with this project, it has defiantly increased my appreciation for open source technologies. Not that I ever had a doubt in the “bazaar” style of development.

Now I must finally mention that, I am not completely giving up hope on this project yet, there is still one little glimmer of hope in the form of a character comparison loop within matcher-bm.c at line 345 (as of this writing) that looks like this:

  found = 1;
  for(j = 0; j < p->length + p->prefix_length && off < length; j++, off++) {
     if (bp[j] != pt[j]) {
            found = 0;
            break;
      }
  }

This small loop of about 7 lines of code, does not seem to run particularly busy. It is however one of the last checks in the control of its parent loop. If the character bp and pt at index j do not match, then the break statement ends the loop and execution jumps to the starting for loop at line 276. This is where p, which is a type cli_bm_patt linked list pointer; is then advanced to get the next pattern structure and searching continues.

Unfortunately as each node in the linked list must traverse the above part of the parent loop to see if it is appropriate to be used for a match attempting to work on multiple nodes is not going to work. This is not an optimization that can be handled with vectorization and instead needs to be tackled with parallelization; a task that is already handled by clamdscan.

My last shred of hope for any kind of optimization within this block of code is that we can somehow try and make this loop take advantage of vector registers and compare multiple characters per iteration. To do this within the X86 environment we MAY have a couple instructions available to us, they are:

MPSADBW – Compute Multiple Packed Sums of Absolute Difference

https://www.felixcloutier.com/x86/mpsadbw

&

PCMPESTRM – Packed Compare Explicit Length Strings, Return Mask

   https://www.felixcloutier.com/x86/pcmpestrm

From what I understand string compares can be difficult to use in terms of vectorization. This is due to that fact that we could potentially load values past the end of the array if we are not properly aligned, or we are comparing strings of different sizes. But by using these instructions, the first of which showing more promise we MAY be able to get a tiny bit more performance out of this function.

Before I get anyone’s hopes up, I need to note that the way this must be done will be architecture specific. The best way I can think of is with intrinsics which will require we use some kind of pre-processor logic. My overall worry is this may overly complicate this block of code to the point it may not be worth the minimal performance gain that will be achieved; let’s try anyway.

 

Cheers,

ElliePenguins

Project Update

Creating a dedicated testing box & some actual code.

Hey everyone, I have another update on my progress with software portability and optimization, as always, for anyone new to this blog. I am attempting to apply what I have learned so far as a final project and you can check my previous posts on the topic here:

https://penguinpro.ca/category/school-work/project/

 

New/Old Machine:

As you were aware I had attempted to build and benchmark my software on a virtual machine, I thought that taking advantage of the tools within VirtualBox such as snapshots and virtual networking that it would make the process go much smoother. In reality however I found that it caused strange fluctuations in output times as well as attempted “optimization” modifications to actually slow the run times down as you will read in a bit.

Luckily for me I was able to attend a used hardware sale and found myself with a “new” dedicated i5 with a fresh arch install, the process of which will make a good blog post for anyone interested into making the jmp to command line junky. But for now this post will just be about new benchmarks on this new hardware as well as describe some direction as I make a first attempt at optimization.

Also the networking that I am using for this new machine involves a usb NIC and a virtual installation of pf sense. For anyone who wants to start experimenting with software it is always a good idea to learn a bit about good vm/lab hygiene. Being aware of what machines can see each other or have access to different parts of you internal network are important to know as you move forward. You can check this post for instructions on how to install PFsense onto a Virtualbox VM here:

https://penguinpro.ca/2019/09/14/87/

 

Before we continue I wanted to say to anyone who has old computing hardware laying around, please do not throw it away, not only is this bad for the environment but cheap or free hardware is how I was able to learn computers; there are many talented people who cannot afford them.

Please consider donating old computers and hardware, you may just inspire the next generation of tech professionals:

https://www.rebootcanada.ca/

https://www.pcnix.ca/donate-your-old-computer/

(Note: I only found the above examples with a quick google search as an example, it may be a good idea to research further.)

Now back to the software.

 

New Build

 Just like in earlier posts, the software was gotten from the github repo:

                Git Clone https://github.com/Cisco-Talos/clamav-devel

However this time instead of creating individual build directories with different configurations, we created:

                Git branch static

                Git branch vector

And tailored the source in each branch for the type of build that we are trying to create.

When it comes to the actual build I have decided on this configuration, as I feel that there is some opporuinity for vectorization within matcher-bm.c, although I am not 100 percent confident with it. This is also the configuration I have decided to settle on while adding my personal modifications to the source code.

CFLAGS=-ggdb -ftree-vectorize -mavx -fopt-info-vec-missed -fopt-info-vec -fopt-info-vec-note -O2 ./configure –prefix=’/home/<username>/vector_build’ –enable-check –enable-coverage -with-systemdsystemunitdir=no –enable-experimental –enable-clamtop –enable-pcre –disable-llvm

After we run our configure script we can use the next lines to have our make commands gives us output in regards to decisions made about vectorization in source file’s loops. This information is stored to a file that can be searched for each specific loop with:

                Make –j<core_count> 2>&1 | tee ~/compiler_output.txt

Or

                Make –j<core_count> 2>&1 | grep vector | tee ~/compiler_output_vectorize.txt

 

To get a more detailed explanation on the actual build process see this link:

https://penguinpro.ca/2019/11/07/project-build/

 

As well as the provided documentation:

https://www.clamav.net/documents/clam-antivirus-user-manual

 

The Test:

Our test on this new machine is with the exact same data that was used within the virtual environment. However we are now consistently achieving a run time of 2 minutes and 45 seconds after 10 runs with only milliseconds of deviation between them.

This is in contrast to our previous run times of well over 6 minute, and about 30 seconds of deviation between high and low. You can read about that here: https://penguinpro.ca/2019/11/10/main-project-benchmark/

 

Proper Static build:

To anyone who was following the blog, or read some of my previous posts. You will have known that I has a bit of an issue with trying to create the software as a static binary. My attempt in doing this was by adding flags to the compiler, where after reviewing the clamav documentation again. I noticed that we were supposed to actually use the –enable-static and –disable-shared flags when calling the configure script. After we did this we were provided with this large map:

post7

This output shows a better representation of the number of calls to each function, (because it is using instrumentation instead of sampling.) But aside from a few areas’ we may investigate in the future, it seems our original plan of targeting “cli_bm_scanbuf” in “libclamav” holds true. Let’s look at some code.

 

Some Code:

Here is a direct link to each source file of our matcher-bm module:

Header: https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.103/libclamav/matcher-bm.h

Source: https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.103/libclamav/matcher-bm.c

The lines that stand out when inspected/disassembled with perf are:

 

 

283:  if (p && p->cnt == 1 && p->pattern0 != prefix) 

Which relates to:

Movzbl 0x3e(%r13), %edx           ; @ 19.23% usage

 This specific statement is within a large for loop with many conditions between. Not to mention it acts on elements of a c-style linked list that are single entities, in this case one character, and one int. Specifically looking at p->pattern0 != prefix, there is another identical comparison approximately 20 lines of code later, however due to the nature of the linked list pointer potentially being advanced in several places, storing this calculated result is of little value, and after attempting it actually slowed our code (in the virtual environment) by seconds. My thoughts, specifically relating to this would be in the presence of DMA, and if a pattern can be detected; storing these results may be of some worth. We will look into this in the future.

  

312:  if ((off + p->length > length) || (p->prefix_length > off)) {
             p = p->next;
             continue;
       }

                 Which relates to:

Movzwl 0x38(%r13),%eax            ; @ 12.12% usage

Again we are looking at values that are resolved using single entities stored as elements of a linked list. The actual reasoning for this, with a lack of comments will have to be researched further; However seems to related to the comparison of signatures. Again determining and storing these values earlier on may relieve some computational overhead.

 

Loops:

 Loops specifically in this program are handled interestingly, they seem to use for loops with all of the calculation and advancement of pointers being done in the compare and increment part of the loop. Rewriting these to take advantage of SIMD, may provide some optimization however I worry about the length of data being compared, if we advancing a pointer through an array of values, we may need to know the length of that value; as well as requirements involved in forcing alignment of the data structures themselves, again this will require research.

 

                Example:

     for (; offdata->pos < offdata->cnt && off >= offdata->offtab[offdata->pos]; offdata->pos++)
                ;

 

 

Interesting variable:

On line 249 within cli_bm_scanbuf there is a variable of type uint16_t called idx, I am assuming this is intended to mean index. Where on line 277 we see:

 idx = HASH(buffer[i], buffer[i + 1], buffer[i + 2]);

 

 2 interesting things I have noted are that, HASH seems to report back some kind of index which makes that function an interesting place to investigate for signature lengths. The next and more “suspicious” is that while stepping through this block of code with gdb at –O3, I noticed that when attempting to print this variables value it was reported as “optimized out.” This to me indicated that it may be worthwhile to investigate the intended reason and overall value of this variable.

 

Summery:

My intention for this post was mostly to give an update on the change of my dev environment, as well as to keep track as I attempt to cut into the more technical elements of this project. My next post will hopefully include any findings while stepping though individual blocks of code. As well as maybe include some information on how to use gdb to get a better understanding of how a program works.

 

Love,

ElliePenguins

Main Project Benchmark

 

Hello again friends, I have finally gotten around to actually digging into the code a bit for my selected project for software portability and optimization. For those of you who may not be familiar, here is a link to previous posts in the project:

https://penguinpro.ca/category/school-work/project/

I think in this post; that is enough of an introduction because if you’re as excited as I am to finally dig in and get started on this project; it’s been to be difficult to focus on anything else.

At this point I will list some of the tools that we will be using for profiling. That’s not to say we may use others as we go, but for now this should give you a pretty good idea of any requirements before we start:

We will need:

            Perf

            Grpof

            Gprof2dot

            Dot

            Flamegraph

I will also mention our “test data.” We are testing on about 7.7GB of various files types. This directory includes pdf’s, docx, jar files, mp4 video, many audio formats and jpeg as well as some ISO disk images, random data and executables; all ranging in size from a few bytes to a few gigabytes.

Before running:

Before we jump into running our tools, there are some setting we need to configure. For our profiling tools to be able to check the call stack and know where some routines are located we need to modify some things. We can do this with:

Sysctl kernel.kptr_restrict=0

Sysctl kernel.perf_event_paranoid=1

Running:

Let’s run our first clamscan against our test directory with perf, which interrupts the program many times and then records the state of the process during that interrupt.

            Perf record –freq=max –g ./clamscan –r ~/clam_test/

In the above command you can see that we run perf record which tells perf that we would like to analyze the specified process, followed by –freq=max meaning we want perf to interrupt the program to record analysis as many times as possible; and finally we add the –g flag which tells perf to append stack traces. The reason we run in this way is because of the visual analysis tool that we are going to pipe our data to, called flamegraph.

This was another package that needed to be built from the Arch User Repository which you can get here: https://aur.archlinux.org/flamegraph.git

As Mentioned in earlier articles I described how to build these packages but I will include it again because it is useful to know in the Arch world.

Git clone https://aur.archlinux.org/flamegraph.git

                Makepkg

                Pacman –U flamegraph-[version]-x86_64.pkg.tar.xz

Once we have this built we can do:

            Perf script > run_one.perf

            Stackcollapse-perf.pl run_one.perf | flamegraph.pl > run_one.svg

Which provides us with this graphic:

post2

This SVG graphic can be opened in a browser and is interactive! For example by doing a search for a function, we can see graphically every instance of “cli_bm_scanbuff” highlighted in purple. The reason I mention “cli_bm_scanbuff” is because it is the function I feel has the highest impact potential for optimization. Being a top level function we can assume that there is actual work being done in that function instead of being delegated to other routines. We can also see that although the call from “cli_chkign” is the largest, it is also not the only place that it is called, again these instances are also at the top of the stack trace.

I should note that there may be some optimization in the “insert_list” function as well but that would be completely code based and most likely not within scope for this course. Once we complete our work with “cli_bm_scanbuff” we may open up this function too and see what’s inside.

Before I continue here’s a general link to flamegraph, Its great:

http://www.brendangregg.com/flamegraphs.html

Analysis:

Opening our perf report, we can see the routines that are frequently on the stack during each sample are:

Cli_bm_scanbuff         in libclamav.so  @ 17.39%

Pcre2_match_8           in libpcre2-8.so @ 17.1%

Cli_ac_scanbuff          in libclamav.so  @ 13.25%

Insert_list                    in libclamav.so  @   8.11%

Perf Report output:

post1

Digging deeper into cli_bm_scanbuff we can see that the single most noticed call is from cli_loadhash, through cli_chkign and finally at the top most level at “cli_bm_scanbuff.” Where it seems to report 59,047 samples, making up 8.23% of the total number of samples captured. This however is not considering other calling functions and may have more overhead then originally considered.

The module that this function seems to be a part of is:

https://github.com/Cisco-Talos/clamav-devel/blob/dev/0.103/libclamav/matcher-bm.c

We will dig into the source a bit more in my next post, as for now we just need a general idea of where to focus our effort, but this looks the most promising.

Other Tools:

Let’s see what our other methods of profiling/benhmarking report. The other tools that we have to use are gprof which require that the software is built with the –pg compiler options. This type of profiling, actually embeds instructions into the code as opposed to the way perf interrupts the program and inspects its state at that time.

This profiling tool can be as simple as just running the program and then running:

 gprof <executable_name>

Like perf and flame graph, this program also has tools that can convert its output to a graphical output for easier analysis, this can be done with:

gprof <executable_name> | gprof2dot | dot –Tx11

and saved to a file with:

gprof <executable_name> | gprof2dot > filename.svg

Which in our case provides us with:

`post3

This does not exactly look right when compared against our previous run with perf, these are all functions that relate to the program itself like scan manager and output information. With a little research we find this link: http://gernotklingler.com/blog/gprof-valgrind-gperftools-evaluation-tools-application-level-cpu-profiling-linux/

Which tells us that gprof does not have the ability to profile functions in shared libraries? A pretty big problem considering the fact that we are trying to optimize libclamav, nevertheless we will find a way to make this tool do the thing we want. One way we can do that is by recompiling the source as a static executable, however attempting this required dependencies I was not able to satisfy in a reasonable amount of time; so we’ll circle back to this one. For now let’s run the program a few times over to get some statistics for the average run time;

Let’s try something like:

            While true; do (time ./clamscan –r ~/clam_test/) 2>&1 | tail -15 >> run_times.txt; echo “run complete”; done

 This line in bash will run the program the exact same way until we stop it, saving the time data to a file called run_times.txt and outputting “run complete” to the terminal at every iteration. From there we can look at those run times and try to calculate the highest, lowest and average run times.

Note we are putting the output from the time command together with the scan summery to create entries of every run in a file that look [kind of] like:

———– SCAN SUMMARY ———–
Scanned directories: 35
Scanned files: 339
Infected files: 50
Data scanned: 660.14 MB
Data read: 7753.45 MB (ratio 0.09:1)
Time: 398.016 sec (6 m 38 s)

real      6m38.032s
user     6m10.382s
sys       0m25.658s

After allowing it to run for some time we are getting outputs between 6 Minutes, 7 Seconds; and 6 Minutes, 45 Seconds. Averaging about 6 Minutes 36 Seconds.

Overview:

Hopefully now we have a general idea of what is going on within this piece of software, in regards to optimization; and most importantly a direction for our next steps. In my next post I am hopeful to be able to actually dig into the source code from the function that we looked at earlier in this post. I am also hopeful we can attempt to compare different optimization levels to see if there are any tweaks we can make in the build itself.

 

Send more coffee,

ElliePenguins