SPO 600 Project - Stage 1

For the SPO600 course, I need to complete a project of identifying a CPU-intensive function/method - let's say a checksum or hash algorithm - that is implemented in a language that compiles to machine code (e.g. C, C++, or Assembler) in an open-source project, improving its performance, then get that optimization accepted by the upstream. For more information, click here.

Find an open source package

For this project, I chose Tengine, a fork of the web server nginx  developed by Taobao. When I'm reading its source code in Github, I noticed that it also uses the MurmurHash, just like nginx. Since MurmurHash itself is a simple hash function, it would be difficult to improve it (I still remembered my professor once said an easy program is the most difficult one to optimize), I decided to accept the challenge, because why not?

Benchmark the performance on AArch64

As the function is very simple, testing its performance is much simpler than I imagine: I just extract the function from the file, and then write a small tester to test it.
I check the repository of Tengine, and it shows that it is built with the -O2 compiler flag for optimization. Therefore, my test would also be compiled with -O2. Also, to avoid any possible errors on my measurements (Sometimes a program will run faster on the second time it's executed thanks to the cache), I will run the test 10 times.

Here is the source code:


#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

#define DATA_LENGTH 500000

uint32_t
ngx_murmur_hash2(u_char *data, size_t len)
{
    uint32_t h, k;

    h = 0 ^ len;

    while (len >= 4)
    {
        k = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch (len)
    {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= 0x5bd1e995;
    }

    h ^= h >> 13;
    h *= 0x5bd1e995;
    h ^= h >> 15;

    return h;
}

int main()
{
    u_char tmp;
    u_char *data;
    srand(time(NULL));

    data = calloc(DATA_LENGTH, sizeof(u_char));
    for (int i = 0; i < DATA_LENGTH; i++)
    {
        tmp = (u_char)((rand() % 26) + 'a');
        data[i] = tmp;
    }

    for (int j = 0; j < 10; j++)
    {
        clock_t start = clock();
        uint32_t result = ngx_murmur_hash2(data, sizeof(u_char) * DATA_LENGTH);
        clock_t end = clock();
        printf("Run #%d\n", (j+1));
        clock_t diff = end - start;
        double msec = (double)diff / CLOCKS_PER_SEC;
        printf("Time taken: %lf seconds\n", msec);
    }

    free(data);
}

What this tester does is very simple, it just create an array of 500000 random characters, start the timer, send the array to the hash function and records the result, stop the timer, print the total time, and repeat the above steps for 10 times.

This is the final result:
Run #1
Time taken: 0.000002 seconds
Run #2
Time taken: 0.000001 seconds
Run #3
Time taken: 0.000001 seconds
Run #4
Time taken: 0.000002 seconds
Run #5
Time taken: 0.000001 seconds
Run #6
Time taken: 0.000001 seconds
Run #7
Time taken: 0.000001 seconds
Run #8
Time taken: 0.000001 seconds
Run #9
Time taken: 0.000002 seconds
Run #10
Time taken: 0.000002 seconds

Impressive, most impressive. The results looks quite uniform.

Identifying the strategies

After benchmarking the performance of the hash function, I managed to come out with some strategies:

1. Altered build options - since it is using -O2 optimization, maybe it would be faster if I try -O3 optimization instead. It is also the easiest option, so I don't see why I shouldn't give it a try.
2. Code changes to permit better optimization by the compiler - judging from the code itself, I think I can rewrite it in a way that the compiler would auto-vectorize (SIMD) a few operations. 
3. Algorithm improvements - as it is already quite optimized, I doubt if I can improve it further.
4. In-line assembler - personally I am very reluctant to use In-line assembler (since it kills portability), but if that is the only way I can make it faster, then so be it.

 

Comments

Popular posts from this blog

SPO 600 Project - Stage 2

SPO 600 - Lab 6