Posts

SPO 600 Project - Stage 3

In this post, I will discuss my findings during Stage 3 of the project. For my previous stages, click here and here . Before I start, I think I should mention how to performance improved in percentage. And there's the result: [GeoffWu@aarchie tester]$ ./tester Run #1 original version took 33339704 nanoseconds modified version took 33237179 nanoseconds hash match: y difference between two versions: 102525 nanoseconds percentage difference: 0.307516 Run #2 original version took 33302201 nanoseconds modified version took 33231595 nanoseconds hash match: y difference between two versions: 70606 nanoseconds percentage difference: 0.212016 Run #3 original version took 33366103 nanoseconds modified version took 33270939 nanoseconds hash match: y difference between two versions: 95164 nanoseconds percentage difference: 0.285212 Run #4 original version took 33349119 nanoseconds modified version took 33346432 nanoseconds hash match: y difference between two version

SPO 600 - Lab 6

In this lab, I would be exploring the use of inline assembler and its use in open source software. For more information, click here . Part A For comparison, I tested the performance of vol_simd (the version with inline assembly and SIMD) and vol3 (the version with no inline assembly, which I built in lab 5) on aarch64 (well, the inline assembly version is written in aarch64 assembly), and I noticed vol_simd is a bit faster than its pure-C counterparts. Inline assembly and SIMD: [GeoffWu@aarchie spo600_20181_inline_assembler_lab]$ time ./vol_simd Generating sample data. Scaling samples. Summing samples. Result: -454 Time spent: 0.000670 real    0m0.027s user    0m0.027s sys    0m0.000s Pure C: [GeoffWu@aarchie spo600_20181_vol_skel]$ time ./vol1 Result: -142 Time spent: 0.001029 real    0m0.028s user    0m0.028s sys    0m0.000s I adjust the number of samples to 500000000, and vol_simd is still faster. Inline assembly and SIMD: [GeoffWu@aarchie spo600_20181

SPO 600 Project - Stage 2

In this post, I will discuss my findings during Stage 2 of the project. For more details, click here . Before I should talk about how I optimized the Murmurhash2 algorithm, I would like to answer a question that I forgot to add on my stage 1 post - where is that algorithm is used ? After doing some research, I found out that Tengine, just like nginx, uses Murmurhash2 in the ngx_http_split_clients_module , which is often used for A/B Testing . Alright then, let's talk about how I optimized the algorithm. First of all, I think I should mention the optimization strategies I eliminated: In-line assembler : Given Tengine is meant to support a variety of platforms and CPU architectures, I honesty don't think it is a good option to use it - unless Jack Ma would give me a job afterwards :) Altered build options : At first I considered this option, since it is the easiest one, and planning to change its compile options to -O3/-Ofast, but after checking the cc (which listed all

SPO 600 Project - Stage 1

For the SPO600 course, I need to complete a project of identifying a CPU-intensive function/method - let's say a checksum or hash algorithm - that is implemented in a language that compiles to machine code (e.g. C, C++, or Assembler) in an open-source project, improving its performance, then get that optimization accepted by the upstream. For more information, click here . Find an open source package For this project, I chose Tengine , a fork of the web server nginx  developed by Taobao . When I'm reading its source code in Github, I noticed that it also uses the MurmurHash , just like nginx. Since MurmurHash itself is a simple hash function, it would be difficult to improve it (I still remembered my professor once said an easy program is the most difficult one to optimize), I decided to accept the challenge, because why not? Benchmark the performance on AArch64 As the function is very simple, testing its performance is much simpler than I imagine: I just extract the

SPO 600 - Lab 5

Image
In Lab 5, I will be looking at Algorithm Selection and how the algorithm affect the system's performance. For more information, click here . I will be doing this lab on Aarch64, and I will compile my programs with -O3 optimization. Original version Right now I will checking if it produces the same result every single time, how long it takes to scaling the sound samples, and how long it takes to run. (Note: to view how long it takes for the system to execute a command, type time before it. For more information, consult the man page of time .) In each run, the program always produces the same result, but the time of scaling the sound samples and the run-time in each run are slightly different. Yes, the difference is so small, but at the end the numbers are still different, right? Version 2 - Lookup table In this version, I will create a lookup table of all possible sample values multiplied by the volume factor. Like the original, in each run, it also always pr

SPO 600 - Lab 4

In Lab 4, I will be looking at Single Instruction Multiple Data (SIMD) and Auto-Vectorization. Simply put, vectorization is to process one operation on multiple pairs of operands at the same time, to make the program faster. For this lab, I will have to write a C program that creates two 1000-element integer arrays, fill them in with random numbers within the range -1000 to +1000, sums these two arrays element-by-element into a third array, calculate the sum of the third array, then display it. I will test the program in Aarch64 environment. Here is the source code: #include <stdio.h> #include <stdlib.h> #include <time.h> int main (){ const int SIZE = 1000 ; int a[SIZE], b[SIZE], c[SIZE]; int min = - 1000 , max = 1000 ; srand(time( NULL )); for ( int i = 0 ; i < SIZE; i ++ ){ a[i] = rand() % (max + 1 - min); b[i] = rand() % (max + 1 - min); } int sum = 0 ; for ( int j = 0 ; j < SIZE; j ++ ){ c[j] = a[j] + b[j]; su

SPO 600 - Lab 3

In Lab 3, I'll be looking at Assembler and Assembly Language. For this lab, I will write a program that prints a message and its loop count for each iteration - in Assembly Language . Here's the expected output: Loop: 0 Loop: 1 Loop: 2 Loop: 3 Loop: 4 Loop: 5 Loop: 6 Loop: 7 Loop: 8 Loop: 9 Looks quite simple, isn't it? Well, in C that would get the job done: #include <stdio.h> int main (){ for ( int i = 0 ; i < 10 ; i ++ ){ printf( "Loop: %d \n " , i); } return 0 ; } But in Assembly Language, it is much more complicated. In fact, you actually have to manually convert the number into ASCII, concatenate it into the string, and then print it out: .text .globl _start start = 0 max = 10 _start: mov $start, % r15 loop: mov % r15, % r14 add $ 48 , % r14 mov % r14b,msg + 6 movq $len, % rdx movq $msg, % rsi