-
Notifications
You must be signed in to change notification settings - Fork 80
performance
Benchmark code is available here
For the following results, the best result within 5% has been selected: more than 10 tries of the bench have been launch, then the selected value is the minimum bound of the longest segments of all values within 5%.
## C libraries
Type | List | Array | Tree | Dict S | Dict B | Sort |
---|---|---|---|---|---|---|
M*LIB | 572 ms | 312 ms | 296 ms | 32 ms | 704 ms | 888 ms |
M*LIB+ | 148 ms | 316 ms | 300 ms | 36 ms | 532 ms | 884 ms |
GLIB | 592 ms | 1356 ms | 2056 ms | 424 ms | 728 ms | 1300 ms |
KLIB | 812 ms | 312 ms | 436 ms | 56 ms | 756 ms | 784 ms |
CollectionC | 720 ms | 1624 ms | 1284 ms | 452 ms | 1200 ms | 2696 ms |
LIBDYNAMIC | NA | 1092 ms | NA | 120 ms | 676 ms | NA |
TOMMY DS | 624 ms | 752 ms | 1380 ms | 184 ms | 584 ms | NA |
UTHASH | 564 ms | 328 ms | NA | 420 ms | 1276 ms | 1276 ms |
QLIBC TO | 1476 ms | 3196 ms | 1872 ms | NA | NA | NA |
LIBSRT | NA | 2192 ms | NA | NA | NA | 1252 ms |
CMC | NA | 872 ms | 1404 ms | 328 ms | NA | NA |
CTL | 592 ms | 312 ms | 1204 ms | 460 ms | NA | 1256 ms |
STC | 584 ms | 324 ms | 908 ms | 104 ms | 640 ms | 1304 ms |
POTTERY | 572 ms | 1680 ms | 488 ms ? | 60 ms | 908 ms | 916 ms |
CC | 542 ms | 720 ms | NA | 113 ms | 1336 ms | NA |
VERSTABLE | NA | NA | NA | 80 ms | 620 ms | NA |
## C++ libraries
Type | List | Array | Tree | Dict S | Dict B | Sort |
---|---|---|---|---|---|---|
STL | 616 ms | 772 ms | 1228 ms | 424 ms | 1028 ms | 740 ms |
QT | 596 ms | 1172 ms | 1300 ms | 456 ms | 840 ms | 816 ms |
DENSE HMAP | NA | NA | NA | 48 ms | 804 ms | NA |
FLAT HMAP | NA | NA | NA | 92 ms | 808 ms | NA |
EMILIB | NA | NA | NA | 44 ms | 644 ms | NA |
EMHASH | NA | NA | NA | 36 ms | 521 ms | NA |
HOPSCOTCHMAP | NA | NA | NA | 80 ms | 844 ms | NA |
RIGTORP HMAP | NA | NA | NA | 40 ms | 960 ms | NA |
BOOST UNORDERED | NA | NA | NA | 38 ms | 426 ms | NA |
For theses results, the best one of 3 tries was kept:
Type | Queue MPMC |
---|---|
M*LIB CONCURRENT+DEQUEUE | 579 ms |
M*LIB BUFFER | 375 ms |
M*LIB MPMC Queue | 36 ms |
LIBLFDS MPMC | 83 ms |
CONCURRENT QUEUE | 279 ms |
BOOST LOCK FREE | 338 ms |
RIGTORP MPMC Queue | 225 ms |
The used versions are:
COMPONENT | VERSION |
---|---|
CPU | Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz |
GCC | 10.1 |
M*LIB | 743959ebde33e9a5452799e02c245cee02ebada2 |
Qt | 5.7.1 |
GLIB | 2.50.3 |
KLIB | 928581a78413bed4efa956731b35b18a638f20f3 |
CollectionsC | 775447bfa379417382b2db8454f19039f4d3deb4 |
ConcurrentQueue | 38e6a6f0185a98c3aaf2a95aa109ba041221d527 |
Emilib | 0d90f4c0cd44813f6898b417bf3f4ef574a5b136 |
EMHASH | 7e51e4cce5ae053c3e000466163aebd620e1004d |
Flast HMAP | 2c4687431f978f02a3780e24b8b701d22aa32d9c |
Rigtorp HMAP | 2f91f0cf97e9a5b6881a8cdf37bf5fbd470b3ee6 |
Rigtorp MPMC Q | b74a9f06469ec268b82eca3b5fe28dd58f312d06 |
Hopscotch HMAP | 59e03db96365fa22de26729a412d632e0b06df3b |
LibCollections | c259a9a781fa03a4a02b6b7d5118fd202574c347 |
LibDynamic | d680409177b23ee649c0521c4a1ef647478c1f2f |
LibSRT | e3c8a7091f42dbb3a055252bee6bccb171a6eddc |
LibLFDS | 7.1.1 |
QLIBC | 25d5f5ce44ec4c863edbeaecdcb4d3c05dcf3aa7 |
RapidString | 2a3ef51e29f37c6dbc62b46c8d39f058689d40c9 |
SDS | fb463145c9c245636feb28b5aac0fc897e16f67e |
Dense HMAP | f93c0c69e959c1c77611d0ba8d107aa971338811 |
SparsePP | 5dc50f795b9783ce5778c03b6cad5d1c7d0d29ea |
Tommy DS | 8a99599168ef7e56276ed3dbdeee40a27b039da7 |
UTHASH | 28334219b5327937cb05a593dc2d6ba622ce5bcb |
BOOST | 1.84 |
C Macro Collections | v0.23.1 |
CTL | b787374d672c97e173164fcb6068dced0e1a3619 |
STC | 9f1a51593ae7dd45db46f39ac18901ad175af18e |
POTTERY | e6ac5f45feb6119521e1b79c65b2062d5ebbceb0 |
CC | 3cbef63a2608e82bdd8c1cde04a7ec62cbd69f7c |
VERSTABLE | afbe27c7914aad2d936c3569128588799b3f0913 |
Notes:
- Tree means "Totally Sorted Tree": Red-Black Tree, AVL Tree, B-Tree or equivalent.
- Dict S: Dictionary with 8 bytes type,
- Dict B: Dictionary with 256 bytes type .
- M*LIB+: M*LIB with mempool option activated (replacing malloc).
- M*LIB's Dict S used Open Addressing
- M*LIB's Dict B used separate chaining.
- Qt5 doesn't have a container that is a set offering total ordering. As such, it was QMap which was tested instead.
- Glib uses its own memory allocator (gslice).
- Tommy DS doesn't have a insert_or_replace function, as such, it is insert alone which is benched (two equal keys can both be appended in the dictionary), which is faster.
- QLibC hash functions support only const char pointer as keys.
- A result marked ? means that the computation of the benchmark was wrong, and so the bench time is unreliable.
Let's compare two equivalent code. One with M*Lib, the other one with standard classic array. The code is inspired by this question.
The classic C version:
const int dimension = 999;
struct Pixel * pixels = malloc(sizeof(struct Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
The equivalent M*lib version:
const int dimension = 999;
array_pixel_t pixels;
array_pixel_init(pixels);
array_pixel_reserve(pixels, dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
const struct Pixel p = { 255, 0, 0};
array_pixel_push_back (pixels, p);
}
array_pixel_clear(pixels);
Let's compare the timings (gcc -O2 -march=native)
Code | Time |
---|---|
classic C | 856ms |
M*Lib | 604ms |
M*LIB timing is even faster! And M*Lib performs much more work with the data (the push_back operator may realloc the data if needed). How is it possible?
Let's deep into the generated code. For the classic version:
movl $2994003, %edi
call malloc
movq %rax, %rdi
leaq 2994003(%rax), %rdx
.L11:
movb $-1, (%rax)
addq $3, %rax
movb $0, -2(%rax)
movb $0, -1(%rax)
cmpq %rdx, %rax
jne .L11
call free
This is a fair and simple translation of the C code: the inner loop is simple and efficient. Now let's look for the M_*lib version:
movl $2994003, %edi
call malloc
movq %rax, %rdi
testq %rax, %rax
je .L9
xorl %eax, %eax
.L2:
leaq (%rdi,%rax), %rdx
movl $255, %ecx
addq $3, %rax
movw %cx, (%rdx)
movb $0, 2(%rdx)
cmpq $2994003, %rax
jne .L2
call free
Despite having a lot more of codes (due to the push_back operator checking that it doesn't have to realloc the array to be able to push the data), all this code have been optimized out by the compiler! The inner loop have been translated into a simple and efficient version with a nice optimization that is not present in the naive version that explains the timing difference.
There is no doubt however that the naive version can be as fast as the M*LIB one with a little rework of the original C code. However, there is no doubt however than using M*LIB is as fast as plain array.
Benchmark from http://natsys-lab.blogspot.fr/2015/09/fast-memory-pool-allocators-boost-nginx.html with use of M*LIB mempool allocators on a core 2 duo:
small object size: 24
big object size: 240
huge object size: 8305
mallocfree (Small): 1763ms
mallocfree w/ free (Small): 1149ms
mallocfree (Big): 858ms
mallocfree w/ free (Big): 541ms
GLib slices (Small): 2228ms
GLib slices w/ free (Small): 1345ms
GLib slices (Big): 725ms
GLib slices w/ free (Big): 381ms
boost::pool (Small): 965ms
boost::pool w/ free (Small): 562ms
boost::pool (Big): 526ms
boost::pool w/ free (Big): 470ms
boost::pool cr. & destr.: 264ms
ngx_pool (Small): 806ms
ngx_pool (Big): 445ms
ngx_pool w/ free (Mix): 2767ms
ngx_pool cr. & destr.: 216ms
tfw_pool (Small): 858ms
tfw_pool w/ free (Small): 290ms
tfw_pool (Big): 312ms
tfw_pool w/ free (Big): 111ms
tfw_pool w/ free (Mix): 302ms
tfw_pool cr. & destr.: 99ms
M*LIB mempool (Small): 465ms
M*LIB mempool (Big): 353ms
M*LIB mempool w/ free (Mix): 275ms
M*LIB performs very good on this test.
UDB Benchmark is an integer / string benchmark for associated arrays (hash tables, Red/Black Tree, etc.) developed by the author of KLIB. It can be found here. More information about the benchmark is available here.
The results for N=5000000 (default value) for a i5-4690K are:
LIBRARY | INT TIME | STR TIME |
---|---|---|
google_dense | 0.220 | 1.110 |
mlib-oa | 0.150 | 0.870 |
sgi_map | 2.820 | 6.880 |
boost_hash | 1.230 | 1.640 |
uthash | 1.160 | 1.520 |
unordered_map_gcc | 0.700 | 1.300 |
khash | 0.170 | 0.680 |
NP_rbtree | 2.790 | 5.310 |
glib_hash | 0.960 | 1.050 |
mlib | 0.280 | 0.610 |
stx_btree | 1.230 | 4.440 |
kbtree | 1.490 | 4.900 |
glib_tree | 5.280 | 6.240 |
sgi_hash_map | 0.860 | 1.380 |
google_sparse | 0.670 | 2.010 |
This benchmark is taken from Bstring benchmark
This table show the number of operations per second on a core i5 4690K normalized with the STL:
Operations | GCC STL | CBString | LIBSRT | M*LIB | SDS | POTTERY |
---|---|---|---|---|---|---|
empty constructor | 100 | 21 | 12 | 175 | 13 | 19 |
non-empty constructor | 100 | 87 | 30 | 125 | 52 | 23 |
small non-empty constructor | 100 | 12 | 4 | 119 | 7 | 9 |
char * assignment | 100 | 85 | 46 | 264 | 57 | 52 |
char extraction | 100 | 100 | 31 | 88 | 100 | 87 |
scan | 100 | 184 | 259 | 425 | 467 | 476 |
concatenation | 100 | 95 | 18 | 282 | 44 | 8 |
replace | 100 | 72 | NA | 171 | 6 | NA |
And the master benchmark which consists in creating 2000000 strings based on random integers, concatenating them into a single string and replacing 2 sub-strings in it:
Library | Time |
---|---|
M*LIB | 0.09s |
STL | 0.21s |
BSTRLIB | 0.48s |
POTTERY | 4.27s |
SDS | 7.92s |
Notes:
- GCC 10.2
- M*LIB e31a8dc407c3d63181218e693efa08955e1f8924
- Better String f0ff1e808102a42cdc7204a4bb6fe231a24c4546
- LIBSRT eee28e6dfc23f76c7b8f76f32ef68418619064be
- SDS fb463145c9c245636feb28b5aac0fc897e16f67e
- POTTERY 32378c84428625428a392d0e7b4c0d1c888552cb
- The operations-benchmark may be unfair for non-inline libraries (like Better String µ/ LIBSRT / SDS) as operations on constants strings can be optimized very heavily with inline implementation (like STL & M*LIB).