[wip] `BM_TCPEchoServerLatencyNQDRSubprocess` benchmark #326

jiridanek · 2022-04-12T19:40:58Z

First few benchmarks is already in main, the new one is the BM_TCPEchoServerLatencyNQDRSubprocess benchmark.

This shows what adding a router to a long chain does with latency when sending a small tcp message through. C is a client that measures timing, S is an echo server.

C <-> R1 <> R2 <> R3 <> ... <> RN <-> S

(use arguments such as --benchmark_filter=.*BM_TCPEchoServerLatencyN.* to run only chosen benchmarks, or to run multiple times and compute stats)

What would be interesting would be latency percentiles/distributions, which are not readily available now, but the benchmark can be updated with that, of course.

Looks like adding routers to the chain increases average (yes, I am ashamed for using average) latency linearly. And this could be used to measure where the latency is coming from, hopefully, and to track improvements if improvements are called for.

/home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo/tests/c_benchmarks/c-benchmarks
2022-04-12T21:21:09+02:00
Running /home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo/tests/c_benchmarks/c-benchmarks
Run on (12 X 4300.03 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.89, 1.32, 1.66
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
BM_RouterInitializeMinimalConfig              58.4 ms        0.049 ms          100
BM_AddRemoveSinglePattern                     1.24 us         1.23 us       562571
BM_AddRemoveMultiplePatterns/1                1.27 us         1.27 us       559319
BM_AddRemoveMultiplePatterns/3                3.06 us         3.05 us       221506
BM_AddRemoveMultiplePatterns/10               9.32 us         9.30 us        76641
BM_AddRemoveMultiplePatterns/30               27.6 us         27.5 us        25368
BM_AddRemoveMultiplePatterns/100              92.8 us         92.6 us         7702
BM_AddRemoveMultiplePatterns/1000             1074 us         1071 us          662
BM_AddRemoveMultiplePatterns/100000         350917 us       349669 us            2
BM_AddRemoveMultiplePatterns_BigO           211.27 NlgN     210.52 NlgN 
BM_AddRemoveMultiplePatterns_RMS                 1 %             1 %    
BM_TCPEchoServerLatencyWithoutQDR            0.014 ms        0.006 ms       120267
BM_TCPEchoServerLatency1QDRThread            0.103 ms        0.008 ms        86610
BM_TCPEchoServerLatency1QDRSubprocess        0.101 ms        0.008 ms        87909
BM_TCPEchoServerLatency2QDRSubprocess        0.164 ms        0.008 ms        92487
BM_TCPEchoServerLatencyNQDRSubprocess/2      0.165 ms        0.008 ms        92226
BM_TCPEchoServerLatencyNQDRSubprocess/3      0.264 ms        0.009 ms        89734
BM_TCPEchoServerLatencyNQDRSubprocess/4      0.308 ms        0.008 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/5      0.382 ms        0.008 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/6      0.466 ms        0.009 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/7      0.534 ms        0.009 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/8      0.612 ms        0.009 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/9      0.689 ms        0.009 ms        10000

Process finished with exit code 0

jiridanek · 2022-04-13T18:49:27Z

Looking at this, it seems to me that adding a router should (in ideal case) add 0.014 ms of latency. That is time that the round trip to echo server without any routers in between takes. Adding a router to the chain adds two hops to the path of the packet, which should equal to +0.014 ms of latency.

Actual latency added is 0.07, on average. That means there is 0.056 ms of overhead caused by the router. Is this a little, is this a lot? Where is this time spent? Is it spent usefully?

jiridanek · 2022-04-13T18:55:56Z

In these latency tests, there is ever only a single TCP send in flight at a time, so the routers are as little loaded as is ever possible. So the latency measured should be the lowest achievable.

edit: there should be tls in this

jiridanek · 2022-04-14T11:20:38Z

On the whole, there is absolutely no reason to orchestrate the router subprocesses from C++ test. Much nicer to do this in Python and to use existing tooling, like echo server, some tcp ping utilities, iperf3, like a normal perf test would. Much more trustworthy results, that way, as well. When the thing stops being a microbenchmark, there is no point in trying to treat it as a microbenchmark.

jiridanek added 5 commits April 12, 2022 19:02

Fix helpers.hpp function REQUIRE: it must always be run

cf2bd76

Typofixes

686c786

Few improvements to the Socket classes

7b5a2ba

todo echo server that accepts multiple; may not be needed, actually

57e37c6

TODO line of routers latency tcp test

eb6c08d

jiridanek marked this pull request as draft April 18, 2022 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] `BM_TCPEchoServerLatencyNQDRSubprocess` benchmark #326

[wip] `BM_TCPEchoServerLatencyNQDRSubprocess` benchmark #326

jiridanek commented Apr 12, 2022

jiridanek commented Apr 13, 2022

jiridanek commented Apr 13, 2022 •

edited

Loading

jiridanek commented Apr 14, 2022

[wip] BM_TCPEchoServerLatencyNQDRSubprocess benchmark #326

Are you sure you want to change the base?

[wip] BM_TCPEchoServerLatencyNQDRSubprocess benchmark #326

Conversation

jiridanek commented Apr 12, 2022

jiridanek commented Apr 13, 2022

jiridanek commented Apr 13, 2022 • edited Loading

jiridanek commented Apr 14, 2022

[wip] `BM_TCPEchoServerLatencyNQDRSubprocess` benchmark #326

[wip] `BM_TCPEchoServerLatencyNQDRSubprocess` benchmark #326

jiridanek commented Apr 13, 2022 •

edited

Loading