Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] BM_TCPEchoServerLatencyNQDRSubprocess benchmark #326

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jiridanek
Copy link
Contributor

First few benchmarks is already in main, the new one is the BM_TCPEchoServerLatencyNQDRSubprocess benchmark.

This shows what adding a router to a long chain does with latency when sending a small tcp message through. C is a client that measures timing, S is an echo server.

C <-> R1 <> R2 <> R3 <> ... <> RN <-> S

(use arguments such as --benchmark_filter=.*BM_TCPEchoServerLatencyN.* to run only chosen benchmarks, or to run multiple times and compute stats)

What would be interesting would be latency percentiles/distributions, which are not readily available now, but the benchmark can be updated with that, of course.

Looks like adding routers to the chain increases average (yes, I am ashamed for using average) latency linearly. And this could be used to measure where the latency is coming from, hopefully, and to track improvements if improvements are called for.

/home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo/tests/c_benchmarks/c-benchmarks
2022-04-12T21:21:09+02:00
Running /home/jdanek/repos/skupper-router/cmake-build-relwithdebinfo/tests/c_benchmarks/c-benchmarks
Run on (12 X 4300.03 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.89, 1.32, 1.66
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
BM_RouterInitializeMinimalConfig              58.4 ms        0.049 ms          100
BM_AddRemoveSinglePattern                     1.24 us         1.23 us       562571
BM_AddRemoveMultiplePatterns/1                1.27 us         1.27 us       559319
BM_AddRemoveMultiplePatterns/3                3.06 us         3.05 us       221506
BM_AddRemoveMultiplePatterns/10               9.32 us         9.30 us        76641
BM_AddRemoveMultiplePatterns/30               27.6 us         27.5 us        25368
BM_AddRemoveMultiplePatterns/100              92.8 us         92.6 us         7702
BM_AddRemoveMultiplePatterns/1000             1074 us         1071 us          662
BM_AddRemoveMultiplePatterns/100000         350917 us       349669 us            2
BM_AddRemoveMultiplePatterns_BigO           211.27 NlgN     210.52 NlgN 
BM_AddRemoveMultiplePatterns_RMS                 1 %             1 %    
BM_TCPEchoServerLatencyWithoutQDR            0.014 ms        0.006 ms       120267
BM_TCPEchoServerLatency1QDRThread            0.103 ms        0.008 ms        86610
BM_TCPEchoServerLatency1QDRSubprocess        0.101 ms        0.008 ms        87909
BM_TCPEchoServerLatency2QDRSubprocess        0.164 ms        0.008 ms        92487
BM_TCPEchoServerLatencyNQDRSubprocess/2      0.165 ms        0.008 ms        92226
BM_TCPEchoServerLatencyNQDRSubprocess/3      0.264 ms        0.009 ms        89734
BM_TCPEchoServerLatencyNQDRSubprocess/4      0.308 ms        0.008 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/5      0.382 ms        0.008 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/6      0.466 ms        0.009 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/7      0.534 ms        0.009 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/8      0.612 ms        0.009 ms        10000
BM_TCPEchoServerLatencyNQDRSubprocess/9      0.689 ms        0.009 ms        10000

Process finished with exit code 0

@jiridanek
Copy link
Contributor Author

Looking at this, it seems to me that adding a router should (in ideal case) add 0.014 ms of latency. That is time that the round trip to echo server without any routers in between takes. Adding a router to the chain adds two hops to the path of the packet, which should equal to +0.014 ms of latency.

Actual latency added is 0.07, on average. That means there is 0.056 ms of overhead caused by the router. Is this a little, is this a lot? Where is this time spent? Is it spent usefully?

@jiridanek
Copy link
Contributor Author

jiridanek commented Apr 13, 2022

In these latency tests, there is ever only a single TCP send in flight at a time, so the routers are as little loaded as is ever possible. So the latency measured should be the lowest achievable.

edit: there should be tls in this

@jiridanek
Copy link
Contributor Author

On the whole, there is absolutely no reason to orchestrate the router subprocesses from C++ test. Much nicer to do this in Python and to use existing tooling, like echo server, some tcp ping utilities, iperf3, like a normal perf test would. Much more trustworthy results, that way, as well. When the thing stops being a microbenchmark, there is no point in trying to treat it as a microbenchmark.

@jiridanek jiridanek marked this pull request as draft April 18, 2022 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant