-
-
Notifications
You must be signed in to change notification settings - Fork 604
Debugging Excessive and Contended Mutex Locks
The purpose of this document is to show tools to help finding unexpected lock contention which is slowing down a workload.
When a thread needs a mutex but it's already locked, this causes a context switch which is orders of magnitude slower than an uncontended lock/unlock (ball-park figures are about 10 times slower on a single CPU, about 50 times slower when the lock is held by another CPU). Lock contention may also increase the idle time, if there are no additional threads wanting to run.
In some workloads, we can discover that one mutex is causing much of this lock contention, and perhaps it can be replaced by a lock-free algorithm or RCU (e.g., when the mutex is protecting a read-mostly data structure) which will avoid this contention. In other cases, we discover that while the mutex is not contended, we lock and unlock it excessively, which takes time. This document explains how to find these problematic mutexes.
The text below follows a real example of running "memcached" with UDP, using the image
make image=memcached,mgmt.shell
For more information, check out [Debugging OSv](Debugging OSv), [Trace analysis using trace.py](Trace analysis using trace.py).
The first step is to verify that we do have a problem with excessive mutex-related context switches, by counting the frequency of context switches, mutex-caused context switches, and mutex use in general. One easy way to do this counting is with the "perf stat" command in our shell.
Our image includes both the shell and memcached, so let's just run it:
sudo scripts/run.py -m4G -c1 -Vnv
The "-nv" is needed to enable networking for the memcached server. You'll now see output like:
OSv v0.08-63-g7166013
eth0: 192.168.122.89
Memcached for OSV, based on Memcached 1.4.15
[/]%
In the shell prompt, the perf stat <tracepoints>
command can be used to periodically show the frequency of the given tracepoints. In the following example, we look at the sched_switch
(number of context switches), sched_wait
(number of times a thread yielded because it started to wait), mutex_lock_wait
(number of times such a wait was caused by a mutex::lock()), and mutex_lock
(the total number of mutex::lock() calls).
[/]% perf stat sched_switch,sched_wait,mutex_lock_wait,mutex_lock
sched_switch sched_wait mutex_lock_wait mutex_lock
42 26 0 6443
157 77 0 366
145 79 0 299
137 75 0 350
140 76 0 257
149 82 1 790
89186 45628 12122 4740993
104427 53004 14461 4714123
91227 46281 12417 4544367
97062 49180 13143 4736282
107685 54459 14739 4651545
94802 48192 12786 4721227
...
TODO: this mutex_lock_wait number is lower than I saw previously... Or below.
First, we need to run the workload with the mutex_lock_wait
and mutex_lock_wake
tracepoints enabled, together with backtraces:
sudo scripts/run.py --trace=mutex_lock_wake,mutex_lock_wait --trace-backtrace -m4G -c1 -Vnv -e "memcached.so -u root -t 1 -m 3000"
And run in the background the "memaslap" benchmark which loads this server.
Now, use scripts/trace.py extract
to extract these tracepoints pairs.
We can use scripts/trace.py summary
to show the time the time period these tracepoints span, and their number in this period:
$ scripts/trace.py summary
Collected 30701 samples spanning 535.26 ms
Tracepoint statistics:
name count
---- -----
mutex_lock_wait 15350
mutex_lock_wake 15351
Now, scripts/trace.py prof-lock -FL
can extract and summarize these tracepoint pairs.
The first block in the summary shows that 76% of the contended locks happened in in_pcblookup_hash():
38.66 ms (100.00%, #15350) All
log /data/nyh/osv/include/osv/trace.hh:363
lockfree::mutex::lock() /data/nyh/osv/core/lfmutex.cc:87
|-- 29.38 ms (76.01%, #10598) mutex_lock /data/nyh/osv/include/osv/mutex.h:33
| |-- 29.04 ms (75.12%, #10409) in_pcblookup_hash /data/nyh/osv/bsd/sys/neti
net/in_pcb.cc:1647
| | udp_input(mbuf*, int) /data/nyh/osv/bsd/sys/netinet/udp_usrreq.cc:546
| | ip_input /data/nyh/osv/bsd/sys/netinet/ip_input.cc:780
| | netisr_dispatch_src /data/nyh/osv/bsd/sys/net/netisr.cc:769
| | ether_input_internal /data/nyh/osv/bsd/sys/net/if_ethersubr.cc:733
| | ether_nh_input /data/nyh/osv/bsd/sys/net/if_ethersubr.cc:746
| | netisr_dispatch_src /data/nyh/osv/bsd/sys/net/netisr.cc:769
| | virtio::net::receiver() /data/nyh/osv/drivers/virtio-net.cc:452
| | sched::thread::main() /data/nyh/osv/core/sched.cc:813
| |
| \-- 344.25 us (0.89%, #189) udp_output /data/nyh/osv/bsd/sys/netinet/udp_u
srreq.cc:940
| udp_send /data/nyh/osv/bsd/sys/netinet/udp_usrreq.cc:1533
| sosend_dgram /data/nyh/osv/bsd/sys/kern/uipc_socket.cc:910
| kern_sendit /data/nyh/osv/bsd/sys/kern/uipc_syscalls.cc:506
| linux_sendit /data/nyh/osv/bsd/sys/compat/linux/linux_socket.cc:499
| sendmsg /data/nyh/osv/bsd/sys/kern/uipc_syscalls_wrap.cc:239
| 0x1000000136f4
| 0x100000014585
| 0x10000001491d
| 0x100000410a43
We see here two paths which do INP_LOCK(inpcb): The first is the input path, run from the virtio receiver thread. The second is the output path, where the application thread uses sendmsg() to send a reply.
The receiver thread runs udp_input() which runs
inp = in_pcblookup_mbuf(&V_udbinfo, ip->ip_src, uh->uh_sport,
ip->ip_dst, uh->uh_dport, INPLOOKUP_WILDCARD |
INPLOOKUP_LOCKPCB, ifp, m);
The INPLOOKUP_LOCKPCB flag is the one which causes the inpcb's locking, and the contention.
TODO: say that netchannels would solve this
Run with sched_wait
tracepoint enabled, and backtraces:
sudo scripts/run.py --trace=sched_wait --trace-backtrace -m4G -c1 -Vnv -e "memcached.so -u root -t 1 -m 3000"
Run the memaslap benchmark on the host, and scripts/trace.py extract
.
$ scripts/trace.py summary
Collected 32755 samples spanning 323.73 ms
Tracepoint statistics:
name count
---- -----
sched_wait 32755
Then run scripts/trace.py prof -FL
to understand where these 32755 waits (in 1/3rd of a second) come from. It turns out that:
- 59% of the waits, 19314 in 1/3rd second, happen where net::receiver (virtio_net.cc) calls
virtio_driver::wait_for_queue(vq, &vring::used_ring_not_empty);
- About 35% of the waits happen in mutex locks. As above about 75% of those is the inpcb lock on input, and most of the rest is in the poll() implementation. The rest is negligable.
While mutex waits are much slower than ordinary lock/unlock, when the number of lock/unlock pairs is huge - almost 5 millions time a second in the example above, this can add up. When the mutex is only used from a single CPU, a lock/unlock pair takes on my machine roughly 25 nanoseconds, so 5 million of them take 0.125 seconds - i.e., 12.5% of the run time is locking and unlocking the mutex!
$ sudo scripts/run.py --trace=mutex_lock --trace-backtrace -m4G -c
1 -Vnv -e "memcached.so -u root -t 1 -m 3000"
$ scripts/trace.py extract
$ scripts/trace.py summary
$ scripts/trace.py summary
Collected 30696 samples spanning 14.71 ms
Tracepoint statistics:
name count
---- -----
mutex_lock 30696
$ scripts/trace.py prof -FL |less
The biggest culprits are:
28% in rtalloc1_fib() in rtalloc_ign_fib() in ip_output() or in_pbcladdr() in udp_output(): 14%: rwlock()::runlock() and rlock() in rtalloc1_fib() 7%: Similarly, a lock RT_LOCK(newrt) in rtalloc1_fib() 7%: Also RT_LOCK(rt) in_matroute() (rn = rnh->rnh_matchaddr(dst, rnh);) in rtalloc1_fib()
14%: rwlock()::runlock() and rlock() in two places in arpresolve() in ether_output()
11.3% from _GLOBAL__sub_I_gdb_trace_function_entry. What is this? A fluke?
10.5% in in_pbclookup_hash() in udp_input(): 7% rwlock()::runlock() and rlock() in in_pbclookup_hash() 3.5% for mutex_lock() in in_pbclookup_hash
10.5% poll related: 7%: mutex_lock() in pipe_buffer::write_events() socket_file::poll(), called from poll_scan() called from epoll_wait(). 3.5% mutex_lock() in sopoll_generic()
7% rwlock()::runlock() and rlock() in udp_output() (INP_HASH_RUNLOCK(&V_udbinfo), RLOCK). 3.5%: RTFREE() in in_pbcladdr() and ip_output(). 3.5%: mutex_lock() in if_transmit() 3.5%: mutex_lock() in sosend_dgram() 3.5%: mutex_lock() in soreceive_dgram()