-
-
Notifications
You must be signed in to change notification settings - Fork 604
Debugging Excessive and Contended Mutex Locks
The purpose of this document is to show tools to help finding unexpected lock contention which is slowing down a workload.
When a thread needs a mutex but it's already locked, this causes a context switch which is orders of magnitude slower than an uncontended lock/unlock (ball-park figures are about 10 times slower on a single CPU, about 50 times slower when the lock is held by another CPU). Lock contention may also increase the idle time, if there are no additional threads wanting to run.
In some workloads, we can discover that one mutex is causing much of this lock contention, and perhaps it can be replaced by a lock-free algorithm or RCU (e.g., when the mutex is protecting a read-mostly data structure) which will avoid this contention. In other cases, we discover that while the mutex is not contended, we lock and unlock it excessively, which takes time. This document explains how to find these problematic mutexes.
The text below follows a real example of running "memcached" with UDP, using the image
make image=memcached,mgmt.shell
For more information, check out [Debugging OSv](Debugging OSv), [Trace analysis using trace.py](Trace analysis using trace.py).
The first step is to verify that we do have a problem with excessive mutex-related context switches, by counting the frequency of context switches, mutex-caused context switches, and mutex use in general. One easy way to do this counting is with the "perf stat" command in our shell.
Our image includes both the shell and memcached, so let's just run it:
sudo scripts/run.py -m4G -c1 -Vnv
The "-nv" is needed to enable networking for the memcached server. You'll now see output like:
OSv v0.08-63-g7166013
eth0: 192.168.122.89
Memcached for OSV, based on Memcached 1.4.15
[/]%
In the shell prompt, the perf stat <tracepoints>
command can be used to periodically show the frequency of the given tracepoints. In the following example, we look at the sched_switch
(number of context switches), sched_wait
(number of times a thread yielded because it started to wait), mutex_lock_wait
(number of times such a wait was caused by a mutex::lock()), and mutex_lock
(the total number of mutex::lock() calls).
[/]% perf stat sched_switch,sched_wait,mutex_lock_wait,mutex_lock
sched_switch sched_wait mutex_lock_wait mutex_lock
42 26 0 6443
157 77 0 366
145 79 0 299
137 75 0 350
140 76 0 257
149 82 1 790
89186 45628 12122 4740993
104427 53004 14461 4714123
91227 46281 12417 4544367
97062 49180 13143 4736282
107685 54459 14739 4651545
94802 48192 12786 4721227
...
While mutex waits are much slower than ordinary lock/unlock, when the number of lock/unlock pairs is huge - almost 5 millions time a second in the example above, this can add up. When the mutex is only used from a single CPU, a lock/unlock pair takes on my machine roughly 25 nanoseconds, so 5 million of them take 0.125 seconds - i.e., 12.5% of the run time is locking and unlocking the mutex!