Skip to content

Debugging Excessive and Contended Mutex Locks

Nadav Har'El edited this page May 11, 2014 · 8 revisions

The purpose of this document is to show tools to help finding unexpected lock contention which is slowing down a workload.

When a thread needs a mutex but it's already locked, this causes a context switch which is orders of magnitude slower than an uncontended lock/unlock (ball-park figures are about 10 times slower on a single CPU, about 50 times slower when the lock is held by another CPU). Lock contention may also increase the idle time, if there are no additional threads wanting to run.

In some workloads, we can discover that one mutex is causing much of this lock contention, and perhaps it can be replaced by a lock-free algorithm or RCU (e.g., when the mutex is protecting a read-mostly data structure) which will avoid this contention. In other cases, we discover that while the mutex is not contended, we lock and unlock it excessively, which takes time. This document explains how to find these problematic mutexes.

The text below follows a real example of running "memcached" with UDP, using the image

make image=memcached,mgmt.shell

For more information, check out [Debugging OSv](Debugging OSv), [Trace analysis using trace.py](Trace analysis using trace.py).

Counting mutex locks and waits

The first step is to verify that we do have a problem with excessive mutex-related context switches, by counting the frequency of context switches, mutex-caused context switches, and mutex use in general. One easy way to do this counting is with the "perf stat" command in our shell.

Our image includes both the shell and memcached, so let's just run it:

sudo scripts/run.py -m4G -c1 -Vnv

The "-nv" is needed to enable networking for the memcached server. You'll now see output like:

OSv v0.08-63-g7166013
eth0: 192.168.122.89
Memcached for OSV, based on Memcached 1.4.15

[/]% 

In the shell prompt, the perf stat <tracepoints> command can be used to periodically show the frequency of the given tracepoints. In the following example, we look at the sched_switch (number of context switches), sched_wait (number of times a thread yielded because it started to wait), mutex_lock_wait (number of times such a wait was caused by a mutex::lock()), and mutex_lock (the total number of mutex::lock() calls).

[/]% perf stat sched_switch,sched_wait,mutex_lock_wait,mutex_lock
 sched_switch sched_wait mutex_lock_wait mutex_lock
           42         26               0       6443
          157         77               0        366
          145         79               0        299
          137         75               0        350
          140         76               0        257
          149         82               1        790
        89186      45628           12122    4740993
       104427      53004           14461    4714123
        91227      46281           12417    4544367
        97062      49180           13143    4736282
       107685      54459           14739    4651545
        94802      48192           12786    4721227
...

Finding where the mutex waits happen

Finding where the mutex locks happen

While mutex waits are much slower than ordinary lock/unlock, when the number of lock/unlock pairs is huge - almost 5 millions time a second in the example above, this can add up. When the mutex is only used from a single CPU, a lock/unlock pair takes on my machine roughly 25 nanoseconds, so 5 million of them take 0.125 seconds - i.e., 12.5% of the run time is locking and unlocking the mutex!

Clone this wiki locally