membarrier system call performance and the future of Userspace RCU on Linux
A few months ago on the lttng-dev mailing list, Milian Wolff reported that his test application was seeing noticeable startup delays because it was linked against liblttng-ust. He didn't start a tracing session and his test program didn't call any code in liblttng-ust. In fact, the test program contained fewer lines of code than a standard Hello, World! application:
int main()
{
return 0;
}
Yet linking with the library was enough to slow his application by around 640 times. Things became even worse when running within a tracing session, resulting in a roughly one-second delay when launching the program.
This result was entirely unexpected. Sure, linking against liblttng-ust does incur a bit of run time overhead, but it shouldn't be anywhere near one second.
Tracing the first second of the application's run time with perf trace
revealed a large percentage of the time was spent executing the
membarrier
system call:
$
perf trace --duration 1 ./a.out
6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg: 0x7fbe2fbb1070) = 1
5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op: WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1) = 0
The membarrier
system call is a relatively new addition to Linux, so
here's a quick overview of why it was created, how liblttng-ust uses it,
and how it caused Milian's performance issue.
What is the Linux membarrier
system call?
The membarrier
system call is used for synchronizing access to data
structures shared by multiple threads. Its main use case is implementing
synchronization primitives that can be split into fast and slow paths, for
example the read-copy-update
(RCU) algorithm. As the name implies, it's used to implement memory
barrier semantics without
actually using memory barrier instructions.
Traditionally, synchronization algorithms are implemented using pairs of memory
barriers, either explicitly for RCU, or implicitly by lock-prefixed atomic
instruction for reader-writer locks on Intel. For cases where explicit memory
barriers are used, and where the algorithm clearly defines fast paths and slow
paths, the membarrier
system call can be used as an optimization to speed up
the fast path at the expense of adding overhead to the slow path, which results
in an overall speed up.
Since algorithms such as RCU are designed to be used by code that reads shared
data much more often than it writes to it, this optimization is still a good
trade-off for that specific scenario. When the membarrier
system call was
merged into Linux 4.3, improving the speed of Userspace
RCU
was the rationale mentioned in the patch message.
Internally, the membarrier
system call uses the kernel's
synchronized_sched()
function to wait for a "grace period" to elapse. In
other words, it blocks the calling thread until all running threads on the
system have gone through a context switch.
And it's this blocking behaviour that led to Milian's application startup performance issue.
How liblttng-ust uses the membarrier
system call
liblttng-ust uses the Userspace RCU library, liburcu, to synchronize updates to the tracing state. RCU read side scales very well on modern multi-core machines and it keeps LTTng's overhead low so it can be used in production to analyze race conditions and other timing-critical bugs.
Userspace RCU started using membarrier(2)
in 0.9.0. At run time,
liburcu detects whether the membarrier
system call is available on the
running system and, if available, uses it to implement
synchronize_rcu()
, which blocks the calling thread until it's safe to
modify shared data. synchronize_rcu()
does not batch memory
reclamation, and each call waits for the grace period.
Note:liburcu does offer an
alternative function that batches memory reclamation,
call_rcu()
. However, using it requires a separate
thread to asynchronously handle the callback and we wanted to
keep the number of threads needed by LTTng to a minimum to
reduce the complexity of tracing applications.
liblttng-ust uses GCC's constructor
attribute
to run an initialization function (lttng_ust_init()
) as soon as the
liblttng-ust shared library is loaded and before the application starts. Code
running during this initialization phase calls to synchronize_rcu()
. Given
that a single call to membarrier(2)
can take tens of milliseconds (each
enabled event triggers a membarrier(2)
call), it's no wonder that Milian saw
delays when starting his application.
liblttng-ust does not batch memory reclaim which means that it calls
synchronize_rcu()
repeatedly.
Commit 6447802 (Fix: don't use membarrier SHARED syscall command in
liburcu-bp)
to liburcu shows how we addressed the problem: avoid the membarrier
system call entirely.
We still think it makes sense to use membarrier(2)
to implement Userspace
RCU, though not with the MEMBARRIER_CMD_SHARED
command.
The real fix? The MEMBARRIER_CMD_PRIVATE_EXPEDITED
flag
Fortunately, a new
MEMBARRIER_CMD_PRIVATE_EXPEDITED
flag for membarrier(2)
has been merged into Linux 4.14. It uses
inter-processor interrupts
(IPI) to implement the memory barrier semantics,
and communicates only with threads in the calling thread's process. Best of
all, it executes much faster than the MEMBARRIER_CMD_SHARED
command and never
blocks the calling thread.
Now that MEMBARRIER_CMD_PRIVATE_EXPEDITED
has landed in mainline
Linux, liburcu 0.11 will
include
patches
that use it if available (it requires at least Linux 4.14 and the
CONFIG_MEMBARRIER
build option to be enabled). If you compile liburcu
using the --disable-sys-membarrier-fallback
configuration
option, liburcu will now abort if the MEMBARRIER_CMD_PRIVATE_EXPEDITED
flag
is not supported by the running kernel: it's a useful check to ensure
you're running the most optimized version of the code.