Full Stack System Call Latency Profiling
Finding the exact origin of a kernel-related latency can be very difficult: we usually end up recording too much or not enough information, and it becomes even more complex if the problem is sporadic or hard to reproduce.
In this blog post, we present a demo of a new feature to extract as much relevant information as possible around an unusual system call latency. Depending on the configuration, during a long system call, we can extract:
- multiple kernel-space stacks of the process to identify where we are waiting,
- the user-space stack to identify the origin of the system call,
- an LTTng snapshot of the user and/or kernel-space trace to give some background information about what led us here and what was happening on the system at that exact moment.
All of this information is tracked at run-time with low overhead and recorded only when some pre-defined conditions are matched (PID, current state and latency threshold).
Combining all of this data gives us enough information to accurately determine where the system call was launched in an application, what it was waiting for in the kernel, and what system-wide events (kernel and/or user-space) could explain this unexpected latency.
Let’s look at a few sample use cases for the system call latency tracker. You may find all scripts and applications presented here on Github.
First, we want to get detailed information about the long sleeps made by
sleep-demo, an application which calls libc's
usleep() function with random
durations in a loop.
# Generate events for system calls taking more than 2 seconds. echo 2000000 > /sys/module/latency_tracker_syscalls/parameters/usec_threshold # Launch the application with liblttng-profile in order to collect userspace stacks. LD_PRELOAD=liblttng-profile.so src/sleep-demo # Record a trace. scripts/trace.sh # Generate a flame graph, for the show ./syscall_latency_flame_graph.py [path-to-trace] | ./flamegraph.pl > sleep.svg
Here is the result (zoom).
The flame graph shows the userspace and kernel stacks of system calls that exceeded the specified threshold. The recorded trace is small because it only contains the information we are really interested in and the LTTng ring-buffer is configured to store only 1.5MB of trace per-CPU.
In order to grasp the true potential of a low-overhead latency tracker, let's have a look at more complex examples. The following flame graph shows system calls issued by MongoDB which took more than 100 ms (zoom).
There are a lot of long system calls in MongoDB, but we don’t care about most of
them because they just correspond to worker threads waiting for new work to do.
Fortunately, we know that external commands are processed by the
execCommand() method. We can run
grep execCommand on the output
to get a flame graph of the latencies that directly affect the database's response
Having kernel stacks is also very helpful. In a recent blog
Fabien Reumont-Locke explained how he dug into the source code of the kernel
to find a semaphore that slowed-down parallel calls to
Executing the system call latency tracker while running the
app provides the same information without requiring to look at the kernel code
Sometimes, full stacks are not enough to understand why a system call took more time than expected. In such a case, the detailed information provided by an LTTng kernel trace can be the key to understand what is happening on the system. LTTng's snapshot mode allows the collection of a trace in a circular memory buffer and saving it to a file on demand. The system call latency tracker can be used in combination with that mode to take a snapshot when a long system call occurs within a monitored application. This way, we can obtain extremely detailed information about what is happening on a system when rare latencies occur without having to manage huge trace files.
To show this feature, we will use the
application which continuously creates new files on disk.
# Generate events for system calls taking more than 2 seconds. echo 2000000 > /sys/module/latency_tracker_syscalls/parameters/usec_threshold # Launch io-demo with liblttng-profile in order to collect userspace stacks. LD_PRELOAD=liblttng-profile.so src/io-demo # Configure the LTTng session and save snapshots when a high latency occurs scripts/trace-io-snapshot.sh
Per-process I/O Write ############################################################################### ██████████████████████████████████████████████████ 13.50 MB io-demo (5786) 13.50 MB file 0 B net 4.00 KB unknown 73.83 KB chrome (5434) 0 B file 0 B net 73.83 KB unknown 60.46 KB gnome-terminal (3720) 0 B file 0 B net 60.46 KB unknown 50.95 KB Xorg (1980) 0 B file 0 B net 50.95 KB unknown 36.88 KB chrome (5713) 0 B file 0 B net 36.88 KB unknown
And that this data was actually committed to the disk, which explains the latency
open() system call:
Block I/O Write ############################################################################### ██████████████████████████████████████████████████ 8.51 MB swapper (pid=0) ██████████████████████████████████████████████ 8.00 MB kworker/u17:0 (pid=68) ██████████████████████████ 4.45 MB io-demo (pid=5786) 20.00 KB gnome-terminal (pid=3720)
The overhead of this tool on real applications has been assessed by running the
benchmark with 1 thread. When system calls lasting more than 200 ms are tracked, the
request throughput is reduced by 15%. If userspace symbols are resolved offline
(only addresses are included in the trace), the throughput is reduced by 6%
instead. During the execution of this benchmark,
mongod generated an
average of 12885 scheduler events per second. Capturing call stacks for each of
these events would have produced a gigantic trace. Instead of that, our tool
only captured the stacks for system calls of more than 200 ms, an event that
occurred about 5 times per second.
In terms of scalability, we benchmarked the latency_tracker module analyzing
wake-up latencies on a highly-concurrent application (
hackbench) with 1
to 32 CPU cores enabled and measured a linear overhead from 0.3% (1 core) to
22% (32 cores). The latency_tracker data structures are currently being
profiled to select the best ones to run in production without major impact (we
are comparing 3 hashtable implementations, 2 linked-lists and a wait-free queue
with various locking schemes), more on that later.
An install script is provided here. Please use it as an install guide instead of running it blindly.
This solution is composed of 3 individual components:
- The latency_tracker for the run-time system call latency detection,
- lttng-profile to perform the stack unwinding with an optimized version of libunwind when it receives a signal from the latency_tracker.
- and of course LTTng to collect the user and kernel-space trace in a memory ring-buffer. We provide a patch for LTTng-modules to connect to the tracepoints emitted by the latency_tracker because they are not ready for inclusion yet.
Once everything is installed, configure the LTTng session and wait for events to trigger the snapshots:
To launch an application where you want the full user-space stack:
$ LD_PRELOAD=liblttng-profile.so ./myapplication
Wait for the latency to appear; the
will inform you when a snapshot is recorded. You can then use your favorite CTF
trace parser/analyser tool to process the trace.
threshold parameter can be configured at run-time in
/sys/module/latency_tracker_syscalls/parameters or when loading the
modinfo latency_tracker_syscalls.ko for the other tunable
Under the hood
latency_tracker is a module designed for efficient and scalable
tracking of events in the kernel: it matches an event entry to an event exit
based on a shared key. If the delay between these two events is higher than a
pre-defined, and dynamically configurable, threshold, it executes a user-provided
If we want the user-space stack of a specific application, it must be started
with the LD_PRELOAD of a library provided by
lttng-profile. This library
registers the process (and its threads) to the
module to notify it that it can receive profiling signals if needed.
We used the infrastructure of the latency_tracker to build a system-call
latency tracking module. It hooks on the entry and exit system call tracepoint
events and uses the PID as the key to match the entry and exit of system calls.
In addition, when a system call takes longer than the threshold, if enabled,
we extract the kernel stack of the process every time it gets some CPU time
sched_switch event), so we get an accurate view of where it is waiting
in the kernel. When the system call finally completes, we send the SIGPROF
signal to the program (if registered) to ask it to extract its stack.
All of the stacks and related information are extracted through some static
tracepoints on which LTTng can attach. When a latency is detected, the
latency_tracker_syscalls wakes up a user-process which records the LTTng
snapshot which extracts the current buffers on disk or through the network.
Of course LTTng can also be configured to simultaneously record everything
else happening on the system.
The kernel module works independently from the user-space library, so if you
only want kernel events and/or the kernel stack of all system calls taking
longer than a threshold, you can simply load the module with the
parameter set to
1. If you are not interested in the user-space trace,
the tracepoints also work fine with
Finally, if we need to generate an event when a specific function in the kernel
is hit during a long system call, it is easy to add a
kprobe anywhere and
have it check the state of the current syscall to decide what to do. This gives
us the opportunity to create "stateful tracing events" with native code
flexibility and performance which we believe will be really useful in the near
To summarize, we are using the static kernel tracepoints to dynamically decide what and when to record relevant information for our use-case and trying to find the needle in the haystack. The module can be configured to follow only registered processes or the whole system, and some more advanced filtering options are coming up in a near future.
Please remember that it is a work in progress, the overall user-experience still needs some polish to make it easier to deploy and use, but we are presenting it today to start gathering some early feedback and improvement suggestions. Feel free to let us know what you think, the use-cases you foresee, if you have other ideas, etc.
We hope you enjoy it! Keep in mind that the links presented here might change in the future, so follow the main repos for more updates.