Monitoring real-time latencies
Debugging and monitoring real-time latencies can be a difficult task: monitoring too much adds significant overhead and produces a lot of data, while monitoring only a subset of the interrupt handling process gives a hint but does not necessarily provide enough information to understand the whole problem.
In this post, I present a workflow based on tools we have developed to find the right balance between overhead and quickly identifying latency-related problems. To illustrate the kind of problems we want to solve, we use the JACK Audio Connection Kit sound server, capture relevant data only when it is needed, and then analyse the data to identify the source of a high latency.
The following tools are presented in this post:
latency_tracker
to detect high interrupt-processing latencies on a running system (online),- LTTng in snapshot (flight-recorder) mode to collect background information efficiently,
- LTTng-analyses to extract relevant metrics from the collected trace (offline post-processing).
Setup
I'm using JACK2 and one of the most recent
PREEMPT_RT kernels
currently available (4.1.10-rt11). The analysis depends on a Linux kernel
with the sched_waking
tracepoint (mainline 4.3 or
PREEMPT_RT 4.1.10-rt11).
Let's start JACK2 with an RT priority of 1 (low RT priority) and pin it on CPU 0:
$ taskset -c 0 jackd -P 1 -v -d alsa -H ... creating alsa driver ... hw:0|hw:0|1024|2|48000|0|0|hwmon|swmeter|-|32bit configuring for 48000Hz, period = 1024 frames (21.3 ms), buffer = 2 periods ...
As you can see, JACK2 configures the ALSA driver to sample at 48 kHz and accumulate 1024 frames before raising an interrupt. The whole period is 21.3 ms.
JACK2 already detects when there is a buffer over- or under-flow, called an Xrun. This is very useful because it means that its workload is periodic. We can therefore validate our findings, but the tools I am presenting in this post work just as well with aperiodic workloads.
To simulate a priority problem, let's run for a very short period of time (any duration as long as it is longer than 21.3 ms) a CPU-intensive task on the same CPU with a higher priority:
$ taskset -c 0 chrt -r 60 cpuburn -n 1
Immediately after stopping cpuburn
, we see a message similar
to this one:
Jack: alsa_pcm: xrun of at least 353.963 msecs
We now know it worked. We ran a higher priority task on the same CPU as our important workload, a problem that can happen for a variety of reasons in real production environments, either due to misconfiguration or more complex problems such as priority inversion. Now is the time to run our tools and try to automatically identify the same latency problem and understand its cause.
Monitoring IRQ handling latencies
When the kernel gets interrupted, it has to call the appropriate interrupt handler, which may raise a software interrupt and/or wake up another process, for example. So to track the state from the interrupt event to the target process, I looked at the already existing tracepoints in the Linux kernel and extracted the call chain for both the mainline and PREEMPT_RT kernels. The main difference between the two is that the PREEMPT_RT kernel uses threaded interrupt handlers, so we observe a process wake up instead of all the work being done in hardware and/or software interrupt handlers. With this information, we can follow interrupts throughout the chain and extract timing information along the way.
We are looking for an abnormal delay between the time an interrupt is received
by the kernel (do_IRQ
) and the time the process waiting for it starts to
handle it. We are not trying to solve a problem related to the interrupts being
masked for too long (ftrace already has some tools to help with that). We
only focus on concurrency issues inside the kernel and in user space when the
interrupt happens.
Of course, not all the interrupts happening in the kernel are interesting to us, but we only know that after gathering some information in the kernel and eventually in user space, such asinterrupt number or the process being woken up to handle it. If we were to only start the accounting at the end of the interrupt handling chain, we would miss some possible delays.
We developed a module for the
latency_tracker
to track the
interrupts continuously, statefully filter the relevant occurrences, measure the
delay of the whole chain, and take an action if it is greater than a specified
threshold.
The module is configured in debugfs just like ftrace. Once it is compiled and installed, we can load it with:
# modprobe latency_tracker # modprobe latency_tracker_rt
Now we are interested in IRQ latencies greater than the period (21.3 ms)
related to the jackd
process. We can configure the tracker as such:
# echo 25000000 > /sys/kernel/debug/latency/rt/threshold # echo 1 > /sys/kernel/debug/latency/rt/irq_tracing # echo 1 > /sys/kernel/debug/latency/rt/enter_userspace # echo jackd > sys/kernel/debug/latency/rt/procname
With this configuration, we track all the interrupts that result in waking
up the jackd
process. If the delay between the time the interrupt arrived
in the kernel and the time jackd
starts working (after the wake up) is
greater than 25 ms, call the callback.
The built-in callback emits an event (tracepoint) and wakes up a wakeup_pipe
file, which I'll explain later. The tracepoint can be recorded by a tracer
such as LTTng, ftrace, or perf. Here is the ftrace output when running
the experiment described earlier:
latency_tracker_rt: comm=jackd, pid=5820, delay=484615261, preempt_count=0,
breakdown={
do_IRQ [001] = 530334994294,
to sched_waking(830) [001] = 29824,
to switch_in(irq/29-snd_hda_-830, -51) [003] = 20181,
to sched_waking(5820) [003] = 26869,
to switch_in(jackd-5820, -2) [000] = 484538387
}
From this output, we know:
- The target process was
jackd
- The target PID was 5820
- The latency lasted 484615261 ns (484 ms)
- The process was not preempted (normal for this use case)
- The
do_IRQ
event occurred at the timestamp 530334994294 - It took 29.824 µs to call the wake up for the process 830
- It then took 20.181 µs for this process to start running
- After 26.869 µs of running, this process woke up 5820
The corresponding output in the Jack console is:
Jack: **** alsa_pcm: xrun of at least 484.039 msecs
With this information, we know the exact timestamp for when the IRQ was
received in the kernel, as well as the delay between each important step of the
chain. This information alone allows us to see that the major delay occurred
between the sched_waking
event (wake up of the target process for this
IRQ) and the corresponding sched_switch
event (time the process actually
got scheduled in).
Now, let's find out why!
Recording context information
LTTng has a nice feature called the snapshot mode. In this mode, we record trace data in a ring buffer in memory. New data constantly overrides older data, which means that we only keep a limited time window. The amount of data stored is configured when creating the session. The more memory allocated, the longer the history in memory.
When something interesting happens, we can record a snapshot which extracts the content of the memory ring buffer and sends it to disk or over a network interface. The main benefits of using this mode over recording the whole trace session on disk, are that it is really lightweight as it only records in memory, and the resulting trace is much smaller.
For our current use case, we use this feature to record the activity of our system in memory and, when an abnormal latency is detected, automatically record a snapshot for a subsequent analysis. By doing this, we limit our overhead on the critical path by only detecting the latencies at run time and performing the more advanced analysis offline.
To create an LTTng session in snapshot mode with all kernel events enabled and 4 sub-buffers of 2 MiB/CPU of trace data, we use the following commands:
# lttng create --snapshot -o /tmp/debugrt # lttng enable-channel -k bla --subbuf-size 2M --num-subbuf 4 # lttng enable-event -k -a -c bla # lttng enable-event -u -a # lttng start
Note that this is a configuration which can be adjusted depending on the useful events, the event generation rate, and the history time window wanted.
Now, all we need to do is wait on the wakeup_pipe
file created by our module
and record a snapshot when something interesting happens:
# cat /sys/kernel/debug/latency/rt/wakeup_pipe # lttng stop # lttng snapshot record
The wakeup_pipe
file blocks the read operation, so cat
hangs
until something interesting happens in the latency_tracker
and we unblock
it. It is possible to put that in a loop to have a continuous monitoring; just
remember to run lttng start
before going back into waiting on the pipe.
The pipe is currently rate-limited to one wake up per second.
A script that handles all the LTTng session setup and the waiting on the pipe
is available
here.
Just run it and watch the snapshots being generated in /tmp/debugrt
. Here
is what happens when we create a high latency in JACK2 using the experiment
setup described above:
Tracing stopped for session auto-20151204-185817
Snapshot recorded successfully for session auto-20151204-185817
In our test environment and with the above configuration of LTTng, the resulting snapshots take around 23 MiB and cover 4 seconds of execution, with 45 seconds before the high latency occurred, which gives us plenty of background information without creating huge traces.
Now that we have more context information around the high latency event, let's look at the trace and find the source of this problem.
Automatically identify real-time concurrency issues
The analyses repository for LTTng contains various scripts to automatically process a trace and extract relevant metrics. We have already demonstrated a use case with system calls and I/O latencies. Now we use the newly introduced scheduling analyses scripts to get more details.
The first step is to run the lttng-schedtop
analysis on the trace to quickly see
a top 10 of the longest wake up latencies that occurred in the snapshot:
$ lttng-schedtop --cpu 0 /tmp/debugrt/snapshot-1-20151204-185826-0/ Timerange: [2015-12-04 18:58:21.275110855, 2015-12-04 18:58:26.622859794] Scheduling top Wakeup Switch Latency (us) Priority CPU Waker Wakee [18:58:26.101002405, 18:58:26.600549510] 499547.105 -2 0 cpuburn (5822) rcuc/0 (14) [18:58:26.101020741, 18:58:26.600564667] 499543.926 -2 0 cpuburn (5822) ksoftirqd/0 (3) [18:58:26.105607924, 18:58:26.600942937] 495335.013 20 0 cpuburn (5823) kworker/0:1 (5647) [18:58:26.116065859, 18:58:26.600612940] 484547.081 -2 0 irq/29-snd_hda_ (830) jackd (5820) [18:58:26.145583650, 18:58:26.600437823] 454854.173 -51 0 cpuburn (5823) irq/28-iwlwifi (758) [18:58:26.503649875, 18:58:26.600016011] 96366.136 -61 0 kworker/3:5 (1405) cpuburn (5822) [18:58:23.863324044, 18:58:23.869553215] 6229.171 -100 0 lttng-sessiond (4879) lttng-sessiond (4878) [18:58:25.244714289, 18:58:25.245833154] 1118.865 20 0 ibus-ui-gtk3 (3079) gdbus (3117) [18:58:25.246084580, 18:58:25.247036547] 951.967 20 0 gdbus (3117) ibus-ui-gtk3 (3079) [18:58:25.236120612, 18:58:25.237044353] 923.741 20 0 gdbus (3117) ibus-ui-gtk3 (3079)
The latency we're interested in on line 4. We can also confirm that JACK2
(jackd
) was woken up by an IRQ thread. Since we know from the previous
analysis that it was a scheduling latency and that our process was pinned
to a specific CPU, we need to know what was happening on this CPU between
the time our process got scheduled in and the time it actually was able to run.
Let's first take a look at the interrupts that were raised on this
CPU during this time range:
$ lttng-irqstats --cpu 0 --timerange "[18:58:26.116065859, 18:58:26.600612940]" /tmp/debugrt/snapshot-1-20151204-185826-0/ Timerange: [2015-12-04 18:58:26.116065859, 2015-12-04 18:58:26.600613097] Hard IRQ Duration (us) count min avg max stdev --------------------------------------------------------------------------------| 28: iwlwifi 11 1.790 2.291 3.230 0.494 |
Nothing alarming here. Let's look at the CPU usage during this time range on CPU 0:
$ lttng-cputop --cpu 0 --timerange "[18:58:26.116065859, 18:58:26.600612940]" /tmp/debugrt/snapshot-1-20151204-185826-0/ Timerange: [2015-12-04 18:58:26.116065859, 2015-12-04 18:58:26.600613097] Per-TID CPU Usage ############################################################################### ██████████████████████████████████████████████████ 99.95 % cpuburn (5823) (prio: -61) 0.02 % bash (5822) (prio: -61) 0.02 % irq/28-s-iwlwif (759) (prio: -52) 0.01 % ksoftirqd/0 (3) (prio: -2) 0.01 % irq/28-iwlwifi (758) (prio: -51) 0.00 % rcuc/0 (14) (prio: -2) 0.00 % jackd (5820) (prio: -2) 0.00 % /usr/bin/x-term (3394) (prio: 20) 0.00 % gdbus (3396) (prio: 20) 0.00 % rcu_preempt (7) (prio: -2) Per-CPU Usage ############################################################################### █████████████████████████████████████████████████████████████████ 100.00 % CPU 0
Here is our culprit: the cpuburn
process running with a higher priority
on our CPU!
There are many ways to automate this process. Here is an ugly "one-liner" example of what we could run after each snapshot has been recorded and get the same result automatically:
$ lttng-cputop --no-progress --skip-validation --cpu 0 --timerange "[ \ $(lttng-schedtop --cpu 0 --procname jackd --limit 1 --no-progress --skip-validation /path/to/trace \ | grep jackd | cut -d '[' -f2 | cut -d ']' -f1) \ ]" /path/to/trace
Of course, this is just to show that we can automate advanced analyses pretty easily with a nice combination of offline and online processing. If we were to actually use that in production as a monitoring tool, we could do a clean analysis script that combines all those steps. There is also the possibility of directly sending the snapshot to a remote machine running an LTTng relay daemon and launching the offline analyses scripts outside production server, for example on a monitoring server or a system administrator machine.
Overhead
The real-time module of the latency_tracker
adds work in the critical
path of the interrupt processing, so it adds some overhead. We have done a lot
of work to keep this overhead as low as possible: no memory allocation when
tracking interrupts, configurable memory footprint,
RCU lockless hashtable,
as well as per-cpu RCU lock-free freelists. From our early measurements, we add
on average 740 ns (standard deviation: 396 ns) at each state change.
We have between 4 and 6 state changes to reach the user space, so we add around
4.4 µs to keep the state in memory.
In addition to keeping the state, the breakdown of the latency is appended as text during the tracking. This feature will be optional in a near future; as it currently adds on average 660 ns (standard deviation: 616 ns) at each state change, so around 3.9 µs for the 6 state changes. We consider the high standard deviation a problem and post-processing can extract the same information, which is why we will make this feature optional.
The total overhead when tracking the state and using the text formatting feature on a 16-core Xeon E5-2630 is around 8.3 µs per interrupt. Filtering on specific interrupts early in the chain avoids adding this delay on all interrupts. The hard interrupt filter is applied at the first state change.
Conclusion
In this post, I have demonstrated a use case for the latency_tracker
to
automatically detect a high latency in the processing of an hardware interrupt,
launch a user space task to extract the content of the trace ring buffer and
run some post-processing tools to identify the source of the latency.
We also have a way to take the kernel stacks of the "waker" and "wakee" to quickly know where they are blocked, identify priority inversion patterns, and lots of other fun stuff. Be sure to check out the different projects presented in this post, try them (on mainline/PREEMPT_RT kernels), and let us know what you think and what you would like to see next!