← Back to LTTng's blog

Monitoring real-time latencies

Comments

Photo: Copyright © Alexandre Claude, used with permission

Debugging and monitoring real-time latencies can be a difficult task: monitoring too much adds significant overhead and produces a lot of data, while monitoring only a subset of the interrupt handling process gives a hint but does not necessarily provide enough information to understand the whole problem.

In this post, I present a workflow based on tools we have developed to find the right balance between overhead and quickly identifying latency-related problems. To illustrate the kind of problems we want to solve, we use the JACK Audio Connection Kit sound server, capture relevant data only when it is needed, and then analyse the data to identify the source of a high latency.

The following tools are presented in this post:

Setup

I'm using JACK2 and one of the most recent PREEMPT_RT kernels currently available (4.1.10-rt11). The analysis depends on a Linux kernel with the sched_waking tracepoint (mainline 4.3 or PREEMPT_RT 4.1.10-rt11).

Let's start JACK2 with an RT priority of 1 (low RT priority) and pin it on CPU 0:

$ taskset -c 0 jackd -P 1 -v -d alsa -H
...
creating alsa driver ... hw:0|hw:0|1024|2|48000|0|0|hwmon|swmeter|-|32bit
configuring for 48000Hz, period = 1024 frames (21.3 ms), buffer = 2 periods
...

As you can see, JACK2 configures the ALSA driver to sample at 48 kHz and accumulate 1024 frames before raising an interrupt. The whole period is 21.3 ms.

JACK2 already detects when there is a buffer over- or under-flow, called an Xrun. This is very useful because it means that its workload is periodic. We can therefore validate our findings, but the tools I am presenting in this post work just as well with aperiodic workloads.

To simulate a priority problem, let's run for a very short period of time (any duration as long as it is longer than 21.3 ms) a CPU-intensive task on the same CPU with a higher priority:

$ taskset -c 0 chrt -r 60 cpuburn -n 1

Immediately after stopping cpuburn, we see a message similar to this one:

Jack: alsa_pcm: xrun of at least 353.963 msecs

We now know it worked. We ran a higher priority task on the same CPU as our important workload, a problem that can happen for a variety of reasons in real production environments, either due to misconfiguration or more complex problems such as priority inversion. Now is the time to run our tools and try to automatically identify the same latency problem and understand its cause.

Monitoring IRQ handling latencies

When the kernel gets interrupted, it has to call the appropriate interrupt handler, which may raise a software interrupt and/or wake up another process, for example. So to track the state from the interrupt event to the target process, I looked at the already existing tracepoints in the Linux kernel and extracted the call chain for both the mainline and PREEMPT_RT kernels. The main difference between the two is that the PREEMPT_RT kernel uses threaded interrupt handlers, so we observe a process wake up instead of all the work being done in hardware and/or software interrupt handlers. With this information, we can follow interrupts throughout the chain and extract timing information along the way.

We are looking for an abnormal delay between the time an interrupt is received by the kernel (do_IRQ) and the time the process waiting for it starts to handle it. We are not trying to solve a problem related to the interrupts being masked for too long (ftrace already has some tools to help with that). We only focus on concurrency issues inside the kernel and in user space when the interrupt happens.

Of course, not all the interrupts happening in the kernel are interesting to us, but we only know that after gathering some information in the kernel and eventually in user space, such asinterrupt number or the process being woken up to handle it. If we were to only start the accounting at the end of the interrupt handling chain, we would miss some possible delays.

We developed a module for the latency_tracker to track the interrupts continuously, statefully filter the relevant occurrences, measure the delay of the whole chain, and take an action if it is greater than a specified threshold.

The module is configured in debugfs just like ftrace. Once it is compiled and installed, we can load it with:

# modprobe latency_tracker
# modprobe latency_tracker_rt

Now we are interested in IRQ latencies greater than the period (21.3 ms) related to the jackd process. We can configure the tracker as such:

# echo 25000000 > /sys/kernel/debug/latency/rt/threshold
# echo 1 > /sys/kernel/debug/latency/rt/irq_tracing
# echo 1 > /sys/kernel/debug/latency/rt/enter_userspace
# echo jackd > sys/kernel/debug/latency/rt/procname

With this configuration, we track all the interrupts that result in waking up the jackd process. If the delay between the time the interrupt arrived in the kernel and the time jackd starts working (after the wake up) is greater than 25 ms, call the callback.

The built-in callback emits an event (tracepoint) and wakes up a wakeup_pipe file, which I'll explain later. The tracepoint can be recorded by a tracer such as LTTng, ftrace, or perf. Here is the ftrace output when running the experiment described earlier:

latency_tracker_rt: comm=jackd, pid=5820, delay=484615261, preempt_count=0,
    breakdown={
        do_IRQ [001] = 530334994294,
        to sched_waking(830) [001] = 29824,
        to switch_in(irq/29-snd_hda_-830, -51) [003] = 20181,
        to sched_waking(5820) [003] = 26869,
        to switch_in(jackd-5820, -2) [000] = 484538387
    }

From this output, we know:

  • The target process was jackd
  • The target PID was 5820
  • The latency lasted 484615261 ns (484 ms)
  • The process was not preempted (normal for this use case)
  • The do_IRQ event occurred at the timestamp 530334994294
  • It took 29.824 µs to call the wake up for the process 830
  • It then took 20.181 µs for this process to start running
  • After 26.869 µs of running, this process woke up 5820

The corresponding output in the Jack console is:

Jack: **** alsa_pcm: xrun of at least 484.039 msecs

With this information, we know the exact timestamp for when the IRQ was received in the kernel, as well as the delay between each important step of the chain. This information alone allows us to see that the major delay occurred between the sched_waking event (wake up of the target process for this IRQ) and the corresponding sched_switch event (time the process actually got scheduled in).

Now, let's find out why!

Recording context information

LTTng has a nice feature called the snapshot mode. In this mode, we record trace data in a ring buffer in memory. New data constantly overrides older data, which means that we only keep a limited time window. The amount of data stored is configured when creating the session. The more memory allocated, the longer the history in memory.

When something interesting happens, we can record a snapshot which extracts the content of the memory ring buffer and sends it to disk or over a network interface. The main benefits of using this mode over recording the whole trace session on disk, are that it is really lightweight as it only records in memory, and the resulting trace is much smaller.

For our current use case, we use this feature to record the activity of our system in memory and, when an abnormal latency is detected, automatically record a snapshot for a subsequent analysis. By doing this, we limit our overhead on the critical path by only detecting the latencies at run time and performing the more advanced analysis offline.

To create an LTTng session in snapshot mode with all kernel events enabled and 4 sub-buffers of 2 MiB/CPU of trace data, we use the following commands:

# lttng create --snapshot -o /tmp/debugrt
# lttng enable-channel -k bla --subbuf-size 2M --num-subbuf 4
# lttng enable-event -k -a -c bla
# lttng enable-event -u -a
# lttng start

Note that this is a configuration which can be adjusted depending on the useful events, the event generation rate, and the history time window wanted.

Now, all we need to do is wait on the wakeup_pipe file created by our module and record a snapshot when something interesting happens:

# cat /sys/kernel/debug/latency/rt/wakeup_pipe
# lttng stop
# lttng snapshot record

The wakeup_pipe file blocks the read operation, so cat hangs until something interesting happens in the latency_tracker and we unblock it. It is possible to put that in a loop to have a continuous monitoring; just remember to run lttng start before going back into waiting on the pipe. The pipe is currently rate-limited to one wake up per second.

A script that handles all the LTTng session setup and the waiting on the pipe is available here. Just run it and watch the snapshots being generated in /tmp/debugrt. Here is what happens when we create a high latency in JACK2 using the experiment setup described above:

Tracing stopped for session auto-20151204-185817
Snapshot recorded successfully for session auto-20151204-185817

In our test environment and with the above configuration of LTTng, the resulting snapshots take around 23 MiB and cover 4 seconds of execution, with 45 seconds before the high latency occurred, which gives us plenty of background information without creating huge traces.

Now that we have more context information around the high latency event, let's look at the trace and find the source of this problem.

Automatically identify real-time concurrency issues

The analyses repository for LTTng contains various scripts to automatically process a trace and extract relevant metrics. We have already demonstrated a use case with system calls and I/O latencies. Now we use the newly introduced scheduling analyses scripts to get more details.

The first step is to run the lttng-schedtop analysis on the trace to quickly see a top 10 of the longest wake up latencies that occurred in the snapshot:

$ lttng-schedtop --cpu 0 /tmp/debugrt/snapshot-1-20151204-185826-0/
Timerange: [2015-12-04 18:58:21.275110855, 2015-12-04 18:58:26.622859794]

Scheduling top
Wakeup               Switch                 Latency (us)   Priority CPU Waker                      Wakee
[18:58:26.101002405, 18:58:26.600549510]      499547.105         -2   0 cpuburn (5822)             rcuc/0 (14)
[18:58:26.101020741, 18:58:26.600564667]      499543.926         -2   0 cpuburn (5822)             ksoftirqd/0 (3)
[18:58:26.105607924, 18:58:26.600942937]      495335.013         20   0 cpuburn (5823)             kworker/0:1 (5647)
[18:58:26.116065859, 18:58:26.600612940]      484547.081         -2   0 irq/29-snd_hda_ (830)      jackd (5820)
[18:58:26.145583650, 18:58:26.600437823]      454854.173        -51   0 cpuburn (5823)             irq/28-iwlwifi (758)
[18:58:26.503649875, 18:58:26.600016011]       96366.136        -61   0 kworker/3:5 (1405)         cpuburn (5822)
[18:58:23.863324044, 18:58:23.869553215]        6229.171       -100   0 lttng-sessiond (4879)      lttng-sessiond (4878)
[18:58:25.244714289, 18:58:25.245833154]        1118.865         20   0 ibus-ui-gtk3 (3079)        gdbus (3117)
[18:58:25.246084580, 18:58:25.247036547]         951.967         20   0 gdbus (3117)               ibus-ui-gtk3 (3079)
[18:58:25.236120612, 18:58:25.237044353]         923.741         20   0 gdbus (3117)               ibus-ui-gtk3 (3079)

The latency we're interested in on line 4. We can also confirm that JACK2 (jackd) was woken up by an IRQ thread. Since we know from the previous analysis that it was a scheduling latency and that our process was pinned to a specific CPU, we need to know what was happening on this CPU between the time our process got scheduled in and the time it actually was able to run. Let's first take a look at the interrupts that were raised on this CPU during this time range:

$ lttng-irqstats --cpu 0 --timerange "[18:58:26.116065859, 18:58:26.600612940]" /tmp/debugrt/snapshot-1-20151204-185826-0/
Timerange: [2015-12-04 18:58:26.116065859, 2015-12-04 18:58:26.600613097]
Hard IRQ                                             Duration (us)
                       count          min          avg          max        stdev
--------------------------------------------------------------------------------|
28: iwlwifi             11        1.790        2.291        3.230        0.494  |

Nothing alarming here. Let's look at the CPU usage during this time range on CPU 0:

$ lttng-cputop --cpu 0 --timerange "[18:58:26.116065859, 18:58:26.600612940]" /tmp/debugrt/snapshot-1-20151204-185826-0/
Timerange: [2015-12-04 18:58:26.116065859, 2015-12-04 18:58:26.600613097]
Per-TID CPU Usage
###############################################################################
██████████████████████████████████████████████████                   99.95 %  cpuburn (5823) (prio: -61)
                                                                      0.02 %  bash (5822) (prio: -61)
                                                                      0.02 %  irq/28-s-iwlwif (759) (prio: -52)
                                                                      0.01 %  ksoftirqd/0 (3) (prio: -2)
                                                                      0.01 %  irq/28-iwlwifi (758) (prio: -51)
                                                                      0.00 %  rcuc/0 (14) (prio: -2)
                                                                      0.00 %  jackd (5820) (prio: -2)
                                                                      0.00 %  /usr/bin/x-term (3394) (prio: 20)
                                                                      0.00 %  gdbus (3396) (prio: 20)
                                                                      0.00 %  rcu_preempt (7) (prio: -2)
Per-CPU Usage
###############################################################################
█████████████████████████████████████████████████████████████████  100.00 %  CPU 0

Here is our culprit: the cpuburn process running with a higher priority on our CPU!

There are many ways to automate this process. Here is an ugly "one-liner" example of what we could run after each snapshot has been recorded and get the same result automatically:

$ lttng-cputop --no-progress --skip-validation --cpu 0 --timerange "[ \
    $(lttng-schedtop --cpu 0 --procname jackd --limit 1 --no-progress --skip-validation /path/to/trace \
    | grep jackd | cut -d '[' -f2 | cut -d ']' -f1) \
]" /path/to/trace

Of course, this is just to show that we can automate advanced analyses pretty easily with a nice combination of offline and online processing. If we were to actually use that in production as a monitoring tool, we could do a clean analysis script that combines all those steps. There is also the possibility of directly sending the snapshot to a remote machine running an LTTng relay daemon and launching the offline analyses scripts outside production server, for example on a monitoring server or a system administrator machine.

Overhead

The real-time module of the latency_tracker adds work in the critical path of the interrupt processing, so it adds some overhead. We have done a lot of work to keep this overhead as low as possible: no memory allocation when tracking interrupts, configurable memory footprint, RCU lockless hashtable, as well as per-cpu RCU lock-free freelists. From our early measurements, we add on average 740 ns (standard deviation: 396 ns) at each state change. We have between 4 and 6 state changes to reach the user space, so we add around 4.4 µs to keep the state in memory.

In addition to keeping the state, the breakdown of the latency is appended as text during the tracking. This feature will be optional in a near future; as it currently adds on average 660 ns (standard deviation: 616 ns) at each state change, so around 3.9 µs for the 6 state changes. We consider the high standard deviation a problem and post-processing can extract the same information, which is why we will make this feature optional.

The total overhead when tracking the state and using the text formatting feature on a 16-core Xeon E5-2630 is around 8.3 µs per interrupt. Filtering on specific interrupts early in the chain avoids adding this delay on all interrupts. The hard interrupt filter is applied at the first state change.

Conclusion

In this post, I have demonstrated a use case for the latency_tracker to automatically detect a high latency in the processing of an hardware interrupt, launch a user space task to extract the content of the trace ring buffer and run some post-processing tools to identify the source of the latency.

We also have a way to take the kernel stacks of the "waker" and "wakee" to quickly know where they are blocked, identify priority inversion patterns, and lots of other fun stuff. Be sure to check out the different projects presented in this post, try them (on mainline/PREEMPT_RT kernels), and let us know what you think and what you would like to see next!