Preventing trace event record loss with LTTng-UST's blocking mode

The EfficiOS team on 22 November 2017

If you've ever suffered from lost trace event records when using LTTng-UST then you might be interested to know that LTTng 2.10 includes a new feature to prevent that from happening: blocking mode support for channels.

This long-awaited feature makes it possible for LTTng-UST to wait until space becomes available in the trace buffers instead of losing existing event records when new ones arrive.

If you've ever experienced the records were lost or discarded events messages when tracing an application, blocking mode might be the solution you've been searching for.

Why are trace event records discarded?

When your application generates trace data, it's passed to the consumer daemon through channels. Each channel contains ring buffers and this is where your trace data is stored, as event records, before LTTng saves it somewhere, for example on the disk, or over the network.

Your application and the consumer daemon are a classic producer-consumer model: one puts data into the channel's ring buffers, the other takes it out.

But if the application writes data faster than the consumer can read it, you can quickly run into trouble.

Since channel buffers are only so big (their sizes are statically configured when the channel is created), it's possible for them to run out of space for storing new event records.

Once channel buffers are full, LTTng has to decide what to do with any additional records. The choice? Overwrite old data or discard new data.

If you're running in discard mode (the default) and your application causes lots of events, it's possible for thousands of event records to be dropped from the buffer in a matter of seconds. This can ruin any attempt to understand rapidly changing workloads, or intermittent bugs, because you might never capture the key piece of information you're looking for. This is common in memory allocator, locking behavior, interpreter, and simulation tracing scenarios, which typically produce data at a rate much higher than what typical I/O can cope with.

And this is why blocking mode is so useful.

Instead of discarding data when the buffers are full, blocking mode stops the application from writing more event records, either indefinitely, or until a timeout expires.

Which means it's possible to never lose another event record again.

Why not block by default?

At this point you may be wondering why LTTng-UST doesn't always block when the buffers are full?

The reason is that a bug or latency spike on the running system could cause the application to block, possibly forever. And that potential for problems goes against one of the design goals of the LTTng project: the default tracing configuration must be safe to use in production.

Plus we made a conscious decision to prioritize tracing performance over tracing completeness—LTTng-UST needs to be fast because it's often used to debug race conditions and performance issues.

If tracing is slow, or interferes with the normal operation of the system, it could mask any race conditions and make them impossible to debug.

That is the trade-off for using blocking mode: you are guaranteed that no data is lost, but application performance may take a hit and you may not be able to recreate the bug you're chasing.

The best way to know whether blocking mode will noticeably impact the timing of your application you need to try it and see.

Here's how to use LTTng-UST's blocking mode

Create a tracing session:
```
$
```
```
lttng create
```
Create a user space channel in blocking mode:
```
$
```
```
lttng enable-channel --userspace --blocking-timeout=100 blocking-channel
```
We use a timeout of 100 µs in this example.

Create an event rule to record all user space events:

lttng enable-event --userspace --channel=blocking-channel --all

Start tracing:
```
$
```
```
lttng start
```
Run your application, setting the LTTNG_UST_ALLOW_BLOCKING environment variable to 1:
```
$
```
```
LTTNG_UST_ALLOW_BLOCKING=1 my-app
```
When you are done, stop tracing:
```
$
```
```
lttng stop
```

The --blocking-timeout option accepts 3 values:

0
inf
A positive microseconds value

Using 0 or inf as the parameter value causes the application to never block or block forever (infinitely), respectively.

The third option is to specify a timeout value (in microseconds). Since you can only use a blocking timeout for discard mode channels, any new event record is discarded if the blocking timeout is reached before space is available in the buffer.

Once you've created a blocking mode channel, it's possible to check the value of the blocking timeout parameter by looking at the current tracing session's status, for example:

lttng status

Tracing session auto-20171017-224928: [active]
  Trace path: /home/test/lttng-traces/auto-20171017-224928

=== Domain: UST global ===

Buffer type: per UID

Channels:
-------------
- blocking-channel: [enabled]

  Attributes:
  Event-loss mode: discard
  Sub-buffer size: 524288 bytes
  Sub-buffer count: 4
  Switch timer: inactive
  Read timer: inactive
  Monitor timer: 1000000 µs
  Blocking timeout: 100 µs
  Trace file count: 1 per stream
  Trace file size: unlimited
  Output mode: mmap

  Statistics:
  Discarded events: 0

  Event rules:
  all (type: tracepoint) [enabled]

Never lose trace event records again

Blocking mode is a new feature in LTTng-UST 2.10 that allows applications to block when they run out of space to write trace data. If you've ever been unable to fully analyze your application because of discarded event records, and increasing the channel buffer size didn't help, blocking mode can make sure you capture all trace data.

For some scenarios, the introduction of blocking mode channels means users can now finally diagnose those intermittent bugs.

More information on the blocking mode feature is available in the LTTng Documentation.

← Back to LTTng's blog