Full Stack System Call Latency Profiling
Finding the exact origin of a kernel-related latency can be very difficult: we usually end up recording too much or not enough information, and it becomes even more complex if the problem is sporadic or hard to reproduce.
In this blog post, we present a demo of a new feature to extract as much relevant information as possible around an unusual system call latency. Depending on the configuration, during a long system call, we can extract:
- multiple kernel-space stacks of the process to identify where we are waiting,
- the user-space stack to identify the origin of the system call,
- an LTTng snapshot of the user and/or kernel-space trace to give some background information about what led us here and what was happening on the system at that exact moment.
All of this information is tracked at run-time with low overhead and recorded only when some pre-defined conditions are matched (PID, current state and latency threshold).
Combining all of this data gives us enough information to accurately determine where the system call was launched in an application, what it was waiting for in the kernel, and what system-wide events (kernel and/or user-space) could explain this unexpected latency.