Bringing .NET application performance analysis to Linux
Both the Windows and Linux ecosystems have a swath of battle-hardened performance analysis and investigation tools. But up until recently, developers and platform engineers could use none of these tools with .NET applications on Linux.
Getting them to work with .NET involved collaboration across many open source communities. The .NET team at Microsoft and the LTTng community worked together to bring .NET application performance analysis to Linux. Since one of this project's goals was to avoid reinventing the wheel—and to allow existing workflows to be used for .NET applications on Linux—the .NET team chose to enable usage of popular Linux tools such as LTTng and perf to enable performance analysis of .NET core applications.
This article covers some of the work involved in enabling performance analysis of .NET Core applications on Linux: what works, what doesn't, and future plans.
Though .NET was originally created to run on Windows, .NET Core makes it possible to run C#, F#, and Visual Basic applications on Windows, Linux, and macOS. Everything required to build and run .NET Core applications is open source (MIT licensed) and available on GitHub. This project is central enough that .NET Core is fully-featured, and supported by Microsoft in production.
.NET applications can consist of both native code and managed code. Managed code is compiled by the .NET toolchain into the Common Intermediate Language (CIL), originally known as Microsoft Intermediate Language (MSIL). At run-time, the intermediate code is compiled into native machine code by the CoreCLR just-in-time (JIT) compiler and executed. Applications with managed code use runtime services such as garbage collection and JIT compilation, and each of these runtime services needs to provide diagnostic data to allow developers to understand the performance of the application and runtime. The majority of performance diagnostic improvements occurred in these runtime services, enabling symbols to be resolved and stack traces to be recorded on Linux.
Finding the right tools
When investigating tools that could be used to analyze the performance of .NET applications on Linux, Event Tracing for Windows (ETW) was used as a starting model given that it's been used as the performance data collection mechanism on Windows for many years. Since ETW can collect both kernel and a user space data, it offers the ability to collect trace data that spans the entire application (including runtime and libraries) and the Windows kernel. However, nothing like this is available on Linux. An equivalent had to be chosen that covered the same events on Linux as on Windows. The .NET team picked LTTng and perf because each provides enough of the features from ETW to fill the gap and both are widely used by the Linux community.
perf is used to collect machine-wide hardware counters (for example, CPU cycles) and kernel events, and LTTng handles user space (runtime services and application-level events) by using tracepoints generated at CoreCLR build time. Specifically, these tools together provide:
- Hardware counters and kernel events with kernel and user space call stacks.
- Runtime and application-level event-driven tracing data for diagnostic and performance analysis.
For every ETW event in CoreCLR, we construct an LTTng-UST tracepoint when running on Linux, which means there's a complete one-to-one mapping between the two—wherever an ETW event is emitted, there's a corresponding LTTng-UST tracepoint. This makes LTTng-UST a drop-in replacement for all of the diagnostic information that the runtime emits.
Getting accurate stack traces and symbols
To get features such as stack walking to work, it was necessary to
GCC compiler option. This option
incurs a small performance penalty, but it's a worthwhile trade-off
for getting detailed stack traces. The JIT compiler was modified to
preserve frame pointers for generated code. And while those steps
improved .NET Core, many native libraries still aren't compiled with
support for frame pointers, which can result in incomplete or
inaccurate stack traces. Hopefully in the future, there will be a
standard way to walk stacks on Linux that can be performed regardless
of how the binary was compiled. Meanwhile, even inaccurate stack trace
can still provide meaningful diagnostic information.
Symbol resolution for managed code was the next issue to tackle.
CoreCLR adds a symbol file in
/tmp containing details for converting
addresses to symbols. The Linux perf tool then detects and uses the
file to resolve symbols. A tricky side issue is that it's possible to
precompile managed code, that is, not compile on the fly at run time.
This design can speed up application start up times (since code
doesn't need to be just-in-time compiled), but poses a separate
challenge because perf cannot parse binary files containing
platform-specific native code.
The solution to this dilemma is a map file that is consumed by custom .NET tools. Unfortunately, this map file doesn't currently work with standard Linux tools like perf, but the long-term plan is to add support.
An example set of tracepoints implemented by CoreCLR are for JIT compiler functionality. They provide information such as which methods were JIT compiled and how long those compilations took. Understanding the behaviour of the JIT compiler allows you to make changes to your code that improve performance, for example if lots of time is spent compiling a few functions on the fly, you can precompile them.
It's important to note that the information that comes from the runtime is the same regardless of whether the code is executed on Windows or Linux. Both versions of tracepoints produce the same data.
A tool to gather .NET performance data
The .NET team wrote perfcollect to collect performance traces of .NET core applications running on Linux. It's a bash script that takes care of enabling tracepoints and gathering trace data—along with the symbols required to interpret that data—into an archive. Three pieces of trace data are collected by perfcollect: machine-level, runtime (GC, JIT, and the rest), and managed code instrumentation. The CoreCLR project has instructions for getting started with perfcollect.
Once all of this data is collected, you can use PerfView or generate your own flame graphs to really dig into application behaviour. Some of the events are very usable in their raw form, and common tools like Babeltrace and Trace Compass can be used to look at runtime events. The ultimate goal is to improve Linux tracing tools to the point where Windows-only tools are no longer required when inspecting trace data.
Despite this ongoing large-scale effort, there are still a few unfinished pieces. It's not possible yet to dynamically register user space tracepoints, and that means managed code can't register tracepoints at run time. Instead, all managed code triggers a single tracepoint, negatively impacting visibility of application-level behaviour. And precompiled cross-platform code is still difficult to work with.
But perhaps the most important shortcoming is the lack of user space stack traces. Accurate call stacks are required for developers to fully understand performance. Right now, it's possible to know that a garbage collection event occurred, but not which function triggered it. You can see what kind of object was allocated, but you can't get the call stack to see who requested the allocation. This issue is a pretty major gap in the developer's toolkit. Sasha Goldshtein wrote an article offering a workaround by using dynamic tracing and perf-probe to capture function symbols at run time, but it's a hack that requires too many manual steps. What's really needed is a way to gather this information automatically. For that, we need patches for LTTng.
Fortunately, there is hope. Support for capturing user space stack traces in the kernel tracer will be available in the upcoming LTTng 2.11 release. Merging that work is a critical step in bringing the full advantage of Windows performance tools to Linux.