Ante Miličević

February 23, 2024

Continuous Profiling Using Parca

Today we’re tackling continuous profiling in Parca and how it can help you with resource management and bug fixes.

Resource management in software development is a key aspect that determines the performance and cost of the applications. Let’s look at the CPU consumption graph.

The random spikes trigger the calmness of the baseline. The above graph must have recalled some memories for most of you. At least once in your career, you must have encountered such a graph. This always gives rise to a lot of theories suggesting it’s a garbage collection run or a problematic code path in the admin user's tasks. But all these theories are invalid unless we don’t have data to prove them.

‍

Data can give a detailed overview of processes that are consuming CPU during these spikes. Using data, we can even further go into the minor details, like functions or line numbers in the code where resources are being consumed. This would make it much easier to manage our resources and avoid such graphs. That’s the power of continuous profiling. It makes the debugging smoother and more insightful.

What is Profiling?

Profiling is there since modern programming languages are evolving. Profiling is a phenomenon that enables users to analyze a program's execution, with an emphasis on measuring how the resources are being used by the program. It generates detailed reports outlining resource usage. In sample profiling, program execution is observed for a fixed amount of time; for instance, for 10 seconds, program execution is observed, and during those 10 seconds, the program’s function call is stacked at consistent intervals. In this way, after 10 seconds, 100 samples are collected.

‍

This provides a good amount of data to analyze what’s really happening in the program. Sample profiling doesn’t continuously monitor the changes in the program; hence, it has low overhead, which makes it suitable for cloud-based workloads.

Why do we want to profile our application?

The answer is simple: to make our applications faster. Let’s take an example of an e-commerce store. In the e-commerce world, sales numbers are directly related to how fast the website is. Here, profiling comes in and helps in identifying the bottlenecks so you can address those issues, increase the overall sales numbers, and cut the cost of infrastructural bills.

‍

Usually, when applications run code, almost 30% of the resources are spent on easily optimizable code because programmers don’t know which resources are being wasted. Continuous profiling provides detailed insights into resource consumption, allowing you to strategically optimize the resources and reduce the cloud bill.

Limitations of Continuous Profiling

Traditional profiling has its own limitations, as it’s very momentary. Like when users run profiling, they get the samples, and once it’s stopped, they’re not going to know what's happening in the application. So that's a momentary and manual approach. When users face an issue, they set up profilers, start collecting profiles, and stop it. This cycle needs to be repeated whenever you run into a performance problem again.

‍

In addition to this, it's not very easy to get profiles from production. To get samples from production, users need to either SSH into the instance or probably port forward to a local and extract the profiles from the application. So all these are both time-consuming and very error-prone when done in production applications.

How to overcome profiling limitations?

Considering how powerful profiling is and that the developer experience is not so ideal, To solve that problem, continuous profiling comes into the picture. Continuous profiling, as the name suggests, is a method of continuously collecting profiles from applications over a set duration of time, or the lifetime of the program, so that users have a constant trail of what's happening within their applications.

‍

As we have already discussed, sample profiling is very low in terms of overhead; like with any other profiling observability data, users will not know when they are going to need the data. So it's always good to collect it at a low rate.

How does continuous profiling work?

Sample profilers are employed to continuously collect profiles from all the processes that are running in a node, and we tag the data with required metadata that will later allow us to slice and dice the data to get profiles from each and every workload that we specifically need to look into.

‍

In addition to solving the developer experience problem of traditional profiling, continuous profiling also brings in a bunch of benefits. One of the biggest benefits is that the development is in production. Even though we strive so hard to make our development environments as close as production, we simply won't be able to replicate the same workload in our local environment.

‍

By not doing that, we are going to miss out on the crucial data that we get from the production outloads. Continuous profiling helps solve that gap, as does the data and context over time. Once you employ a continuous profiler, you'll have this profile trail over time across various events, like rollouts and production incidents.

‍

Whenever a deployment happens, and if your performance numbers are not as expected, you can go compare the profiles before and after the deployment and see exactly which part of your application has degraded, see what's wrong, and fix it immediately so that you eliminate the regression in your performance goals. With continuous profiling, it is possible to profile production and cloud environments all the time.

‍

When continuous profilers are employed in production environments, not only performance insights are collected, but certain other anomalies can also be detected, which helps us resolve bugs.

Continuous Profiling using Parca

Parca is an open-source continuous profiler developed by polar signals, and it easily integrates with Kubernetes environments. It runs an agent to collect the profiles, and the agent is deployed as a demon set. It is also an EBPF-based zero instrumentation profiler, meaning it doesn't need any code changes to your application. You just have to deploy the agent as a demon set, and it starts with magic. It discovers the processes, each and every process that runs in the node, and it attaches profilers to all of those and collects samples from those.

‍

After that, it indexes the collected profiles with Kubernetes metadata like label values, so that later, whenever there is an issue, users can query for the exact key value pair that you're looking for and extract the data for those.

Parca's architecture

Let's have a quick look at Parka's architecture.

Parca has a very simple architecture where the EBPF-based agent collects the profiles, sends the profiles to the parka server, where the data is processed, like adding a bunch of symbolism and other enhancements, and then saves it to the profile store, which is backed by a FrostDB, which is a custom-built columnar embeddable store that is specifically developed for polar signals. This will then store the data in the object store. A similar process happens from the UI. Whenever users want to query something, it sends a request to the queryer, who in turn gets the necessary data from the profile store and generates reports on the UI.

Parca profile explorer can be divided into three parts:

Query selector
Metrics graph
Visualization section

‍

The query selector is the area where users apply queries to get specific data for a specific workload. Here, Kubernetes label values can be used to query for specific workloads. A metrics graph shows the details, like what's being consumed and how much resource each process is utilizing, in the form of a graph. Below the metrics graph, there's a visualization section that specifies the details within the application.

‍

The visualization used to represent resource consumption is called an icicle graph. This is the most commonly used visualization for performance data at a high level. The vertical space taken by each of the nodes in the icicle graph represents how much resource it takes.

‍

To understand how the icicle graph works let’s take an example.

‍

The root usually takes 100% since it combines everything within that process. The scrape loop has taken 16% of the CPU, the server has taken 11%, the runtime has taken 55%, and so on. You can click on each individual process to get a more detailed view.

‍

It also provides a compare feature that allows you to compare profiles before and after the deployment. This compare feature generates two profiles, one on the left and one on the right, and these can be compared to see how they performed.

Agents in Parca

Agents in Parca collect all of the metadata for all the binaries that are running. This information includes the name of the process, the name of the pod, the name of your cluster, and the process ID. It discovers all the binaries and collects information about them. It then piles them into stack traces. That's where EVPF is used under the hood in kernel space. Then it compresses them into a format like a very low-space-optimized format, and then it sends them to the Parka server for visualization.

‍

This process requires no instrumentation, and users just need to deploy it. Above all, it’s very low overhead, and it does not actually take up CPU space or affect anything else that’s running on the machine or clusters. For each binary, the agent discovers all of its targets or other associated binaries and then generates a list of targets. The target discovery in the agent is system-wide. It profiles everything, whether it's a binary or a process. It allows users to view the stack traces for everything that's running on their system, down to the last system call.

‍

Almost all applications store their code and other information in binary files. There’s other information, like memory addresses, functions, and memory mappings, that’s stored alongside binary files. The Linux binaries use Executable and Linkable Format, also known as ELF. To read ELF format, a complex format specification (DWARF) is required.

‍

Here, the Parca Agent plays its part and extracts information from binaries. It reads and interprets the data and transforms it into stack traces. This process is manual and is later automated in the code. Agents extract the return addresses of all functions and map an order that tells which functions call another along with their memory addresses. This information is compressed and shared with the server.

How frequently does Parca do profiling?

The idea behind continuous profiling is to capture dynamic changes and report resource consumption. This means that it monitors the resources continuously and allows users to timely troubleshoot the issues. To do this, Parca agents collect samples from binaries 19 times per second. This provides a continuous stream of data that’s optimized and causes minimum overhead.

‍

For agents to collect data, a prime number is selected, as prime numbers work well with CPU interruptions. Prime numbers are not divisible by any other number, and this reduces the chances of interruptions at conflicting intervals. Prime numbers are used to define time intervals for agents to avoid any funky interactions. The agent looks at all the processes, every process running on your system, every binary, be it a bare metal container, and it takes information from them using EVPF in kernel space.

‍

Then it compresses them into stack traces and sends them over the stack traces to Parca. In this whole process, the agent collects a lot of metadata, including details like container labels, compiler language, version, and whether the code is jitted or not.

‍

On Parca’s side, an icicle graph is displayed that depicts how much CPU resources are being consumed. All this is done by writing two lines of code and a simple command.

‍

What compilers and runtimes do Parca support?

Parca supports all natively compiled languages like C, C++, Rust, Go, and other languages. Perca also provides support for just-in-time compiled languages, with some additional support. This includes:

C# (using perfmap or jatump)
Erlang (using beam)
Java (using JVM)
Closure
Julia
Node.js

‍

In addition to this, it supports Python, Ruby, and AI-related workloads. Parca supports various architectures, like x86, Arm64, and others. Parca Agent needs Linux to operate, but it can also be used on macOS by using a Linux VM.

Conclusion

Parca provides developers with a deep overview of their applications’ resource consumption. This will help them troubleshoot the performance issues and optimize their code. It’s still in the process of evolving and is expected to introduce support for PHP, Perl, and other AI-related workloads in the future. Continuous profiling not only gives developers an exceptional user experience but also helps to build successful applications.

‍