Benjamin Kušen

March 31, 2024

Optimizing Kafka Consumers with Kubernetes and KEDA

We'll talk about how to utilize Kubernetes and KEDA to optimize Kafka consumer infrastructure for scalability, availability, and cost-efficiency.

In the world of Kubernetes, scaling a real-time data pipeline using Kafka comes with unique challenges, like processing millions of Kafka messages spread across various Kafka topics efficiently. Today, we'll look into using it to scale and streamline a very popular Asian app called Grab.
‍

Grab is a data processing platform on Kubernetes that consumes millions of messages streamed across thousands of Kafka topics. Grab operates as deployments in a Kubernetes cluster, and each deployment consumes data from specific Kafka topic partitions. The load on a pod depends on the partition it consumes. Let’s imagine that a customer orders a meal using Grab.

‍

The data, which is generated by the platform, needs to go through certain data engineering changes, like aggregation before it can be used by other teams. This is where Grab’s generic platform kicks in. The users design their logic, their own business case data transformation on this platform in a very generic way, and then use this data. For instance, this order data might be used by analysts to improve the customer experience.

In a nutshell, the Kubernetes cluster is running Grab deployments consuming Kafka and the data being produced by various service teams. Then this data, after transformation, goes to predefined stores like Scylla, Cairos, and MySQL for real-time as well as offline use cases.

Kafka Consumers on Kubernetes: Optimization Challenges

The idea of running Kafka consumers on Kubernetes comes with challenges.

These challenges include:

The goal of Grab’s infrastructure is to ensure scalability and availability. Building a scalable but cost-effective solution is a major challenge.
The second challenge is to balance the load across all the pods in the deployment. Kafka consumers directly consume from the partitions of a topic to ensure that the load across each part of the deployment is the same to avoid situations like noisy neighbor problems or pod-level throttling, which can be quite problematic.
Maintaining data freshness is also a priority. In simple terms, data freshness basically means that once the data has been generated by the producer, it can be used after it goes through the platform. This is generally controlled or basically identified by the consumer lag of a certain pipeline. This is crucial for the system.
Grab’s philosophy is to provide a good user experience to the service teams, and we don't want them to actually do a lot of resource tuning for their particular pipelines. The goal is to abstract that information so that they can just focus on their pipeline logic rather than worrying about infrastructure scalability.

Strategies to Overcome Optimization Challenges

Grab’s current infrastructure was designed by trying various available solutions. Let's take dive deep into the approaches considered before arriving at the current architecture.

The Vertical Pod Autoscaler Approach

The platform optimization journey begins with the vertical pod autoscaler, also known as VPA. The vertical pod autoscaler scales the application vertically rather than horizontally. Initially, memory and CPU metrics were used for VPA scaling. This demonstrates the VPA’s general functionality, where it changes the size of the pod. The consumers are Kafka consumers, and each pod connects to a single partition.

‍

To begin with, the number of replicas in the deployment is kept the same as the number of partitions that particular deployment is consuming. After this, VPA takes over to decide the right size of the deployments based on the resource requirements for stability.

‍

This particular setup helped to provide a good abstraction for the end users because they don't have to configure any resources or even tune when there's a change in the traffic. Over a period of time, the pod would automatically resize with the changing business trends, and there was no requirement from the platform team or from the end user to intervene.

‍
‍

Challenges with the Vertical Pod Autoscaler

This setup performed well, but there were certain challenges associated with it. For example, the CPU utilization or resource utilization was quite good at peak hours, but it was not very good during off-peak hours.
‍

Theoretically, it should have an equivalent load across all the pods, but practically, it didn't come even close because the load on the partition is decided by the partition key, which is decided by the producers, and these partition keys are more business-oriented. So, in reality, it was unable to achieve an equivalent load across all the pods.

‍

This unbalanced load in terms of VPA affected the average CPU and memory metrics for VPA and eventually led to consumer lag, affecting data freshness for those ports consuming from heavier partitions.

‍

The deployment was consistently running at the maximum number of ports at all times. This basically led to always running the same number of nodes, be it peak or off-peak, which was not a very cost-efficient solution.

The Horizontal Pod Autoscaler Approach

We tried another scaling approach, which is horizontal scaling, to overcome the challenges faced by VPA. The metrics used are the same as the average CPU utilization and average memory utilization as used in VPA. But we didn't have to maintain the consumer replicas equivalent to the number of topic partitions at all times. This means that during off-peak hours, these deployments could scale down.

‍

The deployment was using the average CPU and memory utilization metrics to scale the amount of load processed. Hence, whenever there was an increase in the load on a topic during peak hours, it would scale out and help better the distribution of load across pods.

‍

When the incoming load in the topic was reduced, it would ask the HPA to trigger a scale-in. By using these scaling operations with respect to the fluctuations in the traffic coming into the topic, it was able to achieve a CPU utilization of about 50%, moving above 20%. So, this was a great achievement in terms of resource efficiency.
‍

During off-peak hours, there was a drastic reduction in the number of pods that were running, which essentially meant that the number of Kubernetes nodes that were required during off-peak hours had decreased. This helped in achieving cost-effectiveness.

Challenges with the Horizontal Pod Autoscaler

The horizontal pod autoscaler came with a new set of challenges. One of the challenges was uneven load distribution across pods. One of the aspects of the load across pods is what partitions are assigned to each pod. HPA can just scale in or scale out based on the load that the entire deployment is processing. However, what particular partition gets assigned to each pod is something that can't be controlled.

‍

Another issue was the unbalanced resource utilization across pods. The average CPU and memory utilization metrics in order to perform these scaling operations. But that meant that there could be scenarios where, let's say, one or two pods were overutilizing the resources, and this overutilization of the resources is not something that HPA can detect.

This leads to the third challenge, which is higher consumer lag. When resource consumption was higher than the target resource that was being provided for the deployment, the processing slowed down, and hence there was a higher consumer lag for these pipelines, which essentially meant that they were breaching the service level agreements.
‍

The need for continuous resource tuning with HPA was also problematic. It has a fixed number of resources that are set for each deployment. HPA can basically scale out to the maximum number of replicas in order to process loads. In scenarios where the organic traffic within the topic was to change, it would require manual intervention from the user in order to provide the appropriate resources for the deployment.

Using KEDA for Efficient Resource Management

To overcome the challenges of VPA and HPA, KEDA emerged as a compelling alternative. Keda is an open-source event-driven auto-scaling mechanism that helps to horizontally scale pods based on external triggers or events.
‍

While HPA also does provide the possibility to incorporate external metrics, KEDA is just better suited for applications that are event-driven, and it provides an extensive set of tools to incorporate different kinds of scalars. So, using Keda, custom metrics are incorporated in order to scale the Kafka consumer lag that is directly associated with SLAs.

‍

By utilizing these custom metrics, we were able to keep track of the Kafka consumer lag and maintain the service level agreements. Apart from maintaining the service level agreements, KEDA also helped in achieving higher resource efficiency. Let’s explore how KEDA helped in achieving scaling goals.
‍

Like the service level agreements, we could now set a more relaxed target for our resource metrics. Instead of setting resource targets for CPU utilization to, let's say, 70% in KEDA, we have the targets or guardrails assigned for consumer lag. This means that the CPU utilization can be more than 80% or higher, and hence it helps to achieve higher resource efficiency. In the context of the system, Keda has the ability to incorporate complex scaling rules. It has seamless integration with the Datadog monitoring stack.
‍

By utilizing KEDA's built-in Datadog scaler, we seamlessly integrated our custom metrics from the Datadog dashboard into the autoscaling process. Combining KEDA with HPA resolved the challenges concerning CPU, memory utilization, and cost-effectiveness. However, in scenarios where KEDA would reach its maximum scaling limits, we would manually resize deployments.

‍
This means that there was a lack of abstraction of these resource configuration details from the end user. In order to solve these challenges specifically, an operator, which is a custom resource definition operator, is used. The resource advisor plays a crucial role in addressing the vertical dimension of the deployment strategy. This specifically relates to the allocation of resources within a pipeline deployment. The operations of the resource advisor are executed during off-peak hours for a pipeline when the traffic is low to avoid conflict while scaling is done by KEDA, and is handled by a component called the scheduler.

The advisor has three components:

Data collector: It gathers historical data to understand the application's processing and resource requirements before changing its port size
Recommender: It makes the recommendations based on the data collected by the data collector
Updater: The provisioning resources are done by the advisor using the updater. This is a job for the component called the updater
‍

By scaling vertically, the resource advisor ensures optimal allocatiion of resources for the pipelines. This means that it can smoothly adapt to the organic change in traffic over time. There is a whole workflow for the resource advisor integration with Keda. Firstly, the horizontal scaling is still managed by Keda at all times, and it ensures peak resource utilization based on traffic throughput.

‍

Next, the result advisor is triggered during off-peak hours based on a schedule. Then, after that, data will be collected from the data, custom metrics, and the current resource metadata for the deployment. There are certain conditions that need to be defined for a resource advisor to make a recommendation.

‍

For instance, in use cases where a pipeline was throttled even when KEDA had scaled the deployment to the maximum number of ports, this would lead to consumer lag for the pipeline and affect the service level agreement. To stabilize the pipeline, a very obvious choice is to allocate more resources. Hence, the resource advisor recommended a higher resource for its deployment. Finally, the recommendation is applied, and the changes are made to the deployment, leading to the resizing of the pods.

Advantages of Using KEDA

Let’s discuss what KEDA provides:

‍‍By switching from VPA to HPA with Keda and a resource advisor, who is a CRD operator, the efficiency is improved. It helps increase the daily average platform-wide CPU utilization from 15% to 57%. This means that the pipelines are performing better even during off-peak hours.
Secondly, the new architecture has reduced the daily cost by 55% through efficient CPU utilization.
The advisor ensures the continuous delivery of the platform SLA and prevents data loss.
Keda provides the abstraction of resource configuration for platform users. The automatic resource advisor enables platform users to focus on their business logic, and they don't need to worry about how to tune the results manually.
‍

Future Improvements

Future developments are going to focus on two key areas:

‍‍The first area for improvement is uneven load distribution among ports. KEDA and the resource advisor don't guarantee an equal sharing of the loads among ports. Their main job is to act as safeguards, making sure that systems don't run into problems like consumer latency, which might violate the service level agreement. To address the challenge of uneven output caused by the uneven partition assignment among ports, preventive measures like rate limiting will be implemented at the application level.
In order to deal with complex business requirements, advanced scaling based on complex custom metrics will be introduced.
‍

Conclusion

Grab has achieved significant improvements in optimizing Kafka consumers on Kubernetes using a combination of techniques. It became a more optimized solution with resource efficiency, cost-effectiveness, and SLA compliance, thanks to KEDA.

‍