Benjamin Kušen

March 2, 2024

How to Deal with Memory Pressure in Redis

This post will explore Redis, evaluate its applicability for various uses, and concentrate on the actions required to resolve memory overflow problems.

The open-source, highly capable in-memory data structure store ‘Redis’ has gained popularity. Redis is extensively used for a variety of use cases, from caching to real-time analytics and pub/sub messaging, because of its quick and effective performance.

‍

Redis manages massive amounts of data in memory, thus to maintain seamless operations and peak performance, developers and system administrators must take proactive measures to resolve memory problems.

‍

We will emphasize workable methods to maximize memory utilization and lessen the effect of memory pressure on Redis instances, taking inspiration from an actual production scenario. We will mainly concentrate on managing the Redis cluster that is offered as a managed service in AWS because of the configuration options and basic tools the cloud offers.

Overview and Advantages of Redis

Redis is short for “Remote Dictionary Server”. It is a tool that is used to store data quickly. Today, Redis is being used in several app developments as it can store and retrieve different forms of data.

‍

The key feature of Redis is that since it keeps all the data in memory, its operation is really fast. In older databases, the data is kept on slow disk drives. The apps can get the information fast in Redis as the data is kept in memory and thus avoids any delays in app development.

‍

Numerous data structures, such as texts, lists, sets, sorted sets, hashes, and more, are supported by Redis. These data structures are more than simply containers since they have strong and specialized operations that let programmers work with Redis directly to carry out intricate calculations and data manipulations. Redis, for example, supports rank-based sorting, union operations, and set intersections, which makes it useful for a variety of use scenarios.

‍

The simplicity of Redis is one of its main advantages. Because of its simplicity and intuitiveness, the Redis API makes it easy for developers to understand its principles and take advantage of its features. Redis is simple to include in running applications; it may serve as the main data store for particular use cases or as a cache layer to boost speed.

‍

Redis's flexibility goes beyond data structures and its in-memory architecture. Persistence, replication, and pub/sub messaging are among the built-in capabilities that make it a complete solution for a range of application needs.

‍

Redis replication makes fault tolerance and high availability possible by facilitating the establishment of replicas that may take over if the master node fails. Data saved in Redis is durable because of persistence features like append-only file (AOF) mode and snapshotting, which guarantee data persistence during restarts and failures.

Redis is also very good at pub/sub messaging, which enables real-time communication between various distributed system components. It is an effective tool for developing scalable and responsive systems because of its publish/subscribe architecture, which permits real-time event processing and message broadcasting.

‍

Overall, Redis provides speed, ease of usage, and diversity that makes it an appealing option for a variety of use scenarios. Developers may take full use of Redis's in-memory feature, variety of data structures, and built-in capabilities to improve the speed, responsiveness, and scalability of their applications.

Redis Configuration and Hosting Options

On-Premises Redis hosting refers to the process of setting up and managing Redis instances within your infrastructure. This approach grants full control over the Redis environment but demands a significant initial investment in hardware, networking, and ongoing maintenance. You can customize hardware specifications to meet your exact needs, maintain direct oversight of security measures, and ensure compliance with your organization’s standards.

‍

However, scaling Redis within an on-premises setup necessitates careful planning and provisioning of additional hardware resources. Furthermore, on-premises hosting places the responsibility for infrastructure setup, maintenance, backups, and monitoring squarely on your organization’s IT team. This demands expertise and resources for continuous management and support.

‍

On the other hand, when it comes to cloud solutions, the AWS ElastiCache service provides a practical choice for Redis storage within the AWS environment. Redis cluster deployment and maintenance in the cloud are made easier with the help of this service, which is overseen by Amazon Web Services (AWS).

‍

The intricacies of infrastructure setup, configuration, scalability, and maintenance may be avoided with ElastiCache because these duties are performed by an outside service provider. This gives you more time and resources to focus on developing applications rather than managing operations. AWS ensures a stable and secure Redis environment by taking care of duties like software upgrades, backups, and patching.

‍

Furthermore, ElastiCache simplifies the process of scaling Redis clusters, whether vertically (by increasing the memory of individual nodes) or horizontally (by adding or removing nodes), to accommodate fluctuating workload requirements. Additionally, ElastiCache offers automatic failover and replication features, guaranteeing the high availability of Redis clusters.

‍

These features include Multi-AZ replication, which synchronously replicates data across availability zones to withstand failures. Furthermore, ElastiCache has a pay-as-you-go pricing structure, so you only pay for the resources that you utilize.

‍

It's crucial to recognize that while AWS ElastiCache provides convenience and scalability, it does create dependencies on the AWS platform and involves ongoing operational expenses. Organizations need to carefully assess their particular needs, cost implications, and level of expertise when deciding between on-premises hosting or opting for a managed service like AWS ElastiCache for Redis.

Memory Considerations

The data management policy of an application can be affected by memory, so it is important to consider the memory aspect while using Redis.

Storage Capacity Limitation

Redis offers unmatched speed and performance by storing data mostly in memory. But it also implies that the amount of RAM that is available on the hosting machine limits how much data you may store.

‍

The memory capacity of Redis is directly correlated with the physical or virtual computer it operates on, in contrast to disk-based databases that may extend storage capacity with ease. To ensure that you have enough memory to hold your dataset, it is crucial to properly evaluate how much data you will need.

‍

You must consider employing techniques like data compression, data division, or dynamic data processing with Redis capabilities like Streams and RedisGears to maximize memory use. By reducing memory utilization, these techniques can let you store more data in the memory you have available.

Non-Durability

Redis does not guarantee data permanence on disk, in contrast to conventional databases, because by default it prioritizes speed and performance over data durability. Redis can include persistence options like append-only file (AOF) mode and snapshotting, but they come with additional disk I/O costs that can negatively impact performance.

‍

Moreover, AOF might not be sufficient to protect against every possible failure scenario. For example, in AWS, ElastiCache replaces a node that dies owing to a hardware issue on the underlying physical server with a new node on a separate server, making the AOF file inaccessible for data recovery. Redis restarts with a cold cache as a result. It is sometimes recommended to have one or more read replicas of Redis spread across many cloud availability zones to reduce such concerns.

‍

Any data that is only kept in memory runs the risk of being lost if Redis restarts or has a failure. To avoid data loss, it is therefore essential to set up efficient backup and recovery solutions. To preserve data redundancy across several Redis instances, this may entail taking regular snapshots, using Redis AOF persistence mode, or putting up replication and high-availability configurations.

‍

Some firms choose a hybrid strategy, merging Redis with other databases, to address the durability issue. Important information, for example, can be kept in a robust database for long-term preservation and Redis for quick retrieval.

Case Study

Project Setup

Having established a foundational comprehension of Redis and its memory implications, let's explore an actual situation where memory constraints significantly affect the operational efficiency of a business. We'll examine the initial design errors and the measures implemented to address and alleviate these issues.

To set the stage, let's examine a sizable eCommerce enterprise running an online shopping platform, offering a diverse range of consumer products like shoes and home decor. Our scrutiny will center on the product management facet of the operation.

‍

This encompasses a software system tasked with retrieving details about items, such as prices and available configurations (like sizes), from supplier systems. The software applies essential data enrichment, aggregation, and transformation business logic to furnish updated data through an API. This API, in turn, enables the shopping frontend to present these items to end-users.

‍

From a technological perspective, this segment of the system employs an event-driven architecture comprising the following components:

1. Apache Kafka: Functions as an intermediate data repository, recording all state updates of items, including intermediate modifications made by the system.

2. Kafka Streams: Offers capabilities for data aggregation, processing, and transformation within the system.

3. Kafka Connect: Facilitates the seamless transfer of data from Kafka topics to a storage system.

4. AWS ElastiCache Redis: Operates as the storage solution for processed item data, serving as a repository for prepared information.

5. Java Spring Boot application: Functions as an API, enabling the shopping frontend to access and retrieve item data efficiently.

‍

The schematic representation of the system is depicted in the diagram below.

Analyzing this scenario enables us to pinpoint the challenges encountered and delve into the subsequent measures taken to tackle memory pressure issues.

In the context of the business's product framework, Redis is anticipated to function as a resilient storage relied upon by the Shopping Frontend. However, the guarantee for any Redis instance is not assured due to its specific memory management characteristics.

‍

Additionally, the Redis configuration in this case comprises a non-cluster instance with a read replica situated in a distinct availability zone. This implies that scaling up the storage can only be achieved through vertical scaling of the nodes, involving the addition of more memory capacity.

‍

In the specific context outlined, it's noteworthy that the data retention period in Kafka was set to 1 month, a substantial duration allowing for data replay to the storage in the event of any unforeseen issues.

Events that Occurred.

The initial setback occurred during the peak season, coinciding with Easter, when all the items in the shop went offline, significantly affecting the business. A subsequent investigation by the tech department unveiled that the Redis storage had become empty, prompting the API to implement a fallback strategy, marking unavailable items as "not in stock."

‍

Upon closer examination, it was discovered that the memory usage had surged from 62% to 100% within a few hours, causing Redis to cease accepting written commands. Adding to the complexity, the primary node, responsible for write operations, had restarted during a service disruption, resulting in the complete loss of data.

‍

Despite having a read replica in another availability zone, the situation remained unresolved. When the primary node became unresponsive and initiated a restart, the replica assumed the role of the new primary node. However, continuous data synchronization led to the replica's memory reaching full capacity, rendering it inaccessible.

‍

Consequently, the replica underwent a restart and was replaced with a newly sanitized node. Eventually, both nodes resumed operation, but they were completely devoid of data, disrupting the shop's normal operations.

‍

It's crucial to note that AWS doesn't guarantee the automatic restart of ElastiCache nodes once memory usage capacity is reached. While such events are often related, they don't necessarily coincide. Ordinarily, when memory capacity is reached, Redis halts accepting additional writes to prevent further memory usage. However, in this unfortunate instance, both nodes experienced restarts, with AWS Support not obligated to disclose the exact reasons.

‍

Nevertheless, the consequence was the complete depletion of storage, resulting in significant disruption to shop operations.

To rectify the issue, the tech department had to replay all item data from Kafka topics to Redis, involving the following steps:

1. Halting Kafka Connectors.

2. Deleting Kafka Connectors.

3. Resetting the offsets of corresponding Consumer Groups by executing the command:

```bash

./bin/kafka-consumer-groups.sh — bootstrap-server :9092 — group connectorName — reset-offsets — to-earliest — execute — topic topicName

```

4. Adding the connectors and their configuration back.

5. Restarting the connectors.

This entire process led to several hours of downtime, causing a significant business impact. Following the restoration of the system to its normal state, the team had to ensure that such an incident would not recur.

Alerting and Monitoring

To address the issue and enhance system observability, the team prioritized implementing robust monitoring and alerting mechanisms to detect and preemptively respond to potential issues before they impact business operations.

‍

With Redis managed as a service in AWS, this task was relatively straightforward, leveraging available metrics such as DatabaseMemoryUsagePercentage to track storage memory usage.

Here's how they improved their monitoring setup:

Creating Amazon CloudWatch Alarm: CloudWatch, AWS's monitoring service, allowed the team to set up alarms triggered when the DatabaseMemoryUsagePercentage metric surpasses a specified threshold. This was achieved through the CloudWatch service's Alarms screen, where the desired metric, threshold, and actions (e.g., email notifications or Lambda function triggers) were configured.
Enabling Enhanced Monitoring: By enabling "Enhanced Monitoring" for the ElastiCache cluster via cluster modification options, the DatabaseMemoryUsagePercentage metric became available for tracking within CloudWatch.
Configuring Additional Actions: The team could configure additional actions to be executed when the alarm triggers, providing flexibility in response procedures.

To stay informed about critical events affecting the Redis storage, such as failovers or node restarts, the team set up notifications using AWS Simple Notification Service (SNS).

Here's how they accomplished this:

Setting up an SNS Topic: An SNS topic was established to capture events from the ElastiCache service.
Enabling SNS Notifications: Within the ElastiCache service settings, SNS notifications were enabled, with the new topic designated as the destination.
Configuring Actions for Events: The team specified actions to be taken once an event was published to the SNS topic, similar to configuring CloudWatch alarms (e.g., email notifications or Lambda triggers).

For organizations managing Redis storage on-premises, similar monitoring setups could be achieved using third-party solutions like Datadog to configure monitoring for internal resources in a manner analogous to AWS.

Logic Writing and Data Structure Optimization

To maintain the operational excellence of your Redis cluster, it's crucial to possess a comprehensive understanding of your system's inner workings, including its processes and integrations with the target data store.

‍

Ideally, conceptual views in the form of diagrams or a detailed list should outline which applications interact with your data store, specifying the read-and-write relationships. In the context of a key-value storage system like Redis, knowledge of the key formats used by different systems for reading and writing is invaluable.

By gaining insights into integrated applications and understanding key patterns, a thorough analysis of the Redis cluster becomes possible. The team, for instance, executed scan queries to count records based on key patterns. This led to the identification of an outdated Kafka Streams application designed solely for data analytics, which was no longer relevant to the business.

‍

Despite being obsolete, this application was generating unnecessary data in the order of hundreds of gigabytes. Upon discovery, the team promptly removed the obsolete application, cleaning up the Redis store and eliminating the burden of unused records.

In addition to reviewing system integrations and identifying redundant applications, optimizing data structures plays a pivotal role in enhancing memory usage efficiency.

For example, employing Redis Hashes instead of individual keys for storing object fields can be more memory-efficient. The following commands and results illustrate the creation of a Hash:

‍

<pre class="codeWrap"><code>redis 127.0.0.1:6379> HMSET tutorial name "redis tutorial"
description "redis basic commands for hashing" products 230 stocks 25000
OK
redis 127.0.0.1:6379> HGETALL tutorial
1) "name"
2) "redis tutorial"
3) "description"
4) "redis basic commands for hashing"
5) "products"
6) "230"
7) "stocks"
8) "25000"
</code></pre>

‍

In essence, utilizing Redis Hashes allows you to consolidate values associated with a single key, forming a collection of field names and corresponding field values for a particular entity. This data structure proves efficient for storing maps.

‍

Additionally, leveraging Redis' data compression capabilities, such as Redis Modules like RedisGears, enables the compression of data before storage. This contributes to a reduced overall memory footprint in the Redis cluster.

‍

Furthermore, taking advantage of the TTL (Time to Live) attribute for records can be advantageous when perpetual data storage is unnecessary. By scheduling the automatic removal of records after a specified period, you can effectively manage memory consumption, ensuring a lean and optimized Redis cluster.

‍

To summarize, ensuring the operational excellence of your Redis cluster requires a deep understanding of its workings, encompassing processes, integrations, and key patterns. Thorough analyses, including identifying unused applications and optimizing data structures, contribute to enhanced memory usage efficiency. Strategies like data compression and TTL attributes further optimize memory usage and overall performance.

Vertical Scaling

As your business grows and data storage requirements increase, scaling up the memory capacity of your Redis cluster becomes essential. Leveraging ElastiCache's vertical scaling capabilities involves modifying the instance type in your AWS environment and facilitating an online upgrade to keep the cluster operational during the process. Alternatively, if Redis is hosted on-premises, migrating to a machine with a larger memory capacity is an option.

‍

Vertical scaling enables your Redis deployment to handle larger data volumes, alleviating memory pressure, particularly in rapidly expanding scenarios. However, it's important to note that vertical scaling requires careful planning and time. In instances of sudden memory pressure peaks, the scaling-up process might not provide immediate relief.

‍

Anticipating such scenarios by closely monitoring memory usage and considering alternative strategies, such as horizontal scaling or implementing caching mechanisms, is crucial for maintaining smooth operation during peak periods.

‍

In conclusion, scaling up the memory capacity of your Redis cluster is a significant step in meeting the growing demands of your business. Whether utilizing ElastiCache's vertical scaling or adopting a larger capacity machine, careful planning, ongoing memory monitoring, and preparation for sudden spikes in memory pressure are essential for ensuring optimal performance.

Backups

Redis backups are created using ElastiCache and kept in AWS S3 as snapshots as illustrated below:

While Redis backups stored in AWS S3 through ElastiCache provide a crucial recovery mechanism and serve as a safety net, it's important to understand that they are not specifically designed to address memory pressure concerns in an active storage environment.

‍

These backups act more as a precautionary measure, offering a recovery point for situations where major business impacts are acceptable, and memory pressure is not the primary concern.

‍

When restoring a Redis cluster from a backup, AWS creates a new cluster using the data from the backup, rather than directly seeding the backup data into the existing cluster. This can lead to additional configuration changes, as the restored cluster may require adjustments for compatibility and consistency with your application.

‍

Although backups are a valuable part of an overall Redis storage strategy, they should not be solely relied upon to tackle memory pressure issues. It's advisable to explore alternative approaches, such as vertical or horizontal scaling, to effectively handle increased data volumes and address memory constraints in real-time.

‍

In summary, ElastiCache's Redis backups in AWS S3 serve as a recovery mechanism and safety net but are not tailored for resolving memory pressure concerns in an active storage environment. Recognizing their limitations and considering alternative strategies, like scaling options, is essential for ensuring the optimal performance and reliability of your Redis cluster.

Data Partitioning and Horizontal Scaling

For workloads exhibiting a skewed data distribution pattern, implementing data partitioning becomes valuable to evenly distribute data across multiple Redis shards. This helps distribute the memory load and prevent excessive memory usage on specific shards.

‍

In AWS, this can be achieved by creating a Redis storage with Cluster Mode Enabled or by migrating existing storage from Cluster Mode Disabled to Cluster Mode Enabled, along with necessary integration preparations, such as changing the SDK.

‍

Implementing data partitioning and horizontal scaling provides a dynamic solution to manage memory usage efficiently and enhance the overall performance of your Redis cluster, especially when dealing with varying workloads and data distribution patterns.

‍

The following illustration provided by AWS shows the difference between non-clustered and clustered mode:

While implementing data partitioning for horizontal scaling in AWS, it's essential to be mindful of the considerations and complexities associated with managing partitioned data.

‍

This involves configuring key slot partitions to specify which shards would host specific ranges of keys, either evenly or in a custom distribution. Additionally, ensuring that integrated applications support the cluster protocol for Redis is crucial.

‍

Once your data is partitioned across shards, you can then consider horizontal scaling by adding more shards to your Redis cluster. Horizontal scaling distributes data across multiple nodes, increasing overall memory capacity and performance.

‍

ElastiCache makes it seamless to add or remove shards without disrupting the application. Horizontal scaling is generally a simpler and quicker process than vertical scaling, and it can occur automatically without service disruptions, although it may lead to some performance degradation during scaling.

Availability

While availability may not be directly related to memory pressure, it remains a crucial concern for mission-critical storage systems. To enhance the availability of your Redis cluster, consider implementing Read Replicas in an AWS environment. Distributing the overall load across multiple nodes, including the primary node for writing commands and multiple Read Replicas for handling read loads, helps balance the workload and increase fault tolerance.

‍

Enabling data partitioning further distributes the write load across multiple nodes or shards, alleviating potential bottlenecks and improving performance.

‍

To enhance resiliency, expand your Redis cluster across multiple Availability Zones (AZs) within the same AWS region. ElastiCache's Multi-AZ deployments allow synchronous data replication to standby replicas in different AZs, increasing data availability and durability. Distributing the memory load across multiple AZs reduces pressure on individual nodes, leading to improved performance and reliability.

‍

Implementing these availability-enhancing measures also prepares your Redis cluster for failover scenarios. In case of an issue with the primary node, a Read Replica can seamlessly replace it with minimal downtime. AWS handles provisioning another Read Replica to compensate for the replacement, ensuring continuous operations and minimizing the impact on your Redis cluster's availability.

‍

The difference in fault tolerance and reliability is clearly visible. Here, a Redis node containing a Read Replica in identical AZ is deployed in its most basic configuration.

‍

And here's a representation of a more sophisticated deployment: a Primary Node in three availability zones with two read replicas:

Because synchronous replication occurs across the cluster's nodes, having a read replica in a different AZ significantly increases the cluster's resilience and somewhat degrades write speed.

Utilizing Technology Based on Your Needs

Understanding your business processes and requirements is essential for effectively evaluating and selecting the right storage solution. By aligning your solution with specific requirements and assessing compatibility, you can ensure a proper fit between your storage solution and the unique needs of your business.

‍

For example, in the eCommerce scenario, the primary requirements for the storage solution included data durability and read performance. While Redis excels in high-performance capabilities, concerns arise regarding data durability and resilience.

‍

Even in a cloud environment, Redis nodes can fail, potentially leading to data loss in the Append-Only File. Although data partitioning and Multi-AZ deployments can mitigate this risk to some extent, there is still a possibility of business impact during failover or the loss of certain data segments.

‍

In cases where data durability and resilience are critical, it's advisable to reconsider alternative solutions that offer both performance and confirmed data durability. AWS DynamoDB or MongoDB, for instance, are key-value-based data storage solutions prioritizing data durability and resilience, providing more reliable options for business-critical applications.

‍

By reevaluating your requirements and considering alternative storage solutions, you can ensure a better alignment between your business needs and the capabilities of the chosen technology. This assessment helps mitigate the risk of data loss or service disruption, ultimately supporting the long-term success and reliability of your storage infrastructure.

Lessons Learned

Summarizing the insights gathered, the following key recommendations emerge for effectively managing memory pressure in your Redis cluster:

Comprehensive Observability. Ensure thorough observability with monitoring, metrics access, and alerting mechanisms, enabling proactive measures to address memory pressure issues.
Understanding Business Processes. Familiarize yourself with the business processes supported by your system, understand the system context, and be aware of integrations with your storage solution to identify anomalies contributing to memory pressure.
Informed Technology Selection. Make informed decisions when selecting technologies based on specific business requirements. If risks are identified, consider migrating to alternative solutions that better align with your needs.
Cloud-Based Solutions. Explore cloud-based storage solutions if compatible with your business and policies. Leveraging cloud services shares storage management responsibility provides automated instruments, and often leads to cost savings.
Effective Use of Read Replicas. Utilize Read Replicas to distribute both write and read loads across nodes, establishing a failover scenario for continuity in the event of a primary node failure.
Optimizing Data Structures. Optimize data structures with techniques like data compression, choosing appropriate record types, and utilizing the TTL attribute to manage record expiration when long-term storage is unnecessary.
Data Partitioning and Horizontal Scaling. Evaluate data partitioning and automatic horizontal scaling options to remove bottlenecks, facilitate seamless scaling, and distribute the workload. Be prepared for adjustments to integrations and configurations related to sharding.
Vertical Scaling Considerations. Consider vertical scaling if necessary, understanding that it takes time and may not provide immediate relief during sudden memory pressure spikes.
Multi-AZ Deployments for Resiliency. Implement Multi-AZ deployments to enhance the resiliency of your Redis cluster. Span your deployment across multiple Availability Zones to mitigate zone failures and improve overall system reliability.

By implementing these strategies and leveraging cloud platform capabilities, you can effectively manage memory pressure in your Redis deployment. This approach ensures optimal performance and reliability for your applications while providing scalability and fault tolerance to accommodate growing demands.