2020 won’t go down in the books as one of the best years of the decade, but we can say for certain that at least one good thing came out of it: this roundup of the four best practices for Kafka optimization. This blog post was originally published in May, and it has quickly become a favorite among our readers. Seeing as the best practices still stand today, we wanted to highlight them once again to end the year off with a bang. Give it a read and enjoy.
Apache Kafka is great. It allows for the creation of real-time, high-throughput, low latency data streams that are easily scalable. Optimized Kafka performance also leads to other benefits, such as resistance to machine/node failure occurring inside the cluster and persistence of both data and messages on the cluster. Performance optimization of your Kafka framework should be a paramount priority.
But optimization is a complex exercise. Optimizing your Apache Kafka deployment can be a challenge because there are many layers to the distributed architecture and parameters that can be tweaked within those layers.
For example: Normally, a high-throughput publish-subscribe (pub/sub) pattern with automated data redundancy is a good thing. But when your consumers struggle to keep up with your data stream, or if they fail to read the messages because these messages disappear way before the consumers get to them, then work needs to be done to support the performance needs of the consuming applications.
Best Practices for Kafka Optimization
Kafka optimization is a broad topic that can be very deep and granular, but here are some key best practices to get you started:
1. Upgrade to the latest version of Kafka.
This might sound blindingly obvious, but you’d be surprised how many people use older versions of Kafka. A really simple Kafka optimization move is to upgrade and use the latest version of the platform. You have to determine if your customers are using older versions of Kafka (ver. 0.10 or older). If they are, they should upgrade immediately.
The latest version of Kafka (ver. 0.8x) comes with Apache ZooKeeper, which is used primarily to coordinate consumer groups. Using the outdated version of Kafka can lead to long-running rebalances as well as rebalance algorithm failures.
2. Understand data throughput rates.
Optimizing your Apache Kafka deployment is an exercise in optimizing the layers of the platform stack. Partitions are the storage layer upon which throughput performance is based. The data-rate-per-partition is the average size of the message multiplied by the number of messages-per-second. Put simply, it is the rate at which data travels through the partition. Desired throughput rates dictate the target architecture of the partitions.
3. Stick to random partitioning when writing to topics, unless architectural demands call for otherwise.
Solutions architects would prefer each partition to support similar amounts of data and throughput rates. In reality, data rates vary over time as do the raw number of producers and consumers.
The performance challenge presented by the variability is the potential for consumer lag, AKA consumer read rates falling behind producer write rates. As Kafka environments scale, random partitioning is an effective way to ensure you don’t introduce artificial bottlenecks unnecessarily attempting to apply s