2020 won’t go down in the books as one of the best years of the decade, but we can say for certain that at least one good thing came out of it: this roundup of the four best practices for Kafka optimization. This blog post was originally published in May, and it has quickly become a favorite among our readers. Seeing as the best practices still stand today, we wanted to highlight them once again to end the year off with a bang. Give it a read and enjoy.

Apache Kafka is great. It allows for the creation of real-time, high-throughput, low latency data streams that are easily scalable. Optimized Kafka performance also leads to other benefits, such as resistance to machine/node failure occurring inside the cluster and persistence of both data and messages on the cluster. Performance optimization of your Kafka framework should be a paramount priority. 

But optimization is a complex exercise. Optimizing your Apache Kafka deployment can be a challenge because there are many layers to the distributed architecture and parameters that can be tweaked within those layers. 

For example: Normally, a high-throughput publish-subscribe (pub/sub) pattern with automated data redundancy is a good thing. But when your consumers struggle to keep up with your data stream, or if they fail to read the messages because these messages disappear way before the consumers get to them, then work needs to be done to support the performance needs of the consuming applications.

Best Practices for Kafka Optimization

Kafka optimization is a broad topic that can be very deep and granular, but here are some key best practices to get you started:

1. Upgrade to the latest version of Kafka.

This might sound blindingly obvious, but you’d be surprised how many people use older versions of Kafka. A really simple Kafka optimization move is to upgrade and use the latest version of the platform. You have to determine if your customers are using older versions of Kafka (ver. 0.10 or older). If they are, they should upgrade immediately.

The latest version of Kafka (ver. 0.8x) comes with Apache ZooKeeper, which is used primarily to coordinate consumer groups. Using the outdated version of Kafka can lead to long-running rebalances as well as rebalance algorithm failures.

2. Understand data throughput rates. 

Optimizing your Apache Kafka deployment is an exercise in optimizing the layers of the platform stack. Partitions are the storage layer upon which throughput performance is based. The data-rate-per-partition is the average size of the message multiplied by the number of messages-per-second. Put simply, it is the rate at which data travels through the partition. Desired throughput rates dictate the target architecture of the partitions.

3. Stick to random partitioning when writing to topics, unless architectural demands call for otherwise.

Solutions architects would prefer each partition to support similar amounts of data and throughput rates. In reality, data rates vary over time as do the raw number of producers and consumers. 

The performance challenge presented by the variability is the potential for consumer lag, AKA consumer read rates falling behind producer write rates. As Kafka environments scale, random partitioning is an effective way to ensure you don’t introduce artificial bottlenecks unnecessarily attempting to apply static definitions to a moving performance target.

Partition leadership is usually the product of simple elections via metadata maintained with the Zookeeper. Leadership election does not, however, take into account the performance of the individual partitions. There are proprietary balancers that can be leveraged depending on your Kafka distribution, but short of such tooling, random partitioning provides the most hands-off path to balanced performance.

The takeaway? Stick to random partitioning when writing to topics, unless architectural demands demand otherwise.

4. Adjust consumer socket buffers to achieve high-speed ingest.

In the older Kafka versions, the parameter receive.buffer.bytes is set to 64kB as its default. In the newer Kafka versions, the parameter is socket.receive.buffer.bytes, with 100kB as the default.

What does this mean for Kafka optimization? For high-throughput environments, these default values are way too small, thus insufficient. This is very much the case when the network’s bandwidth-delay product between the broker and the consumer is bigger than that of LAN (local area network).

If your network is running on 10 Gbps or higher and has latencies of 1 millisecond or more, you’re advised to tune your socket buffers to 8 or 16 MB. If memory is an issue, consider 1 MB.

Explore More Ways to Optimize Kafka Performance

Optimizing your Apache Kafka deployment is an ongoing job, but these five best practices should be a solid start. The performance optimization tips mentioned above are just some of the optimization approaches users can implement to improve Kafka performance. Kafka is becoming more and more popular for application developers, IT professionals, and data managers. And for good reasons. Check out our other resources, which discuss in length the best practices for Kafka when applied to specific areas of application development and data management.

Already using Kafka? Monitor and improve its performance with Pepperdata Streaming Spotlight.

To recap, we recommend that you upgrade to the newest version of Kafka. It’s a minor thing but can make all the difference. Next, is to make sure you understand your data throughput rates. And unless architectural demands require otherwise, we recommend that you opt for random partitioning when you’re writing to topics. And if you want to achieve high speed ingest, adjust consumer socket buffers. We hope you enjoyed this best of 2020 post highlighting the best practices we recommend for Kafka optimization.

Take a free 15-day trial to see what Big Data success looks like

Pepperdata products provide complete visibility and automation for your big data environment. Get the observability, automated tuning, recommendations, and alerting you need to efficiently and autonomously optimize big data environments at scale.