Data Stream Adventures – Apache Kafka

As GameSparks continues to grow, so too does the volume of events data produced by the platform. This data is at the core of the powerful analytics service which allows customers to track various metrics for their games.

With the increasing volume of data comes fresh challenges in how we ingest, process and present the data through the analytics Portal. In order to meet these challenges, the operations team have been evaluating new technologies to improve the process. In this blog, we will take a look at one of those – Apache Kafka.

What is Apache Kafka?

Apache Kafka is a stream processing product which aims to offer a scalable, high-throughput and low-latency platform for data streams. Kafka was originally developed at LinkedIn with the goal of unifying the various data streams they handle. LinkedIn engineers have written at length about the development of Kafka – this blog in particular offers a great deal of insight into its development:

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Why Kafka?

Kafka’s scalability and performance characteristics could make it the ideal solution for piping the events data in the GameSparks platform. It offers a number of benefits over other events/message processing platforms, but the most attractive to us from an operations perspective is the horizontal capacity and performance scaling allowed by its design.

Kafka achieves this by breaking up a data stream (Topic) into a number of partitions which are distributed over a number of nodes (Brokers). This is possible as every message is identified by an offset, allowing Kafka to maintain the ordering of messages across partitions. In database terms, Kafka maintains a commit log.

Kafka allows for the number of partitions to be increased on the fly, so more Brokers can be added to handle extra partitions for a Topic. Each additional Broker and partition brings a performance increase. Kafka also maintains replicas of each partitions data across the nodes.

From an infrastructure perspective, this opens up some exciting options for how we scale and deploy the service. At GameSparks, we are continuing to move towards a micro-services architecture and have an existing Mesos infrastructure. The Mesos infrastructure could be utilised to run Kafka within docker containers and operate as the framework managing the scaling of the Kafka Brokers.

Challenges of Containerised Deployment

Kafka was developed with traditional virtual machine deployments in mind and has no native provisions for operating on a containerised platform. As such, deploying Kafka on a framework like Mesos or Kubernetes introduces a number of challenges.

The first challenge we identified is maintaining a persistent data store for a Kafka Broker. As we would intend to keep data in Kafka for a number of weeks before moving it to long-term storage (S3, in our case), it would be beneficial to keep the data for a Broker if it goes away. If we lose a Broker and another joins the cluster without the data, the new Broker would have to replicate data from the remaining Brokers. This would be a resource intensive process given the amount of data we expect to hold. Also, in the case of total failure (losing all Broker instances) we would be at risk of losing data.

With our Mesos infrastructure, we would need the ability to move a Broker from one Mesos Agent to another and have the persistent data move with the Broker. There are a number of storage management engines for platforms like Mesos which provision and manage disks for containers – this includes the creation and mounting of EBS volumes in AWS and VHDs in Azure. We found the REX-Ray engine works well with our Mesos infrastructure, and provides the ability to mount and move volumes between Agents. It allows for a disk to be provisioned alongside a container and automatically mounted regardless of which Agent is used. In our case, we could create an EBS volume to hold the Kafka data for a Broker.

With REX-Ray there is a solution to the problem of data persistence, however another problem remains.

When a Topic is created, the number of partitions and replicas for the Topic is set to a fixed value. When a Kafka Broker is started, a unique Broker ID is set. The Broker ID is used to identify the Broker within a cluster and to associate it with partitions and replicas for a Topic. If you create a Topic with 3 partitions and start 3 Brokers, each Broker will pick up one of the partitions.

The number of partitions is not dynamic and doesn’t scale up with the number of Brokers. For example, if you have a Topic with 3 partitions and 3 Brokers, adding a 4th Broker will not increase the number of partitions for a Topic. This is a sensible default behaviour and may be preferable depending on the number of data streams and Topics handled by your Kafka cluster. Modifying the configuration for a Topic requires some manual input via the API.

As of writing, we haven’t found a solution to Topic management that satisfies our requirements. Some projects have been developed for Kafka, aiming to provide operational automation for containerised deployments, such as Apache’s Kafka Mesos. Kafka Mesos was evaluated but we found it fell short in some areas – a topic for another another day. Operational automation is an area we’re still exploring.

Closing Points

Kafka offers the scalability and performance needed for the increasing demands of our analytics platform. While some operational challenges remain, it is shaping up be a good fit for the micro-services architecture we are pursuing. In future blogs we may update on our experience with Kafka and discuss the role it plays in our analytics platform.

Who uses GameSparks?