At GameSparks, log analytics plays a key part in enabling the DevOps team to provide the uptime and stability we do. We process tens of millions of data points (log entries) per day from every component in our platform. In this post we’ll take a look at how we have managed to process this huge amount of data without disrupting our users experience.
In The Beginning
In the early days, our log aggregation stack consisted of the 4 components you would expect to see in any ELK build:
- Logstash Forwarder (LumberJack) – this ingests the raw log files from disks and pushes them to Logstash.
- Logstash – accepts the messages from Logstash Forwarder, processes them and writes them to ElasticSearch.
- ElasticSearch – storage, query and retrieval engine for log messages.
- Kibana – visual frontend to ElasticSearch, to query and display result sets.
This generic set up served us well to begin with, however as GameSparks grew and we began deploying our Runtime Clusters in more regions, using more cloud providers, the maintenance and management of network access, along with general scalability and resilience of the set up soon became a headache. The key challenges that we faced were:
- Network complexity – securely sending log entries from distributed servers to a central Logstash/ElasticSearch cluster for aggregation required tunnels to exist between each of our locations.
- Multiline Loglines – Logstash Forwarder was not intelligent enough (by design) to group and send multiline log entries in a single transaction. Collating multiline log entries within LogStash was CPU intensive. It disables multi-threading, and (prior to LogStash 1.5) caused a delay in the processing of a multiline message as Logstash waited for the next “start” line to appear before knowing a multiline message was complete.
- Resilience – any downtime on the Logstash server meant we could lose log entries.
- Scale – our single Logstash instance didn’t give us the flexibility to smoothly handle spikes of incoming data.
To combat the challenges faced, we first took the decision to replace Logstash Forwarder with an open-source Python alternative, Beaver. This supports parsing multiline log entries at their source, log file tagging and more significantly for us, transmission via AWS Simple Queue Service (SQS).
Being able to send our log entries from our servers distributed across the globe to an internet accessible service such as SQS kills the proverbial two birds with one stone. It removed our network security complexity, and provided resiliency from downtime in the remainder of the stack as messages will sit on the queue until being successfully ingested.
Once we had messages within SQS waiting to be processed, we started our tests with the official logstash-input-sqs plugin. However, because of the number of log entries we generate, we quickly realised the Ruby AWS SDK was far too slow.* Using the workaround documented in that Jira issue, we began work on packaging our own version of the SQS input plugin using the Java AWS SDK. This resulted in a performance increase of more than twice the number of log entries being processed per minute.
Making these changes also allowed us to tune Logstash’s input and worker threads to give us maximum throughput. As well as allowing us to run multiple Logstash processes, all concurrently consuming from the same queue.
Having all our log data centralised in near-realtime makes trend analysis and troubleshooting problems far less painful. It also gives us the ability to retrospectively analyse data. We use the information we have to pro-actively identify trends and anomalies.
We started off by developing an Elasticsearch River to perform configurable aggregations against new log entries and historical data for trends every minute. The output data from this process is used to generate reports for business insights and immediately alert our engineers to any service issues.
We have also created numerous Kibana dashboards that provide an insight into platform usage that simply can’t be identified by looking at the raw data alone.
Helpfully, Elasticsearch deprecated rivers not long after we made our own! So our next step is to move the river code into a standalone application. For what its worth, we agree with Elasticsearch’s approach. Rivers are a great concept, but we have found a number of issues with them. For example the same river running on multiple nodes, difficulties stopping them if the Elasticsearch hosts become busy, amongst others. So separating this logic out will give us greater stability and flexibility.
Oh and also we’ve got Kibana 4, Elasticsearch 2.0 and Logstash 2.0 to keep us busy as well. There are interesting times ahead of the GameSparks DevOps team!
*We have not tested the recent release of logstash-input-sqs v1.1.0 which has been upgraded to use AWS SDK v2