In recent years the BaaS sector has forged its way to become a well established and essential tool, deeply ingraining itself into every development process. As the marketplace for server-side solutions has grown and more contenders emerge, developers are looking for those who show full reliability and honesty, seeing them rest easy in the faith that their games are in extremely capable hands.
Here at GameSparks, the DevOps team is constantly maintaining an exceptionally high standard of platform availability and reliability, in turn providing our customers, and their games, with a consistent total uptime. Recently we have taken further steps to ensure full peace of mind for our users, and so today we are proud to announce http://status.gamesparks.net – a status page that provides a RealTime overview of the GameSparks services.
In light of this new introduction, this accompanying post will take you through our in-depth monitoring solutions, including External Monitoring, Internal Monitoring and Automated Functional Tests.
The status page is powered by our comprehensive External Monitoring solution.
External Monitoring is the view of the GameSparks platform from the Internet. By looking from the outside in, we can understand external connectivity issues, region specific issues and your players’ view of our services.
Every minute we perform the following health checks from five geographically distributed locations against each of the game regions. These tests interact with GameSparks in the same way that a game client would:
- Create a WebSocket connection to the region, checking for DNS, network timeouts and WebSocket errors
- Game Authentication
- Player Authentication
- Standard API call
- Binary Asset Download
- Callback URL
If any one of these checks fail to process, or are outside of our acceptable performance tolerances, we alert an engineer to investigate. Additionally, if the issue has the possibility of impacting your game’s availability, the monitoring solution automatically updates the status page.
We use the following status messages depending on the errors reported:
- If 1-2 monitors are unable to validate the region, we mark the region as Degraded
- If 3-4 monitors are unable to validate the region, we mark the region as a Partial Outage
- If 5 monitors are unable to validate the region, we mark the region as a Major Outage
The incident will stay open on the status page until all monitors have reported successfully. Periodic updates will be posted as and when monitoring detects any status changes.
Every component within the GameSparks platform has detailed performance statistics gathered to provide us with an overview of the platform’s historical and ongoing performance. This wealth of data gives us the information we need to continually tune and improve our platform, as well as enabling us to ensure the best response times possible and provide early insights into possible customer issues.
We currently capture almost 100,000 distinct data points including:
- Base OS metrics/health – CPU, memory, disk, etc.
- Application container metrics/health – threads, garbage collections, response times, etc.
- Application metrics/health, response time, resource usage, error rates, etc.
- Database metrics/health, queries, response times, IO, etc.
Additionally we gather and process all of our server logs.
Automated Functional Tests
As an extra safeguard to validate that the GameSparks platform is functioning as expected, we perform full regression testing against each of our regions every two hours. These tests exercise every API call as a game client would.
The results of these tests are in permanent view of our technical and DevOps teams, and in the unlikely event of a test failing, an alert is generated. These tests are also an integral part of our deployment pipeline – and a change is never deployed unless they pass. Once any changes are live, these tests continue to provide peace of mind, informing that our platform is functioning correctly across the globe, for both you and your players.
These functional tests currently perform over 5,800 tests and they’re growing all the time.