While we run must-run performance tests for every release to make sure that the new features being added does not impact the performance of our platform, we (Engineering @ Glassbeam) wanted the ability to run large scale tests periodically too. However, running large scale performance tests is a time consuming and expensive affair. Such tests require spinning up tens of machines which can take days to setup and run the tests. Thus began our new project to automate the entire performance testing cycle, so that even large scale tests could be done in hours with minimal hosting costs and at a click of a button!
In this two part series, I’ll introduce you to our automation framework and also the results of our scalability test. While this test focuses on logs from medical devices, the framework is independent of the type of logs or the device generating them.
This is the first part discussing about the framework we developed for testing large scale ingestion. In the second part, I’ll show you the results we achieved.
Let’s explore our experience.
“A system is considered scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added.”
A scalable platform would handle a linear growth of data volume with a linear growth of computer hardware. To scale a platform effectively, we need to consider some important factors such as:
Time and Cost
We decided on a container-based approach to launch, scale, and measure the scalability of our Glassbeam Analytics platform, as they address the problems of time and cost very effectively.
A container image is a lightweight, standalone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, and settings. Available for both Linux- and Windows-based apps, containerized software will always run the same, regardless of the environment. Containers isolate software from its surroundings, for example, differences between development and staging environments and they help reduce conflicts between teams running different software versions on the same infrastructure.
Using containers helps our platform to scale quickly by simply cloning the containers and running multiple instances at once. There are multiple services, which provide a layer over these containers for easy deployment of any software. We chose the most popular of these, Docker, for our performance platform.
From the official Docker page:
“Docker is the company driving the container movement and the only container platform provider to address every application across the hybrid cloud. Docker enables true independence between applications and infrastructure and developers and IT ops to unlock their potential and creates a model for better collaboration and innovation.”
With the help of Docker, we created images for each module of our platform. These images can be run on any machines, irrespective of the operating system (containers, right!). All of our images are hosted on the Docker hub for easy deployment on any machine.
Glassbeam platform consists of multiple modules:
As we need multiple instances of each of these modules to scale our platform as a whole, running them using bare Docker image files will be tiresome as well. To solve this problem, we are using Docker Compose files, to launch the group of modules and scale them at once.
Having multiple modules makes it difficult to manage scaling the platform as a whole. We need to have a network to deploy all our images and provide them a tunnel to talk to each other. As these images are running on individual containers on different nodes, we decided to deploy them in the swarm mode for inter-network connectivity of our containers across different nodes.
From docker.com - swarm mode
“A swarm is a group of machines that are running Docker and joined into a cluster. After that has happened, you continue to run the Docker commands you’re used to, but now they are executed on a cluster by a swarm manager. The machines in a swarm can be physical or virtual. After joining a swarm, they are referred to as nodes.”
Swarm mode helps us deploy our containers across different nodes and scale them very quickly. Once deployed, all our containers in the swarm network are inter-connected. The built-in load balancing by Docker swarm, helps restart a container, if it stops due to an error.
Now, with a single Docker compose file, we can launch all our modules (scaling 'N' times) on Docker swarm cluster, almost in an instant. Once deployed, our Glassbeam Analytics platform will start processing the received log files and store the relevant data in the databases (Cassandra/Vertica/Kafka) and the search oriented information to the Solr.
Even with the Docker compose files and swarm cluster, we needed to manually spin up nodes for every test which did not help our one click automation case. Enter Docker Cloud.
From Docker.com: docker cloud
“Docker Cloud allows you to connect to clusters of Docker Engines in swarm mode.
With Swarm Mode in Docker Cloud, you can provision swarms to popular cloud providers, or register existing swarms to Docker Cloud. Use your Docker ID to authenticate and securely access personal or team swarms.”
Docker cloud works well with many cloud providers and since we are already hosted on AWS, we integrated Docker cloud to work with AWS containers. With this integration, we are able to spin up multiple nodes from single interface of Docker cloud.
With everything automated, we were now able to spin up our entire setup and start parsing for our performance test with just one click.
Time taken for setup:
|Setting up 'N' Nodes||Time taken|
|Spinning up the machines||
N*1 + 2
2 Mins to spin up the containers and 1
Min each to join into the swarm cluster
|Launching core modules||10|
|Total time taken||N*6+14|
As we can deploy and stop the services with single commands, these scalability tests for our Glassbeam Analytics platform are orchestrated to run efficiently for a few hours (typically around two hours) resulting in reducing the cost run by a huge margin.
Monitoring a 'N' node cluster is very difficult, even more if your cluster runs into any issue. To ease our debugging and measuring the performance metrics on all the nodes in parallel, we developed a tool (simply called Monitor, Duh!) to periodically monitor our core modules on various nodes and mail us the results. This Monitor tool measures our platform's overall throughput and intelligently shuts down the nodes, when the throughput is constant (with 5% variance).
Docker swarm mode (through Docker cloud) is still in Beta. So, only the containers running on the manager nodes are accessible. As we need logs from the containers running on other nodes, we deployed our monitor program on them to periodically upload the logs to Amazon S3.
In our next blog, we will walk your through the results of our test.
Aah! Glad you reached this far. We enjoy sharing the work we do and we also have an Engineering section under our Blogs for the last couple of weeks. On your way out, do see what my team mates Saumitra has to say about Creating DSL with Antlr4 and Scala and Bharadwaj about Integrating Apache Kafka with Glassbeam.
Come back next week for that ‘part 2’ I promised earlier.