Testing 1 TB/day Data Ingestion in a Few Hours and Just One Click! – Part 2

Bharadwaj Narasimha
Thursday, March 1, 2018

In my previous blog, I discussed the framework we have built for testing large scale machine log data. In this post, I will share results of our test. Every test run with varying server clusters was executed through our automation framework, and therefore had minimal effort on our side except clicking a button after deciding the number of servers we need.

The scalability of our platform is broadly dependent on 2 things:

  • Parsing criterion: Glassbeam's DSL (domain specific language) to capture the parsing criterion is called SPL (semiotic parsing language). SPL has many features that can be used to give structure to almost any kind of data. The total parsing throughput of a log is a function of complexity of SPL it needs. The first of our tests was to see what throughput we get for Glassbeam's product to process medical imaging log data for pure parsing. And to see if this throughput scales linearly.
  • Data stores: Glassbeam's platform can write to variety of data stores including Cassandra, Solr, Kafka, Redis, and Vertica. Our second set of tests was to see the throughput and scalability with data writes going to different data stores.

Test Results

1. Scalability and Throughput of Pure Parsing

We did two types of tests –

  • To find the parsing throughput on a single node for long runs.
  • To find the parsing throughput by increasing the number of nodes to discover if it scales.



The results of the two tests are captured in the graphs above. That the platform scales linearly across time and node came as no surprise. Glassbeam's parsing platform is a compute intensive process and requires a minimum of two nodes (note that it was writing to no egress data stores as part of this test). The nodes being used in this test were AWS m5.xlarge machines (4CPU/16GB RAM) in a Docker swarm cluster.

2. Scalability and Throughput of Parsing with Data Stores

When Glassbeam Analytics platform writes to data stores, the following aspects come into play:

  • Throughput and scalability change with each store. For example, writing to Kafka is far more performant than writing to Solr (same number of messages/documents with almost the same fields) due to the underlying properties of each store. Kafka is a distributed message bus with no way to consume data except in the order of writes, while Solr is a search store where every incoming write leads to tokenization, inverted index update, and so on
  • Throughput and scalability change as the number of nodes of distributed stores like Cassandra, Solr, and Kafka changes. To ensure that Glassbeam Analytics platform is writing to a store at the maximum velocity possible, we performed multiple tests - starting at a single platform writer node we increased the number of datastore nodes until the velocity of writes maxed out. Next, we increased the number of writer nodes to two and again increased the datastore nodes until 2x parsing velocity that we achieved with the single writer node. We repeated this until up to 22 platform writer nodes
  • Throughput and scalability change with replication factor used for these distributed data stores.

Note: the actual tests are a little more complex since Glassbeam's parsing platform uses more than a single type of service to write data.

  • Scalability and throughput for analytics using Cassandra:


  • Scalability and throughput for search using Solr:


  • Scalability and throughput for search & analytics:combination of above


The results were not surprising for us as we know Glassbeam Analytics platform scales linearly. The exciting part of this exercise was the ease with which we were able to run these tests and at costs that was a fraction of doing this manually.

Our next step is to automate large scale read performance tests, so that we can also do those tests at a click of a button!

Interested in other blogs from our Engineering group? Check out Mohammed Guller’s blog on machine learning. See how we’re raising the bar of combing spectacular throughput computing that I talked about here with AI on Medical Imaging machine data.