An auto-tuning parser for data from the internet of things (iot)

Ashok Agarwal

We have established beyond a reasonable doubt that knowledge comes from structure. Therefore, parsing IoT logs to create structure is a must do for making sense of this data. Remember the definition of big data – volume, variety and velocity. If you combine that with the business requirement of near real-time analytics, you are looking at a need for high data ingestion speeds. However, the issue is not of ingestion speeds but the total cost of ownership (TCO) for providing that. Today, with a proper application architecture, you can almost infinitely scale a platform horizontally to provide the required ingestion speed but you can’t prevent it from either being underutilized or bottlenecked. Cloud helps, but has its own issues of performance.

At a very high level, the process of creating structure consists of read, parse and load. While disk IO is slow, loading data into an indexed schema (structured) is even slower. Both can benefit from spawning multiple threads. For scalability, Glassbeam uses the Akka actor framework, which provides not just a parallel but also an asynchronous processing environment. This gives optimum machine utilization because read, parse and load become asynchronous. But one can overwhelm the other – for example, if you read and parse too fast, you could push too much data into the load queue and risk an OOM error.

The solution lies in an auto-tuning application. This application is parallel and asynchronous but in addition, it also throttles or expands depending on the load in each segment of the asynchronous process. The overall speed will be limited by the slowest link in the chain. Typically, the maximum number of threads is defined in the receiver. Glassbeam uses Solr for search, and so for us Solr is a receiver (among others like CSV, Cassandra etc.). Solr config has an entry for the maximum number of threads. Obviously, this value is dependent on the resources given to Solr. Glassbeam’s asynchronous processes periodically send PerfMarker messages and measure the response path to determine how much they need to open up or throttle. They tune themselves to the speed and resources allocated to Solr. We don’t have to allocate resources based on maximum expected load, which helps us manage the TCO.