Ever wonder how we power those “which controller went down today” queries that sprawl 1000s of databases, amounting to 100s of terabytes of log data every day? How do we deal with terabytes of data in a robust and efficient manner? We call it harmonic in memory query management.
We’ve been working with a distributed Cassandra cluster for almost a year. During that time, we have learned a bit about achieving scalability, and along the way we have collected some insight on achieving optimal query performance.
Data in databases is distributed by partitioning (AKA Sharding). The real benefit then comes from keeping these partitions (or shards) on separate nodes of a cluster so that they don’t contend for IO resources.
Having very large partitions is bad because then many queries can hit the same resource and memory requirements to hold that resource go up. Very large partitions may cause out of memory errors. Very small partitions are also bad because it increases the disk reads and network overhead for moving the data around. An ideal partition is one that serves a query by doing one reasonable size disk read. Assuming a query which reads 6 months worth of data and generates aggregated values, query performance improvements can be done in following ways:
- split the query into a list of partition keys
- issue all the reads in parallel
- apply the aggregation logic for each returned partition.
- save the aggregated value for each partition
- do a final aggregation of all individual aggregations
- return the result to the user
The flipside to this is that the entire data of a partition needs to be moved from the node to the client. The bottleneck is the size of partition, which is not in our control and may be quite large.
Flex your query muscles: Take small bites with pagination
Data access layers can send large amounts of data in several small pages to the client, instead as one humongous chunk!
With the right ingredients (aggregation logic, client-side compatibility) each page can be processed independently. This way, large amounts of data are not retained at any point during processing, and the partition size limitation goes away. However, moving the data to the client still remains the bottleneck.
To address this we leverage Spark, a Mapreduce framework, and a great solution to reduce the need of data movement to the client. By running the business logic on the same node, only a small amount of the aggregated data needs to be moved to the coordinator. Therefore, aggregation can now be done on the node itself.
Spark, with built-in features for aggregations, reads one partition at a time, making it a great case for distributed processing. Combine this with the pagination provided by modern data access layers, and we get the best of both worlds.
By supporting a rich set of query processing standards and building on powerful distributed frameworks, Glassbeam’s Big Data platform offers optimal partitioning, data distribution, and query processing. In all, the data we handle with our distributed Cassandra cluster can be infinite, limited only the laws of physics.