Integrating Apache Kafka with Glassbeam – Behind the scenes: Opening up the Platform for Integrating with other Data Stores


Opening our platform for our customers and letting them build on top of it had been on our minds for some time now. With this capability now built into our platform, it throws open a variety of possibilities for our customers. In a two part series, I will explain the why and how of it.

The problem Statement (Why are we doing what we are doing)

Giving structure to unstructured data is at the core of what we do at Glassbeam. And for this we have engineered 2 DSLs – Scalar DSL for parsing and the Rules DSL. The job of Glassbeam’s platform is to parse data as per this DSLs. The challenge of parsing log data is often described by the 3 V’s – Variety, Velocity and Volume. So, the question is, how do we do justice to cater to the 3 V’s?

To cater to variety, we follow a proactive approach in product development adding features to the DSL and releasing them asap. We use Scala as our lingua franca for programming at Glassbeam and love how it enables faster development with more concise code. To cater to volume and velocity, we do plethora of things including using Akka and pushing the limits of parallel/async processing and careful use of NoSQL + SQL data stores to get us maximum read and write throughputs for the various use cases.

However, despite these practices, the challenge of the 3 V’s is a moving target because of four primary reasons:

  • with every new customer comes new kind of unstructured data: new kinds of unstructured data require us to add features and further enrich our DSL.
  • newer data visualization requirements: new data visualization requirements make us scrutinize our data read/write paths along with engineering
  • new APIs and UI.
  • ever increasing data volumes.
  • like any other SaaS platform, our priority is to pick features that are generic and impacts all the customers than just one or two.

The implication of above is time-to-market for our customers who have requirements that are very specific. So, as Glassbeam continues to win more customers and increase its usage adoption in our current accounts, the questions that we face with were:

was it possible to decouple parts of tech-stack to enable us to use partners in our engineering effort for faster time-to-market? Since Glassbeam’s customers are large enterprises with significant engineering in-house, often the customer themselves is a willing partner – for example a customer might be willing to invest its own engineering efforts for data visualization if the parsed data can be exposed in a consumable fashion. what is the core competency of Glassbeam’s technology offering and how do we focus to build on it rather than the peripherals?

It was this thinking and working with our customers on these lines that made us realize that giving structure to unstructured data is at the heart of what we do. That our parsing platform with its parsing and rules DSL are our core competency. And thus, we need to provide integration points for external consumers to plug into the output of these. A computational engine that consumes unstructured data and outputs structured data per the specifications of the DSL is in itself a product that can be sold. It allows us to work with our customers and partners even closer. It enables sales to large enterprises that need log analytics deployed on-premise instead of on-cloud. And it allows us to iterate and release our product even faster. So, the next question was, what would the cleanest and most robust way to egress the structured parsed data? An external gateway that could scale and be tolerant to failures. We chose Kafka to be that gateway. Kafka with its topics partitioned + replicated across nodes and log structured fast persistent storage fit our needs perfectly. That it is an already popular and active open source project was an added advantage. So we at Glassbeam built capabilities to our platform to store the output of parsing and rules processing onto Kafka. And we provide a variety of ways to configure and tune what gets written and how it gets written.

The Producer (LCP and its Parallel async Parsers)

The Glassbeam platform is a system of ‘n’ identical micro services. They share a single source for configuration and work on data independently. Each service compiles the DSL’s (parsing and rules), does the parsing according to the DSL while maintaining the parsing status of various in-transit files locally and eventually writes to parsed data to an external store like Kafka. Failure of one of the service instances does not affect the others and incoming data is routed to the running instances. Within each service parsing can be in-progress for ‘m’ files (where ‘m’ is configurable). Each of these ‘m’ files are said to being parsed by its ‘parse-instance’. Each ‘parse-instance’ is a tree of Akka actors created when the DSL is compiled. Lines from the file are streamed through the compiled Actor tree where the actual parsing happens. Each ‘parse-instance’ is throttled by a windowing mechanism so that lines aren’t streamed much faster than they can be parsed. Since different log files can be containing different variety of data and be of different sizes, each takes its own time to complete. The idea behind using many ‘parse-instances’ and Akka actors is to maximize the usage of compute available on the server. Each ‘parse-instance’ forwards the normalized parsed data to a many Akka routee’s that act as Kafka producer’s and eventually write the data.

In the next part, I will cover the topic and partition design, message format examples and integration options…stay tuned.