Glassbeam’s business revolves around providing business intelligence on machine data. Intelligence comes from structured data. Machine data is not always structured. So, there is a gap between what is needed and what is produced. As Glassbeam’s head of engineering, I am going to write a two series blog about how Glassbeam bridges this gap.
The holy grail of business intelligence on machines is Predictive Analytics (PA) and Anomaly Detection (AD). These require Machine Learning (ML), which in turn requires building models. We are now going way beyond just structuring data for intelligence into the realm of Data Engineering.
Glassbeam’s platform is just that. It is a Data Engineering platform which translates traditional machine data – a combination of structured (XML, JSON) and unstructured (text) components – into more useful structured data
Wikipedia defines a domain-specific language (DSL) as a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. For example, HTML is a DSL (Web pages only) and Java is a GPL.
A language providing Data Engineering features on the very specific domain of multi-structured machine logs is Glassbeam’s DSL. This language is called SPL (Semiotic Parsing Language).
Multi-structured Machine Logs:
I define Machine Data as information about the device ― what is the configuration, how is it performing, and how is it being used. Most modern machines produce multiple logs containing this information. This machine data holds tremendous insights into the machine. The trick in obtaining knowledge from these logs lies in the ability to extract the information and then use it meaningfully.
This is not trivial, especially if you consider the fact that each file can have many sections of information in different formats – some unstructured (like text), while others semi-structured (like XML or JSON). All together, the machine data is multi-structured.
A typical log file looks like this:
Each section or child section can have data in a different format – and that is what makes the process of structuring this log file an interesting exercise.
SPL is Glassbeam’s technology for structuring machine data. SPL is a structured compilable language for parsing and structuring an arbitrary multi-structured log file.
Specifically built for Data Engineering, SPL is a DSL which; In a single development step, allows the developer to:
- Model the source data -> defines the type of data being processed. The type of data refers to Configuration Data, Statistical Data, Session Data or Event Data.
- Define the schema -> defines the columns and their data types. For example, a string of length 50 bytes, etc.
- Define UI components -> specify the descriptive labels used for cryptic column names defined in the schema.
- Generate Metadata -> this is the cross-reference used by the Glassbeam applications when showing data to users through the UI.
- Define Parsing Rules (Through Namespaces and ICONS – defined later) -> Please see SPL terminology
- Define Indexing Requirements -> defining indexes which are then used for exploring log data through the explorer app (faceted and full-text search)
SPL allows breaking up a huge multi-structured log file logically into smaller components or sections. Parsing happens at section level by defining the data format bounded by the section. Different sections of a log file can have different data formats. Moreover, sections can be hierarchical to ‘n’ levels. Hierarchy is used to select/drop any specific data.
At the outset, SPL is a parsing language. The structured parsed output can be saved in different stores such as Cassandra, Solr, Kafka, etc. Data that goes into different stores are modeled in that store specific schema definition.
In the next section, we will look at specific examples of logs and how SPL helps parse them.