Semiotic parsing language – the language for machine data analytics

Pramod Sridharamurthy
Tags: SPL

The Internet of Things (IoT), an ever-growing number of connected devices, generates vast amounts of data, called “Machine Data”, complex in variety, volume and velocity. Machine data is information about the device, like its configuration, status, performance, usage and more. Manufacturers across data-intensive industries – such as storage, wireless, networking and medical devices – are struggling to make sense of all this data. Analyzing machine data can help organizations with reactive diagnostic activity, predictive problem identification, and business intelligence.

The first step before any meaningful analytics is data preparation. While data preparation and transformation is a complex undertaking even for structured data, doing the same for multi-structured data would increase the complexity by an order or magnitude. Multi-structured data requires parsing complex logs, handle the volume, velocity and variety in the logs, handling variations in the log format across software versions of the device generating the logs and many more such complexities. Once parsed, then the problem of transformation comes into play.

While data preparation and transformation is the first step, the real value of this prepared data is when it is used as part of different applications. The prepared data could be an input to a search index, could act as a source for analytics or be the data set for machine learning. Ingesting machine data from these devices and converting it into actionable information requires a purpose-built platform.

Glassbeam’s SCALAR, along with its language for analyzing machine data, SPL (Semiotic Parsing Language) are patent-pending technologies that provide the big data platform for ingesting and making sense of machine data. While this blog does not provide a comprehensive description of all features of SPL, it does provide the reader a brief introduction to the power of SPL.

Challenges in Machine Data Analytics

I would like to classify challenges in machine data analytics into three main categories:

  1. Volume and velocity of data
  2. Variety in log data
  3. Time to market and manageability
Volume and Velocity

In 2009, GARTNER stated that data will grow by 650% over the following five years and most of the growth in data is the byproduct of machine-generated data. IDC estimated that in 2020, there will be 26 times more connected things than people. Wikibon issued a forecast of $514 billion to be spent on the INDUSTRIAL INTERNET in 2020. Given the voluminous nature of machine-generated data, analysis presupposes the scale of a big data project that requires specialized tools and specialized skills to ensure responsive analytics.

Variety in log formats

Variety refers to the variations in the log formats that need to be supported in any machine data analytics project. Variety in formats is not just across logs, but also within a single log file. Data can vary from structured sources like CSV, JSON, and XML to semi-structured data like tables, name value pairs, syslog formats etc., to unstructured data which are free flowing text. Software applications and technology devices produce much more machine data than just log files. They also generate product usage capacity and license information, configurations and settings, and other business data. Like the data in log files, this data is text-based and has structure. However, it does not have consistent formatting. The diagram below illustrates this.

Variety in Log Formats

Time to Market and Manageability

While parsing machine data is the first step in analytics, processing machine data into usable information often involves various other activities listed below that pose serious challenges to the manageability of data and the time to market products.

Model the Source Data – This step looks at the available log sources and the business requirement and maps the two. In this step, one decides which logs to parse and what to parse within those logs and how they should be represented to meet business requirements.

Develop Parsers – Once the requirements have been identified, the next step is to write parsers that extract the required data from the log files.

Design the Target Schema – Depending on the business requirements, the target schema needs to be planned. This is not an easy task as the requirements vary between the groups in an organization and also, the requirements change over a period of time. Planning a generic schema that satisfies all user requirements and yet is flexible enough for change is challenging.

Metadata Tagging – The parsed data need to be viewed through a UI. Therefore, the UI should be capable of representing parsed data. This representation needs to be driven through a metadata layer for flexibility and scalability.

Cube Modelling -As the data in the target database grows, performance will be impacted. Creating data structures similar to OLAP cubes with relevant data becomes mandatory to meet business needs.

Handing Version Changes – Log formats change over time. Every new version of a device brings new data and new formats into the log file. Keeping track of the changes and supporting these changes in the parser can be a very complex task.

Incremental Parsing – With changes in format or a change in requirements, new data needs to be parsed. This data might come from new files or can be parts of existing log files that were not yet parsed. Additional changes to the parsing logic will require changes to the database schema, which in turn might require a reparse and reload of all the log data. As you can see, converting machine data into meaningful information that can be mined requires a significant investment in time and resources. Not having a framework to do this quickly as well as trying to sustain the nimbleness as things change impacts the time to market.

Glassbeam’s SCALAR and SPL

To solve the problem of variety, volume, velocity , time to market and the overall manageability of machine data analytics, we created a purpose-built hyper scale platform called SCALAR with a language to parse, tag, model, and store data called Semiotic Parsing Language (SPL). The diagram below represents the components of Glassbeam’s platform.

Glassbeam SCALAR Platform

SPL provides a single step solution for parsing an arbitrary document and storing it in a structured data store

SPL: A single step solution for parsing an arbitrary document and storing it in a structured data store

In a single development step SPL allows the developer to:

  1. Filter files that need to be parsed, stored or indexed.
  2. Syntactically parse the document as sections and elements
  3. Decide what element within a section should be stored, what should be indexed and what should be stored and indexed.
  4. Map document sections and elements to the schema.
  5. Optionally group elements parsed within or across sections as logical objects.

While most ETL (Extract Transform Load) scripts tend to do just part of the above using regular expressions [extract] combined with column mapping [load], SPL:

  • Allows the log to be defined as a hierarchical set of sections with its own elements
  • Avoids (though, of course allows) the use of regular expressions
  • Yields all meta data for building the user interface/applications
  • Ties different files AND streams together

In Summary, SPL forms a very powerful and yet simple to use modelling language for machine data analytics in the Internet of Things world.