Glassbeam for Glassbeam (Part 4) - Helping Smart Engineering Teams be Smarter

Pramod Sridharamurthy
Monday, August 7, 2017

In this post, let me explain how Glassbeam’s engineering team uses the Glassbeam Analytics solution to be an effective and responsive team to every bug and the not-so-nice user experiences that our customers could potentially face.

Here are some guidelines that our Engineering team uses to seek answers from log data-driven Glassbeam Analytics:

  1. understand system stability
  2. resolve problems’
  3. understand metrics
  4. optimize hosting for best performance
  5. track API response times
  6. understand feature usage
  7. be aware of top issues being faced by the customer

Glassbeam Engineering teams instrument every feature to collect metrics that help them understand every aspect of the feature based on the above guidelines.

Glassbeam for Glassbeam

Tracking API response times for better user experience

Glassbeam’s middleware layer is called InfoServer.  All the clicks on the UI result in APIs calls from the backend layer. Subsequently the data query, joins and other computations are done by the middleware, therefore, tracking the response times of every API becomes critical to get the best user experience.

Any API with an unacceptable response time is immediately flagged through our Rules Engine. Besides tactical responsiveness, Engineering also looks at historical data to know which API’s response time was beyond the accepted limit. The other decision parameter the team looks at is how often the calls surpassed the limit and which customer was impacted the most.

The following bubble chart shows the API response times of an event tracked by the Engineering team. You will notice that the bubble size indicates the number of response time and the colour indicates the number of times that API was called.

API count - Glassbeam for Glasbeam

Troubleshooting chronic problems

Recently, the Engineering team were facing a time-out issue in our parsing engine, which resulted in some log files skipping the parsing phase. The team setup a rule that detected this condition to avoid the data loss, however, since this issue was happening infrequently, it was being resolved tactically. Whenever the problem reoccurred, the monitoring team would reparse the file and the problem was solved.

This issue was prioritized for fix in a patch release; the engineer who took up this bug had a tough time trying to reproduce this scenario as it was intermittent and happened in sporadic cases. In distributed computing environment, as you may know, solving problems that happen on a thread, where thousands of threads (AKKA actors in this case) are spawned across multiple machines is not easy. The engineer wanted to understand all the historical scenarios when this happened, on which parsers did the issues occur, which customer’s logs experienced this, and so on.

Glassbeam Analytics solution rescues our engineering team

The engineer used the Glassbeam Workbench app to setup a custom dashboard to track the AKKA actors. The custom dashboard became the starting point to permanently resolve the issue of some logs skipping the parsing engine.

The custom AKKA actor dashboard gave the engineer insights on how often the issues was occurring, which parsers were causing the glitch, and the classes of actors that were affected. Drilling down further into the chart, Engineer discovered the customer and the log file name/path.

By setting up the custom dashboard, the scenarios under which the issue was occurring became crystal clear. Further analysis was conducted using the Glassbeam Explorer app. For every occurrence of the issue, using the app, it is easy to assess what events were occurring across other components of the system when logs were skipping the parsers. 

Armed with all the information, the engineer was able to identify the problem and fix it. It just doesn’t end there. Once the problem was fixed and the patch deployed, the engineer added himself to the rule that gets triggered when this event occurs. This way, it was possible for the engineer to track if the fix indeed worked and if there were any other edge cases where this issue still reoccurred.

Glassbeam for Glassbeam - Timeout count

Simplifying testing and being notified of issues that are hard to test

Our Semeiotic Parsing Language (SPL), like any other language, has data types to store string, integer, or real numbers. Data types in SPL store the data that is being parsed from the log. The type and size of the data type defined in SPL is used to design the downstream data store schema.

When parsing log data, it is always a challenge to identify the right data type for the information subset contained in the logs because we have no control over the log and there is no documentation that defines the data type of each element being printed in the log. In addition, because there are so many variations of the logs being generated, it is impractical to test it against all the variations.

For example, one of our customers, a storage vendor, has 10 models of a storage device that the manufacturer has built over a period of time and is installed in their customer base. Over the last several years, there have been many versions of the software released (both Major and Minor versions). Our customer’s users have varied product versions from the earliest product release to the most recent one. This means that the logs we receive at Glassbeam from this customer’s install base are across all of their software product releases. Imagine this; there are about 1200 major and minor versions out there in the field. So, 10 models of storage and 1200 software version means, there could theoretically be 12000 variations to any piece of information that we parse. While there is always a challenge to get a sample of every possible log variation, testing all variations, even if we get those samples is an impractical affair. So, the first challenge is testing and the second challenge is to keep an eye out for the edge cases missed during our product testing phase against those varied logs coming from so many software release versions.

While writing the SPL code, the developers use their best judgement to define the right data type based on the sample set of logs available at the time of building the SPL.  The developer has to be conservative with the choice of data types, so that we don’t unnecessarily over estimate and impact the downstream data store schema. But, being conservative also means that we might receive data that is bigger than what we had defined in the data type and it needs to be adjusted.

While thorough testing is the first step to solve this problem, to identify the edge cases, we log every time we truncate the data and provide all the required information in the log to help identify the SPL and the column that needs the change. Our developers set a rule to be notified whenever this condition occurs, so that those edge cases can be immediately addressed. Our developers use a report like the one shown below to see the changes that need to be made to the defined data types when tested against a large set of log samples.

Glassbeam for Glassbeam

There are many other use cases related to user behaviour that both Product management and Engineering share and I will cover those in the next blog. Here, I have picked a tiny snippet of the possible use cases to show how we use our Glassbeam Analytics solution in the Engineering team to become responsive and truly agile.  

Refer to other posts in this Glassbeam for Glassbeam series: