To Search or Not to Search on Machine Data

Pramod Sridharamurthy

Search as the starting point is a great way to start any analytics with Machine data. As a user, initially you don’t know what you are searching for and hence searching for “needle in a hay stack” is easy, because all you need to do is type needle! Yes, you will get a lot of results back which then needs to be filtered/ranked and presented in a meaningful way, but open source search engines, that allow full text search of any document like SOLR/Lucene, provide a good starting point for search implementation. With data indexed, a lot of functionality can be built on top of it to enhance the search ability for the user.

While search is good as a starting point, it tends to get more geeky for the end user when complex search queries need to be created. Even if you manage to write complex queries, everything driven from search alone means a huge load on the search server. Also, the power of full text search diminishes when you want to perform operations similar to those on structured data.

Consider a simple example of a device log that has CPU load information captured as a table through vmstat command of Linux like below. This is just one of the sections along with whole lot of other information in the log

r b swpd free buff cache si so bi bo in cs us sy id wa st
8 1 1865144 1018628 56364 1677296 10 16 35 161 97 92 4 0 2 1 0
2 2 1865144 1018628 56364 1677296 0 0 0 4 1007 395 0 0 30 0 0
4 2 1865144 1018628 56376 1677296 0 0 0 92 1016 421 0 0 10 0 0
4 2 1865144 1018628 56376 1677296 0 0 0 0 1008 363 0 0 10 0 0
6 3 1865144 1018628 56376 1677316 0 0 0 0 1003 361 0 0 5 0 0
3 1 1865144 1018628 56384 1677308 0 0 0 20 1006 341 0 0 10 0 0
1 1 1865144 1018628 56384 1677332 0 0 0 8 1009 364 0 0 50 0 0
0 0 1865144 1018628 56396 1677320 0 0 0 92 1019 448 0 0 99 0 0

Let’s say we need to plot r, b over time and also look at the CPU utilization(id) when the values of r or b > 2.

A requirement like this is complex, if not impossible when approaching the problem through search. Let’s assume that this data was parsed and the data above was available in a database like table. Now, performing analysis as listed above would be relatively simpler. Also, the user can be provided with a simple user interface to browse through the attributes which has been extracted and allow the user to use drag and drop type of interface to slice and dice data. This approach though requires upfront data extraction which has to be persisted in a data store and a simple framework to do it efficiently.

The question hence is not about analytics through search or analytics through pre-parsed content. It is about using the right method based on the type of content, so that effective results can be seen with minimal cost, time and resources. The question then is why not both?. When the data is both parsed and Indexed appropriately, it opens up avenues that are not possible with either approach alone. For example, all the parsed content can be used to add useful context to what is being searched and search as an interface can be used to get to the content in the simplest and quickest way. You are no longer searching for a pattern, but you are searching for a pattern within a well-defined context. For ex: Consider a search term like this, ‘Error code 5085″ where Software version = 5.4 or 5.6, Model = 9877, License = NFS, Language Setting = US-EN. Assume that each of the context variable being passed to search comes from some section in the log, where the attribute of that section can have multiple values. Also, Pre-parsed content can be transformed or pre-aggregated to increase end user response time for queries being executed. A combined approach hence could provide lot more options than an either/or approach.

At Glassbeam, we are constantly looking at providing practical solutions to Big data problems. The above blog is just one such use case