Designing for the internet of things using cassandra

ASHOK AGARWAL
Wednesday, September 25, 2013

I like the word “ontology”. It has a nice ring to it. Wikipedia defines Ontology as “knowledge as a set of concepts within a domain, and the relationships among those concepts”. When applied to machine data analytics (“domain”), we see that unless we isolate concepts and understand the relationships, we cannot obtain “knowledge”.

In the “Internet of Things”, the “things” are intricate machines producing complicated multi-structured logs. Isolating “concepts” is all about parsing data from this multi-structured logs. If you further think about data that you cannot automagically “recognize” (like time series events), parsing becomes a complex iterative process. For example, look at these syslog entries:

Sep 23 10:26:39 gbdev2g1 restorecond: Will not restore a file with more than one hard link (/etc/resolv.conf) No such file or directory

Sep 23 10:28:56 gbdev2g1 kernel: Removing netfilter NETLINK layer.

Sep 23 10:29:01 gbdev2g1 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team

Sep 23 10:29:01 gbdev2g1 kernel: Netfilter messages via NETLINK v0.30.

Sep 23 10:29:01 gbdev2g1 kernel: ip_conntrack version 2.4 (8192 buckets, 65536 max) – 304 bytes per conntrack/p>

Sep 23 10:34:02 gbdev2g1 kernel: Removing netfilter NETLINK layer.

Sep 23 10:34:05 gbdev2g1 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team

Or at these Apache log entrees:

10.163.36.23 – - [19/Sep/2013:20:57:23 -0400] “HEAD / HTTP/1.1″ 200 -

168.143.88.103 69993 – - [19/Sep/2013:20:57:28 -0400] “POST /cgi-bin/ui/prod/upload.cgi HTTP/1.1″ 200 359

168.143.88.103 53193 – - [19/Sep/2013:20:58:07 -0400] “POST /cgi-bin/ui/upload.cgi HTTP/1.1″ 200 359

10.163.36.23 – - [19/Sep/2013:20:58:08 -0400] “HEAD / HTTP/1.1″ 200 -

10.163.36.23 – - [19/Sep/2013:20:58:53 -0400] “HEAD / HTTP/1.1″ 200 -

While they contain important information, from parsing perspective, they are not very interesting. But now look at this data:

—————— Monthly Data ———————

< getmacid for pnode 1 >
MAC ID is DD-05-EF-6E-1E-8B
< cluster-getmac for pnode 1 >
cluster MAC: 8E-D1-B0-E1-F4-9B
< id for pnode 1 >
Server name: stlscn1prd01
Comment: Sys BLI
Company: Vang Systems
Department:

 

Location: 800 North Lindbergh Ave.
Building J
Saint Louis, Missouri 63167
United States

 

<ver for pnode 1 >
Model: 110
Software: 7.0.2050.18c (built 2011-04-06 17:31:25+01:00)
Hardware: Mercury (M2SEKW1050134)
——————————————————————

Not only can you not recognize it, you need to work hard to parse it.

Interesting and deep analytics (or Knowledge) only happens as more and more of this information is available as parsed data. Almost always the requirements for analytics are not known upfront and many times even the meaning of the log data is not fully defined upfront. So, what do you do? Obvious choices are to wait for all information to be available or make a start and refine it as you go along. I have noticed that as you show things to customers, juices start flowing, and they start asking for more and more things.

Trick is in architecting a backend which can adapt to the uncertainty of the requirements. Relational databases are rigid in their schemas and may need all data to be reloaded. Key value pairs suffer from the need to update in case you need to go back and parse more data. At Glassbeam we use Cassandra which allows dynamic columns. Which means you start by parsing what is known and simply add more columns as new information becomes available. There is no need to update or rewrite previously written data. This process is called Incremental Parsing.

In Cassandra you can save a dynamic set of columns attached to a primary key. Then as more data needs to be parsed, simply have a MapReduce process to parse additional columns from the old logs. Since these new columns will have the same primary key, they will just get added to the previously written data – no fuss no mess. You can keep enriching the data and keep producing more “Knowledge” and keep refining the ontology.

Comments are closed.