Scalable machine learning with apache spark and mlbase

Friday, July 11, 2014

Big data Applications are no longer a nice to have, but must have for many organizations.Many enterprises are already using massive data being collected in their organizations to understand and serve their customers better. Ones which have not yet learned how to use it will over a period of time, be left behind. Companies are now looking for platforms which not only provide them analytical capabilities over their data, but also help them become PROACTIVE AND PREDICTIVE. And hence machine learning capability is becoming an important component for any analytical platform.

At Glassbeam, we are constantly working towards enhancing our Big Data Platform to add more functionality. Machine data analytics is our passion and all our research and development is to make machine data meaningful for our customers. One of the features we wanted to add to our platform was Machine Learning (ML) capability. While Machine Learning can mean a lot of things, our focus was to help our customers predict system/component failure in the field. In this post I will cover my findings based on evaluations we did on ML using open source options.

While looking at existing open source libraries, there were clearly two choices to add scalable ML capabilities:

I will cover our findings on MLbase in this post and Mahout in a different post in the future. I will also document how Glassbeam’s SCALAR can help in ML capabilities instead of building a ML platform from grounds up.

Current state of ML tools

The biggest problem with current ML tools and libraries is that they are either suited for operational data scientist or investigative ones. Investigative data scientist are mainly focused with experimenting and building models using tools like R, Python, scikit-learn, etc. Now while these tools provide awesome libraries for statistical analysis but the problem with these is that they don’t scale well. Operational data scientist are responsible for translating these models into scalable versions using Java, C++, Hadoop, Mahout, etc. But this adds lot of complexity because its not always easy to write equivalent scalable/distributed version of final algorithm.


MLBase provides tools and interfaces to bridge the gap between operational and investigative machine learning.

It provides 3 main components:

  1. MLlib: A distributed low-level ML library written against the Spark runtime that can be called from Scala and Java. The library includes common algorithms for classification, regression, clustering and collaborative filtering.
  2. MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions.
  3. ML Optimizer: This layer aims to simplify ML problems for end users by automating the task of model selection.

Clustering with MLlib

We will see a example of how to use APACHE SPARK and MLLIB for clustering and dimensionality reduction. MLlib includes a modified version of kmeans called kmeans|| which can be used for clustering. Following flow diagram shows the stages in building a typical clustering platform.

Both feature extraction and feature selection are quite complex task when you want a reliable model. Feature extraction requires lot of time and tools. Feature selection requires domain expertise and wrong selection of features means an overall bad quality of clusters. Glassbeam’s SPL(Semitic parsing language) can make feature engineering much easier and reduce time to market.

Apache Spark with MLlib

Apache Spark is a cluster computing system. Its primary abstraction is a distributed collection of items called a RESILIENT DISTRIBUTED DATASET (RDD). RDDs are fault-tolerant collections of elements that can be operated on in parallel and can be persisted in memory for faster access.

Starting a standalone spark cluster

Spark provides binary packages for Hadoop and CDH4. You can DOWNLOAD them or directly build from SOURCE.

Now you can use packaged utilites to start a master and a worker. You don’t need to start master/workers to play with spark. Spark provides spark-shell to try out examples. And if you trying examples inside a java/scala project, you can use ‘local’ while creating a spark context which will start psedo master/worker process.

After this you can point you browser to


to see spark admin UI. Once you submit a job, you will be able to see job stats and status through this interface.

MLlib in action

Let’s say we have extracted the features and defined a feature weight matrix for different events. We can now use this to train a clustering model. The feature-matrix can be stored either in hdfs or on local FS. It will contain the features and their weight. Feature extraction and selection is a trivial task with Glassbeam’s SPL. The sample data will look like:

Before we can load the file in spark cluster, you need to create a spark content which basically tells spark how to find master node and shared resources.

There are two ways to work with this example:

1. Spark shell: Its a interactive shell where you can run commands directly. People coming from Python/R background will find this familiar. To execute the shell, type:

2. Sbt project: You can create a spark project by including spark and mllib dependency as below:

Next you have to create a spark context. If you are using spark-shell then a spark content by the name ‘sc’ is available by default.

Once you have the context, lets load and parse the data in spark:

You can also load the file from hdfs. Next we need to create a RDD from this data.

Now let’s define some cluster parameters:

And run kmeans on that dataset

You now have you KMeans model ready. To see the cluster centers:

Now you can use this model to predict new incoming data:

To evaluate clusters quality, you can calculate sum of squared errors

To evaluate clusters quality, you can calculate sum of squared errors

Dimensionality Reduction

One of the problems with high-dimensional data is that, in many cases, not all the measured features/variables are important for building the model to answer the questions of interest. So in many applications we reduce the dimension of the original data before creating any model. In context of clustering, through DIMENSIONALITY REDUCTION we try to remove features which have least variance.

For instance lets say we are clustering some 4 dimensional “events” data which have almost same priority & threshold. So it would make sense to reduce this data to a 2-dimensional sample before building a model for clustering.

MLlib at present provides two algorithms for dimensionality reduction:


Following example shows how you can transform you data to lower dimension using PCA and then build a model using k-means:


MLlib have potential to grow into the one unified platform for data scientists. It can drastically reduce time to market when integrated with platforms like Glassbeam’s SCALAR which can do heavy lifting tasks of feature extraction and selection.

Glassbeam’s SCALAR platform provides Semiotic Parsing Language, which can be leveraged to extract data quickly and define complex relationship between different data points in a log or across logs for IoT data. User can use ICONS (pattern recognizers) to extract features from semi structured or structured data and then define relationship between them. Glassbeam’s solution engineering team brings with them domain expertise which is very important for feature selection.

Stay tuned for next post on how to use MLlib for prediction.