woensdag 3 juni 2015

Microsoft Azure ML


Now Machine Learning (ML) is becoming more and more important, it's a good idea to get a grasp of the capabilities of Machine Learning (ML). In this blogpost I'll investigate machine learning in more detail. What is Machine learning?, what is the process of creating a ML model? and what are common algorithms (in Microsoft Azure ML)?

Datamining techniques was already available in SSAS, but hardly used by customers or BI consultants. Microsoft released a while ago Azure Machine Learning and this tool provides a way of applying historical data to a problem by creating a model and using it to successfully predict future behaviors or trends.

For this blogpost I'll use Microsoft Azure ML studio as an example. Below a screenshot of a try out of the Microsoft Azure ML.On the right, the data flow, clean up and a projection and on the left the Linear Regression algorithm that is applied to the data, such that the model is trained.

Machine learning process

Everybody knows the quiz where someone  needs to guess an answer about what another person is telling about the characteristics about the object, like the game, called Pictionary. E.g: Me: “Its round, green, and edible” You: “It’s an apple!”. So, based from what you know (train the model) you can guess that the answer should be an apple (score and test the model).

And, you have to continue to learn the model by adding new data, like green apples, red apples and yellow apples to improve the model. So, there are a couple of steps:
  1. Get the data.
  2. Preprocess the data.
  3. Define features.
  4. Choose and apply an algorithm.
  5. Predict new incoming data.

1. Get the data

First thing to do is to get the data from a source. There are multiple options for loading the data in ML Studio. 

File Formats
The following file formats are supported in ML:
  • CSV file.
  • TSV file.
  • Plain text.
  • Svmlight file (Support Vector Machine).
  • Attribute relation file format.
  • Zip file.
  • RObject or Workspace.

Reader options
There are also some other input options available, like below:
  • Web Url via HTTP.
  • Hive Query.
  • Azure SQL Database.
  • Azure Table.
  • Azure Blob storage.
  • Data Feed Provider.

2. Preprocess the data

A dataset usually requires some preprocessing before it can be analyzed. You may notice some missing values in columns on different rows and these missing values needs cleaning in order to let the model analyze the data properly

In Microsoft Azure ML there are all kinds of manipulations (transformations?) possible:
  • Filtering.
  • Manipulation like adding columns, adding rows, Clean missing data, group categorical values or project columns
  • Create samples and splitting the data (for a training set and a test set).
  • etc

3. Define features

In machine learning, we are not talking about dimensions (or attributes) but about features. These are individual measurable properties of something you’re interested in. Each column in the dataset is a feature and finding a proper set of features is a tedious and important task for creating a predictive model. For instance some columns can have a strong correlation and therefore it will not add much new information to the model.

We'll select the features (columns) with the Project Columns module. For training the model it's needed that the dependent variable, the variable that we are going to predict, is in the data set.

4. Choose and apply an algorithm

Constructing a predictive model consists selecting an algorithm and train and test this algorithm in order to get the best result. These are algorithms that are currently available in Microsoft Azure ML:
  • Anomaly Detection
    • One-Class Support Vector Machine
    • PCA-Based anomaly Detection
  • Classification
    • Multiclass Decision Forest
    • Multiclass Decision Jungle
    • Multiclass Logistic Regression
    • Multiclass Neural Network
    • One-vs-All Multiclass
    • Two-Class Averaged Perceptron
    • Two-Class Bayes Point MAchine
    • Two-Class Boosted Decision Tree
    • Two-Class Decision Forest
    • Two-Class Decision Jungle
    • Two-Class Locally-Deep Support Vector Machine
    • Two-class Logistic Regression
    • Two-Class Neural Network
    • Two-Class Support Vector Machine
  • Clustering
    •  K-Means Clustering
  • Regression
    • Bayesian Linear Regression
    • Boosted Decision Tree Regression
    • Decision Forest Regression
    • Fast Forest Quantile Regression
    • Lineair Regression
    • Neural Network Regression
    • Ordinal Regression
    • Poisson Regression
So, most of the algorithms focusses on classification and regression. Both algorithms are so called supervised learning algorithms. Classification algorithms are used for predicting responses that can have just a few known values (such as married, single, or divorced) based on the other columns in the dataset. Regression algorithms are used for prediciting values based on contineous variables like age.

An example of an algorithm:

5. Predict new incoming data.

Now that we have trained the model, we have to score the model. In the split we have a training set created by spliting 75% of the data and in order to score the model, we have to compare this with the 25% test set to see how well the model functions.

In this example I've dragged a score model component in the diagram.

Below an example of the scoring experiment

Here you can see the  price and the predicted price based on the features of the data set.

When you compare the price with the scored labels with this diagram (below), you'll see that the low end is quite accurate but not upper right (because of the lack of sufficient data?).

Finally, to test the quality of the results, select and drag the Evaluate Model module to the experiment canvas, and connect the left input port to the output of the Score Model module. With this component it's possible to test two different algorithms for the best fit.


I've covered a small part of Azure Machine learning to get an impression of the possibilities. The possibilities of the free version are great and very useful for machine learning.