Birds of a feather, flock together

Unsupervised Clustering

Birds flocking together. Source

Birds flocking together. Source

In a previous post I gave an overview of 3 methods of Machine Learning - Supervised, Unsupervised and Reinforcement Learning.  In this post, I will give more detail on how unsupervised learning works, by giving examples of  2 Unsupervised Clustering algorithms.

You are the average of the five people you spend the most time with.
— Jim Rohn

Clustering

This algorithm works by making subgroups of numerical data into a number of clusters.  A simple example of this could be to take the heights and weights of a group of individuals and create clusters to indicate gender.  In this case, we would be clustering the data on 2 classes. 

In the above example we know that there should be as there are only 2 logical groups.  However, we don't have to give the algorithm this information.  We could, allow it to determine the number of clusters itself, and then sense check the output for our 2 logical groups. Further,  we may not know the number of clusters in our data and want the algorithm to figure this our for us.  

These 2 approaches of whether or not to tell the algorithm in advance the number of clusters we want are called:

  • Flat Clustering - passing the expected cluster count as 'K'
  • Hierarchical Clustering.- not passing the expected cluster count as 'K'

Flat Clustering with K-Means

One of the most popular Flat Clustering algorithms is K-means.  This algorithm works on numerical data and assumes that the clusters in your data are in relatively sized groups.  The 'K', in K-means represents the number of clusters which you expect to find in the data.

This algorithm works by choosing 'K' points in the data as an initial estimate for the centres of our  expected clusters. These centres are called Centroids.  The next step is to measure the distance (frequently Euclidean) from each remaining features to each Centroid, and assign the features to its nearest Centroid as the centre of its cluster.

Now that we have K Centroids with a cluster of feature sets around them, we take the mean of each cluster, and create a new Centroid where the mean of these points is located.  This process is repeated until our Centroids stop moving, at which point, we're done.

Fig. 1 K-means

Fig. 1 K-means

K-Means with the Iris Dataset

Let's take a look at an example of using K-Means with the Iris Dataset, which contains 50 samples of each 3 species of Iris:

  • Iris Setosa
  • Iris Virginica
  • Iris Versicolor

These species are the labels by which we want to cluster our data.  Our features are:

  • Sepal Length
  • Sepal Width
  • Petal Length
  • Petal Width

As K-means is an unsupervised algorithm we will be excluding these labels from our data and using Sci-kit learn's implementation of K-Means to classify the Iris data-set based on the features above. We will then compare the Labels classified by the algorithm with the true labels, and calculate our accuracy.  

One thing to note here is that our accuracy score may come out as a number higher than 50% or lower.  If the accuracy is lower than 50% then we take the compliment of the score for our true accuracy.  For example, a accuracy score of 32% is to be read as 100% -32% = 68%.  Lets take a look at the Python code:

import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = KMeans(n_clusters=3)
clf.fit(X)
#Array of the Centroids created in the K-Means Fit method
centroids =     clf.cluster_centers_
#Array of the Labels created in the K-Means Fit method
labels      =   clf.labels_
#Array of colors for the featuresets
colors = ["g.","r.","c.","b.","k.","o."]
#Iterate over the predictions and compare to our Y label
correct = 0.0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1,len(predict_me))
    prediction = clf.predict(predict_me)
    #Compare the prediction to our known labels as Y
    if prediction[0] == y[i]:
        correct += 1
    #Plot our featuresets with lables dtermined by K-means
    plt.plot( X[i][0], X[i][1] ,colors[labels[i]] , markersize=25,zorder=1)
#Plot the calculated centroids
plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=150, linewidths=5 ,zorder=2)
#Print our accuracy score (remember to invert this if lower than 50%)
print(correct/len(X))

plt.ylabel('Sepal Width')
plt.xlabel('Sepal Length')
plt.show()

Our accuracy here was 0.893, which means that based solely on the features, the K-Means algorithm was able to correctly classify 89% of the data.  Having plotted the data, we can see in Fig. 2, where the algorithm has settled on the locations each Centroid.

Fig. 2 Plotting of the Iris Sepal features with K-Means Centroids plotted as 'X'

Fig. 2 Plotting of the Iris Sepal features with K-Means Centroids plotted as 'X'

Hierarchical  Clustering

In the above example, we had prior knowledge of the number of clusters which we expected the algorithm to classify on.  In this example, we will take a look at Hierarchical Clustering, were the number of clusters is not provided to the algorithm.  The method we will use is the Mean Shift algorithm.

Mean Shift with the Iris Dataset

Working again with the Iris data-set, we will see if the Mean Shift algorithm can determine the number of classes from the data.

Like the K-Means algorithm, mean shift applies the same principle of moving the Centroids, taking the mean of the cluster and creating a new Centroid, from which the process is repeated until the Centroids stop moving. However, as we are not passing the number of clusters, the Mean Shift algorithm takes a slightly different approach.

To start, the Mean Shift algorithm makes every feature-set a Centroid. For our data, this means that we start with 150 Centroids. For each of the Centroids, the algorithm applies a Bandwidth and groups the neighbouring feature-sets that fall within the radius of the Bandwidth.  Then, like in K-means, it takes the mean of this cluster and creates a new Centroid.  The process continues, for each Centroid, as in Fig. 3.

Fig. 3 Meanshift  

Fig. 3 Meanshift  

The Bandwidth plays a vital part in the mean shift Algorithm, if not specified, sci-kit learn's implementation with attempt to calculate the Bandwidth.  Remember that the Bandwidth is the radius from which we cluster neighbouring Centroids. Later,  I will show how changing this value can alter the results of the algorithm.  Here is the Python code:

import numpy as np
from itertools import cycle
from sklearn.cluster import MeanShift
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
ms = MeanShift()
ms.fit(X)
#Array of the Cluster Centers created in the K-Means Fit method
cluster_centers = ms.cluster_centers_
labels      =   ms.labels_
#Extract the unique cluseter labels
n_clusters_ = len(np.unique(labels))
colors = cycle('grcbk')


plt.figure(1)
plt.clf()
#Iterate over the clusters, colors and featuresets and plot results
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.', markersize=25, zorder=1)
    plt.scatter(cluster_center[0], cluster_center[1], marker='x', s=150, linewidths=5, zorder=2)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.ylabel('Sepal Width')
plt.xlabel('Sepal Length')
plt.show()

If we take a look at the Plot in Fig. 3, we can see that the model only found 2 classes of Iris from the data-set.

Fig. 3 Plotting of the Iris Sepal features with Mean Shift Centroids plotted as 'X' 

Fig. 3 Plotting of the Iris Sepal features with Mean Shift Centroids plotted as 'X' 

Whilst it has done an okay job at classifying the Iris Setosa (RED) featuresets, It clearly has had trouble distinguishing the Iris Virginica & Iris Versicolor featuresets.

Let's take a look at how we can tweak our algorithm by altering the Bandwidth value.  In the code example below, we manually set the Bandwidth value to 0.18.

import numpy as np
from itertools import cycle
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target

bandwidth = estimate_bandwidth(X,quantile=0.18)

ms = MeanShift(bandwidth=bandwidth)
ms.fit(X)
...

Fig 4. shows how altering the Bandwidth increases our estimated clusters from 2 to 3.

Fig. 4 Changing the bandwidth to 0.18 results in 3 clusters found

Fig. 4 Changing the bandwidth to 0.18 results in 3 clusters found

The Meanshift algorithm is very useful in marketing where we suspect that the classes of customers form different subgroups.  Targeting customers based on their subgroup can increase the efficacy of marketing campaigns or recommendation systems. 

Conclusion

In this post I have shown you 2 examples of Clustering with Unsupervised machine learning.  These examples - K-Means & Meanshift - are a good representation of the unsupervised approach and should be easy to understand.

Aside from the potential revelations of previously unknown clusters, a huge benefit in Unsupervised Machine Learning is that they require fewer resources.  As the feature-sets do not require labels, the availability of data for training the algorithm is not constrained by the need to classify the data prior to training the algorithm.  Thus, data can be sourced in larger quantities and more quickly.

The Code from this post is available here.

What is Machine Learning?

Artificial Intelligence & Machine Learning

Cover of Isaac Asimov's Robot Visions 

Cover of Isaac Asimov's Robot Visions 

The explosion of articles and interest in Artificial Intelligence and Machine Learning over the past few years has been hard to miss.  From the realisation of automated cars to the incorporation of AI into translation services and digital assistants such as Siri, most of us will interact with some sort of AI in our daily lives.  As below, Machine Learning is an increasingly trending topic. 

Whilst AI and Machine Learning seemed to be used interchangeable, the distinction can be summarised as:

  • AI - the creation of machines which can interact in an intelligent and human like way.  AI attempts to build machines with both logical prowess and emotional intelligence.
  • Machine Learning - the creation of algorithms which can solve problems for which they have not been explicitly programmed, and learn from new data.

In this post, I will explain what Machine Learning is, and introduce some different methodologies and how they might be used.

How is Machine Learning different from computer programming?

Fig. 1 shows the contrast between conventional programming where the rules are hard coded, and Machine Learning, where the rules are deduced by the algorithm.

Fig. 1 Comparison of conventional programming and machine learning

Fig. 1 Comparison of conventional programming and machine learning

Whilst conventional programming involves solely the application of code, Machine Learning is multidisciplinary, combing Statistics, Maths and Programming.

Fig. 2 Component disciplines of Machine Learning

Fig. 2 Component disciplines of Machine Learning

Methods of Machine Learning

There are 3 main methods of Machine Learning, these are:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Whilst these methods can be combined, generally, they are each suited to different scenarios, and approach discovery in different ways.

Supervised Learning

In Supervised Learning the algorithm is given data which comprises of Features and Labels and works out the relationship between the 2, with the goal of creating a model which can predict the Label on new data, given only the features.

An example of this is to build a Machine Learning algorithm that can takes historic data on share prices, formulates a set of rules on how variables affect the share price, and then predict for a future value, given new data.  The type of algorithm used to predict continuous data like this is Linear Regression.

Spam Filtering is an application of Classification.  Given certain features of the email such as:

Contains the word 'viagra'; Contains the word 'Free'; Has spelling mistake; Has no return sender address;

the algorithm will learn to classify the email as spam or not spam. Once the model has been optimised on data containing the classification label, we can then test and predict on data with the label excluded.

Unsupervised Learning

Unlike Supervised Learning, this method does not supply the algorithm with the output which we want it to predict.  The algorithm must determine the correct output by finding relationships, or groupings in the data without any correctly labelled data to learn from.

This method of training can be used for training the model to create groups within the dataset. Clustering is a method of Machine Learning that is useful in doing just this.  The algorithm can reveal clusters in the data, which were previously unknown. Clustering algorithms are used in marketing for the segmentation of target markets into insightful subgroups.  This information then affords the marketer the opportunity for focussing advertising based on the clustered consumer group.

A second example of clustering is in the optimised location for Distribution Centres.  Given data on delivery points and stores, the algorithm can cluster the locations into logical groups, and find the central point at which the DC location would be best located. 

Reinforcement Learning

This method or Machine Learning is common in the automation of robots/vehicles and in training algorithms to play games.  It is similar to how children learn the world around them. Putting your hand on a hot stove and learning not to do that again is a powerful form of reinforcement learning.

Algorithms learning through reinforcement learning, quickly move through a process of trial and error, and update themselves based on success or failure.

 Conclusion

I hope that this brief introduction to a vast and captivating subject has given you a good idea of what Machine Learning is, and an understanding of the practical applications of 3 methods of Machine Learning.

If this post has piqued your appetite and you would like to learn more, I recommend the following resources:

Books

  • The Master Algorithm, Pedro Domingos
    • A fantastic introduction to machine learning, the evolution of approaches to machine learning including the Five Tribes (Symbologists, Connectionists, Bayesians, Analogizers,Evolutionaries), and ideas on how we can combine these to create The Master Algorithm.
  • Machine Learning for Dummies, John Paul Mueller & Luca Massaron
    • Solid introduction to Machine Learning for the beginner.  Clear and concise.

Programming

There are a wealth of packages and libraries to help get you started with machine Learning.  The programming language Python is arguably becoming the de facto language to explore Machine Learning, and has 2 excellent libraries available that allow you to quickly start training models, with no need to write the algorithms from scratch. These are:

Additionally, you will want to be using 2 data manipulation libraries:

Learning to code can seem challenging, however the Python programming language borrows syntax from natural language and, with this and its simple coding style, it can be a great language to start programming with.

If you are interested in learning Machine Learning with Python, I highly recommend a series of  YouTube tutorial videos on Machine Learning with Python which are incredibly clear and comprehensive. The author also has several tutorial series on learning Python.

If you are coming from a R background or want to start Machine Learning in R, I can recommend:

R also has a very popular Machine Learning package which abstracts away the need for the user to code the algorithms from scratch, this is:

Code free

Finally, If you would like to start building Machine Learning algorithms without coding, Microsoft's Azure Machine Learning cloud software offers an incredibly simple drag and drop interface to build models and expose them as Web Services.  You can get a free subscription at:

In future posts I will drill into more detail on the topics and methods covered above, but if you have any questions, comments or concerns, please leave them in the comments section below.