In a previous post I gave an overview of 3 methods of Machine Learning - Supervised, Unsupervised and Reinforcement Learning. In this post, I will give more detail on how unsupervised learning works, by giving examples of 2 Unsupervised Clustering algorithms.
This algorithm works by making subgroups of numerical data into a number of clusters. A simple example of this could be to take the heights and weights of a group of individuals and create clusters to indicate gender. In this case, we would be clustering the data on 2 classes.
In the above example we know that there should be as there are only 2 logical groups. However, we don't have to give the algorithm this information. We could, allow it to determine the number of clusters itself, and then sense check the output for our 2 logical groups. Further, we may not know the number of clusters in our data and want the algorithm to figure this our for us.
These 2 approaches of whether or not to tell the algorithm in advance the number of clusters we want are called:
- Flat Clustering - passing the expected cluster count as 'K'
- Hierarchical Clustering.- not passing the expected cluster count as 'K'
Flat Clustering with K-Means
One of the most popular Flat Clustering algorithms is K-means. This algorithm works on numerical data and assumes that the clusters in your data are in relatively sized groups. The 'K', in K-means represents the number of clusters which you expect to find in the data.
This algorithm works by choosing 'K' points in the data as an initial estimate for the centres of our expected clusters. These centres are called Centroids. The next step is to measure the distance (frequently Euclidean) from each remaining features to each Centroid, and assign the features to its nearest Centroid as the centre of its cluster.
Now that we have K Centroids with a cluster of feature sets around them, we take the mean of each cluster, and create a new Centroid where the mean of these points is located. This process is repeated until our Centroids stop moving, at which point, we're done.
K-Means with the Iris Dataset
Let's take a look at an example of using K-Means with the Iris Dataset, which contains 50 samples of each 3 species of Iris:
- Iris Setosa
- Iris Virginica
- Iris Versicolor
These species are the labels by which we want to cluster our data. Our features are:
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
As K-means is an unsupervised algorithm we will be excluding these labels from our data and using Sci-kit learn's implementation of K-Means to classify the Iris data-set based on the features above. We will then compare the Labels classified by the algorithm with the true labels, and calculate our accuracy.
One thing to note here is that our accuracy score may come out as a number higher than 50% or lower. If the accuracy is lower than 50% then we take the compliment of the score for our true accuracy. For example, a accuracy score of 32% is to be read as 100% -32% = 68%. Lets take a look at the Python code:
import numpy as np from sklearn.cluster import KMeans from sklearn import datasets import matplotlib.pyplot as plt iris = datasets.load_iris() X = iris.data y = iris.target clf = KMeans(n_clusters=3) clf.fit(X) #Array of the Centroids created in the K-Means Fit method centroids = clf.cluster_centers_ #Array of the Labels created in the K-Means Fit method labels = clf.labels_ #Array of colors for the featuresets colors = ["g.","r.","c.","b.","k.","o."] #Iterate over the predictions and compare to our Y label correct = 0.0 for i in range(len(X)): predict_me = np.array(X[i].astype(float)) predict_me = predict_me.reshape(-1,len(predict_me)) prediction = clf.predict(predict_me) #Compare the prediction to our known labels as Y if prediction == y[i]: correct += 1 #Plot our featuresets with lables dtermined by K-means plt.plot( X[i], X[i] ,colors[labels[i]] , markersize=25,zorder=1) #Plot the calculated centroids plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=150, linewidths=5 ,zorder=2) #Print our accuracy score (remember to invert this if lower than 50%) print(correct/len(X)) plt.ylabel('Sepal Width') plt.xlabel('Sepal Length') plt.show()
Our accuracy here was 0.893, which means that based solely on the features, the K-Means algorithm was able to correctly classify 89% of the data. Having plotted the data, we can see in Fig. 2, where the algorithm has settled on the locations each Centroid.
In the above example, we had prior knowledge of the number of clusters which we expected the algorithm to classify on. In this example, we will take a look at Hierarchical Clustering, were the number of clusters is not provided to the algorithm. The method we will use is the Mean Shift algorithm.
Mean Shift with the Iris Dataset
Working again with the Iris data-set, we will see if the Mean Shift algorithm can determine the number of classes from the data.
Like the K-Means algorithm, mean shift applies the same principle of moving the Centroids, taking the mean of the cluster and creating a new Centroid, from which the process is repeated until the Centroids stop moving. However, as we are not passing the number of clusters, the Mean Shift algorithm takes a slightly different approach.
To start, the Mean Shift algorithm makes every feature-set a Centroid. For our data, this means that we start with 150 Centroids. For each of the Centroids, the algorithm applies a Bandwidth and groups the neighbouring feature-sets that fall within the radius of the Bandwidth. Then, like in K-means, it takes the mean of this cluster and creates a new Centroid. The process continues, for each Centroid, as in Fig. 3.
The Bandwidth plays a vital part in the mean shift Algorithm, if not specified, sci-kit learn's implementation with attempt to calculate the Bandwidth. Remember that the Bandwidth is the radius from which we cluster neighbouring Centroids. Later, I will show how changing this value can alter the results of the algorithm. Here is the Python code:
import numpy as np from itertools import cycle from sklearn.cluster import MeanShift from sklearn import datasets import matplotlib.pyplot as plt iris = datasets.load_iris() X = iris.data y = iris.target ms = MeanShift() ms.fit(X) #Array of the Cluster Centers created in the K-Means Fit method cluster_centers = ms.cluster_centers_ labels = ms.labels_ #Extract the unique cluseter labels n_clusters_ = len(np.unique(labels)) colors = cycle('grcbk') plt.figure(1) plt.clf() #Iterate over the clusters, colors and featuresets and plot results for k, col in zip(range(n_clusters_), colors): my_members = labels == k cluster_center = cluster_centers[k] plt.plot(X[my_members, 0], X[my_members, 1], col + '.', markersize=25, zorder=1) plt.scatter(cluster_center, cluster_center, marker='x', s=150, linewidths=5, zorder=2) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.ylabel('Sepal Width') plt.xlabel('Sepal Length') plt.show()
If we take a look at the Plot in Fig. 3, we can see that the model only found 2 classes of Iris from the data-set.
Whilst it has done an okay job at classifying the Iris Setosa (RED) featuresets, It clearly has had trouble distinguishing the Iris Virginica & Iris Versicolor featuresets.
Let's take a look at how we can tweak our algorithm by altering the Bandwidth value. In the code example below, we manually set the Bandwidth value to 0.18.
import numpy as np from itertools import cycle from sklearn.cluster import MeanShift, estimate_bandwidth from sklearn import datasets import matplotlib.pyplot as plt iris = datasets.load_iris() X = iris.data y = iris.target bandwidth = estimate_bandwidth(X,quantile=0.18) ms = MeanShift(bandwidth=bandwidth) ms.fit(X) ...
Fig 4. shows how altering the Bandwidth increases our estimated clusters from 2 to 3.
The Meanshift algorithm is very useful in marketing where we suspect that the classes of customers form different subgroups. Targeting customers based on their subgroup can increase the efficacy of marketing campaigns or recommendation systems.
In this post I have shown you 2 examples of Clustering with Unsupervised machine learning. These examples - K-Means & Meanshift - are a good representation of the unsupervised approach and should be easy to understand.
Aside from the potential revelations of previously unknown clusters, a huge benefit in Unsupervised Machine Learning is that they require fewer resources. As the feature-sets do not require labels, the availability of data for training the algorithm is not constrained by the need to classify the data prior to training the algorithm. Thus, data can be sourced in larger quantities and more quickly.
The Code from this post is available here.