How to predict the future

Simple Linear Regression

Biff.

Biff.

In this article, I am going to show you how to predict the future.

Okay, so I am being a little cavalier here.  Perhaps I should say -  I am going to show you how to predict the future value of a variable based on its past interactions with one other variable, on the condition that they have a linear relationship. Not so snappy, but a little less misleading.

Predicting the future has captivated mankind since the dawn of time. In fact, you could say that predicting the future is the great difference between mankind and the rest of the animal kingdom. Not only can we extrapolate future events based on those of the past, but we have the ability to put our predictions of the future back through the our brain, and extrapolate from those - abstract thought.

In this article, I am going to show you how we can use Simple Linear Regression to make predictions from data.

The best predictor of future behavior is … past behavior

Simple Linear Regression.

Simple Linear Regression is the calculation of a trend from 2 linearly related variables. One of which is hypothesised to explain the behaviour of the other. 

As an example, take the hours of study and scores in an exam. One would imagine that the time you put into studying affects exam results. We then say that these variables x = hours of study, and y= exam results have a linear relationship. 

The increase in score can be explained and therefore predicted by the hours of study. If i can build a linear model on past data, you should be able to give me the hours of study which a student has put in, I can run it through my model, and I can make a prediction on the score which they will get.

You will often see this depicted as a scatterplot, with a nice straight line going through them, at an angle.

There are a few different methods for creating a Simple linear regression model using machine learning. In this article I will show you 3 (or 4, not fully decided yet) which can be used. 

I will show you how we can create a SLR model using vanilla Python on some dummy data, how we can use something called Stochastic gradient Descent to continually update our model until we find ‘the sweet spot’, stirring in some Linear Algebra using the Normal Equation.  

Before we dive into the code, lets take a look at the maths behind the curtain.

Maths of the Simple Linear Regression model.

In the example of the student’s hours of study and the grade on his or her exam, we have a model for the data which we are able to plug in the hours of study and predict for the grade. Our model follows a mathematical formula which goes a little something like this:

\begin{equation} \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \end{equation}

On the left hand side our predicted grade - y.  On the far right we have our hours of study - X, and also we have B0 and B1, these are our Coefficients.  The Coefficients are the magic values which allow us to do our magic. So, all we need to do is find what values our Coefficients; should be easy right?

First we need to know B1 - how steep our line should be. Or, put another way, how much it should slope by. To calculate the Slope, we use this formula:

Slope: \begin{equation} \hat{\beta}_1 = \frac{\sum(X_i – \bar{X}) (Y_i – \bar{Y})} {\sum(X_i – \bar{X})^2} \end{equation}

Now lets solve for B0, which is our Intercept.  Remember when I said that we normally see a nice straight line drawn through our values, well the straight line will at some point cross our Y axis, but it could cross high up or low down.  The point at which it crossed our Y axis is our Intercept. We can calculate this by:

Intercept: \begin{equation} \hat{\beta}_0 = \bar{Y} – \hat{\beta}_1 \bar{X} \end{equation}

Plugging these values into our formula, we now have our model for the Simple Linear Regression.

Okay, so hopefully you are still with me and that all made sense. Now, as we our ardent machine learners, we going to want to fire up our IDE and see how we can create and run this model in code.

Simple Linear Regression in Vanilla Python

First off, to make explicit what is going on, I’m going to take this to the metal and code this in as transparent a way as possible.

import  statistics
import  matplotlib.pyplot as plt
#Create some dummy data
xs = [1, 2, 3, 4, 5]
ys = [1, 4, 9, 16, 25]

#Function to calculate our Coefficients 
def basic_linear_regression1(x, y):
    length = len(x)
    sum_x = sum(x)
    sum_y = sum(y)
    sum_x_squared = sum(map(lambda a: a ** 2, x))
    sum_of_products = sum([x[i] * y[i] for i in range(length)])
    m = (sum_of_products - (sum_x * sum_y) / length) / \
        (sum_x_squared - ((sum_x ** 2) / length))
    b = (sum_y - m * sum_x) / length
    return m, b

#Get our Coefficients
m, b = basic_linear_regression1(xs, ys)
#use our Coefficients to create our regression line
reggression_line = [(m * x + b) for x in xs]
#Create a new value to test our model
new_x = 8
#Make a prediction
predict_y = m * new_x + b

print("Slope: ", m, " Intercept: ", b, " x: ", new_x, "Yhat: ", m * new_x + b)
#Plot our regressin line
plt.plot(xs, reggression_line, color="g", lw=2, alpha=0.5)
#Plot our original data
plt.scatter(xs,ys)
#Plot our prediction
plt.scatter(new_x, predict_y, s=20, color='r', alpha=1)
plt.show()

Stochastic gradient descent

Now that we have a good intuition of model and have worked through coding this up in vanilla python, lets take a look at another method for finding the best Coefficients.

So, Simple Linear Regression is something called  a Convex Problem.  To understand this, we need to to understand one other key point to our model.

On drawing our line through our variables, we want the line to be as close to as many of out points as possible. Imagine that our data fell so tightly in a line already, and when we created our model and drew it our regression line on top, it overlapped with the actual data. It would be clear to see that this would be a very precise model, which describes our data perfectly (too perfectly?). It is extremely unlikely that  you would see a fit this close, but the idea is that the closer we can get as possible, the better our model is. Conversely, the further away our data points our from our line, the worse our model is.

This distance between each point and our linear regression line are called residuals. Like the name implies, they are the bits of noise left over between our nice regression line and our data.  They expose to the viewer the imprecision of our model and, if you image that we are paid based on the fit of our model, these residuals cost us money.  This is the cost that we want to minimise.

Now, I mentioned before that Simple Linear Regression was a Convex problem.  If we keep wiggling our line around, there would eventually be a ‘sweet spot’ at which our cost would be at a minimum. You can imagine trying to tune in an analogue radio or TV. A little to the left, to the right….and okay, “don’t move”.  This is like our cost function, wiggle, wiggle, good! 

So we know that there exists a sweet spot at which our cost will be at a minimum, but how do we find it. This is where the Convexity (no idea if that is a real word) comes into play. It turns out that if we plot on a graph all those wiggles, then it will plot a shape will a letter U, or a bowl. At the bottom is the sweet spot, and it gets steeper at the sides. Now you can see why we call it ‘…. Gradient Descent’ - our method is to walk down the sides of the U or bowl until we find this minimum cost or, sweet spot.

The Stochastic part you can just read as - “we are going to take tiny tentative steps.  After each step, we are going to take a look around, and see if we are at a lower point in the bowl than we were before”.  If we are continually lower, we are heading in the right direction, if we start climbing back up, we have gone to far, and take smaller steps back down in the directions we came.

Back in Python, this is how we can run Stochastic Gradient Descent to minimise the cost or our troublesome residuals, and find our optimal Coefficients.

import numpy as np
import sklearn
from sklearn.datasets.samples_generator import make_regression 
import matplotlib.pyplot as plt
from scipy import stats

def gradient_descent(alpha, x, y, ep=0.0001, max_iter=10000):
    converged = False
    iter = 0
    m = x.shape[0] # number of samples
    # initial theta
    t0 = np.random.random(x.shape[1])
    t1 = np.random.random(x.shape[1])
    # total error, J(theta)
    J = sum([(t0 + t1*x[i] - y[i])**2 for i in range(m)])
    # Iterate Loop
    while not converged:
        # for each training sample, compute the gradient (d/d_theta j(theta))
        grad0 = 1.0/m * sum([(t0 + t1*x[i] - y[i]) for i in range(m)]) 
        grad1 = 1.0/m * sum([(t0 + t1*x[i] - y[i])*x[i] for i in range(m)])
        # update the theta_temp
        temp0 = t0 - alpha * grad0
        temp1 = t1 - alpha * grad1
        # update theta
        t0 = temp0
        t1 = temp1
        # mean squared error
        e = sum( [ (t0 + t1*x[i] - y[i])**2 for i in range(m)] ) 

        if abs(J-e) <= ep:
            print('Converged, iterations: ', iter, '!!!')
            converged = True
        J = e   # update error 
        iter += 1  # update iter
        if iter == max_iter:
            print('Max interactions exceeded!')
            converged = True
    return t0,t1

if __name__ == '__main__':
    x, y = make_regression(n_samples=100, n_features=1, n_informative=1, 
                        random_state=0, noise=35) 
    print('x.shape = %s y.shape = %s' %(x.shape, y.shape))
    alpha = 0.01 # learning rate
    ep = 0.01 # convergence criteria
    # call gradient decent, and get intercept(=theta0) and slope(=theta1)
    theta0, theta1 = gradient_descent(alpha, x, y, ep, max_iter=1000)
    # plot
    for i in range(x.shape[0]):
        y_predict = theta0 + theta1*x 

    plt.plot(x,y,'o')
    plt.plot(x,y_predict,'k-')
    plt.show()

Normal Equation

The above code is a little verbose, thankfully we can use a bit of Linear Algebra to solve for this cost minimisation function by using the Normal Equations.

import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_regression 
#Use SKlearn to create some random values for us
xs, ys = make_regression(n_samples=100, n_features=1, n_informative=1, 
                        random_state=0, noise=35)   
#Our Gradient Descent function to minimise our cot funtion - baby steps!
def gradient_Descent_Normal_Equation(x, y, theta, alpha, m, numIterations):
    xTrans = x.transpose()
    for i in range(0, numIterations):
        hypothesis = np.dot(x, theta)
        loss = hypothesis - y
        gradient = np.dot(xTrans, loss) / m
        theta = theta - alpha * gradient
    return theta
#Get the number of rows and features ( m = rows = 100, features = n = 1)
m, n = np.shape(xs)
#Maimum numnber of iterations it we have not already convereged (found our sweet spot)
numIterations= 100000
#This is the size of the 'little steps' we will take to find our optimal (sweet spot)
alpha = 0.0005
#Starting point for our gradient descent
theta = np.ones(n)
#Our predicte value
theta = gradient_Descent_Normal_Equation(xs, ys, theta, alpha, m, numIterations)
print(theta)

plt.scatter(xs,ys)
plt.show()

 

Hopefully, by now you have a better idea of what Simple Linear Regression is, how it works, and come away with a few methods to apply it in Python.

However, you may be asking, what if my hypothesis involved more than 1 explanatory variables. For example, perhaps we think that, yes, hours of study is a predictor for exam grade, but I believe that it also is explained by hours of sleep. Well, that is out of the scope for this introduction into Linear Regression, in a future post I will venture into the world of Multivariate Linear Regression, and show you how we can use Python to do that too.

Birds of a feather, flock together

Unsupervised Clustering

Birds flocking together. Source

Birds flocking together. Source

In a previous post I gave an overview of 3 methods of Machine Learning - Supervised, Unsupervised and Reinforcement Learning.  In this post, I will give more detail on how unsupervised learning works, by giving examples of  2 Unsupervised Clustering algorithms.

You are the average of the five people you spend the most time with.
— Jim Rohn

Clustering

This algorithm works by making subgroups of numerical data into a number of clusters.  A simple example of this could be to take the heights and weights of a group of individuals and create clusters to indicate gender.  In this case, we would be clustering the data on 2 classes. 

In the above example we know that there should be as there are only 2 logical groups.  However, we don't have to give the algorithm this information.  We could, allow it to determine the number of clusters itself, and then sense check the output for our 2 logical groups. Further,  we may not know the number of clusters in our data and want the algorithm to figure this our for us.  

These 2 approaches of whether or not to tell the algorithm in advance the number of clusters we want are called:

  • Flat Clustering - passing the expected cluster count as 'K'
  • Hierarchical Clustering.- not passing the expected cluster count as 'K'

Flat Clustering with K-Means

One of the most popular Flat Clustering algorithms is K-means.  This algorithm works on numerical data and assumes that the clusters in your data are in relatively sized groups.  The 'K', in K-means represents the number of clusters which you expect to find in the data.

This algorithm works by choosing 'K' points in the data as an initial estimate for the centres of our  expected clusters. These centres are called Centroids.  The next step is to measure the distance (frequently Euclidean) from each remaining features to each Centroid, and assign the features to its nearest Centroid as the centre of its cluster.

Now that we have K Centroids with a cluster of feature sets around them, we take the mean of each cluster, and create a new Centroid where the mean of these points is located.  This process is repeated until our Centroids stop moving, at which point, we're done.

Fig. 1 K-means

Fig. 1 K-means

K-Means with the Iris Dataset

Let's take a look at an example of using K-Means with the Iris Dataset, which contains 50 samples of each 3 species of Iris:

  • Iris Setosa
  • Iris Virginica
  • Iris Versicolor

These species are the labels by which we want to cluster our data.  Our features are:

  • Sepal Length
  • Sepal Width
  • Petal Length
  • Petal Width

As K-means is an unsupervised algorithm we will be excluding these labels from our data and using Sci-kit learn's implementation of K-Means to classify the Iris data-set based on the features above. We will then compare the Labels classified by the algorithm with the true labels, and calculate our accuracy.  

One thing to note here is that our accuracy score may come out as a number higher than 50% or lower.  If the accuracy is lower than 50% then we take the compliment of the score for our true accuracy.  For example, a accuracy score of 32% is to be read as 100% -32% = 68%.  Lets take a look at the Python code:

import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = KMeans(n_clusters=3)
clf.fit(X)
#Array of the Centroids created in the K-Means Fit method
centroids =     clf.cluster_centers_
#Array of the Labels created in the K-Means Fit method
labels      =   clf.labels_
#Array of colors for the featuresets
colors = ["g.","r.","c.","b.","k.","o."]
#Iterate over the predictions and compare to our Y label
correct = 0.0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1,len(predict_me))
    prediction = clf.predict(predict_me)
    #Compare the prediction to our known labels as Y
    if prediction[0] == y[i]:
        correct += 1
    #Plot our featuresets with lables dtermined by K-means
    plt.plot( X[i][0], X[i][1] ,colors[labels[i]] , markersize=25,zorder=1)
#Plot the calculated centroids
plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=150, linewidths=5 ,zorder=2)
#Print our accuracy score (remember to invert this if lower than 50%)
print(correct/len(X))

plt.ylabel('Sepal Width')
plt.xlabel('Sepal Length')
plt.show()

Our accuracy here was 0.893, which means that based solely on the features, the K-Means algorithm was able to correctly classify 89% of the data.  Having plotted the data, we can see in Fig. 2, where the algorithm has settled on the locations each Centroid.

Fig. 2 Plotting of the Iris Sepal features with K-Means Centroids plotted as 'X'

Fig. 2 Plotting of the Iris Sepal features with K-Means Centroids plotted as 'X'

Hierarchical  Clustering

In the above example, we had prior knowledge of the number of clusters which we expected the algorithm to classify on.  In this example, we will take a look at Hierarchical Clustering, were the number of clusters is not provided to the algorithm.  The method we will use is the Mean Shift algorithm.

Mean Shift with the Iris Dataset

Working again with the Iris data-set, we will see if the Mean Shift algorithm can determine the number of classes from the data.

Like the K-Means algorithm, mean shift applies the same principle of moving the Centroids, taking the mean of the cluster and creating a new Centroid, from which the process is repeated until the Centroids stop moving. However, as we are not passing the number of clusters, the Mean Shift algorithm takes a slightly different approach.

To start, the Mean Shift algorithm makes every feature-set a Centroid. For our data, this means that we start with 150 Centroids. For each of the Centroids, the algorithm applies a Bandwidth and groups the neighbouring feature-sets that fall within the radius of the Bandwidth.  Then, like in K-means, it takes the mean of this cluster and creates a new Centroid.  The process continues, for each Centroid, as in Fig. 3.

Fig. 3 Meanshift  

Fig. 3 Meanshift  

The Bandwidth plays a vital part in the mean shift Algorithm, if not specified, sci-kit learn's implementation with attempt to calculate the Bandwidth.  Remember that the Bandwidth is the radius from which we cluster neighbouring Centroids. Later,  I will show how changing this value can alter the results of the algorithm.  Here is the Python code:

import numpy as np
from itertools import cycle
from sklearn.cluster import MeanShift
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target
ms = MeanShift()
ms.fit(X)
#Array of the Cluster Centers created in the K-Means Fit method
cluster_centers = ms.cluster_centers_
labels      =   ms.labels_
#Extract the unique cluseter labels
n_clusters_ = len(np.unique(labels))
colors = cycle('grcbk')


plt.figure(1)
plt.clf()
#Iterate over the clusters, colors and featuresets and plot results
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.', markersize=25, zorder=1)
    plt.scatter(cluster_center[0], cluster_center[1], marker='x', s=150, linewidths=5, zorder=2)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.ylabel('Sepal Width')
plt.xlabel('Sepal Length')
plt.show()

If we take a look at the Plot in Fig. 3, we can see that the model only found 2 classes of Iris from the data-set.

Fig. 3 Plotting of the Iris Sepal features with Mean Shift Centroids plotted as 'X' 

Fig. 3 Plotting of the Iris Sepal features with Mean Shift Centroids plotted as 'X' 

Whilst it has done an okay job at classifying the Iris Setosa (RED) featuresets, It clearly has had trouble distinguishing the Iris Virginica & Iris Versicolor featuresets.

Let's take a look at how we can tweak our algorithm by altering the Bandwidth value.  In the code example below, we manually set the Bandwidth value to 0.18.

import numpy as np
from itertools import cycle
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data
y = iris.target

bandwidth = estimate_bandwidth(X,quantile=0.18)

ms = MeanShift(bandwidth=bandwidth)
ms.fit(X)
...

Fig 4. shows how altering the Bandwidth increases our estimated clusters from 2 to 3.

Fig. 4 Changing the bandwidth to 0.18 results in 3 clusters found

Fig. 4 Changing the bandwidth to 0.18 results in 3 clusters found

The Meanshift algorithm is very useful in marketing where we suspect that the classes of customers form different subgroups.  Targeting customers based on their subgroup can increase the efficacy of marketing campaigns or recommendation systems. 

Conclusion

In this post I have shown you 2 examples of Clustering with Unsupervised machine learning.  These examples - K-Means & Meanshift - are a good representation of the unsupervised approach and should be easy to understand.

Aside from the potential revelations of previously unknown clusters, a huge benefit in Unsupervised Machine Learning is that they require fewer resources.  As the feature-sets do not require labels, the availability of data for training the algorithm is not constrained by the need to classify the data prior to training the algorithm.  Thus, data can be sourced in larger quantities and more quickly.

The Code from this post is available here.

What is Machine Learning?

Artificial Intelligence & Machine Learning

Cover of Isaac Asimov's Robot Visions 

Cover of Isaac Asimov's Robot Visions 

The explosion of articles and interest in Artificial Intelligence and Machine Learning over the past few years has been hard to miss.  From the realisation of automated cars to the incorporation of AI into translation services and digital assistants such as Siri, most of us will interact with some sort of AI in our daily lives.  As below, Machine Learning is an increasingly trending topic. 

Whilst AI and Machine Learning seemed to be used interchangeable, the distinction can be summarised as:

  • AI - the creation of machines which can interact in an intelligent and human like way.  AI attempts to build machines with both logical prowess and emotional intelligence.
  • Machine Learning - the creation of algorithms which can solve problems for which they have not been explicitly programmed, and learn from new data.

In this post, I will explain what Machine Learning is, and introduce some different methodologies and how they might be used.

How is Machine Learning different from computer programming?

Fig. 1 shows the contrast between conventional programming where the rules are hard coded, and Machine Learning, where the rules are deduced by the algorithm.

Fig. 1 Comparison of conventional programming and machine learning

Fig. 1 Comparison of conventional programming and machine learning

Whilst conventional programming involves solely the application of code, Machine Learning is multidisciplinary, combing Statistics, Maths and Programming.

Fig. 2 Component disciplines of Machine Learning

Fig. 2 Component disciplines of Machine Learning

Methods of Machine Learning

There are 3 main methods of Machine Learning, these are:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Whilst these methods can be combined, generally, they are each suited to different scenarios, and approach discovery in different ways.

Supervised Learning

In Supervised Learning the algorithm is given data which comprises of Features and Labels and works out the relationship between the 2, with the goal of creating a model which can predict the Label on new data, given only the features.

An example of this is to build a Machine Learning algorithm that can takes historic data on share prices, formulates a set of rules on how variables affect the share price, and then predict for a future value, given new data.  The type of algorithm used to predict continuous data like this is Linear Regression.

Spam Filtering is an application of Classification.  Given certain features of the email such as:

Contains the word 'viagra'; Contains the word 'Free'; Has spelling mistake; Has no return sender address;

the algorithm will learn to classify the email as spam or not spam. Once the model has been optimised on data containing the classification label, we can then test and predict on data with the label excluded.

Unsupervised Learning

Unlike Supervised Learning, this method does not supply the algorithm with the output which we want it to predict.  The algorithm must determine the correct output by finding relationships, or groupings in the data without any correctly labelled data to learn from.

This method of training can be used for training the model to create groups within the dataset. Clustering is a method of Machine Learning that is useful in doing just this.  The algorithm can reveal clusters in the data, which were previously unknown. Clustering algorithms are used in marketing for the segmentation of target markets into insightful subgroups.  This information then affords the marketer the opportunity for focussing advertising based on the clustered consumer group.

A second example of clustering is in the optimised location for Distribution Centres.  Given data on delivery points and stores, the algorithm can cluster the locations into logical groups, and find the central point at which the DC location would be best located. 

Reinforcement Learning

This method or Machine Learning is common in the automation of robots/vehicles and in training algorithms to play games.  It is similar to how children learn the world around them. Putting your hand on a hot stove and learning not to do that again is a powerful form of reinforcement learning.

Algorithms learning through reinforcement learning, quickly move through a process of trial and error, and update themselves based on success or failure.

 Conclusion

I hope that this brief introduction to a vast and captivating subject has given you a good idea of what Machine Learning is, and an understanding of the practical applications of 3 methods of Machine Learning.

If this post has piqued your appetite and you would like to learn more, I recommend the following resources:

Books

  • The Master Algorithm, Pedro Domingos
    • A fantastic introduction to machine learning, the evolution of approaches to machine learning including the Five Tribes (Symbologists, Connectionists, Bayesians, Analogizers,Evolutionaries), and ideas on how we can combine these to create The Master Algorithm.
  • Machine Learning for Dummies, John Paul Mueller & Luca Massaron
    • Solid introduction to Machine Learning for the beginner.  Clear and concise.

Programming

There are a wealth of packages and libraries to help get you started with machine Learning.  The programming language Python is arguably becoming the de facto language to explore Machine Learning, and has 2 excellent libraries available that allow you to quickly start training models, with no need to write the algorithms from scratch. These are:

Additionally, you will want to be using 2 data manipulation libraries:

Learning to code can seem challenging, however the Python programming language borrows syntax from natural language and, with this and its simple coding style, it can be a great language to start programming with.

If you are interested in learning Machine Learning with Python, I highly recommend a series of  YouTube tutorial videos on Machine Learning with Python which are incredibly clear and comprehensive. The author also has several tutorial series on learning Python.

If you are coming from a R background or want to start Machine Learning in R, I can recommend:

R also has a very popular Machine Learning package which abstracts away the need for the user to code the algorithms from scratch, this is:

Code free

Finally, If you would like to start building Machine Learning algorithms without coding, Microsoft's Azure Machine Learning cloud software offers an incredibly simple drag and drop interface to build models and expose them as Web Services.  You can get a free subscription at:

In future posts I will drill into more detail on the topics and methods covered above, but if you have any questions, comments or concerns, please leave them in the comments section below.