Introduction to Supervised Learning
Though there are many definitions of machine learning (“ML”), the principal idea is to give an algorithm the ability to improve without programming it explicitly. It gets around the fact that the set of all possible behaviours given all possible inputs quickly becomes too complex to describe and programme in a classical way. In this blog we will give a quick introduction to the different terms used in ML, explain the process of creating a simple supervised learning algorithm, and illustrate it through a linear regression example.
There are two main phases involved in the creation of an ML algorithm: learning phase, where the model is created and improved and production phase, once the model is done. Some models can keep learning during production phase if they have access to a feedback on the quality if the result is produced. We will focus here on the learning phase.
For each problem, we speak of regression for continuous variables and else we speak of classification.
We also speak of supervised learning for tagged data (when we know the input and output (x, y) for a sample of data, see below), reinforcement learning when each action is feedbacked with quantitative rewards or unsupervised learning for untagged data.
If I want to learn Spanish :
- Supervised learning : I get a series of examples « English – Spanish » that I learn and extrapolate (perro – dog ; perra – bitch ; gato – cat) –> gata – female cat.
- Reinforcement learning : I get a teacher that scores every Spanish sentence I make.
- Unsupervised learning : I go to Spain and try to guess the vocabulary and grammar rules.
It is worth noting that these are only a few of the numerous types of problems you can encounter in ML and the list can be extended almost indefinitely (supervised, reinforcement and unsupervised learning can be completed by transfer, semi-supervised, partially supervised learning; regression and classification problems can be completed by cluster analysis or dimension reduction, etc.).
Let’s illustrate supervised learning and its four steps with a model of linear regression. You might be thinking « well, I can draw a line in the middle of a cloud of points, I thought ML was things like voice recognition, autonomous driving, .. » But the processes to draw this line are exactly the same as the ones used for more complex actions. Let’s dive in with the following example :
The dataset includes two types of variables: target variable y, price of a flat (what the machine must learn to predict) and features variables x1, x2, ..xn surface of the flat, construction year, .. (what influences the value of y).
Ex : We have (x, y) and m = 5 (size of dataset) ; n = 1 (number of features)
Model (blue) :
Function f used to describe the link between x and y. f(x) = y.
Ex : f(x) = ax + b with random values for a and b at the beginning.
Cost function (red) :
It is the difference between the model and the reality, f(x) – y. It can be a norm or the Euclidean distance in our case.
Note : This formula represents the general idea, it could be tweaked in practice.
Learning Algorithm :
The algorithm that reduces the cost function ex: least squares, descent gradient, ..
Ex : J has the same appearance as a square function, it is convex, so we can find the minimum by finding (a, b) so that the derivative of J is zero. We will instead use a gradient descent here to introduce the method.
Our model starts with a « random » a0. We calculate the next point a1 using :
Alpha is the step. It must be small enough so that we don’t oscillate around the solution, but big enough so that the algorithm doesn’t take forever.
Why a “-” sign ? If a0 < amin , then the derivative of J is negative (we are going down the slope). Therefore a1 > a0 and getting closer to amin (see figure 3). On the other hand, if we step over the minimum because α is too big and a1 > amin , then the derivative will be positive and the algorithm will bring us back towards the minimum.
Though this example is very simple, the use of matrices allows for generalisation of the algorithm. The example below will show that the algorithm can easily cover multiple dimensions and degrees of polynoms with hardly any change.
There are multiple ways to check our algorithm is functional, the first of which is to display the cost function. Theoretically, it should diminish with each iteration until it reaches a minimum (it also allows us to correct the step and the number of iterations required). In the case of a linear regression, we can also calculate the Pearson determination coefficient which measures the correlation between predicted values and measures (Coefficient of determination – Wikipedia ; Pearson correlation coefficient – Wikipedia) :
A python code is available upon request and we will show some results here.
We first create a dataset (figure 4 – blue). We then Choose a linear model with random (a, b) parameters (figure 4 – red). We create and reduce a cost function (figure 5) via a learning algorithm (gradient descent) being iterated, which gives us our final model (figure6 – green) whose determination coefficient is much better, and whose parameters (a, b) are known.
By using matrices in the code, the algorithm becomes easily tuneable to multiple dimensions and degrees of polynom (see below).
Machine learning applied to autonomous vehicles raises important questions on the understanding of the underlying algorithms, getting public confidence, but also distributing liability. It also poses a change in paradigm as it is no longer a driver following clear rules but influencing them. Whatever your opinion on the matter, we should all keep in mind that machine learning is what makes AV possible.
Written by Geoffroy Heurtier – Project Engineer at Claytex
Please get in touch if you have any questions or have got a topic in mind that you would like us to write about. You can submit your questions / topics via: Tech Blog Questions / Topic Suggestion