I’ve started to take a popular free course on Coursera about Machine Learning taught by Andrew Ng, one of the expert in that field. After learning a few about machine learning in YouTube that mostly covers about Genetic Algorithm and Artificial Neural Network in general. Now I eventually knows slightly better the foundation of it.
There are some topics that covered during the first and second week, but in brief, Ng explains a brief about history of machine learning, then the 2 difference of machine learning method (Supervised & Unsupervised Learning) and then about the concept of Linear Regression and how to implement the Gradient Descent to find the local or global optima of the function or hypothesis of the Linear Regression itself.
In this time, I will cover more about Linear Regression and how to implement the method of Gradient Descent. First of all, Linear Regression itself is an approach to model a relationship between 1 or more features. For example, suppose we have the data of Land Area in regards of the house price. Look at the table below:
|Land Area||House Price|
|100 M2||Rp 1.000.000|
|200 M2||Rp 2.000.000|
|300 M2||Rp 3.000.000|
|400 M2||Rp 4.000.000|
Then, by looking at the data, we can conclude that the function for determining the House Price is:
House Price = Land Area * 10.000
or in a more “symbolic way”
Y = 𝒙 * 10.000
Now, for the linear regression itself, let us make a more universal way to determine our function , but wait for a moment, we could not call it function yet, as we are only “guessing” it, so we call it as Hypothesis, and it is:
Y = θ0 + θ1 * 𝒙1
This is a simple hypothesis to determine a linear regression with 1 feature. And we could actually set the hypothesis for the linear regression to match the dataset. For example if we want to add more features in dataset:
Y = θ0 + θ1 * 𝒙1 + … θn* 𝒙 n
Now, image that we have a new data about the price of the application in regards of its version
Now, by using the known hypothesis Y = θ0 + θ1 * 𝒙1 . We can start to look for the optimal value for the theta0 and theta1.
First, give theta0 and theta1 some random values. Make sure it is still in the range of the feature and Y value. This time, for sake of simplicity I want to make both 1.
As we can see, the blue line is our hypothesis, and it is still far from our expected value. Now let us see how big is the error of our current hypothesis by using the cost function
We need a function to see how big the error of our hypothesis is. One common function that is often used is mean squared error, which measure the difference between the estimator (the dataset) and the estimated value (the prediction). It looks like this:
When both of the current theta is 1, the Mean Squared Error is equal to 68.25. We want to decrease it. Then we will use Gradient Descent method
Gradient descent is an iterative optimization algorithm for finding the minimum of a function. In simple explanation, gradient descent help us to reach the fitted result by looking for the derivative of the Cost Function.
And after 600 of iteration, we find the minimum function for our hypothesis with the red line below
And see the error rate over the iteration
Notice that the error rate decrease in steeper rate at the first and gradually become slower in the 150th iteration.