The Mathematics behind Linear Regression.

6 min readMar 8, 2021

In this article, I will explain various mathematical concepts related to Linear Regression in the simplest possible way.

Linear Regression is a Machine Learning algorithm that falls under Supervised Learning method where historic data is labelled and used to determine the value of the output/dependent variable based on the predictor/independent variable/s. Here as the name suggests, the relationship between the dependent and independent variables is assumed to be linear.

There are 2 types of Linear regression algorithms based upon number of predictor variables:

Simple Linear Regression: Only one predictor variable is used to predict the values of dependent variable. Equation of the line: y = c + mX. Where

y : dependent variable

X: predictor variable

m: slope of the line defining relationship between X and y, also called co-efficient of X

c: intercept

Multiple Linear Regression: More than one predictor variables are used to predict the values of dependent variable.

Equation of the line: y = c + m1x1+ m2x2 + m3x3 … + mixi (many predictor variables x1, x2 … xi).

(m1, m2 … mi are respective co-efficients)

If you want to understand how a linear regression model is built from scratch, please refer to my article below:

https://medium.com/mlearning-ai/linear-regression-simple-explanation-with-example-fba51b2c181d

The aim of linear regression is to find the best fit line for given X and y variables such that we get optimal values for c and m in the above equation. Now to achieve this we need to understand the above graph properly.

For every X value, say Xk, we have a y value as per our data, say y_true. Our regression line also gives us a y value say y_pred. For simplicity of understanding, the best intuitive line that we can draw is the one passing through average of all y values, say y = y_avg.

So now we have y_true, y_pred and y_avg. Let’s see how they are related.

y_true - y_pred gives us the error term associated with Xk, also called residual. These error terms can be positive or negative based upon the y_true value. So we take the square of these residual terms, in order to avoid negative sign. Sum of all such residuals is called Residual Sum of Squares (RSS). This forms our cost function. We need to minimize this cost function in order to get optimal values for slope(m) and intercept(c) for out linear regression line, which we will see further in this article.

Explained Sum of Squares gives us the measure of how much variation exists, in the values given by our regression line, as compared to the average.

Total Sum of Squares gives us the measure of how much variation exists, in the observed values, as compared to the average.

From the above equations and the figure, we can verify that:

TSS = ESS + RSS

Coefficient of Determination (R-Squared)

For the regression line as shown in the figure, the coefficient of determination is measure which tells how much variance in the dependent variable is explained by the independent variable. In short R-squared tells us how good is our model fit, for the given data.

The value of R-squared ranges between 0 to 1. A value close to 1 generally means the model is a good fit or more particularly a good amount of variance in the dependent variable is explained by the independent variable/s.

So intuitively we can write RSS = ESS/TSS.

Therefore R-Squared = (TSS-RSS)/TSS.

Hence, R-squared = 1- (RSS/TSS).

This is an important equation when it comes to linear regression.

Adjusted R-squared

While building the multiple linear regression model, when we add variables to our model, the R-squared value goes on increasing. But what if the variable added, is not that significant? It may unnecessarily make the model complex and increase the chance of overfitting. To take care of this there is a measure called Adjusted R-squared.

Adjusted R-squared penalizes the model, if the variable added is insignificant.

Where N is number of sample points in our data and P is the number of predictor variables.

This is also a very important concept w.r.t. Linear Regression.

A good linear regression model will always have R-squared and Adjusted R-squared values close to each other. However, Adjusted R-squared value is always less than the R-squared value.

Gradient Descent

To get a best fit line, we have to minimize the underlying cost function, which in our case is RSS, as stated above. Iterative optimization and closed form optimization are two popular algorithms used to minimize the underlying cost function in question.

In closed for solution, we simply find the derivative of the function and equate it to 0, we get the minimum value. But this becomes complex when data is multidimensional.

While closed form solution is preferred when the data set is small, Gradient Descent is useful for larger data and it is also less complex and cheaper option. Gradient Descent runs in iterations, till the minimum value of the cost function is reached. To understand more, let us consider the figure below.

J(m) is our cost function and we can see that we want to reach a point where the function converges (reaches minimum value). In other words we want to reach a point where the value of our function is minimum. So we do this in steps, also called iterations.

Learning rate is the rate at which we move in the direction of minimization. Larger the learning rate higher is the chance that we miss the minimum value point. Hence it is wise to keep the learning rate small. This may make the process slow but it is worth it.

m1 = m0 - (learning rate). (dJ/dm)

Where : m1 is the next point given by the iteration

m0 : starting point of current iteration

We know that our cost function is RSS = (y_pred-y_avg)²

Equate y_pred by ‘c+mx’. We get a function where there are 2 unknowns m and c.

We take partial derivatives of these functions, once w.r.t to m and once w.r.t c and equate them to 0. We get two equations and 2 unknowns. Solving these two equations we get m and c values.

Thus in Gradient Descent, when our function reaches a point where it converges, RSS becomes minimum, we get optimal values of slope(m) and intercept (c).

Hence we can say that the regression line y = mx+c is the best fit line.

Conclusion: There is a lot of mathematics involved, while studying linear regression, but I have tried to explain the main mathematical concepts behind Linear Regression in a simplified way. Hope everyone enjoys reading this article.

Thank You.

You can connect with me on LinkedIn : https://www.linkedin.com/in/pathakpuja/

Please visit my GitHub profile for the python codes: https://github.com/pujappathak

Feel free to comment and give your feedback if you like my articles.

© 2021 — Puja Pathak — All rights reserved. Do not copy, reproduce, or distribute the content of this article without prior permission. Instead, share a link to this article. Thank you for respecting my work!

The Mathematics behind Linear Regression.

Adjusted R-squared

Gradient Descent

Written by Puja P. Pathak

Responses (2)