Simple Linear Regression Theory

Simple Linear Regression Theory

[25]

Simple Linear Regression

Linear Regression is the simplest statistical regression method used for predictive analysis. The most common is the simple linear regression which consists of 1 independent variable + 1 dependent variable. It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables. The main goal is to find a linear relationship between the independent variable (predictor) and the dependent (output). It shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression.

If there is only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression.

How to compute it?

To compute the best-fit line linear regression, we use the following line function:

Yᵢ = Ax + B

💡
Yᵢ = Dependent variable, B = Intercept, A = Slope, xᵢ = Independent variable

A linear line showing the relationship between the dependent and the independent variables is called a regression line. A regression line can show two types of relationship:

  1. Positive linear relationship: If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a positive linear relationship.

  2. Negative linear relationship: If the dependent variable decreases on the Y-axis and the independent increases on the X-axis, then such a relationship is called a negative linear relationship.

How to define the best fit?

When working with linear regression, our main goal is to find the best fit that means the error between predicted values and actual values should be minimized. We define the best fit line that presents the least error. The error between predicted values and the actual values should be minimum.

Random Error (Residuals)

Residuals are defined as the difference between observed values of the dependent variable and the predicted ones.

∈ᵢ = Y predicted- Yᵢ

How to obtain it mathematically?

We use a cost function that helps us work out the optimal values for A and B. Cost function optimizes the regression coefficients or weights. It measures how a linear regression model is performing.

Mean Squared Error (MSE)

We use the average of the squared error that occurs between predicted and observed values.

$$MSE = 1/N \sum_{i=1}^{n}\left ( yi - \right ( Ax + b))$$

💡
N = Total number of observations, Yi = Actual value, Ax + b = Predicted value

Gradient Descent

Gradient Descent is one of the optimization algorithms the optimizes the cost function. To obtain the optimal solution, we need to reduce the MSE for all data points.

Evaluation

The most used metrics are:

  • Co-efficient of determination or R-squared (R2)

  • Root Mean Squared Error (RMSE)

Assumptions to apply it

  1. Linearity of the variables: There needs to be linear dependency between the dependent and the independent variables.

  2. Independence of residuals: The error terms should not be dependent on one another.

  3. Normal distribution of residuals: The mean of residuals should follow a normal distribution with a mean close to zero.

  4. The equal variance of residuals: The error terms must have constant variance.