Linear Regression [Week 1]

made by https://cneuralnets.netlify.app/

This is the most basic and overlooked in today’s machine learning world, when we have advanced stuff, like transformers, RNNs and so much more. But in reality, if you dive deep into any kind of model, it will have linear regression in some form or the other!

What is it?

The equation for linear regression is :

$$ y=\alpha_0+\alpha_1x_1+\alpha_2x_2+.....+\alpha_nx_n+\epsilon $$

where : y is the dependent variable, the variable we are trying to predict $x_i$ is the independent variable, the features our model is trying to use $\alpha_i$ is the coefficient (or weights) of our linear regression, they are what we are learning essentially $\epsilon$ is the error in our model

What we mainly try to do is try to fit the model. By fitting we mean, we need to find the set of coefficients that will form the best predictions for $y$, closest to the actual values. Finally it will be as easy as just plugging in the values of $x_i$ in the equation below to find your prediction

$$ \hat{y}=\hat{\alpha_0}+\hat{\alpha_1x_1}+\hat{\alpha_2x_2}+....+\hat{\alpha_nx_n} $$

Assumptions of Regression Models

While making a regression model, one must always keep in mind the 4 rules that are assumed to be true before we move on to make the model!

Linear Relationship

The fundamental principle of multiple linear regression is that there is a linear relationship between the dependent (outcome) variable and the independent variables. This linearity can be visually assessed using scatterplots, which should indicate a straight-line relationship rather than a curvilinear one.

Edit: However, it is essential to clarify that while the relationship between the dependent and independent variables must be linear in terms of the coefficients, this does not preclude the need for transformations if the raw data exhibits non-linear patterns. It is specifically the relationship between the dependent variable and the model parameters, and not necessarily to the raw data itself.

Multivariate Normality

The analysis presumes that the residuals (the differences between observed and predicted values) follow a normal distribution. This assumption can be evaluated by inspecting histograms or Q-Q plots of the residuals, or through statistical tests like the Kolmogorov-Smirnov test.

Absence of Multicollinearity

It is crucial that the independent variables are not excessively correlated with one another, a situation referred to as multicollinearity. This can be assessed using: