My Understanding of Linear Regression

ifeelfree
3 min readNov 28, 2020

--

I will use this article to keep an record of my understanding of linear regression.

Part 1: Linear regression model definition

Multidimensional linear model is defined as:

Geometrically, this is akin to fitting a plane to points in three dimensions, or fitting a hyper-plane to points in higher dimensions.

Here we show a demo on how to use linear regression with functions provided by sklearn library:

from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
rng = np.random.RandomState(1)
X = 10 * rng.rand(100, 3)
y = 0.5 + np.dot(X, [1.5, -2., 1.])+0.1*rng.randn(100)
model.fit(X, y)
print(model.intercept_)
print(model.coef_)

The output of the demo is:

0.5156233346576982
[ 1.49815954 -1.99762243 0.99725804]

Part 2: Theory behind linear regression

(1) correlation vs regression

The difference lies in the fact that correlation measures the strength of an association between two variables while regression quantifies the nature of the relationship.

(2) Key terms

  • response (Y): variable to predict
  • independent variable (X): the variable used to predict the response
  • record (x, y): one observation
  • intercept: the response Y when independent variable X is zero
  • least squares: the method of fitting a regression by minimizing the sum of squared residuals, and least squares method is sensitive to outliers

(3) prediction versus explanation

Regression is used for both prediction and explanation.

Historically, a primary use of regression was to illuminate a supposed linear relation‐ ship between predictor variables and an outcome variable. The goal has been to understand a relationship and explain it using the data that the regression was fit to.

With the advent of big data, regression is widely used to form a model to predict individual outcomes for new data (i.e., a predictive model) rather than explain data in hand. In this instance, the main items of interest are the fitted values.

The interpretation of the coefficients are as follows: the predicted value changes by the coefficient i for each unit changes in Xi assuming all the other variables Xj remain the same.

(4) model assessment

  • Root Mean Squared Error (RMSE): the square root of the average squared error of the regression. It has the very similar meaning with Residual Standard Error (RSE).
from sklearn.metrics import r2_score, mean_squared_error
RMSE = np.sqrt(mean_squared_error(gd, predicted))
  • R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data. It is defined as the proportion of variation in the data that is accounted for in the model.
from sklearn.metrics import r2_score, mean_squared_error
r2_score(gd, predicted)
  • t-statistics: it is opposite to p-value. The larger it is, the more important it is. High t-statistics indicate the model should contain this variable while low t-statistics shows the model should discard this variable. t-statistics is given when using statismodels to set up a linear regression model.

Part 3: Implementation

(1) scikit-learn

It provides LinearRegression class for handling regression problem, and the functions within the class accepts both numpy.array and pandas.dataframe (or pandas.series) as inputs.

predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 
'Bedrooms', 'BldgGrade']
outcome = 'AdjSalePrice'

house_lm = LinearRegression()
house_lm.fit(house[predictors], house[outcome])

print(f'Intercept: {house_lm.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(predictors, house_lm.coef_):
print(f' {name}: {coef}')

(2) statismodels

We can also use statismodels to set up a linear regression model. The default setting in this model does not allow intercept, and in order to estimate it, we must add an additional const 1 column in the input variable. This can be easily done with dataframe.assign(const=1) command.

import statismodels.api as sm
model = sm.OLS(house[outcome], house[predictors].assign(const=1))
results = model.fit()
print(results.summary())

Part 4: Reference

Book

  • Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python: CHAPTER 4 Regression and Prediction

Codes

--

--

No responses yet