# My Understanding of Linear Regression

I will use this article to keep an record of my understanding of linear regression.

**Part 1: Linear regression model definition**

Multidimensional linear model is defined as:

Geometrically, this is akin to fitting a plane to points in three dimensions, or fitting a hyper-plane to points in higher dimensions.

Here we show a demo on how to use linear regression with functions provided by sklearn library:

from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True)rng = np.random.RandomState(1)

X = 10 * rng.rand(100, 3)

y = 0.5 + np.dot(X, [1.5, -2., 1.])+0.1*rng.randn(100)model.fit(X, y)

print(model.intercept_)

print(model.coef_)

The output of the demo is:

`0.5156233346576982`

[ 1.49815954 -1.99762243 0.99725804]

**Part 2: Theory behind linear regression**

**(1) correlation vs regression**

The difference lies in the fact that correlation measures the strength of an association between two variables while regression quantifies the nature of the relationship.

**(2) Key terms**

- response (Y): variable to predict
- independent variable (X): the variable used to predict the response
- record (x, y): one observation
- intercept: the response Y when independent variable X is zero
- least squares: the method of fitting a regression by minimizing the sum of squared residuals, and least squares method is sensitive to outliers

**(3) prediction versus explanation**

Regression is used for both prediction and explanation.

Historically, a primary use of regression was to illuminate a supposed linear relation‐ ship between predictor variables and an outcome variable. The goal has been to understand a relationship and explain it using the data that the regression was fit to.

With the advent of big data, regression is widely used to form a model to predict individual outcomes for new data (i.e., a predictive model) rather than explain data in hand. In this instance, the main items of interest are the fitted values.

The interpretation of the coefficients are as follows: the predicted value changes by the coefficient* i *for each unit changes in *Xi *assuming all the other variables* Xj *remain the same.

**(4) model assessment**

- Root Mean Squared Error (RMSE): the square root of the average squared error of the regression. It has the very similar meaning with Residual Standard Error (RSE).

`from sklearn.metrics import r2_score, mean_squared_error`

RMSE = np.sqrt(mean_squared_error(gd, predicted))

*R*2 is a statistic that will give some information about the goodness of fit of a model. In regression, the*R*2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An*R*2 of 1 indicates that the regression predictions perfectly fit the data. It is defined as the proportion of variation in the data that is accounted for in the model.

`from sklearn.metrics import r2_score, mean_squared_error`

r2_score(gd, predicted)

- t-statistics: it is opposite to p-value. The larger it is, the more important it is. High t-statistics indicate the model should contain this variable while low t-statistics shows the model should discard this variable. t-statistics is given when using statismodels to set up a linear regression model.

**Part 3: Implementation**

**(1) scikit-learn**

It provides `LinearRegression`

class for handling regression problem, and the functions within the class accepts both `numpy.array`

and `pandas.dataframe`

(or `pandas.series`

) as inputs.

`predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', `

'Bedrooms', 'BldgGrade']

outcome = 'AdjSalePrice'

house_lm = LinearRegression()

house_lm.fit(house[predictors], house[outcome])

print(f'Intercept: **{**house_lm.intercept_**:**.3f**}**')

print('Coefficients:')

**for** name, coef **in** zip(predictors, house_lm.coef_):

print(f' **{**name**}**: **{**coef**}**')

**(2) statismodels**

We can also use statismodels to set up a linear regression model. The default setting in this model does not allow intercept, and in order to estimate it, we must add an additional const 1 column in the input variable. This can be easily done with `dataframe.assign(const=1)`

command.

`import statismodels.api as sm`

model = sm.OLS(house[outcome], house[predictors].assign(const=1))

results = model.fit()

print(results.summary())

**Part 4: Reference**

**Book**

- Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python: CHAPTER 4 Regression and Prediction

**Codes**