My Understanding of Data Normalization

ifeelfree
1 min readNov 29, 2020

--

I use this post to summarize my data normalization experience.

Part 1: What is data normalization?

Consider a data set containing two features, age(x1), and income(x2). Where age ranges from 0–100, while income ranges from 0–20,000 and higher. Income is about 1,000 times larger than age and ranges from 20,000–500,000. So, these two features are in very different ranges. When we do further analysis, like multivariate linear regression, for example, the attributed income will intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. Therefore it is mandatory to normalize features in multivariate linear regression model.

The official definition of data normalization is as follows:

The goal of normalization is to change the values of numeric columns in the dataset to a common scale.

The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things. Normalization usually means to scale a variable to have a values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1. This standardization is called a z-score.

Part 2: normalization method

--

--

No responses yet