This post will keep recording of my understanding of categorical variables.
Part 1: what is categorical variable?
The official definition of categorical variable is as follows:
In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.
Binary data is an important special case of categorical variable. For example, 0/1, yes/no….. Another useful categorical variable is ordinal data, where the categories are ordered.
Categorical variables are often stored in a tabular. Tabular: Data that is in the form of a table, such as from a spreadsheet, database, or CSV file. A tabular model is a model that tries to predict one column of a table based on information in other columns of the table.
Part 2: what is the challenge of categorical variable?
- A categorical variable has too many levels. This pulls down performance level of the model. For example, a cat. variable “zip code” would have numerous levels.
- A categorical variable has levels which rarely occur. Many of these levels have minimal chance of making a real impact on model fit. For example, a variable ‘disease’ might have some levels which would rarely occur.
- There is one level which always occurs i.e. for most of the observations in data set there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.
- If the categorical variable is masked, it becomes a laborious task to decipher its meaning. Such situations are commonly found in data science competitions.
- You can’t fit categorical variables into a regression equation in their raw form. They must be treated.
- Most of the algorithms (or ML libraries) produce better result with numerical variable. In python, library “sklearn” requires features in numerical arrays.
Part 3: how to transform categorical variable into numerical variable?
- Ordinal encoding
The basic idea of label encoder is to replace each categorical variable with one unique number. The demo is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name':['a','b','c'],
'income':[100, 200, 300],
})
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
df['name'] = number.fit_transform(df['name'])
The output is:
The other ordinal coding implementation relies on Pandas internal function:
cut_mapping = {'Fair': 0, 'Good': 1, 'Very Good': 2, 'Premium': 3, 'Ideal': 4}
diamond_df.cut = diamond_df.cut.map(cut_mapping)
2. Dummy encoding/binary encoding
Dummy coding is a commonly used method for converting a categorical input variable into continuous variable. ‘Dummy’, as the name suggests is a duplicate variable which represents one level of a categorical variable. Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created.
3. One-hot encoding
One of the main ways for working with categorical variables is using 0, 1 encodings. In this technique, you create a new column for every level of the categorical variable. The advantages of this approach include:
1. The ability to have differing influences of each level on the response.
2. You do not impose a rank of the categories.
3. The ability to interpret the results more easily than other encodings.
The disadvantages of this approach are that you introduce a large number of effects into your model. If you have a large number of categorical variables or categorical variables with a large number of levels, but not a large sample size, you might not be able to estimate the impact of each of these variables on your response variable. There are some rules of thumb that suggest 10 data points for each variable you add to your model. That is 10 rows for each column. This is a reasonable lower bound, but the larger your sample (assuming it is representative), the better.
One-hot encoding example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name':['a','b','c'],
'income':[100, 200, 300],
})
# dummy
print('dummy encoding')
new_df2 = pd.get_dummies(df, columns=['name'], drop_first=True)
print(new_df2.head())# one hot
print('one-hot encoding')
new_df = pd.get_dummies(df, columns=['name'],drop_first=False)
print(new_df.head())dummy encoding
income name_b name_c
0 100 0 0
1 200 1 0
2 300 0 1
one-hot encoding
income name_a name_b name_c
0 100 1 0 0
1 200 0 1 0
2 300 0 0 1
- in Pandas,
get_dummies
function has one parameterdrop_first=False
, and it will determine the encoding method. - the difference between dummy encoding and one-hot encoding is obvious.
Reference