My Fastai Course Note (9): Tabular Modeling Deep Dive

10 min readNov 20, 2020

This note is based on Fastbook.

Revisit categorical embedding

Embedding layer is just another layer.
Embedding transforms the categorical variables into inputs that are both continuous and meaningful.
The model learns an embedding for the entities that define a continuous notion of distance between them. Because the embedding distance was learned based on real patterns in the data, that distance tends to match up with our intuitions.
Another benefit is that we can combine our continuous embedding values with truly continuous input data in a straightforward manner: we just concatenate the variables, and feed the concatenation into our first dense layer. In other words, the raw categorical data is transformed by an embedding layer before it interacts with the raw continuous input data. This is how fastai handle tabular models containing continuous and categorical variables.
A date is a very special categorical variable: some dates are different to others (ex: some are holidays, weekends, etc.) that cannot be described as just an ordinal variable. Instead, we can generate many different categorical features about the properties of the given date (ex: is it a weekday? is it the end of the month?, etc.)

2. Beyond deep learning

The good news is that modern machine learning can be distilled down to a couple of key techniques that are widely applicable. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:

(1) Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data (such as you might find in a database table at most companies)

(2) Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language). Deep Network is good at the following structured data:

There are some high-cardinality categorical variables that are very important (“cardinality” refers to the number of discrete levels representing categories, so a high-cardinality categorical variable is something like a zip code, which can take on thousands of possible levels).
There are some columns that contain data that would be best understood with a neural network, such as plain text data.

3. TabularPandas in Fastai

A TabularPandas behaves like a fastai Datasets object, including providing train and valid attributes.

In fastai, a tabular model is simply a model that takes columns of continuous or categorical data, and predicts a category (a classification model) or a continuous value (a regression model). Categorical independent variables are passed through an embedding, and concatenated, as we saw in the neural net we used for collaborative filtering, and then continuous variables are concatenated as well.

4. Decision tree

The decision tree is over-fitted if there are more leaf nodes than data items, and there is a fundamental compromise between how well it generalizes and how accurate it is on the training set.
Decision tree can handle nonlinear relationship and interactions between variables.
In deep neural network, we can use embedding layer to discover the meaning of the different levels of categorical variables. In decision tree, it just works using categorical variables. We convert the categorical variables to integers, where the integers correspond to the discrete levels of the categorical variable. Apart from that, there is nothing special that needs to be done to get it to work with decision trees (unlike neural networks, where we use embedding layers).

The algorithm of decision tree is as follows:

(1) Loop through each column of the dataset in turn

(2) For each column, loop through each possible level of that column in turn

(3) Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable)

(4) Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple “model” where our predictions are simply the average sale price of the item’s group

(5) After looping through all of the columns and possible levels for each, pick the split point which gave the best predictions using our very simple model

(6) We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each, by going back to step one for each group

(7) Continue this process recursively, and until you have reached some stopping criterion for each group — for instance, stop splitting a group further when it has only 20 items in it.

5. Random forests

The basic idea is from Leo Breiman’s paper:

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions… The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests… show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

Why random forests work so well: each tree has errors, but those errors are not correlated with each other, so the average of those errors should tend towards zero once there are enough trees.

Random Forest goes further than bagging: randomly choosing rows for each model’s training, but also randomly selected from a subset of columns when choosing each split in each decision tree.

The Out-of-Bag Error is a way of measuring prediction error on the training set by only including in the calculation of a row’s error trees where that row was not included in training. This allows us to see whether the model is over-fitting.

Random Forest is also good at model interpretation:

The confidence of the prediction using a particular row of data (for testing or real application): using the standard deviation of predictions across the trees.
Feature importance: feature importance gives us insight into which columns in the tabular data are important. The way these importance are calculated is quite simple yet elegant. The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split, and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is summed across all branches of all trees, and finally the scores are normalized such that they add to 1.
We can improve random forest by removing low-importance variables
We can improve random forest by removing redundant features
Partial dependence plots try to answer the question: if a row varied on nothing other than the feature in question, how would it impact the dependent variable?
For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction? Using the treeinterpreter package to check how the prediction changes as it goes through the tree, adding up the contributions from each split/feature. Use waterfall plot to visualize.

Random Forest, however, is not good at extrapolation outside of the types of data they have seen. That’s why we need to make sure our validation set does not contain out-of-domain data.

Some important parameters in random forest, including:

n_estimators A higher n_estimators mean more decision trees are being used. However, since the trees are independent of each other, using higher n_estimators does not lead to overfitting.
max_samples and max_features: When training random forests, we train multiple decision trees on random subsets of the data. max_samples defines how many samples, or rows of the tabular dataset, we use for each decision tree. max_features defines how many features, or columns of the tabular dataset, we use for each decision tree.

In random forest, there is a term called out of bag error (OOB): Only use the models not trained on the row of data when going through the data and evaluating the dataset. No validation set is needed. However, OOB can be smaller than error on a model’s validation set, and the major reason could be because the model does not generalize well. Related to this is the possibility that the validation data has a slightly different distribution than the data the model was trained on.

6. Data leakage

A trivial example of leakage would be a model that uses the target itself as an input, thus concluding for example that ‘it rains on rainy days’. In practice, the introduction of this illegitimate information is unintentional, and facilitated by the data collection, aggregation and preparation process.

Data leakage is subtle and can take many forms. In particular, missing values often represent data leakage.

7. Out-of-Domain data

Sometimes it is hard to know whether your test set is distributed in the same way as your training data, or, if it is different, what columns reflect that difference.

With Random Forest, however, it becomes easy.

But in this case we don’t use the random forest to predict our actual dependent variable. Instead, we try to predict whether a row is in the validation set or the training set. To see this in action, let’s combine our training and validation sets together, create a dependent variable that represents which dataset each row comes from, build a random forest using that data, and get its feature importance.

For the import features, let’s remove them in the dataset, and see how it will affect the model. If it does not affect the model, it means that there is no domain shift.

8. Ensembling(bagging) with different models

The idea of Random Forest can extend to other models: if we can train different models, and then average the predictions.

For example, we have two very different models, trained using very different algorithms: a random forest, and a neural network. It would be reasonable to expect that the kinds of errors that each one makes would be quite different. Therefore, we might expect that the average of their predictions would be better than either one’s individual predictions. It can certainly add a nice little boost to any models that you have built.

9. Boosting: another kind of ensembling

This is how boosting works:

Train a small model that underfits your dataset.
Calculate the predictions in the training set for this model.
Subtract the predictions from the targets; these are called the “residuals” and represent the error for each point in the training set.
Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse

There are many models following this basic approach, such as Gradient boosting machine (GBMs) and gradient boosted decision trees (GBDTs). XGBoost is the most popular.

Using more trees in a random forest does not lead to overfitting, because each tree is independent of the others. But in a boosted ensemble, the more trees you have, the better the training error becomes, and eventually you will see overfitting on the validation set. Unlike random forests, gradient boosted trees are extremely sensitive to the choices of these hyperparameters; in practice, most people use a loop that tries a range of different hyperparameters to find the ones that work best.

10. Combining Embeddings with other methods

The embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead.

This is a really important result, because it shows that you can get much of the performance improvement of a neural network without actually having to use a neural network at inference time.

11. General guild-line to tabular modeling

We have dicussed two approaches to tabular modeling: decision tree ensembles and neural networks. We’ve also mentioned two different decision tree ensembles: random forests, and gradient boosting machines. Each is very effective, but each also has compromises:

Random forests are the easiest to train, because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.
Gradient boosting machines in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.
Neural networks take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.

We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it’s a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try neural nets and GBMs, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them. If decision tree ensembles are working well for you, try adding the embeddings for the categorical variables to the data, and see if that helps your decision trees learn better.

My Fastai Course Note (9): Tabular Modeling Deep Dive

Written by ifeelfree