The note is based on Fastbook
- What’s the deep learning project strategy?
- Iterate from end to end
- Complete every step in a reasonable amount of time
2. What are high cardinality categorical variables?
- categorical variables==discrete
- high cardinality == too many unique values
- Deep learning is good at analyzing tabular data that includes natural language, or high cardinality categorical columns (containing larger number of discrete choices like zip code).
3. Dataset, DataLoader, DataLoaders and DataBlocks
(1) torch.utils.data.Dataset
is an abstract class representing a dataset. Your custom dataset should inherit Dataset
and override the following methods:
__len__
so thatlen(dataset)
returns the size of the dataset.__getitem__
to support the indexing such thatdataset[i]
can be used to get iith sample
(2) torch.utils.data.DataLoader
is an iterator that provides the following functions:
- Batching the data
- Shuffling the data
- Load the data in parallel using
multiprocessing
workers.
(3) DataLoaders
is a thin class in Fastai that just stores whatever DataLoader
objects you pass to it, and makes them available as train
and valid
.
(4) DataBlock in Fastai
- It is a factory method
- From DataBlock we can get DataLoaders
data_block.dataloaders(path)
- DataBlock supports reset parameters by using
.new
method.
4. Data augmentation
- In PyTorch, the confusion may come from the fact that often
transforms
are used both for data preparation (resizing/cropping to expected dimensions, normalizing values, etc.) and for data augmentation (randomizing the resizing/cropping, randomly flipping the images, etc.). - In Fastai, we separate it into two parameters:
item_tfms
andbatch_tfms
. The first is used to denote the pre-processing for input data, and the second is used for data augmentation.
5. Training in Fastai
- Training in Fastai has been simplified and it needs three inputs (1)
DataLoaders
(2) neural network architectures and loss function (3) error metrics - Leaner has a handle that points to
DataLoaders
6. Network interpretation
- confusion matrix
- plot_top_losses
7. Data cleaning using the trained network
ImageClassifierCleaner
- created from Leaner
8. Model saving
- .export()
export
saves both the architecture, as well as the trained parameters of the neural network architecture. It also saves how theDataLoaders
are defined. - .pkl format (Python pickle serialization)
- .pth format is supported by Pytorch
- created from Learner
load_leaner
is used to load the saved model
9. Model deployment
- Jupyter Notebook for UI
- voila is used to turn the Jupyter Notebook into a real APP
- Very often the trained model does not work well in real situations due to out-of-domain data problem: domain shift
- Another problem often happens is the presence of bias: feedback loops can result in bias getting worse and worse
- GPUs are best for doing identical work in parallel. If you will be analyzing single pieces of data at a time (like a single image or single sentence), then CPUs may be more cost effective instead, especially with more market competition for CPU servers versus GPU servers. GPUs could be used if you collect user responses into a batch at a time, and perform inference on the batch. This may require the user to wait for model predictions. Additionally, there are many other complexities when it comes to GPU inference, like memory management and queuing of the batches.