The note is based on Fastbook.
- Data processing tips: pre-sizing and augmentation in a single-step
Pre-sizing uses the following strategies to deal with artifacts introduced in data augmentation:
- Resize images to relatively “large” dimensions — that is, dimensions significantly larger than the target training dimensions (via
item_tfms
inDataBlock
). This concept is known as presizing. Data augmentation is often applied to the images and in fastai it is done on the GPU. However, data augmentation can lead to degradation and artifacts, especially at the edges. Therefore, to minimize data destruction, the augmentations are done on a larger image, and then RandomResizeCrop is performed to resize to the final image size. - Compose all of the common augmentation operations (including a resize to the final target size) into one, and perform the combined operation on the GPU only once at the end of processing, rather than performing the operations individually and interpolating multiple times.
The following codes illustrate data augmentation in one step:
x1 = TensorImage(x.clone())
x1 = x1.affine_coord(sz=224)
x1 = x1.rotate(draw=30, p=1.)
x1 = x1.zoom(draw=1.2, p=1.)
x1 = x1.warp(draw_x=-0.2, draw_y=0.2, p=1.)
tfms = setup_aug_tfms([Rotate(draw=30, p=1, size=224), Zoom(draw=1.2, p=1., size=224),
Warp(draw_x=-0.2, draw_y=0.2, p=1., size=224)])
x = Pipeline(tfms)(x)
2. Fastai Dataloaders
re-visit
show_batch
: visualization a batch of imagesdataloaders
can be retrieved fromDataBlock
- In the context of deep learning, data can be kept as a single file like CSV format or database format, and it can also be possible that data kept in individual files, like image files or text files.
3. Fastai DataBlock
- it has connection with
Dataloaders
- it provides
summary
function to summarize the training and validation data set item_tfms
is used to image when image is being readbatch_tfms
is used for image batches for data augmentation
4. Softmax vs Sigmoid
softmax
is the multi-category equivalent of sigmoid
—we have to use it any time we have more than two categories and the probabilities of the categories must add to 1, and we often use it even when there are just two categories, just to make things a bit more consistent. We could create other functions that have the properties that all activations are between 0 and 1, and sum to 1; however, no other function has the same relationship to the sigmoid function, which we've seen is smooth and symmetric.
Intuitively, the softmax function really wants to pick one class among the others, so it’s ideal for training a classifier when we know each picture has a definite label.
Note that it may be less ideal during inference, as you might want your model to sometimes tell you it doesn’t recognize any of the classes that it has seen during training, and not pick a class because it has a slightly bigger activation score. In this case, it might be better to train a model using multiple binary output columns, each using a sigmoid activation. For example, multi-label classification problem.
5. Why should we take the Log?
The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1. That means that our model will not care whether it predicts 0.99 or 0.999. Indeed, those numbers are so close together — but in another sense, 0.999 is 10 times more confident than 0.99.
6. nll_loss
, and CrossEntropyLoss
The nll in nll_loss
stands for "negative log likelihood," but it doesn't actually take the log at all! It assumes you have already taken the log. PyTorch has a function called log_softmax
that combines log
and softmax
in a fast and accurate way. nll_loss
is designed to be used after log_softmax
.
When we first take the softmax, and then log likelihood of it, that combination is called cross-entropy loss.
- native implementation
acts = torch.randn((6,2))*2
sm_acts = torch.softmax(acts, dim=1)
log_sm_acts = torch.log(sm_acts)F.nll_loss(log_sm_acts, targ, reduction='none')
- pytorch
nn.CrossEntropyLoss(reduction='none')(acts,targ)
7. An interesting feature about cross-entropy loss appears when we consider its gradient. The gradient of cross_entropy(a,b)
is just softmax(a)-b
. Since softmax(a)
is just the final activation of the model, that means that the gradient is proportional to the difference between the prediction and the target. This is the same as mean squared error in regression (assuming there's no final activation function such as that added by y_range
), since the gradient of (a-b)**2
is 2*(a-b)
. Because the gradient is linear, that means we won't see sudden jumps or exponential increases in gradients, which should lead to smoother training of models.
8. Model interpretation, and in the context of classification an easy interpretation can be explained by ClassificationInterpretation
class:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)
9. Learning rate finder
The idea was to start with a very, very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini-batch, find what the losses are afterwards, and then increase the learning rate by some percentage (e.g., doubling it each time). Then we do another mini-batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better.
lr_min,lr_steep = learn.lr_find()
Two rules to select the right learning rate:
- one order of magnitude less than where the minimum loss was achieved (i.e. the minimum divided by 10)
- the last point where the loss was clearly decreasing
In transferring learning (fine_tune
method), it makes sense to find two learning rate:
- the first learning rate is used to find the learning rate that will train the last few layers
- after that, we can unfreeze all the layer parameters and then retrain the neural network by finding a better learning rate using
lr_find
method.
10. Discriminative learning rate
Fastai lets you pass a Python slice
object anywhere that a learning rate is expected. The first value passed will be the learning rate in the earliest layer of the neural network, and the second value will be the learning rate in the final layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range.
Discriminative learning rates refers to the training trick of using different learning rates for different layers of the model. This is commonly used in transfer learning. The idea is that when you train a pretrained model, you don’t want to drastically change the earlier layers as it contains information regarding simple features like edges and shapes. But later layers may be changed a little more as it may contain information regarding facial feature or other object features that may not be relevant to your task. Therefore, the earlier layers have a lower learning rate and the later layers have higher learning rates.
11. Learning epochs
Remember, it’s not just that we’re looking for the validation loss to get worse, but the actual metrics.
Before the days of 1cycle training it was very common to save the model at the end of each epoch, and then select whichever model had the best accuracy out of all of the models saved in each epoch. This is known as early stopping.
However, using early stopping is a poor choice when using one cycle training. If early stopping is used, the training may not have time to reach lower learning rate values in the learning rate schedule, which could easily continue to improve the model. Therefore, it is recommended to retrain the model from scratch and select the number of epochs based on where the previous best results were found.
12. Deeper architecture
In general, a bigger model has the ability to better capture the real underlying relationships in your data, and also to capture and memorize the specific details of your individual images.
However, using a deeper model is going to require more GPU RAM, so you may need to lower the size of your batches to avoid an out-of-memory error.
Another solution is to to use mixed-precision training, in Fastai library, use to_fp16()
function in Learner class.
from fastai.callback.fp16 import *
learn = cnn_learner(dls, resnet50, metrics=error_rate).to_fp16() learn.fine_tune(6, freeze_epochs=3)