My Fastai Course Note (5): Image Classification

6 min readNov 7, 2020

The note is based on Fastbook.

Data processing tips: pre-sizing and augmentation in a single-step

Pre-sizing uses the following strategies to deal with artifacts introduced in data augmentation:

Resize images to relatively “large” dimensions — that is, dimensions significantly larger than the target training dimensions (via item_tfms in DataBlock). This concept is known as presizing. Data augmentation is often applied to the images and in fastai it is done on the GPU. However, data augmentation can lead to degradation and artifacts, especially at the edges. Therefore, to minimize data destruction, the augmentations are done on a larger image, and then RandomResizeCrop is performed to resize to the final image size.
Compose all of the common augmentation operations (including a resize to the final target size) into one, and perform the combined operation on the GPU only once at the end of processing, rather than performing the operations individually and interpolating multiple times.

The following codes illustrate data augmentation in one step:

x1 = TensorImage(x.clone())
x1 = x1.affine_coord(sz=224)
x1 = x1.rotate(draw=30, p=1.)
x1 = x1.zoom(draw=1.2, p=1.)
x1 = x1.warp(draw_x=-0.2, draw_y=0.2, p=1.)

tfms = setup_aug_tfms([Rotate(draw=30, p=1, size=224), Zoom(draw=1.2, p=1., size=224),
                       Warp(draw_x=-0.2, draw_y=0.2, p=1., size=224)])
x = Pipeline(tfms)(x)

2. Fastai Dataloaders re-visit

show_batch: visualization a batch of images
dataloaders can be retrieved from DataBlock
In the context of deep learning, data can be kept as a single file like CSV format or database format, and it can also be possible that data kept in individual files, like image files or text files.

3. Fastai DataBlock

it has connection with Dataloaders
it provides summary function to summarize the training and validation data set
item_tfms is used to image when image is being read
batch_tfms is used for image batches for data augmentation

4. Softmax vs Sigmoid

softmax is the multi-category equivalent of sigmoid—we have to use it any time we have more than two categories and the probabilities of the categories must add to 1, and we often use it even when there are just two categories, just to make things a bit more consistent. We could create other functions that have the properties that all activations are between 0 and 1, and sum to 1; however, no other function has the same relationship to the sigmoid function, which we've seen is smooth and symmetric.

Intuitively, the softmax function really wants to pick one class among the others, so it’s ideal for training a classifier when we know each picture has a definite label.

Note that it may be less ideal during inference, as you might want your model to sometimes tell you it doesn’t recognize any of the classes that it has seen during training, and not pick a class because it has a slightly bigger activation score. In this case, it might be better to train a model using multiple binary output columns, each using a sigmoid activation. For example, multi-label classification problem.

5. Why should we take the Log?

The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1. That means that our model will not care whether it predicts 0.99 or 0.999. Indeed, those numbers are so close together — but in another sense, 0.999 is 10 times more confident than 0.99.

6. nll_loss, and CrossEntropyLoss

The nll in nll_loss stands for "negative log likelihood," but it doesn't actually take the log at all! It assumes you have already taken the log. PyTorch has a function called log_softmax that combines log and softmax in a fast and accurate way. nll_loss is designed to be used after log_softmax.

When we first take the softmax, and then log likelihood of it, that combination is called cross-entropy loss.

native implementation

acts = torch.randn((6,2))*2
sm_acts = torch.softmax(acts, dim=1)
log_sm_acts = torch.log(sm_acts)F.nll_loss(log_sm_acts, targ, reduction='none')

pytorch

nn.CrossEntropyLoss(reduction='none')(acts,targ)

7. An interesting feature about cross-entropy loss appears when we consider its gradient. The gradient of cross_entropy(a,b) is just softmax(a)-b. Since softmax(a) is just the final activation of the model, that means that the gradient is proportional to the difference between the prediction and the target. This is the same as mean squared error in regression (assuming there's no final activation function such as that added by y_range), since the gradient of (a-b)**2 is 2*(a-b). Because the gradient is linear, that means we won't see sudden jumps or exponential increases in gradients, which should lead to smoother training of models.

8. Model interpretation, and in the context of classification an easy interpretation can be explained by ClassificationInterpretation class:

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

9. Learning rate finder

The idea was to start with a very, very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini-batch, find what the losses are afterwards, and then increase the learning rate by some percentage (e.g., doubling it each time). Then we do another mini-batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better.

lr_min,lr_steep = learn.lr_find()

Two rules to select the right learning rate:

one order of magnitude less than where the minimum loss was achieved (i.e. the minimum divided by 10)
the last point where the loss was clearly decreasing

In transferring learning (fine_tune method), it makes sense to find two learning rate:

the first learning rate is used to find the learning rate that will train the last few layers
after that, we can unfreeze all the layer parameters and then retrain the neural network by finding a better learning rate using lr_find method.

10. Discriminative learning rate

Fastai lets you pass a Python slice object anywhere that a learning rate is expected. The first value passed will be the learning rate in the earliest layer of the neural network, and the second value will be the learning rate in the final layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range.

Discriminative learning rates refers to the training trick of using different learning rates for different layers of the model. This is commonly used in transfer learning. The idea is that when you train a pretrained model, you don’t want to drastically change the earlier layers as it contains information regarding simple features like edges and shapes. But later layers may be changed a little more as it may contain information regarding facial feature or other object features that may not be relevant to your task. Therefore, the earlier layers have a lower learning rate and the later layers have higher learning rates.

11. Learning epochs

Remember, it’s not just that we’re looking for the validation loss to get worse, but the actual metrics.

Before the days of 1cycle training it was very common to save the model at the end of each epoch, and then select whichever model had the best accuracy out of all of the models saved in each epoch. This is known as early stopping.

However, using early stopping is a poor choice when using one cycle training. If early stopping is used, the training may not have time to reach lower learning rate values in the learning rate schedule, which could easily continue to improve the model. Therefore, it is recommended to retrain the model from scratch and select the number of epochs based on where the previous best results were found.

12. Deeper architecture

In general, a bigger model has the ability to better capture the real underlying relationships in your data, and also to capture and memorize the specific details of your individual images.

However, using a deeper model is going to require more GPU RAM, so you may need to lower the size of your batches to avoid an out-of-memory error.

Another solution is to to use mixed-precision training, in Fastai library, use to_fp16() function in Learner class.

from fastai.callback.fp16 import * 
learn = cnn_learner(dls, resnet50, metrics=error_rate).to_fp16() learn.fine_tune(6, freeze_epochs=3)

My Fastai Course Note (5): Image Classification

Written by ifeelfree

No responses yet