This note is based on Fastbook.
- Fully convolutional neural network
Consider this question: would this approach makes sense for an optical character recognition (OCR) problem such as MNIST? The vast majority of practitioners tackling OCR and similar problems tend to use fully convolutional networks, because that’s what nearly everybody learns nowadays. But it really doesn’t make any sense! You can’t decide, for instance, whether a number is a 3 or an 8 by slicing it into small pieces, jumbling them up, and deciding whether on average each piece looks like a 3 or an 8. But that’s what adaptive average pooling effectively does! Fully convolutional networks are only really a good choice for objects that don’t have a single correct orientation or size (e.g., like most natural photos).
This is different from regular pooling in the sense that those layers will generally take the average (for average pooling) or the maximum (for max pooling) of a window of a given size. For instance, max pooling layers of size 2, which were very popular in older CNNs, reduce the size of our image by half on each dimension by taking the maximum of each 2×2 window (with a stride of 2).
2. Skip Connections
The motivation behind skip connections is that more layers does not increase the NN’s performance.
A ResNet is, therefore, good at learning about slight differences between doing nothing and passing though a block of two convolutional layers (with trainable weights). This is how these models got their name: they’re predicting residuals (reminder: “residual” is prediction minus target).
The original paper didn’t actually do the trick of using zero for the initial value of gamma
in the last batchnorm layer of each block; that came a couple of years later. So, the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to "navigate through" the skip connections did indeed make it train better. Adding the batchnorm gamma
init trick made the models train at even higher learning rates.
The original implementation of ResNet is as follows:
class ResBlock(Module):
def __init__(self, ni, nf):
self.convs = nn.Sequential(ConvLayer(ni,nf),
ConvLayer(nf,nf, norm_type=NormType.BatchZero))
def forward(self, x):
return x + self.convs(x)
However, there are two problems here: it cannot handle a stride other than 1, and it requires input bands are equal to output bands.
An improvement is done here:
def _conv_block(ni,nf,stride):
return nn.Sequential(
ConvLayer(ni, nf, stride=stride),
ConvLayer(nf, nf, act_cls=None,
norm_type=NormType.BatchZero))
class ResBlock(Module):
def __init__(self, ni, nf, stride=1):
self.convs = _conv_block(ni,nf,stride)
self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1,
act_cls=None)
self.pool = noop if stride==1 else nn.AvgPool2d(2,
ceil_mode=True) def forward(self, x): return F.relu(self.convs(x) + self.idconv(self.pool(x)))
In Hao Li et al.’s “Visualizing the Loss Landscape of Neural Nets”. It shows that using skip connections helps smooth the loss function, which makes training easier as it avoids falling into a very sharp area.
3. Top-5 accuracy
Top-1 accuracy is the conventional accuracy, which means that the model answer (the one with the highest probability) must be exactly the expected answer.
Top-5 accuracy means that any of your model that gives 5 highest probability answers that must match the expected answer.
Let’s say you’re applying a machine learning algorithm for object recognition using a neural network. A picture of a cat is shown, and these are the outputs of your neural network:
- Tiger: 0.4
- Dog: 0.3
- Cat: 0.1
- Lynx: 0.09
- Lion: 0.08
- Bird: 0.02
- Bear: 0.01
In the above-mentioned probabilities:
Using top-1 accuracy, you will count this output as wrong, because it predicted a tiger.
Using top-5 accuracy, you count this output as correct, because the cat is among the top-5 guesses.
4. Bottleneck layers
The bottleneck design we’ve shown here is typically only used in ResNet-50, -101, and -152 models.