My Fastai Course Note (4): Under the Hood: Training a Digit Classifier

4 min readNov 4, 2020

The note is based Fastbook.

untar_data is a factory function that manages path, and it manages data that is kept in the cloud, data that is kept in storage and data that is kept in local file system.
DataFrame from Pandas can be visualized in Jupyter Notebook nicely if the data frame is an image

im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

3. PyTorch provides loss functions, for example, F.l1_loss, F.mse_loss, etc.

4. The difference between pytorch tensor and numpy array is as follows:

numpy array supports “jagged array”: the elements in the numpy array can be of different size. pytorch tensor does not support “jagged array”
operators supported by numpy array can often be supported by pytorch tensor.
pytorch tensor supports GPU
pytorch tensor supports derivatives

5. broadcasting

It will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank.

6. How to get a variable’s gradient in PyTorch?

def f(x): return (x**2).sum()
xt = tensor([3, 4,5]).requires_grad_()
print(xt.grad)yt = f(xt)
yt.backward()print(xt.grad)

define a function
define a variable at a certain location with gradient calculation enabled
get the function output
perform back propagation
get the gradient

7. An example to show gradient descent algorithm:

time = torch.arange(0, 20).float()
speed = torch.rand(20)*3+0.73*(time-9.5)**2+1

def f(t, params):
    a, b, c = params
    return a*(t**2)+b*t+c
def mse(preds, targets):
    return ((preds-targets)**2).mean()

params = torch.randn(3).requires_grad_()
orig_params = params.clone()


 
lr = 1e-5
def apply_step(params):
    preds = f(time, params)
    loss = mse(preds, speed)
    loss.backward()
    params.data -= lr *params.grad.data
    params.grad = None
    print(loss.item())

for i in range(10):
    apply_step(params)

print(orig_params.tolist())
print(params.tolist())

(1) Initialize the parameters — Random values often work best.

(2) Calculate the predictions — This is done on the training set, one mini-batch at a time.

(3) Calculate the loss — The average loss over the minibatch is calculated

(4) Calculate the gradients — this is an approximation of how the parameters need to change in order to minimize the loss function

(5) Step the weights — update the parameters based on the calculated weights

(6) Repeat the process

(7) Stop — In practice, this is either based on time constraints or usually based on when the training/validation losses and metrics stop improving.

8. Why we cannot use classification accuracy as the loss function?

In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5), so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are 0 or infinite, which are useless for updating the model.

The key difference is that the metric is to drive human understanding and the loss is to drive automated learning.

The following codes shows how the link between loss function and classification accuracy:

trgts = tensor([1, 0, 1])
prds = tensor([0.9, 0.4, 0.2])
def mnist_loss(pred, targts):
    return torch.where(targts==1, 1-pred, pred).mean()
loss = mnist_loss(prds, trgts)
print(loss)
loss = mnist_loss(tensor([0.9, 0.4, 0.8]), trgts)
print(loss)


def mnist_loss2(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets == 1, 1 - predictions, predictions).mean()

9. How to choose batch size in Stochastic Gradient Descent (SGD) algorithm?

A larger batch size means that you will get a more accurate and stable estimate of your dataset’s gradients from the loss function, but it will take longer, and you will process fewer mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately.

10. DataLoader in Fastai:

ds = L(enumerate(string.ascii_lowercase))
dl = DataLoader(ds, batch_size=6, shuffle=True)
print(list(dl))[(tensor([17, 18, 10, 22,  8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),  (tensor([20, 15,  9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),  (tensor([ 7, 25,  6,  5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),  (tensor([ 1,  3,  0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),  (tensor([2, 4]), ('c', 'e'))]

11. Why should we need non-linearity in Neural Network layer?

Without nonlinear functions between linear function, it will function as one linear function. Besides, a Neural Network with one non-linear hidden layer can be used to approximate any complex function:

def simple_net(xb):
      res = xb@w1+b1
      res = res.max(tensor(0.0))
      res = res@w2+b2
      return res


simple_net = nn.Sequential( nn.Linear(28*28, 30),  nn.ReLU(), nn.Linear(30,1)
)

12. A typical learning procedure in Fastai

learn = Learner(dls, simple_net, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)

dls: DataLoaders from Fastai
simple_net: nn.Sequential from PyTorch
opt_func: SGD optimization functions from PyTorch
loss_function: mnist_loss self-defined
metrics: batch_accuracy accuracy metrics

You can check more details of Learner class in Fastai by this link

One object in the Learner is the Recorder, and it records the train, validation and test loss at each epoch.

def train_epoch(model, lr, params):     
       for xb,yb in dl:         
          calc_grad(xb, yb, model)         
          for p in params:             
                p.data -= p.grad*lr             
                p.grad.zero_() 
for i in range(20):     
   train_epoch(model, lr, params)

13. Deep network example

The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

There are practical performance benefits to using more than one nonlinearity. We can use a deeper model with less number of parameters, better performance, faster training, and less compute/memory requirements.

dls=ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False,  loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)

14. Activations vs Parameters

A neural network contains a lot of numbers, but they are only of two types: numbers that are calculated (Activations), and the parameters that these numbers are calculated from (Parameters).

15. What’s the difference between Rank and Shape in Tensor?

Rank is the number of axes or dimensions in a tensor(.ndim); shape is the size of each axis of a tensor (.shape).

16. What are the “bias” parameters in a neural network? Why do we need them?

Without the bias parameters, if the input is zero, the output will always be zero. Therefore, using bias parameters adds additional flexibility to the model.

My Fastai Course Note (4): Under the Hood: Training a Digit Classifier

Written by ifeelfree

No responses yet