My Fastai Course Note (4): Under the Hood: Training a Digit Classifier
The note is based Fastbook.
untar_data
is a factory function that manages path, and it manages data that is kept in the cloud, data that is kept in storage and data that is kept in local file system.DataFrame
from Pandas can be visualized in Jupyter Notebook nicely if the data frame is an image
im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')
3. PyTorch provides loss functions, for example, F.l1_loss
, F.mse_loss
, etc.
4. The difference between pytorch tensor and numpy array is as follows:
- numpy array supports “jagged array”: the elements in the numpy array can be of different size. pytorch tensor does not support “jagged array”
- operators supported by numpy array can often be supported by pytorch tensor.
- pytorch tensor supports GPU
- pytorch tensor supports derivatives
5. broadcasting
It will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank.
6. How to get a variable’s gradient in PyTorch?
def f(x): return (x**2).sum()
xt = tensor([3, 4,5]).requires_grad_()
print(xt.grad)yt = f(xt)
yt.backward()print(xt.grad)
- define a function
- define a variable at a certain location with gradient calculation enabled
- get the function output
- perform back propagation
- get the gradient
7. An example to show gradient descent algorithm:
time = torch.arange(0, 20).float()
speed = torch.rand(20)*3+0.73*(time-9.5)**2+1
def f(t, params):
a, b, c = params
return a*(t**2)+b*t+c
def mse(preds, targets):
return ((preds-targets)**2).mean()
params = torch.randn(3).requires_grad_()
orig_params = params.clone()
lr = 1e-5
def apply_step(params):
preds = f(time, params)
loss = mse(preds, speed)
loss.backward()
params.data -= lr *params.grad.data
params.grad = None
print(loss.item())
for i in range(10):
apply_step(params)
print(orig_params.tolist())
print(params.tolist())
(1) Initialize the parameters — Random values often work best.
(2) Calculate the predictions — This is done on the training set, one mini-batch at a time.
(3) Calculate the loss — The average loss over the minibatch is calculated
(4) Calculate the gradients — this is an approximation of how the parameters need to change in order to minimize the loss function
(5) Step the weights — update the parameters based on the calculated weights
(6) Repeat the process
(7) Stop — In practice, this is either based on time constraints or usually based on when the training/validation losses and metrics stop improving.
8. Why we cannot use classification accuracy as the loss function?
In mathematical terms, accuracy is a function that is constant almost everywhere (except at the threshold, 0.5), so its derivative is nil almost everywhere (and infinity at the threshold). This then gives gradients that are 0 or infinite, which are useless for updating the model.
The key difference is that the metric is to drive human understanding and the loss is to drive automated learning.
The following codes shows how the link between loss function and classification accuracy:
trgts = tensor([1, 0, 1])
prds = tensor([0.9, 0.4, 0.2])
def mnist_loss(pred, targts):
return torch.where(targts==1, 1-pred, pred).mean()
loss = mnist_loss(prds, trgts)
print(loss)
loss = mnist_loss(tensor([0.9, 0.4, 0.8]), trgts)
print(loss)
def mnist_loss2(predictions, targets):
predictions = predictions.sigmoid()
return torch.where(targets == 1, 1 - predictions, predictions).mean()
9. How to choose batch size in Stochastic Gradient Descent (SGD) algorithm?
A larger batch size means that you will get a more accurate and stable estimate of your dataset’s gradients from the loss function, but it will take longer, and you will process fewer mini-batches per epoch. Choosing a good batch size is one of the decisions you need to make as a deep learning practitioner to train your model quickly and accurately.
10. DataLoader
in Fastai:
ds = L(enumerate(string.ascii_lowercase))
dl = DataLoader(ds, batch_size=6, shuffle=True)
print(list(dl))[(tensor([17, 18, 10, 22, 8, 14]), ('r', 's', 'k', 'w', 'i', 'o')), (tensor([20, 15, 9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')), (tensor([ 7, 25, 6, 5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')), (tensor([ 1, 3, 0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')), (tensor([2, 4]), ('c', 'e'))]
11. Why should we need non-linearity in Neural Network layer?
Without nonlinear functions between linear function, it will function as one linear function. Besides, a Neural Network with one non-linear hidden layer can be used to approximate any complex function:
def simple_net(xb):
res = xb@w1+b1
res = res.max(tensor(0.0))
res = res@w2+b2
return res
Or
simple_net = nn.Sequential( nn.Linear(28*28, 30), nn.ReLU(), nn.Linear(30,1)
)
12. A typical learning procedure in Fastai
learn = Learner(dls, simple_net, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
- dls:
DataLoaders
from Fastai - simple_net:
nn.Sequential
from PyTorch - opt_func:
SGD
optimization functions from PyTorch - loss_function:
mnist_loss
self-defined - metrics:
batch_accuracy
accuracy metrics
You can check more details of Learner class in Fastai by this link
One object in the Learner
is the Recorder, and it records the train, validation and test loss at each epoch.
def train_epoch(model, lr, params):
for xb,yb in dl:
calc_grad(xb, yb, model)
for p in params:
p.data -= p.grad*lr
p.grad.zero_()
for i in range(20):
train_epoch(model, lr, params)
13. Deep network example
The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?
There are practical performance benefits to using more than one nonlinearity. We can use a deeper model with less number of parameters, better performance, faster training, and less compute/memory requirements.
dls=ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False, loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)
14. Activations vs Parameters
A neural network contains a lot of numbers, but they are only of two types: numbers that are calculated (Activations), and the parameters that these numbers are calculated from (Parameters).
15. What’s the difference between Rank
and Shape
in Tensor?
Rank is the number of axes or dimensions in a tensor(.ndim
); shape is the size of each axis of a tensor (.shape
).
16. What are the “bias” parameters in a neural network? Why do we need them?
Without the bias parameters, if the input is zero, the output will always be zero. Therefore, using bias parameters adds additional flexibility to the model.