This note is based on Fastbook.
- How to train a simple language model from zero?
Basic Idea
The first tweak is that the first linear layer will use only the first word’s embedding as activations, the second layer will use the second word’s embedding plus the first layer’s output activations, and the third layer will use the third word’s embedding plus the second layer’s output activations. The key effect of this is that every word is interpreted in the information context of any words preceding it.
The second tweak is that each of these three layers will use the same weight matrix. The way that one word impacts the activations from previous words should not change depending on the position of a word. In other words, activation values will change as data moves through the layers, but the layer weights themselves will not change from layer to layer. So, a layer does not learn one sequence position; it must learn to handle all positions.
Implementation
class LMModel1(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden, vocab_sz)
def forward(self, x):
h = F.relu(self.h_h(self.i_h(x[:,0])))
h = h + self.i_h(x[:,1])
h = F.relu(self.h_h(h))
h = h + self.i_h(x[:,2])
h = F.relu(self.h_h(h))
return self.h_o(h)
2. How to train a Recurrent Neural Network for language model?
A neural network that is defined using a loop is called a recurrent neural network (RNN).
The previous model we set up is problematic because we initialize the model’s hidden state to zero for each new sample, we are throwing away all the information we have about the sentences we have seen so far.
A simple solution is to initialize the hidden state. However, this can create another problem: it makes the network very deep, as deep as the number of tokens in the document. As a result, it can lead to memory and performance problems. Since the hidden state is maintained through every single call of the model, when performing backpropagation with the model, it has to use the gradients from also all the past calls of the model. This can lead to high memory usage. So therefore after every call, the detach
method is called to delete the gradient history of previous calls of the model.
A solution of the algorithm is to remove the gradient history.
class LMModel3(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = 0
def forward(self, x):
for i in range(3):
self.h = self.h+self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
out = self.h_o(self.h)
self.h = self.h.detach()
return out
def reset(self):
self.h = 0
The idea implemented about is called backpropagation through time
:
Back propagation through time (BPTT): Treating a neural net with effectively one layer per time step (usually refactored using a loop) as one big model, and calculating gradients on it in the usual way. To avoid running out of memory and time, we usually use truncated BPTT, which “detaches” the history of computation steps in the hidden state every few time steps.