My Fastai Course Note (10): NLP Deep Dive: RNNs
This note is based on Fastbook.
- What is self-supervised learning?
Training a model using labels that are embedded in the independent variable, rather than requiring external labels.
Self-supervised learning is particularly useful when pre-trained model does not exist. The alternative is to use “self-supervised learning”, where we train a model using labels that are naturally part of the input data. In Natural Language Processing, a very common self-supervised learning task is pretext task. In Computer Vision, self-supervised learning tasks include in-painting, …, etc. These pretext-like task is often called an auto-encoder.
see Self-supervised learning and computer vision for more details.
2. Why should we know how to train a language model in detail?
A language model is a self-supervised model that tries to predict the next word of a given passage of text.
- understand the model foundations
- refine-tuning the classification model, which is called Universal Language Model Fine-tuning (ULMFit) approach.
3. Text pre-processing
- Tokenization: convert the text into a list of words(or characters, or sub-strings, depending on the granularity of your model). The tokenization process will add special tokens and deal with punctuation to return this text. (stemming, lemmatization, POS tags, named entity recognition, chunking)
- Numericalization: make a list of all of the unique words that appear, and convert each word into a number, by looking up its index in the vocab. If all the words in the dataset have a token associated with them, then the embedding matrix will be very large, increase memory usage, and slow down training. Therefore, only words with more than
min_freqoccurrence are assigned a token and finally a number, while others are replaced with the “unknown word” token.
- Language model data loader creation: this depends on applications, and fastai provides some functions that facilitate this task.
- Language model creation: use RNN for training a model
4. Natural Language Processing
- sentimental analysis
- speech recognition
- machine translation
5. A full process of training the language model for classification task:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])dls_lm = DataBlock(
get_items = get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path,path=path, bs=16, seq_len=80)learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3,
- Just like
freezewhen using a pretrained model (which is the default), so this will only train the embeddings.
- We can save the model via
- We can load the model via
learn.load(‘1epoch’), and then continue to train the model via
- When we save the model as encoder, it means that the saved model does not include the task-specific final layers.
- Since the documents have variable sizes, padding is needed to collate the batch. Other approaches. like cropping or squishing, either to negatively affect training or do not make sense in this context. Therefore, padding is used.
6. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful — students often get this one wrong! Be sure to check your answer against the book website.)
a. The dataset is split into 64 mini-streams (batch size)
b. Each batch has 64 rows (batch size) and 64 columns (sequence length)
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1–64)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 65–128)
7. What does an embedding matrix for NLP contain? What is its shape?
It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size (vocab_size x embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.