My Fastai Course Note (10): NLP Deep Dive: RNNs

  1. What is self-supervised learning?
  • understand the model foundations
  • refine-tuning the classification model, which is called Universal Language Model Fine-tuning (ULMFit) approach.
  • Tokenization: convert the text into a list of words(or characters, or sub-strings, depending on the granularity of your model). The tokenization process will add special tokens and deal with punctuation to return this text. (stemming, lemmatization, POS tags, named entity recognition, chunking)
  • Numericalization: make a list of all of the unique words that appear, and convert each word into a number, by looking up its index in the vocab. If all the words in the dataset have a token associated with them, then the embedding matrix will be very large, increase memory usage, and slow down training. Therefore, only words with more than min_freq occurrence are assigned a token and finally a number, while others are replaced with the “unknown word” token.
  • Language model data loader creation: this depends on applications, and fastai provides some functions that facilitate this task.
  • Language model creation: use RNN for training a model
  • sentimental analysis
  • chatbot
  • speech recognition
  • machine translation
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items = get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path,path=path, bs=16, seq_len=80)
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3,
metrices=[accuracy, Perplexity()]).to_fp16
learn.fit_one_cycle(1, 2e-2)
  • Just like cnn_learner, language_model_learner automatically calls freeze when using a pretrained model (which is the default), so this will only train the embeddings.
  • We can save the model via‘1epoch’)
  • We can load the model via learn.load(‘1epoch’), and then continue to train the model via learn.unfreeze().
  • When we save the model as encoder, it means that the saved model does not include the task-specific final layers. learn.save_encoder(‘finetuned’)
  • Since the documents have variable sizes, padding is needed to collate the batch. Other approaches. like cropping or squishing, either to negatively affect training or do not make sense in this context. Therefore, padding is used.




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Serving article comments using reinforcement learning of a neural net

Top 3 Transformer models 2021

SIGGRAPH 2019 Series (3) — Terra Mars (Art Papers)

Beyond accuracy: other classification metrics you should know in Machine Learning

Feature Selection for Machine Learning in Python — Wrapper Methods

Machine Learning : Understanding Our Language Part I

Interview with Chief Scientist at Salesforce: Dr. Richard Socher

Classify Your Images using Convolutional Neural Network

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

How Privacy is Ensured in a Database?

Start and end index of a sorted array

Is Genius Lazy?

My Q1 Journey