My Fastai Course Note (11): Data Munging with Fastai’s Mid-Level API

ifeelfree
3 min readNov 23, 2020

--

This note is based on Fastbook.

  1. Transform in Fastai for natural language processing

There are three ways of implementing transformation

Method 1

def f(x:int): return x+1
tfm = Transform(f)
tfm(2),tfm(2.0)
(3, 2.0)

Method 2

@Transform
def f(x:int): return x+1
f(2),f(2.0)
(3, 2.0)

Method 3

class NormalizeMean(Transform):     
def setups(self, items):
self.mean = sum(items)/len(items)
def encodes(self, x):
return x-self.mean
def decodes(self, x):
return x+self.mean
tfm = NormalizeMean()
tfm.setup([1,2,3,4,5])
start = 2
y = tfm(start)
z = tfm.decode(y)
tfm.mean,y,z
(3.0, -1.0, 2.0)

Method 3 is recommenced as a way to customize Transform. So, if we are subclassing Transform, we are required to implement the actual encoding in encodes() method and optionally can implement setups() and decodes() for setup and decoding behavior respectively.

To combine several transforms together, fastai provides Pipeline class TfmdLists, which will combine multiple Transformers into one.

tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])
t = tls[0]
ttls.show(t)

The TfmdLists is named with an "s" because it can handle a training and a validation set with a splits argument. You just need to pass the indices of which elements are in the training set, and which are in the validation set:

cut = int(len(files)*0.8)
splits = [list(range(cut)), list(range(cut,len(files)))]
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize],
splits=splits)

Datasets in Fastai is similar to TfmdLists.

x_tfms = [Tokenizer.from_folder(path), Numericalize] 
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms])
x,y = dsets[0]
x[:20],y

The last step is to convert our Datasets object to a DataLoaders, which can be done with the dataloaders method. Here we need to pass along a special argument to take care of the padding problem (as we saw in the last chapter). This needs to happen just before we batch the elements, so we pass it to before_batch:

dls = dsets.dataloaders(bs=64, before_batch=pad_input)

In a summary:

tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]
files = get_text_files(path, folders = ['train', 'test'])
splits = GrandparentSplitter(valid_name='test')(files)
dsets = Datasets(files, tfms, splits=splits)
dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)

This is the same with

path = untar_data(URLs.IMDB) dls = DataBlock(     blocks=(TextBlock.from_folder(path),CategoryBlock),     get_y = parent_label,     get_items=partial(get_text_files, folders=['train', 'test']),     splitter=GrandparentSplitter(valid_name='test') ).dataloaders(path)

2. Transformer for computer vision

We use the following example to illustrate transformers in computer vision:

#Here we create a SiameseImage object that subclasses fastuple and #is intended to contain three things: two images, and a Boolean #that's True if the images are of the same breed.
class
SiameseImage(fastuple):
def show(self, ctx=None, **kwargs):
img1,img2,same_breed = self
if not isinstance(img1, Tensor):
if img2.size != img1.size:
img2 = img2.resize(img1.size)
t1,t2 = tensor(img1),tensor(img2)
t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)
else: t1,t2 = img1,img2
line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)
return show_image(torch.cat([t1,line,t2], dim=2),
title=same_breed, ctx=ctx)
def label_func(fname):
return re.match(r'^(.*)_\d+.jpg$', fname.name).groups()[0]
class SiameseTransform(Transform):
def __init__(self, files, label_func, splits):
self.labels = files.map(label_func).unique()
self.lbl2files = {l: L(f for f in files if label_func(f) == l)
for l in self.labels}
self.label_func = label_func
self.valid = {f: self._draw(f) for f in files[splits[1]]}

def encodes(self, f):
f2,t = self.valid.get(f, self._draw(f))
img1,img2 = PILImage.create(f),PILImage.create(f2)
return SiameseImage(img1, img2, t)

def _draw(self, f):
same = random.random() < 0.5
cls = self.label_func(f)
if not same:
cls = random.choice(L(l for l in self.labels if l != cls))
return random.choice(self.lbl2files[cls]),same
splits = RandomSplitter()(files)
tfm = SiameseTransform(files, label_func, splits)
tfm(files[0]).show();
tls = TfmdLists(files, tfm, splits=splits)
show_at(tls.valid, 0);
dls = tls.dataloaders(after_item=[Resize(224), ToTensor],
after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])
  • after_item Applied on each item after grabbing it inside the dataset. This is the equivalent of item_tfms in DataBlock.
  • before_batch Applied on the list of items before they are collated. This is the ideal place to pad items to the same size.
  • after_batch Applied on the batch as a whole after its construction. This is the equivalent of batch_tfms in DataBlock.

3. Transform questions

(1) Why does a Transform have a decode method? What does it do?

A transform is used to transform data from one form to another so that in the end the final output can be used for training. However, it is important to decode the machine readable changes to human readable form. The decode method helps us do this.

(2) Why does a Transform have a setup method? What does it do?

The setup method initializes an inner state for the Transform. What exactly happens is that it might train the input if required. The method is optional.

If we are creating our custom transforms, it might be a wise idea to code the setup method to perform the actual transformation.

(3) How does a Transform work when called on a tuple?

When applying Transforms, we use the tuple as (input, target) and there can be multiple of them. The Transform is individually applied to the input and the target if possible rather than on the entire tuple.

--

--

No responses yet