If you google the term “Data augmentation” all of the relevant results will be speaking about data augmentation in computer vision in which you deal with images. But it’s not always the case in data science as you’ll be frequently exposed to text. In this post I will lay out a technique for data augmentation in the context of text.
But first, what’s data augmentation?
Data augmentation is a technique used when either you have little data or an imbalanced data set. The idea of data augmentation is slightly modifying your data in a relevant way in order to produce more data that is the same as the original data synthetically.
In the images domain that could mean doing a set of image transformations on an image. For example, rotate by varying degrees, flip, stretch, translate, zoom, crop and so on. I’m going to use the following script on an image of a cat to demonstrate various geometric transformations.
import math import numpy as np import matplotlib.pyplot as plt from skimage import data from skimage import transform as tf img = skimage.data.imread("cat.jpg") tform0 = tf.SimilarityTransform() tform1 = tf.SimilarityTransform(scale=1, rotation=math.pi/4, translation=(img.shape/2, -100)) tform2 = tf.SimilarityTransform(scale=1, rotation=math.pi/4, translation=(img.shape/2, -100)) tform3 = tf.SimilarityTransform(scale=2, rotation=-math.pi/6) tform4 = tf.SimilarityTransform(scale=1, rotation=-math.pi/6) tforms = [tform0, tform1, tform2, tform3, tform4] for i in tforms: plt.axis('off') plt.imshow(tf.warp(img, i)) plt.show()
This kind of transformations helps teach neural networks some sort of invariance to the pose, location, and angle of the object.
What about text?
Now let’s switch domain to text and take sequence tagging as our case study. In general, sequence tagging is concerned with building models that will take as input a sequence of tokens and produces as output a sequence of classes.
An example of sequence tagging is Named Entity Recognition and Part-of-speech tagging.
The way this works is by training a neural network to read text left to right and as it reads it token by token it produces an output class by class. This is unlike seq2seq which reads the whole text first and then starts generating outputs.
The way this model works is by having a character language model that is pre-trained on huge corpora, this language model is used to create contextual word embeddings. Word embeddings are then fed to a BiLSTM+CRF classifier that classifies the input tokens one by one.
Pre-trained models make it possible to train with way less data
There aren’t many techniques known in text that are as common as image transformation techniques and that leads us to searching for methods that would help us generate samples that are identical to our training data yet carrying new information.
A common problem
Debugging NER models showed me that sometimes the model is confused with a token to the extent it fails to mark it as an entity. For example:
Yesterday, I met John Doe while crossing the street.
In the previous example the hypothesized model would correctly tag John Doe as an entity. Now let’s try a variation.
Yesterday, I met Omar Essam while crossing the street.
Yesterday, I met Li Wang while crossing the street.
In the previous example the hypothesized model would fail to tag the names because it has never seen them in a similar context before. Although a good model would be good at capturing context, sometimes when your data set is small it wouldn’t always be able to.
The augmentation technique I’m proposing here is a simple technique that would create more examples out of our training data the same way we would by flipping and rotating images.
For this to work you need an annotation per word, i.e each word has its own class. That’s not the case with text classification, but it would work with sequence tagging tasks like NER and POS.
- Flatten all your text data into one large vector
- Bucket them according to class, for example: Name => [“John”, “Kevin”, “Adam”]
- For each example in your training data, for each class, create N examples in which you substitute the word W with any word from the bucket of class C. Sampling can be random, inverse frequency (Rare words first), Frequency (Common words first)
So for example if we had for the class Name the following list: [“John”, “John”, “John”, “Kevin”, “Kevin”, “Adam”, “Peter”] and our sampling scheme was inverse frequency a transformation on the following example will be:
The CEO of an AI related startup, Peter, stated that they’re in series A.
The CEO of an AI related startup, Adam, stated that they’re in series A.
Even more transformations can be:
The CEO of an AI related startup, Kevin, stated that they’re in series A.
And the number of produced samples will be up to you. But keep in mind that you’ll fall to the trade-off of either having too little samples thus not much new data or having too much new data to the extent your model would start memorizing templates rather than learning from.
I understand this technique is so simple but it would definitely help. I’m ready to listen to critique and if you have further ideas for data augmentation in NLP feel free to share it with the community, I’d love to hear.