Building an Arabic Named Entities Tagger

Categories Data Science

Extracting information from free text that is available on the internet is an essential task for many businesses. These include but are not limited to: social media analytics, advertisements, sentiment analysis, question answering, and many more. Identifying certain tags in text like people’s names or organizations can give you the power to mine knowledge from the heaps of data available everywhere.

Think of it like this, having a tool that is capable of monitoring twitter feeds. Hence extracting global sentiment towards a certain product that your company has recently released. Now you realize how expressive this information can be and how easy it is to find. This eliminates the need for having to conduct tedious surveys and annoying phone calls. so let’s see what are Named Entities and how can we detect them.

What are Named Entities?

A formal definition of a named entity is: It is a real world object that we can denote with a proper name. An example of a named entity is: Google, California, Michael Jackson, UNESCO.

How can we detect Named Entities?

Detecting named entities in free unstructured text is not a trivial task. There has been many approaches to build rule based models to detect Named Entities, However these models fail to scale and work on limited vocabulary. The modern day solution to this problem is to use Neural Networks to classify sequences of text in order to tag a sentence into a sequence of tags. Depending on our task and our needs, but the most used tags are: O, PERSON, LOCATION, ORGANIZATION, MISCELLANEOUS.

O: Short for outside. This is the class of the words that mark no entities. These words are most of the text and it includes words like: To, I, He, Ate, Spoke, Building, ….

PERSON(PERS): The class for a person’s name. Example: Barack Obama.

LOCATION(LOC): The class for a location. This can be a city/country/town/continent.

ORGANIZATION(ORG): The class for an organization like Google or the UN.

MISCELLANEOUS(MISC): The class for entities that are in neither of the prior classes. Example: USD, March, ….

The most famous scheme for representing these classes is the IOB representation where in this scheme classes are appended with either a B- or an I- which indicates that this class is a “Beginning” or an “Inside” class of a Named Entity, for example the input sentence: “I met Barack Obama at California.” would be tagged as: “O O B-PERS I-PERS O B-LOC”

How can we build a Named Entities Tagger?

A typical Named Entities tagger would take as input a sentence and produce as output class labels for each word. This sequential nature of the classifier limits our choice to: LSTM, MEMM, CRF, HMM.

Research has tested most of these models and so far the state of the art model is a recurrent neural network with a CRF classification layer and word embedding as input. This is because of the strong ability of recurrent neural networks to learn representative features from its sequential inputs. But with a strong feature learner you’ll still need a good input representation which is the job of word embedding (Word2Vec, Glove, …). We can use a simple softmax layer as an output layer however CRF has shown superior results due to its strong ability to learn dependencies between output classes. This makes it optimal in this case since classes can be dependent on each other (An inside class is only going to happen after a beginning class).

What infrastructure do we need?

To build our Named Entities tagger we’re going to use Python3.5. But before training a Neural Network we’re going to need an Nvidia GPU to speed things up(Using a CPU is viable, however slower). I’m going to use my laptop which runs Ubuntu 16.04 on top of an Intel i7-7500 and an Nvidia Geforce 940MX.

(You’ll notice that so far we haven’t really mentioned Arabic, this is because the architecture is the same for any language, the data only changes)

Fetch a Data Set

Speaking of the devil, we’ll first need an annotated data set for Named Entities Recognition. For Arabic I’m going to use the ANER corpus

So grab the data set and let’s explore what lies within!

Explore the data

The first step in building any model is to explore your data, so let’s load our data and start exploring it. I will be using Python 3.5 for throughout the experiment.

import pandas as pd
import numpy as np
data_file = "ANERCorp"
# Read dataset
with open(data_file) as d:
    data = d.read()
    data = data.split("\n")

data = [(line.split(" ")[0], line.split(" ")[1:]) for line in data]
data = pd.DataFrame(data)


# Rename the columns
data.columns = ["Word", "NER"]

# Remove incorrect classes from Data
data.drop(data["NER"][data["NER"].map(len) == 0].index, inplace=True)

# Transform class variables from array to string
data["NER"] = data["NER"].apply(lambda x: x[0])

Now we have our data frame loaded, but what’s inside?

A table with a few words and their classes
A sample from the data frame

Clean the classes

What classes do we have in the data set?

Unique classes in the data set
Unique classes in the data set

We can see quite a few miss parsed classes!

I’ve built a dictionary to map most of the miss parsed classes to their correct form like “B-ORF” to “B-ORG”, here it is:

classes_map = {
    # Incorrect classes
    "B-ERS": "B-PERS",
    "B-MSIC": "B-MISC",
    "B-OEG": "B-ORG",
    "B-ORF": "B-ORG",
    "B-PERs": "B-PERS",
    "B-PRG": "B-ORG",
    "I-PRG": "I-ORG",
    "IPERS": "I-PERS",
    "": "O",
    "o": "O",
    "B-MIS0": "B-MISC",
    "B-MIS1": "B-MISC",
    "B-MIS2": "B-MISC",
    "B-MIS3": "B-MISC",
    "I-MIS0": "I-MISC",
    "I-MIS1": "I-MISC",
    "I-MIS2": "I-MISC",
    "I-MIS3": "I-MISC",
    "B-SPANISH": "O",
    "I-SPANISH": "O",
    "B-MIS": "B-MISC",
    "I-MIS": "B-MISC",
    "B-MIS-2": "B-MISC",
    "B-MIS-1": "B-MISC",
    "B-MIS1'": "B-MISC",
    "I-MIS": "I-MISC",
    "OO": "O",
    "I--ORG": "I-ORG",
    "B-MISS1": "B-MISC",
    "IO": "O",
    "B-ENGLISH": "O",
    "B-PER": "B-PERS",
    "I-PER": "I-PERS",
    
    # Correct classes
    "B-LOC": "B-LOC",
    "O": "O",
    "B-ORG": "B-ORG",
    "I-ORG": "I-ORG",
    "B-PERS": "B-PERS",
    "I-PERS": "I-PERS",
    "I-LOC": "I-LOC",
    "I-MISC": "I-MISC",
    "B-MISC": "B-MISC",
}

Let’s apply it to the data frame and then drop the cases that we couldn’t match with this dictionary

data["NER"] = data["NER"].map(classes_map)

# Removes the remaining unmapped classes
# Drop words that were mis-parsed from class
data.drop([item for item in data.index if pd.isnull(data["NER"][item]) or data["Word"][item] in classes_map.keys()], inplace=True)

Our classes should now be cleaner

Classes are mapped correctly
Classes after pre-processing

Back to exploring the data

The classes are now clean, let’s plot class counts from the data

Classes histogram
Classes histogram

The data is indeed imbalanced because named entities are rare in text.

Plotting the data without the O class we can see the distribution of the entities

Classes histogram without O
Classes histogram without O

From the histogram it seems that the classes are close in counts with no major differences. We can also see that for each class, its beginning tags are more than its inside tags, which is a sanity check.

Preparing the input for the model

Now that we have our data let’s organize it to be used in training a model.

At the moment we have a data frame with 2 columns, a word and its tag. We can’t use words for training a neural network so we need a numerical representation for that word. We will be using Word2Vec to infer a numerical representation for each word in our data.

If you’re interested in knowing more about the applications of Word2Vec take a look at this post: Word2Vec for product recommendations

So go ahead and grab the Arabic Fasttext model from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

The upside to using Fasttext is its ability to infer vectors for words that are out of vocabulary (Words it has never seen before), plus Facebook has released pre-trained models for almost every language.

I’m going to use the fasttext pip package that works as a wrapper for Fasttext, you can install it by doing

pip install fasttext

Now all we have to do is load the Fasttext model and map every word to its vector representation, quite simple!

import fasttext

word_vecs_model = "wiki.ar.bin"

# Load Word Vectors Model
vecs = fasttext.load_model(word_vecs_model)
data["Word"] = data["Word"].apply(lambda word: np.array(vecs[word]))
Our data frame after Fasttext
Our data frame after Fasttext

Preparing output for the model

Features are now ready but classes are not, neural networks expect one hot encoding of the target classes, while we have string representation for these classes. Let’s map these classes to their one hot equivalent using scikit-learn’s Multi Label Binarizer

from sklearn.preprocessing import MultiLabelBinarizer

# Instantiate the MLB to turn string classes into one hot
mlb = MultiLabelBinarizer()
classes = [[item] for item in set(classes_map.values())]

mlb.fit(classes)
classes = mlb.classes_

classes = pd.Series(classes)
classes.to_csv("classes_ar.csv")


data["NER"] = data["NER"].apply(lambda x: mlb.transform([[x]])[0])
Data after one hot encoding
Data after one hot encoding

Turning data into sequences

As we know a Recurrent neural network works on sequential data only. Meanwhile our data is only a list of vectors and their corresponding class in one hot encoding. In order to train the network we will need to transform our data into sequences of word vectors and class labels. I wrote a little helper function that does that and a little more, it can transform these tokens by specifying sequence length and it can also create these sequences using a window based approach thus increasing our training data’s size.

def window_stack(X, Y, stride=1, time_steps=3, output_mode=0):
    """Stacks elements in a window and resizes array to be array of
       sequences."""

    # Output_mode defines if it will return sequence of Y or a
    # single Y value corresponding to a sequence of X
    # 0 => Single, 1 => Sequence

    if(len(Y) == 0):
        test = True
    else:
        test = False

    X2 = X[np.arange(0, X.shape[0]-time_steps+1, stride)[:,None] + np.arange(time_steps)]
    
    if(not test):
        Y2 = Y[np.arange(0, Y.shape[0]-time_steps+1, stride)[:,None] + np.arange(time_steps)]
        return (X2, Y2)

    return X2

Let’s apply this function to our data set and split the data into Train, Validation, Test sets.

from sklearn.model_selection import train_test_split
# Assume every 10 consecutive tokens as a sentence
# But we will use a window based approach

SEQ_LENGTH = 10

X = data["Word"]
Y = data["NER"]

X = np.array([np.array(item) for item in X.values])
Y = np.array([np.array(item) for item in Y.values])

# Turn the data into sequences
X, Y = window_stack(X, Y, time_steps=SEQ_LENGTH, output_mode=1, stride=SEQ_LENGTH)

# Shuffle and split the dataset
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size=0.4)
X_Train, X_Val, Y_Train, Y_Val = train_test_split(X_Train, Y_Train, test_size=0.2)

# Turn the training data back to tokens
X_Train, Y_Train = X_Train.reshape((-1, 300)), Y_Train.reshape((-1, 9))

# Turn again into sequences using a sliding window
X_Train, Y_Train = window_stack(X_Train, Y_Train, time_steps=SEQ_LENGTH, output_mode=1, stride=1)

# Align and reshape test
alignment_drop = X_Test.shape[0] % SEQ_LENGTH
if(alignment_drop != 0):
    X_Test = X_Test[0:-alignment_drop]
    Y_Test = Y_Test[0:-alignment_drop]
X_Test = X_Test.reshape((-1, SEQ_LENGTH, 300))
Y_Test = Y_Test.reshape((-1, SEQ_LENGTH, len(classes)))

# Align and reshape val
alignment_drop = X_Val.shape[0] % SEQ_LENGTH
if(alignment_drop != 0):
    X_Val = X_Val[0:-alignment_drop]
    Y_Val = Y_Val[0:-alignment_drop]
X_Val = X_Val.reshape((-1, SEQ_LENGTH, 300))
Y_Val = Y_Val.reshape((-1, SEQ_LENGTH, len(classes)))

Now we’ll have our 3 data sets ready as sequences that can be fed into the RNN.

Get rid of single label samples

One last problem we need to fix is that due to the class imbalance we can find sequences that have no entities, using these samples in our training data is going to be useless since they carry no information for our model, so let’s get rid of them.

def drop_single_label(X, Y):
    drop = []
    for idx, x in zip(range(len(Y)), Y):
        if(len(np.vstack({tuple(row) for row in x})) == 1):
            drop.append(idx)
    return (np.delete(X, drop, axis=0), np.delete(Y, drop, axis=0))

X_Train, Y_Train = drop_single_label(X_Train, Y_Train)
X_Test, Y_Test = drop_single_label(X_Test, Y_Test)
X_Val, Y_Val = drop_single_label(X_Val, Y_Val)

Prepare the test data for scoring

We’ll be using the metrics module from the sklearn-crfsuite which takes as input the class labels as a string, so I’m going to clone the test array in order to have a string version of it for evaluation.

# Turn the Y_Test array back into class numbers
Y_Test_OHE = Y_Test.copy()
Y_Test = np.argmax(Y_Test, axis=2)

# Turn the Y_Test_Named into class labels (Strings)
classes_dict = pd.DataFrame(classes).to_dict()[0]

Y_Test_Named = np.copy(Y_Test).astype(str)
for k, v in classes_dict.items(): Y_Test_Named[Y_Test==k] = v
    
sorted_labels = sorted(
    list(classes_dict.values()),
    key=lambda name: (name[1:], name[0])
)
sorted_labels.remove('O')

Create the model

To create the model we’re going to use Keras and the CRF layer from Keras-contrib. The model is going to be a Bi-LSTM encoder and a CRF classifier. The number of nodes and dropout rates can be further tweaked, however using these rates the model will perform quite well.

from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, Input, SimpleRNN, GRU
from keras.layers.normalization import BatchNormalization
from keras.preprocessing.sequence import pad_sequences
from keras.layers.wrappers import TimeDistributed
from keras.callbacks import EarlyStopping
from keras_contrib.layers import CRF


crf = CRF(len(classes), input_shape=(SEQ_LENGTH, 300))
batch_size = 32
model = Sequential()
model.add(Bidirectional(LSTM(256, return_sequences=True, activation='relu', dropout=0.5, recurrent_dropout=0.3),input_shape=(SEQ_LENGTH, 300)))
model.add(BatchNormalization())
model.add(crf)

early_stopping = EarlyStopping(monitor='val_loss', patience=3)
reduce_lr = ReduceLROnPlateau(patience=2)

model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])

Train the model

We’re going to use EarlyStopping to stop the model training once its performance on the validation set starts degrading, we’ll also use ReduceLROnPlateau to decrease the learning rate if its performance stops improving

model.fit(X_Train, Y_Train,
          batch_size=batch_size,
          epochs=100,
          validation_data=[X_Val, Y_Val],
          callbacks=[early_stopping],
          verbose=1
         )

with open("model.json", "w") as output:
    output.write(model.to_json())

model.save_weights("weights.hdf", overwrite=True)

Test the model

I trained the model for a few hours until it automatically stops or when the number of epochs is complete. The scores are measured using the sklearn_crfsuite metrics module which can measure F1-score for sequences.

We use the cloned version of the test target values which uses the class names instead of their one hot encoding and measure the classification results on it.

Y_Pred = model.predict_classes(X_Test, verbose=0)

Y_Pred2 = np.copy(Y_Pred).astype(str)
for k, v in classes_dict.items(): Y_Pred2[Y_Pred==k] = v
    
print(metrics.flat_classification_report(Y_Test_Named, Y_Pred2, labels=sorted_labels))
Results of the model
Results of the model

Not bad after all!

We can benchmark the model using KFold cross validation if we’re unsure of its performance or when we’re doing parameter tuning. However this would last a few days of running until we find the optimal set of parameters.

Conclusion

We built an end to end Named Entities tagger for Arabic using Bi-LSTM and CRF, we used word vectors from Fasttext. The approach can be fully copied to any other language given we have a suitable data set.