Introduction
Oh wow, what a long title that is, I really dislike long titles but I had to fit so many keywords in that title and this is what I got back. And it’s been quite a while since I last blogged something and this blogging thing feels weird now, but here we go again with a new post!
Let’s get this straight, this is an NLP post with a trick that I found useful a couple of years ago and tried to publish but wasn’t lucky enough, so I guess it’s fair to share it here maybe it can be of some benefit.
The Algorithm
The idea is simple and straightforward, most NLP components are formed of 2 components. A feature extraction module (think encoder) and a task-dependent specialized model (think decoder) and each model is trained to minimize the loss of either the two components (encoder + decoder) or just the decoder.
For Named Entity Recognition as a case study, let’s look at what typical -maybe old school now- NER models do:
- Word Embedding
- Context dependent classifier
At the time of writing this (paper?) the state of the art was a BiLSTM + CRF model which was fed Word2Vec or GloVe vectors(Yeah, old school, who uses word2vec in 2020?). A typical word2vec model would be trained on a single language, once trained it will have a representation for each word in the vocabulary on a 300-dimensional space.
You can think of our BiLSTM+CRF decoder as a model that only understands one language and that language is whatever the word2vec model babbles. if you alter the word2vec model, the BiLSTM+CRF decoder will break. This is a trick that can be used to think of the problem from a different perspective, what if we could train a word2vec model for every possible language on earth? then we’ll only have to train one single NER model for every task instead of one model per task per language, cool right?
However, it’s impossible to train a word2vec model that is inclusive of every language because we don’t have enough data to support that. But what we have already are word2vec models for a wide range of languages, can we do better?
Apparently, we can! one way to connect those models together is by aligning them to a single reference space and thus all our word2vec models will ideally be in the same space that our trained decoders understand. The idea here is that word2vec is in fact a very large matrix of dimensions (n_vocabulary * 300) and that n_vocabulary varies from model to model. But after all, it’s a matrix, and matrix operations apply to it. With a neat trick from linear algebra (You can read the paper at the end of the post for a more mathematical explanation), we can align multiple models from different languages into the same space thus making them behave effectively as one model.
The perk of this is that you can train an NLP model on one language that has a lot of data and predict on languages that it has never seen! without even having to retrain or fine tune your NLP model.
Critique
Does this work? yes.
Is this accurate? no.
Does this break state of the art on every language? no.
Should I use this? only if you can/have a single model and you need to run different languages through it occasionally.
The real problem here is that there’s no guarantee that the word cat in English and chat in French will appear in similar contexts and thus their vectors will be distributed in similar locations.
Also, there’s no guarantee that a linear transformation can successfully map the word2vec space from one space to another. Maybe, a non linear model would do better if proven that there’s a non linear relationship.
I have no idea how would this work with modern context based word embeddings, since they do not use an embedding matrix, rather an LSTM or a Transformer that is highly non linear. But maybe we can have a complex non linear model do that translation as well.
Conclusion
Orthogonal transformation of the word2vec matrix can align different languages together, and thus plug and play different languages into the same NLP pipeline, however, the efficacy of this approach is limited and there is no guarantee that this alignment is correct in every scenario.
But on the bright side, this may be somewhere on the first steps of a path to create more universal NLP models that do indeed work on any language by transforming (not translating) encoders.
Feel free to read the paper which has the in depth explanation of this topic, I intended to make this post short in order not to bore you and with the hope that the paper is more useful.
Language Independent NER via Orthogonal Transformation of Word Vectors