In the past few years the field of Natural Language Processing has been rigged with new research ideas and new breakthroughs. State of the art scores have been broken on many tasks including Machine Translation, sentiment analysis, dependency parsing and many other tasks. These breakthroughs are due to many factors among which an algorithm called Word2Vec.
-If you’re a beginner and this doesn’t make much sense to you, consider taking a look at this post first–
What is Word2Vec?
Word2Vec is an unsupervised algorithm that learns vector representation of words from a huge corpus of text. The way it works is by benefiting from the context data of words, for every word in a given context the algorithm tries to learn vector representation (think of these as features) of each word that would group (cluster) these words together so in a nutshell it would try to group similar words in a vector space.
A visualization of a 2 dimensional PCA projection of a sample of words shows how the countries were grouped on the left and the capitals were grouped on the right, also the Country-Capital relationship produces similar vectors across these countries.
Wait, but what is Word2Vec useful for?
The possibilities of Word2Vec are endless, you can now transform any Word in your vocabulary into a meaningful vector representation instead of the extremely sparse BOW representation this would overcome the curse of dimensionality in NLP models. Word2Vec can be used for tasks like analogies, for example you feed the model analogies like (King + Man – Woman) and the model is expected to output Woman based on the cosine similarities between its vectors. Word vectors are also the input to any modern NLP model, in Neural Machine Translation an encoder RNN is fed word vectors of input words from a source language and the decoder RNN is responsible for translating the input from the source language to the target language.
Word2Vec can also be used to query nearest neighbors to a certain word, this would usually find words that are similar in meaning to a certain word, we can exploit this in domains other than NLP. In this post I’m going to demonstrate one different usage of this algorithm.
The Instacart Challenge
Instacart, an online grocery store, has recently open sourced a huge data set of orders history and accompanied it with a Kaggle challenge. The goal of this challenge is to predict among all the products a certain user has ordered before, which products is he going to reorder on his next order.
Lets first explore the data and have some intuition on what’s inside and then we can take a look at how Word2Vec can be used to model this data and draw useful insights from it.
You can start by downloading the data set from the Kaggle competition, it is composed of a few files:
- aisles.csv => This dataframe has a list of all the aisles on instacart and their corresponding ids
- departments.csv => This dataframe has a list of all the departments on instacart and their corresponding ids
- order_products__prior.csv => This dataframe has the order history of every user up to their last order, think of it as our input data
- order_products__train.csv => This dataframe has the last order of every user, think of it as our target data
- orders.csv => This dataframe has all the orders on instacart and their details, including priors and train orders
- products.csv => This dataframe has all the products and their corresponding ids
- sample_submission.csv => This is a sample submission file
The most interesting part of this dataset are the orders dataframes (Train and Prior) which have all the history of orders, so let’s take a look at them using Python.
First of all load Pandas and the orders data
import pandas as pd import numpy as np orders = pd.read_csv("../input/orders.csv")
Let’s take a look at the orders
In : orders.shape Out: (3421083, 7)
In : orders.head() Out: order_id user_id eval_set order_number order_dow order_hour_of_day \ 0 2539329 1 prior 1 2 8 1 2398795 1 prior 2 3 7 2 473747 1 prior 3 3 12 3 2254736 1 prior 4 4 7 4 431534 1 prior 5 4 15 days_since_prior_order 0 NaN 1 15.0 2 21.0 3 29.0 4 28.0
Now let’s take a look at the details of these orders, their details are in the Prior and Train dataframes.
train_orders = pd.read_csv("../input/order_products__train.csv") prior_orders = pd.read_csv("../input/order_products__prior.csv") products = pd.read_csv("../input/products.csv").set_index('product_id')
In : print(train_orders.shape) (1384617, 4) In : print(prior_orders.shape) (32434489, 4)
In : print(train_orders.head()) order_id product_id add_to_cart_order reordered 0 1 49302 1 1 1 1 11109 2 1 2 1 10246 3 0 3 1 49683 4 0 4 1 43633 5 1
In : prior_orders.head() Out: order_id product_id add_to_cart_order reordered 0 2 33120 1 1 1 2 28985 2 1 2 2 9327 3 0 3 2 45918 4 1 4 2 30035 5 0
They indeed look identical except for the fact that train_orders are the last order for every user.
Now that we can see the products in every order it would be nice if we could find products that are similar to each other or that are usually ordered together. From the shop’s point of view this would result in promotions on packages or in case of a physical shop they would move these products to be physically nearer to urge users to buy more.
In a scenario like this one Word2Vec would come in handy since it’s most concerned with words that come together in the same context then we can use it to find products that are usually bought together or products that are similar to each other. To do this we need to interpret every order as a sentence and every product in an order as a word, interesting!
We start by transforming Product ID into a string instead of an integer
train_orders["product_id"] = train_orders["product_id"].astype(str) prior_orders["product_id"] = prior_orders["product_id"].astype(str)
Now let’s group into order into a list of products
train_products = train_orders.groupby("order_id").apply(lambda order: order['product_id'].tolist()) prior_products = prior_orders.groupby("order_id").apply(lambda order: order['product_id'].tolist())
Each order is now a list of products and each product is represented by its ID string
In : train_products.head() Out: order_id 1 [49302, 11109, 10246, 49683, 43633, 13176, 472... 36 [39612, 19660, 49235, 43086, 46620, 34497, 486... 38 [11913, 18159, 4461, 21616, 23622, 32433, 2884... 96 [20574, 30391, 40706, 25610, 27966, 24489, 39275] 98 [8859, 19731, 43654, 13176, 4357, 37664, 34065... dtype: object
Let’s merge the two dataframes together then find the longest order
In : sentences = prior_products.append(train_products) ...: longest = np.max(sentences.apply(len)) ...: print(longest) ...: 145
We transform the sentences into a numpy array
sentences = sentences.values
Finally we train the Word2Vec model. We will be using Gensim’s implementation
import gensim model = gensim.models.Word2Vec(sentences, size=100, window=longest, min_count=2, workers=4)
Notice the usage of window = longest in the training of the model. Since there is no sequence characteristics of the products in an order -Because each product in an order is independent on the orders that were chosen before it in the cart- we should have a training window huge enough to accommodate all the products together or else we imply that products that are far apart even though they’re in the same cart are dissimilar which is not true.
The model has now learnt vector representations of each product (Except for those below min_count) so let’s see what has it learnt
First of all we need to project our vectors onto 2 dimensions so we can visualize them. We do the projection by using PCA.
from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(model.wv.syn0)
Next we will need two helper functions for visualization, here they are
def get_batch(vocab, model, n_batches=3): output = list() for i in range(0, n_batches): rand_int = np.random.randint(len(vocab), size=1) suggestions = model.most_similar(positive=[vocab[rand_int]], topn=5) suggest = list() for i in suggestions: suggest.append(i) output += suggest output.append(vocab[rand_int]) return output def plot_with_labels(low_dim_embs, labels, filename='tsne.png'): """From Tensorflow's tutorial.""" assert low_dim_embs.shape >= len(labels), "More labels than embeddings" plt.figure(figsize=(21, 21)) #in inches for i, label in enumerate(labels): x, y = low_dim_embs[i,:] plt.scatter(x, y) plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') plt.show()
Finally we need to random sample our products and visualize these products and their neighbors. This is the job of the get_batch function as it will extract the nearest 5 products for each product n_batches products it will sample. let’s call these functions and visualize their output using Matplotlib
from matplotlib import pyplot as plt embeds =  labels =  for item in get_batch(vocab, model, n_batches=5): embeds.append(pca.transform(model[item])) labels.append(products.loc[int(item)]['product_name']) embeds = np.array(embeds) plot_with_labels(embeds, labels)
Here are some random results from its output
-Click on an image for full size-
Notice that these images are zoomed in, the distances are very small but for illustration purposes I had to zoom in.
There’s a lot more to it if we kept random sampling the products.
Word2Vec is a powerful and very quick algorithm for learning a meaningful representation of words from context information. Furthermore this is not limited to words as you can expand it to many other domains. We applied it to a shopping based problem and it showed very useful insights from the data and can supply useful features for models that can be built to solve a certain task.
You can take a look at my Kaggle kernel which has all this code.