Hi!
Previously we have discussed chatbots. Briefly, chatbots can be categorized into 2 branches:
- Retrieval based
- Generative
Retrieval chat bots rely on a database to search in. However, generative chat bots rely on a model to generate its answers.
Generative chat bots require a huge amount of data to be trained, they also require huge resources in order to train them. That makes generative chat bots harder to find and create.
Also, their outputs are not perfect, they sometimes make no sense and they make the silliest grammatical mistakes.
In this tutorial, we are going to train a generative chatbot using a small dataset. we shouldn’t really expect much from that bot, but the same process and model can be used to train bigger and better models.
For this task, we are going to use the Frames dataset from Maluuba. This dataset contains almost 1400 conversations between humans that are trying to reserve travel packages.
The data set isn’t designed to be used for training chatbots, however, it can be used to extract simple one-liner conversations that we can use to train a chatbot with.
I want to focus on the practical part of this task, so I’m going to leave the theoretical part to articles that better explain it. I’m not going to explain the sequence to sequence algorithm because there are many resources that explain it, but there aren’t many practical examples, especially those that use the dataset API.
You can read Tensorflow’s introduction to machine translation tutorial for an understanding of seq2seq.
Our pipeline is composed of the following stages:
- Data preparation
- Data pre-processing
- EDA
- Graph building
- Training/Inference
Data Preparation
Preparing the dataset is a little out of the scope of this tutorial so I’m going to briefly discuss what we need to accomplish here.
- Extract all short conversations between users and bots
- Every utterance pair will be a training example for our bot
- There’s a lot of information in the dataset that we can benefit from, but mostly we only need the utterances. However, if you’re looking into making your bot more advanced you might need to extract useful metadata that might aid your bot to understand the context and the user’s needs.
Data Pre-processing
Since our data is text, there has to be some pre-processing. I suppose the basics are going to be enough.
Before we start we need to import some basic packages, including “re” for regular expressions, “NLTK” for stemming, “autocorrect” for spell checking.
from nltk.stem import PorterStemmer from autocorrect import spell
First, we need to make our text lowercase, I don’t see any benefit in upper case text in this task. We also strip extra unwanted characters from the string.
string = string.strip().lower()
Now our string is ready to be passed through an array of regular expressions that will normalize text. The first regular expression will remove unwanted characters, so we restrict our characters to a closed set.
string = re.sub(r"[^A-Za-z0-9(),!?\'\`:]", " ", string)
We separate “‘s” from words to help the tokenizer.
string = re.sub(r"\'s", " \'s", string)
Same happens for a few other clauses.
string = re.sub(r"\'ve", " \'ve", string) string = re.sub(r"n\'t", " n\'t", string) string = re.sub(r"\'re", " \'re", string) string = re.sub(r"\'d", " \'d", string) string = re.sub(r"\'ll", " \'ll", string)
Punctuation will be separated from text to help the tokenizer.
string = re.sub(r",", " , ", string) string = re.sub(r"!", " ! ", string)
Normalize spaces to a single space.
string = re.sub(r"\s{2,}", " ", string)
Now we can tokenize our string to apply token level pre-processing.
string = string.split(" ")
If the token is a number, we substitute it with the “NUM” token, because many different numbers are going to fill our vocabulary and will not benefit our model.
string = [re.sub(r"[0-9]+", "NUM", token) for token in string]
Some words have many repeated characters that are wrong and were probably used as exaggeration by the human who typed it, for example, “wellll” will be normalized to “well”.
We also stem each token using NLTK’s porter stemmer.
string = [stemmer.stem(re.sub(r'(.)\1+', r'\1\1', token)) for token in string]
Finally, we use the spell checker to try and normalize words that might be mistaken.
string = [spell(token).lower() for token in string]
Because some strings might remain empty, we iteratively delete all empty tokens.
while True: try: string.remove("") except: break
Now we’re done, but there’s one last step. In seq2seq we need to append special tokens to text. This is mainly in the decoder’s data. In the decoder’s input, we append a start token which tells the decoder it should start decoding. And for the decoder’s output, we append an end token to tell it the work is done.
We also truncate long strings to a maximum length.
if(not bot_input and not bot_output): string = string[0:MAX_LEN] elif(bot_input): string = string[0:MAX_LEN-1] string.insert(0, "</start>") else: string = string[0:MAX_LEN-1] string.insert(len(string), "</end>")
Another token we need to add is the padding token, in order to make our sequences of fixed length.
We also need to return the original string’s length prior to tokenization in order to mask the padding tokens.
old_len = len(string) for i in range((MAX_LEN) - len(string)): string.append(" </pad> ") string = re.sub("\s+", " ", " ".join(string)).strip()
That’s it for pre-processing. Now we have the new string and its length.
We wrap the previous with a function and apply it on the full data. Saving the data as pickle files is to save our time because we will probably have to try and run the algorithm many different times.
if(os.path.isfile("user_processed.pkl")): user = cPickle.load(open("user_processed.pkl", "rb")) else: user = [process_str(item) for item in user] cPickle.dump(user, open("user_processed.pkl", "wb")) if(os.path.isfile("bot_in_processed.pkl")): bot_inputs = cPickle.load(open("bot_in_processed.pkl", "rb")) else: bot_inputs = [process_str(item, bot_input=True) for item in bot] cPickle.dump(bot_inputs, open("bot_in_processed.pkl", "wb")) if(os.path.isfile("bot_out_processed.pkl")): bot_outputs = cPickle.load(open("bot_out_processed.pkl", "rb")) else: bot_outputs = [process_str(item, bot_output=True) for item in bot] cPickle.dump(bot_outputs, open("bot_out_processed.pkl", "wb")) user_lens = [message[1] for message in user] user = [message[0] for message in user] bot_inp_lens = [message[1] for message in bot_inputs] bot_out_lens = [message[1] for message in bot_outputs] bot_inputs = [message[0] for message in bot_inputs] bot_outputs = [message[0] for message in bot_outputs]
user is the input of the encoder which is a message by the user. bot_inputs is the input to the decoder and bot_outputs is the output of the decoder.
Finally, we can grab the vocabulary and add the special tokens to it.
bow = CountVectorizer() bow.fit(user + bot_inputs) vocab = list(bow.vocabulary_.keys()) vocab.insert(0, "NUM") vocab.insert(0, "UNK") vocab.insert(0, "</end>") vocab.insert(0, "</start>") vocab.insert(0, "</pad>")
EDA
There’s a lot that we can do to explore the data. However, I was only interested in the number of tokens in either the bot’s or the user’s conversations to find a suitable maximum length.
print("Average user message: {}, average bot message: {}".format(np.mean(user_lens), np.mean(bot_inp_lens))) print("80th percentile of user lengths: {}, 80th percentile of bot lengths: {}".format(np.percentile(user_lens, 80), np.percentile(bot_inp_lens, 80)))
This will help you find a proper maximum length for sequences. I choose 20 tokens because it covers almost 80% of the conversations.
There’s more that we can do to exploratory data analysis, however, for now, I find this sufficient.
Graph Building
In order to train a model we first need to define it! In graph building, we’re going to define the components of the graph and the computations between them to make the model’s output.
Data
Using the dataset API we can define our data preparation pipeline and batching it.
We already have our datasets (User input, bot inputs, and outputs, the length of each string) so all we need to do is wrap our data using the dataset API.
tf_user = tf.data.Dataset.from_tensor_slices(user) tf_bot_inp = tf.data.Dataset.from_tensor_slices(bot_inputs) tf_bot_out = tf.data.Dataset.from_tensor_slices(bot_outputs) tf_user_lens = tf.data.Dataset.from_tensor_slices(tf.constant(user_lens)) tf_bot_inp_lens = tf.data.Dataset.from_tensor_slices(tf.constant(bot_inp_lens)) tf_bot_out_lens = tf.data.Dataset.from_tensor_slices(tf.constant(bot_out_lens))
Now, we need to define some data operations on these datasets. These operations are:
- Tokenization
- Word to index
- Shuffling
- Batching
So to begin we first need to define a lookup table for word to index and an inverse lookup table for the inverse operation.
with tf.device("/cpu:0"), tf.name_scope("data"): words = tf.contrib.lookup.index_table_from_tensor(mapping=tf.constant(vocab), default_value=3) inverse = tf.contrib.lookup.index_to_string_table_from_tensor(mapping=tf.constant(vocab), default_value="UNK", name="inverse_op")
Next, we define the tokenization, word to index, shuffling and batching operations.
with tf.device("/cpu:0"), tf.name_scope("data"): tf_user = tf_user.map(lambda string: tf.string_split([string])).map(lambda tokens: (words.lookup(tokens))) tf_bot_inp = tf_bot_inp.map(lambda string: tf.string_split([string])).map(lambda tokens: (words.lookup(tokens))) tf_bot_out = tf_bot_out.map(lambda string: tf.string_split([string])).map(lambda tokens: (words.lookup(tokens))) data = tf.data.Dataset.zip((tf_user, tf_bot_inp, tf_bot_out, tf_user_lens, tf_bot_inp_lens, tf_bot_out_lens)) data = data.shuffle(buffer_size=256).batch(BATCH_SIZE) data = data.prefetch(10)
Finally, we define an iterator that will execute these operations and supply us with a batch.
Iterator
with tf.device("/cpu:0"), tf.name_scope("data"): data_iterator = tf.data.Iterator.from_structure(data.output_types, data.output_shapes, None, data.output_classes) train_init_op = data_iterator.make_initializer(data, name='dataset_init') user_doc, bot_inp_doc, bot_out_doc, user_len, bot_inp_len, bot_out_len = data_iterator.get_next() user_doc = tf.sparse_tensor_to_dense(user_doc) bot_inp_doc = tf.sparse_tensor_to_dense(bot_inp_doc) bot_out_doc = tf.sparse_tensor_to_dense(bot_out_doc)
Now the tensors: “user_doc, bot_inp_doc, bot_out_doc, user_len, bot_inp_len, bot_out_len” hold the batch we will train on.
Embedding
We will start our network with an embedding layer to learn suitable embeddings for these words.
with tf.name_scope("embedding"): embedding = tf.get_variable("embedding", [len(vocab), 200], initializer=tf.glorot_uniform_initializer()) embedded_user = tf.nn.embedding_lookup(embedding, user_doc) embedded_user_dropout = tf.nn.dropout(embedded_user, 0.7) embedded_bot_inp = tf.nn.embedding_lookup(embedding, bot_inp_doc) embedded_bot_inp_dropout = tf.nn.dropout(embedded_bot_inp, 0.7) embedded_user_dropout = tf.reshape(embedded_user_dropout, [-1, MAX_LEN, 200]) embedded_bot_inp_dropout = tf.reshape(embedded_bot_inp_dropout, [-1, MAX_LEN, 200])
Seq2Seq Network
Here’s a basic schematic of the Seq2Seq Network/Algorithm.
We have two components: An encoder that turns the input sequence into a thought vector and a decoder that decodes the thought vector into our desired output.
The blue blocks are the encoder and the red blocks are the decoder.
Encoder
Next, we define the encoder. The encoder is a Recurrent Neural Network. It takes as input the embeddings of our words and its output is the thought vector that the decoder will use.
This vector is the encoder’s last hidden state.
with tf.name_scope("encoder"): # Build RNN cell encoder_GRU = tf.nn.rnn_cell.GRUCell(128) encoder_cell_fw = tf.nn.rnn_cell.DropoutWrapper(encoder_GRU, input_keep_prob=0.7, output_keep_prob=0.7, state_keep_prob=0.9) encoder_cell_bw = tf.nn.rnn_cell.DropoutWrapper(encoder_GRU, input_keep_prob=0.7, output_keep_prob=0.7, state_keep_prob=0.9) encoder_outputs, encoder_state = tf.nn.bidirectional_dynamic_rnn( encoder_cell_fw, encoder_cell_bw, embedded_user_dropout, sequence_length=user_len, dtype=tf.float32) encoder_state = tf.concat(encoder_state, -1)
Decoder
Once the decoder receives the last hidden state of the encoder and the start decoding token, it starts decoding the input word by word. The decoder should be fed its output at the previous time step to produce the next output.
The projection layer for the decoder’s output – The softmax over the full vocabulary – needs to be defined first.
with tf.name_scope("projection"): projection_layer = tf.layers.Dense( len(vocab), use_bias=False)
Then we give it to the decoder
with tf.name_scope("decoder"): decoder_GRU = tf.nn.rnn_cell.GRUCell(256) decoder_cell = tf.nn.rnn_cell.DropoutWrapper(decoder_GRU, input_keep_prob=0.7, output_keep_prob=0.7, state_keep_prob=0.9) # Helper for use during training # During training we feed the decoder # the target sequence # However, during testing we use the decoder's # last output helper = tf.contrib.seq2seq.TrainingHelper( embedded_bot_inp_dropout, bot_inp_len) # Decoder decoder = tf.contrib.seq2seq.BasicDecoder( decoder_cell, helper, encoder_state, output_layer=projection_layer) # Dynamic decoding outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder) logits = outputs.rnn_output translations = outputs.sample_id
Loss
After the decoder’s done we can compute a loss on its output in order to compute gradients.
To do so we define the cross-entropy loss but this time we add a mask that will not compute the loss for the padding tokens and we normalize by batch size.
with tf.name_scope("loss"): loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.reshape(bot_out_doc, [-1, MAX_LEN]), logits=logits) mask = tf.sequence_mask(bot_out_len, dtype=tf.float32) train_loss = (tf.reduce_sum(loss * mask) / BATCH_SIZE)
Optimizer
Using the loss we can compute gradients. Gradients can train the model using an optimizer.
We define an optimizer operation that will compute gradients and use them to train the model.
However, we add two steps to the optimizer.
- Cosine decay + restarts
- Gradient clipping
Cosine decay + restarts are used as a learning rate scheduling mechanism that aims to evade falling into local minima as opposed to decaying learning rate.
Gradient clipping is a naive approach that works when we need to avoid exploding gradients.
with tf.variable_scope('Adam'): global_step = tf.Variable(0, trainable=False) inc_gstep = tf.assign(global_step,global_step + 1) learning_rate = tf.train.cosine_decay_restarts(0.001, global_step, 550, t_mul=1.1) adam_optimizer = tf.train.AdamOptimizer(learning_rate) adam_gradients, v = zip(*adam_optimizer.compute_gradients(train_loss)) adam_gradients, _ = tf.clip_by_global_norm(adam_gradients, 10.0) adam_optimize = adam_optimizer.apply_gradients(zip(adam_gradients, v))
Summaries
Now our model and optimization operations are all defined. We can define more operations that will help us monitor the training process. One of these operations are the summaries used by Tensorboard. So we define these summaries.
Think of summaries as variables that hold the values of certain parameters during training and as the model trains we can monitor these variables. For example training/validation loss, gradient norms, and the embeddings.
with tf.name_scope('summaries'): tf.summary.scalar('Loss', train_loss) tf.summary.scalar('LR', learning_rate) merged = tf.summary.merge_all() config = projector.ProjectorConfig() embedding_vis = config.embeddings.add() embedding_vis.tensor_name = embedding.name vocab_str = '\n'.join(vocab) metadata = pd.Series(vocab) metadata.name = "label" metadata.to_csv("checkpoints/metadata.tsv", sep="\t", header=True, index_label="index") embedding_vis.metadata_path = 'metadata.tsv'
Training loop
The rest is left to the training loop.
In it, we initialize all variables and start our training process. We also save the model every n epochs and we compute average loss over each epoch.
losses = [] print("Started training") saver = tf.train.Saver() save_dir = 'checkpoints/' if not os.path.exists(save_dir): os.makedirs(save_dir) save_path = os.path.join(save_dir, 'best_validation') sess = tf.InteractiveSession(config=sconfig) writer = tf.summary.FileWriter('./checkpoints', sess.graph) projector.visualize_embeddings(writer, config) sess.run([words.init, tf.global_variables_initializer(), inverse.init]) step = 0 for i in range(NUM_EPOCHS): if(i % 10 == 0): saver.save(sess=sess, save_path=save_path, write_meta_graph=True) sess.run(train_init_op) while True: try: _, batch_loss, summary = sess.run([adam_optimize, train_loss, merged]) writer.add_summary(summary, i) losses.append(batch_loss) except tf.errors.InvalidArgumentError: continue except tf.errors.OutOfRangeError: print("Epoch {}: Loss(Mean): {} Loss(Std): {}".format(i, np.mean(losses), np.std(losses))) losses = [] break sess.run(inc_gstep) step += 1
Conclusion
Most of the tutorials available for Seq2Seq do not use the dataset API. In this tutorial, it was my goal to demonstrate the usage of the dataset API to train a Seq2Seq model.
Hopefully, soon enough Tensorflow devs will start writing more tutorials and documentation for the dataset API.