Skip to content

Building a Seq2Seq Conversational Chat bot using Tensorflow


Previously we have discussed chatbots. Briefly, chatbots can be categorized into 2 branches:

  1. Retrieval based
  2. Generative

Retrieval chat bots rely on a database to search in. However, generative chat bots rely on a model to generate its answers.

Generative chat bots require a huge amount of data to be trained, they also require huge resources in order to train them. That makes generative chat bots harder to find and create.

Also, their outputs are not perfect, they sometimes make no sense and they make the silliest grammatical mistakes.

In this tutorial, we are going to train a generative chatbot using a small dataset. we shouldn’t really expect much from that bot, but the same process and model can be used to train bigger and better models.

For this task, we are going to use the Frames dataset from Maluuba. This dataset contains almost 1400 conversations between humans that are trying to reserve travel packages.

The data set isn’t designed to be used for training chatbots, however, it can be used to extract simple one-liner conversations that we can use to train a chatbot with.

I want to focus on the practical part of this task, so I’m going to leave the theoretical part to articles that better explain it. I’m not going to explain the sequence to sequence algorithm because there are many resources that explain it, but there aren’t many practical examples, especially those that use the dataset API.

You can read Tensorflow’s introduction to machine translation tutorial for an understanding of seq2seq.

Our pipeline is composed of the following stages:

  1. Data preparation
  2. Data pre-processing
  3. EDA
  4. Graph building
  5. Training/Inference

Data Preparation

Preparing the dataset is a little out of the scope of this tutorial so I’m going to briefly discuss what we need to accomplish here.

  1. Extract all short conversations between users and bots
  2. Every utterance pair will be a training example for our bot
  3. There’s a lot of information in the dataset that we can benefit from, but mostly we only need the utterances. However, if you’re looking into making your bot more advanced you might need to extract useful metadata that might aid your bot to understand the context and the user’s needs.

Data Pre-processing

Since our data is text, there has to be some pre-processing. I suppose the basics are going to be enough.

Before we start we need to import some basic packages, including “re” for regular expressions, “NLTK” for stemming, “autocorrect” for spell checking.

from nltk.stem import PorterStemmer
from autocorrect import spell


First, we need to make our text lowercase, I don’t see any benefit in upper case text in this task. We also strip extra unwanted characters from the string.

string = string.strip().lower()

Now our string is ready to be passed through an array of regular expressions that will normalize text. The first regular expression will remove unwanted characters, so we restrict our characters to a closed set.

string = re.sub(r"[^A-Za-z0-9(),!?\'\`:]", " ", string)

We separate “‘s” from words to help the tokenizer.

string = re.sub(r"\'s", " \'s", string)

Same happens for a few other clauses.

string = re.sub(r"\'ve", " \'ve", string)
string = re.sub(r"n\'t", " n\'t", string)
string = re.sub(r"\'re", " \'re", string)
string = re.sub(r"\'d", " \'d", string)
string = re.sub(r"\'ll", " \'ll", string)

Punctuation will be separated from text to help the tokenizer.

string = re.sub(r",", " , ", string)
string = re.sub(r"!", " ! ", string)

Normalize spaces to a single space.

string = re.sub(r"\s{2,}", " ", string)

Now we can tokenize our string to apply token level pre-processing.

string = string.split(" ")

If the token is a number, we substitute it with the “NUM” token, because many different numbers are going to fill our vocabulary and will not benefit our model.

string = [re.sub(r"[0-9]+", "NUM", token) for token in string]

Some words have many repeated characters that are wrong and were probably used as exaggeration by the human who typed it, for example, “wellll” will be normalized to “well”.

We also stem each token using NLTK’s porter stemmer.

string = [stemmer.stem(re.sub(r'(.)\1+', r'\1\1', token)) for token in string]

Finally, we use the spell checker to try and normalize words that might be mistaken.

string = [spell(token).lower() for token in string]

Because some strings might remain empty, we iteratively delete all empty tokens.

while True:

Now we’re done, but there’s one last step. In seq2seq we need to append special tokens to text. This is mainly in the decoder’s data. In the decoder’s input, we append a start token which tells the decoder it should start decoding. And for the decoder’s output, we append an end token to tell it the work is done.

We also truncate long strings to a maximum length.

if(not bot_input and not bot_output):
    string = string[0:MAX_LEN]
    string = string[0:MAX_LEN-1]
    string.insert(0, "</start>")
    string = string[0:MAX_LEN-1]
    string.insert(len(string), "</end>")

Another token we need to add is the padding token, in order to make our sequences of fixed length.

We also need to return the original string’s length prior to tokenization in order to mask the padding tokens.

old_len = len(string)
for i in range((MAX_LEN) - len(string)):
    string.append(" </pad> ")
string = re.sub("\s+", " ", " ".join(string)).strip()

That’s it for pre-processing. Now we have the new string and its length.

We wrap the previous with a function and apply it on the full data. Saving the data as pickle files is to save our time because we will probably have to try and run the algorithm many different times.

    user = cPickle.load(open("user_processed.pkl", "rb"))
    user = [process_str(item) for item in user]
    cPickle.dump(user, open("user_processed.pkl", "wb"))

    bot_inputs = cPickle.load(open("bot_in_processed.pkl", "rb"))
    bot_inputs = [process_str(item, bot_input=True) for item in bot]
    cPickle.dump(bot_inputs, open("bot_in_processed.pkl", "wb"))

    bot_outputs = cPickle.load(open("bot_out_processed.pkl", "rb"))
    bot_outputs = [process_str(item, bot_output=True) for item in bot]
    cPickle.dump(bot_outputs, open("bot_out_processed.pkl", "wb"))
user_lens = [message[1] for message in user]
user = [message[0] for message in user]

bot_inp_lens = [message[1] for message in bot_inputs]
bot_out_lens = [message[1] for message in bot_outputs]

bot_inputs = [message[0] for message in bot_inputs]
bot_outputs = [message[0] for message in bot_outputs]

user is the input of the encoder which is a message by the user. bot_inputs is the input to the decoder and bot_outputs is the output of the decoder.

Finally, we can grab the vocabulary and add the special tokens to it.

bow = CountVectorizer() + bot_inputs)
vocab = list(bow.vocabulary_.keys())
vocab.insert(0, "NUM")
vocab.insert(0, "UNK")
vocab.insert(0, "</end>")
vocab.insert(0, "</start>")
vocab.insert(0, "</pad>")



There’s a lot that we can do to explore the data. However, I was only interested in the number of tokens in either the bot’s or the user’s conversations to find a suitable maximum length.

print("Average user message: {}, average bot message: {}".format(np.mean(user_lens), np.mean(bot_inp_lens)))
print("80th percentile of user lengths: {}, 80th percentile of bot lengths: {}".format(np.percentile(user_lens, 80), np.percentile(bot_inp_lens, 80)))

This will help you find a proper maximum length for sequences. I choose 20 tokens because it covers almost 80% of the conversations.

There’s more that we can do to exploratory data analysis, however, for now, I find this sufficient.

Graph Building

In order to train a model we first need to define it! In graph building, we’re going to define the components of the graph and the computations between them to make the model’s output.


Using the dataset API we can define our data preparation pipeline and batching it.

We already have our datasets (User input, bot inputs, and outputs, the length of each string) so all we need to do is wrap our data using the dataset API.

tf_user =
tf_bot_inp =
tf_bot_out =

tf_user_lens =
tf_bot_inp_lens =
tf_bot_out_lens =

Now, we need to define some data operations on these datasets. These operations are:

  1. Tokenization
  2. Word to index
  3. Shuffling
  4. Batching

So to begin we first need to define a lookup table for word to index and an inverse lookup table for the inverse operation.

with tf.device("/cpu:0"), tf.name_scope("data"):
    words = tf.contrib.lookup.index_table_from_tensor(mapping=tf.constant(vocab), default_value=3)
    inverse = tf.contrib.lookup.index_to_string_table_from_tensor(mapping=tf.constant(vocab), default_value="UNK", name="inverse_op")

Next, we define the tokenization, word to index, shuffling and batching operations.

with tf.device("/cpu:0"), tf.name_scope("data"):
    tf_user = string: tf.string_split([string])).map(lambda tokens: (words.lookup(tokens)))
    tf_bot_inp = string: tf.string_split([string])).map(lambda tokens: (words.lookup(tokens)))
    tf_bot_out = string: tf.string_split([string])).map(lambda tokens: (words.lookup(tokens)))
    data =, tf_bot_inp, tf_bot_out, tf_user_lens, tf_bot_inp_lens, tf_bot_out_lens))
    data = data.shuffle(buffer_size=256).batch(BATCH_SIZE)
    data = data.prefetch(10)

Finally, we define an iterator that will execute these operations and supply us with a batch.


with tf.device("/cpu:0"), tf.name_scope("data"): 
   data_iterator =, data.output_shapes,
                                                   None, data.output_classes)
    train_init_op = data_iterator.make_initializer(data, name='dataset_init')
    user_doc, bot_inp_doc, bot_out_doc, user_len, bot_inp_len, bot_out_len = data_iterator.get_next()
    user_doc = tf.sparse_tensor_to_dense(user_doc)
    bot_inp_doc = tf.sparse_tensor_to_dense(bot_inp_doc)
    bot_out_doc = tf.sparse_tensor_to_dense(bot_out_doc)


Now the tensors: “user_doc, bot_inp_doc, bot_out_doc, user_len, bot_inp_len, bot_out_len” hold the batch we will train on.


We will start our network with an embedding layer to learn suitable embeddings for these words.

with tf.name_scope("embedding"):
    embedding = tf.get_variable("embedding", [len(vocab), 200], initializer=tf.glorot_uniform_initializer())
    embedded_user = tf.nn.embedding_lookup(embedding, user_doc)
    embedded_user_dropout = tf.nn.dropout(embedded_user, 0.7)
    embedded_bot_inp = tf.nn.embedding_lookup(embedding, bot_inp_doc)
    embedded_bot_inp_dropout = tf.nn.dropout(embedded_bot_inp, 0.7)
    embedded_user_dropout = tf.reshape(embedded_user_dropout, [-1, MAX_LEN, 200])
    embedded_bot_inp_dropout = tf.reshape(embedded_bot_inp_dropout, [-1, MAX_LEN, 200])


Seq2Seq Network

Here’s a basic schematic of the Seq2Seq Network/Algorithm.

We have two components: An encoder that turns the input sequence into a thought vector and a decoder that decodes the thought vector into our desired output.

Encoder decoder

The blue blocks are the encoder and the red blocks are the decoder.


Next, we define the encoder. The encoder is a Recurrent Neural Network. It takes as input the embeddings of our words and its output is the thought vector that the decoder will use.

This vector is the encoder’s last hidden state.

with tf.name_scope("encoder"):
    # Build RNN cell
    encoder_GRU = tf.nn.rnn_cell.GRUCell(128)
    encoder_cell_fw = tf.nn.rnn_cell.DropoutWrapper(encoder_GRU, input_keep_prob=0.7, 
                                                 output_keep_prob=0.7, state_keep_prob=0.9)
    encoder_cell_bw = tf.nn.rnn_cell.DropoutWrapper(encoder_GRU, input_keep_prob=0.7, 
                                                 output_keep_prob=0.7, state_keep_prob=0.9)
    encoder_outputs, encoder_state = tf.nn.bidirectional_dynamic_rnn(
        encoder_cell_fw, encoder_cell_bw, embedded_user_dropout,
        sequence_length=user_len, dtype=tf.float32)
    encoder_state = tf.concat(encoder_state, -1)


Once the decoder receives the last hidden state of the encoder and the start decoding token, it starts decoding the input word by word. The decoder should be fed its output at the previous time step to produce the next output.

The projection layer for the decoder’s output – The softmax over the full vocabulary – needs to be defined first.

with tf.name_scope("projection"):
    projection_layer = tf.layers.Dense(
    len(vocab), use_bias=False)

Then we give it to the decoder

with tf.name_scope("decoder"):
    decoder_GRU = tf.nn.rnn_cell.GRUCell(256)
    decoder_cell = tf.nn.rnn_cell.DropoutWrapper(decoder_GRU, input_keep_prob=0.7, 
                                                 output_keep_prob=0.7, state_keep_prob=0.9)
    # Helper for use during training
    # During training we feed the decoder
    # the target sequence
    # However, during testing we use the decoder's
    # last output
    helper = tf.contrib.seq2seq.TrainingHelper(
        embedded_bot_inp_dropout, bot_inp_len)
    # Decoder
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell, helper, encoder_state,
    # Dynamic decoding
    outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
    logits = outputs.rnn_output
    translations = outputs.sample_id


After the decoder’s done we can compute a loss on its output in order to compute gradients.

To do so we define the cross-entropy loss but this time we add a mask that will not compute the loss for the padding tokens and we normalize by batch size.

with tf.name_scope("loss"):
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.reshape(bot_out_doc,
                                                                    [-1, MAX_LEN]), logits=logits)
    mask = tf.sequence_mask(bot_out_len, dtype=tf.float32)
    train_loss = (tf.reduce_sum(loss * mask) / BATCH_SIZE)


Using the loss we can compute gradients. Gradients can train the model using an optimizer.

We define an optimizer operation that will compute gradients and use them to train the model.

However, we add two steps to the optimizer.

  1. Cosine decay + restarts
  2. Gradient clipping

Cosine decay + restarts are used as a learning rate scheduling mechanism that aims to evade falling into local minima as opposed to decaying learning rate.

Gradient clipping is a naive approach that works when we need to avoid exploding gradients.

with tf.variable_scope('Adam'):
    global_step = tf.Variable(0, trainable=False)
    inc_gstep = tf.assign(global_step,global_step + 1)
    learning_rate = tf.train.cosine_decay_restarts(0.001, global_step, 550, t_mul=1.1)
    adam_optimizer = tf.train.AdamOptimizer(learning_rate)
    adam_gradients, v = zip(*adam_optimizer.compute_gradients(train_loss))
    adam_gradients, _ = tf.clip_by_global_norm(adam_gradients, 10.0)
    adam_optimize = adam_optimizer.apply_gradients(zip(adam_gradients, v))


Now our model and optimization operations are all defined. We can define more operations that will help us monitor the training process. One of these operations are the summaries used by Tensorboard. So we define these summaries.

Think of summaries as variables that hold the values of certain parameters during training and as the model trains we can monitor these variables. For example training/validation loss, gradient norms, and the embeddings.

with tf.name_scope('summaries'):
    tf.summary.scalar('Loss', train_loss)
    tf.summary.scalar('LR', learning_rate)
    merged = tf.summary.merge_all()
    config = projector.ProjectorConfig()
    embedding_vis = config.embeddings.add()
    embedding_vis.tensor_name =
    vocab_str = '\n'.join(vocab)
    metadata = pd.Series(vocab) = "label"
    metadata.to_csv("checkpoints/metadata.tsv", sep="\t", header=True, index_label="index")
    embedding_vis.metadata_path = 'metadata.tsv'

Training loop

The rest is left to the training loop.

In it, we initialize all variables and start our training process. We also save the model every n epochs and we compute average loss over each epoch.

losses = []
print("Started training")

saver = tf.train.Saver()
save_dir = 'checkpoints/'
if not os.path.exists(save_dir):
save_path = os.path.join(save_dir, 'best_validation')

sess = tf.InteractiveSession(config=sconfig)

writer = tf.summary.FileWriter('./checkpoints', sess.graph)
projector.visualize_embeddings(writer, config)[words.init, tf.global_variables_initializer(), inverse.init])
step = 0

for i in range(NUM_EPOCHS):
    if(i % 10 == 0):, save_path=save_path, write_meta_graph=True)

    while True:
            _, batch_loss, summary =[adam_optimize, train_loss, merged])
            writer.add_summary(summary, i)
        except tf.errors.InvalidArgumentError:
        except tf.errors.OutOfRangeError:
            print("Epoch {}: Loss(Mean): {} Loss(Std): {}".format(i, np.mean(losses), np.std(losses)))
            losses = []
        step += 1


Most of the tutorials available for Seq2Seq do not use the dataset API. In this tutorial, it was my goal to demonstrate the usage of the dataset API to train a Seq2Seq model.

Hopefully, soon enough Tensorflow devs will start writing more tutorials and documentation for the dataset API.

Published inData Science