Deep Learning for NLP

In this assignment, you will implement a character-based language model using RNNs. This will familiarize you with recurrent neural networks, and give you chance to sample fun random texts from your favorite corpus. For an explanation of the basic idea, along with some very cool visualizations, see Andrej Karpathy’s blog post on this topic.

Obtain a corpus of your choice. This could be one of the corpora provided by Karpathy, or some other corpus. Be aware of character encoding issues if you choose a non-ASCII corpus.

Implement an RNN that predicts the next character in a string when given the previous character as input. For example, on the string “hello”, your RNN will be given (a one-hot encoding of) the character “h” as input in timestep 1, and is supposed to produce (a softmax distribution that assigns the maximum probability to) the character “e” as output. Then your RNN receives “e” as input in timestep 2, and is expected to produce “l” as output; and so on. Feel free to use special symbols for the beginning and end of a sequence, or to work around these in some other way.

Train your RNN on your corpus and save the trained model to disk (e.g. with the `Checkpoint`

extension from saveload). Then load the model and use it to sample a new text of a given size from your model. At each timestep t, use the input x.t and the hidden activations h.(t-1) as input. Compute the softmax layer output as probability distribution for the next character and draw y.t at random from it. Then update the hidden activation to the new value, and feed y.t to the RNN as the input character x.(t+1). Generate a few random texts for models at various stages of training (e.g. after one epoch, ten, fifty, …) and compare them to each other. Prepare a few fun texts so you can show them in class.

It is relatively straightforward to define a network containing an RNN layer with Blocks (use the `SimpleRecurrent`

brick), but it is important that you provide the training data in the right form.

Training data for an RNN in Blocks consists of a number of sequences. The RNN is trained to accept each sequence separately, producing the required outputs for that sequence. It is then reset to its (learned) initial state and trained to accept the next sequence correctly, and so on. Thus the training data has a three-dimensional structure. Each source has the Numpy shape `(seqlength, instances, dimension)`

, where

`seqlength`

is the maximum length of the training sequences;`instances`

is the number of sequences that are provided in the training data;`dimension`

is the dimension of the source, e.g. the length of the one-hot encodings.

If your training sequences have different lengths, you can use the `mask`

parameter to tell that to Blocks. This would allow you to train the system to generate complete sentences. Alternatively, you could simply cut the training corpus into sequences of fixed length, which would simplify this step.

Now because your sources are three-dimensional arrays, you will need three-dimensional Theano tensors for your input and output layers (`tensor3`

). This increase in dimensionality is pulled up to all other layers as well. In particular, you will need the `NDimensionalSoftmax`

brick to compute the softmax of a 3D tensor, and your cost will look something like this:

where `y`

is the gold-standard output and `linear_output`

are the inputs to the Softmax layer. Notice the `extra_ndim`

parameter, which makes this work for 3D tensors.

We couldn’t find a nice way to evaluate a trained RNN on your own data (at sampling time). Here is how we tackled this problem:

- If you used
`Checkpoint`

to pickle your main loop during training, you can load it from the file using`blocks.serialization.load`

. This will give you a`main_loop`

object, which has a`model`

attribute that contains the model. Unfortunately, because your main loop was defined around a`Model(cost)`

, the model itself defines a computation graph that maps an input`x`

and a target (gold) output`y`

to the value of the cost function (feel free to verify this by drawing the Theano computation graph). This is not what you need in sampling. - You want a Theano function that maps from
`x`

(your input source) to the values of the softmax layer. To obtain such a function, search through the`model.variables`

to find the variables for`x`

(its name is the same as your input source) and the outputs of your softmax layer (if you gave the layer the name`softmax`

when defining the model, the variable will be called`softmax_log_probabilities_output`

). You can then define a Theano function`f`

that takes the`x`

variable as input and the softmax variable as output. `f`

wants a three-dimensional vector as input with shape`(seqlength, instances, dimension)`

(as described above), and returns a three-dimensional vector with the same shape. You only want to feed it one (one-hot-encoded) character at a time and get a (one-dimensional) probability distribution for the next characters back. Thus, the`seqlength`

and`instances`

dimensions now are both 1. Simply wrap each input into a Numpy array of shape`(1,1,C)`

(where`C`

is the length of the one-hot encoding) and unwrap the output with`f(x)[0,0]`

for the probability distribution.- You will also need to update the activation of the hidden layer of your RNN in each step. The activations of the RNN are stored in a shared variable called
`initial_state`

, which you will find as an element of`model.shared_variables`

. Now you need to use the “update” parameter of Theano’s`function`

function to update the value of this shared variable after each timestep. The easiest way is this: When you define your model (before training), set a name for the (intermediate) theano variable holding the result for the computation of the hidden activations. (This is not a shared variable but a node in the computational graph.) Say your theano variable is`h`

and you assigned`h.name = 'hidden_activations'`

. Then you can look for a variable`rv`

with that name in`model.variables`

at sampling time, and update the shared variable`initial_state`

to`rv[0,0]`

when you define the`function`

.

There may be a prettier way to do it. If you find one, we look forward to seeing it.

- When you create a data stream out of your training set, the instances are probably provided in minibatches. This will replace the
`instance`

dimension of the shape by the minibatch size. - It may be helpful to preprocess the training corpus and store the data in a HDF5 file. This will speed up your work on the training script.
- At sampling time, make sure that you convert the activations of the softmax layer into an actual probability distribution. Depending on the exact variable that you access, the values may still be the input scores x.i, and you need to calculate exp(x.i)/Z for each x.i, where Z is the normalization constant that ensures everything sums to one. Once you have this tuple of probabilities, you can use the choice function from Numpy to sample a value from this distribution.
- Code that solves this assignment is available on the Internet. However, it is either not written for Blocks (e.g. Karpathy’s original program), or it is pretty intransparent. Our experience has been that looking at that code is of limited use. In particular, please realize that the purpose of this assignment is to give you some practice in using RNNs, so you are familiar with them when you do your project. This will not happen if you take too many shortcuts.
- You may be tempted to organize your training data in the shape
`(instances, seqlength, dimension)`

instead of`(seqlength, instances, dimension)`

. If you do this, you can rearrange the shape of your training data stream as follows. Note that the`transpose_stream`

function has to be in a separate module, so it can be loaded from your sampling script; otherwise you will get an error when you unpickle the model.

Instead of using training sequences of fixed length, identify natural sequences in your training data (e.g. sentences) and train your RNN to produce those. This will require you to work with variable-length sequences.

Play around with LSTMs and GRUs – more powerful architectures for recurrent networks – and see how they perform compared to RNNs.

As you may notice, the one-hot encoding of the characters inflates the dataset and results in a substantial `HostToGPU`

overhead, i.e. a lot of processing time is spent moving minbatches from RAM onto the GPU. It is not as bad as for the one-hot-encoded word inputs with 10K+ dimensions from the last assignment. Still, there is potential for improvement, which is another nice optional challenge. Note that the `LookupTable`

brick will not do the job this time because you do not want to learn embeddings. Instead, you will need to map integer indices to pre-determined (and constant) one-hot encoding vectors. Test the performance increase of this approach using the profiler. Here are two ways to do this: