Deep Learning for NLP
In this assignment, you will implement a character-based language model using RNNs. This will familiarize you with recurrent neural networks, and give you chance to sample fun random texts from your favorite corpus. For an explanation of the basic idea, along with some very cool visualizations, see Andrej Karpathy’s blog post on this topic.
Obtain a corpus of your choice. This could be one of the corpora provided by Karpathy, or some other corpus. Be aware of character encoding issues if you choose a non-ASCII corpus.
Implement an RNN that predicts the next character in a string when given the previous character as input. For example, on the string “hello”, your RNN will be given (a one-hot encoding of) the character “h” as input in timestep 1, and is supposed to produce (a softmax distribution that assigns the maximum probability to) the character “e” as output. Then your RNN receives “e” as input in timestep 2, and is expected to produce “l” as output; and so on. Feel free to use special symbols for the beginning and end of a sequence, or to work around these in some other way.
Train your RNN on your corpus and save the trained model to disk (e.g. with the Checkpoint
extension from saveload). Then load the model and use it to sample a new text of a given size from your model. At each timestep t, use the input x.t and the hidden activations h.(t-1) as input. Compute the softmax layer output as probability distribution for the next character and draw y.t at random from it. Then update the hidden activation to the new value, and feed y.t to the RNN as the input character x.(t+1). Generate a few random texts for models at various stages of training (e.g. after one epoch, ten, fifty, …) and compare them to each other. Prepare a few fun texts so you can show them in class.
It is relatively straightforward to define a network containing an RNN layer with Blocks (use the SimpleRecurrent
brick), but it is important that you provide the training data in the right form.
Training data for an RNN in Blocks consists of a number of sequences. The RNN is trained to accept each sequence separately, producing the required outputs for that sequence. It is then reset to its (learned) initial state and trained to accept the next sequence correctly, and so on. Thus the training data has a three-dimensional structure. Each source has the Numpy shape (seqlength, instances, dimension)
, where
seqlength
is the maximum length of the training sequences;instances
is the number of sequences that are provided in the training data;dimension
is the dimension of the source, e.g. the length of the one-hot encodings.If your training sequences have different lengths, you can use the mask
parameter to tell that to Blocks. This would allow you to train the system to generate complete sentences. Alternatively, you could simply cut the training corpus into sequences of fixed length, which would simplify this step.
Now because your sources are three-dimensional arrays, you will need three-dimensional Theano tensors for your input and output layers (tensor3
). This increase in dimensionality is pulled up to all other layers as well. In particular, you will need the NDimensionalSoftmax
brick to compute the softmax of a 3D tensor, and your cost will look something like this:
where y
is the gold-standard output and linear_output
are the inputs to the Softmax layer. Notice the extra_ndim
parameter, which makes this work for 3D tensors.
We couldn’t find a nice way to evaluate a trained RNN on your own data (at sampling time). Here is how we tackled this problem:
Checkpoint
to pickle your main loop during training, you can load it from the file using blocks.serialization.load
. This will give you a main_loop
object, which has a model
attribute that contains the model. Unfortunately, because your main loop was defined around a Model(cost)
, the model itself defines a computation graph that maps an input x
and a target (gold) output y
to the value of the cost function (feel free to verify this by drawing the Theano computation graph). This is not what you need in sampling.x
(your input source) to the values of the softmax layer. To obtain such a function, search through the model.variables
to find the variables for x
(its name is the same as your input source) and the outputs of your softmax layer (if you gave the layer the name softmax
when defining the model, the variable will be called softmax_log_probabilities_output
). You can then define a Theano function f
that takes the x
variable as input and the softmax variable as output.f
wants a three-dimensional vector as input with shape (seqlength, instances, dimension)
(as described above), and returns a three-dimensional vector with the same shape. You only want to feed it one (one-hot-encoded) character at a time and get a (one-dimensional) probability distribution for the next characters back. Thus, the seqlength
and instances
dimensions now are both 1. Simply wrap each input into a Numpy array of shape (1,1,C)
(where C
is the length of the one-hot encoding) and unwrap the output with f(x)[0,0]
for the probability distribution.initial_state
, which you will find as an element of model.shared_variables
. Now you need to use the “update” parameter of Theano’s function
function to update the value of this shared variable after each timestep. The easiest way is this: When you define your model (before training), set a name for the (intermediate) theano variable holding the result for the computation of the hidden activations. (This is not a shared variable but a node in the computational graph.) Say your theano variable is h
and you assigned h.name = 'hidden_activations'
. Then you can look for a variable rv
with that name in model.variables
at sampling time, and update the shared variable initial_state
to rv[0,0]
when you define the function
.There may be a prettier way to do it. If you find one, we look forward to seeing it.
instance
dimension of the shape by the minibatch size.(instances, seqlength, dimension)
instead of (seqlength, instances, dimension)
. If you do this, you can rearrange the shape of your training data stream as follows. Note that the transpose_stream
function has to be in a separate module, so it can be loaded from your sampling script; otherwise you will get an error when you unpickle the model.Instead of using training sequences of fixed length, identify natural sequences in your training data (e.g. sentences) and train your RNN to produce those. This will require you to work with variable-length sequences.
Play around with LSTMs and GRUs – more powerful architectures for recurrent networks – and see how they perform compared to RNNs.
As you may notice, the one-hot encoding of the characters inflates the dataset and results in a substantial HostToGPU
overhead, i.e. a lot of processing time is spent moving minbatches from RAM onto the GPU. It is not as bad as for the one-hot-encoded word inputs with 10K+ dimensions from the last assignment. Still, there is potential for improvement, which is another nice optional challenge. Note that the LookupTable
brick will not do the job this time because you do not want to learn embeddings. Instead, you will need to map integer indices to pre-determined (and constant) one-hot encoding vectors. Test the performance increase of this approach using the profiler. Here are two ways to do this: