SS 2016

Deep Learning for NLP

 

Assignment 3: Character RNNs

In this assignment, you will implement a character-based language model using RNNs. This will familiarize you with recurrent neural networks, and give you chance to sample fun random texts from your favorite corpus. For an explanation of the basic idea, along with some very cool visualizations, see Andrej Karpathy’s blog post on this topic.

Your Task

Obtain a corpus of your choice. This could be one of the corpora provided by Karpathy, or some other corpus. Be aware of character encoding issues if you choose a non-ASCII corpus.

Implement an RNN that predicts the next character in a string when given the previous character as input. For example, on the string “hello”, your RNN will be given (a one-hot encoding of) the character “h” as input in timestep 1, and is supposed to produce (a softmax distribution that assigns the maximum probability to) the character “e” as output. Then your RNN receives “e” as input in timestep 2, and is expected to produce “l” as output; and so on. Feel free to use special symbols for the beginning and end of a sequence, or to work around these in some other way.

Train your RNN on your corpus and save the trained model to disk (e.g. with the Checkpoint extension from saveload). Then load the model and use it to sample a new text of a given size from your model. At each timestep t, use the input x.t and the hidden activations h.(t-1) as input. Compute the softmax layer output as probability distribution for the next character and draw y.t at random from it. Then update the hidden activation to the new value, and feed y.t to the RNN as the input character x.(t+1). Generate a few random texts for models at various stages of training (e.g. after one epoch, ten, fifty, …) and compare them to each other. Prepare a few fun texts so you can show them in class.

Training an RNN

It is relatively straightforward to define a network containing an RNN layer with Blocks (use the SimpleRecurrent brick), but it is important that you provide the training data in the right form.

Training data for an RNN in Blocks consists of a number of sequences. The RNN is trained to accept each sequence separately, producing the required outputs for that sequence. It is then reset to its (learned) initial state and trained to accept the next sequence correctly, and so on. Thus the training data has a three-dimensional structure. Each source has the Numpy shape (seqlength, instances, dimension), where

If your training sequences have different lengths, you can use the mask parameter to tell that to Blocks. This would allow you to train the system to generate complete sentences. Alternatively, you could simply cut the training corpus into sequences of fixed length, which would simplify this step.

Now because your sources are three-dimensional arrays, you will need three-dimensional Theano tensors for your input and output layers (tensor3). This increase in dimensionality is pulled up to all other layers as well. In particular, you will need the NDimensionalSoftmax brick to compute the softmax of a 3D tensor, and your cost will look something like this:

cost = softmax.categorical_cross_entropy(y, linear_output, extra_ndim=1).mean()

where y is the gold-standard output and linear_output are the inputs to the Softmax layer. Notice the extra_ndim parameter, which makes this work for 3D tensors.

Sampling from an RNN

We couldn’t find a nice way to evaluate a trained RNN on your own data (at sampling time). Here is how we tackled this problem:

There may be a prettier way to do it. If you find one, we look forward to seeing it.

Tips

def transpose_stream(data):
   return (data[0].swapaxes(0,1), data[1].swapaxes(0,1))

data_stream = Mapping(data_stream, transpose_stream)

Optional Tasks

Instead of using training sequences of fixed length, identify natural sequences in your training data (e.g. sentences) and train your RNN to produce those. This will require you to work with variable-length sequences.

Play around with LSTMs and GRUs – more powerful architectures for recurrent networks – and see how they perform compared to RNNs.

As you may notice, the one-hot encoding of the characters inflates the dataset and results in a substantial HostToGPUoverhead, i.e. a lot of processing time is spent moving minbatches from RAM onto the GPU. It is not as bad as for the one-hot-encoded word inputs with 10K+ dimensions from the last assignment. Still, there is potential for improvement, which is another nice optional challenge. Note that the LookupTable brick will not do the job this time because you do not want to learn embeddings. Instead, you will need to map integer indices to pre-determined (and constant) one-hot encoding vectors. Test the performance increase of this approach using the profiler. Here are two ways to do this:

# define when using GPU for better profiling results
# Note: it's generally recommended to use the CPU for profiling
print 'using', theano.config.device
if theano.config.device[:3] == 'gpu':
    import os
    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# get a batch of data
# depending on your dataset implementation, this may look slightly different
batch_size = 128
a = dataset.get_data(request=range(batch_size)) 

# option #1: use forward propagation (without cost)
from theano import function
func1 = function(inputs, outputs, profile=True) # theano variable names will vary based on your code
o = func1(*a[1:]) # exclude targets (assuming targets is first source), otherwise change this
func1.profile.summary()

# option #2: work with actual gradient computation (backprop)
algorithm = GradientDescent(cost=cost, parameters=cg.parameters,   # cg is the ComputationalGraph
                            theano_func_kwargs=dict(profile=True)) # like normal call but with theano_func_kwargs
algorithm.initialize()
func2 = algorithm._function
o = func2(*a) # needs targets as cost input
func2.profile.summary()