Deep Learning for NLP
Assignment 4 is Part One of a two-part assignment. Our eventual goal is to reimplement Andrej Karpathy’s Neuraltalk system, which translates images into natural-language descriptions, such as “a little boy on the beach trying to fly a kite”. However, this has turned out to be a bit of a project, and we are therefore splitting it up over two weeks.
In this assignment, you will use a pre-trained convolutional network to find out what objects are in a given image. We are using the MSCOCO dataset, which consists of about 200.000 real-world images, each of which has been labeled with multiple short natural-language descriptions. You can explore the MSCOCO data on the “Explore” tab of the website.
Training a CNN that performs well on this task is expensive. We will therefore use a network that has already been trained (on a different dataset): the 16-layer model from the Visual Geometry Group at Oxford (see here to get a sense of the network’s structure). Unfortunately, this network was created using the Caffe library, whose file format is not compatible with Theano. You can obtain a Blocks model as follows:
x["meta"][0]["classes"][0][0][0][1][0]
.tensor4
of dtype float32
and apply all bricks to it, one by one. Create a Theano function from your input variable to the output of the softmax brick (use the parameter allow_input_downcast=True
). The function takes 4-dimensional tensors as output (# instances * # color channels * image width * image height). For the MSCOCO data, there are three color channels and the pictures are 256x256 pixels in size. The function returns a vector of length 1000 specifying the softmax distribution over the possible image labels.Now obtain the MSCOCO data. You can download it from their website. One of the downloads is very large (13 GB), but all the data is available on medusa in the directory /projects/korpora/mscoco
. Feel free to scp it to your laptop or an external drive, ideally from within the university network. However, please avoid unzipping the MSCOCO Zip files in your home directories on medusa, as this would put a considerable strain on the file server.
Follow the instructions on the neuraltalk2 Github page for converting the MSCOCO dataset into a single huge HDF5 file (under “I’d like to train my own network on MS COCO”). Beware the one broken image that you have to fix by hand. Karpathy’s preprocessing script also creates a JSON file, which maps from image positions in the list to image metadata, and from word IDs (starting at 1) to the actual words. Use a HDF viewer (e.g. this one) to explore the data file. Warning: The “images” dataset is huge.
Finally, write a script that can read images from the HDF5 file, apply the CNN to it, and output the words that have a high probability. Play around with your script a bit and see how well it works. If you want to do this on medusa (where the CNN can run on the GPU), feel free to use the HDF5 and JSON files in /projects/korpora/mscoco/coco
.
As a secondary project this week, redo Assignment 3 using sequence generators. A sequence generator combines a recurrent network brick (e.g. an RNN or an LSTM) with infrastructure that emits outputs based on the RNN’s state, in a way that the output of the previous timestep is fed into the RNN as input at the next timestep. There is also support for attention bricks, which you can ignore for now.
You will be pleased to see how short your Char-RNN code will become through the use of sequence generators. Furthermore, it will be useful to familiarize yourself with sequence generators for the more complex models we will build in the future, and as a starting point for learning about attention.
Here are some important tips for working with sequence generators:
lmatrix
. The first dimension is position in the sequence, and the second dimension is position in the list of training instances (within the minibatch). The variable cannot have more than two dimensions. This means that you must encode the character ID as an int, not as a one-hot vector.LookupFeedback
brick as the feedback_brick
to get this behavior.Model
in code, exactly as you would for training (except that you can leave bricks out, e.g. for the cost function). You can then use the function blocks.serialization.load_parameters
and the method model.set_parameters
to only set the parameters of the model (weights and such) from the file. This has the huge advantage that you have all the tensor variables that your model contains in hand, and can define your Theano functions from them. Note that load_parameters
is only available in the bleeding-edge version of Blocks, and that you must store your model in a file with a name of the form “*.tar” (e.g., “model.tar”) for this to work.generator.generate(n_steps=y.shape[0], batch_size=y.shape[1])
. This function maps a (1,1) lmatrix containing the input character to a triple of values: the next internal state of the RNN (as an (1,1,D)-dimensional tensor, where D is the size of the state); a randomly chosen output character (as an (1,1)-dimensional lmatrix); and the transition cost.model.get_parameter_dict()["/generator/with_fake_attention/transition.initial_state"]
, and call its set_value
method to update the state.emit
method in the SoftmaxEmitter
(which you probably want to use as part of your sequence generator) does choose at random from the multinomial distribution over output values that the softmax brick defines. However, the random number generator is initialized with the same seed every time the program runs; thus any two runs of your program will always yield the exact same “random” output. You can fix this by making your own version of SoftmaxEmitter
whose __init__
method executes self.theano_rng = MRG_RandomStreams(seed=random.randint(0,100000))
or something similar.