In Assignment 4, you imported a CNN for image recognition into Blocks. This CNN maps from a 256x256 image picture to a 1000D vector whose positions correspond to class labels from the Imagenet challenge. In Assignment 5, you will implement a system that learns to map these 1000D vectors into longer natural-language descriptions of images, such as “a man surfing on an ocean wave” or “a little boy on the beach trying to fly a kite”, using a recurrent network. You will then combine both parts together, obtaining an overall system which can map images into natural-language descriptions.
First, prepare the training data for the new network. Generate a HDF5 file with two sources (we call it “coco.hdf5” below): One which contains the 1000D encodings of all MSCOCO images, and one which contains the annotated word sequences. Check the documentation of the cocotalk files on the neuraltalk Github page. Both sources need to have the same number of instances, so you will have to repeat the image encodings for all annotations of the same image. Keep in mind that the word sequences will be input for a recurrent network, so make sure this source has the correct three-dimensional structure that you know from Assignments 3 and 4. Because you will load this HDF5 into a Fuel dataset and use it for training, you need to put in a training/testing split (as in Assignment 1); but it can be a trivial split in which you use all instances for training and none for testing. Create this HDF5 early in the week; it can take an hour or two to generate.
Now implement a neural network that maps from 1000D vectors into word sequences. We follow a decoder-encoder architecture, in which the CNN from Assignment 4 serves as the encoder, and its output serves as the initial state of the RNN decoder. Unfortunately, the standard Blocks classes for recurrent networks (such as SimpleRecurrent) assume that the initial state is a parameter whose value should be learned from data – not given to the RNN as input. Therefore you need to implement your own recurrent brick. For instance, you can use a
MySimpleRecurrent brick (shown below) as a replacement for a
SimpleRecurrent brick. Notice that the methods
apply are now annotated to say that they want to be evaluated with respect to a context named
apply method ignores the value of this context, but
initial_state returns it as the value of the RNN’s initial state.
Train your model on the HDF5 file which you created above, taking care to swap the first and second axis as in Assignments 3 and 4. One challenge is that you need to pass both the image-encoding and the word-sequence source to the training algorithm. The correct way is to apply the cost function (e.g., the
cost method of a
SequenceGenerator) to the word-sequence source (because this is the sequence the RNN should learn to produce), and to pass the image-encoding source as a named argument called
context (because the image encodings are meant to be the initial states of the RNNs); that is, use a call like
cost(sequence, context=image_vec). We have found that the Adam step rule, plus some gradient clipping, seems to work well. Save the parameters of your model into a file
model.tar (remember: it needs to be *.tar so the parameter values are saved separately and you can load them separately later).
Finally, write code that takes an image as input, applies the CNN to obtain its vector encoding, and then computes the best output sequence using the model you just trained. You can use the BeamSearch class from Blocks to compute the k-best output sequences of a
SequenceGenerator. You can tell its
search method to generate output words (encoded as ints) until an end-of-sequence value is generated or up to a given maximum length. In our case, the EOS value is 0 (look at the original cocotalk HDF5). Here are some tips:
samplesparameter. In our implementation (which is based on sequence generators), the following works for the
MySimpleRecurrentclass provided below, because its
applymethod declares a context but does not use it in the computation graph it generates. (Try it and observe the error messages yourself.) You can either try to fix this as described here, or you can use the following hack. The classes
MySimpleRecurrentcontain exactly the same parameters, except that
SimpleRecurrentalso contains a parameter vector for the trained initial state. Because we want to set the value of the initial state ourselves, and are thus not interested in a trained value, you can therefore create a variant of your network with a
SimpleRecurrentbrick at evaluation time, and load its parameters from
model.tar. You can adapt the method
set_parameter_valuesfrom Model so that it does not check for missing parameters.
set_valueon it. In our implementation, the following statement will do it:
Put all the pieces together and write a nice front-end that will allow you to generate label sequences for any picture in the MSCOCO dataset. Explore the dataset a bit and bring your best (positive and negative) examples to class.
If you would like to improve your system, here are some ideas you could try.