Assignment 5: Image labeling

In Assignment 4, you imported a CNN for image recognition into Blocks. This CNN maps from a 256x256 image picture to a 1000D vector whose positions correspond to class labels from the Imagenet challenge. In Assignment 5, you will implement a system that learns to map these 1000D vectors into longer natural-language descriptions of images, such as “a man surfing on an ocean wave” or “a little boy on the beach trying to fly a kite”, using a recurrent network. You will then combine both parts together, obtaining an overall system which can map images into natural-language descriptions.

Your Task

First, prepare the training data for the new network. Generate a HDF5 file with two sources (we call it “coco.hdf5” below): One which contains the 1000D encodings of all MSCOCO images, and one which contains the annotated word sequences. Check the documentation of the cocotalk files on the neuraltalk Github page. Both sources need to have the same number of instances, so you will have to repeat the image encodings for all annotations of the same image. Keep in mind that the word sequences will be input for a recurrent network, so make sure this source has the correct three-dimensional structure that you know from Assignments 3 and 4. Because you will load this HDF5 into a Fuel dataset and use it for training, you need to put in a training/testing split (as in Assignment 1); but it can be a trivial split in which you use all instances for training and none for testing. Create this HDF5 early in the week; it can take an hour or two to generate.

Now implement a neural network that maps from 1000D vectors into word sequences. We follow a decoder-encoder architecture, in which the CNN from Assignment 4 serves as the encoder, and its output serves as the initial state of the RNN decoder. Unfortunately, the standard Blocks classes for recurrent networks (such as SimpleRecurrent) assume that the initial state is a parameter whose value should be learned from data – not given to the RNN as input. Therefore you need to implement your own recurrent brick. For instance, you can use a MySimpleRecurrent brick (shown below) as a replacement for a SimpleRecurrent brick. Notice that the methods initial_state and apply are now annotated to say that they want to be evaluated with respect to a context named context. The apply method ignores the value of this context, but initial_state returns it as the value of the RNN’s initial state.

Train your model on the HDF5 file which you created above, taking care to swap the first and second axis as in Assignments 3 and 4. One challenge is that you need to pass both the image-encoding and the word-sequence source to the training algorithm. The correct way is to apply the cost function (e.g., the cost method of a SequenceGenerator) to the word-sequence source (because this is the sequence the RNN should learn to produce), and to pass the image-encoding source as a named argument called context (because the image encodings are meant to be the initial states of the RNNs); that is, use a call like cost(sequence, context=image_vec). We have found that the Adam step rule, plus some gradient clipping, seems to work well. Save the parameters of your model into a file model.tar (remember: it needs to be *.tar so the parameter values are saved separately and you can load them separately later).

Finally, write code that takes an image as input, applies the CNN to obtain its vector encoding, and then computes the best output sequence using the model you just trained. You can use the BeamSearch class from Blocks to compute the k-best output sequences of a SequenceGenerator. You can tell its search method to generate output words (encoded as ints) until an end-of-sequence value is generated or up to a given maximum length. In our case, the EOS value is 0 (look at the original cocotalk HDF5). Here are some tips:

The BeamSearch class expects you to pass the Theano tensor that holds the outputs of your model in the constructor’s samples parameter. In our implementation (which is based on sequence generators), the following works for the samples argument:

sequence = tensor.lmatrix("sequence")
generated = rnn.generator.generate(n_steps=sequence.shape[0], batch_size=sequence.shape[1])
model = Model(generated)
VariableFilter(bricks=[rnn.generator], name="outputs")(model)[1]

BeamSearch will not work directly with the MySimpleRecurrent class provided below, because its apply method declares a context but does not use it in the computation graph it generates. (Try it and observe the error messages yourself.) You can either try to fix this as described here, or you can use the following hack. The classes SimpleRecurrent and MySimpleRecurrent contain exactly the same parameters, except that SimpleRecurrent also contains a parameter vector for the trained initial state. Because we want to set the value of the initial state ourselves, and are thus not interested in a trained value, you can therefore create a variant of your network with a SimpleRecurrent brick at evaluation time, and load its parameters from model.tar. You can adapt the method set_parameter_values from Model so that it does not check for missing parameters.
In order to generate a sequence for a new image, first obtain the 1000D encoding from the CNN. Then initialize the RNN on which you’re going to perform the beam search so its initial state is the value of this 1000D encoding. You will need to find the name of the Theano variable representing the initial state, and then call set_value on it. In our implementation, the following statement will do it: model.get_parameter_dict()['/generator/with_fake_attention/transition.initial_state'].set_value(image_vec).

Put all the pieces together and write a nice front-end that will allow you to generate label sequences for any picture in the MSCOCO dataset. Explore the dataset a bit and bring your best (positive and negative) examples to class.

Extensions

If you would like to improve your system, here are some ideas you could try.

Some variants of the neuraltalk model do not set the initial state of the RNN to the image encoding directly, but first apply an MLP (with trained weights) to the image encoding to obtain the initial state. Try this out and see how it performs.
Experiment with other types of recurrent networks (LSTMs, GRUs). Also play around with different numbers of dimensions, gradient clipping strategies, and so on. Inspect both the cost function at training time and the quality of the generated label sequences at evaluation time.
If you feel truly adventurous, allow the CNN weights to be adapted to the MSCOCO sequence labeling task. Thus, instead of applying the (fixed) CNN to obtain fixed 1000D encodings of the images, create a single neural network which feeds the CNN outputs directly into the RNN initial states. Initialize the CNN weights as in Assignment 4, but then perform end-to-end training on the MSCOCO data that can modify all weights in the CNN for optimal performance on the overall task (at the expense of much increased training times).

The class MySimpleRecurrent

class MySimpleRecurrent(BaseRecurrent, Initializable):
    @lazy(allocation=['dim'])
    def __init__(self, dim, activation, **kwargs):
        self.dim = dim
        children = [activation]
        kwargs.setdefault('children', []).extend(children)
        super(MySimpleRecurrent, self).__init__(**kwargs)

    @property
    def W(self):
        return self.parameters[0]

    def get_dim(self, name):
        if name == 'mask':
            return 0
        if name in (MySimpleRecurrent.apply.sequences +
                    MySimpleRecurrent.apply.states):
            return self.dim
        return super(MySimpleRecurrent, self).get_dim(name)

    def _allocate(self):
        self.parameters.append(shared_floatx_nans((self.dim, self.dim), name="W"))
        add_role(self.parameters[0], WEIGHT)

        # NB no parameters for initial state

    def _initialize(self):
        self.weights_init.initialize(self.W, self.rng)

    @recurrent(sequences=['inputs', 'mask'], states=['states'],
               outputs=['states'], contexts=['context'])
    def apply(self, inputs, states, mask=None, **kwargs):
        next_states = inputs + tensor.dot(states, self.W)
        next_states = self.children[0].apply(next_states)
        if mask:
            next_states = (mask[:, None] * next_states +
                           (1 - mask[:, None]) * states)
        return next_states

    @application(contexts=["context"])
    def initial_states(self, batch_size, *args, **kwargs):
        init = kwargs["context"]
        return init.T

    @initial_states.property('outputs')
    def initial_states_outputs(self):
        return self.apply.states