SS 2016

Deep Learning for NLP


Assignment 5: Image labeling

In Assignment 4, you imported a CNN for image recognition into Blocks. This CNN maps from a 256x256 image picture to a 1000D vector whose positions correspond to class labels from the Imagenet challenge. In Assignment 5, you will implement a system that learns to map these 1000D vectors into longer natural-language descriptions of images, such as “a man surfing on an ocean wave” or “a little boy on the beach trying to fly a kite”, using a recurrent network. You will then combine both parts together, obtaining an overall system which can map images into natural-language descriptions.

Your Task

First, prepare the training data for the new network. Generate a HDF5 file with two sources (we call it “coco.hdf5” below): One which contains the 1000D encodings of all MSCOCO images, and one which contains the annotated word sequences. Check the documentation of the cocotalk files on the neuraltalk Github page. Both sources need to have the same number of instances, so you will have to repeat the image encodings for all annotations of the same image. Keep in mind that the word sequences will be input for a recurrent network, so make sure this source has the correct three-dimensional structure that you know from Assignments 3 and 4. Because you will load this HDF5 into a Fuel dataset and use it for training, you need to put in a training/testing split (as in Assignment 1); but it can be a trivial split in which you use all instances for training and none for testing. Create this HDF5 early in the week; it can take an hour or two to generate.

Now implement a neural network that maps from 1000D vectors into word sequences. We follow a decoder-encoder architecture, in which the CNN from Assignment 4 serves as the encoder, and its output serves as the initial state of the RNN decoder. Unfortunately, the standard Blocks classes for recurrent networks (such as SimpleRecurrent) assume that the initial state is a parameter whose value should be learned from data – not given to the RNN as input. Therefore you need to implement your own recurrent brick. For instance, you can use a MySimpleRecurrent brick (shown below) as a replacement for a SimpleRecurrent brick. Notice that the methods initial_state and apply are now annotated to say that they want to be evaluated with respect to a context named context. The apply method ignores the value of this context, but initial_state returns it as the value of the RNN’s initial state.

Train your model on the HDF5 file which you created above, taking care to swap the first and second axis as in Assignments 3 and 4. One challenge is that you need to pass both the image-encoding and the word-sequence source to the training algorithm. The correct way is to apply the cost function (e.g., the cost method of a SequenceGenerator) to the word-sequence source (because this is the sequence the RNN should learn to produce), and to pass the image-encoding source as a named argument called context (because the image encodings are meant to be the initial states of the RNNs); that is, use a call like cost(sequence, context=image_vec). We have found that the Adam step rule, plus some gradient clipping, seems to work well. Save the parameters of your model into a file model.tar (remember: it needs to be *.tar so the parameter values are saved separately and you can load them separately later).

Finally, write code that takes an image as input, applies the CNN to obtain its vector encoding, and then computes the best output sequence using the model you just trained. You can use the BeamSearch class from Blocks to compute the k-best output sequences of a SequenceGenerator. You can tell its search method to generate output words (encoded as ints) until an end-of-sequence value is generated or up to a given maximum length. In our case, the EOS value is 0 (look at the original cocotalk HDF5). Here are some tips:

sequence = tensor.lmatrix("sequence")
generated = rnn.generator.generate(n_steps=sequence.shape[0], batch_size=sequence.shape[1])
model = Model(generated)
VariableFilter(bricks=[rnn.generator], name="outputs")(model)[1]

Put all the pieces together and write a nice front-end that will allow you to generate label sequences for any picture in the MSCOCO dataset. Explore the dataset a bit and bring your best (positive and negative) examples to class.


If you would like to improve your system, here are some ideas you could try.

The class MySimpleRecurrent

class MySimpleRecurrent(BaseRecurrent, Initializable):
    def __init__(self, dim, activation, **kwargs):
        self.dim = dim
        children = [activation]
        kwargs.setdefault('children', []).extend(children)
        super(MySimpleRecurrent, self).__init__(**kwargs)

    def W(self):
        return self.parameters[0]

    def get_dim(self, name):
        if name == 'mask':
            return 0
        if name in (MySimpleRecurrent.apply.sequences +
            return self.dim
        return super(MySimpleRecurrent, self).get_dim(name)

    def _allocate(self):
        self.parameters.append(shared_floatx_nans((self.dim, self.dim), name="W"))
        add_role(self.parameters[0], WEIGHT)

        # NB no parameters for initial state

    def _initialize(self):
        self.weights_init.initialize(self.W, self.rng)

    @recurrent(sequences=['inputs', 'mask'], states=['states'],
               outputs=['states'], contexts=['context'])
    def apply(self, inputs, states, mask=None, **kwargs):
        next_states = inputs +, self.W)
        next_states = self.children[0].apply(next_states)
        if mask:
            next_states = (mask[:, None] * next_states +
                           (1 - mask[:, None]) * states)
        return next_states

    def initial_states(self, batch_size, *args, **kwargs):
        init = kwargs["context"]
        return init.T'outputs')
    def initial_states_outputs(self):
        return self.apply.states