Assignment 4: Image classification

Assignment 4 is Part One of a two-part assignment. Our eventual goal is to reimplement Andrej Karpathy’s Neuraltalk system, which translates images into natural-language descriptions, such as “a little boy on the beach trying to fly a kite”. However, this has turned out to be a bit of a project, and we are therefore splitting it up over two weeks.

Your Task

In this assignment, you will use a pre-trained convolutional network to find out what objects are in a given image. We are using the MSCOCO dataset, which consists of about 200.000 real-world images, each of which has been labeled with multiple short natural-language descriptions. You can explore the MSCOCO data on the “Explore” tab of the website.

Training a CNN that performs well on this task is expensive. We will therefore use a network that has already been trained (on a different dataset): the 16-layer model from the Visual Geometry Group at Oxford (see here to get a sense of the network’s structure). Unfortunately, this network was created using the Caffe library, whose file format is not compatible with Theano. You can obtain a Blocks model as follows:

Download the Matlab version of the model (caution: 500 MB download). You can load this file into Python using the loadmat function from Scipy. Explore the data structure a bit, noting that it mostly consists of structured arrays from Numpy. Observe that the 1000 image labels are stored in x["meta"][0]["classes"][0][0][0][1][0].
We have provided a Python module for converting the layers of the Matlab model into Blocks bricks. Download it and try it out. Read the code and understand it. If anything is unclear, please come talk to us. We are saving you a lot of time this week by giving you ready-made code, but you still need to know what it does so you could implement a CNN too if you need it for your project.
Note that the module is not entirely complete. It returns a list of 51 bricks (convolutional, pooling, rectifier, and softmax), which you still need to combine into a Blocks model. Create an input variable that is a tensor4 of dtype float32 and apply all bricks to it, one by one. Create a Theano function from your input variable to the output of the softmax brick (use the parameter allow_input_downcast=True). The function takes 4-dimensional tensors as output (# instances * # color channels * image width * image height). For the MSCOCO data, there are three color channels and the pictures are 256x256 pixels in size. The function returns a vector of length 1000 specifying the softmax distribution over the possible image labels.

Now obtain the MSCOCO data. You can download it from their website. One of the downloads is very large (13 GB), but all the data is available on medusa in the directory /projects/korpora/mscoco. Feel free to scp it to your laptop or an external drive, ideally from within the university network. However, please avoid unzipping the MSCOCO Zip files in your home directories on medusa, as this would put a considerable strain on the file server.

Follow the instructions on the neuraltalk2 Github page for converting the MSCOCO dataset into a single huge HDF5 file (under “I’d like to train my own network on MS COCO”). Beware the one broken image that you have to fix by hand. Karpathy’s preprocessing script also creates a JSON file, which maps from image positions in the list to image metadata, and from word IDs (starting at 1) to the actual words. Use a HDF viewer (e.g. this one) to explore the data file. Warning: The “images” dataset is huge.

Finally, write a script that can read images from the HDF5 file, apply the CNN to it, and output the words that have a high probability. Play around with your script a bit and see how well it works. If you want to do this on medusa (where the CNN can run on the GPU), feel free to use the HDF5 and JSON files in /projects/korpora/mscoco/coco.

Assignment 3 revisited: Sequence Generators

As a secondary project this week, redo Assignment 3 using sequence generators. A sequence generator combines a recurrent network brick (e.g. an RNN or an LSTM) with infrastructure that emits outputs based on the RNN’s state, in a way that the output of the previous timestep is fed into the RNN as input at the next timestep. There is also support for attention bricks, which you can ignore for now.

You will be pleased to see how short your Char-RNN code will become through the use of sequence generators. Furthermore, it will be useful to familiarize yourself with sequence generators for the more complex models we will build in the future, and as a starting point for learning about attention.

Here are some important tips for working with sequence generators:

The input variable to which a SequenceGenerator is applied must be an lmatrix. The first dimension is position in the sequence, and the second dimension is position in the list of training instances (within the minibatch). The variable cannot have more than two dimensions. This means that you must encode the character ID as an int, not as a one-hot vector.
You will want to map each possible input (= character ID) to a vector representation, which is then handed to the RNN as input. You would normally do this with a Lookup brick, but here we are in a slightly special situation because the Lookup must also be applied to the output of each timestep t-1 to obtain the input at timestep t. You can use a LookupFeedback brick as the feedback_brick to get this behavior.
Here’s a cool trick for loading the trained model in order to sample output. You can simply create a Model in code, exactly as you would for training (except that you can leave bricks out, e.g. for the cost function). You can then use the function blocks.serialization.load_parameters and the method model.set_parameters to only set the parameters of the model (weights and such) from the file. This has the huge advantage that you have all the tensor variables that your model contains in hand, and can define your Theano functions from them. Note that load_parameters is only available in the bleeding-edge version of Blocks, and that you must store your model in a file with a name of the form “*.tar” (e.g., “model.tar”) for this to work.
Create a Theano function that maps from your inputs (= character IDs) to the result of generator.generate(n_steps=y.shape[0], batch_size=y.shape[1]). This function maps a (1,1) lmatrix containing the input character to a triple of values: the next internal state of the RNN (as an (1,1,D)-dimensional tensor, where D is the size of the state); a randomly chosen output character (as an (1,1)-dimensional lmatrix); and the transition cost.
One complication is that applying the Theano function does not update the state of the RNN. You can grab the variable for the state as model.get_parameter_dict()["/generator/with_fake_attention/transition.initial_state"], and call its set_value method to update the state.
One really mean detail concerns the random generation of the next output character. The emit method in the SoftmaxEmitter (which you probably want to use as part of your sequence generator) does choose at random from the multinomial distribution over output values that the softmax brick defines. However, the random number generator is initialized with the same seed every time the program runs; thus any two runs of your program will always yield the exact same “random” output. You can fix this by making your own version of SoftmaxEmitter whose __init__ method executes self.theano_rng = MRG_RandomStreams(seed=random.randint(0,100000)) or something similar.