SS 2016

Deep Learning for NLP


Assignment 4: Image classification

Assignment 4 is Part One of a two-part assignment. Our eventual goal is to reimplement Andrej Karpathy’s Neuraltalk system, which translates images into natural-language descriptions, such as “a little boy on the beach trying to fly a kite”. However, this has turned out to be a bit of a project, and we are therefore splitting it up over two weeks.

Your Task

In this assignment, you will use a pre-trained convolutional network to find out what objects are in a given image. We are using the MSCOCO dataset, which consists of about 200.000 real-world images, each of which has been labeled with multiple short natural-language descriptions. You can explore the MSCOCO data on the “Explore” tab of the website.

Training a CNN that performs well on this task is expensive. We will therefore use a network that has already been trained (on a different dataset): the 16-layer model from the Visual Geometry Group at Oxford (see here to get a sense of the network’s structure). Unfortunately, this network was created using the Caffe library, whose file format is not compatible with Theano. You can obtain a Blocks model as follows:

Now obtain the MSCOCO data. You can download it from their website. One of the downloads is very large (13 GB), but all the data is available on medusa in the directory /projects/korpora/mscoco. Feel free to scp it to your laptop or an external drive, ideally from within the university network. However, please avoid unzipping the MSCOCO Zip files in your home directories on medusa, as this would put a considerable strain on the file server.

Follow the instructions on the neuraltalk2 Github page for converting the MSCOCO dataset into a single huge HDF5 file (under “I’d like to train my own network on MS COCO”). Beware the one broken image that you have to fix by hand. Karpathy’s preprocessing script also creates a JSON file, which maps from image positions in the list to image metadata, and from word IDs (starting at 1) to the actual words. Use a HDF viewer (e.g. this one) to explore the data file. Warning: The “images” dataset is huge.

Finally, write a script that can read images from the HDF5 file, apply the CNN to it, and output the words that have a high probability. Play around with your script a bit and see how well it works. If you want to do this on medusa (where the CNN can run on the GPU), feel free to use the HDF5 and JSON files in /projects/korpora/mscoco/coco.

Assignment 3 revisited: Sequence Generators

As a secondary project this week, redo Assignment 3 using sequence generators. A sequence generator combines a recurrent network brick (e.g. an RNN or an LSTM) with infrastructure that emits outputs based on the RNN’s state, in a way that the output of the previous timestep is fed into the RNN as input at the next timestep. There is also support for attention bricks, which you can ignore for now.

You will be pleased to see how short your Char-RNN code will become through the use of sequence generators. Furthermore, it will be useful to familiarize yourself with sequence generators for the more complex models we will build in the future, and as a starting point for learning about attention.

Here are some important tips for working with sequence generators: