# SS 2016

Deep Learning for NLP

• General
• Assignments

# Assignment 6: Attention

In this final assignment, you will implement an End-to-End Memory Network and train it to answer questions about stories from the Facebook bAbI dataset.

## Before you begin

The main technical challenge in this assignment is that the Blocks library is not sufficient to implement the network. Instead, you will have to use some low-level Theano operations. This can be tricky, because (a) you will have to be very clear on the linear algebra operations you want to apply, and (b) every tensor will be one dimension higher than you expected because the first dimension will be the position within the minibatch.

So before you start with the task itself, we urge you to familiarize yourself with the `batched_dot` and `dimshuffle` operations from Theano (see the documentation). `Batched_dot`, in particular, will be your friend, because it applies a matrix multiplication to all values in the minibatch. So if you have one tensor `A` of the shape `(a,b,c)`, where `a` is the batch size, and you have another tensor `B` of the shape `(a,c,d)`, then `batched_dot(A,B)` will return a tensor of the shape `(a,b,d)`. This also works if `A` or `B` is two-dimensional. Try this out, e.g. in a Jupyter notebook, by creating computation graphs that use batched_dot, generating a Python function that evaluates this computation graph, and running this function on a bunch of inputs and see what it computes. Many Theano operations mirror operations that exist in Numpy, so you may be able to use Numpy to compute the expected outputs.

Note also that you can add a softmax operation to the computation graph using tensor.nnet.softmax, and that you can apply a brick to an arbitrary Theano tensor (even if it is not the output of another brick) and thereby mix and match Theano and Blocks.

Implement an End-to-End Memory Network. You can assume that the stories have a fixed length, which is given to you. Start with the simplest configuration, i.e. a single memory layer and a bag-of-words model for combining the word embeddings of the individual words into word embeddings of the whole sentences. One way to compute the BOW vector is to multiply the vector for the whole sentence (or the whole story) with a matrix in a way that calculates the appropriate sums.

Generate some training and test data for the bAbI dataset, using the bAbI-tasks tool. The most interesting dataset is probably “Factoid QA with two supporting facts”, but feel free to play around with the others as well. Also feel free to generate symbolic outputs, i.e. “S goes K” rather than “Sebastian goes to the kitchen”. Convert this data into training and test data for your network. Note that each story generated by the bAbI script contains multiple questions and answers, so you have to deal with this appropriately, and zero-pad stories to the maximum story length.

Train and test your network on this data.

## Hints

Using `batched_dot(A,B)` will work nicely as long as the batch dimension (0th axis) of A and B matches. Otherwise, Theano will complain. As the last batch during iteration is usually not a full one, you will need a workaround. Suppose you got a 3D tensor `M` that you would like to `batch_dot`-multiply with an `input_batch`. To make sure their batch size matches, you can use a slice of `M` like this:

Note that `input_batch.shape` is a symbolic Theano variable that is evaluated at runtime.

## Ideas for extensions

Once you overcome the `batched_dot` learning curve, this assignment should pretty easy. Thus you probably still have time in the end, which you could use for extensions. Here are some ideas:

• Visualize which sentences of the story your network attends to.
• Generalize your network to natural-language inputs (“Sebastian goes to the kitchen”), or at least present the symbolic inputs in natural language when you display the visualization.
• Use a better method than BOW to aggregate word representations into sentence representations (see paper).
• Use a memory network with more than one layer.
• Use a recurrent network to read input stories of variable length.