Assignment 2: Word Embeddings

In this assignment, you’re going to implement a simple version of word2vec-style word embeddings. This will give you hands-on experience with word embeddings (which are very hot right now), and give you an opportunity to learn a few new tricks regarding Blocks and Fuel when dealing with data of nontrivial size.

Your Task

Obtain a corpus of your choice. The Brown corpus, which can be easily downloaded via NLTK, may be a good starting point. If (training!) time permits, you can switch to a more serious corpus later. You will want to implement your own Dataset to make the corpus available to Blocks, see below.

Implement the CBOW model from Mikolov et al. Notice that the mappings from each context word to the middle layer use the same weight matrix. For simplicity, you can start with a context of one word before and one word after, and then work your way up to bigger contexts. Use a softmax activation function and a categorical cross-entropy cost function. Experiment with different sizes of the middle layer.

The inputs and outputs of your network will be integers indicating the identity of a word, e.g. as an index in a list of all words. When you read papers, these integers are usually represented as one-hot encodings of length V, where V is the vocabulary size. You will find that actually computing such one-hot encodings is prohibitively slow, especially if you also need to transfer them to a GPU. Use the following tricks to deal with this:

Use the integer word identities as your actual inputs, and use a LookupTable brick instead of a Linear brick. This frees you from converting the int into a one-hot encoding, and replaces the matrix multiplication (of the weight matrix with the one-hot encoding) by a row lookup in the matrix.
Use the integer word identities as your actual outputs, and use the “int” version of the categorical cross-entropy brick. If the output layer of your network has length V, and your training data has outputs which are lvectors y of length one, Theano interprets this as a one-hot vector in which the position y[0] is one and all others are zero. It does this without actually computing the one-hot vector, which is much faster. (This change alone gives at least a 1000x speedup on medusa.)

At the end, save your ordered word list and the weight matrix of the embedding layer to a file. Use vector arithmetic and cosine similarities to solve a few analogy tasks. Notice that the value of the cosine may range from -1 to +1 because word embedding vectors may contain negative coefficients. A cosine of +1 indicates that the two vectors are parallel and point in the same direction.

Implementing your own Dataset

In order to make your corpus available to Blocks, you will want to implement your own subclass of Dataset. Because the Fuel documentation is not great, here are some tips.

We assume that you access your dataset through a SequentialScheme in training, i.e. that the second argument to your MainLoop is something like:

DataStream.default_stream(dataset,
       iteration_scheme=SequentialScheme(dataset.num_instances(), 50))

Your subclass is expected to implement the method get_data. If you use a SequentialScheme as above, get_data will be called with arguments like request=[0,1,...,49], i.e. the value of the request parameter will be a list that contains the indices of the instances for which you’re supposed to provide data.
Your training data will consist of several parts, e.g. for inputs and expected outputs. You can handle this with sources. For instance, you could have a features and a target source. Then you would set self.sources = ["features", "target"] in the constructor of your class, and you would return tuples of length 2 from your get_data method. The first element of this tuple is for the “features” source and the second element for the “target” source. These names need to match the names of the vectors in your Blocks model.
The actual value of the first element needs to be a Numpy array with one entry for each instance in the request. Note that the dtypes of the entries in this array need to match the dtypes of the Theano vectors/matrices that you use as input or output in your network.
Also set self.axis_labels = None in your constructor. This is needed for technical reasons.

Hints

Ignore all words that occur less than five times in the corpus.
Don’t wait too long until you start training your model. On medusa, one epoch for the Brown corpus takes about three minutes, at 60% GPU use. If you want to use a larger corpus, the times per epoch may be much higher (hours or more).
At one point in your network, you will want to compute the average of multiple vectors. To do this, realize that the results and arguments of the apply method of a brick are Theano tensors, and notice that you can perform arithmetic operations on Theano tensors.
You may want to save your word embedding matrix after each epoch. You can do this with the following custom extension, which you need to add to the list you pass to MainLoop under the extensions parameter. Pass in layers a list of network layers that each have a weight matrix that you want to save as their first parameter. Pass the filename prefixes for each layer in the list prefixes. You can load a Numpy array from each dumped file using numpy.load.

from blocks.extensions import SimpleExtension
class SaveWeights(SimpleExtension):
    def __init__(self, layers, prefixes, **kwargs):
        kwargs.setdefault("after_epoch", True)
        super(SaveWeights, self).__init__(**kwargs)
        self.step = 1
        self.layers = layers
        self.prefixes = prefixes

    def do(self, callback_name, *args):
        for i in xrange(len(self.layers)):
            filename = "%s_%d.npy" % (self.prefixes[i], self.step)
            np.save(filename, self.layers[i].parameters[0].get_value())
        self.step += 1

Profile your code to see which operations take up most of the processing time following the Theano Documentation. (Note that the function print_summary() is actually only called summary()). To this end, turn your symbolic cost function into an actual function that you can call using theano.funcion(...). (You can also use this to just run some examples through your net without learning.) Profiling works best on the CPU. What do you observe?
Once you have trained your word embeddings and saved them to a NPY file, it is time to test their quality. You can download the list of semantic and syntactic queries from the URL provided in the paper. Write a method that evaluates your word vectors using these examples. You will need to read the tuples from the text file, look up their indices in the vocabulary, retrieve the respective word vectors, compute a prediction of the forth word vector from the first three ones, retrieve the most similar word vectors from your dictionary and translate them back into words to compare them with the target word.

Optional Task: Speeding up learning

The paper mentions two ways to speed up the SoftMax computation – using a Hierarchical SoftMax or Negative Sampling. How much can you speed up your code using either of these approaches?