SS 2016

Deep Learning for NLP

 

Assignment 2: Word Embeddings

In this assignment, you’re going to implement a simple version of word2vec-style word embeddings. This will give you hands-on experience with word embeddings (which are very hot right now), and give you an opportunity to learn a few new tricks regarding Blocks and Fuel when dealing with data of nontrivial size.

Your Task

Obtain a corpus of your choice. The Brown corpus, which can be easily downloaded via NLTK, may be a good starting point. If (training!) time permits, you can switch to a more serious corpus later. You will want to implement your own Dataset to make the corpus available to Blocks, see below.

Implement the CBOW model from Mikolov et al. Notice that the mappings from each context word to the middle layer use the same weight matrix. For simplicity, you can start with a context of one word before and one word after, and then work your way up to bigger contexts. Use a softmax activation function and a categorical cross-entropy cost function. Experiment with different sizes of the middle layer.

The inputs and outputs of your network will be integers indicating the identity of a word, e.g. as an index in a list of all words. When you read papers, these integers are usually represented as one-hot encodings of length V, where V is the vocabulary size. You will find that actually computing such one-hot encodings is prohibitively slow, especially if you also need to transfer them to a GPU. Use the following tricks to deal with this:

At the end, save your ordered word list and the weight matrix of the embedding layer to a file. Use vector arithmetic and cosine similarities to solve a few analogy tasks. Notice that the value of the cosine may range from -1 to +1 because word embedding vectors may contain negative coefficients. A cosine of +1 indicates that the two vectors are parallel and point in the same direction.

Implementing your own Dataset

In order to make your corpus available to Blocks, you will want to implement your own subclass of Dataset. Because the Fuel documentation is not great, here are some tips.

We assume that you access your dataset through a SequentialScheme in training, i.e. that the second argument to your MainLoop is something like:

DataStream.default_stream(dataset,
       iteration_scheme=SequentialScheme(dataset.num_instances(), 50))

Hints

from blocks.extensions import SimpleExtension
class SaveWeights(SimpleExtension):
    def __init__(self, layers, prefixes, **kwargs):
        kwargs.setdefault("after_epoch", True)
        super(SaveWeights, self).__init__(**kwargs)
        self.step = 1
        self.layers = layers
        self.prefixes = prefixes

    def do(self, callback_name, *args):
        for i in xrange(len(self.layers)):
            filename = "%s_%d.npy" % (self.prefixes[i], self.step)
            np.save(filename, self.layers[i].parameters[0].get_value())
        self.step += 1

Optional Task: Speeding up learning

The paper mentions two ways to speed up the SoftMax computation – using a Hierarchical SoftMax or Negative Sampling. How much can you speed up your code using either of these approaches?