Deep Learning for NLP
In this assignment, you’re going to implement a simple version of word2vec-style word embeddings. This will give you hands-on experience with word embeddings (which are very hot right now), and give you an opportunity to learn a few new tricks regarding Blocks and Fuel when dealing with data of nontrivial size.
Obtain a corpus of your choice. The Brown corpus, which can be easily downloaded via NLTK, may be a good starting point. If (training!) time permits, you can switch to a more serious corpus later. You will want to implement your own Dataset to make the corpus available to Blocks, see below.
Implement the CBOW model from Mikolov et al. Notice that the mappings from each context word to the middle layer use the same weight matrix. For simplicity, you can start with a context of one word before and one word after, and then work your way up to bigger contexts. Use a softmax activation function and a categorical cross-entropy cost function. Experiment with different sizes of the middle layer.
The inputs and outputs of your network will be integers indicating the identity of a word, e.g. as an index in a list of all words. When you read papers, these integers are usually represented as one-hot encodings of length V, where V is the vocabulary size. You will find that actually computing such one-hot encodings is prohibitively slow, especially if you also need to transfer them to a GPU. Use the following tricks to deal with this:
At the end, save your ordered word list and the weight matrix of the embedding layer to a file. Use vector arithmetic and cosine similarities to solve a few analogy tasks. Notice that the value of the cosine may range from -1 to +1 because word embedding vectors may contain negative coefficients. A cosine of +1 indicates that the two vectors are parallel and point in the same direction.
In order to make your corpus available to Blocks, you will want to implement your own subclass of Dataset. Because the Fuel documentation is not great, here are some tips.
We assume that you access your dataset through a SequentialScheme
in training, i.e. that the second argument to your MainLoop
is something like:
get_data
. If you use a SequentialScheme
as above, get_data
will be called with arguments like request=[0,1,...,49]
, i.e. the value of the request
parameter will be a list that contains the indices of the instances for which you’re supposed to provide data.features
and a target
source. Then you would set self.sources = ["features", "target"]
in the constructor of your class, and you would return tuples of length 2 from your get_data
method. The first element of this tuple is for the “features” source and the second element for the “target” source. These names need to match the names of the vectors in your Blocks model.request
. Note that the dtypes of the entries in this array need to match the dtypes of the Theano vectors/matrices that you use as input or output in your network.self.axis_labels = None
in your constructor. This is needed for technical reasons.apply
method of a brick are Theano tensors, and notice that you can perform arithmetic operations on Theano tensors.MainLoop
under the extensions
parameter. Pass in layers
a list of network layers that each have a weight matrix that you want to save as their first parameter. Pass the filename prefixes for each layer in the list prefixes
. You can load a Numpy array from each dumped file using numpy.load.print_summary()
is actually only called summary()
). To this end, turn your symbolic cost function into an actual function that you can call using theano.funcion(...)
. (You can also use this to just run some examples through your net without learning.) Profiling works best on the CPU. What do you observe?The paper mentions two ways to speed up the SoftMax computation – using a Hierarchical SoftMax or Negative Sampling. How much can you speed up your code using either of these approaches?