Deep Learning for NLP

In this assignment, you’re going to implement a very simple feedforward network and use it to classify mushrooms as edible or poisonous. The primary purpose of the assignment is to familiarize you with the Blocks library.

Download the Mushroom dataset from the UCI Machine Learning repository. This is a standard dataset for evaluating classifiers.

Install the Blocks and Fuel libraries. See below for details. Work through the tutorials to make sure everything runs on your computer.

Implement a feedforward network with a single hidden layer (with a logistic = sigmoid activation function) and a softmax activation function on the output layer. Split the mushroom data into training and test sets, and train and evaluate your network. Experiment with different cost functions (e.g. for regularization), learning rates, layer sizes, etc.

Install Bokeh, version 0.10, and blocks-extras:

Print the final test performance of the model.
Try out the Blocks Plot extension, and use it to track the cost and the misclassification rates (both training and test) during training.
Also try out the `ProgressBar`

extension if you run the training loop from within a terminal.

Finally, visualize the weight matrices that your network learns. You can extract the weight matrix variables from the computation graph, access the values they had after training as a Numpy array using their get_value method, and visualize each array, e.g. with the imshow function from Matplotlib or as Hinton diagrams.

We recommend that you use Anaconda Python, which already comes with many useful modules preinstalled. You will still have to install Blocks and Fuel yourself.

- Install the bleeding-edge version of Theano.
- Install Blocks (stable version is okay).
- Install Fuel (stable version is okay).

Installing Blocks or Fuel may overwrite your installation of numpy with an older version. If you get strange exceptions regarding numpy versions, you can reinstall the current version with `pip install numpy --upgrade`

.

If you have a decent (Nvidia) graphics card, you should install the GPU backend for Theano. You can find instructions here. Try the example scripts to ensure you’re actually using the GPU. If you don’t have a fast graphics card, you should still be able to do this first assignment on a CPU.

- Blocks relies on Numpy for data representation and linear algebra operations. It is probably an extremely good idea to become friendly with Numpy, we’ll use it a lot. Similarly, it’s probably a good idea to familiarize yourself with Matplotlib for data visualization.
- The data is given to you as a CSV file. You need to convert each row of the file into one row of a Numpy array, in which one-hot encodings of the individual feature values have been glued together into a row vector of 117 values. The desired outputs need to be converted into a matrix of Numpy row vectors of width two. Note the Numpy functions
`eye`

,`concatenate`

, and`vstack`

for this. - You have two options for making this Numpy array available as input to the Blocks training algorithm via Fuel. One option is to access the Numpy object directly from Fuel, using an
`IndexableDataset`

as described here. Our dataset has two sources, ‘features’ and ‘targets’ and the data needs to be stored in two separate Numpy arrays, respectively. - Alternatively, you can store the Numpy array in a HDF5 file and load the file into Fuel with an
`H5PYDataset`

. Storing a Numpy array in a HDF5 file is explained below. - Using
`CategoricalCrossEntropy`

as cost function requires one-hot encoded targets. Using`MisclassificationRate`

(for monitoring), however, requires just the class labels. They can be recovered from the one-hot encoded targets`y`

as`y.argmax(axis=1)`

. Not matching the proper format will result in a rather cryptic Theano error.

One slightly annoying task is to write the Numpy array into which you have converted the CSV into an HDF5 file. This is easy when you know how to do it, but the Fuel documentation is a bit intransparent. Here is the code we came up with. `np_enc_data`

is a Numpy array with one row for each training and test instance; each row is the one-hot encoding of a feature vector. `np_enc_y`

is a similar array in which each row is the one-hot encoding of the annotated class. `N`

is the number of training plus testing instances, and `splitpoint`

is the number of training instances.

Note that we are giving names `x`

and `y`

to the inputs and outputs in the HDF5 file. These must match the names for the input and output structures that your Blocks model assumes, i.e. they are meant for constructing the input as `tensor.matrix('x')`

and the output as `tensor.lmatrix('y')`

. If you want to use other names in your model, you should use those names in the code below as well.