Deep Learning for NLP
In this assignment, you’re going to implement a very simple feedforward network and use it to classify mushrooms as edible or poisonous. The primary purpose of the assignment is to familiarize you with the Blocks library.
Download the Mushroom dataset from the UCI Machine Learning repository. This is a standard dataset for evaluating classifiers.
Install the Blocks and Fuel libraries. See below for details. Work through the tutorials to make sure everything runs on your computer.
Implement a feedforward network with a single hidden layer (with a logistic = sigmoid activation function) and a softmax activation function on the output layer. Split the mushroom data into training and test sets, and train and evaluate your network. Experiment with different cost functions (e.g. for regularization), learning rates, layer sizes, etc.
Install Bokeh, version 0.10, and blocks-extras:
Print the final test performance of the model.
Try out the Blocks Plot extension, and use it to track the cost and the misclassification rates (both training and test) during training.
Also try out the ProgressBar
extension if you run the training loop from within a terminal.
Finally, visualize the weight matrices that your network learns. You can extract the weight matrix variables from the computation graph, access the values they had after training as a Numpy array using their get_value method, and visualize each array, e.g. with the imshow function from Matplotlib or as Hinton diagrams.
We recommend that you use Anaconda Python, which already comes with many useful modules preinstalled. You will still have to install Blocks and Fuel yourself.
Installing Blocks or Fuel may overwrite your installation of numpy with an older version. If you get strange exceptions regarding numpy versions, you can reinstall the current version with pip install numpy --upgrade
.
If you have a decent (Nvidia) graphics card, you should install the GPU backend for Theano. You can find instructions here. Try the example scripts to ensure you’re actually using the GPU. If you don’t have a fast graphics card, you should still be able to do this first assignment on a CPU.
eye
, concatenate
, and vstack
for this.IndexableDataset
as described here. Our dataset has two sources, ‘features’ and ‘targets’ and the data needs to be stored in two separate Numpy arrays, respectively.H5PYDataset
. Storing a Numpy array in a HDF5 file is explained below.CategoricalCrossEntropy
as cost function requires one-hot encoded targets. Using MisclassificationRate
(for monitoring), however, requires just the class labels. They can be recovered from the one-hot encoded targets y
as y.argmax(axis=1)
. Not matching the proper format will result in a rather cryptic Theano error.One slightly annoying task is to write the Numpy array into which you have converted the CSV into an HDF5 file. This is easy when you know how to do it, but the Fuel documentation is a bit intransparent. Here is the code we came up with. np_enc_data
is a Numpy array with one row for each training and test instance; each row is the one-hot encoding of a feature vector. np_enc_y
is a similar array in which each row is the one-hot encoding of the annotated class. N
is the number of training plus testing instances, and splitpoint
is the number of training instances.
Note that we are giving names x
and y
to the inputs and outputs in the HDF5 file. These must match the names for the input and output structures that your Blocks model assumes, i.e. they are meant for constructing the input as tensor.matrix('x')
and the output as tensor.lmatrix('y')
. If you want to use other names in your model, you should use those names in the code below as well.