SS 2016

Deep Learning for NLP

 

Assignment 1: Mushrooms

In this assignment, you’re going to implement a very simple feedforward network and use it to classify mushrooms as edible or poisonous. The primary purpose of the assignment is to familiarize you with the Blocks library.

Your Task

Download the Mushroom dataset from the UCI Machine Learning repository. This is a standard dataset for evaluating classifiers.

Install the Blocks and Fuel libraries. See below for details. Work through the tutorials to make sure everything runs on your computer.

Implement a feedforward network with a single hidden layer (with a logistic = sigmoid activation function) and a softmax activation function on the output layer. Split the mushroom data into training and test sets, and train and evaluate your network. Experiment with different cost functions (e.g. for regularization), learning rates, layer sizes, etc.

Install Bokeh, version 0.10, and blocks-extras:

pip install bokeh==0.10
git clone https://github.com/mila-udem/blocks-extras.git
cd blocks-extras
python setup.py install

Print the final test performance of the model. Try out the Blocks Plot extension, and use it to track the cost and the misclassification rates (both training and test) during training. Also try out the ProgressBar extension if you run the training loop from within a terminal.

Finally, visualize the weight matrices that your network learns. You can extract the weight matrix variables from the computation graph, access the values they had after training as a Numpy array using their get_value method, and visualize each array, e.g. with the imshow function from Matplotlib or as Hinton diagrams.

Installing Blocks and Fuel

We recommend that you use Anaconda Python, which already comes with many useful modules preinstalled. You will still have to install Blocks and Fuel yourself.

  1. Install the bleeding-edge version of Theano.
  2. Install Blocks (stable version is okay).
  3. Install Fuel (stable version is okay).

Installing Blocks or Fuel may overwrite your installation of numpy with an older version. If you get strange exceptions regarding numpy versions, you can reinstall the current version with pip install numpy --upgrade.

If you have a decent (Nvidia) graphics card, you should install the GPU backend for Theano. You can find instructions here. Try the example scripts to ensure you’re actually using the GPU. If you don’t have a fast graphics card, you should still be able to do this first assignment on a CPU.

Hints

Writing Numpy arrays to HDF5

One slightly annoying task is to write the Numpy array into which you have converted the CSV into an HDF5 file. This is easy when you know how to do it, but the Fuel documentation is a bit intransparent. Here is the code we came up with. np_enc_data is a Numpy array with one row for each training and test instance; each row is the one-hot encoding of a feature vector. np_enc_y is a similar array in which each row is the one-hot encoding of the annotated class. N is the number of training plus testing instances, and splitpoint is the number of training instances.

Note that we are giving names x and y to the inputs and outputs in the HDF5 file. These must match the names for the input and output structures that your Blocks model assumes, i.e. they are meant for constructing the input as tensor.matrix('x') and the output as tensor.lmatrix('y'). If you want to use other names in your model, you should use those names in the code below as well.

import h5py
from fuel.datasets.hdf5 import H5PYDataset

hdf5name = 'mushrooms.hdf5'
f = h5py.File(hdf5name, mode='w')

fx = f.create_dataset('x', np_enc_data.shape, dtype='float32')
fy = f.create_dataset('y', np_enc_y.shape, dtype='int64')

fx[...] = np_enc_data
fy[...] = np_enc_y

split_dict = {
    'train': {'x': (0,splitpoint), 'y': (0, splitpoint)},
    'test': {'x': (splitpoint, N), 'y': (splitpoint, N)}}

f.attrs['split'] = H5PYDataset.create_split_array(split_dict)

f.flush()
f.close()