Assignment 1: Mushrooms

In this assignment, you’re going to implement a very simple feedforward network and use it to classify mushrooms as edible or poisonous. The primary purpose of the assignment is to familiarize you with the Blocks library.

Your Task

Download the Mushroom dataset from the UCI Machine Learning repository. This is a standard dataset for evaluating classifiers.

Install the Blocks and Fuel libraries. See below for details. Work through the tutorials to make sure everything runs on your computer.

Implement a feedforward network with a single hidden layer (with a logistic = sigmoid activation function) and a softmax activation function on the output layer. Split the mushroom data into training and test sets, and train and evaluate your network. Experiment with different cost functions (e.g. for regularization), learning rates, layer sizes, etc.

Install Bokeh, version 0.10, and blocks-extras:

pip install bokeh==0.10
git clone https://github.com/mila-udem/blocks-extras.git
cd blocks-extras
python setup.py install

Print the final test performance of the model. Try out the Blocks Plot extension, and use it to track the cost and the misclassification rates (both training and test) during training. Also try out the ProgressBar extension if you run the training loop from within a terminal.

Finally, visualize the weight matrices that your network learns. You can extract the weight matrix variables from the computation graph, access the values they had after training as a Numpy array using their get_value method, and visualize each array, e.g. with the imshow function from Matplotlib or as Hinton diagrams.

Installing Blocks and Fuel

We recommend that you use Anaconda Python, which already comes with many useful modules preinstalled. You will still have to install Blocks and Fuel yourself.

Install the bleeding-edge version of Theano.
Install Blocks (stable version is okay).
Install Fuel (stable version is okay).

Installing Blocks or Fuel may overwrite your installation of numpy with an older version. If you get strange exceptions regarding numpy versions, you can reinstall the current version with pip install numpy --upgrade.

If you have a decent (Nvidia) graphics card, you should install the GPU backend for Theano. You can find instructions here. Try the example scripts to ensure you’re actually using the GPU. If you don’t have a fast graphics card, you should still be able to do this first assignment on a CPU.

Hints

Blocks relies on Numpy for data representation and linear algebra operations. It is probably an extremely good idea to become friendly with Numpy, we’ll use it a lot. Similarly, it’s probably a good idea to familiarize yourself with Matplotlib for data visualization.
The data is given to you as a CSV file. You need to convert each row of the file into one row of a Numpy array, in which one-hot encodings of the individual feature values have been glued together into a row vector of 117 values. The desired outputs need to be converted into a matrix of Numpy row vectors of width two. Note the Numpy functions eye, concatenate, and vstack for this.
You have two options for making this Numpy array available as input to the Blocks training algorithm via Fuel. One option is to access the Numpy object directly from Fuel, using an IndexableDataset as described here. Our dataset has two sources, ‘features’ and ‘targets’ and the data needs to be stored in two separate Numpy arrays, respectively.
Alternatively, you can store the Numpy array in a HDF5 file and load the file into Fuel with an H5PYDataset. Storing a Numpy array in a HDF5 file is explained below.
Using CategoricalCrossEntropy as cost function requires one-hot encoded targets. Using MisclassificationRate (for monitoring), however, requires just the class labels. They can be recovered from the one-hot encoded targets y as y.argmax(axis=1). Not matching the proper format will result in a rather cryptic Theano error.

Writing Numpy arrays to HDF5

One slightly annoying task is to write the Numpy array into which you have converted the CSV into an HDF5 file. This is easy when you know how to do it, but the Fuel documentation is a bit intransparent. Here is the code we came up with. np_enc_data is a Numpy array with one row for each training and test instance; each row is the one-hot encoding of a feature vector. np_enc_y is a similar array in which each row is the one-hot encoding of the annotated class. N is the number of training plus testing instances, and splitpoint is the number of training instances.

Note that we are giving names x and y to the inputs and outputs in the HDF5 file. These must match the names for the input and output structures that your Blocks model assumes, i.e. they are meant for constructing the input as tensor.matrix('x') and the output as tensor.lmatrix('y'). If you want to use other names in your model, you should use those names in the code below as well.

import h5py
from fuel.datasets.hdf5 import H5PYDataset

hdf5name = 'mushrooms.hdf5'
f = h5py.File(hdf5name, mode='w')

fx = f.create_dataset('x', np_enc_data.shape, dtype='float32')
fy = f.create_dataset('y', np_enc_y.shape, dtype='int64')

fx[...] = np_enc_data
fy[...] = np_enc_y

split_dict = {
    'train': {'x': (0,splitpoint), 'y': (0, splitpoint)},
    'test': {'x': (splitpoint, N), 'y': (splitpoint, N)}}

f.attrs['split'] = H5PYDataset.create_split_array(split_dict)

f.flush()
f.close()