Report 5: Handwritten character recognition by the method of k neareast neighbors

The method of k means we implemented last week can be a useful one when you want to divide objects into classes of similar objects whose nature is not well defined in advance.

But in the handwritten character recognition problem, we know the character class of each image in the "training" data set, and we should make use of that information.

One very intuitive method of creating a classifier using labelled training data, and one that requires almost no work beyond devising features, is the method of k nearest neighbors. It is simply to guess that a new example belongs to the same class as its nearest neighbors in feature space. First choose a number, k, of neighbors to be consulted. Then, given a new image that you want to classify, (i) compute its feature vector, (ii) find the classes of the k nearest neighbors of this feature vector among the training data, (iii) use the mode of the classes of the nearest neighbors (the mode in case they are not all the same) as your predicted class.

The kd tree that I recommended, and many of you used, for your Star Recognizer provides an easy and efficient way of obtaining k nearest neighbors of a given point among a set of other points. (scipy.spatial.KDTree, using the query() method.) Caution: the k in kd tree is a different k: the dimension of the feature space.

Your project for Report 5, due at 8am Saturday Nov 16, is to develop a classifier of handwritten characters from the set of 10 classes occurring in our training data: devising and including at least one new feature that you think might be useful, using the method of k nearest neighbors to classify this new set of handwritten character images (still welcoming additional contributions to this!), and providing an assessment of how well your classifier performs.

You may restrict your attention to a subset of the 10 classes if you feel you can do well on that subset but not on the whole 10. But include at least 5 classes.


Q: Are we allowed to use scipy's implementation of kNN or are we expected to program it from scratch?

A: You are expected to program it from scratch except that you may use scipy's k-D tree.

Q: Do I need to use the method of k-means?

A: No. Do NOT use the method of k-means.