Saturday, May 12, 2012

Implementation of a supervised discretization algorithm (or, I'm almost certain that I had a reason for this when I started)

Oftentimes, you have a set of data and want to use it to make a prediction about the future (e.g., predicting tomorrow's weather, based on today's weather) or about an unobserved system (e.g., predicting the presence or absence of metal ores underground, based on the local magnetic field).

In order to make that prediction, we need to set up some sort of mathematical framework describing the possible relationships between the data and the prediction.  If we have first-principles knowledge of the system under question, we can make a model of the system and use the available data to set any free parameters the model has.  However, we often have no idea about the system between the data and the prediction.  In these cases, we need to propose a sufficiently complicated and unbiased model so that, after setting the model's parameters according to the observed prediction/data relationships, the model accurately reflects the unknown system between the data and the variables to be predicted.

There is an enormous literature describing many different structures for generic predictors.  However, many of these rely on the input data being discretely-valued; that is, instead of taking on any value in a continuous range (like a distance, 12m, 5.43m, 1m, 0.333205m), they can only take on a discrete number of values (like 'number of fingers' is an integer between 0 and 12, inclusive).  In order to leverage these discrete-input predictors, it is necessary to 'discretize' continuous inputs; that is, to assign ranges of values to a single class (e.g., temperatures 0 to 10 degrees are now '0', 10 to 25 are now '1', 25 to 40 are now '2', etc.).

There is a smaller literature describing and analyzing methods for chosing how to perform this discretization.  In the future, I might put together a post reviewing these methods (at least, from a layman's perspective, as I am not a member of the machine learning research community).  Here, I will share my implementation, in MATLAB, of the method created/communicated by Marc Boulle in 2006; this was the method I found the most compelling after performing a review of the discretization literature.

The implementation

The algorithm is described (and derived, and analyzed, and experimentally compared with other methods) in "Boulle, Marc (2006). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65:131-165."  The algorithm consists of a criterion which allows one to compare different potential partitionings of the data and a method for attempting to find the best partitioning, based on this criterion.

My implementation uses two 'linked lists' (in quotation marks, because they are implemented as the MATLAB generic array data type, instead of a distinct linked-list data type).  The first list contains the data on the partition intervals in the data; this information includes the total number of instances of the sample data in the interval, the number of instances of the sample data from each output class in the data, and the identity of the first an last member of the data set in each interval.  The second list points to adjacent pairs of the intervals in the first list, and is sorted according to how much the criterion would be improved by the merger of the pointed-to intervals.  The algorithm proceeds by merging the 'best' pair of adjacent intervals, then updating the lists to reflect that merger.  The algorithm is called 'greedy', as it always chooses the most immediately obvious 'best' merger, even though that may not lead to a universally optimum solution.

My implementation of this algorithm (including some support functions) is included in this archive.

A quick test

Here's some example data I generated, along with the optimal discretization returned by the algorithm (as implemented in MATLAB).  The 1000 sample data points were drawn from equal-variance gaussian distributions whose means were different and determined by their output class value.  The 1000 sample data points are plotted below; the y-axis is an individual data point's continuous value, and the point's color corresponds to its output class value.  The horizontal lines show the edges of the  intervals determined by the algorithm; as you can see, the classes (colors) are well-segregated by the edges.  Intuition suggests that there would be four edges (separating the five output classes); in the case below, there is an additional edge separating the contentious transition region between the dark blue and cyan classes.

No comments:

Post a Comment