Saturday, May 12, 2012

Implementation of a supervised discretization algorithm (or, I'm almost certain that I had a reason for this when I started)

Oftentimes, you have a set of data and want to use it to make a prediction about the future (e.g., predicting tomorrow's weather, based on today's weather) or about an unobserved system (e.g., predicting the presence or absence of metal ores underground, based on the local magnetic field).

In order to make that prediction, we need to set up some sort of mathematical framework describing the possible relationships between the data and the prediction.  If we have first-principles knowledge of the system under question, we can make a model of the system and use the available data to set any free parameters the model has.  However, we often have no idea about the system between the data and the prediction.  In these cases, we need to propose a sufficiently complicated and unbiased model so that, after setting the model's parameters according to the observed prediction/data relationships, the model accurately reflects the unknown system between the data and the variables to be predicted.

There is an enormous literature describing many different structures for generic predictors.  However, many of these rely on the input data being discretely-valued; that is, instead of taking on any value in a continuous range (like a distance, 12m, 5.43m, 1m, 0.333205m), they can only take on a discrete number of values (like 'number of fingers' is an integer between 0 and 12, inclusive).  In order to leverage these discrete-input predictors, it is necessary to 'discretize' continuous inputs; that is, to assign ranges of values to a single class (e.g., temperatures 0 to 10 degrees are now '0', 10 to 25 are now '1', 25 to 40 are now '2', etc.).

There is a smaller literature describing and analyzing methods for chosing how to perform this discretization.  In the future, I might put together a post reviewing these methods (at least, from a layman's perspective, as I am not a member of the machine learning research community).  Here, I will share my implementation, in MATLAB, of the method created/communicated by Marc Boulle in 2006; this was the method I found the most compelling after performing a review of the discretization literature.

The implementation

The algorithm is described (and derived, and analyzed, and experimentally compared with other methods) in "Boulle, Marc (2006). MODL: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65:131-165."  The algorithm consists of a criterion which allows one to compare different potential partitionings of the data and a method for attempting to find the best partitioning, based on this criterion.

My implementation uses two 'linked lists' (in quotation marks, because they are implemented as the MATLAB generic array data type, instead of a distinct linked-list data type).  The first list contains the data on the partition intervals in the data; this information includes the total number of instances of the sample data in the interval, the number of instances of the sample data from each output class in the data, and the identity of the first an last member of the data set in each interval.  The second list points to adjacent pairs of the intervals in the first list, and is sorted according to how much the criterion would be improved by the merger of the pointed-to intervals.  The algorithm proceeds by merging the 'best' pair of adjacent intervals, then updating the lists to reflect that merger.  The algorithm is called 'greedy', as it always chooses the most immediately obvious 'best' merger, even though that may not lead to a universally optimum solution.

My implementation of this algorithm (including some support functions) is included in this archive.

A quick test

Here's some example data I generated, along with the optimal discretization returned by the algorithm (as implemented in MATLAB).  The 1000 sample data points were drawn from equal-variance gaussian distributions whose means were different and determined by their output class value.  The 1000 sample data points are plotted below; the y-axis is an individual data point's continuous value, and the point's color corresponds to its output class value.  The horizontal lines show the edges of the  intervals determined by the algorithm; as you can see, the classes (colors) are well-segregated by the edges.  Intuition suggests that there would be four edges (separating the five output classes); in the case below, there is an additional edge separating the contentious transition region between the dark blue and cyan classes.

Sunday, May 6, 2012

Project updates (or, why is it that the least interesting parts of a project make up most of the effort?)

In the last few weeks (since I tested out the OLED display), I've been getting all of the last little details in place to move forward with the sleep mask project.  Specifically, since the circuits have been finalized, I have been laying out the printed circuit board and making certain that I have all of the necessary components on hand (and putting together an order for those I do not).

The cheapest service I could find is Seeed Studio's Fusion PCB service.  For a mere 10$ you get 10 5x5cm boards; an extra 15$ gets you 5x10cm.

After completing the board layout (shown below), I found that it was more than 5x5cm.  It is larger than I had hoped, but still reasonable relative to the size of the sleep mask.  A large part of its... largeness... is due to the fact that I was designing based on the parts I already had in my inventory.  Those parts, in turn, were chosen to be usable across many projects; as a result, they are usually rated for much higher voltages, currents, and dissipated energies than are strictly necessary for this project.  This is okay for a prototype, but any future hardware revisions will involve specifying more appropriately-sized resistors and capacitors.

Since I was going to have to pay for an extra 5cm of board, I decided to make the best of it and add a circuit for a project I've had on the back burner for a while.  Specifically, I want to build some flashing lights into my bicycle helmet for safety; some of the lights will be ordinary LEDs, but eventually I want to build some EL wire into the helmet to give a real Vegas feel.  To do this, I need 'high voltage' (about 20V) to step up to 120V using transformers.  While I am still collecting the transformers (I take them out of burned-out CFL lightbulbs, as transformers or even bare magnetics of appropriate size have proven difficult to source), I am going to move forward with getting this board, including the 20V step-up and LED constant-current drivers, layed out and ordered.  It is also shown in the image below.


The REM sleep mask board is on the left; the microcontroller is in the middle, with the micro USB connector above, battery charger above right, buck converter right, REM detector bottom left, headphone amplifier left and OLED display top left.  The helmet flasher/boost board is on the right; boost top left, EL wire switches bottom left, microcontroller bottom right and LED drivers top right.