Suppose you are a medical researcher studying diabetes. Your boss has given you a big chart of data from diabetes patients. Each row of the chart has information for one patient. Each column of the chart is a health-related statistic, such as height, weight, age, blood pressure, cholesterol level, etc. We say that each patient's score across the columns is that patient's profile. The chart looks something like this:
Height Weight Age Blood P Chol
70 350 48 90 35.5
58 210 53 110 42.1
63 225 44 95 22.2
and so on
So the profile for patient one is just the first row, the profile for the second patient is the second row, and so on. Your job is to figure out if there is some tight range of profiles that is overrepresented by the patient data on the chart, because you want to warn other people who fall into that range to get checked out for diabetes.
As it turns out, there is a really simple way of doing this. If you treat each patient as a point in five-dimensional space, where each dimension corresponds to a column in the chart, all you need to do is find tight spatial clusters of points, and this is a problem which has long been solved. Enter the clustering algorithms.
The module I have contributed contains two major classes: Point and Cluster, which (not surprisingly) allow you to model points and clusters of points. Once you have transformed your data into a list of my generic Point objects, pass it into either the kmeans or agglo functions, which perform k-means clustering and agglomerative clustering, respectively. The output in either case is a list of Cluster objects, which you may then process as you see fit.
A word on the inputs to the algorithms:
K-means should only be used when you have some expectation about the number of clusters you want to get back. This is the "k" input parameter. The k-means algorithm is an iterative algorithm, which means that it will run forever until the biggest centroid shift is smaller than your "cutoff" input parameter. In general, if your data is tightly packed and you want to make fine distinctions between clusters, use a smaller cutoff.
The agglomerative algorithm can be used even if you have no idea how many clusters you should end up with. It takes two parameters: the linkage type (currently either 's' for single linkage, 'c' for complete linkage, or 't' for centroid linkage) and the cutoff. In each iteration, the algorithm computes the pair of clusters with the smallest distance between them and fuses them, until either all the clusters have been fused into one mega-cluster or until the smallest distance is bigger than your "cutoff" input parameter. The distances between clusters are computed by linkage type: single linkage means "the distance between the closest pair of points, one from each cluster," complete linkage means "the distance between the farthest pair of points, one from each cluster," and centroid linkage is simply the distance between the cluster centroids.
The main function tests out these methods by generating 10 random points in two dimensions, then passing them into both clustering algorithms and printing the output clusters.
For more information on clustering, see Brian Luke's website: