I am trying to write a pretty straight forward algorithm (agglomerative clustering), but the details are getting pretty hairy. (A cluster is simply a group of points. The idea is to take a list of points and decide which are "grouped" or "clustered" close together.)
Here is the outline:
Input: A vector of N Points (x,y,z coordinates)
Algorithm:
- Create N clusters (each input point is it's own cluster with only the one point)
- Merge the two closest clusters (ie. delete either cluster, move all of the points from the deleted cluster to the other cluster). Set the "cluster center" to be the center of mass of all of the points in the new cluster. These cluster centers are what are compared to decide which clusters are closest to each other.
- rinse and repeat until the two closest clusters are farther apart than a threshold
I got it working with a really naive algorithm, but then I tried it with an input of size 30,000 and of course that didn't fly ! :)
The choices seem to be
1) Do you actually store Points in a vector in a Cluster class and move them from cluster to cluster? Or do you just have a vector of labels that you update when points should be "moved" to a new cluster?
2) I was considering storing objects that contained "Cluster 1 ID", "Cluster 2 ID", "Distance". But then each iteration I would have to go through all the clusters, and invalidate any objects that had the deleted cluster as either cluster 1 or cluster 2.
I realize this is quite a problem, but can anyone recommend at a high level a way to keep track of these points and clusters (ie which data structures would you use? And there are only two questions to answer really - how to pick which clusters to merge and how to merge them.)
I can post my naive solution if that would help, but it's a bit lengthy.
Thanks!
Dave