I am trying to write a pretty straight forward algorithm (agglomerative clustering), but the details are getting pretty hairy. (A cluster is simply a group of points. The idea is to take a list of points and decide which are "grouped" or "clustered" close together.)

Here is the outline:

Input: A vector of N Points (x,y,z coordinates)
Algorithm:
- Create N clusters (each input point is it's own cluster with only the one point)
- Merge the two closest clusters (ie. delete either cluster, move all of the points from the deleted cluster to the other cluster). Set the "cluster center" to be the center of mass of all of the points in the new cluster. These cluster centers are what are compared to decide which clusters are closest to each other.
- rinse and repeat until the two closest clusters are farther apart than a threshold

I got it working with a really naive algorithm, but then I tried it with an input of size 30,000 and of course that didn't fly ! :)

The choices seem to be
1) Do you actually store Points in a vector in a Cluster class and move them from cluster to cluster? Or do you just have a vector of labels that you update when points should be "moved" to a new cluster?
2) I was considering storing objects that contained "Cluster 1 ID", "Cluster 2 ID", "Distance". But then each iteration I would have to go through all the clusters, and invalidate any objects that had the deleted cluster as either cluster 1 or cluster 2.

I realize this is quite a problem, but can anyone recommend at a high level a way to keep track of these points and clusters (ie which data structures would you use? And there are only two questions to answer really - how to pick which clusters to merge and how to merge them.)

I can post my naive solution if that would help, but it's a bit lengthy.

Thanks!

Dave

Just my opinion but the outline of the problem is that you have a Cluster which consists of a set of points and a centre point which you have somehow derived from that set. To me that is best modelled by a Cluster class containing a vector of points, and a centre point.
When you merge clusters, you append one vector to the other, recalculate the centre point and throw away the unwanted Cluster.
That's a good clean design which represents the problem. If you have performance issues with that then start looking at improvements.

Right, that part is pretty straight forward, but how to keep track of these Cluster objects is the first question, and how to keep track of the distances between the clusters is the second.

Why not start with a list of Clusters. Say each Cluster starts as a list of points having size of one. If you wanted you could start with the first Cluster. Find the closest Cluster to it. Merge the two. Repeat until you have come to the end of the list of Clusters, meaning, in this scenario, the number of Clusters has been reduced by half. Since you are doing a relatively large number of deletions and additions to Clusters lists may be more effecient than vectors. You can keep repeating the process until you get down to however many Clusters you want, the lowest number being 1 Cluster, which means you have gone back to where you began.

Alternatively, you could group the Clusters by distance rather than by closest. Take the first Cluster in the lis and group all other Clusters with it if the distance from one Cluster to another is less than x. Then for each successive loop through the list of clusters you increment the distance between the remaining clusters by y until you decrease the number of Clusters, or you increase the distance between Clusters, to a given predetermined value.

class Cluster
   list<Point> points
   Point centerOfMass

list<Cluster> uinverseOfClusters

Right, that part is pretty straight forward, but how to keep track of these Cluster objects is the first question, and how to keep track of the distances between the clusters is the second.

Okay, so initially you'll be allocating a cluster for each point.
You'll want to iterate through the clusters (in no particular order) to measure and compare the distances between them.
You'll want to remove clusters from random positions in the set as they merge. A vector wouldn't be great for that because the elements would all need shuffling as an element is deleted. I think a map container would do the job nicely. As you create each cluster give it an ID which you use as the map key, then the cluster object pointer is the map value. The slow bit is going to be iterating though the map to compare the distances. To compare each cluster with all the others will be painful with a large data set. You could speed that up by getting the size of the map (number of clusters, N), creating an array N distances, then doing a single iteration of the map, copying each distance to the array. Then you only need to iterate the array for comparing each distance with all the others. I imagine that would be a fair bit faster than iterating the map of clusters.
Although the distance scanning is slow I don't think it makes sense to try and keep track of distance data between scans. Clusters are deleted and merged so it becomes out of date immediately. I think I would work out the closest clusters then dump the distance data and scan it all again.
When you've determined the 2 closest clusters, merging them and removing the redundant one from the map is a doddle. Remember to delete the cluster object of course.
My head hurts. I hope that's useful.

You could speed that up by getting the size of the map (number of clusters, N), creating an array N distances, then doing a single iteration of the map, copying each distance to the array. Then you only need to iterate the array for comparing each distance with all the others.

Was getting ahead of myself when I said that. I meant:
You could speed that up by getting the size of the map (number of clusters, N), creating an array N cluster centres, then doing a single iteration of the map, copying each cluster centre to the array. Then you only need to iterate the array for comparing each distance with all the others.

Ok, I implemented it using a distance matrix, and updating the distance matrix row and column that correspond the the cluster that was deleted. It worked pretty well, and I ran it with 7000 points and it took less than an hour. However, it turns out a better cluster distance function for my application is to merge the clusters with the closest distance between any of their points. Ie find the distance from all the points in A to all the points in B. The minimum of those distances is considered the cluster to cluster distance. The clusters with the closest cluster to cluster distances are merged. This makes updating the row and column of the distance matrix horribly horribly slow (ie. its down to 6000 clusters from 7000 points in an hour, and as the size of the clusters grow the speed decreases, so I bet its looking at 20 hours or so.

I don't see any shortcut to computing all of these distances (I already put the points in one cluster in a KDTree and then found the distance from each point in cluster B to the tree, rather than brute force finding the distances between each point in A to all the points in B). That helped a bit, but its still wayyyy too slow.

Sorry, I'm kind of rambling, but I don't know where to go from here!

Dave

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.