I'm just starting a search for a library or code sample for grouping numbers together based on their distance from one another. For instance, in the set{1,2,4,25,28,29}, is there an algorithm that would identify two groups of numbers subsetA{1,2,4} and subsetB{25,28,29}?

Nothing really came up in my google search, but I figure there has to be something out there. Is there another name for "number grouping algorithm"?

How do you expect to split up the sets of numbers? subset is a pretty generic topic.

[edit] Sorry, I missed the part where you said 'distance from one another' :)

Still, splitting on a distance is rather trivial: sort and look for gaps greater than a specified distance. You may find that as part of a much larger library but it is unlikely you will have much luck finding a library dedicated directly to that purpose.

Your terms are way to vague. Judging from your example, you can do something like this :

1) Sort the set
2) Define a MIN_DISTANCE_TO_SPLIT
3) Keep a beging / end pointer, both pointing to the start of the sorted list
4) Move the end pointer until a difference of MIN_DISTANCE_TO_SPLIT is observed from element i to i+1 or until you get to the end of the list
5) Now create a subset from [beginPointer,EndPointer]
6) Set begin pointer to endPointer + 1
7) Increment endPointer
8) Repeat from step 4, until an invalid endPointer


What are you trying to solve with this anyways?

Thanks for the advice. The problem I'm solving is somewhat general. When presented with a dataset, some of the numbers appear to belong with each other, and other numbers do not. I'd like to have a better way of splitting the numbers up. In the past I've just simply split the set by range. Numbers that fall into this range get grouped together, the next range another group and so forth. But this doesn't work very well because the range is rather rigid and sometimes a number or many numbers that belong with one group get sorted into another based on the rigid range in which they fall.

I just now wrote a very crude function to split any given set of type int into two or more subsets (I need to rewrite it to accept template type). The function works fine, but I was hoping there was something out there a bit more sophisticated.

As of right now it splits a set of numbers into two subsets based on the following criteria:

A) Sets and subsets must contain at least 3 elements
B) If the largest percent gap between any two numbers in a set is either more than 50% or more than twice as large as the average percent gap between all numbers of the set, then that set is split into two subsets. The process is then repeated on the two new subsets.

I would still like advice on the algorithm if anyone has ideas. My next step will be to handle outliers. If there is a single outlier, or any number of outliers, I'd like to split the set by exactly that number of outliers.

Still, I'd be interested in browsing any set splittling library that you can recommend even if it doesn't do exactly what I am trying to do.

#define min_set_size 3
void FindSubSets(vector<vector<int>>& oldSubSetVec)
{
	bool newSplit(true);
	vector<vector<int>> newSubSetVec;
	
	// Prime the system
	newSubSetVec.assign(oldSubSetVec.begin(),oldSubSetVec.end());
	
	while (newSplit)
	{
		newSplit = false;
		oldSubSetVec.assign(newSubSetVec.begin(), newSubSetVec.end());
		newSubSetVec.clear();
		for (vector<vector<int>>::iterator itSet = oldSubSetVec.begin(); itSet != oldSubSetVec.end(); ++itSet)
		{
			if (itSet->size() >=  2*min_set_size)
			{
				sort(itSet->begin(),itSet->end());
				double avgPercentDiff = 0.0;
				double largestGapPercent = 0.0;
				vector<int>::iterator itLargestGap = itSet->begin();
			
				int count = 0;
				for (vector<int>::iterator itNumber = itSet->begin(); itNumber != (itSet->end()-1); ++itNumber)
				{
					double percentDiff = ((double)(*(itNumber+1) - *itNumber) / (double)*itNumber);
					avgPercentDiff *= (double)count;
					avgPercentDiff += percentDiff;
					++count;
					avgPercentDiff /= (double)count;
					//Check for new largest gap
					if (largestGapPercent < percentDiff && itSet->begin()+2 <= itNumber && itSet->end()-3 >= itNumber)
					{
						largestGapPercent = percentDiff;
						itLargestGap = itNumber;
					}
				}

				if (largestGapPercent > 0.50 || largestGapPercent > 2 * avgPercentDiff)
				{
					newSplit = true;
					vector<int> newSubSet;

					newSubSet.assign(itSet->begin(), itLargestGap+1);
					newSubSetVec.push_back(newSubSet);
					
					newSubSet.assign(itLargestGap+1, itSet->end());
					newSubSetVec.push_back(newSubSet);
				} 
				else
				{
					newSubSetVec.push_back(*itSet);
				}
			}
		}
	}
}

I think the simple approach would produce good results, that is defining some DistanceSplit value. For example, consider the data below.

[ 4 12 13 21 30 38 39 45 60 84 90 91 95 96 97 109 113 114 128 148 149 163 175 180 182 184 192 ]

say the distance split is 10, then we would have the following subset
[ [4,12,13,21,30,38,39,45] , [60] , [84,90,91,95,96,97] , [109,113,114] , [128,148,149,163] [175,180,182,184,192] ]

Obviously the lower the distanceToSplit value the more subsets you will have. You can calculate the distanceToSplit by using factors such as the mean,variance and standard deviation.

Does the given data follow some type of patterns?
What does the data represent?
Given the above sample data, what would be the 'correct' subset to you? Can you give us more information on the problem this is going to solve?

You have indeed a simple problem of terminology.

What you are describing is a classic problem in, at least, the fields of machine learning and data mining. The problem is called "data clustering" (which is part of the field of "unsupervised learning methods" for "data classification").

If you search for these terms, you will have much better luck, I guarantee.

As for the most suited algorithm for your simple problem of clustering simple integer numbers, it is definitely the "K-means clustering" algorithm. The algorithm is very simple:

1) Start with K random elements in your set as the starting means of your clusters,
2) Put all the other number together with whichever mean they are closest to
3) Then recompute the mean of each cluster,
4) and Repeat, until you have no more change in the clusters.

Very simple, guaranteed convergence. Although, there might be a more specialized algorithm (or version of it) for this 1D case (since K-means is more generally applicable to any dimensionality and any distance metric). And, of course, there are refinements to this basic algorithm (and entire fields of research are dedicated to this), but for your simple problem, this simple algorithm is definitely good enough, and can be implemented in a matter of minutes.

commented: lol, I thought so. Thanks for getting me on the right track! +2

The K-means clustering algorithm computes yields K clusters, where K is given by you. You have to "tell" the algorithm beforehand how many clusters the data set should be broken up into.

A hierarchical clustering algorithm on the other hand forms successive groupings, with the number of clusters varying from from 1 to N (where N is the cardinality of the data set). Hierarchical clustering lets you choose which of the N steps gives you the most convenient number of clusters for your analysis. Ward's algorithm is a commonly used agglomerative hierarchical clustering algorithm. http://iv.slis.indiana.edu/sw/ward.html

Different clustering methods have different strengths and weaknesses. Which algorithm to use, choosing the distance metric which is most appropriate, and determining the number of clusters to specify for K-means clustering or select from hierarchical clustering are interesting problems. Discriminant function analysis is typically used for these quality measures.

commented: Interesting... more food for thought to digest. Thank you! +2

To follow up, I'm going to use the k-means algorithm for now. Those four steps Mike outlined are easy enough for me to understand and it's a big improvement over what I've been doing. I'll just use the size of the range to determine the size of k.

Now I see that there is tons of research on data clustering and tons of open-source out there. I'm really curious about Ward's algorithm and other hierarchical approaches. That's the next step. Can't wait! :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.