Hi,

I have a problem where I need to generate hash for ~10e9 (1 bio) keys. I tried creating a hash of my own but after all the research I realized that I can as well (or even should) use one of existing functions.
I narrowed down on SHA-1, which generates a 20-byte key. Range is 0-10e48
My storage allows me to store only a "java long" so I'm doing a simple mod of the SHA1 hash to truncate it to the range of long (64 bits = 10e18).

Now what I'm worried about is that the mod'ing would screw up the distribution of the hash and would increase the possibility of collision.

In my usecase a collision is basically fatal. There is no way for me to recover / handle a collision.

Question:
How to do I measure the distribution of a hash function? I have no idea how the hash functions are evaluated. So are there any special/pre-defined techniques/code I can reuse?

Thanks

PS: I could use a hash that generates a 8 byte hash but I'm using SHA1 as it's supposed to be good.

Here is the implementation:

package com.kash;

import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Hash_SHA1 implements HashFunction {
	private MessageDigest algo = null;

	public Hash_SHA1() {
		try {
			algo = MessageDigest.getInstance("SHA-1");
		} catch (NoSuchAlgorithmException e) {
			System.out.println("Hash_my.Hash_SHA1(): NoSuchAlgorithmException");
			e.printStackTrace();
		}
	}

	@Override
	public BigInteger hash(String s) {
		// SHA-1 generates a 20 byte hash, we can only deal with 8 bytes so mod it.
		return new BigInteger(1, algo.digest(s.getBytes()))
		           .remainder(new BigInteger("999999999999999999"));
	}
}

Attached: Exported Eclipse project with JUnits and input.txt and the interface etc. For testing.

Attachments..
Unable to upload the full input file, so uploading only half of it..

While taking a modulous on the hash function will increase the chance of a collision (you have less bits, of course it will), a good hash function will still be fairly 'random'.

As for the chance of collision, I can't tell you the exact value (I don't have anything that can calculate 10e18 factorial in my lifetime). You'd think that it would be remote, but you might have enough values to run into the Birthday Paradox.

If you're using the word "fatal" with regard to collisions, a hash might not be your best option. If you need to generate a unique ID, generate a unique ID.

If you're using the word "fatal" with regard to collisions, a hash might not be your best option. If you need to generate a unique ID, generate a unique ID.

Indeed this is where I started but the problem is that
- my input (1bio keys) is a string with max length of 140 chars with each slot having 60 different possible chars (40 ignoring case). This translates to ~40^140 (=2e224) unique combinations, although practically it'll never be beyond 10e9 (1bio).
- my output (integer) can not be bigger than 10e38.
Do you have any ideas I can try? We considered Hoffman coding, but that doesn't guarantee the output size and the performance is much worse compared to hashing.

Btw, after all the googling I could the best link I found on testing hash functions is this.
But unfortunately it can't applied to my problem as it is.
So finally I wrote some code to do brute force distribution. It's attached. Result is that even after doing a mod the distribution is still PRETTY impressive (good).
The code is given below, see the new method check_distribution().
General idea is:
- write all the longs (hashes) to a flat file,
- divide the range of long into X (configurable) buckets
- read the longs from file and distribute into buckets
- print the bucket contents, average / bucket and std deviation.

package com.kash;

import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
import java.util.logging.FileHandler;
import java.util.logging.Formatter;
import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.LogRecord;
import java.util.logging.Logger;

import org.apache.commons.math.stat.descriptive.moment.StandardDeviation;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class StringHashTest {

	Logger logger = Logger.getLogger("");
	Scanner s = null;

	Set<Long> collisionDetectionSet;

	@Before
	public void setUp() throws Exception {
		init();
		s = new Scanner(new File("D:\\Workspaces\\Helios\\StringHash\\input.txt"));
		collisionDetectionSet = null;
		collisionDetectionSet = new HashSet<Long>();
		// String maxStr = "";
		// while (s.hasNext()) {// && i < 100) {
		// String str = s.nextLine();
		// if (str.length() > maxStr.length()) {
		// maxStr = str;
		// }
		// }
		// System.out.println("maxKey = " + maxStr);
		// System.out.println("maxKeyLen = " + maxStr.length());
	}

	@After
	public void tearDown() throws Exception {
		if (s != null)
			s.close();
		s = null;
	}

	@Test
	public void testHash_SHA1() {
		testHash(new Hash_SHA1());
	}

	@Test
	public void testHash_djb2() {
		testHash(new Hash_djb2());
	}

	@Test
	public void testHash_Jenkins() {
		testHash(new Hash_Jenkins());
	}

	private void checkDistribution(String fileName) {
		long[] hashes = readInputFile(fileName);
		Bucket[] buckets = createBuckets(hashes, 10000);

		logger.fine("Algo: " + fileName);
		logger.fine("adding to buckets..");
		// distribute into buckets..
		// for (long l : hashes) {
		// for (Bucket bucket : buckets)
		// if (bucket.add(l))
		// break;
		// }
		// logger.fine("Done adding to dumb buckets.. calls = " + calls);
		// calls = 0;
		// distribute into buckets..
		for (long l : hashes)
			addToBucket(buckets, l);
		logger.fine("Done adding to binary buckets.. calls = " + calls);

		long bucketSizeSum = 0;
		for (Bucket b : buckets) {
			// System.out.print(b);
			bucketSizeSum += b.size();
		}
		logger.fine("\nNumber of buckets: " + buckets.length);
		logger.fine("Bucket size: " + (buckets[0].max - buckets[0].min));
		logger.fine("Average hits per bucket: " + (bucketSizeSum / buckets.length));
		logger.fine(String.format("Your hashes are one in: %e", (float) (buckets[0].max - buckets[0].min)
				/ (float) (bucketSizeSum / buckets.length)));
		analyze(buckets, hashes);
	}

	private void addToBucket(Bucket[] buckets, long val) {
		// logger.fine("a2b: " + val);
		// use binary search kinda strategy (divide and conquer) to optimize
		int min = 0;
		int max = buckets.length;
		do {
			int mid = (max + min) / 2;
			// System.out.println(mid);
			Bucket buk = buckets[mid];
			if (buk.add(val))
				break;

			if (val > buk.max)
				min = mid;
			else
				max = mid;

		} while (max > min);

	}

	private void analyze(Bucket[] buckets, long[] hashes) {
		StandardDeviation sd = new StandardDeviation();
		for (Bucket b : buckets) {
			sd.increment(b.size);
		}
		logger.fine("\nstandard deviation: " + sd.getResult());
	}

	private static long calls = 0;

	public class Bucket {
		private long min;
		private long max;
		private long size;

		public Bucket(long zmin, long zmax) {
			min = zmin;
			max = zmax;
		}

		public boolean add(long val) {
			if ((++calls % 1000000) == 0)
				logger.fine(".");
			if (val > min && val <= max) {
				size++;
				return true;
			} else
				return false;
		}

		public long size() {
			return size;
		}

		@Override
		public String toString() {
			return "\nBucket [min=" + min + ", max=" + max + ", size=" + size + "]";
		}
	}

	private Bucket[] createBuckets(long[] hashes, long numBuckets) {

		if (numBuckets > Integer.MAX_VALUE)
			throw new IllegalArgumentException("(numBuckets > Integer.MAX_VALUE)");

		long min = Long.MAX_VALUE, max = Long.MIN_VALUE;
		for (long l : hashes) {
			if (l < min)
				min = l;
			else if (l > max)
				max = l;
		}
		logger.fine("min = " + min + ", max = " + max);

		long bucketSize = Long.MAX_VALUE / numBuckets;
		Bucket[] buckets = new Bucket[(int) numBuckets];
		long currMin = 0;
		for (int i = 0; i < buckets.length; i++) {
			buckets[i] = new Bucket(currMin, currMin += bucketSize);
		}

		// System.out.print("createBuckets() returning:");
		// for (Bucket b : buckets)
		// System.out.print(b);

		return buckets;
	}

	private long[] readInputFile(String fileName) {
		long[] hashes = null;
		try {
			File f = new File(fileName);
			int numHashes = (int) f.length() / 8;
			int i = 0;
			DataInputStream dis = new DataInputStream(new FileInputStream(f));
			hashes = new long[numHashes];

			while (true) {
				try {
					hashes[i++] = dis.readLong();
				} catch (IOException e) {
					break;
				}
			}
		} catch (IOException e) {
			logger.throwing("StringHashTest", "checkDistribution(" + fileName + "\"", e);
		}
		return hashes;
	}

	private void testHash(HashFunction hf) {
		String algo = hf.getClass().getName();
		String outFileName = "D:\\Workspaces\\Helios\\StringHash\\" + algo + ".log";
		try {
			DataOutputStream outputFile = new DataOutputStream(new FileOutputStream(outFileName));
			logger.info(algo + " -- start --");
			int collisions = 0;
			while (s.hasNext()) {// && i < 100) {
				long hash = hf.hash(s.nextLine());
				// if (!collisionDetectionSet.add(hash))
				// ++collisions;
				if (hash == 0)
					collisions++;
				else
					outputFile.writeLong(hash);
			}
			outputFile.close();
			logger.info(algo + " -- Number of keys generated = " + (collisions + collisionDetectionSet.size()));
			logger.info(algo + " -- Number of collisions = " + collisions);
			logger.info(algo + " -- end --");
		} catch (IOException e) {
			logger.throwing("StringHashTest", "testHash()", e);
		}
		checkDistribution(outFileName);
	}

	private static boolean isLoggerInitd = false;

	private static void init() {
		if (isLoggerInitd)
			return;

		Logger.getLogger("").setLevel(Level.ALL);

		Handler l = null;
		Formatter f = new MyFormatter();
		try {
			l = new FileHandler("D:\\Workspaces\\Helios\\StringHash\\Hash.log");
		} catch (Exception e) {
			e.printStackTrace();
		}
		l.setFormatter(f);
		l.setLevel(Level.ALL);

		Logger.getLogger("").addHandler(l);
		isLoggerInitd = true;
	}

	static class MyFormatter extends Formatter {
		public synchronized String format(LogRecord record) {
			StringBuffer sb = new StringBuffer();
			String message = formatMessage(record);
			// sb.append(record.getMillis()).append(' ').append(record.getLevel().getLocalizedName()).append("\t")
			// .append(message).append('\n');
			sb.append(message).append('\n');
			if (record.getThrown() != null) {
				try {
					StringWriter sw = new StringWriter();
					PrintWriter pw = new PrintWriter(sw);
					record.getThrown().printStackTrace(pw);
					pw.close();
					sb.append(sw.toString());
				} catch (Exception ex) {
				}
			}
			return sb.toString();
		}
	}
}

Indeed this is where I started but the problem is that
- my input (1bio keys) is a string with max length of 140 chars with each slot having 60 different possible chars (40 ignoring case). This translates to ~40^140 (=2e224) unique combinations, although practically it'll never be beyond 10e9 (1bio).
- my output (integer) can not be bigger than 10e38.

I don't understand why this is a problem for a unique ID? Point of a unique ID is it doesn't depend on the content, it's external to the data, so it doesn't matter how much data there is.

I don't understand why this is a problem for a unique ID? Point of a unique ID is it doesn't depend on the content, it's external to the data, so it doesn't matter how much data there is.

Okay, now I get what you meant. Well, the reason for this whole exercise is the BAD BAD performance of the existing key generation code (which uses Oracle DB's sequence number generator). Existing key generation code does what you say: Generate a unique number for each string.
Problems with the unique number generator are that
1. it has to be synchronized AND
2. for the algorithm to be repeatable (provide same number for same string across 2 calls to the API) mapping of all input string vs generated numbers has to be retained.
This kills the performance.

You can generate a unique ID with 48 hex characters or less, effectively. There is no way to generate a guaranteed non-colliding hash value mathematically. It is one of those NP-Hard problems. So, generate a GUID, and forget hash values if a collision will indeed be "fatal".

I thought it sounded tricky. Didn't know it was NP-hard.

I thought it sounded tricky. Didn't know it was NP-hard.

You'd have to generate hash values for every possible member of the set, and since the set is effectively infinite...

How stable and random is the distribution of information you want to hash?

Generally it is better idea to test random sample of real data and trying to find what could be the worst case in all input, and could the input change to be even worse fitting than that. My instinct would say that remainder could be better by being biggest prime less than current one, but maybe it does not matter.

Hash should be OK, but you must have mechanism for non-unique primary hash, for example have second key using the current slow mechanism guaranteeing uniqueness.

Thanks guys..

You can generate a unique ID with 48 hex characters or less,

Practically: Yes. Because at best only 10^9 (out of the possible 2e224) input strings would be used. In fact given that, my unique number (viz say a running number) can easily fit into 34-bits.
Theoretically: No. As there are 2e224 inputs, and range of 48 hex (192 bits) is 6.2e57 only.

But again, the primary problem remains. I don't know how to write an efficient API to generate a unique/running number for given input string, when it's also expected to be repeatable (i.e. it must return same number if it's called with same string twice).
It's a J2EE env (I'm not opposed to any ideas including JNI/...) with multiple nodes/machines in a cluster calling this API in question in parallel (see the last paragraph as well).

How stable and random is the distribution of information you want to hash?

How would it matter? In fact from my learnings so far from all the R&D says that the distribution of a good hash function shouldn't be dependent on the randomness/distribution of the input. That's my inference of hash function "achieving avalanche".
Or did you ask the question for some different purpose, and I'm going all tangential? :)

Generally it is better idea to test random sample of real data and trying to find what could be the worst case in all input, and could the input change to be even worse fitting than that. My instinct would say that remainder could be better by being biggest prime less than current one, but maybe it does not matter.

I did. I tested with more than a reasonable set of input and SHA1 works just fine. Not sure if you had a chance to see the post where I posted this "So finally I wrote some code to do brute force distribution. It's attached. Result is that even after doing a mod the distribution is still PRETTY impressive (good)."

Hash should be OK, but you must have mechanism for non-unique primary hash, for example have second key using the current slow mechanism guaranteeing uniqueness.

Could explain this a bit more? My short input is that any change/addition to the SHA1 would require that we maintain the mapping of all keys vs all hashes. This kills the performance.
It's easy of course resolve a collision (e.g. just do a ++, or return a running number from a reserved range not used by SHA1) if we can detect it. To detect it we have to store all keys/hashes, coming back to the original problem.

I thought you could cache the collision ID's and do two lookups, first check no collision, then access the main hash. That is considering that double hashing is so much faster than using serial scheme, which I think is only possible with storing the complete keys, not only conflicting keys.

I asked about distribution of keys for the case that there could exist not random but suitably selected way to drop down the length of keys so that keys would remain unique (practically I have experience in direct marketing database, which used combination of pieces of address and letters from name to produce unique key).

The parallelism of the place of application could cause head ache, I have heard. I have not practical experience in that field.

My ideas could be of topic, but I offered what came in my mind, even I have not practical experience in the topic and not special deep expertise.

My C knowledge is not so deep, I do understand general structure of the code you posted, looks nice. I did not experiment with posted code as I am lacking suitable database myself to test with. I am more of a Python guy.

I wrote a repeatable 32-bit hash function some years ago (about 20 - when C++ compilers were still very non-standard) for use in a major manufacturing execution system. In all the years since, I have not heard that it caused any collisions. Let me dig up the code and post it here. That is to say, it probably will generate collisions, but extremely rarely. One of its features is that it generates the same hash value no matter the hardware or system architecture (little-endian, big-endian, 32-bit, or 64-bit).

Ok. Here it is:

union CharMask {
  unsigned	in[sizeof(unsigned)];
  char		ch[sizeof(unsigned)*sizeof(unsigned)];
  CharMask();
};

CharMask::CharMask()
{
  for (register unsigned i=0; i<sizeof(unsigned); i++)
    for (register unsigned j=0; j<sizeof(unsigned); j++)
      ch[sizeof(unsigned)*i+j] = j<i ? 0xff : 0;
}

/*
 * Return a case-sensitive hash value.
 */
unsigned hash(unsigned char* data)
{
// This modification is to assure inter-platform consistency of string
// hash values.  IE: str.hash() on Sparc == str.hash() on Alpha
  static CharMask chmask;
  static unsigned shift = (sizeof(unsigned) == 4) ? 2 : 1;
  register unsigned h, i, j;
  register unsigned char* c = data;
  union
    {
      unsigned char cdata[sizeof(unsigned)];
      unsigned udata;
    } pad;

  h = 0;
  // Compute starting bit pattern for h
  j = length();
  if (j > 0)
    {
      for (i = 0; i < sizeof(unsigned); i++)
	{
	  pad.cdata[i] = (unsigned char)(j % 256);
	  j /= 256;
	}
      h = pad.udata;
    }

  // Hash data in string to the initial bit pattern
  for (i = length(); i >= sizeof(unsigned); i -= sizeof(unsigned))
    {
      for (j = 0; j < sizeof(unsigned); j++)
	{
	  pad.cdata[j] = *c++;
	}
      h ^= pad.udata;	// XOR in the characters.
    }

  // If there are any remaining characters, then XOR in the rest, using a mask:
  if (i > 0)
    {
      pad.udata = 0;
      for (j = 0; j < sizeof(unsigned); j++)
	{
	  if (i > 0)
	    {
	      --i;
	      pad.cdata[j] = *c++;
	    }
	  else
	    {
	      pad.cdata[j] = chmask.ch[j << shift];
	    }
	}
      h ^= pad.udata;
    }

#ifdef LITTLE_INDIAN
  // Swap bytes if we are little-indian
  pad.udata = h;
  unsigned char t[2];
  t[0] = pad.cdata[0];
  t[1] = pad.cdata[1];
  pad.cdata[0] = pad.cdata[3];
  pad.cdata[1] = pad.cdata[2];
  pad.cdata[2] = t[1];
  pad.cdata[3] = t[0];
  h = pad.udata;
#endif

  return h;
}

I seem to recall that I did a considerable amount of testing for collisions and distributions using large dictionary tables of strings, and the results were very good.

I thought you could cache the collision ID's and do two lookups, first check no collision, then access the main hash. That is considering that double hashing is so much faster than using serial scheme, which I think is only possible with storing the complete keys, not only conflicting keys.

Even with this I suppose you see that the original problem remains. To check whether something conflicts or not, we still have to maintain all the conflicted hashes in a cluster-wide-synch'd storage (this is of course MUCH better than the situation when we use a serial no generator and store ALL generated numbers). But we end up being forced to lookup every generated hash in this cache/storage, which in a cluster kills the performance.

I asked about distribution of keys for the case that there could exist not random but suitably selected way to drop down the length of keys so that keys would remain unique (practically I have experience in direct marketing database, which used combination of pieces of address and letters from name to produce unique key).

I considered this, with this I was able to bring down the length of inputs keys from 140 down to ~65 chars, but even than we end up with 1.4e104 unique combinations down from 2e224. So the problem is seemingly less severe but still severe enough.

I wrote a repeatable 32-bit hash function some years ago

Thanks for this. Some inputs:
1. Perhaps in the LONG mail chain you missed that I'm working with J2EE (java). Although I'm open to JNI or similar solns, in this particular case I would rather just re-write the hash() in java.
2. Max unique keys this hash can generate (32-bits) is 4.3e9. This does not even cover my practical number of keys viz ~10e9. So unless the function is enhanced to generate longer key, it's bound to collide.
3. SHA1 generates a 20-byte key (160-bit) and performance of it (even including the time taken to strip the key to a long's 8-bytes size) is more than adequate.


I'm marking the thread as solved now as I guess there is nothing further to be added here.
Thanks for all your inputs.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.