I'm trying to work out what the best structure is for these requirements.. will happily work on it, just wondering on the best direction to go in / if anyone knows from experience or reason what the best type of storage structure is..
- I 'generate' up to 20,000 integers, and I need to make sure they are unique. I have no desire to keep track of which integers I've seen, as long as I can determine whether I've seen one before, if that makes sense - I don't need to get them back again from the structure.
- I can roughly predict the distribution of the numbers, I know, for example, the possible 'next' numbers after seeing any given number, I also know that the numbers probably expand from a single point, i.e, in 1-D, if the first number is one, the second numbers will be 0 and 2, if the first number is 2, the next ones will be 1 and 3, and so on.. I'm not considering negatives. However, I only have a rough idea that will happen - and, its certainly the case that the next two numbers might be +100 and -100, or any other symetric distance from the first number ( again, negatives can't happen, so -100 would only be possible if the first number is >= 100 ). The numbers don't necessarily start from 0, infact, its very unlikely that the first number will be 0.
At the moment I'm using a set [ red/black tree ], all is well for one test - and 20,000 is a worst case situation, i.e. the algorithm this is for aborts if it reaches 20,000 iterations. 300-1000 numbers is more likely; but, there's a chance that, every half second, a worst case example occurs, and worst cases aren't infrequent. Also, I want to increase the worst case level, because 20,0000 is sometimes not enough! In profiling, the set looks like a speed bottleneck in the system at these extreme cases, and I guess it's a memory bottleneck too, since it must store a whole integer at every 'leaf'.
By the way, a matrix/vector with a flag bit for every possible input isn't gonna be good ( although there is a well defined non-infinite input range ), because I dont want every case to suffer just to make the worst cases more efficient, and in most cases, only a relatively tiny number of integers are ever seen. Anything 'fuzzy' isn't good either, i.e. imperfect hashing -- since I can't forget I've seen a number, or remember I've seen a number that I haven't -- and I can't make a perfect hash that isn't the same size as the inputs.
I'm thinking, of using a dynamic vector or link list of ordered ranges; that is - if I see 0, 1, 2, 3, 4, 5, 6 - i would have one range object ( 0 - 6 ) at the beggining of the list, so anything within that range can be ignored; equally, if I see 20, 21, 28, 29, 30, i'd have two range objects ( 20 - 21 ) and ( 28 - 30 ) somewhere at the front of the list.. Kinda like file compression/allocation in realtime. Obv. that has worst memory case of 2 x the number of possible values, but, that's unlikely... Also, that structure might end up having a long lookup or write time, depending on whether it's a linklist or a vector. Maybe I could make it hierachical, but then it's gonna end up looking alot like a set...
Anyway.. anyone know of a better way to do this, or have any ideas?