Given a unicode string or file, what is the best way to find a short unicode string that is not a substring of the first one ? I want a working algorithm that avoids transforming the first string into a list of characters.

My first idea is to choose a random unicode character, then if it is a substring of string A, choose a random string B of 2 unicode characters. Again if it is in string A, choose a new random B with 3 characters, etc.

If somebody sees a working deterministic algorithm, it would be a great idea.

How short is "short"? If the string/file is less than 2^16 characters long then there will be at least one UniCode character that's not in the string, and which will therefore meet your requirement. ;)

Yes in this case there is a string with length 1, but how would you find it ? Short means for example that if there is a string with length 1 and your algorithm finds a string with length 2 or 3, it does not matter.

I don't know Python, but can't you just create an array of 2^16 booleans, corresponding to the 2^16 basic UniCode characters, do one pass of the string and set the corresponding boolean as you encounter each character.

commented: good suggestion +14

Yes it works, here is the code, using module bitstring from pypi

#!/usr/bin/env python
# -*-coding: utf8-*-
'''find a unicode character not in a string
python 2 and 3
'''
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
import sys
if sys.version_info < (3,): chr = unichr

from bitstring import BitArray # sudo pip install bistring
UMAX = 2 ** 16

a = BitArray(length=UMAX)
z = BitArray(length=1)

s = "Hel\0lo world"

for c in s:
    a[ord(c)] = 1
t = a.find(z)


if t:
    u = chr(t[0])
    print('found unicode', ord(u), repr(u))
else:
    print('not found')

""" my output -->

found unicode 1 '\x01'
"""

It wouldn't work with a string containing all the 65536 unicode characters, also there are some issues with respect to this: wikipedia says that there are currently more than 100000 unicode characters, and second my python interpreter accepts 1114112 unicode characters.

This is already nice, but I would like an algorithm that works in all cases.

Yes, there are UniCode symbols that need more than 2 bytes, but frankly they are a mystery to me.
For substrings >= 2 chars I can't think of anything that wouldn't require many searches of the entire String, which is horrible. (So maybe it's always worth trying the single letter case first because that's so very quick and would surely work in 100% of all real-life situations.)

I have a robust solution with more than a single letter. It is described in the attached pdf. I'm working on implementing it.

Two points:
1. It still involves guessing random strings until one is found,so not deterministic and in bad cases may run idefinitely long
2. The REGEX hides the fact that mutiple matches are being tried at every character - don't expect fast!

Here's a suggestion that is deterministic and must be damned close to optimum, provided you have enough memory to hold three times the size of the string:
Note: If the string length N is < 2^16 there is a single char solution
1. try the single letter solution first. If that fails:
Note: if N < 2^32/2 there is a 2-character solution
2. make an array of all the 2 consecutive letter sequences (N-1 elements of 4 bytes each)
3. sort the array
4. search the array sequentially until you find a gap in the sequence. Any value in that gap is a solution.
Note: if N > 2^32/2 and cunningly constructed to include every possible 2-char sequence you will have to go to 3 chars (and you will also win the lottery 3 times in succession)
5. make an array of 3 char sequences as above (construct array of N-2 elements of 6 bytes each), sort, and search for gaps
6. etc

How about hashing the letters in file, generate random letters until random letter hits empty slot in hash. If there is enough empty slots in hash, it should not take too many guesses. And guess + hash + check is quite fast operation. This, if you want to find unicode letter not used in file. With some very simple hash function (maybe simple mod n, not, but something slightly more complex) it might be able to limit guesses in subset of unicode letters based on the index of empty hash slot.

For longer n-length string, it should be quite easy to find unicode letter very unlikely to appear more than n times in the file. Then you just make sure that for hit h, letter h in string generated differs from letter h letters after hit h. You could also record one letter after each hit. There should be many holes in unicode letters, as the hits should not be many. And available unicode choices are many for the second letter of not-matching string. If the rare letter is not found in file is of course trivial and should be dominant case for average running time of the algorithm.

The statistical approach (see pdf above) is based on the monumentally unlikely assumption that all the 2^16 chars are equally likely, so given any sensible file size there are very many short sequences not present.
A much more interesting (and realistic) version would just use the ASCII character set, so all the characters are likely to be present, and most short strings will be present in even a fairly small file.

I think you misunderstand the statistical approach. There is no assumption on the distribution of letters in the data string. I'm using an inquality that is true for all sequences, let them have ascii characters or not. The only assumption is that I choose a random regex string with k characters uniformly over all such strings.

I think it is easy to choose subset of unicode letters (one special language specific letter from each letter set from each language specific letter set and special symbol set, for example, upto 256 set), that are unlikely to all occur in same document. And If you have one letter that does not occur in file, all sequences including that letter are not included in file, if you want longer non-occuring string.

commented: Yes but how to write it down ? +14

All I meant was
If there are many letters, then there are many many possible sequences, and a random sequence has a good chance of being a solution. So trying random sequences is a sensible strategy.
If there are fewer letters, there are fewer possible sequences, and there is less chance of a random sequence being a solution, in which case a deterministic complete solution is more favourable.

Here is the statistical approach at work with a alphabet of size 10.

#!/usr/bin/env python3
# -*-coding: utf8-*-
'''find a short string not in a string
'''
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
from math import log
import random
import re
import sys

def regex_length(L, N, proba):
    L, N, proba = float(L), float(N), float(proba)
    assert L > 1.0 and N > 1.0 and (0.0 < proba <= 1.0)
    u = int(log(L/proba)/log(N))
    return 1 + (u if u >= 0 else 0)

alphabet = 'abcdefghijk'
def get_letter(index):
    return alphabet[index]

def gen_letters(number, alphabet_len):
    for i in range(number):
        index = random.randrange(alphabet_len)
        letter = get_letter(index)
        yield letter

def random_candidate(data, alphabet_len):
    x = regex_length(len(data), alphabet_len, 0.01)
    pat = ''
    s = ''
    for letter in gen_letters(x, alphabet_len):
        if pat:
            pat = '{}(?:{})?'.format(re.escape(letter), pat)
        else:
            pat = re.escape(letter)
        s = letter + s
    pat = '(?=({}))'.format(pat)
    return s, re.compile(pat)

if __name__ == '__main__':
    data = (
        "bbcbcecjjhcajddbbcgbbdjeecijgfeeiaegacgedhehhdbcckfgeakgiha"
        "khekkeeehadebdjdfghbafeakcfaibijafkdkicdcjhckakdiihjadkjdcc"
        "iaikkkhkeeeeehjdabhekaeegbhkfiiiaebdjgjkehjjbcfkjackkidjiad"
        "dkaiddeafcbjhekagbickiagbdbfcbcfaikgeggeddjfachjjadghjekhef"
        "iffjbjhbgfaiggkjkdafbefhaeijcgghhhhdkgddgbhiaaiicfjihibdaag"
        "ahacdiddkaeajhabggcicfcdechjcfdfjkagajkfeeakcgjaaikegdbcdec"
        "jjhejhacfhcakdfhjkadjfkhgicgkcajiagjaiadgeggfkjeddeedbfjbjd"
        "aefdgdikdaecdgfgbgfcjfajdckidhbfifaibeegdiieibcfgedeehjfgha"
        "agckigbfakhcfhcdecdjagddcaafjgefdiahejcjehdebcijjdkjkeacdfd"
        "bhdbbdhbchdbjfdcfbhakjkhdkgjdjjkjghefefbhjcbfkbcigchebbkjdi"
        "gieiibkebikjccfkkdfccchgbgkiaijgfbeghecejdbdjegahbadkgjgcai"
        "jgjaebhbkibcfciaghagjbdekkkkjggdhebjaabbcfbifbjbiagbciibfjc"
        "efcjaaggbjffkbgcbihdikhgiieeffgjecjghajckbiedkfecfaaecijifc"
        "jgkjkdchkichdacjbehjkjjadghkehbdhcbkebbfkbgcccgjkdfhfkiajbf"
        "ieifgbhdcchfabahbeabcibijkkbbefcigbigjkgjjfficgggcckfghbgkk"
        "ickhcifaggcedejfkhgihgifgcfghjeebeiecggigheikedajfcdeajhdgg"
        "gfadfkkfjabibkiedbddakdikijiagcacjfcfkiaecegggcdaegchdba"
        )
    while True:
        niter = 0
        while True:
            niter += 1
            substring, pattern = random_candidate(data, len(alphabet))
            n = max(len(match.group(1)) for match in pattern.finditer(data))
            if n < len(substring):
                result = substring[:n+1]
                break
        assert result not in data
        print("Found non substring:", result, 'in', niter, "iterations")
        break

    """my output -->
    Found non substring: hbfh
    """

I chose a maximum probability 0.01 that the first pass fails. Decreasing this parameter may yield longer substrings.

Many 3 letter strings are found, such as iaf or kde or fei, all in one iteration.

EDIT: see the remark in my post below.

Nice!
I think this one is done now?
cheers
J

Ok, solved! Thank you very much to you and PyTony for your help and support !

ERRATUM: I corrected the first version of the code above, which contained a subtle bug yielding incorrect results. One must consider overlapping matches of the regular expression. This is taken into account at line 38 and 66 of the above code.

The regular expression module normally finds non overlapping matches. In our example, the string kkh was found as a solution, but it does belong to the data string at position 122. It was part of the group kkkh and as kk was a match of the regex, the overlapping match kkh was not seen !

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.