How to return all characters that are within the Unicode 'Lu' class

Question

samuel1991 0 Newbie Poster

10 Years Ago

Hi all, actually I have a requirement to remove all non letter character. (Numbers, Punctuation, symbols, non printing characters etc.)

string.punctuation does a good job, but it does NOT remove any non English punctuation (Like '。' which is a full stop in Chinese)

So I come accross such code:

import unicodedata

def onlyWord(text):

    Word = set(['Lm','Lo','Lu','Ll','Lt'])

    return ''.join(x for x in text
                   if unicodedata.category(x) in Word)

print(onlyWord('µ'))

Great! It works what I wanted, and now I realized that ᐒ is a letter (Unicode category as Lo).

The problem is, I CANNOT use unicodedata, collections, re and a number of libraries as a challenge.

So I want to know how to print out the list of numbers that are defined as 'Lu' (As an example, I can extend to the ones listed above) so that I can do this:

def processDisallowedChars(word):
        '''
        This function removes all the non-alphabetical characters within strings, with hyphens and contractions / astrophes in mind.

        Examples of allowed chars:

        one-north

        tom's

        bill gates

        don't

        中文
        '''
        #Initalize a list of acceptable characters.
        setalphabet = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
        setdisallowed = set(word).difference(setalphabet)
        setexceptions = set(\
            tuple(i for i in range(181,195102) \
                if (182 <= i <= 191) == False and (i != 215) and (i != 247) and (706 <= i <= 709) == False \
                  and (722 <= i <= 735) == False and (741 <= i <= 881) == False and (i != 12290)
                  ))

        for x in set(setdisallowed):
            if(ord(x) not in setexceptions):
                #Further check to ensure there is no starting / ending with illegal characters.
                while (word.endswith(x) or word.startswith(x)):
                    word = word[0:1].replace(x,"") + word[1:len(word) - 1] + word[len(word) - 1:].replace(x,"")

                if (x == '\''):
                    #Removes disllowed "'" character when it does not followed by a "'t" or "'s" (Example: don'b instead of don't / Tom'b instead of Tom's).
                    #This is to allow contractions and apostrophes.

                    targetchar = max([word.rfind("'t"),word.rfind("'s")])

                    if targetchar > 0:
                        word = word[:targetchar].replace(x,"") + word[targetchar:]#Separate any unallowed use of "'" except 't and 's.
                    else:
                        word = word.replace(x,"")

                elif (x == '-'):
                    #Removes disllowed consecutive "-" characters. Hyphens are meant to use once.

                    #Allowed example: Twentieth-century (20th century).
                    #Disallowed example: Twentieth--century (Will be replaced as Twentieth-century) or Mid-----Air (Mid-Air).

                    #1st Line: This statement converts string into a new list contains each characters within string.
                    #It will return a new string based on new list that satifies the criteria in 2nd Line.
                    #2nd Line: New list accepts any chars, except consecutives of "-", which will be treated as single "-".

                    word = "".join([word[n] for n in range(len(word)) \
                           if (word[n] != '-') or (word[n] != word[n-1])])
                else:
                    word = word.replace(x,"") #Remove all other punctuations, numbers, unprintable characters except hyphens and contractions (apostrophes).

        return word #Returns a string with only legal characters.


print(processDisallowedChars('中文。'))

I actually wanted to generate this 'setexceptions' into a tuple of numbers, so that it only accept letters and not any other characters.

Reference I have used:

http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF

http://www.fileformat.info/info/unicode/char/372/index.htm

http://stackoverflow.com/questions/11066400/remove-punctuation-from-unicode-formatted-strings

Now I am stuck in this variable

setexceptions = set(\
                tuple(i for i in range(181,195102) \
                    if (182 <= i <= 191) == False and (i != 215) and (i != 247) and (706 <= i <= 709) == False \
                      and (722 <= i <= 735) == False and (741 <= i <= 881) == False and (i != 12290)
                      ))

Because going through the WikiBook page and compare with fileformat is really inefficient.

Thank you for the help.

legal python

2 Contributors
1 Reply
324 Views
1 Hour Discussion Span
Latest Post 10 Years Ago Latest Post by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2014-10-25T07:36:03+00:00

If you do >>> help(chr) in python 3, you get

Help on built-in function chr in module builtins:

chr(...)
    chr(i) -> Unicode character

    Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.

Now we can get a set of numbers with

>>> s = [i for i in range(0x10ffff) if unicodedata.category(chr(i)) == 'Lu']
>>> len(s)
1441
>>> s
[65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 216, 217, 218, 219, 220, 221, 222, 256, 258, 260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310,...]