Hi all, actually I have a requirement to remove all non letter character. (Numbers, Punctuation, symbols, non printing characters etc.)
string.punctuation does a good job, but it does NOT remove any non English punctuation (Like '。' which is a full stop in Chinese)
So I come accross such code:
import unicodedata
def onlyWord(text):
Word = set(['Lm','Lo','Lu','Ll','Lt'])
return ''.join(x for x in text
if unicodedata.category(x) in Word)
print(onlyWord('µ'))
Great! It works what I wanted, and now I realized that ᐒ is a letter (Unicode category as Lo).
The problem is, I CANNOT use unicodedata, collections, re and a number of libraries as a challenge.
So I want to know how to print out the list of numbers that are defined as 'Lu' (As an example, I can extend to the ones listed above) so that I can do this:
def processDisallowedChars(word):
'''
This function removes all the non-alphabetical characters within strings, with hyphens and contractions / astrophes in mind.
Examples of allowed chars:
one-north
tom's
bill gates
don't
中文
'''
#Initalize a list of acceptable characters.
setalphabet = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
setdisallowed = set(word).difference(setalphabet)
setexceptions = set(\
tuple(i for i in range(181,195102) \
if (182 <= i <= 191) == False and (i != 215) and (i != 247) and (706 <= i <= 709) == False \
and (722 <= i <= 735) == False and (741 <= i <= 881) == False and (i != 12290)
))
for x in set(setdisallowed):
if(ord(x) not in setexceptions):
#Further check to ensure there is no starting / ending with illegal characters.
while (word.endswith(x) or word.startswith(x)):
word = word[0:1].replace(x,"") + word[1:len(word) - 1] + word[len(word) - 1:].replace(x,"")
if (x == '\''):
#Removes disllowed "'" character when it does not followed by a "'t" or "'s" (Example: don'b instead of don't / Tom'b instead of Tom's).
#This is to allow contractions and apostrophes.
targetchar = max([word.rfind("'t"),word.rfind("'s")])
if targetchar > 0:
word = word[:targetchar].replace(x,"") + word[targetchar:]#Separate any unallowed use of "'" except 't and 's.
else:
word = word.replace(x,"")
elif (x == '-'):
#Removes disllowed consecutive "-" characters. Hyphens are meant to use once.
#Allowed example: Twentieth-century (20th century).
#Disallowed example: Twentieth--century (Will be replaced as Twentieth-century) or Mid-----Air (Mid-Air).
#1st Line: This statement converts string into a new list contains each characters within string.
#It will return a new string based on new list that satifies the criteria in 2nd Line.
#2nd Line: New list accepts any chars, except consecutives of "-", which will be treated as single "-".
word = "".join([word[n] for n in range(len(word)) \
if (word[n] != '-') or (word[n] != word[n-1])])
else:
word = word.replace(x,"") #Remove all other punctuations, numbers, unprintable characters except hyphens and contractions (apostrophes).
return word #Returns a string with only legal characters.
print(processDisallowedChars('中文。'))
I actually wanted to generate this 'setexceptions' into a tuple of numbers, so that it only accept letters and not any other characters.
Reference I have used:
http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF
http://www.fileformat.info/info/unicode/char/372/index.htm
http://stackoverflow.com/questions/11066400/remove-punctuation-from-unicode-formatted-strings
Now I am stuck in this variable
setexceptions = set(\
tuple(i for i in range(181,195102) \
if (182 <= i <= 191) == False and (i != 215) and (i != 247) and (706 <= i <= 709) == False \
and (722 <= i <= 735) == False and (741 <= i <= 881) == False and (i != 12290)
))
Because going through the WikiBook page and compare with fileformat is really inefficient.
Thank you for the help.