This handy function turns file into stream of words stripped of punctuation, whitespace and digits, but does not split for example we'd to two words. If you want that you can further process yielded words or change the definition.
lowercase word generator
peter_budo commented: Nice sample +15
import string
def get_lower_words(filein):
for line in filein:
while line:
word, match, line = line.partition(' ')
word = word.lower().strip(string.punctuation+
string.whitespace+
string.digits)
if word: yield word
for word in get_lower_words(open('11.txt')):
print word
griswolf 304 Veteran Poster
Why use partition() instead of split()? What happens with 'malformed' lines such as word1\tword2
. Using split() would fix that problem (but cause the generator to store a lot of data if the lines are long).
TrustyTony 888 pyMod Team Colleague Featured Poster
In StackOverflow discussions have come up that partition is very much faster than split. I have not timed. You can prove it by changing partition in a copy of function to
word, line = line.split(None, 1)
and comparing the speed.
Edited by TrustyTony because: n/a
TrustyTony 888 pyMod Team Colleague Featured Poster
In
word, line = line.split(None, 1)
and comparing the speed.
Except with split you should change to for loop as previous change does not work with under two words lines. Also my simple example of usage does not close the file, better to use with statement.
I did few other versions, including letter by letter isalpha and groupby, and timing was not so bad for loop over splitted line. I will time a re version for the optimisation and post the results.
TrustyTony 888 pyMod Team Colleague Featured Poster
Here is do it yourself walltime timing of different versions with Alice in Wonderland.
Notice that results of these versions differ as some consider all non-letters as word breaks.
Now you can start to make plans how to spend all those saved milliseconds for each book over the previous posts answer (all 78 of them)-
import string
import time
from collections import defaultdict
import itertools
import re
# translation table to turn everything else but letters to ' '
only_letters = ''.join(chr(c) if chr(c).isalpha() else ' ' for c in range(256))
# regular expression capturing words
words = re.compile('\w+')
def get_lower_words(filein):
for line in filein:
while line:
line = line.replace('--',' ')
word, match, line = line.partition(' ')
word = word.lower().strip(string.punctuation+
string.whitespace+
string.digits)
if word: yield word
def lower_words_split(filein):
''' give lower case words keeping non-letters inside words '''
for line in filein:
## deal with double dashes a la Project Guttenberg
line = line.replace('--',' ')
for word in line.split(None):
word = word.lower().strip(string.punctuation+
string.digits)
if word: yield word
def lower_words_split_trans(filein):
''' Make all non-alpha spaces and split with space '''
for line in filein:
for word in line.translate(only_letters).split(' '):
if word: yield word.lower()
def lower_generate(filein):
''' generate the letters like lower_words_split_trans
by letter by letter scan and groupby'''
return (''.join(c)
for line in filein
for islet,c in itertools.groupby(line.lower(), lambda x: x.isalpha()) if islet)
def lower_words_re(filein):
return (w.lower()
for line in filein
for w in re.findall(words, line))
for func in (get_lower_words,
lower_generate,
lower_words_split,
lower_words_split_trans,
lower_words_re):
with open('11.txt') as alice:
counts = defaultdict(int)
t0 = time.time()
for word in func(alice):
counts[word] += 1
t1 = time.time()
print ('%s, %.0f ms' % (func, 1000.0*(t1-t0)))
print '\n'.join('%4i:%20s' % pair
for pair in sorted(((count, word)
for word, count in counts.items()),
reverse=True)[:10])
raw_input('Ready')
''' Output in my computer:
Microsoft Windows XP [versio 5.1.2600]
(C) Copyright 1985 - 2001 Microsoft Corp.
J:\test>yieldwords.py
<function get_lower_words at 0x00BD3030>, 125 ms
1813: the
934: and
805: to
689: a
628: of
545: it
541: she
462: said
435: you
429: in
Ready
<function lower_generate at 0x00BD31F0>, 266 ms
1818: the
940: and
809: to
690: a
631: of
610: it
553: she
545: i
481: you
462: said
Ready
<function lower_words_split at 0x00BD3170>, 78 ms
1813: the
934: and
805: to
689: a
628: of
545: it
541: she
462: said
435: you
429: in
Ready
<function lower_words_split_trans at 0x00BD31B0>, 47 ms
1818: the
940: and
809: to
690: a
631: of
610: it
553: she
545: i
481: you
462: said
Ready
<function lower_words_re at 0x00BD3230>, 78 ms
1818: the
940: and
809: to
690: a
631: of
610: it
553: she
543: i
481: you
462: said
Ready
'''
Edited by TrustyTony because: n/a
griswolf 304 Veteran Poster
I rewrote Tony's tests to be more uniform (all of them now use the 'yield' keyword, etc). I skipped lower_generate which was slowest in his tests. This was running on my OS/X laptop.
bottom line: Using split()
beats partition()
by a factor of 3.5 on this data.
from collections import defaultdict
import string
import re
import time
stripA = string.punctuation+string.whitespace+string.digits
stripB = string.punctuation+string.digits
only_letters = ''.join(chr(c) if chr(c).isalpha() else ' ' for c in range(256))
wordRE = re.compile('\w+')
def get_lower_words_partition(filein):
"""line.partition(' ')"""
for line in filein:
while line:
word, match, line = line.partition(' ')
word = word.lower().strip(stripA)
if word: yield word
def get_lower_words_split_one(filein):
"""line.split(None,1)"""
for line in filein:
while line:
try:
word,line = line.split(None,1)
except ValueError:
word,line = line,None
word = word.lower().strip(stripA)
if word: yield word
def get_lower_words_split_all(filein):
"""line.split()"""
for line in filein:
for word in line.split():
word = word.lower().strip(stripB)
if word: yield word
def get_lower_words_xlate_split_one(filein):
"""line.translate().split(None,1)"""
for line in filein:
line = line.translate(only_letters)
while line:
try:
word,line = line.split(None,1)
except ValueError:
word,line = line,None
word = word.lower().strip(stripA)
if word: yield word
def get_lower_words_xlate_split_all(filein):
"""line.translate().split()"""
for line in filein:
line = line.translate(only_letters)
for word in line.split():
word = word.lower().strip(stripB)
if word: yield word
def get_lower_words_xlate_partition(filein):
"""line.translate().partition(' ')"""
for line in filein:
line = line.translate(only_letters)
while line:
word, match, line = line.partition(' ')
word = word.lower().strip(stripA)
if word: yield word
def get_lower_words_re(filein):
"""'\w+'.findall(line)"""
for line in filein:
for w in wordRE.findall(line):
if w: yield w.lower()
def get_lower_words_xlate_re(filein):
"""'\w+'.findall(line.translate())"""
for line in filein:
line = line.translate(only_letters)
for w in wordRE.findall(line):
if w:
yield w.lower()
fs = [eval(f) for f in dir() if f.startswith('get_')]
functions = zip(fs, (f.__doc__ for f in fs))
results = []
for func,doc in functions:
with open('/tmp/big.txt') as rabbit:
counts = defaultdict(int)
t0 = time.time()
for word in func(rabbit):
counts[word] += 1
t1 = time.time()
result = '%5.0f ms (distinct words: %d) -- %s' % (1000.0*(t1-t0),len(counts),doc)
print('%5.0f ms -- %s'%(1000.0*(t1-t0),doc))
results.append(result)
print('')
for r in sorted(results):
print(r)
"""Results using a file 'big.txt' with about 264K lines, some very long (cat of several PHP files, then by hand merged up to 5000 lines into a single line, several places in the file):
% wc /tmp/big.txt
264538 902027 10014551 /tmp/big.txt
% # wc is 'word count' util: 264,538 lines, 902,027 'words' about 1M characters)
%
% python time_wordgen.py
4619 ms -- line.partition(' ')
1497 ms -- '\w+'.findall(line)
1299 ms -- line.split()
3387 ms -- line.split(None,1)
6514 ms -- line.translate().partition(' ')
1625 ms -- '\w+'.findall(line.translate())
1417 ms -- line.translate().split()
3787 ms -- line.translate().split(None,1)
1299 ms (distinct words: 6375) -- line.split()
1417 ms (distinct words: 2577) -- line.translate().split()
1497 ms (distinct words: 3298) -- '\w+'.findall(line)
1625 ms (distinct words: 2577) -- '\w+'.findall(line.translate())
3387 ms (distinct words: 6375) -- line.split(None,1)
3787 ms (distinct words: 2577) -- line.translate().split(None,1)
4619 ms (distinct words: 6375) -- line.partition(' ')
6514 ms (distinct words: 2577) -- line.translate().partition(' ')
"""
Edited by griswolf because: add 'bottom line'
Be a part of the DaniWeb community
We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.