Performance

Question

Darek6 0 Newbie Poster

13 Years Ago

Hi everybody,

I've got a code which returns to a given text an inverse index. From a list of tokens, the function produces a list sorted by the frequency.

Example:
inverted_index(['the', 'house', ',', 'the', 'beer'])
[('the', [0, 3]), ('beer', [4]), ('house', [1]), (',', [2])]

Code:

def outp(txt):
    ind = {}
    for word in txt:
        if word not in ind.keys():
            i = txt.index(word)
            ind[word] = [i]
        else:
            i = txt.index(word, ind[word][-1]+1)
            ind[word].append(i)
        sorted_ind = sorted(ind.items(), key=lsort, reverse=True)
    return sorted_ind




def lsort(kv):
    return len(kv[1])

The code works, but it's very slow.
So, my question is: How could it be written s.t. the code is faster?

Thanks for any propositions, Darek

python

Edited 13 Years Ago by Darek6

6 Contributors
6 Replies
235 Views
4 Days Discussion Span
Latest Post 13 Years Ago Latest Post by Gribouillis

HiHe 174 Junior Poster

13 Years Ago

This might be faster:

# create and word:index dictionary and a sorted list of index tuples

def create_index_dict(data_list):
    index_dict = {}
    for ix, word in enumerate(data_list):
        index_dict.setdefault(word, []).append(ix)
    return index_dict


data_list = ['the', 'house', ',', 'the', 'beer']
index_dict = create_index_dict(data_list)
print(index_dict)

'''
{'house': [1], 'the': [0, 3], 'beer': [4], ',': [2]}
'''

index_list = [(k, v) for k, v in index_dict.items()]
print(sorted(index_list, reverse=True))

'''
[('the', [0, 3]), ('house', [1]), ('beer', [4]), (',', [2])]
'''

Gribouillis commented: fast code +13

Gribouillis 1,391 Programming Explorer

13 Years Ago

I tried to do it using itertools.groupby(), but hihe's method is faster. The reason is probably that groupby() needs an initial sort, while filling a dict doesn't. Here is the comparison

#!/usr/bin/env python
# -*-coding: utf8-*-
# compare two implementations of creating a sorted index list

data_list = ['the', 'house', ',', 'the', 'beer']

from itertools import count, groupby
from operator import itemgetter
ig0 = itemgetter(0)
ig1 = itemgetter(1)

def score(item):
    return (-len(item[1]), item[0])

def grib_func(data_seq):
    L = sorted(zip(data_seq, count(0)))
    L = ((key, [x[1] for x in group]) for key, group in groupby(L, key = ig0))
    return sorted(L, key = score)

# hihe's code    
def create_index_dict(data_list):
    index_dict = {}
    for ix, word in enumerate(data_list):
        index_dict.setdefault(word, []).append(ix)
    return index_dict

def hihe_func(data_seq):
    return sorted(create_index_dict(data_seq).items(), key = score)

# comparison code

print grib_func(data_list)
print hihe_func(data_list)

from timeit import Timer
for name in ("hihe_func", "grib_func"):
    tm = Timer("%s(data_list)"%name, "from __main__ import %s, data_list"%name)
    print "{0}: {1}".format(name, tm.timeit())

""" my output -->
[('the', [0, 3]), (',', [2]), ('beer', [4]), ('house', [1])]
[('the', [0, 3]), (',', [2]), ('beer', [4]), ('house', [1])]
hihe_func: 11.2509949207
grib_func: 18.3761279583
"""

Edited 13 Years Ago by Gribouillis

HiHe commented: thanks +6

vegaseat commented: thanks fortiming this +14

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 1 · 2012-05-27T10:31:09+00:00

You are doing linear scan of all txt when you use index. Read up on dictionaries in Python.

griswolf 304 Veteran Poster · Answer 2 · 2012-05-27T16:10:56+00:00

Be sure to notice the method setdefault. Or use class Counter.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2012-05-27T22:14:21+00:00

If you need to keep indexes of the words, itertools.groupby might be more useful than Counter.

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague · Answer 4 · 2012-05-29T18:05:50+00:00

You can speed things up by taking line 10 out of the for loop.