kth biggest/smallest without sorting

TrustyTony 0 Tallied Votes 368 Views Share

Inspired by Computer Science forum post, I did this code. I prepared it Shedskin compatible, and it runs quite nicely indeed. Even interpreted the algorithm is faster than full sort.

Here interpreted test for the module, shedskinning this test would show better time for sort solution.

Python version has benefit of being able to take any sequence as input and return.

from __future__ import print_function
from random import randint
import time

from kth_biggest import *


if __name__ == '__main__':
    # second biggest 4 is 5th smallest
    assert kth_biggest([4,3,1,2,3,5], 2) == kth_smallest([4,3,1,2,3,5],5)
    # random
    amount, max_items = 16, 1000000
    print('Testing %i random tests of maximum of %i items and check by sorting' % (amount, max_items))
    for s in ([randint(-100000,100000) for _ in range(randint(10,max_items))]
               for _ in range(amount)) :
        count, t0 = 0, time.clock()
        this_k = randint(1, len(s))
        result = kth_biggest(s, this_k)
        t0 -= time.clock()
        print('%i items %i biggest: time %.3f ms' % (len(s), this_k, -1000 * t0))
        t0 = time.clock()
        check_ok = sorted(s)[-this_k] == result
        t0 -= time.clock()
        print('Sorting check took %.3f ms' % (-1000*t0))         
        assert check_ok, 'Bad result: %s for k: %i, sequence: %s' % (result, this_k, t)
    print('Done')
""" tests with shedskinned module (shedskin -eb):
Testing 16 random tests of maximum of 1000000 items and check by sorting
857690 items 539792 biggest: time 121.498 ms
Sorting check took 1216.890 ms
868304 items 143979 biggest: time 82.965 ms
Sorting check took 1222.044 ms
865958 items 554365 biggest: time 96.862 ms
Sorting check took 1215.802 ms
937524 items 605917 biggest: time 106.586 ms
Sorting check took 1414.248 ms
730108 items 705673 biggest: time 54.905 ms
Sorting check took 997.423 ms
741796 items 335957 biggest: time 85.154 ms
Sorting check took 1013.556 ms
766574 items 37795 biggest: time 67.348 ms
Sorting check took 1111.185 ms
656570 items 655384 biggest: time 60.334 ms
Sorting check took 878.164 ms
779138 items 327158 biggest: time 99.968 ms
Sorting check took 1076.357 ms
503874 items 221915 biggest: time 26.786 ms
Sorting check took 642.609 ms
406703 items 177733 biggest: time 35.485 ms
Sorting check took 495.052 ms
127381 items 20038 biggest: time 8.803 ms
Sorting check took 123.672 ms
659089 items 368176 biggest: time 73.220 ms
Sorting check took 914.019 ms
921276 items 300962 biggest: time 119.500 ms
Sorting check took 1309.654 ms
275044 items 243240 biggest: time 53.022 ms
Sorting check took 323.930 ms
50570 items 46127 biggest: time 3.214 ms
Sorting check took 43.433 ms
Done
"""

Example run the test including the sort shedskinned:

Testing 16 random tests of maximum of 1000000 items and check by sorting
633456 items 165165 biggest: time 219.000 ms
Sorting check took 172.000 ms
273703 items 128897 biggest: time 31.000 ms
Sorting check took 62.000 ms
181203 items 95339 biggest: time 15.000 ms
Sorting check took 32.000 ms
11537 items 4496 biggest: time 15.000 ms
Sorting check took -0.000 ms
862637 items 64172 biggest: time 47.000 ms
Sorting check took 235.000 ms
881542 items 379725 biggest: time 94.000 ms
Sorting check took 265.000 ms
670005 items 154637 biggest: time 62.000 ms
Sorting check took 172.000 ms
530071 items 232143 biggest: time 47.000 ms
Sorting check took 140.000 ms
831663 items 756406 biggest: time 46.000 ms
Sorting check took 219.000 ms
705844 items 420555 biggest: time 47.000 ms
Sorting check took 187.000 ms
429281 items 20335 biggest: time 16.000 ms
Sorting check took 109.000 ms
809376 items 238219 biggest: time 47.000 ms
Sorting check took 203.000 ms
831610 items 547751 biggest: time 62.000 ms
Sorting check took 235.000 ms
120084 items 27530 biggest: time 15.000 ms
Sorting check took 32.000 ms
351194 items 239526 biggest: time 16.000 ms
Sorting check took 78.000 ms
675622 items 233619 biggest: time 47.000 ms
Sorting check took 172.000 ms
Done
from __future__ import print_function

def kth_biggest(seq, k):
    """ kth_biggest without sorting all array or repeating to find maximum k times
        will crash sooner or later becouse of recursion and Python's recursion limit,
        but not so soon because of divide and conquer algorithm inspired by quicksort

    """
    assert k > 0,'k must be positive %i' % k
    ls = len(seq)
    if ls < k:
        raise ValueError('kth %i: only %i elements' % (k, ls))
    if k == 1:
        return max(seq)
    if k == ls:
        return min(seq)

    pivot = seq[0]
    # others except the pivot compared with pivot
    bigger =  [item for item in seq[1:] if item > pivot]
    #debug
    #print(('seq: %s, k: %i,\npivot: %i,\nbigger: %s\n' %
    #       (seq, k, pivot, bigger)))

    if len(bigger) == k - 1:
        return pivot
    elif len(bigger) >= k:
        return kth_biggest(bigger, k)

    not_bigger = [item for item in seq[1:] if item <= pivot]
    #debug
    #print('not_bigger:', not_bigger)
    return kth_biggest(not_bigger, k - len(bigger) - 1)

def kth_smallest(seq, k):
    """ kth_smallest uses kth_biggest to count kth_smallest """
    return kth_biggest(seq, len(seq) - k + 1)
TrustyTony 888 pyMod Team Colleague Featured Poster

Worst case is fully sorted sequence like number range, which is not so ideal, and big k, range(10000) crashes CPython, 10* more does shedskinned module.

Would make sense try the kth last of the sequence as pivot and do iteration of parts before and after pivot with itertools.chain. Then sorted case would succeed fast.

TrustyTony 888 pyMod Team Colleague Featured Poster

Here the chain version over pivot of kth last. Looks hurt by performance in CPython at least.

from __future__ import print_function
from itertools import chain

def kth_biggest(seq, k):
    """ kth_biggest without sorting all array or repeating to find maximum k times
        will crash sooner or later becouse of recursion and Python's recursion limit,
        but not so soon because of divide and conquer algorithm inspired by quicksort

    """
    assert k > 0,'k must be positive %i' % k
    ls = len(seq)
    if ls < k:
        raise ValueError('kth %i: only %i elements' % (k, ls))
    if k == 1:
        return max(seq)
    if k == ls:
        return min(seq)

    pivot = seq[-k]
    # others except the pivot compared with pivot
    bigger =  [item for item in chain(seq[:-k], seq[-k+1:]) if item > pivot]

    if len(bigger) == k - 1:
        return pivot
    elif len(bigger) >= k:
        return kth_biggest(bigger, k)

    not_bigger = [item for item in chain(seq[:-k], seq[-k+1:]) if item <= pivot]
    return kth_biggest(not_bigger, k - len(bigger) - 1)

def kth_smallest(seq, k):
    """ kth_smallest uses kth_biggest to count kth_smallest """
    return kth_biggest(seq, len(seq) - k + 1)

#usage for shedskin by specific type of sequence, pure Python is generic
kth_smallest([1, 2, 3], 2)
TrustyTony 888 pyMod Team Colleague Featured Poster

This looks like beeing known as selection algorithm. At least I did it quite similarly. The selection of pivot in interpreted language causes quite some penalty, so looking for O class is not whole truth.http://en.wikipedia.org/wiki/Selection_algorithm
Probably should prepare shedskinned pivot-choose algorithm for median of medians of five.

Also check skip list algorithm http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli/ see also blist using comments after the post.

Here my speed test comparison for versions of my code, results in my computer with explanation of versions:

kth_biggest: 2.58 s (shedskinned select last)
kth_biggest_p: 6.87 s (python interpreted kth last as pivot)
kth_biggest1: 6.78 s (python version, sorted worst case, select last version)
kth_biggest_r: 7.79 s (python interpreted random selection)
sorting: 12.2 s (full sort and -index)
from __future__ import print_function
from random import randint
import time
import itertools

import kth_biggest as kb2
import kth_biggest1 as kb1
import kth_biggest_p as kb_p
import kth_biggest_r as kb_r


if __name__ == '__main__':
    # random
    amount, max_items = 16, 1000000
    timings = dict()
    print('Testing %i random tests of maximum of %i items and check by sorting\n' % (amount, max_items))
    for s in ([randint(-100000,100000) for _ in range(randint(10,max_items))]
               for _ in range(amount)) :
        this_k = randint(1, len(s))
        print('%i items %i biggest:' % (len(s), this_k))
        for module in (kb_r, kb_p, kb1, kb2):
            t0 = time.clock()
            result = module.kth_biggest(s, this_k)
            t0 -= time.clock()
            timings.setdefault(module.__name__, []).append(-t0)
            print(('Module %s:  time %3.3f ms' % (module.__name__, -1000 * t0)).rjust(50))

        t0 = time.clock()
        check_ok = sorted(s)[-this_k] == result
        t0 -= time.clock()
        timings.setdefault('sorting', []).append(-t0)
        print('Sorting check took %.3f ms' % (-1000*t0))         
        assert check_ok, 'Bad result: %s for k: %i, sequence: %s' % (result, this_k, s)
        print()

    print('\n'.join('{0}: {1:>3.3} s'.format(key, sum(timing)) for key, timing in sorted(timings.items(), key=lambda x: x[1])))
    raw_input('Done.')
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.