I'm soliciting advice for performace improvements for creating a weak checksum for file segments (used in the rSync algorithm)

Here's what I have so far:

def blockchecksums(instream, blocksize=4096):

    from hashlib import md5
    weakhashes = []
    stronghashes = []

    for chunk in iter(lambda: instream.read(blocksize),""):
        a = b = 0
        l = len(chunk)
        for n, i in enumerate(bytes(chunk)):
            a += i
            b += (l - n)*i    
        weakhashes.append((b << 16) | a)
        stronghashes.append(md5(chunk).hexdigest())

    return weakhashes, stronghashes

I haven't had any luck speeding things up using itertools or using c functions (like any() )

This gives a small (10%) performance improvement. Tested on a 30 meg file with bytes=bytearray on python 2.7.

def blockchecksums2(instream, blocksize=4096):
    from hashlib import md5
    weakhashes = []
    stronghashes = []
    for chunk in iter(lambda: instream.read(blocksize),""):
        l = len(chunk)
        a=0
        c=0
        for n, i in enumerate(bytes(chunk)):
            a += i
            c +=n*i
        b=l*a-c
        weakhashes.append((b << 16) | a)
        stronghashes.append(md5(chunk).hexdigest())
    return weakhashes, stronghashes

There is much more room for optimizing. Did you consider to use numpy? Or implement the inner loop in C?

Thanks for the reply/code. I am considering just making a .pyd file to handle the inner loop. Using c for this function tested about 16x faster than python (probably even more with your contribution).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.