Filesystem Metadata Indexing

amalgorious 0 Newbie Poster

14 Years Ago

Hi. I write really terrible python, sort of like using crayons in art class.

I have a couple questions. First is: where is the best forum for python newbies to learn how to write better code (standards) and optimize code (faster)? I just joined this forum, so it may not be the right place... but maybe it is.

Second, I've got some terrible code for you to laugh at. I have lots of questions about it. The only way to run it is to import it into IDLE and then pass main('/name/of/your/directory_to_walk', '/name/of/your/directory_to_save_walk.xml'). The idea of the script is to take a directory, do some maths, and output a xml file with some info about your directory. I have a use case for it, and the reason I need to optimize it, is because I have to run it on multiple terabytes.

So your laughing already, but I have read up on how to convert this into a terminal/cmd line tool, and I will do that once I get it working better. On to the questions:

1. Is running in IDLE slower/faster/the exact same as running from the terminal?
2. Is the *main* bottleneck in my program generating MD5 checksums?
3. How could I tag on something to report the files/sec or mb/sec? Also how about printing an ETA for the script to complete? (I generally start the script and go get a cup of coffee, no idea when it will complete)
4. I understand it is better practice to break out my program into functions and classes. Right now I just get everything done under one ugly function. I *think* a good idea is to have a class for my xml structure, and maybe one for the "metadata" or maybe just one for both? What functions need broken out?
5. What can I fix in my code to make it run faster? What am I currently doing that could be done easier (notice my wonky method of calculating directory depth, number of files/directories, etc)
6. These folks do it better: http://www.mail-archive.com/kragen-hacks@canonical.org/msg00088.html
and http://afflib.org/software/fiwalk/dfxml_tool I can get the second to run from terminal and it is great and I may just end up using it (or calling it in my program?) but right now I am focusing on my code.

#!/usr/bin/env python
import os, os.path, re, hashlib, stat, time, optparse
from xml.dom.minidom import Document

start = time.time()
doc = Document()
def audit(path):
    total_size,total_files,total_dirs,total_depth = 0,0,0,0
    droot=doc.createElement('asset_inventory')
    doc.appendChild(droot)
    droot.setAttribute('xmlns:xsi', "http://www.w3.org/2001/XMLSchema-instance")
    droot.setAttribute('xmlns:oai_dc', "http://www.openarchives.org/OAI/2.0/oai_dc/")
    droot.setAttribute('xmlns:dc', "http://purl.org/dc/elements/1.1/")
    droot.setAttribute('xmlns:dcterms', "http://purl.org/dc/terms/")
    droot.setAttribute('xmlns:dcmitype', "http://purl.org/dc/dcmitype/")	 
    droot.setAttribute('xmlns:premis', "info:lc/xmlns/premis -v2")
    droot.setAttribute('xmlns:cld', "http://purl.org/cld/terms/")
    droot.setAttribute('xmlns:cdtype', "http://purl.org/cld/cdtype/")
    asset = doc.createElement('asset_node')
    asset.setAttribute('relative_path', path)
    collection = doc.createElement('dcmitype:collection')
    asset.appendChild(collection)
    notes = doc.createElement('curators_notes')
    asset.appendChild(notes)
    droot.appendChild(asset)
    
    for root, dirs, files in os.walk(path):
        total_files += len(files)
        total_dirs += len(dirs)
        asset.setAttribute('total_files', str(total_files))            
        asset.setAttribute('total_dirs', str(total_dirs))
        
        if not dirs:
            depthCount=lambda path:path.count('/')
            rdepth=depthCount(root)
            pdepth=depthCount(path)
            token = rdepth-pdepth
            if token >= total_depth:
                total_depth = token
                asset.setAttribute('total_depth', str(total_depth))  
        
        for f in files:
            fp = os.path.join(root, f)
            
            mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime = os.stat(fp)
            fileobject = doc.createElement('fileobject')

            filenameT = doc.createTextNode(re.sub(path,'',fp))
            filenameE = doc.createElement('filename')
            
            filenameE.appendChild(filenameT)
            fileobject.appendChild(filenameE)
            
            filesizeE = doc.createElement('filesize')
            filesizeT = doc.createTextNode(str(size))
            filesizeE.appendChild(filesizeT)
            fileobject.appendChild(filesizeE)

            inodeE = doc.createElement('inode')
            inodeT = doc.createTextNode(str(ino))
            inodeE.appendChild(inodeT)
            fileobject.appendChild(inodeE)

            nlinkE = doc.createElement('nlink')
            nlinkT = doc.createTextNode(str(nlink))
            nlinkE.appendChild(nlinkT)
            fileobject.appendChild(nlinkE)

            uidE = doc.createElement('uid')
            uidT = doc.createTextNode(str(uid))
            uidE.appendChild(uidT)
            fileobject.appendChild(uidE)

            gidE = doc.createElement('gid')
            gidT = doc.createTextNode(str(gid))
            gidE.appendChild(gidT)
            fileobject.appendChild(gidE)

            mtimeE = doc.createElement('mtime')
            mtimeT = doc.createTextNode(str(mtime))
            mtimeE.appendChild(mtimeT)
            fileobject.appendChild(mtimeE)

            atimeE = doc.createElement('atime')
            atimeT = doc.createTextNode(str(atime))
            atimeE.appendChild(atimeT)
            fileobject.appendChild(atimeE)

            ctimeE = doc.createElement('ctime')
            ctimeT = doc.createTextNode(str(ctime))
            ctimeE.appendChild(ctimeT)
            fileobject.appendChild(ctimeE)

            f = open(fp,'rb')
            m = hashlib.md5()
            while True:
                data = f.read(10240)
                if len(data) == 0:
                    break
                m.update(data)
            hashdigestE = doc.createElement('hashdigest')
            hashdigestT = doc.createTextNode(str(m.hexdigest()))
            hashdigestE.appendChild(hashdigestT)
            fileobject.appendChild(hashdigestE)

            total_size = total_size + size
            asset.setAttribute('total_size', str(total_size))
            if not files:
                total_files = 0

            part = doc.createElement('dcterms:hasPart')
            part.appendChild(fileobject)
            collection.appendChild(part)


def main(dir_path,name):
    call=audit(dir_path)

    destfile = doc.toxml()
    destfile = open(save_path, 'w')
    doc.writexml(destfile, addindent='  ', newl='\n')
    destfile.close()
    end=time.time()
    elapsed= end-start
    mins = elapsed/60
    print "I took", mins, "minutes to run"

1 Contributor
0 Replies
89 Views

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.