Hi. I write really terrible python, sort of like using crayons in art class.
I have a couple questions. First is: where is the best forum for python newbies to learn how to write better code (standards) and optimize code (faster)? I just joined this forum, so it may not be the right place... but maybe it is.
Second, I've got some terrible code for you to laugh at. I have lots of questions about it. The only way to run it is to import it into IDLE and then pass main('/name/of/your/directory_to_walk', '/name/of/your/directory_to_save_walk.xml'). The idea of the script is to take a directory, do some maths, and output a xml file with some info about your directory. I have a use case for it, and the reason I need to optimize it, is because I have to run it on multiple terabytes.
So your laughing already, but I have read up on how to convert this into a terminal/cmd line tool, and I will do that once I get it working better. On to the questions:
1. Is running in IDLE slower/faster/the exact same as running from the terminal?
2. Is the *main* bottleneck in my program generating MD5 checksums?
3. How could I tag on something to report the files/sec or mb/sec? Also how about printing an ETA for the script to complete? (I generally start the script and go get a cup of coffee, no idea when it will complete)
4. I understand it is better practice to break out my program into functions and classes. Right now I just get everything done under one ugly function. I *think* a good idea is to have a class for my xml structure, and maybe one for the "metadata" or maybe just one for both? What functions need broken out?
5. What can I fix in my code to make it run faster? What am I currently doing that could be done easier (notice my wonky method of calculating directory depth, number of files/directories, etc)
6. These folks do it better: http://www.mail-archive.com/kragen-hacks@canonical.org/msg00088.html
and http://afflib.org/software/fiwalk/dfxml_tool I can get the second to run from terminal and it is great and I may just end up using it (or calling it in my program?) but right now I am focusing on my code.
#!/usr/bin/env python
import os, os.path, re, hashlib, stat, time, optparse
from xml.dom.minidom import Document
start = time.time()
doc = Document()
def audit(path):
total_size,total_files,total_dirs,total_depth = 0,0,0,0
droot=doc.createElement('asset_inventory')
doc.appendChild(droot)
droot.setAttribute('xmlns:xsi', "http://www.w3.org/2001/XMLSchema-instance")
droot.setAttribute('xmlns:oai_dc', "http://www.openarchives.org/OAI/2.0/oai_dc/")
droot.setAttribute('xmlns:dc', "http://purl.org/dc/elements/1.1/")
droot.setAttribute('xmlns:dcterms', "http://purl.org/dc/terms/")
droot.setAttribute('xmlns:dcmitype', "http://purl.org/dc/dcmitype/")
droot.setAttribute('xmlns:premis', "info:lc/xmlns/premis -v2")
droot.setAttribute('xmlns:cld', "http://purl.org/cld/terms/")
droot.setAttribute('xmlns:cdtype', "http://purl.org/cld/cdtype/")
asset = doc.createElement('asset_node')
asset.setAttribute('relative_path', path)
collection = doc.createElement('dcmitype:collection')
asset.appendChild(collection)
notes = doc.createElement('curators_notes')
asset.appendChild(notes)
droot.appendChild(asset)
for root, dirs, files in os.walk(path):
total_files += len(files)
total_dirs += len(dirs)
asset.setAttribute('total_files', str(total_files))
asset.setAttribute('total_dirs', str(total_dirs))
if not dirs:
depthCount=lambda path:path.count('/')
rdepth=depthCount(root)
pdepth=depthCount(path)
token = rdepth-pdepth
if token >= total_depth:
total_depth = token
asset.setAttribute('total_depth', str(total_depth))
for f in files:
fp = os.path.join(root, f)
mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime = os.stat(fp)
fileobject = doc.createElement('fileobject')
filenameT = doc.createTextNode(re.sub(path,'',fp))
filenameE = doc.createElement('filename')
filenameE.appendChild(filenameT)
fileobject.appendChild(filenameE)
filesizeE = doc.createElement('filesize')
filesizeT = doc.createTextNode(str(size))
filesizeE.appendChild(filesizeT)
fileobject.appendChild(filesizeE)
inodeE = doc.createElement('inode')
inodeT = doc.createTextNode(str(ino))
inodeE.appendChild(inodeT)
fileobject.appendChild(inodeE)
nlinkE = doc.createElement('nlink')
nlinkT = doc.createTextNode(str(nlink))
nlinkE.appendChild(nlinkT)
fileobject.appendChild(nlinkE)
uidE = doc.createElement('uid')
uidT = doc.createTextNode(str(uid))
uidE.appendChild(uidT)
fileobject.appendChild(uidE)
gidE = doc.createElement('gid')
gidT = doc.createTextNode(str(gid))
gidE.appendChild(gidT)
fileobject.appendChild(gidE)
mtimeE = doc.createElement('mtime')
mtimeT = doc.createTextNode(str(mtime))
mtimeE.appendChild(mtimeT)
fileobject.appendChild(mtimeE)
atimeE = doc.createElement('atime')
atimeT = doc.createTextNode(str(atime))
atimeE.appendChild(atimeT)
fileobject.appendChild(atimeE)
ctimeE = doc.createElement('ctime')
ctimeT = doc.createTextNode(str(ctime))
ctimeE.appendChild(ctimeT)
fileobject.appendChild(ctimeE)
f = open(fp,'rb')
m = hashlib.md5()
while True:
data = f.read(10240)
if len(data) == 0:
break
m.update(data)
hashdigestE = doc.createElement('hashdigest')
hashdigestT = doc.createTextNode(str(m.hexdigest()))
hashdigestE.appendChild(hashdigestT)
fileobject.appendChild(hashdigestE)
total_size = total_size + size
asset.setAttribute('total_size', str(total_size))
if not files:
total_files = 0
part = doc.createElement('dcterms:hasPart')
part.appendChild(fileobject)
collection.appendChild(part)
def main(dir_path,name):
call=audit(dir_path)
destfile = doc.toxml()
destfile = open(save_path, 'w')
doc.writexml(destfile, addindent=' ', newl='\n')
destfile.close()
end=time.time()
elapsed= end-start
mins = elapsed/60
print "I took", mins, "minutes to run"