Hello,
I've created a script that parses and downloads images from a website however now what I want to do is create a script that matches the files I've downloaded against common image file type signatures and then tells me how many of the files aren't those types. I'm still relatively new to Python but I'm enjoying the challenge and the fact that I'm learning more about it everytime I code.
This is the code I use to download the files;
def downloads(webpage):
images = re.findall(r'([-\w]+\.(?:jpg|gif|png|bmp|docx))', page)
images.sort()
print '[+]', str(len(images)), 'Files Found:'
for image in images:
url = 'http://dogimagesite.com/' + image
folder = 'C:\\temp\\downloads\\' + image
print url
try:
urllib.urlretrieve(url, folder)
except:
print 'Didn't download properly ' + image
print image
And this is (so far) the code I have to check againt common signatures. I'm just wanting to check for jpg, to keep it (fairly) simple for myself (to begin with). Any help you could give to expand this would be amazing! :)
file_sigs = {'\xFF\xD8\xFF':('JPEG','jpg')}
def readFile():
filename = r'c:\\temp\\downloads'
fh = open(filename, 'r')
file_sig = fh.read(4)
print '[*] readFile() File:',filename #, 'Hash Sig:',
binascii.hexlify(file_sig)
def main():
if len(sys.argv) != 2:
print 'usage: file_sig filename'
sys.exit(1)
else:
file_hashsig = readFile(sys.argv[1])
if __name__ == '__main__':
main()
Jade