Hello,

I've created a script that parses and downloads images from a website however now what I want to do is create a script that matches the files I've downloaded against common image file type signatures and then tells me how many of the files aren't those types. I'm still relatively new to Python but I'm enjoying the challenge and the fact that I'm learning more about it everytime I code.

This is the code I use to download the files;

def downloads(webpage):

    images = re.findall(r'([-\w]+\.(?:jpg|gif|png|bmp|docx))', page)
    images.sort()
    print '[+]', str(len(images)), 'Files Found:'

    for image in images:
        url = 'http://dogimagesite.com/' + image
        folder = 'C:\\temp\\downloads\\' + image
        print url
        try:
            urllib.urlretrieve(url, folder)
        except:
            print 'Didn't download properly ' + image
        print image

And this is (so far) the code I have to check againt common signatures. I'm just wanting to check for jpg, to keep it (fairly) simple for myself (to begin with). Any help you could give to expand this would be amazing! :)

file_sigs = {'\xFF\xD8\xFF':('JPEG','jpg')}

def readFile():
    filename = r'c:\\temp\\downloads'  
    fh = open(filename, 'r') 
    file_sig = fh.read(4) 
    print '[*] readFile() File:',filename #, 'Hash Sig:', 
    binascii.hexlify(file_sig) 

def main(): 
    if len(sys.argv) != 2: 
        print 'usage: file_sig filename' 
        sys.exit(1)
    else:
        file_hashsig = readFile(sys.argv[1]) 

if __name__ == '__main__': 
     main()

Jade

There is no reason to reinvent the wheel. You can use the identify program from ImageMagick to do that for you.

commented: indeed! +14

In linux, the file command can help you too.

Fixing code.

Also notice that there are many results if you search the word 'magic' in pypi, (without k) especially python implementations of 'file', python bindings to the libmagic library, even packages like python-magic apparently work in windows with the GnuWin32 package.

Of course, all this may be overkill for your program.

import sys, os, binascii

def readfile():
dictionary = {'474946':(/>'GIF', 'gif'), 'ffd8ff':(/>'JPEG', 'jpeg')}
    try:

        files = os.listdir('C:\\Temp\\downloads')       

        for item in files:

            file = open('C:\\Temp\\downloads\\'+ item, 'r')

            file_sig = file.read(3)

            file_sig_hex = binascii.hexlify(file_sig)

            if file_sig_hex in dictionary:

                print item + ' is a commonimage file, it is a ' + file_sig
            else:
                print item + ' is not an common image file, it is' +file_sig
            print file_sig_hex

As of right now my script prints out 'Error Try again' but when i comment out this part of the code;

if file_sig_hex in dictionary:
print item + ' is a common image file' + file_sig

else:
print item + ' is not a common image file, it is' +file_sig




print file_sig_hex

outputs;

ffd8ff
ffd8ff
ffd8ff
504b03
3c2144
474946
ffd8ff
ffd8ff
ffd8ff
ffd8ff
ffd8ff
ffd8ff

and with this code without the try/except I get

poodle.jpg is a image file, it is a ffd8ff

with try/except removed but I don't receive any errors when I run the program which tells me that most of my code works.

But with try/except and this code;

if file_sig_hex in dictionary:
    print item + ' is a image file' + file_sig

else:

    print item + ' is not a common image file, it is' +file_sig

I get nothing printed apart from 'Error try again' which comes from here;

except:
        print 'Error. Try again'

but it is also picking up .bmp files not telling me that they're not common image files.

and with this code without the try/except I get

poodle.jpg is a common image file, it is a ffd8ff 

What I would like my output to be is;

poodle.bmp is not a common image file, it is xxxxxx
poodle.jpg is a common image file, it is a ffd8ff

But I want it all to do with the try/except working.

Does anyone know why it isn't currently working with the try/except? I get no errors when I try without it and no errors with it.

The best thing to do is raise the exception to see what happened:

except:
        print 'Error. Try again'
        raise

The drawback of except is that it hides exceptions !

Thank you. I was forgetting about 'raise'

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.