Good day.
I wanted to share with you the code for an application I've created. It can read the number of pages in all the PDF files from one directory.
The question I have is this : does it need more optimisation ? Can I make it work faster ? For now, it is pretty fast, but I feel it can do better.
This is the code in Python.
"""
This module contains a function to count
the total pages for all PDF files in one directory.
"""
#from time import clock as __c #Used for benchmark.
from glob import glob as __g
from re import search as __s
def count( vPath ):
"""
Takes one argument: the path where you want to search the files.
Returns a dictionary with the file name and number of pages for each file.
"""
#
#cdef double ti = __c() #Used for benchmark.
#
vPDFfiles = __g( vPath + "\\" + '*.pdf' )
vPages = 0
vMsg = {}
#
for vPDFfile in vPDFfiles:
vFileOpen = open( vPDFfile, 'rb', 1 )
for vLine in vFileOpen.readlines():
if "/Count " in vLine:
vPages = int( __s("/Count \d*", vLine).group()[7:] )
vMsg[vPDFfile] = vPages
vFileOpen.close()
#
#cdef double tf = __c() #Used for benchmark.
#
#print tf-ti
return vMsg
#
I also wrote the code in Cython and this is the code:
"""
This module contains a function to count
the total pages for all PDF files in one directory.
"""
#from time import clock as __c
from glob import glob as __g
from re import search as __s
cdef dict __count( char *vPath ):
#
cdef list vPDFfiles = __g( vPath + "\\" + '*.pdf' )
cdef int vPages = 0
cdef dict vMsg = {}
#
for vPDFfile in vPDFfiles:
vFileOpen = open( vPDFfile, 'rb', 1 )
for vLine in vFileOpen.readlines():
if "/Count " in vLine:
vPages = int( __s("/Count \d*", vLine).group()[7:] )
vMsg[vPDFfile] = vPages
vFileOpen.close()
#
return vMsg
#
def count( vPath ):
"""
Takes one argument: the path where you want to search the files.
Returns a dictionary with the file name and number of pages for each file.
"""
#cdef double ti = __c()
cdef dict v = __count( vPath )
#cdef double tf = __c()
print tf-ti
return v
#
Both work in the same way : you call ' count( 'C:\\Path_to_your_PDF_files' ) '. You should use double backslash in order to be sure it really works. The function returns a dictionary with the name of the PDF as key and the number of pages as value.
So... does anyone find something that could be optimised?