find some of words in a file ?

Question

zoro007 0 Newbie Poster

15 Years Ago

Hello,

I want know how to find some of words in a file.
For example :
find one word

if "word" in open(file).read():
  print file

I wand do that but with 5 words [ word1 , word2 , word3 , word4 , word5 ]

If find word1 or word2 or word3 ....... do something

mean search in file about all words i want and if find any word in the file print it or do something

I use List or tuple and how?

Thanks

python

3 Contributors
13 Replies
164 Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by sebcbien

Gribouillis 1,391 Programming Explorer

15 Years Ago

You can use the re module. See http://docs.python.org/library/re.html#module-re . Here is an example

import re

word_set = set(["give", "me", "bacon", "and", "eggs", "said", "the", "other", "man"])
word_re = re.compile("\w+")

text = """
Draco chewed furiously at his toast, trying in vain to shut out the strong smell of bacon and eggs, that wafted from Crabbe's and Goyle's overfilled plates. In his view, it was a barbaric way to start the day, and not even the threat of being locked up with wild hippogriffs could make him eat a breakfast that rich and greasy. He glared across the hall, scowling as he watched Harry shovel his eggs into his mouth, his eyes fixed on his plate. """

for match in word_re.finditer(text):
    word = match.group(0)
    if word in word_set:
        print word,

""" my output --->
the bacon and eggs and the and the and the eggs
"""

Gribouillis 1,391 Programming Explorer

15 Years Ago

Thanks for your help
But i see the result not accurate
because if the word not alone not get it
for ex :
if the word alone like that eggs get it done
if like that some_eggs no get it

If your words don't contain underscores '_', you can replace line 4 by

word_re = re.compile(r"[A-Za-z]+")

or

word_re = re.compile(r"[A-Za-z0-9]+")

to include words with digits. You can also post your words in this thread... and also learn more about regular expressions :)

Edited 15 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

15 Years Ago

You could try

Word_set = set(['nochex','lang','test2'])
Word_re = re.compile(r"[A-Za-z0-9]+") # better put this out of the loop
for (path, dirs, files) in os.walk("/home/cache"):
        for file in files:
                filename  = os.path.join(path, file)
                file_size = os.path.getsize(filename)
                READ    = open(filename).read()
                for match in Word_re.finditer(READ):
                       word = match.group(0)
                       if word in Word_set:
                             print filename
                             break

Edited 15 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

15 Years Ago

Unfortunately still not working
and result with word_re = re.compile("\w+") its better from Word_re = re.compile(r"[A-Za-z0-9]+")
any way i will try to resolve that, i dont want fatigue with me :)
and really thanks for your help
Last thing i want your help about unique if you have any idea
the result if contain lot of same word appear to me that
/home/cache/lang_subscriptions.php
/home/cache/lang_subscriptions.php
/home/cache/lang_subscriptions.php
I want solution to unique

I don't understand. The break statement at line 12 above ensures that each filename is printed only once. If it doesn't work, you should be able to show where it fails. Can you write an example file where it fails ?

Edited 15 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

zoro007 0 Newbie Poster · Answer 1 · 2010-04-11T22:35:19+00:00

Thanks for your help

But i see the result not accurate

because if the word not alone not get it
for ex :
if the word alone like that eggs get it done

if like that some_eggs no get it

zoro007 0 Newbie Poster · Answer 2 · 2010-04-11T23:06:01+00:00

still not working after use "word_re = re.compile(r"[A-Za-z]+")" and "word_re = re.compile(r"[A-Za-z0-9]+")"

I put this word on my search "nochex" and the file not get because contain paywith_nochex

And if i put full word like paywith_nochex the file get good

something else, when i do that and print the file path
if the file contain lot of the same word like egg 3 time i see the result repeat 3 time
i want use something to show the result unique

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2010-04-11T23:29:47+00:00

Well, the paywith_nochex should have worked. Perhaps you could post your code and attach the file where you are searching words ?

zoro007 0 Newbie Poster · Answer 4 · 2010-04-11T23:51:47+00:00

That is my code

Word_set = set(['nochex','lang','test2'])
for (path, dirs, files) in os.walk("/home/cache"):
        for file in files:
                filename  = os.path.join(path, file)
                file_size = os.path.getsize(filename)
                READ    = open(filename).read()
                Word_re = re.compile("\w+")
                for match in Word_re.finditer(READ):
                       word = match.group(0)
                       if word in Word_set:
                             print filename

And you not answer me about unique !!

and this file contain : lang_subscriptions.php

'paywith_nochex' => "Pay with NOCHEX now",
'paywith_paypal' => "Make payments with PayPal - it's fast, free and secure!",
'paywith_protx' => "Pay Using Protx",
'paywith_gen' => "Purchase",

zoro007 0 Newbie Poster · Answer 5 · 2010-04-12T00:53:54+00:00

Unfortunately still not working

and result with word_re = re.compile("\w+") its better from Word_re = re.compile(r"[A-Za-z0-9]+")

any way i will try to resolve that, i dont want fatigue with me :)
and really thanks for your help

Last thing i want your help about unique if you have any idea

the result if contain lot of same word appear to me that

/home/cache/lang_subscriptions.php
/home/cache/lang_subscriptions.php
/home/cache/lang_subscriptions.php

I want solution to unique

zoro007 0 Newbie Poster · Answer 6 · 2010-04-12T01:48:40+00:00

yea i'm sorry i was forgot put break statement

Thanks again Gribouillis so much

sebcbien 0 Newbie Poster · Answer 7 · 2010-04-12T14:50:54+00:00

You can actually use this program :

# -*- coding: iso-8859-1 -*-

import sys
import re
import os
printcontent = ''
answer = 'y'
creNotAlpha = re.compile(r'\W')
    
def Walk( root, recurse=0, pattern='*', return_folders=0 ): # return a list of javafiles contained in root
    import fnmatch, os, string
    
    # initialize
    result = []

    # must have at least root folder
    #~ try:
        #~ names = os.listdir(root)
    #~ except os.error:
        #~ print "exception"
        #~ print os.error
        #~ return result
    #~ print 'g passé l\'exception'
    names = os.listdir(root)
    print 'g passé l\'exception'
    # expand pattern
    pattern = pattern or '*'
    pat_list = string.splitfields( pattern , ';' )
    
    # check each file
    for name in names:
        fullname = os.path.normpath(os.path.join(root, name))

        # grab if it matches our pattern and entry type
        for pat in pat_list:
            pat = pat.strip()
            if fnmatch.fnmatch(name, pat):
                if os.path.isfile(fullname) or (return_folders and os.path.isdir(fullname)):
                    result.append(fullname)
                continue
        print 'ligne 35'
        # recursively scan other folders, appending results
        if recurse:
            if os.path.isdir(fullname) and not os.path.islink(fullname):
                result = result + Walk( fullname, recurse, pattern, return_folders )
    return result

def upFolder(folderName): 
    newFname = folderName.replace('\\','/')
    stop = newFname.rfind('/')
    return newFname[:stop]
    
def listsummary(fname, flines, dataLog):

    dataLog += [[fname, flines]]
    return
        
def logInFile(dataLog,mots, bool):
    
    for mot in mots:
        mot = creNotAlpha.sub('',mot)
        fname = '%s.txt' % mot    
        f = open(fname, mode = 'w+')    
        dataLog.sort()
        for fn in dataLog:
            f.write('\n')
            f.writelines(fn[0])
            f.write('\n')
            if bool:
                for fls in fn[1:]:
                    for fl in fls:
                        f.writelines(fl)
                        f.write('\n')
        f.close()
    return


def display(dataLog,bool):
    
    dataLog.sort()
    for fn in dataLog:
        print fn[0]
        if bool:
            for fls in fn[1:]:
                for fl in fls:
                    print fl

    
    return
    
def procfile(fname, creMots):
    f = open(fname)
    print fname
    lines = f.readlines()
    f.close()
    flines = []
    lineNb = 0
    fwarnCount = 0
    for line in lines: 
        lineNb +=1
        for creMot in creMots:
            if creMot.match(line.lower()) is not None:
                flines.append("%d: %s" % (lineNb, line.strip()))
                fwarnCount += 1
                
    return (flines, fwarnCount)
    
def getWord(__isreg):
    creMots = []
    motsplit = []
    
    mots = raw_input("Which word are you looking for ? (for all words leave it empty) ")
    if ';' in mots:
        motsplit = mots.split(';')
        for i in range( len(motsplit)):
            motsplit[i] = motsplit[i].strip()
            if __isreg == 0:
                motsplit[i]  = motsplit[i].replace('(', '\(')
                motsplit[i]  = motsplit[i].replace(')', '\)')
                motsplit[i]  = motsplit[i].replace('[', '\[')
                motsplit[i]  = motsplit[i].replace(']', '\]')
                motsplit[i]  = motsplit[i].replace('?', '\?')
                motsplit[i]  = motsplit[i].replace('!', '\!')
                motsplit[i]  = motsplit[i].replace('+', '\+')
                motsplit[i]  = motsplit[i].replace('^', '\^')
                motsplit[i]  = motsplit[i].replace('*', '\*')
                motsplit[i]  = motsplit[i].replace('$', '\$')
                motsplit[i]  = motsplit[i].replace('=', '\=')
                motsplit[i]  = motsplit[i].replace('|', '\|')
                motsplit[i]  = motsplit[i].replace('<', '\<')
            motsplit[i]  = motsplit[i] .lower()
            try:
                creMots.append( re.compile(r'.*%s.*' % motsplit[i]))
            except re.error:
                if __isreg == 0: print 'there is a syntax error in : %s, please remove "\" replace them by "."' % motsplit[i]
                else: print 'check the syntax of the regular expression'
                return ([],motsplit)
                #~ getWord(__isreg)
    else:
        mots = mots.lower()
        if __isreg == 0:
            mots  = mots.replace('(', '\(')
            mots  = mots.replace(')', '\)')
            mots  = mots.replace('[', '\[')
            mots  = mots.replace(']', '\]')
            mots  = mots.replace('?', '\?')
            mots  = mots.replace('!', '\!')
            mots  = mots.replace('+', '\+')
            mots  = mots.replace('^', '\^')
            mots  = mots.replace('*', '\*')
            mots  = mots.replace('$', '\$')
            mots  = mots.replace('=', '\=')
            mots  = mots.replace('|', '\|')
            mots  = mots.replace('<', '\<')
            
        motsplit.append(mots)
        try:
            creMots.append(re.compile(r'.*%s.*' % motsplit[0]))
        except re.error:
            if __isreg == 0: print 'there is a syntax error in : %s please remove "\" replace them by "."' % mots
            else: print 'check the syntax of the regular expression'
            return ([],motsplit)
            #~ getWord(__isreg)


    return (creMots,motsplit)
    
def runStaticSeeking(creMots,mots,paths,ftype,bool):
    printcontent = ''
    warningFiles = 0
    dataLog = []
    warnCount = 0
    flines = []
    fnames = []
    paths = paths.split(';')
    
    
    for path in paths:
        path = path.strip()
        fnames = Walk(r'%s' % path,1,'%s' % ftype,0)
        print fnames
        for fname in fnames:
            (flines, fwarnCount) = (procfile(fname, creMots))
            if flines:
                listsummary(fname,flines,dataLog)
    if (len(sys.argv))>=2:
        for args in sys.argv :
            if args == '-log':
                logInFile(dataLog,mots,bool)
    display(dataLog,bool)
    
    return warnCount 


if __name__ == '__main__':
    
    creMots = []
    __isreg = 0
    if len(sys.argv)>=2:
        for arg in sys.argv :
            if arg  == '-help':
                print '''to log results in file launch Search_in_files.py -log
if you want to use regular expressions please add -regular 
if regular is set, please protect .!?*^<() with a backslash if it is not part of the regular expression (i.e. "\^" to find "^" in the text)
if regular expression is not set do not protect them ( already done in the code )!
you can search for several words if they are separated by a ;
you can use several extensions separated by a ; i.e. *.py;*.cpp;*.txt ...
you can look for the word in several paths separated by a ;
You can swap the use of regular expressions if you replace the word by set __isreg 1 or set __isreg 0 (enable/disable use of regular expressions)
'''
                sys.exit(1)
            elif arg  == '-regular':
                __isreg = 1.

    bool = 0
    printcontent = ''
    while answer == 'y' :
        while creMots == []:
            (creMots,mots) = getWord(__isreg)
            while mots[0] in ['set __isreg 0','set __isreg 1']:
                if mots[0] == 'set __isreg 0':
                    __isreg=0
                    (creMots,mots) = getWord(__isreg)
                    print
                elif mots[0] == 'set __isreg 1':
                    __isreg=1
                    (creMots,mots) = getWord(__isreg)
                    print
            
        paths = raw_input("enter path(s) : ")
        ftype = raw_input("enter the extension (i.e. *.txt;*.java ) : ")
        printcontent = ''
        while ((printcontent != 'y') & (printcontent != 'n')):
            printcontent = raw_input('do you want to display lines in files ? y/n ')
            print
            
            if (printcontent == 'y') :
                bool = 1
        
        runStaticSeeking(creMots,mots,paths, ftype,bool)
        creMots = []
        mots = []
        paths = []
        answer = raw_input('do you want to make an other search? y/n ')
        
        while( not (answer in ['y', 'n'])):
            answer = raw_input()
            
    sys.exit(1)

Part of the code were found on activestate.com for the function "walk".

I wish it will help you !

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 8 · 2010-04-12T14:58:47+00:00

@sebcbien
I think you might be interested in the module grin http://pypi.python.org/pypi/grin which is a 'grep' utility written in python.

sebcbien 0 Newbie Poster · Answer 9 · 2010-04-12T16:06:13+00:00

@sebcbien
I think you might be interested in the module grin http://pypi.python.org/pypi/grin which is a 'grep' utility written in python.

I wasn't aware of the existence of this package, it seems to fit the purpose of my walk function. I'll get a deeper look to it for my future python programs.
Thanks for the information!