Help with some text file parsing

Question

klabak85 0 Newbie Poster

15 Years Ago

Hi all,

I'm a completely new user of python (and by new I mean I started using it yesterday) but I've been programming for a couple of years now in other languages such as Java. I have a question to ask you all.

So what I want to do is read through a text file containing code and extract only the comments from the document (.txt file). I'll then take those comments and put them in another text file. The comments look something like this:

//*************
Some details about the script / method
//*************

I've figured out how to find each line containing //*** (which I've used for my search query) but I'm not sure how to go about getting the text between the two lines. Please see my code below. I don't know if using readlines() along with lists is the proper way to go about doing this.

fileName = raw_input("Enter the file name you want to read (name.txt): ")
file1 = open(fileName, "r")
print "Name of file: ", file1.name
print "Reading file..."

L = file1.readlines()

comments = []
i = -1
for line in L:
    if "//***" in line:
        i = L.index(line, i+1)
        print "Comment line found at: ", i
        comments.append(i)

As you can see I'm trying to keep track of the lines in another list but it's turning out to be kinda difficult to go back and then extract the text itself. Should I keep going with this method or is there an easier way? Any help would be greatly appreciated. Thanks.

file-system python

4 Contributors
9 Replies
150 Views
1 Day Discussion Span
Latest Post 15 Years Ago Latest Post by jice

All 9 Replies

sneekula 969 Nearly a Posting Maven

15 Years Ago

Some simple changes will do:

comments = []
inComment = False
for i, line in enumerate(open(filename)):
    if "//***" in line:
        inComment = not inComment
    elif inComment:  # changed to elif
        print( "comment found at line %d" %i )  # test
        # optionally strip trailing '\n' char
        line = line.rstrip()
        comments.append((i, line)) 

print( comments )

for comment in comments:
    print( comment[1] )

Edited 15 Years Ago by sneekula because: n/a

shadwickman 159 Posting Pro in Training

15 Years Ago

Sneekula's correct. You're storing each item in the inComments like so: (lineNumber, lineText) . So you just need to access the correct index of the tuple, the same way you would access an item in a list (or Array in Java); just use the square brackets to get the value from the list's item.

# assume inComments was built using your above code...
item = inComments[0]
# item is now a tuple of (lineNumber, lineText)

text = item[1]
# text is now a string of the second index of item

lineNumber, text = item
# demonstrates tuple or list unpacking, where it will assign the
# first value of item to the lineNumber variable, the second index
# to the text variable, and so on; it's a magical unwrapper of lists.

shadwickman 159 Posting Pro in Training

15 Years Ago

Python doesn't need to explicity declare variables in the same way languages like C/C++, Java, etc. do. By writing for comment in inComment: you are creating a new variable called 'comment' that is set to the value of the current index of 'inComment' as it iterates through the indices for it. The thing is, with lists and tuples in Python, they don't need to be declared as only having one datatype; a list in Python can contain a string on one index, then a number, then a class instance, another list, etc. The contents of a list or tuple are fully flexible in that they can be completely of mixed data types. That being said, that means that 'comment' for the for loop will be a copy of each index, including whatever datatype that index may happen to be. Sorry for the mildly confusing explanation!

Also, the %d is an example of string formatting, the same as they have in C/C++ (is it in Java too?). Regardless, Python does not cast types for you automatically, meaning that print "comment found at line " + i will not work because you are joining a string with an integer. You could use print "comment found at line " + str(i) which converts 'i' to a string, which you can append to the rest of that line because its also a string. The %d is just the way for formatting a number via string formatting similar to printf in C. Here's a short explanation of string formatting in Python:
http://diveintopython.org/native_data_types/formatting_strings.html

Edited 15 Years Ago by shadwickman because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jice 53 Posting Whiz in Training · Answer 1 · 2009-09-17T03:26:29+00:00

You can try this :

fileName = raw_input("Enter the file name you want to read (name.txt): ")
comments=[]
inComment=False
for i, line in enumerate(open(filename)): # You can loop directly on the file lines
    if "//***" in line:
        inComment = not inComment
    if inComment:
        print "comment found at line %d" %i
        comments.append((i, line)) # you store the line and its number

klabak85 0 Newbie Poster · Answer 2 · 2009-09-17T20:57:33+00:00

This works great! Thanks so much!

One last question:

Lets say I just want to print out the text of each line (and not the line number tho I want to keep that in the list as it's nice to have for other things). How do I access just that value for printing (from each tuple inside the list?). Thanks for the help.

klabak85 0 Newbie Poster · Answer 3 · 2009-09-17T23:39:34+00:00

That works great thanks so much!

But for my own knowledge I have a couple of questions about the code and how python does things.

Question 1:

The for loop at the bottom of the code:

for comment in comments:
      print( comment[1] )

Are you technically telling python to refer to each tuple in the list (comments[]) as a comment (to me this seems like creating an object and not having to define it explicitly)? Then as you step through the list you are just saying get each comment (tuple) and only return the second value of the tuple (by indicating the number 1)?

Question 2:

Could someone explain to me why you use the % characters in the following code and what the %d is actually doing.

print( "comment found at line %d" %i )

The variable i is obviously pointing to the line # the comment was found on. Why not just have something like

print "comment found at line " i

?

Thanks so much for your help

Edit: Seems shadwickman answered the second half of my question 1 (sorry didn't see that). Thanks. But you are defining what a "comment" is within that for loop correct?

klabak85 0 Newbie Poster · Answer 4 · 2009-09-18T00:03:52+00:00

Very helpful explanation thank you! Thanks for the link as well. Time to do some reading :)

I don't believe that type of string formatting is still in Java (it may be but I don't remember seeing it).

I'm definitely liking Python so far. Thank you for your help everyone.

sneekula 969 Nearly a Posting Maven · Answer 5 · 2009-09-18T01:00:42+00:00

The reason I am using

print( "comment found at line %d"  % i )

is that it works with Python2 and Python3.

jice 53 Posting Whiz in Training · Answer 6 · 2009-09-18T03:40:43+00:00

I would just add one last thing about string formating :
if you write something like :

print ("%s is %d years old and wears a %s suit" % (name, age, color))

it is easier to read, to write and to maintain than

print (name + " is " + str(age) + " years old and wears a " + color + " suit")

Help with some text file parsing

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers