help with text (beginner)

Question

acehigher 0 Newbie Poster

14 Years Ago

Hello,

I think this is a pretty simple problem but I just don't know where to start. I have a text file:

1
00:00:34,000 --> 00:00:36,135
Thank you, Detective.

2
00:00:42,714 --> 00:00:45,794
- Any change?
- Nothing since you left.

3
00:00:52,988 --> 00:00:55,585
She seems to be looking for something.

4
00:00:55,588 --> 00:00:59,234
Camera?

5
00:01:23,961 --> 00:01:26,662
She has a nice ass.

6
00:01:27,571 --> 00:01:30,407
Stay focused on the mission.

7
00:01:36,600 --> 00:01:40,336
Keep an eye on her,
but don't get too close.

8
00:01:51,605 --> 00:01:53,832
- Good morning.
- Good morning.

Actually, its a .srt (subtitle) file and I need to extract the text, so ignore the 'timestamps' and 'index number'. Ultimately, I need to create a corpus of subtile files as part of my linguistics course. Is python the right tool for this job? Any help would be much appreciated :D

python

3 Contributors
8 Replies
103 Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by Gribouillis

Gribouillis 1,391 Programming Explorer

14 Years Ago

Start with a script which prints the lines one by one

SOURCE_FILE = "myfile.srt"

def main():
    with open(SOURCE_FILE) as src_file:
        for line in src_file:
            print(repr(line))

if __name__ == "__main__":
    main()

Gribouillis 1,391 Programming Explorer

14 Years Ago

Assuming you are using python 3, you should be able to write

f = open('file.txt','w')
print(timestamp, '@', subs, file=f)

or alternately

f.write(''.join(timestamp, '@', subs))

Edited 14 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

acehigher 0 Newbie Poster · Answer 1 · 2011-02-22T09:50:37+00:00

acehigher 0 Newbie Poster

14 Years Ago

Thank you very much. I will try this.

acehigher 0 Newbie Poster · Answer 2 · 2011-02-25T13:49:43+00:00

Yes, I have learnt a little more about python. I've finally got the code to do want I want it to do.

import sys, re

output = sys.stdout
text = sys.stdin.read()

#rx_blanks = re.compile(r"\W+")
paragraph = re.compile(r"(\d{2}:\d{2}:\d{2},\d{3}\s-->\s\d{2}:\d{2}:\d{2},\d{3})\r\n(.*?\r?\n?.*)\r\n\r\n\d{1,3}\r",re.MULTILINE)
sub_oneline = re.compile(r"\r\n")

for match in paragraph.finditer(text):
	timestamp, subs = match.groups()
	timestamp = timestamp.strip()
	subs = sub_oneline.sub("->>-",subs)
	print (timestamp, '@', subs)

Which gives this:

01:31:41,632 --> 01:31:44,763 @ I love him too, unfortunately.
01:31:48,515 --> 01:31:50,939 @ I may have a solution for you.
01:32:24,031 --> 01:32:25,689 @ Are you with me this time?
01:32:51,829 --> 01:32:52,868 @ C'mon, let him up.
01:32:54,861 --> 01:32:56,322 @ I'm just a tourist.

But why can I write:

print (timestamp, '@', subs)

and not:

f = open('file.txt','w')
f.write(timestamp, '@', subs)

??

acehigher 0 Newbie Poster · Answer 3 · 2011-02-25T15:58:33+00:00

Great, i'm still in 2.5 i'll update.

Many thanks

richieking 44 Master Poster · Answer 4 · 2011-02-25T16:14:30+00:00

acehigher where did you get the above code from?

acehigher 0 Newbie Poster · Answer 5 · 2011-02-25T16:30:36+00:00

I followed the code of the last post of a thread that i found after hours of googling.

http://stackoverflow.com/questions/587345/python-regular-expression-matching-a-multiline-block-of-text

I'm VERY new to programming. it's perhaps not the most elegant code ever written but I was so happy when it worked.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2011-02-25T17:12:13+00:00

Great, i'm still in 2.5 i'll update.
Many thanks

Actually you can do the same in 2.6 or 2.7 if you add

from __future__ import print_function

as the first line of your file.