Hello all,
I'm hoping someone here might be able to provide some assistance. Let me see if I can describe the general problem I have and perhaps someone can offer some ideas on the best way to approach solving it.
I've got a text file that is basically a dump of thousands of other text files pasted one following the other. I have no choice in how I receive this data, just to make it clear, this is how it comes and I have to make it work. What I need to do is seperate each individual bit of file text into their own seperate file. Here is an example, formatted for your convenience, of the data in the large text file:
AB1234
TITLE: News Headline 1
SOURCE: USA Today
TEXT: News article here
123489
TITLE: News Headline 2
SOURCE: Newsweek
TEXT: News article 2
So that is the general idea, greatly simplified but still conveying the necessary info. Each article is preceded by an identifier that sometimes has two letters, sometimes does not, and is some random number of numbers after that. After that identifier I believe TITLE always appears next. So what I'm trying to do, is knowing this, pull out each article and place them into their own individual file with the preceding identifier if possible.
Currently this is done with a Macro through word, but it takes hours, leave out data, like the identifier, and does not always manage to account for all data or data integrity problems. I'm thinking a perl script could do this very easily and much more quickly, though I could easily be wrong about that. So I haven't used Perl in awhile and I'm trying to get this to work.
Currently I'm just trying to get the regexp to work and what seems like it might be on the right track but does not work entirely.
/[a-z]{0,2}[0-9]+\s+TITLE:(?!TITLE)/g
I put each match from this into another file temporarily just to see what I'm getting returned. Currently this returns:
1234TITLE:
123489TITLE:
So not only is it missing the AB from the first identifier, but its also missing all formatting and of course all the rest of the text. Anyway, if anyone has some suggestions I would greatly appreciate it. Or perhaps if someone knows of an even smarter way to do this that would be even faster and more efficient, I'm all ears. Though there are some limitations in directions I can take this, and the data itself cannot be changed. It comes in a text file and that's the only option I have.
Thanks to everyone.