using grep, sed, awk to pull links out of an XML file

Question

brakeb

16 Years Ago

Greetings,

I have reached a point where I need some help. I have a tivo at home, and I'm trying to script something that will allow me to 1.) pull the XML off of the tivo, and save the file, 2.) Take the text in the XML, and pull out the http:// links, 3.) put all the links into an array where I can 4.) use cURL, and download these files for further processing later.

I've managed to download the tivo.xml file using cURL, but now, I'm stuck with a text file with no spaces, or newlines that I can use to separate out the xml tags:

<?xml version="1.0" encoding="utf-8"?><TiVoContainer xmlns="http://www.tivo.com/developer/calypso-protocol-1.6/"><Details><ContentType>x-tivo-container/tivo-videos</ContentType><SourceFormat>x-tivo-container/tivo-dvr</SourceFormat><Title>Now Playing</Title><LastChangeDate>0x49B7F8FF</LastChangeDate><TotalItems>73</TotalItems><UniqueId>/NowPlaying</UniqueId></Details><SortOrder>Type,CaptureDate</SortOrder><GlobalSort>Yes</GlobalSort><ItemStart>0</ItemStart><ItemCount>73</ItemCount><Item><Details><ContentType>video/x-tivo-raw-pes</ContentType><SourceFormat>video/x-tivo-raw-pes</SourceFormat><Title>The Daily Show With Jon Stewart</Title><SourceSize>783286272</SourceSize><Duration>3600000</Duration><CaptureDate>0x49B7535F</CaptureDate><SourceChannel>48</SourceChannel><SourceStation>COMEDYP</SourceStation><HighDefinition>No</HighDefinition><ProgramId>EP2930531384</ProgramId><SeriesId>SH293053</SeriesId><EpisodeNumber>14034</EpisodeNumber><ByteOffset>0</ByteOffset></Details><Links><Content><Url>http://192.168.1.20:80/download/The%20Daily%20Show%20With%20Jon%20Stewart.TiVo?Container=%2FNowPlaying&amp;id=3207375</Url><ContentType>video/x-tivo-raw-pes</ContentType></Content><TiVoVideoDetails><Url>https://192.168.1.20:443/TiVoVideoDetails?id=3207375</Url><ContentType>text/xml</ContentType><AcceptsParams>No</AcceptsParams></TiVoVideoDetails></Links></Item><Item><Details><ContentType>video/x-tivo-raw-pes</ContentType><SourceFormat>video/x-tivo-raw-pes</SourceFormat><Title>Explorer</Title><SourceSize>780140544</SourceSize><Duration>3600000</Duration><CaptureDate>0x49B71B1F</CaptureDate><EpisodeTitle>T. Rex Walks Again</EpisodeTitle><Description>Dinosaur builder Hall Train and paleoartist Jason Brougham work to build the world's most accurate, fully skinned, mechanical replica of a T. rex. Copyright Tribune Media Services, Inc.</Description><SourceChannel>108</SourceChannel><SourceStation>NGC</SourceStation><HighDefinition>No</HighDefinition><ProgramId>EP7231310112</ProgramId><SeriesId>SH723131</SeriesId><ByteOffset>0</ByteOffset></Details><Links><Content><Url>http://192.168.1.20:80/download/Explorer.TiVo?Container=%2FNowPlaying&amp;id=3212467</Url><ContentType>video/x-tivo-raw-pes</ContentType></Content><CustomIcon><Url>urn:tivo:image:expires-soon-recording</Url><ContentType>image/*</ContentType><AcceptsParams>No</AcceptsParams></CustomIcon><TiVoVideoDetails><Url>https://192.168.1.20:443/TiVoVideoDetails?id=3212467</Url><ContentType>text/xml</ContentType><AcceptsParams>No</AcceptsParams></TiVoVideoDetails></Links></Item><Item><Details><ContentType>video/x-tivo-raw-pes</ContentType><SourceFormat>video/x-tivo-raw-pes</SourceFormat><Title>The Daily Show With Jon Stewart</Title><SourceSize>795869184</SourceSize><Duration>3601000</Duration><CaptureDate>0x49B601DE</CaptureDate><SourceChannel>48</SourceChannel><SourceStation>COMEDYP</SourceStation><HighDefinition>No</HighDefinition><ProgramId>EP2930531382</ProgramId><SeriesId>SH293053</SeriesId><EpisodeNumber>14033</EpisodeNumber><ByteOffset>0</ByteOffset></Details><Links><Content><Url>http://192.168.1.20:80/download/The%20Daily%20Show%20With%20Jon%20Stewart.TiVo?Container=%2FNowPlaying&amp;id=3204470</Url><ContentType>video/x-tivo-raw-pes</ContentType></Content><CustomIcon><Url>urn:tivo:image:expires-soon-recording</Url><ContentType>image/*</ContentType><AcceptsParams>No</AcceptsParams></CustomIcon><TiVoVideoDetails><Url>https://192.168.1.20:443/TiVoVideoDetails?id=3204470</Url><ContentType>text/xml</ContentType><AcceptsParams>No</AcceptsParams></TiVoVideoDetails></Links></Item><Item><Details><ContentType>video/x-tivo-raw-pes</ContentType><SourceFormat>video/x-tivo-raw-pes</SourceFormat><Title>The Wonder Pets!</Title><SourceSize>398458880</SourceSize><Duration>1800000</Duration><CaptureDate>0x49B5D7AE</CaptureDate><EpisodeTitle>Join the Circus!</EpisodeTitle><Description>After the pets rescue a young circus lion, the ringmaster offers each of the pets a job at the circus. Copyright Tribune Media Services, Inc.</Description><SourceChannel>47</SourceChannel><SourceStation>NIKP</SourceStation>

This may look like there are newlines, but this is because of the copy/paste I did to show an example.

Now, I was looking at trying to put newlines in, but so far, I've been unable to find an example of how to do this. I figured out that if I could find all instances of "><", and replace with ">\n<", then I might be able to separate everything into specific lines. It wouldn't look pretty, but it would allow me to script up a "grep", and "cut" type command to get my links.

I only need the links with "192.168.1.20:80" in them, but when I attempt to grep for them, I just get the entire file. A little more searching found that after doing "vi tivo.xml", there is only one line. This is why I thought up the idea of find/replace and adding newlines.

Now, some caveats (as if this needs to be any more difficult), I don't have xsltproc, which I attempted to get from ports and packages. I did find sablotron, but I'm trying to learn shell scripting and I believe it should be possible to do this without resorting to more programs. I use OpenBSD as an OS, so bash is out. I use mostly pdksh and sh. Perl is available in the base system, and if regex is needed, I would be okay with that... I've been wanting to learn that too...

If I happen upon a solution, I will post it, but I was hoping that I'd come out of lurkerdom, and ask after I've been working this for a week on my own...

Regards,
Bryan

file-system http-protocol os-x perl regex shell-scripting xml

2 Contributors
6 Replies
383 Views
8 Hours Discussion Span
Latest Post 16 Years Ago Latest Post by brakeb

All 6 Replies

ShawnCplus 456 Code Monkey

16 Years Ago

replace test.html with the name of your file

sed -r "s/<Url>([^<]+)<\/Url>/\nURL: \1\n/g" test.html | grep ^URL: | sed -r "s/^URL: //g"

This will give you a newline delimited list of the URLs

Also, in the future you can use the following command to clean the XML (requires Tidy)

tidy -xml -i -q <file>

ShawnCplus 456 Code Monkey

16 Years Ago

okay, talking through it got me searching different terms. Apparently, sed can do this, but I need to insert a "^J" or "ctrl-J". I've not figured out how to do that with my keyboard that doesn't spit me out to a prompt, though.
It seems like this should work (pipes "|" are the separators):
cat tivo.xml | 's|>\<|\>\"ctrl-J"\<|g'
The "tr" command can too, and the other site suggested using octal notation to get that to work, but again, I'm hitting a roadblock...
crop tivo.xml | tr -d '\076\074' '\076\012\074' > tivo.result

gives me the following:
"blank line"
?xml version="1.0" encoding="utf-8"?>
TiVoContainer xmlns="http://www.tivo.com/developer/calypso-protocol-1.6/">
Details>
ContentType>x-tivo-container/tivo-videos
/ContentType>
SourceFormat>x-tivo-container/tivo-dvr
/SourceFormat>
Title>Now Playing
/Title>
LastChangeDate>0x49B7F8FF
/LastChangeDate>
TotalItems>73
/TotalItems>
UniqueId>/NowPlaying
/UniqueId>
/Details>
SortOrder>Type,CaptureDate
/SortOrder>
GlobalSort>Yes
/GlobalSort>
ItemStart>0
/ItemStart>
ItemCount>73
/ItemCount>
Item>
Details>
ContentType>video/x-tivo-raw-pes
/ContentType>
SourceFormat>video/x-tivo-raw-pes
/SourceFormat>
Title>The Daily Show With Jon Stewart
/Title>
SourceSize>783286272
/SourceSize>
Duration>3600000
/Duration>
CaptureDate>0x49B7535F
/CaptureDate>
SourceChannel>48
/SourceChannel>
SourceStation>COMEDYP
/SourceStation>
HighDefinition>No
/HighDefinition>
ProgramId>EP2930531384
/ProgramId>
SeriesId>SH293053
/SeriesId>
EpisodeNumber>14034
/EpisodeNumber>
ByteOffset>0
/ByteOffset>
/Details>
Links>
Content>
Url>http://192.168.1.20:80/download/The%20Daily%20Show%20With%20Jon%20Stewart.TiVo?Container=%2FNowPlaying&amp;id=3207375
/Url>
ContentType>video/x-tivo-raw-pes
/ContentType>
/Content>
TiVoVideoDetails>
Url>https://192.168.1.20:443/TiVoVideoDetails?id=3207375
/Url>
This works, but the output deletes the "<" off of each tag. I mean, I'm not complaining. Technically, it is what I want, but I guess I just wanted to know why it does this...
Regards,
Bryan

I just posted the way to do it correctly

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

brakeb · Answer 1 · 2009-03-14T00:56:33+00:00

okay, talking through it got me searching different terms. Apparently, sed can do this, but I need to insert a "^J" or "ctrl-J". I've not figured out how to do that with my keyboard that doesn't spit me out to a prompt, though.

It seems like this should work (pipes "|" are the separators):

cat tivo.xml | 's|>\<|\>\"ctrl-J"\<|g'

The "tr" command can too, and the other site suggested using octal notation to get that to work, but again, I'm hitting a roadblock...

crop tivo.xml | tr '\076\074' '\076\012\074' > tivo.result

gives me the following:

"blank line"
?xml version="1.0" encoding="utf-8"?>
TiVoContainer xmlns="http://www.tivo.com/developer/calypso-protocol-1.6/">
Details>
ContentType>x-tivo-container/tivo-videos
/ContentType>
SourceFormat>x-tivo-container/tivo-dvr
/SourceFormat>
Title>Now Playing
/Title>
LastChangeDate>0x49B7F8FF
/LastChangeDate>
TotalItems>73
/TotalItems>
UniqueId>/NowPlaying
/UniqueId>
/Details>
SortOrder>Type,CaptureDate
/SortOrder>
GlobalSort>Yes
/GlobalSort>
ItemStart>0
/ItemStart>
ItemCount>73
/ItemCount>
Item>
Details>
ContentType>video/x-tivo-raw-pes
/ContentType>
SourceFormat>video/x-tivo-raw-pes
/SourceFormat>
Title>The Daily Show With Jon Stewart
/Title>
SourceSize>783286272
/SourceSize>
Duration>3600000
/Duration>
CaptureDate>0x49B7535F
/CaptureDate>
SourceChannel>48
/SourceChannel>
SourceStation>COMEDYP
/SourceStation>
HighDefinition>No
/HighDefinition>
ProgramId>EP2930531384
/ProgramId>
SeriesId>SH293053
/SeriesId>
EpisodeNumber>14034
/EpisodeNumber>
ByteOffset>0
/ByteOffset>
/Details>
Links>
Content>
Url>http://192.168.1.20:80/download/The%20Daily%20Show%20With%20Jon%20Stewart.TiVo?Container=%2FNowPlaying&amp;id=3207375
/Url>
ContentType>video/x-tivo-raw-pes
/ContentType>
/Content>
TiVoVideoDetails>
Url>https://192.168.1.20:443/TiVoVideoDetails?id=3207375
/Url>

This works, but the output deletes the "<" off of each tag. I mean, I'm not complaining. Technically, it is what I want, but I guess I just wanted to know why it does this...

Regards,
Bryan

brakeb · Answer 2 · 2009-03-14T00:59:41+00:00

replace test.html with the name of your file
sed -r "s/<Url>([^<]+)<\/Url>/\nURL: \1\n/g" test.html | grep ^URL: | sed -r "s/^URL: //g"
This will give you a newline delimited list of the URLs

Sweet! I'm glad there is more than one way to do it...

Also, in the future you can use the following command to clean the XML (requires Tidy)
tidy -xml -i -q <file>

I'll check to see if we have "Tidy" in our ports/packages... seems like I need to learn something if I plan to use this alot...

*edit: YES! www/tidy is a port for OpenBSD, nice!*
Thanks,
Bryan

brakeb · Answer 3 · 2009-03-14T01:12:54+00:00

I just posted the way to do it correctly

Sorry Shawn, my version of "sed" doesn't take the "-r". That's a GNU thing...

brakeb · Answer 4 · 2009-03-14T01:48:47+00:00

Okay, I have a quick and dirty way of getting this to work. I use OpenBSD, and the BSD variant doesn't support "sed -r". I'm sure that works as well, but in case you are using a *BSD variant, here is how I did it:

cat $XML | tr '\076\074' '\076\012\074' | grep "Url>http:" | cut -d\> -f2

this makes a list of the http:// links that I needed, without the https:// links that I didn't need.

"tr '\076\074' '\076\012\074' " puts a newlines in the xml file between the xml tags (i.e. <tag><tag>). This causes a slight hiccup, as the file ends up looking like this:

tag>blah blah blah
/tag>

not:

This isn't a big deal, as you are trying to get rid of all extraneous text...

`grep "Url>http:" ` - this greps for only the http:// links, and omits any extra links.

And finally, "cut -d\> -f2 " uses the "cut" command and sets the ">" as the delimiter. since there is only one ">" on each line, you can then have this output "-f2" or the second field of text that was delimited by the ">", which of course, is the URL.

Hope this helps someone.

Thanks to Shawn for another possible alternative.

Bryan

using grep, sed, awk to pull links out of an XML file

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers