My current goal is to scan and parse an html page and get the attributes from the tags. Right now, I can take a page and scan everything token by token and save them to a String. What I want to so is pick out certain tags and strip them to get certain attributes. Here's an example (the following code does not work):
<img class="awsome" src="http://www.thebestsiteonthefaceoftheplanet.com/image.jpg">
I know that I have to use something like:
s = new Scanner(new BufferedReader(new InputStreamReader(yahoo.openStream())));
String stuff;
while (s.hasNext())
{
stuff = s.nextLine()
System.out.println(stuff);
}
This will barrel through the source (the html file) token by token, save them to the string "stuff" and print a line using "stuff" as the argument.
Now I need to find a way to pick out a target tag, and get certain attributes. I want to detect when the img tag is run over, and harvest the various attributes within, separating them into different strings. Using the tag demonstrated above, I want to find an <img> tag, with with the class attribute "awesome" and get the src attribute.
I think I'm making this more difficult than it needs to be. There might be a simpler way of doing this and I'm not seeing it. Also, whenever I'm done, I need to do something else that's a bit more complex, but I'm taking baby steps right now. Anyone know whether I should keep on this line of thinking, or is there a better way to do this?