I've played around with HTMLEditorKit and HTMLDocument, and while I've managed to do the parsing I needed, I also need the complete source code of the document to pass along to a webkit renderer. Java's existing document throws out some tags after I read it in.
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
kit.read(myFile, doc, 0);
// find BODY tag and insert necessary html code here
/* now grab source text from document */
String sourceText = doc.getText(0, doc.getLength());
The source file it's reading from contains a doctype and link tags to style sheets. Those lines are being thrown out, so when I read back the source, it's not actually the full source I expect.
Now, the API docs for HTMLEditorKit state:
When inserting into a non-empty document all tags outside of the body (head, title) will be dropped.
Am I not reading my original file into an empty document like I thought?
As new messages come into my program, I need to append several tags to the end of the BODY and sometimes inside a DIV before the end. Then the full source is passed on to the webkit renderer.
Hopefully this all made sense to someone who can make a suggestion to me.