The probably isn't the parsing actually, I can't even get to that part yet. The webpage uses a different character set, "windows-1252". But even after setting the reader to use that charset (which exists in the system), I still get the ChangedCharSetException.
String link = "myurl.com";
URL url = new URL(link);
URLConnection conn = url.openConnection();
Reader reader = new InputStreamReader(conn.getInputStream(),Charset.forName("windows-1252"));
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
//throws error here while reading
kit.read(reader, doc, 0);
Here's the first couple lines from the html file:
<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Pragma" content="no-cache">
Is there perhaps some way of reading the file but ignoring the meta data?