Convert HTML file into XML using Java

Question

veerasek 0 Newbie Poster

15 Years Ago

Hi All,

I am trying to convert HTML file into XML using java.
If anyone of you have sample coding please share with me.
Your suggestions greatly appreciated.

Thanks,
Veera

java

6 Contributors
7 Replies
6K Views
1 Year Discussion Span
Latest Post 14 Years Ago Latest Post by ~s.o.s~

stultuske 1,116 Posting Maven

15 Years Ago

be a bit more specific on (for instance) the next topics:
are we talking plain html here, or XHtml?
what have you done so far?
where are you stuck, and please do not say that all you've come up with is the assignment as given to you.

what code have you written so far, and what does/should it produce as output?

quuba 81 Posting Pro

15 Years Ago

Hello,

Hello,
Please look into the following coding for converting HTML into XML.
My Java Program
// line 44.
     XMLOutputter outputter = new XMLOutputter();

I wrote a piece of code with regard to encoding:
Result:
run:

<?xml version="1.0" encoding="iso-8859-2"?>
<html>
<body>
<h1>Second Page</h1>
<input type="text" name="name" value="Veera" />
<input type="text" name="Age" value="30" />
</body>
</html>

BUILD SUCCESSFUL (total time: 2 seconds)

XMLOutputter outputter = new XMLOutputter();
        org.jdom.output.Format newFormat = outputter.getFormat();
        String encoding = "iso-8859-2";
        newFormat.setEncoding(encoding);
        outputter.setFormat(newFormat);

I can't help more.
Inside source packages are some examples.
Download the sources and docs - read them.
http://saxon.sourceforge.net/
http://sourceforge.net/projects/saxon/
http://home.ccil.org/~cowan/XML/tagsoup/
http://www.jdom.org/downloads/source.html
http://www.java2s.com/Code/Jar/jdom/Catalogjdom.htm ==> jaxen jdom.jar ==> jaxen.jar.zip
http://www.zvon.org/index.php?nav_id=tutorials

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

veerasek 0 Newbie Poster · Answer 1 · 2008-12-19T15:48:15+00:00

Hello,

Please look into the following coding for converting HTML into XML.
My Java Program

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.DataInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;


import java.io.IOException;



class HTML2XML
public static void main(String args[]) throws JDOMException
{
	InputStream isInHtml =null;
	URL url  = null;
	URLConnection connection =null;
	DataInputStream disInHtml =null;
	FileOutputStream fosOutHtml =null;
	FileWriter fwOutXml =null;
	FileReader frInHtml=null;
	 BufferedWriter bwOutXml =null;
	 BufferedReader brInHtml=null;
try {
    // url  = new URL("www.climb.co.jp");
   //  connection = url.openConnection();       
   //  isInHtml = connection.getInputStream();
     
     frInHtml = new FileReader("D:\\Second.html");
     brInHtml = new BufferedReader(frInHtml);
     SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser", false);
     org.jdom.Document jdomDocument = saxBuilder.build(brInHtml);
     XMLOutputter outputter = new XMLOutputter();
     try {
          outputter.output(jdomDocument, System.out);
          fwOutXml = new FileWriter("D:\\Second.xml");
          bwOutXml = new BufferedWriter(fwOutXml);
          outputter.output(jdomDocument, bwOutXml);
          System.out.flush();
      }
      catch (IOException e)  {  }
            
}
catch (IOException e) {  }  
finally {
     System.out.flush();
     try{
     isInHtml.close();
     disInHtml.close();                      
     fosOutHtml.flush();
     fosOutHtml.getFD().sync();
     fosOutHtml.close();
     fwOutXml.flush();
     fwOutXml.close();
     bwOutXml.close();
     }
     catch(Exception w)
     {
    	 
     }
}
}
}

Input HTML file: Second.html
------------------------------------
<html>
<body>
<h1>Second Page</h1>
<input type="text" name="name" value="Veera"></input>
<input type="text" name="Age" value="30"></input>
</body>
</html>

Expected Output
----------------------
<?xml version="1.0" encoding="ISO-....">
<root>
    <name>Veera</name>
     <Age>30</Age>
</root>

MouNed 0 Newbie Poster · Answer 2 · 2009-12-30T02:46:41+00:00

Good evening everybody,

Here is my code to convert a table html xml:

import org.htmlparser.*;
import org.htmlparser.util.*;
import org.htmlparser.filters.*;
 
import java.io.*;
 
import org.jdom.*;
import org.jdom.output.*;
public class testParser5
{
	static String[][] matrice=new String[3][5];
	String[][] matrice2=new String[3][4];
	static int i=0,j=0,ligne,td;
	static Element racine = new Element("Table");
	static org.jdom.Document document = new Document(racine);
  public static void displayTree(Node n)
  {
    if (n instanceof Tag)
    {
      Tag t = (Tag)n;
      if (t.getTagName().equals("TR"))
      {
        System.out.println("\n-------------------------------------------------------");
        ligne++;
      }
      else if (t.getTagName().equals("TD"))
      {
    	  if(j>4){j=0;i++;}
    	  matrice[i][j++]=t.toPlainTextString();
        System.out.print(t.toPlainTextString() + "|");
        td++;
      }
    }
 
    if (n.getChildren() != null)
      for (int i = 0; i < n.getChildren().size(); i++)
        displayTree(n.getChildren().elementAt(i));
    
   
  }
 
  static void affiche()
	{
	   try
	   {
	      //On utilise ici un affichage classique avec getPrettyFormat()
	      XMLOutputter sortie = new XMLOutputter(Format.getPrettyFormat());
	      sortie.output(document, System.out);
	   }
	   catch (java.io.IOException e){}
	}
 
	static void enregistre(String fichier)
	{
	   try
	   {
	      //On utilise ici un affichage classique avec getPrettyFormat()
	      XMLOutputter sortie = new XMLOutputter(Format.getPrettyFormat());
	      //Remarquez qu'il suffit simplement de créer une instance de FileOutputStream
	      //avec en argument le nom du fichier pour effectuer la sérialisation.
	      sortie.output(document, new FileOutputStream(fichier));
	   }
	   catch (java.io.IOException e){}
	}
  public static void main(String[] args)
  {
    try
    {
      Parser parser = new Parser ("file:C:\\fichier.html");
      NodeList list = new NodeList ();
      NodeFilter filter = new TagNameFilter ("TABLE");
      for (NodeIterator e = parser.elements (); e.hasMoreNodes ();)
      {
        e.nextNode().collectInto(list, filter);
      }
      
      for (int i = 0; i < list.size(); i++)
         {
        //System.out.println("=============================");
        Node table = list.elementAt(i);
        displayTree(table);  
          }
      int colonne=td/ligne;
     
      for (int i = 0; i < ligne; i++)
      {
    	  for (int z = 0; z < colonne; z++){
    		  //System.out.print("   "+matrice[i][z]);
    	  }
    	  System.out.println();
      }
      
     for(int j=0;j<ligne-1;j++){
    	 //Element racine = new Element("Racine");
    	 Element etudiant = new Element("etudiant");
    	 racine.addContent(etudiant);
    	  for (int z = 0; z < colonne; z++){
      Element nom = new Element(matrice[0][z]);
      nom.setText(matrice[j+1][z]);
      etudiant.addContent(nom);
      }
    	  }
      affiche();
      enregistre("out.xml");
    }
    catch(ParserException e)
    { 
      e.printStackTrace();
    } 
  }
}

But the problem is that this code will work only if the html file conteint a single table. I want it to work for all cases.

The problem is how to put each table in a html template that I can then convert to xml.

I have another code that allows all aaficher html tables but not convert them into xml

import org.htmlparser.*;
import org.htmlparser.filters.*;
import org.htmlparser.util.*;
import org.htmlparser.nodes.*;
 
public class SimpleParse4 {
	static NodeList list;
	static int k=0;
	static int ligne=0,td=0;
    public static void main (String [] args) {
    	String[][] matrice=new String[3][5];
    	String[][] matrice2=new String[3][4];
    	int p=0,j=0;
        Parser parser = null;
        NodeFilter filter = null;
        
                  try {
                	  
parser = new Parser ();
filter = new TagNameFilter ("TABLE");
parser.setResource ("file:C:\\exemple2.html");
 list = parser.parse(filter);
 
System.out.println("Il existe:"+NombreTable()+" tables");
NodeIterator i = list.elements ();
                while (i.hasMoreNodes ())
                	{               	
                    processMyNodes(i.nextNode ());
                	}    }
            catch (EncodingChangeException ece) {
 
            }
            catch (ParserException e) {
                e.printStackTrace ();
            }
    }
   static int NombreTable(){
	   
	   for (int i = 0; i < list.size(); i++)k++;        
	   return k;
                 }
  
    static void processMyNodes (Node node) throws ParserException {
        
        if (node instanceof TagNode)
           {      	       	
            TagNode tag = (TagNode)node;        
            System.out.print(tag.toPlainTextString()+"-----------------------------");   
             }
              }
}

also the same problem: how can I count the number of rows and columns for each table so that I could build my arrays.

I know that my message is very long, I just wanted to be clear because it's been 40 days that seeks a solution to this problem.

and thanks for taking the trouble to read my message.

~s.o.s~ 2,560 Failure as a human Team Colleague Featured Poster · Answer 3 · 2010-01-01T22:09:27+00:00

~s.o.s~ 2,560 Failure as a human

14 Years Ago

Using something like JTidy would be much easier IMO.

thekashyap 193 Practically a Posting Shark · Answer 4 · 2010-01-02T20:28:22+00:00

Sounds like a pure transformations' job. Are you supposed to use Java only? If not this is a job for XSLT.
If you're forced integrate into java, you can make the transformation an ANT task and use the Java ANT runner.

~s.o.s~ 2,560 Failure as a human Team Colleague Featured Poster · Answer 5 · 2010-01-02T21:28:12+00:00

> Sounds like a pure transformations' job. Are you supposed to use Java
> If not this is a job for XSLT.

XSTL works on XML; web pages which don't comply to the XHTML specification (almost all of them out there) don't have a valid XML markup. XSTL won't help if what you need to do is convert broken markup (html) to XML.