Member Avatar for mehnihma

Hi
I have XML file that appends on the end of the file:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head><title>

    </title></head>
    <body>
        <form name="form1" method="post" action="GetProductsXML.aspx?username=UASERNAME&amp;password=PASSWORD" id="form1">
    <div>
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE2MTY2ODcyMjlkZC/1D4iGqP0urqyxWR+2OEQ90eHf" />
    </div>

        <div>

        </div>
        </form>
    </body>
    </html>

I am using this function:

$xml_url= 'http://b2b.domain.com/GetProductsXML.aspx?username=USERNAME&password=PASSWORD';
        $xml = simplexml_load_file(utf8_encode($xml_url), 'SimpleXMLElement', LIBXML_NOCDATA);

How can I filter extra content from this XML? When I open it in web browser I get HTML page with text

Member Avatar for LastMitch

How can I filter extra content from this XML? When I open it in web browser I get HTML page with text

Was there an error when you load the XML?

Member Avatar for mehnihma

There is no error when I load it in broswer but I do not get XML but HTML document because of that code in the end of this XML file.
It is generated with that code in the end of the file and because of that I cannot read it like XML, so I need to strip that par somehow to read it like XML if it is possible.

Member Avatar for LastMitch

I think need to adjusted your $xml_url. The reason why because it's not letting you read the XML.

Member Avatar for mehnihma

What do you mean?

Member Avatar for mehnihma

As in that article you posted, my syntax is exatcly the same, problem is not in reading XML but this XML has extra HTML as I posted above.
Example:

</ProductDescription>
<ImageLarge>http://domain.com/images/products/KOMNET201_inf.jpg</ImageLarge>
<ImageSmall>http://domain.com/images/products/KOMNET201_kat.jpg</ImageSmall>
<BarCode>6935364052034</BarCode>
<ProducerWebPage>http://www.tp-link.com/en</ProducerWebPage>
<ProductWebPage>http://www.tp-link.com/en/products/prodetail.aspx?mid=0103030106&amp;id=541</ProductWebPage>
<Warranty>12 mj.</Warranty><CategoryName>Antene i dodatna oprema</CategoryName>
<ParentCategoryName>Mrežna oprema</ParentCategoryName>
<RowNumber>436</RowNumber><NetoPrice>95,93</NetoPrice>
<ProductDescriptionShort>ohms nominal, VSWR: 1.92 max., cable 1m, SMA</ProductDescriptionShort>
<AvailableQuantity>0</AvailableQuantity>
<InfoWindowLink>http://domain.com/ProductInfo.aspx?ProductID=53299</InfoWindowLink>
<Producer>TP-LINK</Producer></Product><Product>
<IsActiveRetail>true</IsActiveRetail>
<SortOrderRetail>16506</SortOrderRetail>
<SortOrderHomePageRetail>100</SortOrderHomePageRetail>
<ProductID>370770</ProductID>
<ProductCode>KOMNET272</ProductCode>
<ProductName>ANTENA TL-ANT2412D</ProductName>
<ProductDescription />



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">



<html xmlns="http://www.w3.org/1999/xhtml" >

<head><title>



</title></head>

<body>

    <form name="form1" method="post" action="GetProductsXML.aspx?username=domain.com&amp;password=89" id="form1">

<div>

<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE2MTY2ODcyMjlkZOOJeh0Tms5Udbf1jSVwRpTz4gUg" />

</div>



    <div>



    </div>

    </form>

</body>

</html>

How to exclude HTML from XML?

Member Avatar for LastMitch

It is generated with that code in the end of the file and because of that I cannot read it like XML, so I need to strip that par somehow to read it like XML if it is possible.

If you mention you can't read the XML but now you can?

How to exclude HTML from XML?

You just don't want the HTML tags appear?

I don't get.

XML file is separate file.
HTML file read the XML.
You don't put XML with HTML in 1 file.

Member Avatar for mehnihma

The problem is that that is the "XML" which is given to me but it has html tags in it, so I cannot read it like XML, I need to find a way to exclude that tags when reading this so called XML

Member Avatar for LastMitch

The problem is that that is the "XML" which is given to me but it has html tags in it, so I cannot read it like XML, I need to find a way to exclude that tags when reading this so called XML

This:

$xml_url= 'http://b2b.domain.com/GetProductsXML.aspx?username=USERNAME&password=PASSWORD';
$xml = simplexml_load_file(utf8_encode($xml_url), 'SimpleXMLElement', LIBXML_NOCDATA);

Take everything except:

$xml = simplexml_load_file('GetProductsXML.xml');

I want to know can you load the GetProductsXML.xml without any issue?

If you can then there's no issue with reading the file.

Then the issue is has something to do with this:

$xml_url= 'http://b2b.domain.com/GetProductsXML.aspx?username=USERNAME&password=PASSWORD';

If there's an issue reading the GetProductsXML.xml that will tell you that you have a issue reading the GetProductsXML.xml file.

Member Avatar for mehnihma

That is the problem because it canot read it as xml because extra html data in it

Member Avatar for diafol

Why is there html in your xml?

Member Avatar for mehnihma

Honestly, not shure, pearson who did that said that it is OK, and it should look like that :). Because for him this is good.
This is what I have and have to find a way to deal with it :)

Member Avatar for diafol

XML files should only contain XML.

Member Avatar for mehnihma

That I know, but I cannot do anything in this case, just remove it if possible?

Member Avatar for diafol

That I know, but I cannot do anything in this case, just remove it if possible?

I would, but, you could however read the file into a string and then remove the html part, and use the remainder in simplexml_load_string().

Member Avatar for mehnihma

I have tried to exclude it in a string but with no luck, I always get something from hmtl

Member Avatar for diafol

What have you tried? show us the code you used. perhaps we can tweak it.

Member Avatar for mehnihma
return preg_replace('~<(?:!DOCTYPE|/?(?:html|body))[^>]*>\s*~i', '',$retValue);

Also something like this:

$nedozvoljeno1 = array('<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">      <html xmlns=""http://www.w3.org/1999/xhtml"" >     <head><title>      </title></head>     <body>         <form name=""form1"" method=""post"" action=""GetProductsXML.aspx?username=UASERNAME&amp;password=PASSWORD"" id=""form1"">     <div>     <input type=""hidden"" name=""__VIEWSTATE"" id=""__VIEWSTATE"" value=""/wEPDwULLTE2MTY2ODcyMjlkZC/1D4iGqP0urqyxWR+2OEQ90eHf"" />     </div>          <div>          </div>         </form>     </body>     </html>     <!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">      <html xmlns=""http://www.w3.org/1999/xhtml"" >     <head><title>      </title></head>     <body>         <form name=""form1"" method=""post"" action=""GetProductsXML.aspx?username=UASERNAME&amp;password=PASSWORD"" id=""form1"">     <div>     <input type=""hidden"" name=""__VIEWSTATE"" id=""__VIEWSTATE"" value=""/wEPDwULLTE2MTY2ODcyMjlkZC/1D4iGqP0urqyxWR+2OEQ90eHf"" />     </div>          <div>          </div>         </form>     </body>     </html>');

                return str_replace($nedozvoljeno1, "", $retValue);

Maybe some new ideas?

Member Avatar for diafol

OK, that looks complicated. How about:

$fileContent = file_get_contents("my.xml");
$pos = strpos($fileContent,"<!DOC");
$string = substr($fileContent,0,$pos); 
$xml = simplexml_load_string($string);

Assuming the whole thing is in my.xml

Is there no way to get a clean xml file? I'm really confused as to why there should be any regular html in it.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.