I'm creating a bot scraper that gathers information off of other websites and i am using html simple dom parser to do it.
I have found a bug though. I ran into one website that doesnt parse.\
Here is a sample of the code that it cannot parse:
<div
class="header"><div
class="container"><ul
id="nav"><li><a
id="home" href="http://thinkclay.com"
class="selected" title="Return to the home page">Return to home</a></li><li><a
id="about" href="http://thinkclay.com/about"
title="Read more about Clay McIlrath">About Clay McIlrath</a></li><li><a
id="design" href="http://thinkclay.com/graphic-design"
title="View my Graphic Design Portfolio">Web Design Portfolio</a></li><li><a
id="development" href="http://thinkclay.com/web-development"
title="View my Web Development Portfolio">Web Development Portfolio</a></li><li><a
id="photography" href="http://thinkclay.com/photography"
title="View my Photography Portfolio">Photography Portfolio</a></li><li><a
id="wallpaper" href="http://thinkclay.com/desktop-wallpapers"
title="Download free desktop wallpapers">Free Desktop Wallpapers</a></li><li><a
id="wordpress" href="http://thinkclay.com/wordpress"
title="Download free wordpress themes">Free Wordpress Themes</a></li></ul><div
style="clear:both;"></div><p>My name is Clayton McIlrath and I am an entrepreneur currently living in CO. I personally enjoy the process of learning, exploring, and doing all things creative as well as sharing my experiences with others. Being an entrepreneur and <a
href="http://bychosen.com">business owner</a>, I hope that my experiences may help someone else start their own venture and find success and freedom as I have! Feel free to <a
href="http://bychosen.com/contact">contact me</a> anytime for questions or opportunities.</p> <a
class="close" href="#close" title="Close the Cloud"><img
src="http://thinkclay.com/wp-content/themes/thinkclay_v2/images/close.png" alt="close" /></a></div></div><div
class="container"> <a
its seems as if the code gets a line break after the tag name and before the first attribute.
I have tried str_replace'ing & preg_replacing white space characters with a single space and that still doesnt seem to work. Would anybody have any ideas as to why this is happening and how i can fix it?
Thanks