Web Crawler

Question

rishabh7777 0 Newbie Poster

14 Years Ago

Hi,
I am developing a web crawler using java. I have implemented it to some extent, like I have developed program which parses all the hyperlinks from the entered URL and and visits each link one by one and iterates this process. Now I want to parse all the visible text from a particular web page. I am facing problem in this. Can anyone suggest how to accomplish this. Any help wil be greatly appreciated.
Thanks in advance
Rishabh jha

java

2 Contributors
4 Replies
110 Views
20 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by rishabh7777

apines 116 Practically a Master Poster

14 Years Ago

Since you are retrieving the HTML code, you can parse the <a href = "HTML SITE> </a> tags in order to get the hyperlinks. What exactly do you mean by "parsing all the visible text from the particular web page"?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

rishabh7777 0 Newbie Poster · Answer 1 · 2010-11-15T13:25:36+00:00

Hi apines,
I have done exactly the same for for parsing the hyperlinks, and with visible text I mean all the text material that is visible on the web page. I will be obliged if you could come up with a solution as visible text are present in various form in a web page(title, body, heading etc etc).

apines 116 Practically a Master Poster Featured Poster · Answer 2 · 2010-11-15T15:01:41+00:00

I am not sure that I fully understood - can you please give me an example for visible text that you cannot parse?

Also, a search in DaniWeb revealed this post, which contains more information and other links to tutorials regarding web crawlers and Java. Might help you as well.

rishabh7777 0 Newbie Poster · Answer 3 · 2010-11-15T15:13:33+00:00

There are many instances. For ex- text displayed on button, on links etc etc. the problem is that every webpage has a different layout, actually i am not able to understand that the text which we are viewing is present on which portion of webpage, if you analyse source code of any web page you will find that there is no specific region of the visible text.