Experienced Fellow Programmers,
I asking questions to those who have experiences with web crawlers.
I do not want my web crawler getting trapped onto some domain, while crawling it. Trapped and going in a loop for some reason. And so, what to look-out for to prevent loops ?
1.I know crawlers should not spider dynamic urls as they can go in a neverending loop. And so, apart from that, what other dangers are there ?
2.I know I have to program the crawler to avoid trying crawl pages that are dead. And so, got to lookout for 404 pages. And what other numbers got to lookout for ? I need a list of error numbers to feed my crawler.
3.I do not want any hacker/crook/fraud calling my crawler (pinging it) to crawl bad natured pages. Pages that are phishing pages. And so, how do I write code for my crawler to identify phishing pages so it does not crawl or index them on my searchengine ?
4.I do not want any hacker/crook/fraud calling my crawler (pinging it) to crawl his pages that are infected with virus, worm, ant, spyware, etc. Pages that will infect my crawler to carry infections to other domains it crawls afterwards. And so, how do I write code for my crawler to identify infected pages so it does not crawl or index them on my searchengine nor carry the infections to third party domains ?
When I asked 4 times above, "How do I code ?", I meant, "which php fuctions you want me to look into ?".
Anything else I got to program my crawler to watch-out for ?
Good questions. hey ?