Php Pals,
A thought just occured to me and before I delve too much into it, I need your advice.
You're aware that, I have been trying to learn web scraping with cURL & Php to:
-
FIRST PROJECT:
Build my own web proxy from scratch like anonymouse.org. (Thread: cUrl Experiments).
That way, I can add my own custom features onto it which the traditional web proxies don't have. - SECOND PROJECT:
Update a GPL web proxy (Php-Proxy) to add a content filter (check for banned words on pages and stop page loads if banned words exist on page). (Thread: How To Filter Content Before Loading On Screen).
That way, if I fail to build my own proxy from scratch then atleast I manage to add my custom features onto an existing web proxy.
Now I am thinking: How about build a Meta Proxy ?
You have seen searchengines that have their own web crawlers & Indexes (Google, Web Crawler, etc.).
You have seen meta engines who do not have their own web crawlers or indexes but send queries to third party searchengines and present you with their results (Mamma, Dog Pile, etc.).
I am now interested to build my own Meta Web proxy. That way, I:
- DO NOT need to write my own web proxy; (See first project above).
- DO NOT need to write my own content filter; (See 2nd project above).
Infact, my Meta Web proxy can use third party web proxies and their content filters (if any are available).
Anyway, with Meta Engines you can select your chosen searchengines and search on more than one simultaneously or one after the other.
How-about my Meta Web Proxy allows you to query one web proxy after another on auto ? That would be good ?
Anyway, the whole purpose of me opening this thread is to ask you some technical questions.
First let me give you the blue print of how my Meta Web proxy would work and then you can give me your verdict if technically all that is possible with cURL or php or not.
My Meta Web Proxy
It would have a url input ui text box (labelled: Url).
When you type a url on it, it will pass-on the query to your chosen web proxy.
Let us assume that, your chosen web proxy is:
http://anonymouse.org/anonwww.html
Now, let us say, that you want to view http://www.dictionary.com/browse/forum.
Now, when you type http://www.dictionary.com/browse/forum, my Meta Web Proxy would load this url:
http://anonymouse.org/cgi-bin/anon-www.cgi/http://www.dictionary.com/browse/forum
But would not this forward the user away from my site/domain/Meta Web Proxy ?
Anonymouse.org would take-over from then on. Right ? I mean, when the user clicks a link on the anonymouse.org proxified page then my Meta Web proxy forwards the user for good to anonymouse.org.
Q1. Now, how to prevent this forwarding for good so the user does still remain on my site/domain/Meta Web Proxy ?
ISSUE 2
Now, in order for my Meta Web Proxy to track what links you (the user) are clicking on the proxified page (anonymouse.org fetched page), I will need to add my tracker links on all links present on the anonymouse.org proxified page.
Now, in order to do that, I need to proxify the anonymouse.org proxified page itself (in order for the proxified page to contain my proxy links preceded onto the destination links). And to do that, I need to use cURL to fetch the page. Right ? That would mean, I would have to look for a webhost that allows me to run my own Web Proxy. Right ?
In this case, I need to write code for cURL to fetch:
http://anonymouse.org/cgi-bin/anon-www.cgi/http://www.dictionary.com/browse/forum
Correct ?
Q2. Is not there a way where I can track which links my user clicks without needing to get my meta Web proxy's cURL code to fetch the page onto it's own servers ? (Finding a proxy host is difficult, etc.).
Q3a. If I run my own Meta Web proxy (like mentioned above), regardless of whether I need my own proxy host or not, I do not have to write code to build the content filter (banned words filter, profanity filter, etc.) as I can just get the user's chosen web proxy (eg. anonymouse.org) to do the filtering. Right ?
Q3b. But, how do I inject the filter commands into the url of the user's chosen web proxy (eg. anonymouse.org) if I directly inject the user's site into his chosen web proxy's url ? Eg.
http://anonymouse.org/cgi-bin/anon-www.cgi/http://www.dictionary.com/browse/forum
Looking at the above link you can see the url contains no checkbox options selected (eg. disable javascript, disable cookies, remove ads, etc.). I need to know what words in the url would trigger which filters. How to figure this out ?
Look at this link from youtube. It uses the "Last Hour" filter and the "View Count" filter.
Now, how on earth are you suppsed to figure-out from all that what filters it is using ? I guess fiddle and experiment at youtube and figureout their algorith. Right ?
https://www.youtube.com/results?q=cookies&sp=CAMSAggBUBQ%253D
The other alternative is to get my Meta Web Proxy's cURL to navigate to:
http://anonymouse.org/cgi-bin/anon-www.cgi
Then auto fill-in the URL in the "Enter Website Address" labelled ui text box URL and then auto click the "Surf Anonymously" button. And then auto check any options such as "Remove Javascript", "Disable Cookies", "Remove Ads", etc.
But is it possible to get cURL to do all this "check box options" checking or not ? That is the big question. And if so, care to show an example ? Or, atleast show me a link that teaches this.
If you still don't understand what I am blabbering on about then say so and I can show you a free .exe tool (I built in) that does all this so you can get an idea what I want cURL to do.
Don't forget to answer all my 4 questions.
And subscribe to this thread.
Thanks!