Hi everyone, working on a class project where we spider some patents from the USPTO (Patent office) and eventually map out our findings. Our teacher provided a script and expected us to hardwire directly to a modem, not a router. That is asking a lot since I need to provide wifi to many devices.
The first thing I did to circumvent the modem requirement...
In the code, we have to use port 7890.
I assigned my computer a manual IP address, and forwarded port 7890 to that IP address. 7890-->192.168.0.165
The first part of the code is to create a socks server
I summon it in the (OS X) Terminal using
perl TaskDist.pl filename.txt
TaskDist.pl code is as follows:
use strict;
use IO::Socket;
my $sock = new IO::Socket::INET(
LocalHost => '192.168.0.165', #change to your pc ip as server ip
LocalPort => 7890,
Proto => 'tcp',
Listen => SOMAXCONN,
Reuse => 1);
$sock or die "no socket :$!";
STDOUT->autoflush(1);
my($new_sock, $buf);
open(f, $ARGV[0]);
my @theids = <f>;
close(f);
my $theid;
foreach $theid (@theids){
$new_sock = $sock->accept();
my $buf = <$new_sock>;
print ($new_sock $theid."\n");
print $buf . " " . $theid."\n";
close $new_sock;
}
This part seems to work fine using my forwarded port. My SOCKS server is setup at this point I believe, I don't know if the next part of this series of scripts should be using my internal or external IP.
The next part of the code, I am having a hard time with the input in the Terminal, and possibly the code. This is while I have my SOCKS server open in another Terminal.
use IO::Socket;
use HTML::TokeParser;
use LWP;
use URI::Escape;
use Sys::Hostname;
use strict;
my $host = $ARGV[0];
STDOUT->autoflush(1);
my $position=$ARGV[1];
my $browser = LWP::UserAgent->new();
$browser->agent("Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)");
$browser->proxy('http', $ARGV[1]);
my $response;
while (1){
my $sock = new IO::Socket::INET(
PeerAddr => $host,
PeerPort => 7890,
Proto => 'tcp');
$sock or die "no socket :$!";
if (length($position)==0){
$position=hostname();
}
print ($sock $position."\n");
my $filename= <$sock>;
close $sock;
$filename =~ s/\n//;
open(f, $filename);
my @theids = <f>;
close(f);
my $theid;
foreach $theid (@theids){
#http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=4484233.PN.&OS=PN/4484233&RS=PN/4484233
#http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220010020884%22.PGNR.&OS=DN/20010020884&RS=DN/20010020884
$theid =~ s/\n//;
my $pat_url;
if (length($theid)<=8){
$pat_url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=" . $theid . ".PN.&OS=PN/" . $theid . "&RS=PN/" . $theid;
}#151.207.240.23
else{
$pat_url = "http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%22" . $theid . "%22.PGNR.&OS=DN/" . $theid . "&RS=DN/" . $theid;
}#151.207.241.118
# print $pat_url;
my $patno = $theid;
if (-e "..\\dw_pat\\$patno.html"){
select(stdout);
print "skip $patno\n";
next;
}
select(stdout);
#print "getting pat $patno: $pat_url\n\n";
print "getting pat $filename:$patno:\n\n";
do {
$response = $browser->get($pat_url);
if (!$response->is_success()){
select(stdout);
print $response->status_line, "\n\n";
}
sleep(rand(7)+1);
}
while (!$response->is_success());
my $pat_desc = $response->content();
open(fpat, "> ..\\dw_pat\\$patno.html");
select(fpat);
print $pat_desc;
close(fpat);
}
}
exit;
We are supposed to use a proxy server to run this task. So I found a few, I don't know if I am supposed to use a SOCKS proxy or otherwise, and how I should enter the code in the terminal. I also do not know if I should be using my local IP address again as I had done to successfully get the SOCKS server working on my computer or if I should be using my external IP.
The instructions say to enter:
per spider_idThread.pl SERVERIP PROXYIP
This is one of the commands I have tried and the resulting error. I have tried internal/external IPs, I have tried SOCKS/regular proxys, I have tried adding ":port#" to the end of each, and even that 7890 from above.
$ perl spider_idThread.pl 192.168.0.165 58.86.219.231
no socket :Connection refused at spider_idThread.pl line 23.
Line 23 is this from above:
$sock or die "no socket :$!";
So this to me, seems like a problem figuring out which IP addresses to use. Any ideas? I appreciate your help.