I am trying to read a file which contains URLs and are 100 million in number. What I need to do is find out the different websites they are from. So, I am taking chunk of data in memory and reading it line by line. Also, I need to find out how many URLs does each website has in the file and what are those URLs. The way I figured it out is ro have a class domain which is:
class domain{
public:
string domainname;
int nolinks;
string URLs;
domain()
{
nolinks=1;
}
};
and then declare a hash_set in which I insert domain objects. This is the declaration of hash_set
class hash_fnc:public stdext::hash_compare<std::string>{
public:
/*enum{
bucket_size=1024,
min_buckets=8
};*/
size_t operator()(const domain& d)const
{
size_t h = 0;
std::string::const_iterator p, p_end;
for(p = d.domainname.begin(), p_end = d.domainname.end(); p != p_end; ++p)
{
h = 31 * h + (*p);
}
return h;
}
bool operator()(const domain& x,const domain& y) const
{
return x.domainname.compare(y.domainname)<0;
}
};
hash_set<domain,hash_fnc>_domain;
pair<hash_set<domain,hash_fnc>::iterator,bool>ret;
While reading the file, for each URL ex. www.cleaned.be/forum/index.php?showuser=1, I take the website www.cleaned.be, create a domain object with domain.domainname="www.cleaned.be" and insert it to my hash_set. Whenver I encounter another URL of the same website, I try to check if it already exists and increase the count domain.nolinks by 1 and append the URL to domain.URLs. This is the code block for that:
domain X;
X.domainname="www.somedomain.com";
//X.URLs.assign("www.somedomain.com/index/____/x.html");
ret=_domain.insert(X);
if(ret.second==false) //it already exists
{
(ret.first)->nolinks++;
(ret.first)->URLs.append("\n ");
(ret.first)->URLs.append("www.somedomain.com/index/____/x.html");
}
I do this for every line or URL in the file. Now the problem:
Although this worked out really well for an ordinary set<> , it is not for hash_set<>. The URLs are not getting added to the correct domain and my computer turns off while running this program sometimes. Also, the output is now missing almost half the domains and also messed up.Obviously I am making huge mistakes. So, please try to help me. I'll really be thankful.