Hi all. I need to code an offline crawler that acts like a web crawler on a set of given html pages. I know how to read the a=href tag and output those links to the screen. My plan is to modify my code to add more functionality. Instead of outputting the links to the screen I want to put them in the data store and count the links to the different pages. After that I want to access the link to the next html page in the data store and repeat the process. I would like to output the top five pages with the most links to them to the screen. I know that is an earful, but any help would be appreciated. Below is my code so far. I have it commented as best I could
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
//Declaring a variable of type fstream to store the data
fstream data_store;
//Declaring two variables of type string to hold the file name and a line of text
string filename, line;
//Declare a variable of type integer to be used as a counter.
int counter1;
//Output a message to the screen
cout << "Enter a file name" << endl;
//Input the filename
cin >> filename;
//Open the file that was passed into the data store and read data from it
data_store.open(filename.c_str(), ios::in);
//while there is a line to pass into the data store or if the data_store does not reach the end of the file
while(data_store >> line || !data_store.eof())
{
//Declare a temporary string variable and initialize it to blank
string temp = "";
//Initializes counter to zero every loop iteration
counter1 = 0;
//loops through the text until the line hits "href="
if(!line.find("href="))
{
//The line of text gets the text starting at the 6th subscript to the size of the text minus six
line = line.substr(6, counter1 - 6);
//While the line at a subscript does not have single or double quotes,
//and if the counter is less than the line's lengnth
while(line[counter1] != '\"' && line[counter1] != '\'' && counter1 < line.length())
{
//Put the contents of line into temp
temp += line[counter1++];
}
//Ouput temp on a new line
cout << temp << endl;
}
}
//This closes the data store to prevent memory leaks
data_store.close();
}