Hi Guys,

I am having few issues testing a software for which I have to develop a front
end GUI. Basically, the previous person failed to document on how to run the system
and left all the pleasure for me :D

Anyhow, the system basically is a web crawler and uses sql server for both grabing
and storing data. Now the main table where manual information about a forum is inserted
has two fields threadURL and forumURL.

I am bit confused on how to grab these information from a forum URL for e.g. shown below:

http://www.detailingworld.co.uk/forum/showthread.php?t=153906

Please advice.....

Take this thread as example

Daniweb Mysql forum URL --- http://www.daniweb.com/web-development/databases/mysql/126

This thread URL --- http://www.daniweb.com/web-development/databases/mysql/threads/368668

Thanks debasisdas for your help :)
I tried as you showed above yesterday and failed :(

Still trying to figure out as to how he has inserted it?!?

Also if the forum url is .../mysql/126 then why isnt 126 included in the thread url...../mysql/threads/368668?

mysql ---- forum name
126 ---- forum id

only forum name is included in the thread URL not the forum id.

mysql ---- forum name
126 ---- forum id

only forum name is included in the thread URL not the forum id.

Hmm makes sense. Perhaps you can help!
The author mentions in his report, table 'forums' which contains fields
forum name, id, URL, threadurl, forum url

threadurl - the field contains a regular expression used to determine that
a URL points to a thread on the given forum

forumurl - the field contains a regular expression used to determine that
a URL points to a page on the forum which isn't a thread, but is neccessary
to crawl to find links to threads.

So I am not sure if the information needs to be broken down to be put into these
fields?

that sounds appropriate.

and all forums on web may not follow the same naming conventions.

that sounds appropriate.

and all forums on web may not follow the same naming conventions.

So by looking at the report information would you suggest entering the
data for forum and thread url like this:

http://......../mysql/129 - forum url OR just mysql

http://......../mysql/threads/355985 - thread url OR just /threads/355985

Please advice

Yes, you can go with the above approach.

Yes, you can go with the above approach.

I am sorry but I am confused :$

here is my insert statement, you mean like this:

USE [autocrawldb]
GO
INSERT into dbo.forums([Forum name], url, threadRegex, forumRegex)
VALUES ('daniweb', 'http://www.daniweb.com/', '<a rel="nofollow" href="http://www.daniweb.com/web-development/databases/126/threads/368668" target="_blank">http://www.daniweb.com/web-developme...threads/368668</a>','');

It give me the following error:

Msg 8152, Level 16, State 14, Line 1
String or binary data would be truncated.
The statement has been terminated.

you need to store only the following

http://www.daniweb.com/web-development/databases/mysql/threads/368668

As you suggested did the following:

USE [autocrawldb]
GO
INSERT into dbo.forums([Forum name], url, threadRegex, forumRegex)
VALUES ('daniweb', 'http://www.daniweb.com/', 'http://www.daniweb.com/web-development/databases/mysql/threads/368668','http://www.daniweb.com/web-development/databases/mysql');

But get the same error :( sorry

Msg 8152, Level 16, State 14, Line 1
String or binary data would be truncated.
The statement has been terminated.

This is the table design:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[forums](
	[Forum name] [varchar](50) NOT NULL,
	[id] [int] IDENTITY(1,1) NOT NULL,
	[url] [varchar](50) NOT NULL,
	[threadRegex] [varchar](50) NULL,
	[forumRegex] [varchar](50) NULL,
 CONSTRAINT [PK_forums] PRIMARY KEY CLUSTERED 
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO

I think the length of the field is the problem.

i.e.---VARCHAR(50)

change it to VARCHAR(500) and check.

I think the length of the field is the problem.

i.e.---VARCHAR(50)

change it to VARCHAR(500) and check.

Thanks for your reply debasis, i changed the column length and was able to insert the
data but when I ran the actual vs c++ it gave me 'Debug Assertion Failed' dailog box :(
I can see the forum name, id and url being read (in the console box) but the error pops up when it reads threadRegex and forumRegex

I think it is a lost war :(

Thanks for your reply debasis, i changed the column length and was able to insert the
data but when I ran the actual vs c++ it gave me 'Debug Assertion Failed' dailog box :(
I can see the forum name, id and url being read (in the console box) but the error pops up when it reads threadRegex and forumRegex

I think it is a lost war :(

Also this is cpp code which has been used for the execution:

using namespace std;

otl_connect db; // connect object

void insertChs(list<int> chs, int id)
{
	list<int> temp;
	temp.assign(chs.begin(), chs.end());
	int index = 1;

	otl_stream i(200, // buffer size
		"INSERT INTO forumpaths VALUES( "
		":f<int>, :f1<int>, :f2<int>)",
		// SELECT statement
		db // connect object
		); 
	// create select stream
	int tempint;
	while(temp.size() > 0)
	{
		tempint = temp.front();
		temp.pop_front();
		i<<index<<tempint<<id;
		index++;
	}
}


list<int> performAnalysis(int id)
{
	list<int> retVal;

	otl_stream i(200, // buffer size
		"select threadUrl from trainingthreads "
		"where fk_forumId>=:f<int> ",
		// SELECT statement
		db // connect object
		); 
	// create select stream
	i<<id;
	
	char tmpurl[200];
	list<list<int>> tempchs;
	list<int> tempint;
	crawler crawl(0, "", "", "", retVal);

	while(!i.eof()){ // while not end-of-data
		i >> tmpurl;
		tempint = crawl.analyse(tmpurl);
		if(tempint.size() > 1)
			tempchs.push_back(tempint);
		//retVal.push_back(tmp);
	}

	retVal = helper::subList(tempchs);

	return retVal;
}

list<int> getCh(int id)
{
	list<int> retVal;

	otl_stream i(200, // buffer size
		"select nodeindex from forumpaths "
		"where fk_forumId>=:f<int> "
		"ORDER BY [order] ASC",
		// SELECT statement
		db // connect object
		); 
	// create select stream
	i<<id;

	int tmp;
	while(!i.eof()){ // while not end-of-data
		i >> tmp;
		retVal.push_back(tmp);
	}

	if(retVal.size() == 0)
	{
		retVal = performAnalysis(id);
		insertChs(retVal,id);
	}

	return retVal;
}


void select()
{ 
	otl_stream i(200, // buffer size
		"select * from forums",
		db // connect object
		); 
	char f1[200];
	int f2;
	char f3[200];
	char forumRegex[200];
	char threadRegex[200];
	list<int> ch;

	while(!i.eof()){ // while not end-of-data
		i>>f1>>f2>>f3>>threadRegex>>forumRegex;
		cout<<"f1="<<f1<<", f2="<<f2<<", f3="<<f3<<endl;
		
		ch = getCh(f2);
		
		crawler c(f2, f3, threadRegex, forumRegex, ch);
		c.crawl();
	}
}

By looking at the code, it is only reading forum name, id and url from table 'forums' and then threadurl from training threads which is a different table

The problem is the column size of both the fields are varchar(50) and thats the way the author worked it out, i mean got the system working from what I have been told :$

Also thought Id add what was written in the report for the threadRegex and forumRegex fields:

threadRegex: this field contains a regular expression used to determine that a URL points to a thread on the given forum.

forumRegex: this field contains a regular expression used to determine that a URL points
to a page on the forum which isn't a thread, but is necessary to crawl to find links to
threads.

The table has five fields:

forum name, id, url, threadregex and forumregex

The problem is the column size of both the fields are varchar(50) and thats the way the author worked it out, i mean got the system working from what I have been told

What has field size to do with regular expressions and their validity?

What has field size to do with regular expressions and their validity?

Well if I enter the data you suggested in your last reply into the respective fields then I get the following error:

Msg 8152, Level 16, State 14, Line 1
String or binary data would be truncated.
The statement has been terminated.

So how does your insert statement look like?
And if you have a crawler URL database with a limit of 50 characters per URL you're in trouble, anyway. Why can't you change the field size?

So how does your insert statement look like?
And if you have a crawler URL database with a limit of 50 characters per URL you're in trouble, anyway. Why can't you change the field size?

I am attaching the db create file to show you how many tables there are and their structure. This is the way the author build the db. I have no contact with him hence
posting here. I just want to run the system for me to understand and start the GUI development of it.

I am not sure why he choose 50?

My insert statement is as follow:

USE [autocrawldb]
GO
INSERT into dbo.forums([Forum name], url, threadRegex, forumRegex)
VALUES ('daniweb', 'http://www.daniweb.com/', 'http://www\.daniweb\.com/web-develop...mysql/threads/[0-9]+','http://www\.daniweb\.com/web-develop...tabases/mysql/[0-9]+')

I also posted the c++ file which takes in information from this table in page 2 if you want to have a look?

The column size was changed to 500 but still the program was producing error.

Okay if I can explain a bit more

For example, this is a thread:

http://www.boards.ie/vbulletin/showthread.php?p=51566909

and this is a forum:

http://www.boards.ie/vbulletin/forumdisplay.php?f=374

Threads contain "showthread.php" in their URL, and forums contain "forumdisplay.php" in their URL. Therefore in the fields:

threadRegex: should contain a regular expression which will match a string with "showthread.php" and

forumRegex: should contain a regular expression which will match a string with "forumdisplay.php"

Please advice guys :)

You need to understand the concept of regular expressions first.
A RegEx is a pattern which matches a string of characters. You cannot convert a string to a regular expression, because there are infinitely many ways to construct a regular expression which matches a given string. For example, showthread.php and forumdisplay.php might be matched by the RegEx's

(showthread\.php|forumdisplay\.php)
(showthread|forumdisplay)\.php
[^\.]+\.php
[a-z]+\.php
.*

Which of those is the RegEx suitable for your purposes is your task to decide.

For your purpose this might do: http://www\.boards\.ie/vbulletin/(showthread|forumdisplay)\.php\?(p|f)=([0-9]+)

You need to understand the concept of regular expressions first.
A RegEx is a pattern which matches a string of characters. You cannot convert a string to a regular expression, because there are infinitely many ways to construct a regular expression which matches a given string. For example, showthread.php and forumdisplay.php might be matched by the RegEx's

(showthread\.php|forumdisplay\.php)
(showthread|forumdisplay)\.php
[^\.]+\.php
[a-z]+\.php
.*

Which of those is the RegEx suitable for your purposes is your task to decide.

For your purpose this might do: http://www\.boards\.ie/vbulletin/(showthread|forumdisplay)\.php\?(p|f)=([0-9]+)

smantscheff you've been a great help. You are right I need to understand the concept of regex. I am trying to lookup few tutorial on google trying to grasp the idea. The system is not been written by me so I am trying to test every possible regexs :$

In your last reply

http://www\.boards\.ie/vbulletin/(showthread|forumdisplay)\.php\?(p|f)=([0-9]+)

information has to be seperately entered into the forum and thread fields and the http://www.boards.ie/vbulletin doesnt need to be included since we are entering the url into the url feild.

So would the following be right for forum and thread regexs:

\forumdisplay\.php\?f=([0-9]+)
\showthread\.php\?p=([0-9]+)

Yes, but you don't need to escape the first character of the RegEx (drop the leading backslash).

Yes, but you don't need to escape the first character of the RegEx (drop the leading backslash).

Thanks you mean like this:

forumdisplay\.php\?f=([0-9]+)
showthread\.php\?p=([0-9]+)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.