Hi Guys,

Need a bit of advice. Basically I am building a webcrawler and in order to do so
I have to extract the page source of a webpage which I can do so like this:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click

        Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(TextBox2.Text)
        Dim response As System.Net.HttpWebResponse = request.GetResponse()

        Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())

        Dim sourcecode As String = sr.ReadToEnd
        TextBox1.Text = sourcecode

    End Sub

The above code works fine on most of the websites I have tried to extract page source from but for some reason it is failing to extract the html contents of few websites like these where the message posted by a user is nowhere to be seen? The webpage in question is this: http://www.vbforums.com/showthread.php?t=654378

Is there something I have missed or is it due to forum protection etc which is preventing the vb application from extracting the whole page source?

Please advice

Mine is also returning a "The remote server returned an error: (404) Not Found." error.

In such cases, a hidden WebBrowser could do the trick by extracting the .OutterText.

Public Class Form1

    Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
        Me.Cursor = Cursors.WaitCursor
    End Sub

    Private Sub WebBrowser1_DocumentCompleted(sender As System.Object, e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
        TextBox1.Text = WebBrowser1.Document.Body.OuterText
        Me.Cursor = Cursors.Default
    End Sub
End Class

Locating the "Originally Posted by..." line, you know that the following line(s) contain the post. The signature line of "______" could stop the content extraction, although in some cases there might not be a signature.

Hope this helps and good luck.

codeorder thanks and you might have saved me from writing an aditional code for parsing the data lol

Anyhow I made the following changes according to my needs:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
End Sub

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
        TextBox1.Text = WebBrowser1.Document.Body.OuterText
End Sub

I would like to add is that it seems to return me only the text from the webpage without any HTML, like shown below. Is it how webbrowser works? Because end of the day my application will be a forum webcrawler which will only be interested in forum posts bar the html..........

Please advice

Glad I could help. :)

You can also extract other stuff from the WebBrowser:

With WebBrowser1.Document.Body
            TextBox1.Text = .OuterText
            'TextBox1.Text = .InnerText
            'TextBox1.Text = .InnerHtml
            'TextBox1.Text = .OuterHtml
        End With

I tested the .InnerHtml and it did not return the posts in it. Usually, .InnerHtml is the source code you need when extracting data.

'codeorder' thanks again for your help matey :) .InnerHTML and .OuterHTML seem to extract the same html content unless I was seeing it differently? and similarly same for .InnerText and .OuterText........

Also I wanted to ask you this:

Yes I am able to extract the text without the HTML which is what I am after but more or so I was wondering is there a way of extracting the USER posts only with the date of post and so on so forth? Please advice :)

For example in this forum user has posted this

Hi guys, I have tis select statement and I am struggling with it on what I want to display so if anyone could help.

("SELECT MovieDetails.MovieID,MovieDetails.ActorID,
Movies.MovieID, Movies.FilmName,Movies.CatID,Actors.ActorID,
Actors.ActorName, Movies.Poster, Categories.CatName,
Categories.CatID, Movies.FilmDate,
FROM MovieDetails INNER JOIN Actors ON
MovieDetails.ActorID = Actors.ActorID INNER JOIN Movies
ON MovieDetails.MovieID = Movies.MovieID INNER
JOIN Categories on Movies.CatID = Categories.CatID")

This select statement is populated in a dataset and into bindingsource. The actors names are filled in a Listbox so when an Actor is selected, the BindingSource is filtered and display all the movies that Actor is in and display with the details, so the bidingcontext can be move forward and backwards.. Now when the movie is display "only that particular Actor's name is displayed". So how can I display all the Actors names when a movie is selected in this filter. Thanks

But how can I extract this alone and store it say in a dB?

2. Also what about a page containing xml?

Well I can extract the posts now but by only knowing the id as shown below:

TextBox1.Text = WebBrowser1.Document.GetElementById("post-1670732").InnerText

But is there any way of either using regex etc to find this sort of id and
posting the result?

I have noticed also forums use like

<a href="/software-development/vbnet/58">

can it be possible to extract this during forum visits and output the data?

