link extraction using HtmlAgilityPack

Question

intes2010 0 Newbie Poster

11 Years Ago

My code works it does extract links, but these links are not what i expected to be extracted.
My program would extract links inside the <a href tag which contains a specific word which i can set manually.

Here is my complete code:

       Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
            Dim webClient As New System.Net.WebClient
            Dim WebSource As String = webClient.DownloadString("http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA")

        Dim doc = New HtmlAgilityPack.HtmlDocument()
            doc.LoadHtml(WebSource)
            Dim links = GetLinks(doc, "test")
            For Each Link In links
                ListBox1.Items.Add(Link.ToString())
            Next
        End Sub


       Public Class Link
            Public Sub New(Uri As Uri, Text As String)
                Me.Uri = Uri
                Me.Text = Text
            End Sub
            Public Property Text As String
            Public Property Uri As Uri

            Public Overrides Function ToString() As String
                Return String.Format(If(Uri Is Nothing, "", Uri.ToString()))
            End Function
        End Class


        Public Function GetLinks(doc As HtmlAgilityPack.HtmlDocument, linkContains As String) As List(Of Link)
            Dim uri As Uri = Nothing
            Dim linksOnPage = From link In doc.DocumentNode.Descendants()
                              Where link.Name = "a" _
                              AndAlso link.Attributes("href") IsNot Nothing _
                              Let text = link.InnerText.Trim()
                              Let url = link.Attributes("href").Value
                              Where url.IndexOf(linkContains, StringComparison.OrdinalIgnoreCase) >= 0 _
                              AndAlso uri.TryCreate(url, UriKind.Absolute, uri)

            Dim Uris As New List(Of Link)()
            For Each link In linksOnPage
                Uris.Add(New Link(New Uri(link.url, UriKind.Absolute), link.text))
            Next

            Return Uris
        End Function

I want to extract all links which contains the word "test"
Here is the URL i am extracting links from:

http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA

My expected output:

www.copetest.com/‎
www.testofhumanity.com/
www3.algonquincollege.com/testcentre/‎
www.lpitest.ca/‎
testtube.nfb.ca/‎
www.ieltscanada.ca/testdates.jsp‎
https://www.awinfosys.com/eassessment/fsa_fieldtest.htm‎

My actual output (it is so different with my expected output):

http://www.google.com.ph/search?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&tbm=isch&source=og&q=test&sa=N&tab=wi
http://maps.google.com.ph/maps?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=wl
https://play.google.com/?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=w8
http://www.youtube.com/results?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&gl=PH&q=test&sa=N&tab=w1
http://translate.google.com.ph/?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=wT
http://www.google.com.ph/search?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&tbo=u&tbm=bks&source=og&q=test&sa=N&tab=wp
https://plus.google.com/photos?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=wq
https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.com.ph/search%3Fhl%3Den%26as_q%3Dtest%26as_epq%3D%26as_oq%3D%26as_eq%3D%26as_nlo%3D%26as_nhi%3D%26lr%3D%26cr%3DcountryCA%26as_qdr%3Dall%26as_sitesearch%3D%26as_occt%3Dany%26safe%3Dimages%26tbs%3Dctr:countryCA%26as_filetype%3D%26as_rights%3D

Yes it does extract links which contains the word "test" but these are not the links i want to be extracted.
where did i go wrong? I am really confused right now. I have been stuck with this problem for 2 weeks already.
I really need help.

html-css vb.net

2 Contributors
1 Reply
500 Views
3 Days Discussion Span
Latest Post 11 Years Ago Latest Post by Begginnerdev

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Begginnerdev 256 Junior Poster · Answer 1 · 2013-08-28T12:42:51+00:00

After doing a quick view source of google - it looks like what you really want is wrapped inside <cite> </cite> tags, and not href.