My code works it does extract links, but these links are not what i expected to be extracted.
My program would extract links inside the <a href tag which contains a specific word which i can set manually.
Here is my complete code:
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim webClient As New System.Net.WebClient
Dim WebSource As String = webClient.DownloadString("http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA")
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(WebSource)
Dim links = GetLinks(doc, "test")
For Each Link In links
ListBox1.Items.Add(Link.ToString())
Next
End Sub
Public Class Link
Public Sub New(Uri As Uri, Text As String)
Me.Uri = Uri
Me.Text = Text
End Sub
Public Property Text As String
Public Property Uri As Uri
Public Overrides Function ToString() As String
Return String.Format(If(Uri Is Nothing, "", Uri.ToString()))
End Function
End Class
Public Function GetLinks(doc As HtmlAgilityPack.HtmlDocument, linkContains As String) As List(Of Link)
Dim uri As Uri = Nothing
Dim linksOnPage = From link In doc.DocumentNode.Descendants()
Where link.Name = "a" _
AndAlso link.Attributes("href") IsNot Nothing _
Let text = link.InnerText.Trim()
Let url = link.Attributes("href").Value
Where url.IndexOf(linkContains, StringComparison.OrdinalIgnoreCase) >= 0 _
AndAlso uri.TryCreate(url, UriKind.Absolute, uri)
Dim Uris As New List(Of Link)()
For Each link In linksOnPage
Uris.Add(New Link(New Uri(link.url, UriKind.Absolute), link.text))
Next
Return Uris
End Function
I want to extract all links which contains the word "test"
Here is the URL i am extracting links from:
http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA
My expected output:
www.copetest.com/
www.testofhumanity.com/
www3.algonquincollege.com/testcentre/
www.lpitest.ca/
testtube.nfb.ca/
www.ieltscanada.ca/testdates.jsp
https://www.awinfosys.com/eassessment/fsa_fieldtest.htm
My actual output (it is so different with my expected output):
http://www.google.com.ph/search?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&tbm=isch&source=og&q=test&sa=N&tab=wi
http://maps.google.com.ph/maps?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=wl
https://play.google.com/?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=w8
http://www.youtube.com/results?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&gl=PH&q=test&sa=N&tab=w1
http://translate.google.com.ph/?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=wT
http://www.google.com.ph/search?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&tbo=u&tbm=bks&source=og&q=test&sa=N&tab=wp
https://plus.google.com/photos?hl=en&lr=&cr=countryCA&safe=images&um=1&ie=UTF-8&q=test&sa=N&tab=wq
https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.com.ph/search%3Fhl%3Den%26as_q%3Dtest%26as_epq%3D%26as_oq%3D%26as_eq%3D%26as_nlo%3D%26as_nhi%3D%26lr%3D%26cr%3DcountryCA%26as_qdr%3Dall%26as_sitesearch%3D%26as_occt%3Dany%26safe%3Dimages%26tbs%3Dctr:countryCA%26as_filetype%3D%26as_rights%3D
Yes it does extract links which contains the word "test" but these are not the links i want to be extracted.
where did i go wrong? I am really confused right now. I have been stuck with this problem for 2 weeks already.
I really need help.