Guys I have a created a test windows form to try few of the HAP functionalities.....
I have the following code as shown below:
The function below deals with HTML Parsing whereby removing unwanted HTML metadata.
Function SanitizeHtml(ByVal html As String) As String
Dim doc As New HtmlDocument()
doc.LoadHtml(html)
'Remove potentially harmful elements
Dim nc As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object")
If nc IsNot Nothing Then
For Each node As HtmlNode In nc
node.ParentNode.RemoveChild(node, False)
Next
End If
'remove hrefs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript')]|//a[starts-with(@href, 'jscript')]|//a[starts-with(@href, 'vbscript')]")
If nc IsNot Nothing Then
For Each node As HtmlNode In nc
node.SetAttributeValue("href", "protected")
Next
End If
'remove img with refs to java/j/vbscript URLs
nc = doc.DocumentNode.SelectNodes("//img[starts-with(@src, 'javascript')]|//img[starts-with(@src, 'jscript')]|//img[starts-with(@src, 'vbscript')]")
If nc IsNot Nothing Then
For Each node As HtmlNode In nc
node.SetAttributeValue("src", "protected")
Next
End If
'remove on<Event> handlers from all tags
nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]")
If nc IsNot Nothing Then
For Each node As HtmlNode In nc
node.Attributes.Remove("onFocus")
node.Attributes.Remove("onBlur")
node.Attributes.Remove("onClick")
node.Attributes.Remove("onMouseOver")
node.Attributes.Remove("onMouseOut")
node.Attributes.Remove("onDoubleClick")
node.Attributes.Remove("onLoad")
node.Attributes.Remove("onUnload")
Next
End If
Return doc.DocumentNode.WriteTo()
End Function
Here is how I test the function by using a webbrowser control:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim url As String = "http://htmlagilitypack.codeplex.com/discussions/24346"
WebBrowser1.Navigate(url)
End Sub
Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
TextBox1.Text = SanitizeHtml(TextBox2.Text)
End Sub
Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub
Q: The problem I am having is that the function is not removing any of the URLs, tags, scripts etc hence leaving the result still in raw state.
Please advice