Extract data from html document

Question

jamesrobb 0 Newbie Poster

15 Years Ago

How can I extract repeated paragraphs of data from an html document. Every paragrahp is preceded by the line:
Summary as passed House:

Thanks.

html-css vb.net

Edited 15 Years Ago by jamesrobb because: n/a

4 Contributors
7 Replies
769 Views
3 Days Discussion Span
Latest Post 15 Years Ago Latest Post by sknake

All 7 Replies

sknake 1,622 Senior Poster

15 Years Ago

Why don't you post the HTML in a .txt file on the thread so we can see what you're looking at? Line breaks, white space, etc all affect how you scrape the file.

kvprajapati 1,826 Posting Genius

15 Years Ago

You can't really parse HTML with regular expressions. It's too complex. Regular expression won't handle <![CDATA[ sections and reference enities correctly at all.
I recommend Html Agility Pack

SUMMARY:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Edited 15 Years Ago by kvprajapati because: n/a

sknake commented: Great suggestion +15

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jamesrobb 0 Newbie Poster · Answer 1 · 2009-09-18T23:20:21+00:00

jamesrobb 0 Newbie Poster

15 Years Ago

I've attached the html file as a txt file as per your request.

Thanks.

PreFile1.txt (156.55 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

<html><head><title>LIS > Bill Tracking > Reports > 2009 session</title></head><body bgcolor="#f7f7f7">
<table align=center width=755><td>
<h3>                   (PreFile1) Prefiled Bills
</h3><hr size=1>
<h3>
HB 1579 Transportation funding, etc; certain revenues attributable to economic growth in Hampton Roads, etc.</h3>
<p>A BILL to amend and reenact  15.2-4838.1, 15.2-4840,
33.1-230.03, 58.1-811, 58.1-2403, 58.1-2425, and 58.1-3221.3 of the Code of
Virginia and to amend and reenact the fifth and sixteenth enactments of Chapter
896 of the Acts of Assembly of 2007; to amend the Code of Virginia by adding in
Chapter 48.1 of title 15.2 a section numbered 15.2-4841, and by adding in Title
33.1 a chapter numbered 10.3, consisting of sections numbered 33.1-391.17 and
33.1-391.18, a chapter numbered 10.4, consisting of sections numbered
33.1-391.19 and 33.1-391.20, a chapter numbered 10.5, consisting of sections
numbered 33.1-391.21 and 33.1-291.22; a chapter numbered 10.6, consisting of
sections numbered 33.1-391.23 and 33.1-391.24; and to repeal Chapter 10.2 (
33.1-391.6 through 33.1-391.15) of Title 33.1,  46.2-755.1, 46.2-755.2,
46.2-1167.1, 58.1-625.1, 58.1-802.1, 58.1-2402.1, and 58.1-3825.1, Article 4.1
( 58.1-1724.2 through 58.1-1724.7) of Chapter 17 of title 58.1 of the Code of
Virginia, and the sixth, thirteenth, fourteenth, fifteenth, eighteenth, and
nineteenth enactments of Chapter 896 of the Acts of Assembly of 2007, relating
to transportation funding and administration in the Northern Virginia and
Hampton Roads areas, and in the Richmond Highway Construction District, and the
Staunton Highway Construction District.<p>091330668</p>
<p><i>Summary as passed House:</i> <br>
<b>Transportation funding and administration</b>.&nbsp;
Provides for transportation funding and administration in Hampton Roads, Northern Virginia, the Richmond Highway Construction District, the Staunton Highway
Construction District, and the Salem Highway Construction District.&nbsp; The
bill repeals the Hampton Roads Transportation Authority and repeals certain
fees and taxes authorized pursuant to Chapter 896 of the Acts of Assembly of
2007 that are within the ambit of the Supreme Court of Virginia's decision on
February 29, 2008, that they are unconstitutional.&nbsp; This bill incorporates
HB 2622.</p>

<p>
<i>Patrons:</i> Oder, Albo, Athey, Cole, Gear, Hamilton, Hugo, Iaquinto,
Knight, Lingamfelter, Miller, J.H., Pogge and Rust
</p>
<font size=-1>
<p>AMENDMENT(S) PROPOSED BY THE HOUSE </p>

<p>DEL. MARSHALL D.</p>

<ul>1. After line 416, substitute
</ul><ul><ul>insert
</ul></ul><ul><ul><ul><u>B. Fifty percent for construction of I-73 beginning in Henry
County and using the existing Route 58 bypass.</u>
<p>DEL. MARSHALL D.</p>

</ul></ul></ul><ul>2. At the beginning of line 417, substitute
</ul><ul><ul>strike
</ul></ul><ul><ul><ul><u>B</u>
</ul></ul></ul><ul><ul>insert
</ul></ul><ul><ul><ul><u>C</u>
<p>DEL. MARSHALL D.</p>

</ul></ul></ul><ul>3. At the beginning of line 426, substitute
</ul><ul><ul>strike
</ul></ul><ul><ul><ul><u>C</u>
</ul></ul></ul><ul><ul>insert
</ul></ul><ul><ul><ul><u>D</u>
<p>DEL. MARSHALL D.</p>

</ul></ul></ul><ul>4. At the beginning of line 431, substitute
</ul><ul><ul>strike
</ul></ul><ul><ul><ul><u>D</u>
</ul></ul></ul><ul><ul>insert
</ul></ul><ul><ul><ul><u>E</u>
</ul></ul></ul>
</font>
<p>
02/11/09 Senate: Constitutional reading dispensed<br>
02/11/09 Senate: Referred to Committee on Finance<br>
02/17/09 Senate: Failed to report (defeated) in Finance (4-Y 12-N)<br>
<font size=-1>
<p>
YEAS--Stosch, Stolle, Quayle, Norment--4.
</p>
<p>
NAYS--Colgan, Wampler, Houck, Howell, Saslaw, Hanger, Watkins,
Miller, Y.B., Marsh, Lucas, Whipple, Reynolds--12.
</p>
<p>
ABSTENTIONS--0.
</p>
<p>
</p>
</font>
02/18/09 Senate: Reconsidered by Finance<br>
02/18/09 Senate: Failed to report (defeated) in Finance (7-Y 9-N)<br>
<font size=-1>
<p>
YEAS--Stosch, Stolle, Quayle, Norment, Watkins, Lucas, Reynolds--7.
</p>
<p>
NAYS--Colgan, Wampler, Houck, Howell, Saslaw, Hanger, Miller, Y.B.,
Marsh, Whipple--9.
</p>
<p>
ABSTENTIONS--0.
</p>
<p>
</p>
</font>
<h3>
HB 1580 Hampton Roads Transportation Authority; abolished, disposition of revenues, etc.</h3>
<p>An Act to amend and reenact  33.1-23.03, 58.1-811,
58.1-2403, 58.1-2425, and 58.1-3221.3 of the Code of Virginia, to amend and
reenact the fifth and sixteenth enactments of Chapter 896 of the Acts of
Assembly of 2007, and to repeal Chapter 10.2 ( 33.1-391.6 through
33.1-391.15) of Title 33.1 and  46.2-755.1, 46.2-755.2, 46.2-1167.1, 58.1-625.1,
58.1-802.1, 58.1-1724.3, 58.1-1724.5, 58.1-1724.6, 58.1-1724.7, and 58.1-2402.1
of the Code of Virginia and the sixth, fourteenth, fifteenth, and nineteenth
enactments of Chapter 896 of the Acts of Assembly of 2007, relating to the
Hampton Roads Transportation Authority and taxes, fees, and charges dedicated
to financing its operation and programs. <p><i>Summary as enacted with Governor's Recommendations:</i><br>
<b>Hampton Roads Transportation Authority. </b>Abolishes the
Authority and the taxes, fees, and charges dedicated to financing its operation
and programs. The bill also makes several technical changes. This bill is the
same as SB 1018.</p>

<p>
<i>Patrons:</i> Oder, Athey, Gear, Hamilton and Iaquinto
</p>
<font size=-1>
<p>GOVERNOR'S RECOMMENDATION</p>

<p>&nbsp;</p>

<ul>1. Line 28, enrolled, after by the
</ul><ul><ul>strike
</ul></ul><ul><ul><ul>Northern Virginia
</ul></ul></ul><ul><ul>insert
</ul></ul><ul><ul><ul><u>applicable regional</u>
<p>&nbsp;</p>

</ul></ul></ul><ul>2. At the beginning of line 29, enrolled
</ul><ul><ul>strike
</ul></ul><ul><ul><ul>all of line 29
<p>&nbsp;</p>

</ul></ul></ul><ul>3. At the beginning of line 30, enrolled
</ul><ul><ul>strike
</ul></ul><ul><ul><ul><u>organization</u>
</ul></ul></ul><ul><ul>insert
</ul></ul><ul><ul><ul><u>organizations</u>
<p>&nbsp;[Rejected]</p>

</ul></ul></ul><ul>4. Line 356, enrolled, after <b><u>Code.</u></b>
</ul><ul><ul>insert
</ul></ul><ul><ul><ul><b><u>Such tax may be used for transportation safety improvements
as determined by such city or county embraced by the Northern Virginia
Transportation Authority.</u></b>
</ul></ul></ul>
</font>
<p>
04/08/09 House: Signed by Speaker as reenrolled<br>
04/08/09 Senate: Signed by President as reenrolled<br>
04/08/09 House: Communicated to Governor<br>
05/06/09 Governor: Approved by Governor-Chapter 864 (effective 7/1/09)<br>
05/06/09 Governor: Acts of Assembly Chapter text (CHAP0864)<br>
<h3>
HB 1581 Highway logo and tourist-oriented directional sign programs; VDOT & Transportation Bd. to provide.</h3>
<p>A BILL to require the Virginia Department of Transportation
and the Commonwealth Transportation Board to revise the Department's highway
logo sign and tourist-oriented directional sign programs to provide for signs
giving directions to senior centers.<p>089768752</p>
<p><i>Summary as introduced:</i><br>
<b>VDOT highway logo and tourist-oriented directional sign programs.</b>
Requires the Virginia Department of Transportation (VDOT) and the Commonwealth
Transportation Board to revise VDOT's highway logo sign and tourist-oriented
directional sign programs to provide for signs giving directions to senior
centers.</p>

<p>
<i>Patron:</i> Toscano
</p>
<p>
08/04/08 House: Prefiled and ordered printed; offered 01/14/09 089768752<br>
08/04/08 House: Referred to Committee on Transportation<br>
01/07/09 House: Impact statement from VDOT (HB1581)<br>
01/20/09 House: Stricken from docket by Transportation<br>
<h3>
HB 1583 License plates, special; issuance to those celebrating State's tobacco heritage.</h3>
<p>A BILL to authorize the issuance of special truck license plates
celebrating Virginia's tobacco heritage; fee.<p>083567796</p>
<p><i>Summary as introduced:</i><br>
<b>Special truck license plates celebrating Virginia's tobacco
heritage.</b> Authorizes the issuance of &quot;tobacco heritage&quot; special
license plates for trucks. This bill was incorporated into HB 2534.</p>

<p>
<i>Patron:</i> Wright
</p>
<p>
08/26/08 House: Prefiled and ordered printed; offered 01/14/09 083567796<br>
08/26/08 House: Referred to Committee on Transportation<br>
01/16/09 House: Assigned Transportation sub: 3<br>
02/05/09 House: Incorporated by Transportation (HB2534-Scott, E.T.)<br>
<h3>
HB 1586 License plates, special; expired authorizations for certain foundations.</h3>
<p>A BILL to repeal Chapters 432 and 634 of the Acts of Assembly
of 2008, relating to special license plates for supporters of the Lake Taylor
Transitional Care Hospital Foundation and supporters of the National D-Day
Memorial Foundation.<p>083626592</p>
<p><i>Summary as introduced:</i><br>
<b>Special license plates; expired authorizations. </b>Repeals
authorizations for issuance of special license plates for which the required
minimum number of prepaid orders was never received. The affected plates are
those for supporters of the Lake Taylor Transitional Care Hospital Foundation
and those for supporters of the National D-Day Memorial Foundation. This bill
was incorporated into HB 2534.</p>

<p>
<i>Patron:</i> Landes
</p>
<p>
09/09/08 House: Prefiled and ordered printed; offered 01/14/09 083626592<br>
09/09/08 House: Referred to Committee on Transportation<br>
01/16/09 House: Assigned Transportation sub: 3<br>
02/03/09 House: Subcommittee recommends incorporating into HB2534 by voice vote<br>
<font size=-1>
<p>
YEAS--Scott, E.T., Gear, Fralin, Knight, Brink, Ebbin, Toscano--7.
</p>
<p>
NAYS--0.
</p>
<p>
ABSTENTIONS--0.
</p>
<p>
NOT VOTING--0.
</p>
</font>
02/05/09 House: Incorporated by Transportation (HB2534-Scott, E.T.)<br>
<h3>
HB 1587 REAL ID Act; State will not comply with provision thereof that they determine would compromise.</h3>
<p>An Act to authorize the Commonwealth's lack of compliance with
certain provisions of the REAL ID Act.<p><i>Summary as passed House:</i> <br>

<b>REAL ID Act; Commonwealth's participation.</b>
 Provides that, with the ex

sknake 1,622 Senior Poster Featured Poster · Answer 2 · 2009-09-19T05:31:12+00:00

Imports System.Text.RegularExpressions

Public Class frmRegexParagraph

	Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
		Dim str As String() = GetParagraphs(System.IO.File.ReadAllText("C:\testdata.txt"))
		System.Diagnostics.Debugger.Break()
	End Sub

	Private Shared Function GetParagraphs(ByVal data As String) As String()
		Dim result As New List(Of String)
		Dim m As Match = Regex.Match(data, "<p>\s*(.+?)\s*</p>")
		While (m.Success)
			result.Add(m.Value)
			m = m.NextMatch()
		End While
		Return result.ToArray()
	End Function
End Class

jamesrobb 0 Newbie Poster · Answer 3 · 2009-09-21T21:37:40+00:00

jamesrobb 0 Newbie Poster

15 Years Ago

Thanks for your help. I appreciate it.

CodeDoctor 0 Light Poster · Answer 4 · 2009-09-22T08:08:20+00:00

How can I extract repeated paragraphs of data from an html document. Every paragrahp is preceded by the line:
Summary as passed House: 
Thanks.

You can obtain all Paragraph tags using the WebBrowser control using the following technique:

Dim oElements as HtmlElementCollection
oElements = WebBrowser1.Document.GetElementsByTagName("p")
For each oElement as HtmlElement in oElements
     if oElement.InnerHtml.Contains("<i>Summary as passed House:</i>") then
        debug.print "FOUND"
     end if
Next

You can parse the rest to get the specific data within the tags
Hope this helps.

sknake 1,622 Senior Poster Featured Poster · Answer 5 · 2009-09-22T13:50:40+00:00

I hope one of these solutions solved your issue

Please mark this thread as solved if you have found an answer to your question and good luck!

Extract data from html document

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers