Regex to Parse HTML

Question

Dani 4,658 The Queen of DaniWeb

13 Years Ago

Soooo ... quick question :)

I need to do some HTML parsing with regex :)

I currently use $output = preg_replace('/>\s+</', "> <", $output) to strip whitespace between any two HTML tags. What can I do to strip whitespace only between paragraph tags. For example, only between </p> and <p> tags?

Thanks!

html-css php regex

2 Contributors
3 Replies
418 Views
3 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by Dani

All 3 Replies

diafol

13 Years Ago

Sorry Dani, I'm a bit of a noob at regex...
But my DomDOc is even worse...

$content = "<p>this is your html

file
with a ]
few things
</p>";

function breaker($m){
    return nl2br($m[1]);    
}
echo preg_replace_callback('/(<p>[^<]+<\/p>)/s','breaker',$content);

Change the regex to include paragraph attributes if they are used.

Edited 13 Years Ago by diafol because: zcvs

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Dani 4,658 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 1 · 2012-03-28T20:32:52+00:00

OK ... Strike all that. I'm going to use PHP's DOMDocument class instead to manipulate the HTML. Soooo here's what I need ... I need to loop through all HTML, and run nl2br() on everything within a <p> tag. Any takers?

Dani 4,658 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 2 · 2012-03-28T22:40:20+00:00

Haha, no worries. You made up for it by figuring out the culprit with the %02 bug! :)

In any case, I just got it. Here's my code:

$output = '';

$dom = new DOMDocument();
$dom->loadHTML($html_snippet);
$tags = $dom->getElementsByTagName('body')->item(0);
foreach ($tags->childNodes as $tag)
{
    if ($tag->localName == 'p')
    {
        $output .= nl2br($dom->saveHTML($tag));
    }
    else
    {
        $output .= $dom->saveHTML($tag);
    }
}

echo $output;

I did the whole getElementsByTagName('body') thing because it kept turning my little HTML snippets (for individual forum posts) into XHTML-compliant HTML documents complete with doctypes, etc.

Now, I am pleased to say, that the parsing bug for converting BBCode to Markdown is finally fixed! :)

Regex to Parse HTML

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers