So I am trying to do a neat little script in a wordpress plugin that replaces matching keywords with links to other articles on the site. I think I'm done all except one little thing down here below.

I need to fix the preg replace function.

It should match anything that...
* is not inbetween < and > chars
* is not between any <a> and </a> tags (not stricktly mind you, but a tags with href's and target's and anything else you can put in them)
* will not mess up any existing html (if you can think of anything)

If you could lend me a hand with this regex stuff, and if you can a short explaination, that would be terriffic!

<?php
//$words = array(); // this is full of keywords and urls, loaded earlier in the script

krsort($words); // Sort by key (reverse) with the bigger key word groups first.

// Now go through the entire post, and replace the matching and previously unlinked keywords with the links :)

foreach ($words as $k => $v){
    foreach ($v as $keyword => $url){
      $content = preg_replace("/(?!=(?:<a [^>]*>))({$keyword})(?!(?:<\/a>))/si","<a href=\"{$url}\">{$keyword}</a>",$content,1);
    }
}
?>

As of right now, I get these errors (link 67 is the preg_replace() line above):

Warning: preg_replace() [function.preg-replace.html]: Unknown modifier 'B' in C:\xampp\htdocs-wordpress-dev\wp-content\plugins\Johns-Interlink\index.php on line 67

Warning: preg_replace() [function.preg-replace.html]: Unknown modifier 'a' in C:\xampp\htdocs-wordpress-dev\wp-content\plugins\Johns-Interlink\index.php on line 67

Warning: preg_replace() [function.preg-replace.html]: Unknown modifier 'B' in C:\xampp\htdocs-wordpress-dev\wp-content\plugins\Johns-Interlink\index.php on line 67

Warning: preg_replace() [function.preg-replace.html]: Unknown modifier '/' in C:\xampp\htdocs-wordpress-dev\wp-content\plugins\Johns-Interlink\index.php on line 67

Warning: preg_replace() [function.preg-replace.html]: Unknown modifier 'T' in C:\xampp\htdocs-wordpress-dev\wp-content\plugins\Johns-Interlink\index.php on line 67

Warning: preg_replace() [function.preg-replace.html]: Unknown modifier 'P' in C:\xampp\htdocs-wordpress-dev\wp-content\plugins\Johns-Interlink\index.php on line 67

Thank you!

Shouldn't /(?!= at the beginning be /(?!? Not a regular expression expert, but I don't remember the != together. Haven't used lookarounds in awhile.

The = is treated as the first literal character. Are you sure your keywords do not contain special characters, that are interpreted as regex control characters ?

Honestly, I don't really understand regex, so I am not sure kkeith.

Pritaeas, the keywords could contain apostrophies, quotes, spaces, 0-9, a-z and A-Z. Do you think any of those would be a problem?

Could someone kind of break down the regex statment I've got so far. What do the errors I'm getting mean? Is it complaining about my keywords, do I need to 'escape' them, or add something to the regex to indicate that I'm searching for the keyword string?

Regex is just a little over my head.

Thank you!

If you can get some sample input and output, I can test your cases. I can really recommend RegExBuddy (trial version) so you can see and test easily.

Thanks for the input Pritaeas!

I can supply sample data - but it is for a wordpress plugin so if you can imagine anything that could be used in a wordpress title may be passed, but for now I don't need to worry about much anything else aside from just text with spaces, apostrophies, quotes, and possible paranthesis.

Here is the entire function / plugin atm:

function wp_incontent_interlink($content){
global $post, $wpdb; // From WP
$id = $post->ID; // Get the post id
$the_post = get_post($id); // Get the post obj
$title = $the_post->post_title; //Pull the Post Title
$perm = get_permalink($post_id); //Get the posts Permalink
$poststable = $wpdb->prefix."posts"; // Get table prefix

// Loop through all posts
$ck = mysql_query("SELECT ID, post_title FROM $poststable WHERE post_status='publish'");
$cck = mysql_num_rows($ck);
if($cck > 0){
while($ack = mysql_fetch_assoc($ck)){
extract($ack); // ID post_author post_date post_date_gmt post_content post_title post_excerpt post_status comment_status ping_status post_password post_name to_ping pinged post_modified post_modified_gmt post_content_filtered post_parent guid menu_order post_type post_mime_type comment_count
$link = get_permalink($ID); //Get the posts Permalink

$tpcs = explode(",",$post_title); // break title up at every comma

foreach ($tpcs as $k => $v){
$v = trim($v);
$wpcs = explode(" ",$v); // explode at every space
$cwpcs = count($wpcs); // count the pieces 
$words[$cwpcs][$v] = $link; // Load array w/ info

unset($wpcs,$cwpcs,$len,$i,$v,$k);

}

} // while
} // if $cck > 0

krsort($words); // Sort by key (reverse) with the bigger ones at the top.

// Now go through the entire post, and replace the matching and previously unlinked keywords with the links :)

foreach ($words as $k => $v){
    foreach ($v as $keyword => $url){
      $content = preg_replace("/(?!=(?:<a [^>]*>))({$keyword})(?!(?:<\/a>))/si","<a href=\"{$url}\">{$keyword}</a>",$content,1);
    }
}


// Return the content
return($content);

}

As you can see on line 18 -

($tpcs = explode(",",$post_title);).

I do explode the titles at commas - because my site is for plants, I set the titles to the general pattern of 'scientific / bottanical name, common name, common name, etc'. Each of the pieces are loaded into an array along with the url that title belongs to.

After getting everything else set up, we finally order the $words array with the largest title segments first, and the smaller title segments last. ie

krsort($words);

Finally, we start with the biggest title and use the preg_replace command - if an exact match (case insensitive) is found that is not already part of a link, inside of an html tag or anything of that nature that would otherwise mess up what we are trying to do - then it replaces the matched words with a hyperlink - it will read the same text as we matched, but now it is a link pointing to the article that keyword was derived from.

It's straight forward like you might think - I'd use str_replace(); if I could get away without having to bother with existing html tags!

Some sample data might be

Title: Post 1, the first post, 1st post
Content: Post 1 is the <a href="otherlink">first post</a>, it precedes the 2nd post (second post).

Title: Post 2, second post, the second post, 2nd post
Content: The second post is after the first post - <a href="post 1" alt="post 1">Back to post 1</a>.

Title: Post 3, the third post, 3rd post
Content: The 3rd post is last, after the second post.

And we'd get some kind of out put like this:

Title: Post 1, the first post, 1st post
Content: <a href="linkhere">Post 1</a> is the <a href="otherlink">first post</a>, it precedes the <a href="linkhere">2nd post </a>(<a href="linkhere">second post</a>).

Title: Post 2, second post, the second post, 2nd post
Content: <a href="linkhere">The second post</a> is after <a href="linkhere">the first post</a> - <a href="post 1" alt="post 1">Back to post 1</a>..

Title: Post 3, the third post, 3rd post
Content: The <a href="linkhere">3rd post</a> is last, after <a href="linkhere">the second post</a>.

Notice the 2nd link in the Post 1 content and the last link in Post 2 - they are left unmodified if the matching text is inside of html carrots <dont match me> or is they are inside of any <awildcard>dont match me</a> tags.

Please ask any questions, I will give answers! Sorry for the lengthy post, I hope it's clear enough to understand, and thank you for your help!

Just to focus in again and restate the goal here, everything is good to go with the exception of the regex statement:

$content = preg_replace("/(?!=(?:<a [^>]*>))({$keyword})(?!(?:<\/a>))/si","<a href=\"{$url}\">{$keyword}</a>",$content,1);

Thanks :)

How about this:

$content = preg_replace(
    "/(?!(?:<a [^>]*>))({$keyword})(?!(?:<\/a>))/si", 
    "<a href=\"{$url}\">{$keyword}</a>", 
    $content);

Thank you Pritaeas!

I double checked, my sample data / local wordpress did have post titles with forward slashes (/) in them. After str_replace("/","",$title); the titles the errors no long show up and the expression works as hoped.

Thanks for the help. I'll mark this as solved, but if anyone would like to take a moment to explain the expression to me that would be great!

Thanks again!

Thank you pritaeas, I think I understand the lookbehinds and such, but then something throws me off and I feel like it doesn't make sense. This is the hardest thing I've ever had to understand.

I tested and found that I made a mistake, the regex does not work. I got excited when I saw it link some text on my development blog, but I didn't test it fully. (I have marked the thread as unsolved again).

It appears to mark every instance of '$keyword', with a few minor exceptions. I have tried my best for the last week and I can not seam to figure this out. I've come close, but with no cigar.

Here you can see the results of my testing: http://regexr.com?313m8 (match 'layering' keyword in this example)

This one seams to work, (?!<.?)(?!<a)(layering)(?!<\/a>)(?![^<>]?>), except it still matches links with other html inside of the <a> and </a> tags something like <a href="whatever"><em>layering</em></a>.

Could we take another stab at this for me?

Thank you!

I can try, but the consensus is that regular expressions are not the recommended way to parse html. If the page is valid xml, then a dom parser would work much better.

Yes, I've also read that....but it's not reliably foratted xml, and even if it was, I'm not sure that I could get the desired results without major nested loop complexities. The regex is complicated, but it's reliable when it's done.

Thank you for taking the time to help me with this :)

So what do you want with the <a><em>keyword</em></a> example ? Can it match, or not ?

The goal of the project is to place links in text, so it should not match anything at all in between <a> and </a>. In the example above, the regex works perfectly except when there are other html entities inside of the <a> and </a>.

<a><em>keyword</em></a> should not match.

Thanks!

<a><em>keyword</em></a> should not match.

I got to this:

(?!(?:<a [^>]*>)(?:<\w+[^>]*>)*)(keyword)(?!(</\w+>)(?:<\/a>))

But you will encounter new problems that will (not) match as desired. A parser is much more capable in these kinds of situations.

Wouldn't a simple finite-state machine work here? I will write a class tonight and post the results for you.

They can be slow though. Are you storing the text after parsing or doing this everytime the text is needed?

It will be used as a wordpress plugin, so it will run on the fly for every article - when the plugin is disabled then text shows normally. Some pages may have 10 articles, so the plugin would run 10 times, once for each article. Other pages just 1 article, and would run the entire plugin once.

One of the aspects of this plugin is the idea that as I add new articles, old articles will automatically link to the new posts. And vica versa, new posts will automatically link to old posts. This is so that when other sites scrape my posts I will get back links, and also give legitamate users a nudge to check out other articles on the site.

I've been thinking about other ways to accomplish the same result. I imagine that if I used an XPath / DOM class, it would get very complicated. I would have to parse the html with XPath. Then loop through each keyword, and check for word matches in the parsed HTML - IF and only if the text that matches is not in the <a> and </a> tags would we then place a link. This seams the next likely way to go if regex doesn't work.

Alternatively, if there was some way to keep track of where each HTML element existed inside of the text some logic could be worked out to test for <a></a> tags, but that does not seam like the right way to approach this solution.

Can we brain storm for a moment and see if there are better ways to approach this?

pritaeas, this didn't work at first in my sample/example tool (http://regexr.com/?313cp), it's gonna take me a little time to go through it slowly and understand it, if you got it to work I'm sure it's all there, I'm gonna see what all it does. Thank you for taking the time to bust that out!

It checks for additional tags right after the opening <a>.

I agree with kkeith29. You should be able to build a simple parser. The results will be much more accurate. For example, you can search the position of the first word, and then search back manually if it is preceded by tags. A second option would be to build an array of the tags and text, so you can easily find your keyword, and then loop back through the array for any tags. Just some thoughts.

I didn't have time to write the code last night.

Would something like this work?

$text =<<<TEXT
Blah Blah Blah <a class="keyword" href="">Blah Blah</a> alskdjflakjsdf alsdj alkdsj flaksdjflaksjdflk
TEXT;

$keywords = array();
$replacements = array();

function link_replace( $match ) {
    global $keywords,$replacements;
    $idx = count( $replacements );
    $keywords[$idx] = strip_tags( $match[1] );
    $replacments[$idx] = $match[0];
    return "[[replacement:{$idx}]]";
}

$text = preg_replace_callback('#<a[^>]*>(.*?)</a>#','link_replace',$text);

//text without anchors is now in $text variable, do your keyword replacement here checking the $keywords variable to make sure it wasn't already linked
//after you are done, run a regular expression for the placeholders and put the links back.

I think this would work for you, might be wrong though.

Thanks for the indeas and the code here. I'll give this a try when I get back from work tonight.

pritaeas, you are right, some kind of parser is looking more and more like the way to go with this. It's gonna take some time to think through, though.

I'll be back later tonight guys, I'll let you know how it goes.

Tanks again for the help!

Alright, so I've taken a stab at a system that accomplishes the goal without using regex. It's not a lot of code, but does use a bit of CPU.

Any other suggestions? Any other luck on the regex? Thanks!

<?php
/*
Plugin Name: Johns Interlink
Version: 0.1
Plugin URI: http://wp-plugins.iluvjohn.com/?plugin=johns_interlink
Description: Automatically links to other articles. Each title is broken up at each comma, any of those that match are linked. Be sure not to have duplicate post titles.  
Author: John Minton
Author URI: http://iluvjohn.com/
*/

/*  Copyright 2010 John Minton (cjohnweb@gmail.com)

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License, version 2, as 
    published by the Free Software Foundation.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
*/

/*****************************Inter Link**************************************/

// Function appends products below the post content
function wp_johns_interlink($content){
global $post, $wpdb; // From WP
$id = $post->ID; // Get the post id
$the_post = get_post($id); // Get the post obj
$title = $the_post->post_title; //Pull the Post Title
$perm = get_permalink($post_id); //Get the posts Permalink
$poststable = $wpdb->prefix."posts"; // Get table prefix


// Need to save these settings in the db or something:
$blank_target = false; // Set to true are all generated links will have target='_blank'.


// Loop through all posts
$ck = mysql_query("SELECT ID, post_title FROM $poststable WHERE post_status='publish'");
$cck = mysql_num_rows($ck);
if($cck > 0){
while($ack = mysql_fetch_assoc($ck)){
extract($ack); // ID post_author post_date post_date_gmt post_content post_title post_excerpt post_status comment_status ping_status post_password post_name to_ping pinged post_modified post_modified_gmt post_content_filtered post_parent guid menu_order post_type post_mime_type comment_count
$link = get_permalink($ID); //Get the posts Permalink

$tpcs = explode(",",$post_title); // break title up at every comma

foreach ($tpcs as $k => $v){
$v = str_replace("/","",$v);
$v = trim($v);
$wpcs = explode(" ",$v); // explode at every space
$cwpcs = count($wpcs); // count the pieces 
$words[$cwpcs][$v] = $link; // Load array w/ info

unset($wpcs,$cwpcs,$len,$i,$v,$k);

}

} // while
} // if $cck > 0

krsort($words); // Sort by key (reverse) with the bigger ones at the top.

// Now we match keywords and link them, etc.
foreach ($words as $k => $v){
foreach ($v as $keyword => $url){
$finished = false;
$tracker = 0;

while($finished != true){ // The while loop is for finding multiple keyword matches
//$loops++; // for debuging
$stripos = stripos($content,$keyword,$tracker);
if($stripos !== false){
$pass_check = true; // Default
$tpos = $stripos;

// Check if inside of < & >
while($tpos >= 0){
if($content[$tpos] == ">"){$pass_check = true; break; }
if($content[$tpos] == "<"){$pass_check = false; break; }
$tpos--;
}


// Check if inside of <a> & </a>
if($pass_check == true){
$tpos = $stripos;
while($tpos >= 0){
if($content[$tpos].$content[$tpos+1].$content[$tpos+2].$content[$tpos+3] == "</a>"){$pass_check = true; break;}
if($content[$tpos].$content[$tpos+1] == "<a"){$pass_check = false; break;}
$tpos--;
}
}

if($pass_check == true){
$strlen = strlen($keyword);

if($blank_target == true){$target = " target='_blank'";}else{$target = "";}
$substr_replace = substr_replace($content,"<a href='$url'".$target.">".ucwords($keyword)."</a>",$stripos,$strlen);
$content = $substr_replace;
}

$tracker = $stripos+1; // adjusted for next itteration IF $stripos !== false

}else{$finished = true;}
}
unset($loops);
}
}

// Return the content
return($content);

}

/**********************************HOOKS***************************************/

add_filter("the_content", "wp_johns_interlink");

?>

Marking solved, since I've figured out an alternative method to solve the issue. I would be awesome to find some regex that would also achieve the same effect, so if you happen to come acrost it some day, please post it :)

Thanks everyone.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.