I’m building a tagger that searches through a corpus of IM data and tags any instances of words that occur on a wordlist. I've run into a problem and was hoping to find help. I'd like to try and understand exactly why it's not working, so I've laid out everything I can think of that might be helpful.

data = ["@LINE2@ 04-09-2006/DAT 09:05:30/TIM [Team]/CHT  @NAME@ Digs Phung @NAME@ @CONTENT@ you might be crazy @CONTENT@ @LINE2@\n"]

wordlist = ["a", "an", "aardvark", "aardvarks", "aback", "abacus", "you", "be"]

import re

def tagger(data_string):
	string_copy = data_string
	for entry in wordlist:
		p = re.compile("\s" + entry + "\s", re.IGNORECASE)
		g = p.search(string_copy)
		if g == None:
			pass
		else:
			h = g.group()
			space_copy = string_copy.replace(h, h + "/TAG ")
			string_copy = space_copy.replace(" /TAG", "/TAG")
	return string_copy

tagged = []

for line in data:
	x = tagger(line)
	tagged.append(x)

This works as expected producing:

tagged = /CHT @NAME@ Digs Phung @NAME@ @CONTENT@ you/TAG might be/TAG crazy @CONTENT@ @LINE3@']

But when I do the same thing to the full wordlist (~40k words) and data (a list with ~1 million strings), I get the following error:

Traceback (most recent call last):
  File "<pyshell#197>", line 2, in <module>
    x = tagger(data_string)
  File "<pyshell#195>", line 4, in tagger
    p = re.compile("\s" + entry + "\s", re.IGNORECASE)
  File "C:\Python25\lib\re.py", line 188, in compile
    return _compile(pattern, flags)
  File "C:\Python25\lib\re.py", line 241, in _compile
    raise error, v # invalid expression
error: unexpected end of regular expression

I've run a similar function over the full dataset with smaller wordlists (~50) and haven't had any problems, so I figured that the issue was with something in this particular wordlist. Here's what I've already done:

1) I've tested the wordlist to make sure it only contains alphanumeric characters because I thought that other character might be interfering.

2) The function moves through the wordlist once, and then the error message pops up. The second line of data (where it hangs) is above.

3) I swapped out the variable names to words that couldn't appear on the wordlist, in case there was some conflict there.

4) I did some searches for the error message, but the explanations were way over my head.

Any guidance?

I've had this happen to me too, where something works on a small scale, but after I've run it against something huge it bombs out. In my case it was a web spider. I don't know what was causing it.

The reason I'm replying is because I was wondering if it would be easier to use string.find() instead of regular expressions to find your tagged words. Or even split the line into individual words and see if each one is in your

data = ["@LINE2@ 04-09-2006/DAT 09:05:30/TIM [Team]/CHT  @NAME@ Digs Phung @NAME@ @CONTENT@ you might be crazy @CONTENT@ @LINE2@\n"]

wordlist = ["a", "an", "aardvark", "aardvarks", "aback", "abacus", "you", "be"]

import re

def tagger(data_string):
	string_copy = ''
	for word in data_string.split():
                if word in wordlist:
                    string_copy += word + ' /TAG '
                else:
                    string_copy += word + ' '
	return string_copy

tagged = []

for line in data:
	x = tagger(line)
	tagged.append(x)

Note that I didn't test this code at all because I was in a hurry. It might be completely wrong, I was just running with an idea. Also, in the time it took me to write this last part I could have just tested the code. I'm also lazy in addition to being in a hurry.

Member Avatar for leegeorg07

mn_kthompson ive been making a web crawler (web spider) and i think i know what youre error was: i think it was the login sites because you dont have the auth but i just dont know how to implement mine into a search engine

thanks thompson. that works perfectly. I've tested it on the full dataset and wordlist and no errors. i guess the lesson here for me is that if you can't figure out why you're getting an error, develop a new strategy that avoids the problematic bit (in this case, .re).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.