Hello.
I have a homework. I have asked to create a web crawler that be able to enter into a music website and then for the first step, collect the name of singers that their names starts with the letter "A".

Now i need a little help for this step. How my crawler should understand wich words in that page are the singers names?! The crawler should find their names in a special tag, correct?! But what kind of tag?! Their names could be in any tag like <h4></h4> for example or in a single <p></p> tag or in a <b></b> or in <ul></ul> or any other tag!
So i just need a hint to find the way, any idea?!

For starters, you might test each word to see if it begins with 'A' ?

If it does ...
then might look it up in a dictionary of just names to see if it is a name?

Well, @David W, i can check every word in that page to see if the words startwords that starts with latter "A", are the names of singers or any other word that starts with "A"??!!
How my crawler should recognize human names fom any other word starts with letter "A"?!

What do you mean by look it up in a dictionary of just names? In wich dictionary you mean?

You may have to hunt the web for a big file of names ...

When you find suitable ones process and merge them into one big set (of unique names)

(Actually ...
just keep names that begin with 'A' in your set of processed names.)

Now just look up each word that beings with 'A' to see if it is in that set.

(Set look up times are very short.)

Well, ok i will try this way, but i think there should be an other more simple way that we have not find it yet.

A dictionary in this case would be a list of special names you expect to find. Simply create a text file with each name on its own line that you can read into Python code and form a list or a set (removes any duplicates). You could assume that each name starts with a capital letter, be it the name of the composer, performer, group, title etc.

Start with a small list for testing and add more names as you find them.
You might be able to find these special names on the internet.

You can also use compound regex and counters to filter the html. There is a nice feature in regex that allows you to match arbitrary strings.

self.smart_pattern = re.compile(r'(((?P<size>[0-9]+)x(?P=size))/(?P<context>(actions|animations|apps)))')

matches 24x24/apps, but does not match 24x23/apps, and it yields 24x24 in group 1 and apps in group 2. You can work similar magic with HTML blocks.

str.istitle() can also be used to verify that the string has each seperate word capitalized (how a name would be).

@iJunkie22, can you explain your last post please? about str.istitle(), and would be good if give me a little example.

Take a look:

# str.istitle.py #

myStrs = [ "Sam S. Smith", "Ann a. anderson", "Anne Anderson" ]

mySet = set(myStrs) # pretend this is your big dictionary of names

for item in myStrs:
    print( "'" + item +"'.istitle() is", item.istitle() )

print()

for item in myStrs:
    print( "'" + item +"'[0] == A is", item[0] == 'A' )

print()    

for item in myStrs:
    if item[0] == 'A':
        if item in mySet:
            print( "'" + item +"'[0] == A is", item[0] == 'A' )
            print( item, "is in", mySet )

No problem ... :)

There are many ways ...

Sometimes ... simple is easier to see for students.

But yes, I prefer using a format string in many contexts.

Also, the potential scale of the material to crawl makes this a great example of why order of comparisons is important. It all comes down to what is called "Big O".

Given that you want to check:

  • string is in your list of names
  • string starts with "A"
  • string is titlecase

Think of it like Big O refers to growth rate of cost...

The cost of running test1 = # of input strings * # of artist names

The cost of running test2 = # of input strings * 1 (it compares the first character and stops)

The cost of running test3 = # of input strings * # of words in input string

In theory, this makes test2 the least costly, with test1 and test3 tied for 2nd. In practice, however, names are rarely more than 4 parts, and the list of famous people is much larger than 4, so test1 is the most expensive. This means that you can make your crawler far more efficient by simply reordering the tests into this order:

  1. string starts with "A"
  2. string is titlecase
  3. string is in your list of names

For a better explanation of "Big O", read this

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.