Regex is one of the more complicated modules that you can use in python. Once you have learnt it though you can use it many different programming languages, so its a useful tool for using with strings.
So first to use regex you must import it
import re
This loads the module for us to use.
Regex is a module designed to make strings easy to manipulate and is often used to check for correct input.
For example
r = raw_input("Please enter an email address")
But how do you know without complicated checking that they have entered the right format of something@something.com? Well to check this normally we would need to index the '@' symbol, as well as make sure they had the right words (.com) and that it was all in the right order.
But with regex we can work this out in one line... that is after working out the regex string.
So lets start on the email..
First we have to understand what an email needs in it:
- A Beginning (xxxx@mail.com)
- The '@' sign
- a domain (mail@xxx.com)
- and a .com (we are not going to make it for .orgs/anything else)
So lets start (please see below for explanation of symbols)
import re
#Lets make a fake email...
email = 'bogusemail123@sillymail.com'
#to make a re pattern we use a string
pattern = "^[A-Za-z0-9.]+@[A-Za-z0-9.]+.com$"
#now we check if it matches
re.findall(pattern, email)
#Yes! it does
#It returns ["bogusemail123@sillymail.com"]
#lets try some other addresses
re.findall(pattern,"@sillymail.com")
#returns []
re.findall(pattern,"bogusemail123@sillymail"
#returns []
So this is a relatively simple example but you can easily see how it can save you time in checking that a user has inputted the correct things as well as searching for things in a string..
Now to explain what "^[A-Za-z0-9.]+@[A-Za-z0-9.]+.com$" means
- ^ --> means that the pattern starts at the start of the string, this means that "Hello bogusmail123@sillymail.com" will not work
- [A-Za-z0-9.] --> This is called a range, it means that anything inside that range will match the string, so and letter of A-z or a-z as well as numbers 0-9 and a dot. This means that you do not get emails with other forms of punctuation in them.
- + --> This does not mean plus, or anything like that, rather it means that whatever came before it needs to be in the string one time or more. In this case the thing before was our range, so what it means is that we need at least one letter/number/dot or more to have the string match
- @ --> For a match where you want it to match a character exactly you just put the character in the string in the place it is meant to be
- [A-Za-z0-9.]+ --> Just another range like we had before, with a '+' sign to mean it need one or more things in the range
- .com$ --> Then we put in exactly what we want at the end of the email address ('.com') and make sure it is at the end of the string with the dollar symbol.
Then to check that our string matches we use re.findall(regexpatter, string)
That lists all of the strings that match, in our case it should only come back with either a list with one email address or nothing at all if the input was incorrect.
This will not get all email addresses its just a simple example designed to show people the possibilities of the regex module.
If you want to extend yourself in this, try making it so that is accepts .org/.net/com.au etc.
Hope you enjoyed the tutorial and learnt something :)