Regular expressions tutorial

Posted on August 15, 2016 in misc

Regular expressions

Intro

When it comes to text mining you need to understand that a text is nothing but a sequence of characters. The character can be a digits, alphabets, symbols, spaces, new line etc. And in this chapter we are going to learn about regular expressions a very powerfull concept that helps us to play with text. It can be used for finding patterns, searching words or characters, substitution etc.

Observe the below three examples:

.Example1:

  1. abc
  2. abcdef
  3. abcdefg

.Example2:

  1. password123
  2. 12353234123
  3. de12de3

All these examples have a set of strings. And each set has a comman pattern present in them. In Example1 the string 'abc' is comman pattern that is found all the options. In Example2 string '12' - number characters, is a common pattern that is found in the all it's options. Also the string '123' is a common pattern in option 1 and option 2 of example2.

In [4]:
# 
from IPython.html import widgets
from IPython.display import display
radio=widgets.RadioButtons(
    description='Select a pattern that is most common:',
    options=['d12', 'd123', '123'],
    )
button=widgets.Button(description="submit!")

def msg(b):
    selected = radio.value
    if selected == '123':
        print "Exactly: 123 is the most common pattern among the three strings"
    else:
        print "close...But not the right one!"
        
display(radio)
display(button)
button.on_click(msg)
In [ ]:
 

patterns need not just be within words alone... they can be in sentences too. Try find the most uncommon of the following:

  1. Tim is a person.
  2. Tim is a good person.
  3. Is Tim a good person?
In [5]:
# 
from IPython.html import widgets
from IPython.display import display
radio=widgets.RadioButtons(
    description='Select a pattern that is most common:',
    options=['Is', 'is', '?'],
    )
button=widgets.Button(description="submit!")

def msg(b):
    selected = radio.value
    if selected == '?':
        print "Exactly: '?' is the most uncommon pattern among the three strings."
        print  "Also if you observe the three strings are very similar." 
        print "Even 'Is' can be uncommon pattern if you consider case sensitivity which is upercase and lowercase characters"
    else:
        print "close...But not the right one!"
        
display(radio)
display(button)
button.on_click(msg)

The Real need for regular expressions:

The main idea behind the above examples is to make you understand the idea of patterns in text. Its easy to indentiy patterns in three or four sentences manually. But can you do it manually, let's say, for a ten thousand book reviews(comments). Well it would take forever. Which is why we need computers to do these things for you. And regular expressions are their way of seeing the patterns in text.

The language of regular expresions:

Regular expressions have a language of their own. A language that computers or most programming languages understand ofcourse. Computers are by themselves a bit dumb. They dont see the patterns in text by themselves. We need to give it instructions on how a pattern would be like. These instructions are the regular expressions. The regular expressions are fundamentally made of the following:

  1. Metacharacters: Metacharacteres are special characters that are reserved for a specifif utility which we cover later in this chapter. They are not used as literals.

  2. Literals: Literals are simple characters such that comprises of numbers, alhphabets and other non-metacharacters.

  3. If we wanna use the metacharacters as regular characters we need to escape them using backslash().

NOTE: The simplest way to learn the regular expressions is try and apply against a target string or strings

Metacharacters:

Metcharacters are characters that are reserved for specific functionality and these are the main building blocks for regular expressions. The following are the varous METACHARACTERS that we use. They may not make much sense when you read each of them the first time . But upon combining metacharacteres with the literals we can create powerfull pattern matching regular expressions which make more sense( and this is someting we are going to do in the following sections).

Imagine you have ten or more sentences and let's call them the target strings. With regular expressions we try to extract those strings that have a desired pattern which we are looking for from these target strings:

Positioning Metacharactrs:

^

The claret '^' represents the start or beggining of a string:

The expression '^hello' matches all those lines that start with the word hello such as _"hello world"_, _"helloooo there"_. It can also be explained as the regular expression looks for the characters: "hello" at the beginning of a any target string. 


\$

Simlarly, the '\$' dollar symbol signifies the end of the target string: The expression 'end$' matches only those strings among the target strings, which have the word 'end' at the end of the string ( or sentence). For example:

    1. "this is the sentence end"
    2. "thisisthesentenceend" etc ...


.

The '.' or period means any character in it's position. For Example:

1.  ".en" can represent "ben", "ten", "den" etc. The period can be repaced with any character. It matches all those strings that have that pattern.  


Some more Metacharacters:

[ ]

The brackets mean to match any one ( and only one) of the characters that is specified inside them.

An example usage:

[abcd] :- this means match any character of a,b,c,d
[12av] :- this means match any character of 1,2,a,v
[avAV] :- this means match any caharater of a,v,A,V [ notice the lowercase and uppercase.]



-

The '-' or iphen helps you to signify a range of characters or numbers at once. For example in the previeous examples we have listed each character that we want to match inside the brackets. But if we need to list out all the characters from 'a' to 'z' then using iphen or '-' we simply write it as "[a-z]"

[a-z] Matches all characters that are between a-z including the '_a_' and '_z_'
[0-9] Matches all characters that are between numbers 0 to 9 including the 0 and 9.
[A-Z] Matches all the uppercase charachters from A to Z
[0-9a-zA-z] Matches all the alphanumeric characters




\d

This represents any digit. It's simpler to use compared to [0-9] specification:

"password\d\d\d"  matches: "password123", "password234", "password000" etc , you can substitute any number in its place;


\D

This is opposite to \d. It means anything other than a digit (non-numeric characters):

"\Da\D" matches: "dad", "mad", "cat", even words that dont have any meaning such as "lal", "fal" etc. "\D" can be substituted with any non-numeric character. It need not be just alphabets. It can also match space, tab etc;

\w

This represents any alphanumeric character, similar to [0-9a-zA-Z]

'\w\w\w' matches: "abc", "a12", "123", "z1z" etc. It just matches anything which is either a number or a alhpabet.

\W

This is opposite to \w. It represents any non alphanumeric characters. Similar to [^0-9a-zA-Z]

'text\Wmining' matches: "text mining", "text$mining" etc . Just matches anything that is non alphanumeric.

\s

White Space;

\S

Anything other than a white space;

Iteration Metacharacters:

Iteration metacharacters specify the number of times their preceding character can occur in a pattern:

Check the below example:

  1. god
  2. good
  3. goooood

All of the above have one thing in common: the letter 'g' at the start and the letter 'd' at the end and letter 'o' in the middle with repititions. The regular expression 'g[o]d' would represent the first string "god" alone. It does'nt caputre the "good" and "goooood" because they have more than one "o". The simplest way to specify such repititions in regular expressions is to use certain iteration metacharacters:

?

The question mark (?) specfies that the preceding character occurs only 0 or 1 times only. For example:

 "go?d"  matches : "god" and "gd" i.e "o" can occur once or no occurance at all;


*

The asterisk (*) specifies that the preceding character can occur zero or any number of times:

 "go*d" matches: "god", "good", "goooood", "gd", "goooooooooooood" (like i said any number of "o"s)

+

The plus (+) sign specifies that it's preceding character should occur atleast i.e >=1 times:

 "go+d" matches:  "god", "good", "goooood", "gooooooooooood" etc but not "gd". 





Some more examples:

Lets try to use a combination of these metacharacters to form complex regular expressions

Regular Expression possible matches
goa[tl] matches goat, goal
^good matches all the target strings that start with word good. It does not match target string such as "life is good" as good here doesnt occur at the start of the sentence
bye\$ matches all the target strings that end with bye. It does not match sentence such as "bye my friend". Because the characters "bye" is not at the end of the string.
^life is [gf]ood$ this regular expression pattern matches all the strings that start with "life is" and end with either good or food

Pattern finding using python:

Now that we have an idea of the language of the regluar expressions, let's see how we can make computers use regular expressions to extract infromation from text:

Python has it's own library for processing regular expressions. This is called "re". Let's import it:

In [2]:
import re

consider some patterns:

In [11]:
pattern = ".oo."

lets try to find matches for the pattern in a target string:

In [13]:
target_string="I love food, food which is good."

matches =re.findall(pattern, target_string)
print "list of matches: ", matches
list of matches:  ['food', 'food', 'good']

We can also know the position( index) of the match in the target string:

In [23]:
for match in  re.finditer(pattern, target_string):
    start = match.start()
    end = match.end()
    print "positions: ", start, ":", end, target_string[start:end]
positions:  7 : 11 food
positions:  13 : 17 food
positions:  27 : 31 good






Things to try:

  1. Write a regular expression that matches the below strings:

    a. casper
    b. Jasper
  2. Write a regular expression that would match the bellow strings:

    a. hi
    b. hiii
    c. hiiiiiii
  3. Write a regular expression that would match the bellow strings:

    a. code
    b. coding
    c. coder
  4. Write a regular expression that would match the bellow strings:

    a. john is a happy man
    b. john is a good man
    c. john is a great man
  5. Write a regular expression that would match the strings in a,b,c and skips strings d,c:

    a. cat
    b. rat
    c. bat
    d. sat
    e. mat
  6. Write a regular expression that would match the below strings:

    a. life is good
    b. life is great
    c. life is awesome
  1. Consider the given string : " JAMES is always Confident, consistent and persistent the way he works."
    a. Write a regular expression that would capture the qualities of JAMES in the sentence
    b. Write a program that would capture and print these regular expressions
    c. The program should also print the postions of the words in the sentence.