Regular Expressions in Python:

Regular Expressions 1 (i2tutorials)

 

 

In this article I will go through the regular expressions which are very popular among programmers and these regular expressions can be applied in programming languages like c++,Java, python, php etc. these are mostly used in data cleaning and data wrangling . simply it is a default way of data cleaning.

Regular expressions are the sequence of characters and these are mainly used in to find or replace the patterns in a string or file.

In this we are using two types of characters.

1. Meta Characters.

2. Litterals. (1,2, a, b).

For this we are importing the python library “re” that helps with regular expressions.

Import re

 

The most common uses of Regular expressions are

1. Search a string (search and match)

2. Finding a string (findall)

3. Break string into a sub strings (split)

4. Replace part of a string (sub)

 

The library “re” is used to perform these tasks. This library provides different mrthods to perform the required tasks. The  most common methods are

– match()

– search()

– findall()

– split()

– sub()

– compile()

lets discuss one by one.

 

re.match():

this is used to find match in the start of the strings. For example “regular expressions in machinelearning” call the match() function to find the pattern “regular” so the regular will match because it is starting of the string and look for machinelearning it will not match because it is not the starting of the string.

Import re

                                                result=re.match(“regular expressions in nlp)

                                                result=re.match(“r ‘regular’ , “regular expression in nlp”)

                                                print result.group(0)

output:

regular

if you want to find the “nlp” in given string lets see what we get.

result=re.match(“r,’nlp’,regular expressions in nlp”)

                print result

               

                output:

                none

 

re.search():

it is similar to match but there is no restrictions to find the string at starting position only.this method is able to find the string at any position in the sentence.

 

For example:

result=re.match(“r,’nlp’,regular expressions in nlp”)

                print result

output:

nlp

 

re.findall ():

by using the findall method to get all the list of matching patterns. In this there is no constrains to find the starting to ending positions.

For example:

result=re.findall(“r,’nlp’,nlp regular expressions in nlp”)

                print result

output:

[nlp,nlp]

 

re.split():

this will be used for the splitting.

For example:

result=re.findall(“r,’g’, regular”)

                print result

output:

[‘re’ , ’ular’]

In this split method we are having one more paramter that is maxsplit. It is dafalut by 0 if we increasing the that number ihe maximum splits that can be done.

result=re.findall(“r,’r’, regular expressions”)

                print result

output:

[‘egula ’ exp ,essions]

result=re.findall(“r,’r’, regular expressions”, maxsplit=1)

                print result

output:

[‘egula’]

 

re.sub():

this is used to search the pattern and replace with the new strting. If threte is no specific word to replace the sentence won’t  changed.

For example:

result1=re.sub(“r, ’data science’ , ’python’ , i2 tutorials for python”)

                print result1

output:

‘i2 tutorials for data science’

 

re.compile():

this is used to combine a regular expressions into pattern objects and used for the pattern matching.

Import re

pattern=re.compile('i2’)

result=pattern.findall(‘i2 titorials i2')

print result

result2=pattern.findall('i2 tutorials is largest analytics community of India')

print result2

Output:

['i2', 'i2']

['i2']

Here we are having some operators to extract the characters for our conivient way. The most commonly used operators are :

 

1. Get the first word of a string:

Import re

result=re.findall(“r, ’.’ , i2 tutorials for datascience”)

                print result

output:

[‘i’ , ‘2’ ,  ‘t’ , ‘u’ , ‘t’ , ‘o’ , ‘r’ , ‘i ‘ , ‘a’ , ‘l’ , ‘s’ , ‘f’ , ‘o’ , ‘r’ , ‘d’ , ‘a’ , ‘t’ , ‘a’ , ‘s’ , ‘c’ , ‘i’ , ‘e’ , ‘n’ , ‘c’ , ‘e’]

In above space is also extracted for that we can avoid the space  by using \w instead of .

Import re

result=re.findall(“r, ’\w’ , i2tutorials for datascience ”)

                print result

output:

[‘i’ , ‘2’ ,  ‘t’ , ‘u’ , ‘t’ , ‘o’ , ‘r’ , ‘i’ , ‘a’ , ‘l’ , ‘s’ , ‘f’ , ‘o’ , ‘r’ , ‘d’ , ‘a’ , ‘t’ , ‘a’ , ‘s’ , ‘c’ , ‘i’ , ‘e’ , ‘n’ , ‘c’, ‘e’]

2.Extract the words:

Import re

result=re.findall(“r, ’\w*’ , i2 tutorials for data science”)

                print result

output:

[‘i2’ , ‘’  ,   ‘tutorials’ , ‘’  ,  ‘for’ , “ , ‘data’ , ‘science’]

Here also we have to avoid the spaces by using \w+ instead of \w*

Import re

result=re.findall(“r, ’\w+’ , regular expressions in nlp”)

                print result

output:

[‘regular’ ,  ‘expressions’ ,  ‘in’ , ‘nlp’]

 

3.Extract the words using (^)

If you want to fetch the starting and ending of words in the sentence.

Import re

result=re.findall(“r, ’^\w+’ , regular expressions in nlp”)

                print result

output:

[‘regular’ ]

The ending word fetched by using \w+$ instead of ^\w+.

Import re

result=re.findall(“r, ’\w+$’ , regular expressions in nlp”)

                print result

output:

[‘nlp’ ]

4.First two characters of each word:

every word is divided into two characters using \w\w.

Import re

result=re.findall(“r, ’\w\w’ , regular expressions in nlp”)

                print result

output:

[‘re’ , ’gu’ , ’la’ , ’ex’ , ’pr’ , ’es ’ , ’si ’ , ’on’ , ’in’ , ’nl’ ]

Fetching two characters in each word using \b\w instaed of \w\w.

Import re

result=re.findall(“r, ’\b\w’ , regular expressions in nlp”)

                print result

output:

[‘re’ , ‘ex’ , ‘in’ , ‘nl’]

5.Extract all the characters after @.

Import re

result=re.findall(r, ’@\w+’ , ‘i2@yahoo.com, i2@gmail.com,i2@orkut.com)

                print result

output:[‘@yahoo’ , ‘@gmail’ , ‘@orkut’]

if you want to .com also then we can use @\w+.\w+

Import re

result=re.findall(r, ’@\w+’ , ‘i2@yahoo.com, i2@gmail.com,i2@orkut.com)

                print result

output:[‘@yahoo.com’ , ‘@gmail.com’ , ‘@orkut.com’]

 

6.Fetching only domain name using @\w+.(\w+).

Import re

result=re.findall(r, ’@\w+.(\w+)’ , ‘i2@yahoo.com, i2tutorial@gmail.in,i2python@orkut.org)

                print resulti2

output:[‘com ‘ , ‘in’ , ‘org’]

 

7.Fetching the words starts with alphabets.

Import re

result=re.findall(r, ’[AEIOUaeiou]\w+’ , ‘regular expressions in nlp’)

                print result

output:[ ‘egular’ , ‘expressions’ , ‘in’]

in above the ‘egular’ is a meaning less . so  we can drop this by using \b

Import re

result=re.findall(r, ’\b[AEIOUaeiou]\w+’ , ‘regular expressions in nlp’)

                print result

output:[  ‘expressions’ , ‘in’]

 

8.Extract information from html files between the <tr> and <td>.

Here the sample html file.

<tr align="center"><td>1</td> <td>apple</td> <td>orange</td></tr>

<tr align="center"><td>2</td> <td>banana</td> <td>cucumber</td></tr>

<tr align="center"><td>3</td> <td>mango</td> <td>promogranate</td></tr>

<tr align="center"><td>4</td> <td>kiwi</td> <td>papaya</td>

</tr><tr align="center"><td>5</td> <td>grapes</td> <td>custredapple</td></tr>

 

Import re

result=re.findall(r, ‘<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)

                print result

Output:

[(‘apple’ , ‘orange’) , (‘banana’ , ‘cucumber’) , (‘mango’ , ‘promogranate’) , (‘kiwi’ , ‘papaya’) , (‘grapes’ , ‘custerd apple’) ]

In this article we are clearly explain the regular expressions , meta characters and methods. Here am explaind most common useful regular expressions and cover all the most common methods to solve our regular expression problems.