
Regular Expressions in Python
In this article I will go through the regular expressions which are very popular among programmers and these regular expressions can be applied in programming languages like c++,Java, python, php etc. these are mostly used in data cleaning and data wrangling . simply it is a default way of data cleaning.
Regular expressions are the sequence of characters and these are mainly used in to find or replace the patterns in a string or file.
In this we are using two types of characters.
1. Meta Characters.
2. Litterals. (1,2, a, b).
For this we are importing the python library “re” that helps with regular expressions.
Import re
The most common uses of Regular expressions are
1. Search a string (search and match)
2. Finding a string (findall)
3. Break string into a sub strings (split)
4. Replace part of a string (sub)
The library “re” is used to perform these tasks. This library provides different mrthods to perform the required tasks. The most common methods are
– match()
– search()
– findall()
– split()
– sub()
– compile()
lets discuss one by one.
re.match():
this is used to find match in the start of the strings. For example “regular expressions in machinelearning” call the match() function to find the pattern “regular” so the regular will match because it is starting of the string and look for machinelearning it will not match because it is not the starting of the string.
Import re result=re.match(“regular expressions in nlp) result=re.match(“r ‘regular’ , “regular expression in nlp”) print result.group(0) output: regular
if you want to find the “nlp” in given string lets see what we get.
result=re.match(“r,’nlp’,regular expressions in nlp”) print result output: none
re.search():
it is similar to match but there is no restrictions to find the string at starting position only.this method is able to find the string at any position in the sentence.
For example:
result=re.match(“r,’nlp’,regular expressions in nlp”) print result output: nlp
re.findall ():
by using the findall method to get all the list of matching patterns. In this there is no constrains to find the starting to ending positions.
For example:
result=re.findall(“r,’nlp’,nlp regular expressions in nlp”) print result output: [nlp,nlp]
re.split():
this will be used for the splitting.
For example:
result=re.findall(“r,’g’, regular”) print result output: [‘re’ , ’ular’]
In this split method we are having one more paramter that is maxsplit. It is dafalut by 0 if we increasing the that number ihe maximum splits that can be done.
result=re.findall(“r,’r’, regular expressions”) print result output: [‘egula ’ exp ,essions] result=re.findall(“r,’r’, regular expressions”, maxsplit=1) print result output: [‘egula’]
re.sub():
this is used to search the pattern and replace with the new strting. If threte is no specific word to replace the sentence won’t changed.
For example:
result1=re.sub(“r, ’data science’ , ’python’ , i2 tutorials for python”) print result1 output: ‘i2 tutorials for data science’
re.compile():
this is used to combine a regular expressions into pattern objects and used for the pattern matching.
Import re pattern=re.compile('i2’) result=pattern.findall(‘i2 titorials i2') print result result2=pattern.findall('i2 tutorials is largest analytics community of India') print result2 Output: ['i2', 'i2'] ['i2']
Here we are having some operators to extract the characters for our conivient way. The most commonly used operators are :
1. Get the first word of a string:
Import re result=re.findall(“r, ’.’ , i2 tutorials for datascience”) print result output: [‘i’ , ‘2’ , ‘t’ , ‘u’ , ‘t’ , ‘o’ , ‘r’ , ‘i ‘ , ‘a’ , ‘l’ , ‘s’ , ‘f’ , ‘o’ , ‘r’ , ‘d’ , ‘a’ , ‘t’ , ‘a’ , ‘s’ , ‘c’ , ‘i’ , ‘e’ , ‘n’ , ‘c’ , ‘e’]
In above space is also extracted for that we can avoid the space by using \w instead of .
Import re result=re.findall(“r, ’\w’ , i2tutorials for datascience ”) print result output: [‘i’ , ‘2’ , ‘t’ , ‘u’ , ‘t’ , ‘o’ , ‘r’ , ‘i’ , ‘a’ , ‘l’ , ‘s’ , ‘f’ , ‘o’ , ‘r’ , ‘d’ , ‘a’ , ‘t’ , ‘a’ , ‘s’ , ‘c’ , ‘i’ , ‘e’ , ‘n’ , ‘c’, ‘e’]
2.Extract the words:
Import re result=re.findall(“r, ’\w*’ , i2 tutorials for data science”) print result output: [‘i2’ , ‘’ , ‘tutorials’ , ‘’ , ‘for’ , “ , ‘data’ , ‘science’]
Here also we have to avoid the spaces by using \w+ instead of \w*
Import re result=re.findall(“r, ’\w+’ , regular expressions in nlp”) print result output: [‘regular’ , ‘expressions’ , ‘in’ , ‘nlp’]
3.Extract the words using (^)
If you want to fetch the starting and ending of words in the sentence.
Import re result=re.findall(“r, ’^\w+’ , regular expressions in nlp”) print result output: [‘regular’ ]
The ending word fetched by using \w+$ instead of ^\w+.
Import re result=re.findall(“r, ’\w+$’ , regular expressions in nlp”) print result output: [‘nlp’ ]
4.First two characters of each word:
every word is divided into two characters using \w\w.
Import re result=re.findall(“r, ’\w\w’ , regular expressions in nlp”) print result output: [‘re’ , ’gu’ , ’la’ , ’ex’ , ’pr’ , ’es ’ , ’si ’ , ’on’ , ’in’ , ’nl’ ]
Fetching two characters in each word using \b\w instaed of \w\w.
Import re result=re.findall(“r, ’\b\w’ , regular expressions in nlp”) print result output: [‘re’ , ‘ex’ , ‘in’ , ‘nl’]
5.Extract all the characters after @.
Import re result=re.findall(r, ’@\w+’ , ‘i2@yahoo.com, i2@gmail.com,i2@orkut.com) print result output:[‘@yahoo’ , ‘@gmail’ , ‘@orkut’]
if you want to .com also then we can use @\w+.\w+
Import re result=re.findall(r, ’@\w+’ , ‘i2@yahoo.com, i2@gmail.com,i2@orkut.com) print result output:[‘@yahoo.com’ , ‘@gmail.com’ , ‘@orkut.com’]
6.Fetching only domain name using @\w+.(\w+).
Import re result=re.findall(r, ’@\w+.(\w+)’ , ‘i2@yahoo.com, i2tutorial@gmail.in,i2python@orkut.org) print resulti2 output:[‘com ‘ , ‘in’ , ‘org’]
7.Fetching the words starts with alphabets.
Import re result=re.findall(r, ’[AEIOUaeiou]\w+’ , ‘regular expressions in nlp’) print result output:[ ‘egular’ , ‘expressions’ , ‘in’]
in above the ‘egular’ is a meaning less . so we can drop this by using \b
Import re result=re.findall(r, ’\b[AEIOUaeiou]\w+’ , ‘regular expressions in nlp’) print result output:[ ‘expressions’ , ‘in’]
8.Extract information from html files between the <tr> and <td>.
Here the sample html file.
<tr align="center"><td>1</td> <td>apple</td> <td>orange</td></tr> <tr align="center"><td>2</td> <td>banana</td> <td>cucumber</td></tr> <tr align="center"><td>3</td> <td>mango</td> <td>promogranate</td></tr> <tr align="center"><td>4</td> <td>kiwi</td> <td>papaya</td> </tr><tr align="center"><td>5</td> <td>grapes</td> <td>custredapple</td></tr>
Import re result=re.findall(r, ‘<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str) print result Output: [(‘apple’ , ‘orange’) , (‘banana’ , ‘cucumber’) , (‘mango’ , ‘promogranate’) , (‘kiwi’ , ‘papaya’) , (‘grapes’ , ‘custerd apple’) ]
In this article we are clearly explain the regular expressions , meta characters and methods. Here am explaind most common useful regular expressions and cover all the most common methods to solve our regular expression problems.