/  Technology   /  Python   /  Match URLs using regular expressions in Python

Match URLs using regular expressions in Python

Match URLs using regular expressions in Python:

In this article, I will explain you about how to match the URL using regular expressions in python.

What is Regular Expressions?

Regular expression generally represented as regex or regexp is a sequence of characters which can define a search pattern. A regular expression can be anything like date, numbers, lowercase and uppercase text, URL’s etc., and most of the times we will see a combination of all. We will use the regular expressions to perform string operations in real-world cases like input validation, validating email address, phone number etc.

Regular expressions are mostly used in search engines for crawling the text (web scraping). You can observe the regular expression capabilities in almost every programming language which might be available built-in or through installing libraries.

Here, we will learn how to understand the pattern and match the URL using python library “re“.

re” is regular expression library that is available with python programming language. “re” is resourceful library to work with any type of patterns by its own provided methods and functions.

So, our main aim to extract the Urls from the given statement.

We have to Import the library  ‘re‘  which is available in python by default.

import re

we need to assign the statement to the variable

var= 'You can understand the regular expressions in this link https://en.wikipedia.org/wiki/Regular_expression and you can get more practice using http://www.i2tutorials.com and you can get the python documentation from http://python.org'

Using “findall” method in re to extract the matched url patterns of different protocols like http/ftp/https and also different symbols that include Colon (:),forward slash( / )

re.findall(r'(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?', var)

Below is the complete program:

import re

var= 'You can understand the regular expressions in this link https://en.wikipedia.org/wiki/Regular_expression and you can get more practice using http://www.i2tutorials.com and you can get the python documentation from http://python.org'

re.findall(r'(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?', var)

Now, you can understand the output clearly shows us the extracted URL patterns from the given data as a list of  tuples.

Output:

[('https', 'en.wikipedia.org', '/wiki/Regular_expression'),
 ('http', 'www.i2tutorials.com', ''),
 ('http', 'python.org', '')]

Leave a comment