/  Technology   /  Analysis of Data using NLP and Python
Market Research and Analysis (i2tutorials)

Analysis of Data using NLP and Python

While we are working with Data, we need to do some analysis on the data for different purposes. Analysis includes identifying number of words, count of each word, determining length of text, identifying a specific keyword in the text etc., Python supports us to do these types of analysis on the data by using Natural Language Processing (NLP).

Let us take an example of data which contains serial number, URL, Title, Description having number of columns. We will perform different operations like tokenization, removal of stop words and some python functions in order to make analysis on the data which we have mentioned earlier. We will see each step of the process in more detailed manner.

Importing Libraries

In order to perform the above operations, we have to import required libraries in python. The required libraries are numpy, pandas, nltk and seaborn (visualizations).

Python Code:

import numpy as np 
import pandas as pd
import seaborn as sns
import nltk

Extracting Data

To perform analysis, we need data. We have to extract the data from system by using pandas in Python. pd.read_excel() reads the data which is in excel data. head() gives output of first 5 rows of the data.

Python Code:

data=pd.read_excel('C:\\Users\\Documents\\data.xlsx')
data.head()

Output:

Analysis of Data using NLP and Python1

 

Converting the description of data into data frame as the string type. Padding the description of the data to the right side of the data in order to make convenient to read the description data.

Python Code:

df1= data['Description'].astype(str)
df1.str.pad(150,side='right',fillchar=' ').head()

Output:

Analysis of Data using NLP and Python2

 

Tokenization

Tokenization means converting the entire text of the data into tokens. Tokens means separating each word of the text into strings. For better analysis, we need to convert the entire text into lower case. We apply this tokenization on the description of the data. We will create an extra column in the data for tokens of the data.

Python Code:

import re
def tokenize(text):
   tokens = re.split('\W+', text)
return tokens
data['TokenDescription'] = df1.apply(lambda x: tokenize(x.lower()))
data.head()

Output:

Analysis of Data using NLP and Python3

 

Removing Stop words

In data, we will have stop words like the, is, to etc., which are not so necessary for analysis and also reduces the performance of analysis. So, we will remove stop words from our tokenized data by importing stop words in English in nltk corpus library. It removes stop words from the data which are present in nltk corpus library. We will also add this column to our data as nostop_token_Articles. Let us see the data without stop words.

Python Code:

stopword = nltk.corpus.stopwords.words('English')
def remove_stopwords(tokenized_list):
   text = [word for word in tokenized_list if word not in stopword]
  return text
data['nostop_token_Articles'] = data['TokenDescription'].apply(lambda x: remove_stopwords(x))
data.head()

Output:

Analysis of Data using NLP and Python4

Python Code:

data['nostop_token_Articles'].head()

Output:

Analysis of Data using NLP and Python5

For further analysis, we will take this nostop_token_Articles into the list named as var.

Python Code:

datalist=data['nostop_token_Articles']
var=[]
for i in datalist:
   for j in i:
       var.append(j)
#print(var)

Counting number of words

In order to count number of times a word is repeated in the data, we will import collections. collections.Counter() is used to count repeated words in the text. We will create an empty set and update it with a word in the data and its count. We will sort this in the descending order which displays the word which is repeated greater number of times.

Python Code:

import collections
word_counts = collections.Counter(var)
var1={}
for word, count in sorted(word_counts.items()):
  # print('%s is repeated %d time%s.' % (word, count, "s" if count > 1 else ""))
   var1.update({word:count})
Keyword_Counts=pd.DataFrame({'count':var1})
Keyword_Counts.columns = ['count']
Keyword_Counts.sort_values('count')
Keyword_Counts.to_csv('C://Users//sriharshithaghali//OneDrive//Documents//sorteddata.csv', encoding='utf-8')
Keyword_Counts.sort_values('count', ascending=False).head()

Output:

Analysis of Data using NLP and Python6

Length of URLs and Title

As a part of analysis, we will also find length of URLs and Title of the data. str.len() is a function is used to determine lengths of the URLs and Title of the data. To make it easier, we will these columns to the data.

Python Code:

data['URL Length']=data['URL'].str.len()
data['Title Length']=data['Title'].str.len()
data.head()

Output:

Analysis of Data using NLP and Python7

For better analysis, we will sort these titles and their respective length in descending order. For this analysis, we will create an empty list and append Titles of the data. In order to sort the titles in descending order according to their length, we will create a set with Titles and their length.

Python Code:

titlelist=data['Title']
var2=[]
for i in titlelist:
var2.append(i)
#print(var2)

var3={}
for element in var2:
# print('%s has %d letter%s.' % (element, len(element), "s" if count > 1 else ""))
var3.update({element:len(element)})
Title_length=pd.DataFrame({'length':var3})
Title_length.columns = ['length']
Title_length.sort_values('length')
Title_length.to_csv('C://Users//Documents//sorteddata2.csv', encoding='utf-8')
Title_length.sort_values('length', ascending=False).head()

Output:

Analysis of Data using NLP and Python8

Same process is used for length of URLS.

Python Code:

urllist=data['URL']
var4=[]
for i in urllist:
var4.append(i)
#print(var4)
var5={}
for element in var4:
# print('%s has %d letter%s.' % (element, len(element), "s" if count > 1 else ""))
var5.update({element:len(element)})
url_length=pd.DataFrame({'length':var5})
url_length.columns = ['length']
url_length.sort_values('length')
url_length.to_csv('C://Users//Documents//sorteddata3.csv', encoding='utf-8')
url_length.sort_values('length', ascending=False).head()

Output:

Analysis of Data using NLP and Python9

Identifying URLs for a given Keyword

As discussed earlier, we will produce URLs in the output as per the given Keyword. For this we will create an empty list for the URLs which consists of keyword. We will use match and keyword in order to extract URLs which consists of required keyword.

Python Code:

ls = var4
matches = []
keyword=input(str('enter keyword:'))
for match in ls:
if keyword in match:
matches.append(match)
print(matches)

Output:

Analysis of Data using NLP and Python10

Leave a comment