/  Technology   /  A Comprehensive Guide on Text Cleaning Using the nltk Library
A Comprehensive Guide onText Cleaning Using the nltk Library

A Comprehensive Guide on Text Cleaning Using the nltk Library

NLTK is a library that processes on string input and output’s the result in the form of either a string or lists of strings. This library offers a lot of algorithms that helps significantly in the learning purpose. One can think and compare among various variants of outputs. There are other libraries also like spaCy, CoreNLP, PyNLPI, Polyglot. NLTK and spaCy are most broadly used. Spacy works admirably with large information and for advanced NLP.

 

The data scraped from the website is generally in the raw text form. This data should be cleaned before analyzing it or fitting a model to it. Cleaning up the text data is important for your machine learning system to pick up on highlighted attributes. Cleaning the data generally consists of a number of steps. Let’s begin with the cleaning techniques!

 

Removing extra spaces

 

The text data may contain extra spaces in between the words, after or before a sentence. We can remove these extra spaces from each sentence by using regular expressions.

 

Example:

 

import re
doc = "i2tutorials the best learning site for   Python.  "
new_doc = re.sub("\s+"," ", doc)
print(new_doc)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Removing punctuations

 

The punctuation, present in text, will create a problem in differentiating with other words and also do not add value to the data.

 

Example:

 

text = "Hello! i2tutorials provides the best Python Course!"
re.sub("[^-9A-Za-z ]", "" , text)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Punctuations can also be removed with the help of a package from the string library.

 

Example:

 

import string
text = "Hello! i2tutorials provides the best Python and Machine Learning Course!"
text_clean = "".join([i for i in text if i not in string.punctuation])
text_clean

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Case Normalization

 

As we know that python is a case sensitive language so it will treat NLP and nlp differently. Hence we can easily convert the string to either lower or upper by using:
str.lower() or str.upper().

 

Example:

 

Below is an example to convert the character to either lower case or upper case at the time of checking for the punctuations.

 

import string
text = "Hello! i2tutorials provides the best Python and Machine Learning Course!"
text_clean = "".join([i.lower() for i in text if i not in string.punctuation])
text_clean

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Tokenization

 

It is the processes of splitting a sentence into words and creating a list, which means each sentence is a list of words. There are primarily 3 types of tokenizers avialable.

 

word_tokenize:

 

This is a generic tokenizer that separates words and punctuations, but here apostrophe is not considered as punctuation.

 

Example:

 

#word tokenize
import nltk
text = "Hello!  I'm very excited to share that i2tutorials provides the best Python and Machine Learning Course's!"
nltk.tokenize.word_tokenize(text)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Notice that in the above output words are split based on the punctuations.

 

Tweet Tokenizer:

 

When dealing with text data from social media consisting of #,@, emoticons this is specifically used.

 

Example:

 

#tweet tokenize
text = "Hello! I'm very excited to share that i2tutorials provides the best Python and Machine Learning Course's!"
from nltk.tokenize import TweetTokenizer
tweet = TweetTokenizer()
tweet.tokenize(text)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

regexp_tokenize:

 

It can be used when we want to separate words of our interests like extracting all hashtags from tweets, addresses from tweets, or hyperlinks from the text. In this, you can use the normal regular expression functions to separate the words.

 

Example:

 

#regexp_tokenize:
import re
a = 'Visit our site for Python and Machine Learning related concepts @i2tutorials'
re.split('\s@', a)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Removing Stopwords

 

Stopwords are nothing but I, he, she, and, but, was were, being, have, etc, which doesn’t add meaning to the data. After tokenizing the text these stopwords can be removed which helps to reduce the features from our data.

 

Example:

 

import nltk
import string
stopwords = nltk.corpus.stopwords.words('english')
text = "Hello! I'm very excited to share that i2tutorials provides the best Python and Machine Learning Course's!"
text_new = "".join([i for i in text if i not in string.punctuation])
print(text_new)
words = nltk.tokenize.word_tokenize(text_new)
print(words)
words_new = [i for i in words if i not in stopwords]
print(words_new)
words_new = [i for i in words if i not in stopwords and len(i)>2]
print(words_new)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Lemmatization & Stemming

 

Stemming:

 

It is a technique that takes the word to its root form. It is done by removing suffixes from the words. The stemmed word doesn’t have any meaning and may not be part of the dictionary. There are mainly two types of stemmer- Porter Stemmer and Snow Ball Stemmer advanced version of Porter Stemmer.

 

Example:

 

ps = nltk.PorterStemmer()
w = [ps.stem(word) for word in words_new]
print(w)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

OR

 

ss = nltk.SnowballStemmer(language = 'english')
w = [ss.stem(word) for word in words_new]
print(w)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Lemmatization:

 

It takes the word to its root structure called Lemma and is applied to nouns by default. It helps to bring words in their dictionary form. It is more accurate as it utilizes more informed analysis to create groups of words with comparable meanings depending on the specific context, so it is complex and requires more time.

 

Example:

 

import nltk
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()
w = [wn.lemmatize(word) for word in words_new]
print(w)

 

Output:

 

A Comprehensive Guide onText Cleaning Using the nltk Library

 

Above discussed are the cleaning techniques methods that must be applied to prepare our text data ready for analysis and model building. It is not required to perform all these steps for cleaning.

Sometimes, you need to create new features for analysis such as the percentage of punctuation in each text, length of each review of any product in a large dataset or you can check that if there are more percentage of punctuations in a spam mail or ham mail or positive sentiment reviews are having larger number of punctuations than negative sentiment reviews or vice-versa.

When the text cleaning is done, we will proceed with text analytics. Before building a model, it is important to convert the text data to numeric form called vectorization which machine understands.

 

 

Leave a comment