A Comprehensive Guide on Text Cleaning Using the nltk Library
NLTK is a library that processes on string input and output’s the result in the form of either a string or lists of strings. This library offers a lot of algorithms that helps significantly in the learning purpose. One can think and compare among various variants of outputs. There are other libraries also like spaCy, CoreNLP, PyNLPI, Polyglot. NLTK and spaCy are most broadly used. Spacy works admirably with large information and for advanced NLP.
The data scraped from the website is generally in the raw text form. This data should be cleaned before analyzing it or fitting a model to it. Cleaning up the text data is important for your machine learning system to pick up on highlighted attributes. Cleaning the data generally consists of a number of steps. Let’s begin with the cleaning techniques!
Removing extra spaces
The text data may contain extra spaces in between the words, after or before a sentence. We can remove these extra spaces from each sentence by using regular expressions.
import re doc = "i2tutorials the best learning site for Python. " new_doc = re.sub("\s+"," ", doc) print(new_doc)
The punctuation, present in text, will create a problem in differentiating with other words and also do not add value to the data.
text = "Hello! i2tutorials provides the best Python Course!" re.sub("[^-9A-Za-z ]", "" , text)
Punctuations can also be removed with the help of a package from the string library.
import string text = "Hello! i2tutorials provides the best Python and Machine Learning Course!" text_clean = "".join([i for i in text if i not in string.punctuation]) text_clean
As we know that python is a case sensitive language so it will treat NLP and nlp differently. Hence we can easily convert the string to either lower or upper by using:
str.lower() or str.upper().
Below is an example to convert the character to either lower case or upper case at the time of checking for the punctuations.
import string text = "Hello! i2tutorials provides the best Python and Machine Learning Course!" text_clean = "".join([i.lower() for i in text if i not in string.punctuation]) text_clean
It is the processes of splitting a sentence into words and creating a list, which means each sentence is a list of words. There are primarily 3 types of tokenizers avialable.
This is a generic tokenizer that separates words and punctuations, but here apostrophe is not considered as punctuation.
#word tokenize import nltk text = "Hello! I'm very excited to share that i2tutorials provides the best Python and Machine Learning Course's!" nltk.tokenize.word_tokenize(text)
Notice that in the above output words are split based on the punctuations.
When dealing with text data from social media consisting of #,@, emoticons this is specifically used.
#tweet tokenize text = "Hello! I'm very excited to share that i2tutorials provides the best Python and Machine Learning Course's!" from nltk.tokenize import TweetTokenizer tweet = TweetTokenizer() tweet.tokenize(text)
It can be used when we want to separate words of our interests like extracting all hashtags from tweets, addresses from tweets, or hyperlinks from the text. In this, you can use the normal regular expression functions to separate the words.
#regexp_tokenize: import re a = 'Visit our site for Python and Machine Learning related concepts @i2tutorials' re.split('\s@', a)
Stopwords are nothing but I, he, she, and, but, was were, being, have, etc, which doesn’t add meaning to the data. After tokenizing the text these stopwords can be removed which helps to reduce the features from our data.
import nltk import string stopwords = nltk.corpus.stopwords.words('english') text = "Hello! I'm very excited to share that i2tutorials provides the best Python and Machine Learning Course's!" text_new = "".join([i for i in text if i not in string.punctuation]) print(text_new) words = nltk.tokenize.word_tokenize(text_new) print(words) words_new = [i for i in words if i not in stopwords] print(words_new) words_new = [i for i in words if i not in stopwords and len(i)>2] print(words_new)
Lemmatization & Stemming
It is a technique that takes the word to its root form. It is done by removing suffixes from the words. The stemmed word doesn’t have any meaning and may not be part of the dictionary. There are mainly two types of stemmer- Porter Stemmer and Snow Ball Stemmer advanced version of Porter Stemmer.
ps = nltk.PorterStemmer() w = [ps.stem(word) for word in words_new] print(w)
ss = nltk.SnowballStemmer(language = 'english') w = [ss.stem(word) for word in words_new] print(w)
It takes the word to its root structure called Lemma and is applied to nouns by default. It helps to bring words in their dictionary form. It is more accurate as it utilizes more informed analysis to create groups of words with comparable meanings depending on the specific context, so it is complex and requires more time.
import nltk nltk.download('wordnet') wn = nltk.WordNetLemmatizer() w = [wn.lemmatize(word) for word in words_new] print(w)
Above discussed are the cleaning techniques methods that must be applied to prepare our text data ready for analysis and model building. It is not required to perform all these steps for cleaning.
Sometimes, you need to create new features for analysis such as the percentage of punctuation in each text, length of each review of any product in a large dataset or you can check that if there are more percentage of punctuations in a spam mail or ham mail or positive sentiment reviews are having larger number of punctuations than negative sentiment reviews or vice-versa.
When the text cleaning is done, we will proceed with text analytics. Before building a model, it is important to convert the text data to numeric form called vectorization which machine understands.