Hands-On Guide To Different Tokenization Methods In NLP

November 20, 2020

Hands-On Guide To Different Tokenization Methods In NLP

Tokenization is the process by which large amount of text is partitioned into smaller parts called tokens.

Natural language processing is used for building applications example Text classification, intelligent chatbot, sentimental analysis, language translation, and so forth. It becomes essential to understand the pattern in the text to accomplish the above-stated purpose. These tokens are very useful for finding such patterns just as is considered as a base step for stemming and lemmatization.

In this article, we will begin with the first step of data-preprocessing i.e Tokenization. Further, we will actualize different techniques in python to perform tokenization of text data.

Tokenize Words Using NLTK

Let’s begin with the tokenization of words using the NLTK library. It breaks the given string and returns a list of strings by the white determined separator.

Example 1:

#Tokenize words
from nltk.tokenize import word_tokenize
text = "i2 Tutorials provides the best Python and Machine Learning Tutorials."
word_tokenize(text)

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Here, we tokenize the sentences rather than words by a full stop (.) separator.

Example 2:

#Tokenize Sentence
from nltk.tokenize import sent_tokenize
text = "i2tutorials provides the best Data Science training helps you to excel your skill on  of data acquisition, project life cycle, deploying machine learning and statistical methods. Here you will be working on real time projects that have high relevance in the corporate world."
sent_tokenize(text)

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Regular Expression

Regex function is used to match or find strings using a sequence of patterns comprising of letters and numbers. We will use re library to tokenize words and sentences of a paragraph.

Example 1:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
text = "i2 Tutorials provides the best Python and Machine Learning Tutorials."
tokenizer.tokenize(text)

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Example 2:

#Split Sentences
import re
text = """i2tutorials provides the best Data Science training helps you to excel your skill on  of data acquisition, project life cycle, deploying machine learning and statistical methods. Here you will be working on real time projects that have high relevance in the corporate world."""
sentences = re.compile('[.!?] ').split(text)
sentences

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Split()

split() method is utilized to break the given string in a sentence and return a list of strings by the stated separator.

Example 1:

text = """i2 Tutorials provides the best Python and Machine Learning Tutorials."""
# Splits at space
text.split()

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Example 2:

#Split Sentences
import re
text = """i2tutorials provides the best Data Science training helps you to excel your skill on  of data acquisition, project life cycle, deploying machine learning and statistical methods. Here you will be working on real time projects that have high relevance in the corporate world."""
sentences = re.compile('[.!?] ').split(text)
sentences

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Spacy

Spacy is an open-source library used for tokenization of words and sentences.

Example 1:

import spacy
import en_core_web_sm
sp = spacy.load('en_core_web_sm')
sentence = sp(u'i2 Tutorials provides the best Python and Machine Learning Tutorials.')
print(sentence)
L=[]
for word in sentence:
L.append(word)
L

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Example 2:

Here we tokenize sentences.

#Split Sentences 
sentence = sp(u'i2tutorials provides the best Data Science training helps you to excel your skill on  of data acquisition, project life cycle, deploying machine learning and statistical methods. Here you will be working on real time projects that have high relevance in the corporate world.')
print(sentence)x = []for sent in sentence.sents:
x.append(sent.text)
x

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Gensim

The last technique that we are covering in this article is gensim. It is an open-source python library for topic modelling and similarity retrieval of large datasets.

Example 1:

from gensim.utils import tokenize
text = """i2 tutorials provides the best Data Science Course online training helps you to excel your skill on  of data acquisition, project life cycle, deploying machine learning and statistical methods."""
list(tokenize(text))

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Example 2:

#Split Sentence
from gensim.summarization.textcleaner import split_sentences
text = """i2tutorials provides the best Data Science training helps you to excel your skill on  of data acquisition, project life cycle, deploying machine learning and statistical methods. Here you will be working on real time projects that have high relevance in the corporate world."""
split1 = split_sentences(text)
split1

Output:

Hands-On Guide To Different Tokenization Methods In NLP

Tokenization is a vital step in data cleaning/pre-processing process. In this article, we executed different methods of tokenization from a given text.

Hands-On Guide To Different Tokenization Methods In NLP

Tokenize Words Using NLTK

Example 1:

Example 2:

Regular Expression

Example 1:

Example 2:

Split()

Example 1:

Example 2:

Spacy

Example 1:

Example 2:

Gensim

Example 1:

Example 2:

Leave a comment Cancel reply

Top Tutorials

Recent Posts

What Are The Benefits Of Doing Hard Drive Destruction?

How to Shorten an Essay Effectively (with An Essay Shortener)

A Complete Guide To How To Make A Food Ordering Website Know the Market, Features, And Process To Build A Food Ordering Website

Benefits of Creating a Blog to Market Your Product or Service

Machine Learning and Predictive Analytics

Work with us

Contact Us

Subscribe to Newsletter

Jobs

Hands-On Guide To Different Tokenization Methods In NLP

Tokenize Words Using NLTK

Example 1:

Example 2:

Regular Expression

Example 1:

Example 2:

Split()

Example 1:

Example 2:

Spacy

Example 1:

Example 2:

Gensim

Example 1:

Example 2:

Related Posts

Leave a comment Cancel reply