Bag of Words using Machine Learning algorithms

The Bag of Words is a method of feature extracting from the text documents for the use in machine learning algorithms. These features are used for the training of our machine learning algorithms. It creates a vocabulary of all different words present in all the documents in training set.

Bag of words 2 (i2tutorials)

In simple words it is a collection of words to represent the sentences with the word counts. The bag of word is mainly used with

1. Natural Language Processing.

2. Document Classification.

3.  Retrieve the information from documents

 

This will be shown by following below figure.

Bag of words 1 (i2tutorials)

 

Let’s take an example to understand by taking the sentences and generating the vectors for the respective sentences.

1. Khirod likes to watch movies.  Rosy likes movies too.

2. khirod also likes to watch indoor games.

These two sentences can be represented by following like a collection of words.

1. ['khirod', 'likes', 'to', 'watch', 'movies.', 'Rosy', 'likes', 'movies', 'too.']

2. ['Khirod', 'also', 'likes', 'to', 'watch', 'indoor', 'games']

Here remove the multiple words and use the word count for the representation.

1.{"khirod":1,"likes":2,"to":1,"watch":1,"movies":2,"Rosy":1,

"too":1}

2. {"khirod":1,"also":1,"likes":1,"to":1,"watch":1,"indoor":1,   "games":1}

Here we can take both sentences into account and combine with word frequency for our whole document.

{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,  "also":1,"indoor":1,"games":1}

Let’s observe the above, it’s a vocabulary for our document and by using this vocabulary we create vectors for our sentences.

The length of the vector must be equal to the vocabulary size. Here the length of the vector is 10.lets compare the our sentences with our vocabulary and we get the vectors like,

khirod likes to watch movies. rosy likes movies too.
[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

khirod also likes to watch indoor games.
[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

Always the vector is proportion to the size of vocabulary. In this, the main drawback is if we are having a big document then the size of the vocabulary is high and that vector contains more number of zeros. It is called sparse matrix and the sparse matrix require more memory and high computational power.

 

Coding of BOW:

Here the input is having multiple sentences and the output will be vectors. And the input is,

["khirod waited for the train", "The train was late", "Rosy and jessie took the bus",

"I looked for Rosy and jessie at the bus station",

"Rosy and jessie arrived at the bus station early but waited until noon for the bus"]

From the above sentences if any stop words are there we can remove the stop words because the stop words does not contain the enough significance and avoid the space for storing these words in our database.

Tokenization is a process of breaking the sentence into sequence of words, spaces, phrases and symbols called tokens.

 

def word_extraction(sentence):
ignore = ['a', "the", "is"]
words = re.sub("[^\w]", " ",  sentence).split()
cleaned_text = [w.lower() for w in words if w not in ignore]
return cleaned_text

Apart from this, we can use nltk library to remove the stop word by using per-built function.

import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

 

Apply the tokenization for all the sentences

def tokenize(sentences):
words = []
for sentence in sentences:
w = word_extraction(sentence)
words.extend(w)
words = sorted(list(set(words)))
return words

We can iterate all the sentences and get the vocabulary like this.

['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'i', 'khirod', 'late', 'looked', 'Rosy', 'noon', 'jessie', 'station', 'the', 'took', 'train', 'until', 'waited', 'was']

Using the above vocabulary and create the vectors.

def generate_bow(allsentences):    
vocab = tokenize(allsentences)
print("Word List for Document \n{0} \n".format(vocab));

for sentence in allsentences:
words = word_extraction(sentence)
bag_vector = numpy.zeros(len(vocab))
for w in words:
for i,word in enumerate(vocab):
bag_vector[i] += 1
                    
print("{0}\n{1}\n".format(sentence,numpy.array(bag_vector)))

Now, execute our code,

allsentences = ["khirod waited for the train train", "The train was late", "Rosy and jessie took the bus",

"I looked for Rosy and jessie at the bus station",

“Rosy and jessie arrived at the bus station early but waited until noon for the bus"]

generate_bow(allsentences)

The output will be like this

Output:

Khirod waited for the train train
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 1. 0.]

The train was late
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.]

Rosy and jessie took the bus
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0.]

I looked for Rosy and jessie at the bus station
[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

Rosy and jessie arrived at the bus station early but waited until noon for the bus
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0.]

 

From the above steps each sentence is compared with our vocabulary and generate the vectors. These vectors are used for the documents classification and predictions.