Basics of Natural Language Processing
Now days, one of the most famous advancement technology is Machine Learning. It is the development of abilities to teach the machines to how to understand the human communications.
In this I am going to explain the basics of Natural Language processing, before this we have to understand the some basics.
Basically, the computer or machine is working with the mathematical calculations but it is not able to do the complex interpretations and understandings. But the computers will do the complex calculations in seconds.
Coming to Natural language processing it is an automatic manipulation of natural languages like speech, text. As mentioned earlier, the machine is used to convert the natural language into mathematical form. Below are the some methods under the natural language processing.
Tokenization, Stemming and Lemmitization :
Tokenization: It is a process of break the total text into words. This tokenization can happens on any character and the most common way of tokenization is space character.
Stemming : It is a rule based process to remove the derivational affixes[“ing” , “es” , “s”] which means the word is formed from the another word. In this we will use the porter’s algorithm.
Lemmatization : It is a step by step procedure which is used to remove the inflectional endings means it is a group of letters that are added to the ending of the words[“-s” , bat , bats] from that we can obtain the root words.
Here we have to observe one thing which is the stemming and lemmatization do the same process right..!! But the difference is that, sometimes the root word from the stemming is not related to the English language, but the root word from the Lemmatization is removing the influential words and properly ensure that word belongs to the English language.
Here the N gram represents combine the nearby words together
Let’s take a sentence “Basics of Natural Language processing”
A unigram is model that is used to tokenize the sentence into one word and that will be as like “Basics”, “of”, “Natural”, “Language”, “Processing”.
A bigram is model that is used to tokenize the words into combination of two words and the output will be, “Basics of”, “of Natural”, “Natural Language”, “and Language Processing”.
Similarly the trigram will tokenize the words into combination of three words like, “Basics of Natural”, “of Natural Language”, “Natural Language Processing”.
The breakdown of natural language into N grams is essential for count of words which will be used for the mathematical calculations.
One of the most common methods to achieve this in bag of words to represent in tf-idf
the tf-idf is used to score the vocabulary and provide a weights to a words in proportion of the impact it has on the meaning of sentence and simply it’s a product of two independent scores I.e. term frequency and inverse document frequency.
Term Frequency (TF):
The term frequency is defined as the frequency of words in the current document.
Inverse Document Frequency (IDF):
It is a logarithmic ratio of total no. of documents in corpus to the number of documents containing the frequency of words. It is calculated by log (N/d). Here N is the number of documents and d is the number of documents containing the word.
These are another type of representing the word into the numeric form. In this the length of the words must be equal to the length of the vocabulary. Each and every observation is represented by rows and columns in which the rows are equal to the vocabulary and columns equal to the length of observations. In this the word of vocabulary is present it will take it as 1 otherwise it is zero.
Word embedding’s the modern type of representing the words into vectors. It is used to redefine the higher dimensional word features into lower dimensional features similarly in the corpus. These word embedding’s widely used in recurrent neural networks and convolution neural networks.
Bag of Words:
The bag of words is representing the data in a tabular format. In this rows representing a single observation and columns representing the total vocabulary of the corpus. And the intersection of row and column represents the count of words.
This is very much helpful to the machine that can easily understand the sentences and it enables the linear algebraic operations and different algorithms to be applied on data and build the predictive models. For ex, the bag of words from medical journals
The embedding matrix is used to represent the embedding’s for each and every word present in the vocabulary and the rows represents the dimension of word embedding space and the words present in the vocabulary represented by columns.
If you want to convert the sample into word embedding then the each of word is one hot encoded and multiplied by the embedding matrix. Let’s observe the below example;
Here we have to remember the one thing which is the one hot encoding refers to the n dimensional vector with the value of 1 at the position of the word in the vocabulary. Where n is the length of vocabulary.