How to solve NLP problems
Suppose you are going to launch a new company or new product that can always have the massive data. By using those data you can improve, validate and expand the functionalities of that product. From that data we can extract the meaning and learning to an active topic of research, it is called as Natural Language Processing. The NLP is a very large field because it produces the new and existing results on daily basis. Using this NLP we are having number of applications like
1. Detecting and extracting different categories from feedbacks. (Sentiment Analysis, positive and negative opinions)
2. Identifying different units of customers. (Product Preferences, predictive churn)
3. Text classification according to intents.
Step by Step Procedure:
1. Gather your Data:
Generally, we are have lot of data in the form of chats , mails , tweets and posts. Every machine learning algorithm must starts with data only. We have lot of data sources that contains textual information,
- Any product reviews (flipkart , amazon)
- User generated content like tweets , comments.
- chat logs and customer requests.
Here we consider the disasters in the social media dataset. By using this dataset, we have to detect which tweets are about disastrous event as opposed to irrelevant topic such as movie. Here the challenging task is both classes having the same search terms. For this we can use subtler differences method. We have the labeled data so we can easily find which tweet is belongs to which class.
2. Clean the Data:
This is the very important task to every data scientist. First just have a look on data and clean it up. The cleaned dataset will give a good learning model and there is no chance for over fitting and under fitting. To clean our data by
– Remove all insignificant characters I.e. alpha numeric characters.
– Tokenize the text documents into words.
– Remove the urls and @
– If any uppercase words are there in our text data we can convert into lowercase because of ambiguity. For ex , “nlp” as “NLP” , “Nlp”
– Any combined or misspelled words to single word representation.
– Remove the stop words and consider the Lemmatization. Ex , (am , is ,are ,be)
– After this we can cross check for the additional errors and now starts with clean and labeled data.
3. Data Representation :
Every machine can take the input as numerical values.
Any image as input it takes the pixel values as input.
One hot encodings (Bag of Words):
These are another type of representing the word into the numeric form. In this the length of the words must be equal to the length of the vocabulary. Each and every observation is represented by rows and columns in which the rows are equal to the vocabulary and columns equal to the length of observations. In this the word of vocabulary is present it will take it as 1 otherwise it is zero.
The Bag of Words is a method of feature extracting from the text documents for the use in machine learning algorithms. These features are used for the training of our machine learning algorithms. It creates a vocabulary of all different words present in all the documents in training set.
In social disaster dataset we are having around the 20,000 words in our generated vocabulary so we are represent the sentence with that vocabulary containing 20,000 words. So in vector we are having lot of zeros because every sentence contains the very small subset.
In this we have to check weather our embedding’s capturing the information that is related to our problem or not by using the PCA to visualize the information in two dimensions.
Observe the above visualization the two classes are not separated well right..! For this we can train a classifier based on bag of words.
Whenever we are having the binary classification we can blindly go to logistic regression because you can train the model easily and the results are very interpretable and easily extract the important coefficients from the model.
We split our data into training set and testing set. Training set is used for fit our model and test data is used for how it generalized of unseen data.
After training our model, the next step is to understand the errors our model makes, and have to know what kinds of errors are least desirable. In our example false negative classify the tweet about disaster is irrelevant. And false positive classify the tweet about irrelevant is disaster. You are react to every potential event you must have the lower false negatives otherwise false positives.
Once observe the above figure our classifier creates the more false negatives compared to false positives which means disaster is a irrelevant and the false positive represents high cost for low enforcement.
Explaining our model:
For validating of our model and its predictions you need to know which words are using to make the decisions. If our data is biased it will give the high accuracy and make the accurate predictions in sample data, bur in real world our model would not generalize well and plot the most important words for both irrelevant and disaster class. The bag of words and logistic regression we can plot the most important words easily then we can extract the coefficients. These coefficients are used for our predictions.
Once observe the above classification our classifier correctly picks up on only some patterns such as massacre and hiroshima , but it shows clearly over fitting on some meaningless words such as x1392 and heyoo.our bag of words model deals with very huge words and it will treat all the words equally. In this some of words are frequent and only contributing noise to our predictions.
You can help to your model to give more focus on meaningful words by using tf-idf on top of bag of words. Is there any rare words are there the tf-idf weighed the weights and discounting words that are too frequent and add the noise. Here the visualization after new embedding’s.
The word2vec is a latest model to manage all the words in vocabulary and find the high signal words so we can encounter words that we have not seen our training set before. The previous method will not able to classify those tweets. For this we can go to the semantic meaning of words is used to understand that words like “positive” and “good” are very much closer to “continent” and “apricot”. This is called word2vec.
Using pre-trained words:
The technique Word2Vec is used to find the continuous embedding’s for the all words and it learns from reading massive amounts of text and memorizing the which words are appear in similar contexts. After training it will generates the 300 dimension vectors for each and every word in vocabulary.
Sentence level representation:
For our classifier it is a quick way to get the sentence embedding’s word2vec sores of all words in sentence. This is similar to bag of words but in this just we lose the syntax of sentence and keeping the semantic information.
Let’s observe our new embedding’s visualization.
Once observe the visualization the new embedding’s will helps to our classifier to separate them. Now again train our model with logistic regression and check the accuracy it might be 77.5. Compare to previous it will be more efficient.
In our previous models, It is harder to find which words are most similar, our embedding’s are not presented in one dimension for one word.in this complex models we can use a black box explinaries such as LIME it is open source package that allows users to explain the decisions of any classifier on one particular example and see the predictions changes.
Let’s see couple of sentences in our social disaster dataset is
We want to explore the thousands of examples we don’t have the that much of time for this we can use LIME package to see which words keep coming up as stronger contributors. Using this we can get the word importance scores and these scores are used to our previous models and get the predictions.
Till now, we have covered efficient approaches to generate the compact sentence embedding’s and remove the syntactic information in the sentences. Let’s check the results and if this method don’t provide sufficient and accurate results we can use the more complex models which is whole sentences take into the consideration with syntax. For this we can use a Sequence of individual word vectors using Glove or word2vec or CoVe as. It can be represented below.
Convolution neural networks also used for the text or sentence classification and train the model is very quickly. And the convolutional neural networks give the excellent results on text related tasks and most complex NLP approaches for ex, LSTM and Encoder/Decoder architectures. By using these neural networks it does not required more time for training and it will give the more accuracy compared to previous models.