Rajdeep Das

Introduction:

In the field of Artificial Intelligence, the amount of using Natural Language Processing is increasing heavily. Some common applications where NLP is used mostly as follows:

Text Classification (Spam Detector etc)

Sentiment Analysis

Author Recognition

Machine Translate

Chatbots

What is Sentiment Analysis?

One of the most common applications in Natural Language Processing is Sentiment Analysis through which we can decide the emotion of a text is written.

As the use of Social Media platforms are growing day by day, as the use of these platforms are getting popular and the more people are getting attached to it, the need to analyze the content that people shares/posts over here are increasing rapidly. If we consider the volume of data coming through social media, it is really difficult to do this with human power. Therefore, the need for applications that can quickly detect and respond to the positive or negative comments that people write are increasing. In this blog, a baseline model for simple analysis of sentiment will be developed.

First of all, go through the information about the dataset .

Data Set Name: Sentiment Labelled Sentences Data Set

Data Set Source: UCI Machine Learning Libarary

Data Set Info: This dataset was created with user reviews collected via three different websites ( like Amazon, Yelp, IMDb). These comments contain the restaurants, films and product reviews. Each record in the data set is labeled with two different emoticons. These are 1: Positive, 0: Negative.We will create a sentiment analysis model using the data set we have given above.

Let's build a Machine Learning model with the Python using the sklearn and nltk library.

First, let's import the libraries we will use.

import pandas as pd
import numpy as np
import pickle
import sys
import os
import io
import re
from sys import path
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from string import punctuation, digits
from IPython.core.display import display, HTML
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

Now we'll upload and view our data set.

#Amazon Data
input_file = "../data/amazon_cells_labelled.txt"
amazon = pd.read_csv(input_file,delimiter='\t',header=None)
amazon.columns = ['Sentence','Class']
​
#Yelp Data
input_file = "../data/yelp_labelled.txt"
yelp = pd.read_csv(input_file,delimiter='\t',header=None)
yelp.columns = ['Sentence','Class']
​
#Imdb Data
input_file = "../data/imdb_labelled.txt"
imdb = pd.read_csv(input_file,delimiter='\t',header=None)
imdb.columns = ['Sentence','Class']
​
​
#combine all data sets
data = pd.DataFrame()
data = pd.concat([amazon, yelp, imdb])
data['index'] = data.index
​
data

Successfully imported the data and viewed it. Now, let's look at the statistics about the data.

#Total Count of Each Category
pd.set_option('display.width', 4000)
pd.set_option('display.max_rows', 1000)
distOfDetails = data.groupby(by='Class', as_index=False).agg({'index': pd.Series.nunique}).sort_values(by='index', ascending=False)
distOfDetails.columns =['Class', 'COUNT']
print(distOfDetails)
​
#Distribution of All Categories
plt.pie(distOfDetails['COUNT'],autopct='%1.0f%%',shadow=True, startangle=360)
plt.show()

See carefully, the data set is very balanced i.e. almost equal numbers of positive and negative classes.

Now, before using the data set in the model, let's do a few things to clear the text.

#Text Preprocessing
columns = ['index','Class', 'Sentence']
df_ = pd.DataFrame(columns=columns)
​
#lower string
data['Sentence'] = data['Sentence'].str.lower()
​
#remove email adress
data['Sentence'] = data['Sentence'].replace('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+', '', regex=True)
​
#remove IP address
data['Sentence'] = data['Sentence'].replace('((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', regex=True)
​
#remove punctaitions and special chracters
data['Sentence'] = data['Sentence'].str.replace('[^\w\s]','')
​
#remove numbers
data['Sentence'] = data['Sentence'].replace('\d', '', regex=True)
​
#remove stop words
for index, row in data.iterrows():
    word_tokens = word_tokenize(row['Sentence'])
    filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')]
    df_ = df_.append({"index": row['index'], "Class":  row['Class'],"Sentence": " ".join(filtered_sentence[0:])}, ignore_index=True)
​
data = df_

We made the pre-cleaning of the data ready for use within the model. Before we build our model, let's split our dataset to test (10%) and training(90%).

X_train, X_test, y_train, y_test = train_test_split(data['Sentence'].values.astype('U'),data['Class'].values.astype('int32'), test_size=0.10, random_state=0)
classes  = data['Class'].unique()

Now we have to create our model using our training data. While creating the model, I will use the TF-IDF as the vectorizer and the Stochastic Gradient Descend algorithm as the classifier.

We found these methods and the parameters in the method using grid search (I will not mention grid search in this article).

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
​
​
#grid search result
vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1,2), max_features=50000,max_df=0.5,use_idf=True, norm='l2') 
counts = vectorizer.fit_transform(X_train)
vocab = vectorizer.vocabulary_
classifier = SGDClassifier(alpha=1e-05,max_iter=50,penalty='elasticnet')
targets = y_train
classifier = classifier.fit(counts, targets)
example_counts = vectorizer.transform(X_test)
predictions = classifier.predict(example_counts)

Our model has occurred. Now let's test our model with test data. Let's examine the accuracy, precision, recall and f1 results.

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
​
#Model Evaluation
acc = accuracy_score(y_test, predictions, normalize=True)
hit = precision_score(y_test, predictions, average=None,labels=classes)
capture = recall_score(y_test, predictions, average=None,labels=classes)
​
print('Model Accuracy:%.2f'%acc)
print(classification_report(y_test, predictions))

Model Accuracy:0.83
             precision    recall  f1-score   support
​
          0       0.83      0.84      0.84       139
          1       0.84      0.82      0.83       136
​
avg / total       0.83      0.83      0.83       275

See the success of our model was 83%. Let's look at the confusion matrix, where we can see more clearly how accurate our estimates are.

#source: https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        print()
​
    plt.imshow(cm, interpolation='nearest', cmap=cmap, aspect='auto')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
​
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] &gt; thresh else "black")
​
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.figure(figsize=(150,100))

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, predictions,classes)
np.set_printoptions(precision=2)
​
class_names = range(1,classes.size+1)
​
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix, without normalization')
​
classInfo = pd.DataFrame(data=[])
for i in range(0,classes.size):
    classInfo = classInfo.append([[classes[i],i+1]],ignore_index=True)
​
classInfo.columns=['Category','Index']
classInfo