How to Learn More in Less Time with Natural Language Processing (Part 2)

And how to create your own bag of words classifier

Vedant Gupta
6 min readJan 26, 2019

With the nifty extractive text summarizer we created in Part 1, we were able to take news articles and cut them down to half their size or more! Now it is time to take these articles and classify them by subject. In this part we will go through how to create a bag of words NLP classifier to do this!

How to create a Bag of Words Classifier (Python)

GitHub Repository:

https://github.com/Vedant-Gupta523/text-summarization-project

Overview:

  1. Import libraries and data set
  2. Clean the articles in the data set and store each article as an element in a list
  3. Create our bag of words matrix
  4. Divide the array into a test and training set
  5. Predict test results and evaluate accuracy
  6. Make predictions for our summarized articles

Import libraries and the data set

# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('subject_freq.csv')

To classify the subjects of articles, we first need a data set which contains articles labelled with their subject. In my GitHub repository I included such a data set and a tool to help you create a quick data set of your own (with the compromise of accuracy). My data set provides one of three labels to articles: “Science”, “Technology”, or “Life”.

We import numpy and pandas and then store the data set “subject_freq.csv” in the variable “dataset”.

Creating a list with cleaned articles from the data set

For our bag of words model we need to create a matrix which tells us what words are in which articles and what label those articles have. To create such a matrix, we need to create a list with the articles from the data set. Similar to the extractive text summarizer, we don’t want to include punctuation, numbers, or stop words.

# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
corpus = []
for i in range(0, numDatapoints):
review = re.sub('[^a-zA-Z]', ' ', dataset['Article'][i])
review = review.lower()
review = review.split()
review = ' '.join(review)
corpus.append(review)

Like Part 1, we import Regex and NLTK and we download the English stop words to clean the text. We clean each article and append them to our list, “corpus”.

Creating the bag of words matrix

It is time to create the matrix our classifier will train off of!

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

We import CountVectorizer from the sklearn module and then set its max_features to 1500. This sets a limit of 1500 unique words in our matrix. We then create our matrix of features (X) and create a separate list for our labels (y). The matrix created by the CountVectorizer looks similar to this:

Created matrix (continues on the right)

Each row represents an article and each column represents a unique word (with a limit of 1500 words/columns). The number in each cell is the amount of times each word appeared in the corresponding article.

Creating the training set and test set and training the model

In school, teachers give us homework from which we learn various concepts. After some time, we are given a test to see if we can apply what we learned from the homework to solve similar, but different problems. When we are training our classifier to figure out the subjects for each article we follow a similar process. We divide our data set into a training set and a test set. The model uses the training set to find correlations between the matrix and labels Then, we give it the test set (without the labels) and it uses what it learned to make predictions on the labels. Afterwards, we can compare the results to see how accurate our model is.

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

We use “train_test_split” from sklearn to divide the articles in the matrix and their labels. As specified by “test_size”, 80% of the original corpus will become training data and the model will be tested on the remaining 20%.

We import the Naive Bayes classifier from sklearn and train it on “X_train” and “y_train”.

Evaluating accuracy

It is time to test the classifier and see how it fares!

# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

We store the classifier’s predictions on “X_test” in “y_pred” then we use sklearn’s confusion matrix to visualize our results:

Confusion matrix result

Wow that’s amazing…? It may seem confusing at first, but it’s actually simple when you understand how to read it. The columns represent the actual labels (Science, Technology, and Life) and the rows represent the predicted labels. In this case, the first row tells us that the classifier predicted Science 3 times when it was actually Science and predicted Science 0 times when it was actually Technology or Life. To easily calculate the total number of right predictions you find the sum of the numbers from the top-left to the bottom-right. The classifier made 8 right predictions out 12 test questions!

75% accuracy isn’t bad, but it can definitely be improved. The biggest problem is that the model we created didn’t have enough data to train on, so it wasn’t able to make entirely accurate predictions. Having a large data set is key to have highly accurate classifiers!

Making predictions on our summarized articles

The goal of creating the bag of words classifier was to eventually classify our summarized articles from Part 1.

from bagofwords_classifier import corpus, cv, classifiertestData = article_text
testReview = re.sub('[^a-zA-Z]', ' ', testData)
testReview = testReview.lower()
testReview = testReview.split()
testReview = ' '.join(testReview)
corpus.append(testReview)
testX = cv.fit_transform(corpus).toarray()
testX = testX[-1:, :]
testResult = classifier.predict(testX)
corpus = corpus[:-1]
print(testResult)

In the snippet above we start by importing the corpus, CountVectorizer, and classifier from our bag of words classifier. We take the pre-summarized article, format it, and add it to the corpus from earlier just like every article from the data set. We then apply the classifier to the corpus and create a variable “testX” to contain the last element (the article we are trying to predict). Finally we use the classifier to make a prediction on “testX” and cut out the extra entry that we added to the corpus. The result is printed.

Example predictions:

  1. Article: https://goo.gl/Qfq8Ah Expected: TECH Result: TECH
  2. Article: https://goo.gl/AfHLL5 Expected: SCIENCE Result: SCIENCE
  3. Article: https://goo.gl/gE8rEX Expected: LIFE Result: LIFE

Key Takeaways

Natural Language Processing classifiers have so many applications from simple subject classification to sentiment analysis. NLP will have a great impact on our lives and it important to have an understanding of how the technology works. You now know how to make an NLP classifier with the bag of words method!

--

--

Vedant Gupta

I love sharing my learnings/experiences from working with new emerging technologies.