How to Learn More in Less Time with Natural Language Processing (Part 1)
And how to create your own extractive text summarizer
Imagine you are given an assignment from school or work that involves A LOT of research. You spend all night grinding it out, so you can acquire the knowledge you need for a high-quality end product.
Now imagine you are given the exact same assignment and you finish with the same high-quality result except this time you finished with lots of time to spare!
For obvious reasons the latter scenario is preferable. Time is a valuable asset hence we need to find a solution to one of the biggest time wasting problems we face as a society: modern-day data influx.
As technologies advance, the amount of data we collect is increasing at exponential rates and it is becoming increasingly difficult to keep up with new information. Fortunately, we have Natural Language Processing (NLP), a technology we can leverage to help solve this issue.
This article will go through what NLP is and how you can create your own extractive text summarizer! In Part 2 we will look at creating a bag of words model to classify the subject of the article you chose to summarize. Let’s get into it!
What is Natural Language Processing and how can it help?
Natural Language Processing is a sub-field in AI which deals with computer understanding of human languages. It includes everything from speech recognition, text recognition and more! Some examples where this technology has been implemented is Google’s Google Assistant, Amazon’s Alexa, etc.
What if you could scrape news articles or textbooks online and summarize them to save lots of reading time? Maybe you would like to classify these articles by subject so you can create your own repository of summarized articles to look back at. This is all possible with NLP and this is exactly what we will go through in the following tutorial!
How to create an Extractive Text Summarizer (Python)
- Scrape article from internet
- Evaluate how often words occur in the text
- Divide article into sentences
- Assign each sentence a value based on the words it has and those words’ frequencies
- Remove sentences whose values don’t reach a certain threshold
Getting your corpus of text
To start summarizing our text, we first need to acquire the article. To scrap it from the internet I used BeautifulSoup:
from bs4 import BeautifulSoup
import requests#Scrape the url for all paragraphs and create a collective string
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
article_text = ''
article = soup.findAll('p')
for element in article:
article_text += '\n' + ''.join(element.findAll(text = True))
We started by importing the BeautifulSoup and requests modules. Then we let the user input the URL they want to scrape. Requests takes the URL then we use BeautifulSoup to store the content. Next, we use BeautifulSoup to find all paragraph elements (p) on the website and store it in “article”. Lastly, we use a for loop to go through all paragraph elements we found and append it to article_text so we have one long string containing the article.
Something to note about this method is that it won’t work for all websites. Some website use tags other than paragraph tags (i.e. article tags). You can adjust the code accordingly.
Cleaning the text
Right now our text has a lot of punctuation marks and numbers, all of which we don’t want included when we calculate word frequencies. We must remove these for the time being!
import re#Remove special characters, numbers, stopwords, etc. from the text
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
Above we imported regex to help us format the text. In the following four lines we remove everything we don’t want and store it as “formatted_article_text”. Mindful that we still have “article_text” which is our article with numbers and punctuation.
Creating a dictionary of word frequencies
We are ready to evaluate word frequencies. Since the word frequencies will ultimately determine the value of each sentence in the article, we want to make sure that the words themselves have some level of meaning. Hence, we first want to remove stop words.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenizewords = word_tokenize(formatted_article_text)
stopWords = nltk.corpus.stopwords.words('english')#Tally word frequencies from the text
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
if word in freqTable:
freqTable[word] += 1
freqTable[word] = 1
We import the relevant NLTK libraries and download a list of English stop words. We break up our “formatted_article_text” into individual words and run it through a for loop which counts how often each word appears in the text. It also checks if the word is a stop word in which case it is ignored by the frequency table.
Assigning sentences values
Now we need to determine which sentences are important so we can remove the ones that aren’t. We do this by assigning each sentence a value determined by the words it contains and the frequency of those words.
#Break text into sentences then assign values based on word frequencies
sentences = sent_tokenize(article_text)
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
Above, we use sent_tokenize to divide our “article_text” (the article with punctuation and numbers) into sentences. Each sentence is given a “sentenceValue” which is equivalent to the sum of the number of times each word in the sentence appears in the article.
Removing unimportant sentences
It’s finally time to remove the sentences we don’t need, leaving us with our final summary!
# Average value of a sentence from original text
average = int(sumValues / len(sentenceValue))
#If a sentence's value exceeds the average * 1.2, include it in the summary
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
#Print summary and analytics
print("Original article URL: " + url + "\n")
print(summary + "\n")
print("Original word count: " + str(len(article_text.split())))
print("Summarized word count: " + str(len(summary.split())))
print("Percent reduction: " + str("%.2f" % (100 - len(summary.split()) * 100 / len(article_text.split()))) + "%")
print("Time reduction: " + str("%.0f" % (len(article_text.split()) / 225)) + " minutes to " + str("%.0f" % (len(summary.split()) / 225)) + " minutes")
We find the average sentence value by dividing the sum of all values by the number of sentences. Next we create an empty string “summary” in which we append every sentence whose value is greater than 1.2x the average. I found that 1.2x works well, but feel free to play around with it to see what works for you. We now have our summarized text stored inside of “summary”!
As a bonus, I had the program output a mini report which prints the following:
- Original article URL
- The summary
- Original word count
- Summarized word count
- Percent reduction
- Reading time reduction
- Article: https://goo.gl/Qfq8Ah Output: https://goo.gl/7RfXgq
- Article: https://goo.gl/AfHLL5 Output: https://goo.gl/QqeMXA
- Article: https://goo.gl/gE8rEX Output: https://goo.gl/Gfpbjk
Natural Language Processing is adding another tool to the current arsenal of computers: understanding human languages. It is crazy to think about all the ways NLP is affecting our lives from simple text summarizers to Google Duplex making appointments at hair salons. Having an understanding of this technology is super important. You now know how to make an extractive text summarizer and you can look forward to creating a bag of words classifier in Part 2!