How to Learn More in Less Time with Natural Language Processing (Part 1)

And how to create your own extractive text summarizer

What is Natural Language Processing and how can it help?

Natural Language Processing is a sub-field in AI which deals with computer understanding of human languages. It includes everything from speech recognition, text recognition and more! Some examples where this technology has been implemented is Google’s Google Assistant, Amazon’s Alexa, etc.

How to create an Extractive Text Summarizer (Python)

GitHub Repository:

https://github.com/Vedant-Gupta523/text-summarization-project

Overview:

  1. Scrape article from internet
  2. Evaluate how often words occur in the text
  3. Divide article into sentences
  4. Assign each sentence a value based on the words it has and those words’ frequencies
  5. Remove sentences whose values don’t reach a certain threshold

Getting your corpus of text

To start summarizing our text, we first need to acquire the article. To scrap it from the internet I used BeautifulSoup:

from bs4 import BeautifulSoup
import requests
#Scrape the url for all paragraphs and create a collective string
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)

article_text = ''
article = soup.findAll('p')
for element in article:
article_text += '\n' + ''.join(element.findAll(text = True))

Cleaning the text

Right now our text has a lot of punctuation marks and numbers, all of which we don’t want included when we calculate word frequencies. We must remove these for the time being!

import re#Remove special characters, numbers, stopwords, etc. from the text
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

Creating a dictionary of word frequencies

We are ready to evaluate word frequencies. Since the word frequencies will ultimately determine the value of each sentence in the article, we want to make sure that the words themselves have some level of meaning. Hence, we first want to remove stop words.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize(formatted_article_text)
stopWords = nltk.corpus.stopwords.words('english')
#Tally word frequencies from the text
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1

Assigning sentences values

Now we need to determine which sentences are important so we can remove the ones that aren’t. We do this by assigning each sentence a value determined by the words it contains and the frequency of those words.

#Break text into sentences then assign values based on word frequencies
sentences = sent_tokenize(article_text)
sentenceValue = dict()

for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq


sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]

Removing unimportant sentences

It’s finally time to remove the sentences we don’t need, leaving us with our final summary!

# Average value of a sentence from original text
average = int(sumValues / len(sentenceValue))

#If a sentence's value exceeds the average * 1.2, include it in the summary
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence

#Print summary and analytics
print("Original article URL: " + url + "\n")
print(summary + "\n")
print("Original word count: " + str(len(article_text.split())))
print("Summarized word count: " + str(len(summary.split())))
print("Percent reduction: " + str("%.2f" % (100 - len(summary.split()) * 100 / len(article_text.split()))) + "%")
print("Time reduction: " + str("%.0f" % (len(article_text.split()) / 225)) + " minutes to " + str("%.0f" % (len(summary.split()) / 225)) + " minutes")
  1. Original article URL
  2. The summary
  3. Original word count
  4. Summarized word count
  5. Percent reduction
  6. Reading time reduction

Example summaries

  1. Article: https://goo.gl/Qfq8Ah Output: https://goo.gl/7RfXgq
  2. Article: https://goo.gl/AfHLL5 Output: https://goo.gl/QqeMXA
  3. Article: https://goo.gl/gE8rEX Output: https://goo.gl/Gfpbjk

Key Takeaways

Natural Language Processing is adding another tool to the current arsenal of computers: understanding human languages. It is crazy to think about all the ways NLP is affecting our lives from simple text summarizers to Google Duplex making appointments at hair salons. Having an understanding of this technology is super important. You now know how to make an extractive text summarizer and you can look forward to creating a bag of words classifier in Part 2!

--

--

I love sharing my learnings/experiences from working with new emerging technologies.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vedant Gupta

I love sharing my learnings/experiences from working with new emerging technologies.