Beyond bag of words: Using PyTextRank to find Phrases and Summarize text

Aneesha Bakharia
5 min readOct 11, 2017

--

There are numerous weaknesses with the bag of words model, especially when applied to natural language processing tasks, that graph ranking algorithms such as TextRank are able to address. TextRank is able to incorporate word sequence information. Bag of words simply refers to a matrix in which the rows are documents and the columns are words. The values matching a document with a word in the matrix, could be a count of word occurrences within the document or use tf-idf. The bag of words matrix is then provided to a machine learning algorithm. Using word counts or tf-idf, we are only able to identify key single word terms in a document. In this blog post, I will describe the TextRank algorithm which is able to identify multi-word phrases and summarize text. The PyTextRank library will also be introduced.

The TextRank algorithm was introduced in 2004 by Rada Mihalcea and Paul Tarau. TextRank is a graph ranking algorithm — this simply means that nodes in a graph can be scored using information from the global graph. A well-known graph ranking algorithm is Google’s PageRank. In the context of text, words are nodes/vertices and the cooccurrence of words together forms a link/relationship between the words (i.e., an edge). Mihalcea and Tarau introduce a vertex ranking algorithm that takes into consideration edge weights. TextRank can be used for keyword extraction and text summarization.

Multi-word Phrase Extraction

A brief outline of the keyword extraction process using TextRank:

  1. Words are tokenized and annotated with parts-of-speech tags
  2. Words are added to the graph as vertices (but first filtered based on whether they are a noun or adjective)
  3. An edge is added between words that co-occur within N words of each other
  4. The TextRank vertices ranking algorithm is run until convergence
  5. The vertices are ranked by their score and the top T are selected as keywords
  6. If vertices in the top T keywords appear as adjacent vertices in the graph, they are grouped together to form a multi-word expression/phrase.

Sentence Extraction for Summarization

A brief outline of the sentence extraction process using TextRank:

  1. Each sentence is added as a vertex in the graph
  2. Sentence cooccurrence can’t be used as the relationship between two sentences. Sentences don’t get repeated in text! Instead the similarity between sentences (i.e., using word overlap) is calculated and used as the edge weight in the Mihalcea and Tarau paper. A normalization factor is used to penalize longer sentences. It should be noted that any similarity measure can be used here — PyTextRank uses the Jaccard distance.
  3. The TextRank vertices ranking algorithm is run until convergence
  4. The sentences (i.e., vertices) are ranked by their score and the top T can be used to summarize the text.

Indepth Analysis of TextRank

PyTextRank is a great library, but if you want to delve a little deeper into TextRank and learn to program it, I recommend starting with a simpler implementation. NLP for Hackers has a blog post on TextRank which only uses Numpy and NLTK — the graph ranking is implemented in pure Numpy.

Using PyTextRank

Its great to have an understanding of how TextRank works. We don’t however need to implement the algorithm especially if you are using Python. PyTextRank is an amazing robust Python library that uses spaCy, datasketch and NetworkX. PyTextRank is even used by O’Reilly Media in production.

PyTextRank includes a Jupyter Notebook that provides step by step instructions for loading a json file that contains the text and displaying the output of the TextRank algorithm.

Example 1

This is the example paragraph used by most multi-word expression finding algorithms:

“Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types.”

The extracted multi-word expression generated by PyTextRank:

  • [“types systems”, 0.09189173098990264, [24, 2], “np”, 1]
  • [“minimal set”, 0.07308266940796514, [19, 5], “np”, 1]
  • [“strict inequations”, 0.05343195271225413, [11, 12], “np”, 1]
  • [“mixed types”, 0.04594586549495132, [33, 24], “np”, 1]
  • [“natural numbers”, 0.036876246384868736, [6, 7], “np”, 1]
  • [“minimal generating sets”, 0.03654133470398257, [19, 23, 5], “np”, 1]
  • [“linear constraints”, 0.03206715477373365, [3, 4], “np”, 1]
  • [“linear diophantine equations”, 0.028873346046550823, [3, 9, 10], “np”, 1]
  • [“nonstrict inequations”, 0.026715976356127064, [13, 12], “np”, 1]

The top 2 ranked summary sentences:

“Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.”

Example 2

In Example 2, TextRank is run over the opening paragraph of this blog post. TextRank does a good job on the keywords but misses “bag of words” because only nouns and adjectives are used. The top 2 sentences are the first 2 sentences of the paragraph which I think provide a good summary.

The TextRank graph for Example 2 displayed using NetworkX

The extracted multi-word expression generated by PyTextRank:

  • [“words model”, 0.09609026248373426, [37, 38], “np”, 1]
  • [“textrank algorithm”, 0.0634849444362508, [66, 21], “np”, 1]
  • [“word sequence information”, 0.04804513124186713, [37, 56, 57], “np”, 1]
  • [“words matrix”, 0.03930697477654861, [37, 44], “np”, 1]
  • [“natural language processing tasks”, 0.027496394401730084, [6, 40, 41, 42], “np”, 1]
  • [“key single word terms”, 0.026189952086614007, [60, 61, 37, 62], “np”, 1]
  • [“word occurrences”, 0.025826320912548863, [37, 51], “np”, 1]
  • [“numerous weaknesses”, 0.023471643384109276, [34, 35], “np”, 1]
  • [“multi-word phrases”, 0.018348338660060273, [67, 68], “np”, 1]
  • [“blog post”, 0.011574105593003477, [63, 64], “np”, 1]

The top 2 ranked summary sentences:

“There are numerous weaknesses with the bag of words model especially when applied to natural language processing tasks that graph ranking algorithms such as TextRank are able to address. TextRank is able to incorporate word sequence information.”

In a future blog post I will show how TextRank can be used to summarize the output of a topic model. Till then have fun with algorithms….

--

--

Aneesha Bakharia

Data Science, Topic Modelling, Deep Learning, Algorithm Usability and Interpretation, Learning Analytics, Electronics — Brisbane, Australia