Abstract

With the present availability of abundant data in digital space, which is mostly non-structured text data, there is a need to develop automatic text summarization tools to get insights from data easily. Text Summarization as a concept is quite old and yet a difficult task. It aims at generating concise and precise summary of voluminous texts while focusing on the sections that convey useful information, and without losing the overall meaning.

There are two kinds of Text summarization techniques, Extractive and Abstractive summarization. This paper is about the different technologies and approaches that are combined in order to generate effective yet meaningful Extractive summarization. We are showcasing a summary of financial research reports as a use case.

Introduction

We (Humans) are generally good at summarization as it is a cognitive activity. It involves first understanding the meaning of the source document and then distilling the meaning and capturing salient features. We do it implicitly and intuitively such that it appears spontaneous.

The goal of automatic summarization is a problem in the field of Natural Language Processing or NLP. It is a solution to generate a concise and meaningful summary of text from multiple text resources such as books, news articles, blog posts, research papers, emails, and tweets. The aim is to make a machine understand insights of a document, and be able to generate summary without losing the context or important features. It is not enough to just generate words and phrases that capture the gist of the source document. The summary should be accurate and should read fluently as a new standalone document.

There are two general approaches to text summarization[1]:

  • Extractive summarization[10]: This method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form.
  • Abstractive summarization: Develop an understanding of the main concepts in a document and then express those concepts in clear natural language[1]
Save for Later Download Brochure

Extractive summarization

The process of extracting content from the source without altering the content is called as extractive summarization[1]. Extractive Summarization involves identifying key sentences in a document that highlight the meaning of the document in a concise manner while weeding out statistically less important content.

Bulk of the extractive summarization techniques work on the following structure

  • Cleanse and pre-process text that needs to be extracteded
  • Ranking of sentences in order of importance
  • Creating a summary based on the top ‘n’ most important sentences

Cleansing or pre-processing involves removal of noise in the document. Examples of which could be header, footer, author information, disclaimer, etc. which we as humans would implicitly discard when reading a document, unless that text has a bearing on the summarization that we are performing. If this is not discarded then this may impact the summarization process.

Text Ranking is to order the critical sentences in a text in order of occurrence so that they convey the meaning of the text. Statistical algorithms are used for this purpose and some of them are detailed below.

Text Rank Algorithm

One of the popular algorithms to rank text is Text Rank Algorithm. This ranks sentences in a document in order of importance by identifying how similar a sentence is to all other sentences in the text. The most important sentence being the one that is most similar to all the others. This approach will be most suitable for a single domain documents or text which centers around a core topic.

Bag of Words

This is also a statistical approach which weighs the sentences of a document as a function of high frequency words, disregarding the very high frequency common words or stop words. This technique is referred to as ‘Bag-of-words’.

Based on this word frequency count, sentences are given scores. Sentences which crosses a threshold score can be considered for summary formation.

TF-IDF

However, all the words which have the highest word count are not important all the time. Sometimes rarely used words provide greater insight of a document and this is where Term Frequency- Inverse document frequency or TF-IDF comes to picture. Term frequency is the number of times a word appears in a document divided by the total number of words in the document. Whereas, Inverse document frequency determines the weight of rare words which appeared in the entire document. Using TF-IDF rarely used words are given more weightage and hence the sentences which contain these words.

Topic Modelling

Topic Modelling is an unsupervised algorithm for detecting topics which occur in a document and automatically clustering word groups of similar expressions which reflect the topic. Bag of Words along with topic modelling will enable clustering of words around topics in a document lending itself to better summarization.

Formatting, Position of Text and Key Word based approach for Ranking Text

Ranking of text may be done using visual and more practical approaches as well, like utilizing the formatting and position of text in a document to identify their relevance. Identifying location of sentences based on an assumption that sentences occurring in initial positions and final positions of paragraphs have higher probability of being relevant is a practical alternative and collating such sentences may give a reasonable summary.

In some cases the document itself may contain a highlight section, bold font words which reflects the important aspects. Sometimes sentence weight may be computed as a factor of content words and words in sub headings. Summarization could also be performed by interpreting the sections of the document that closely match the table of contents, headings etc. In this technique therefore it is essential for the program to interpret the formatting of the document in addition to the text itself.

Combination of Approaches

No approach is complete by itself and combinations of techniques will have to be used to arrive at a meaningful summary. Domain specific lexicon or dictionary may be relevant to identify specific keywords. Single topic based documents are more easy to summarize as compared to multi topic documents with Text Rank Algorithm giving very good results. In multi topic documents topic modelling followed by summarization using above algorithms may give better results.

An area which needs more research is summarizing content from images or tables within documents.

This depends on the criticality of such content and effective summarization techniques in this case are yet to be explored. For example in a financial statement of a company the most important part of the document may be the Balance Sheet or the Income Statement rather than the narrative text itself. So in this case summarizing the tabular data or extracting it as is may be more useful for the reader.

Abstractive summarization

Abstractive summarization methods involve building an internal semantic representation of the original document and then consuming this representation to create a summary that mimics a human output. This is more challenging than extraction, as it involves both natural language processing and deep understanding of the domain of the original text, which is why most summarization systems are extractive. Abstractive Summarization’s objective would be to derive insights or meanings from documents for example like an expert view of a topic. Our paper currently deals with our experiences with extractive summarization.

Use Case: Research Report Summarization

Research groups in Asset Management firms need to consume lengthy research reports as shown in Figure 1 and provide insights to fund managers to support portfolio design, monitoring performance and taking of timely decisions. Research analysts spend significant time on data collation & analysis from multiple sources, resulting in longer turnaround time and challenges in expanding the coverage. These research reports run into multiple pages with a lot of structured and unstructured data in the form of tables, graphs, pictures etc.

There is a need to summarize & extract relevant information from these statements to make them easier to consume. The use case here is to summarize the research reports to bring efficiency in the research process. We used extractive summarization techniques for this use case.

Coforge

Figure 1: Sample Research Reports

https://www.blackrock.com/corporate/literature/whitepaper/bii-macro-perspectives-august-2018.pdf

Structure of a Research Report

A research report that is typically sent by a financial analyst or research firm may deal with a single topic (an insight from an analyst) or may be around multiple topics for example providing a summary of key financial events like a roundup. In addition to the main content, it also has details of the firm, the authors publishing the report, header and footer, disclaimers, details of the sources of information in fine print, images and tables. Majority of research reports are in pdf format. For the purpose of our use case we have chosen single topic research reports. Sample Reports which have been chosen by us are:

https://www.blackrock.com/corporate/literature/whitepaper/bii-macro-perspectives-august-2018.pdf

https://www.blackrock.com/corporate/literature/whit epaper/bii-global-equity-outlook-november-2018.pdf

The following steps have been adopted to extract text from these documents

  • Cleanse Text
  • Pre-Processing
  • Algorithm based ranking
  • Final Summary

Coforge

Figure 2: Extraction Process for sample use case

Text Cleansing

The first step is to extract text from pdf documents. We have used popular libraries like pdfminer[2] and PyPDF2[3] to extract text from the PDFs. The next step is to remove noise like header, footer, author information, disclaimer, etc. These are sections which we as humans would implicitly discard when reading a document, unless that text has a bearing on the summarization that we are performing. This edit is critical as its non-removal will affect the various algorithms being used in terms of the sentence rank.

Pre-Processing

Pre-processing is to convert the text to a series of keywords or relevant tokens.

  • Tokenization: It is the process of splitting a string or text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
  • Removal of stop word: A stop word is a commonly used word- (such as “the”, “a”, “an”, “in”). Removal of such words will move the focus on the more relevant words in a document.
  • Lemmatization: It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.[14]
  • Applying Bi-grams: A bigram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. Bi-grams help in grouping commonly occurring words together to form a meaningful representation. Eg: Balance Sheet.

The diagram below shows the extraction of tokens from cleansed input text.

Coforge

Figure 3: Tokenizing the input text

Algorithms based Ranking

To rank the text, we have attempted multiple algorithms in this use case on the tokenised text that we got after pre-processing.

  • Bag of Words
  • Text Rank
  • Topic Modelling

Bag of Words algorithm calculates how frequently each word appears in the entire document and assigns weightage to it. It derives the values for sentences using the weighted words. If a sentence weight crosses a threshold, then such sentences are concatenated in the same order as they appear in the original document.

Extraction using Text Rank algorithm involves calculating similarities between sentences and creating a similarity matrix. The similarity matrix is then converted into a graph, with sentences as vertices and similarity scores as edges, for sentence rank calculation. At the end, a certain number of top-ranked sentences form the final summary. Topic modelling is a statistical method of identifying topics that best describes the document. One popular topic modelling technique is called as Latent Dirichlet Allocation (LDA). We have used Genism LDA[13] for topic modelling. The LDA model is an unsupervised learning algorithm that groups the token of words obtained after pre-processing into a pre-defined number of topics (Figure 4 below) )

Coforge

Figure 4: Word Cloud generated using Topic Modelling

The model was further tweaked by adjusting different parameters to improve the quality of topic identification.

Final Summary

The summaries from all these methods were merged to remove redundancies and arrive at the final summary of text in order of appearance in the document. The results are displayed in the image below. As shown in Figure 5, the image has a pdf document on the left and its corresponding summary on the right, framed inside a simple UI we created. The summary is short and crisp, which covers most of the information the document is talking about.

Coforge

Figure 5: Shows the UI of source Document and its summary

Technology Stack

The table below represents some of the technologies that we have used to create the summary.

Coforge

Conclusion

The above extractive summary use case involved working with single topic documents. As shown in Figure 6, the need for summarization spans multiple information sources.

Summarizing multiple document on same topic for example may be useful in presenting multiple analyst views on a particular event, while highlighting a contra view. Another area for exploration is summarizing a multi topic document where each topic will have to be summarized individually. Cognitive search is also a summarization use case. For example, in contract documents, summarization may involve identifying the contact value, contract terms, key clauses, contract date by scanning contract documents.

There is a real need for summarization especially in the BFS world where there are other alternate sources of data (Alternate Data[12]) like audio files, videos which contain plethora of useful information eg: a CEO speaking about the future plans and goals about a company, or tabular data in a financial report which presents a company performance. There is also a lot of any data available on the internet which may be public records, press releases or the news feed which gives more insights about a company. Each of the above mentioned sources requires different techniques to create a concise summary.

Coforge

Figure 6: Data sources requiring summarization

The finally frontier in summarization is ‘abstractive summarization’. This is literally considered the ‘Holy Grail’ of this research area, when a program is able to derive insights from documents and provide an expert opinion. Our future work is revolving around these areas.

References

  1. https://en.wikipedia.org/wiki/Automatic_summarization
  2. https://pdfminer-docs.readthedocs.io/pdfminer_index.html
  3. https://pythonhosted.org/PyPDF2/
  4. https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/
  5. https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3
  6. https://www.textcompactor.com
  7. https://www.tools4noobs.com/summarize/
  8. https://machinelearningmastery.com/what-are-word-embeddings/
  9. https://www.blackrock.com/corporate/insights/blackrock-investment-institute/archives#investment-outlook
  10. https://en.wikipedia.org/wiki/Alternative_data_(finance
  11. https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
  12. https://en.wikipedia.org/wiki/Lemmatisation