Keywords

1 Introduction

In this paper, we describe the implementation details of the automatic summarization system reported in [1]. The system mines text data on documents and webpages and uses knowledge base and inference engine to produce an abstractive summary. It generates summaries by composing new sentences based on the semantics derived from the text. The system combines syntactic structures and semantic features to provide summaries that contains information synthesized from various parts of the document. It is built on Cyc development platform that consists of the world’s largest knowledge ontology and one of the most powerful inference engines that allow information comprehension and generalization [2]. In addition, the Cyc knowledge ontology provides the domain knowledge for the subject matter discussed in the documents.

Abstractive document summarization is a task that is still considered complex for a human and especially for a machine. When human experts perform document summarization they tend to use their domain expertise about subject matter to merge information from various parts of the document and synthesize novel information, which was not explicitly mentioned in the text [3]. Our proposed system aims to follow similar approach. It generalizes new abstract concepts based on the knowledge derived from the text. It automatically detects main topics described in the text. Moreover, it composes new English sentences for some of the most significant concepts. The created sentences form an abstractive summary, combining concepts from different parts of the input text.

Our text data mining system is domain independent and unsupervised, being limited only by the common sense ontology provided by the Cyc development platform. The system conducts summarization process in three steps: knowledge acquisition, knowledge discovery, and knowledge representation.

The knowledge acquisition step derives syntactic structure of each sentence of the input document and maps words and their relations into Cyc knowledge base. Next, the knowledge discovery step generalizes concepts upward in the Cyc ontology and detects main topics covered in the text. Finally, the knowledge representation step composes new sentences for some of the most significant concepts defined in main topics. The syntactic structure of the newly created sentences follows an enhanced subject-predicate-object model, where adjective and adverb modifiers are used to produce more complex and informative sentences.

The system was implemented as a pipelined and modular data mining framework. Such system design allows comprehensible data flow, convenient maintenance and implementation of additional functionality as needed. The system was tested on various documents and webpages. The test results show that the system is capable of identifying key concepts and discovering main topics comprised in the original text, generalizing new concept not explicitly mentioned in the text and creating new sentences that contain information synthesized from various parts of the text. The newly created sentences have complex syntactic structures that enhance subject-predicate-object triplets with adjective and adverb modifiers. For example, the sentence “Colored grapefruit being sweet edible fruit” was automatically generated by the system analyzing encyclopedia articles describing grapefruits. Here, the subject concept “grapefruit” is modified by the adjective concept “colored” that was not explicitly mentioned in the text and the object concept “edible fruit” is modified by the adjective concept “sweet”. The modifiers are chosen based on the weight of the syntactic relation.

The rest of the paper is organized as follows. Section 2 outlines related work undertaken in automatic text summarization area. Section 3 gives a brief overview of the summarization process steps performed by the system. Section 4 covers system implementation details. Section 5 provides thorough description of the system modules. Section 6 presents testing results. Section 7 discusses conclusions and directions of future work.

2 Related Work

Automatic text summarization seeks to compose a concise and coherent version of the original text preserving the most important information. Computational community has studied automatic text summarization problem since late 1950s [4]. Studies in this area are generally divided into two main approaches – extractive and abstractive. Extractive text summarization aims to select the most important sentences from original text to form a summary. Such methods vary by different intermediate representations of the candidate sentences and different sentence scoring schemes [5]. Summaries created by extractive approach are highly relevant to the original text, but do not convey any new information. Most prominent methods in extractive text summarization use term frequency versus inverse document frequency (TF-IDF) metric [6, 7] and lexical chains for sentence representation [8, 9]. Statistical methods based on Latent Semantic Analysis (LSA), Bayesian topic modelling, Hidden Markov Model (HMM) and Conditional random field (CRF) derive underlying topics and use them as features for sentence selection [10, 11]. Despite significant advancements in the extractive text summarization, such approaches are not capable of semantic understanding and limited to the shallow knowledge contained in the text.

In contrast, abstractive text summarization aims to incorporate the meaning of the words and phrases and generalize knowledge not explicitly mentioned in the original text to form a summary. Phrase selection and merging methods in abstractive summarization aim to solve the problem of combining information from multiple sentences. Such methods construct clusters of phrases and then merge only informative ones to form summary sentences [12]. Graph transformation approaches convert original text into a form of sematic graph representation and then combine or reduce such representation with an aim of creating an abstractive summary [13, 14]. Summaries constructed by described methods consist of sentences not used in the original text, combining information from different parts, but such sentences do not convey new knowledge.

Several approaches attempt to incorporate semantic knowledge base into automatic text summarization by using WordNet lexical database [8, 15, 16]. Major drawback of WordNet system is the lack of domain-specific and common sense knowledge. Unlike Cyc, WordNet does not have reasoning engine and natural language generation capabilities.

Recent rapid development of deep learning contributes to the automatic text summarization, improving state-of-the-art performance. Deep learning methods applied to both extractive [17] and abstractive [18] summarization show promising results, but such approaches require vast amount of training data and powerful computational resources.

Our system is similar to the one proposed in [19]. In this work, the structure of created sentences has simple subject-predicate-object pattern and new sentences are only created for clusters of compatible sentences found in the original text.

3 Overview of the Summarization Process

Our system conducts summarization process in three steps: knowledge acquisition, knowledge discovery, and knowledge representation. Summarization process workflow is illustrated in Fig. 1.

Fig. 1.
figure 1

System’s workflow diagram.

The knowledge acquisition step consists of two parts. It first takes text documents as an input and derives their syntactic structures. Then it maps each word in the text to its corresponding Cyc concept and assigns word’s weight and derived syntactic relations to that concept. The knowledge discovery step is responsible for abstracting new concepts that are not explicitly mentioned in the text. During this process, system derives ancestor concept for each mapped Cyc concept, assigns ancestor-descendant relation and adds scaled descendant concept weight and descendant concept associations to the ancestor concept. In addition, the system identifies main topics comprised in the text by clustering mapped Cyc concepts. During the knowledge representation step, the system first identifies most informative subject concepts in each of the discovered main topics and then composes English sentences for each identified subject. This process ensures that the summary sentences are composed using information synthesized from different parts of the text while preserving coherence to the main topics.

4 Details of the System’s Implementation

We chose Python as the implementation language to develop our system because of the advanced Natural Language Processing tools and libraries it supplies. Our system uses Cyc knowledge base and inference engine as a backbone for the semantic analysis. Cyc development platform supports communications with the knowledge base and utilization of the inference engine through the application programming interfaces (APIs) implemented in Java. We utilize Java-Python wrapper supported by JPype library to allow our system using Cyc Java API packages. JPype library is essentially an interface at a basic level of virtual machines [20]. It requires starting Java Virtual Machine with a path to the appropriate jar files before Java methods and classes can be accessible within Python code. Communication between our system and Cyc development platform is illustrated in Fig. 2. To the best of our knowledge, our developed system is the first Python-based system that allows communication with Cyc development platform.

Fig. 2.
figure 2

Communication between summarization system and Cyc development platform.

We have designed our system as a modular and pipelined data mining framework. Modularity provides the ability to conveniently maintain parts of the system and to add new functionality as needed. Pipelined design allows comprehensible data flow between different modules.

The system consists of seven modules:

  1. 1.

    Syntactic analysis;

  2. 2.

    Mapping words to Cyc KB;

  3. 3.

    Concepts propagation;

  4. 4.

    Concepts’ weights and relations accumulation;

  5. 5.

    Topics derivation;

  6. 6.

    Subjects identification;

  7. 7.

    New sentences generation.

Modules 1 and 2 together constitute the knowledge acquisition step of the summarization process. Modules 3, 4 and 5 together make up the knowledge discovery step of the summarization process. Modules 6 and 7 together form knowledge representation step of the summarization process. System modules are illustrated in Fig. 3.

Fig. 3.
figure 3

Modular design of the system.

5 Description of the System’s Modules

5.1 “Syntactic Analysis” Module

The first module in the system is the “Syntactic analysis” module. The role of this module is essentially a data preprocessing. The module takes documents as an input and transforms them into syntactic representations. It first performs text normalization by lemmatizing each word in each sentence. Then it derives part of speech tags, parses syntactic dependencies and counts word’s weights. The syntactic dependencies are recorded in the following format: (“word” “type” “head”), where “word” is the dependent element, “type” is the type of the dependency, and “head” is the leading element. For example, applying syntactic parser on the following sentence: “John usually drinks strong coffee” produces the following syntactic dependencies between words: (“John” “nsubj” “drinks”), (“coffee” “dobj” “drinks”), (“usually” “advmod” “drinks”), (“strong” “amod” “coffee”). Syntactic dependencies of the example sentence are illustrated in Fig. 4.

Fig. 4.
figure 4

Illustration of the syntactic dependencies of a sample sentence.

The “Syntactic analysis” module is implemented using SpaCy – Python library for advanced natural language processing. SpaCy library is the fastest in the world with the accuracy within one percent of the current state of the art systems for part of speech tagging and syntactic dependencies analysis [21]. The “Syntactic analysis” module operates outside of the Cyc development platform. The output of the module is a dictionary that contains words, their part of speech tags, weights and syntactic dependencies. This dictionary serves as an input for “Mapping words to Cyc KB” module.

5.2 “Mapping Words to Cyc KB” Module

The “Mapping words to Cyc KB” module takes dictionary of words, derived by the “Syntactic analysis” module, as an input. This module finds an appropriate Cyc concept for each word in the dictionary, and assigns word’s weight and syntactic dependency associations to Cyc concept. It starts by mapping each word to the corresponding Cyc concept (1). Next, it assigns word’s weight to Cyc concept (2). Then it maps the syntactic dependency head to the appropriate Cyc concept. Finally, it assigns the syntactic dependency association and its weight to the Cyc concept (3). Table 1 provides the description of Cyc commands used to implement each step.

Table 1. Description of Cyc commands used by “Mapping words to Cyc KB” module.

This module communicates with Cyc development platform and updates weight and syntactic dependency relations of Cyc concepts. The output of the module are mapped Cyc concepts with assigned weights and syntactic dependency relations. The mapped Cyc concepts serve as an input for “Concepts propagation” module. “Syntactic analysis” and “Mapping words to Cyc KB” modules together constitute the knowledge acquisition step of the summarization process.

5.3 “Concepts Propagation” Module

The “Concepts propagation” module takes Cyc concepts, mapped by “Mapping words to Cyc KB” module, as an input and finds their closest ancestor concepts. This module performs generalization and abstraction of new concepts that have not been mentioned in the text explicitly. It starts by querying Cyc knowledge base for all the concepts that have assigned weight (1). Then it finds an ancestor concept for each concept derived by the query (2). Next, it records the number of ancestor’s descendant concepts and their weight (3). Finally, it assigns ancestor-descendant relation between ancestor and descendant concepts (4). Table 2 provides the description of Cyc commands used to implement each step.

Table 2. Description of Cyc commands used by “Concepts propagation” module.

This module communicates with Cyc development platform to derive all mapped Cyc concepts, find closest ancestor concepts and update ancestor concepts’ relations. The output of the module are ancestor Cyc concepts with assigned descendant concepts’ weights and counts and ancestor-descendant relations. The ancestor Cyc concepts are used by “Concepts’ weights and relations accumulation” module.

5.4 “Concepts’ Weights and Relations Accumulation” Module

The “Concepts’ weights and relations accumulation” module takes ancestor Cyc concepts as an input and adds descendants’ accumulated weight and relations to ancestor concepts if the calculated descendant-ratio is higher than the threshold. The descendant-ratio is the number of mapped descendant concepts divided by the number of all descendant concepts of an ancestor concept. This module starts by querying Cyc knowledge base for all ancestor concepts (1). Then it calculates the descendant ratio for each ancestor concept (2.1, 2.2). Next, it adds propagated descendants’ weight (3) and descendants’ associations with their propagated weights (4) to ancestor concepts if the descendant-ratio is higher than the defined threshold. Table 3 provides the description of Cyc commands used to implement each step.

Table 3. Description of Cyc commands used by “Concepts’ weights and relations accumulation” module.

This module communicates with Cyc development platform to derive all ancestor Cyc concepts, find the number of ancestor’s mapped descendants, find the number of all ancestor’s descendants and update ancestor’s weight and relations. The output of the module are the Cyc concepts with updated weights and syntactic dependency associations. Updated Cyc concepts are used by the “Topics derivation” and the “Subjects identification” modules.

5.5 “Topics Derivation” Module

The “Topics derivation” module takes updated Cyc concepts as an input and derives defining micro theory for each concept. Micro theories with the highest weights represent the main topics of the document. This module first derives defining micro theory for each Cyc concept that have assigned weight (1). Then it counts the weights of derived micro theories based on their frequencies and picks up top-n with the highest weights. Table 4 provides the description of Cyc command used to implement defining micro theory derivation.

Table 4. Description of Cyc command used by “Topics derivation” module

This module communicates with Cyc development platform to derive defining micro theory for each mapped Cyc concept. Calculation of the derived micro theories’ weights is handled outside of the Cyc development platform. The output of the module is the micro theories dictionary that contains top-n micro theories with highest weights. This dictionary serves as an input for the “Subjects identification” module. The “Concepts propagation”, the “Concepts’ weights and relations accumulation” and the “Topics derivation” modules together constitute knowledge discovery step of the summarization process.

5.6 “Subjects Identification” Module

The “Subjects identification” module uses updated Cyc concepts and the dictionary of top-n micro theories as an input to derive most informative subject concepts based on a subjectivity rank. Subjectivity ranks is the product of the concept’s weight and the concept’s subjectivity ratio. Subjectivity ratio is the number of concept’s syntactic dependency associations labelled as “subject” relations divided by the total number of concept’s syntactic dependency associations. Subjectivity rank allows identifying concepts with the strongest subject roles in the documents. The module start by querying Cyc knowledge base for all mapped Cyc concepts for each micro theory in top-n micro theories dictionary (1). Then it calculates subjectivity ratio and subjectivity rank for each derived Cyc concept (2.1, 2.2). Finally, it picks top-n subject concepts with the highest subjectivity rank. Table 5 provides the description of Cyc commands used to implement each step.

Table 5. Description of Cyc commands used by “Subjects identification” module.

This module communicates with Cyc development platform to derive mapped Cyc concepts for each defining micro theory in the input dictionary and to find the number of the concept’s syntactic dependency associations labelled as “subject” relation and the number of all syntactic dependency associations of the concept. Calculations of the subjectivity ratio and the subjectivity rank are handled outside of the Cyc development platform. The output of the module is the dictionary that contains top-n subjects with the highest subjectivity rank. This dictionary serves as an input for the “New sentence generation” module.

5.7 “New Sentences Generation” Module

The “New sentences generation” module takes the dictionary of top-n most informative subjects as an input and produces new sentences for each of the subject to form a summary of the input documents. The module starts by deriving a natural language representation of each subject Cyc concept in the dictionary (1). Then it picks the adjective Cyc concept modifier with the highest subject-adjective syntactic dependency association weight (2) and derives its natural language representation. Next, it picks top-n predicate Cyc concepts with the highest subject-predicate syntactic dependency association weights (3) and derives their natural language representations. Then it picks the adverb Cyc concept modifier with the highest predicate-adverb syntactic dependency association weight (4) and derives its natural language representation. Next, it picks top-n object Cyc concepts with the highest product of subject-object and predicate-object syntactic dependency association weights (5.1, 5.2) and derives their natural language representations. Then, it picks the adjective Cyc concept modifier with the highest object-adjective syntactic dependency association weight and derives its natural language representation. Finally, it composes the new sentence using subject, subject-adjective, predicate, predicate-adverb, object and object-adjective natural language representations. Table 6 provides the description of Cyc commands used to implement each step.

Table 6. Description of Cyc commands used by “New sentence generation” module.

This module communicates with Cyc development platform to derive appropriate Cyc concepts for each sentence element based on the weights of their syntactic dependency associations and derive their natural language representation. New sentences are composed outside of the Cyc development platform and serve as an output for the module and the whole summarization system. The “Subjects identification” and the “New sentences generation” modules together constitute the knowledge representation step of the summarization process.

6 Testing and Results

We have tested our system on various encyclopedia articles describing concepts from different domain. First, we conducted an experiment using multiple articles about grapefruits. In this experiment, we increased the number of analyzed articles on each run of the system, starting with a single article. Figure 5 illustrates new sentences created by the system. These results show the progression of sentence structure from simple subject-predicate-object triplet to more complex structure enhanced by the adjective and adverb modifiers when more articles were processed by the system.

Fig. 5.
figure 5

Test results of new sentences created for multiple articles about grapefruit; (a) – single article, (b) – two articles, (c) – three articles.

Next, we applied our system on five encyclopedia articles describing different types of felines, including cats, tigers, cougars, jaguars and lions. Figure 6 shows main topics and concepts extracted from the text and newly created sentences.

Fig. 6.
figure 6

Test results of new sentences, concepts and main topics for encyclopedia articles about felines.

These results show that the system is able to abstract new concepts and create new sentences that contain information synthesized from different parts of the documents. Concepts like “canis”, “mammal meat” and “felis” were derived by the generalization process and were not explicitly mentioned in the original documents. Our system yields better results compared to the reported in [19]. New sentences created by the system have structure that is more complex and contain information fused from various parts of the text. More testing results are reported in [1].

7 Conclusions and Future Work

In this paper, we described an implementation of the knowledge based automatic summarization system that creates an abstractive summary of the text. This task is still challenging for machines, because in order to create such summary, the information from the input text has to be aggregated and synthesized, drawing knowledge that is more general. This is not feasible without using the semantics and having domain knowledge. To have such capabilities, our implemented system uses Cyc knowledge base and its reasoning engine. Utilizing semantic features and syntactic structure of the text shows great potential in creating abstractive summaries. We have implemented and tested our proposed system. The results show that the system is able to abstract new concepts not mentioned in the text, identify main topics and create new sentences using information from different parts of the text.

We outline several directions for the future improvements of the system. The first direction is to improve the domain knowledge representation, since the semantic knowledge and reasoning are only limited by Cyc knowledge base. Ideally, the system would be able to use the whole World Wide Web as a domain knowledge, but this possesses challenges like information inconsistency and sense disambiguation. The second direction is to improve the structure of the created sentences. We use subject-predicate-object triplets extended by adjective and adverb modifiers. Such structure can be improved by using more advanced syntactic representation of the sentence, e.g. graph representation. Finally, some of the created sentences are not conceptually connected to each other. Analyzing the relations between concepts on the document level will help in creating sentences that will be linked to each other conceptually.