Abstract
Faced with the overwhelming amounts of data in the 24/7 stream of new articles appearing online, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we present a system that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. We rely on a series of natural language processing methods, including open-domain information extraction, a special filtering method to maintain only meaningful relationships, and a heuristic to form graphs with a high coverage rate of topic entities and concepts. Our graph visualization then allows users to explore these connections. In our experiments, we rely on a large collection of news crawled from the Web and show how connections within this data can be explored. Code related to this paper is available at: https://shengyp.github.io/vmse.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In today’s interconnected world, there is an endless 24/7 stream of new articles appearing online, including news reports, business transactions, digital media, etc. Faced with these overwhelming amounts of information, it is helpful to consider only the key entities and concepts and their relationships. Often, these are spread across a number of disparate articles and sources. Not only do different outlets often cover different aspects of a story. Typically, new information only becomes available over time, so new articles in a developing story need to be connected to previous ones, or to historic documents providing relevant background information.
In this paper, we present a systemFootnote 1 that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. Such a system is useful for anyone wishing to drill down into datasets and explore relationships, e.g. analysts and journalists. We rely on a series of natural language processing methods, including open-domain information extraction and coreference resolution, to achieve this while accounting for linguistic phenomena. While previous work on open information extraction has extracted large numbers of subject-predicate-object triples, our method attempts to maintain only those that are most likely to correspond to meaningful relationships. Applying our method within and across multiple documents, we obtain a large conceptual graph. The resulting graph can be filtered such that only the most salient connections are maintained. Our graph visualization then allows users to explore these connections. We show how groups of documents can be selected and showcase interesting new connections that can be explored using our system.
2 Approach and Implementation
2.1 Fact Extraction
The initial phase of extracting facts proceeds as follows:
Document Ranking. The system first select the words appearing in the document collection with sufficiently high frequency as topic words, and computes standard TF-IDF weights for each word. The topic words are used to induce document representations. Documents under the same topic are ranked according to the TF-IDF weights of the topic words in each document. The user can pick such topics, and by default, the top-k documents for every topic are selected for further processing.
Coreference Resolution. Pronouns such as “she” are ubiquitous in language and thus entity names often are not explicitly repeated when new facts are expressed in a text. To nevertheless interpret such textual data appropriately, it is thus necessary to resolve pronouns, for which we rely on the Stanford CoreNLP system [3].
Open-Domain Knowledge Extraction. Different sentences within an article tend to exhibit a high variance with regard to their degree of relevance and contribution towards the core ideas expressed in the article. While some express key notions, others may serve as mere embellishments or anecdotes. Large entity network graphs with countless insignificant edges can be overwhelming for end users. To address this, our system computes document-specific TextRank importance scores for all sentences within a document. It then considers only those sentences with sufficiently high scores. From these, it extracts fact candidates as subject-predicate-object triples. Rather than just focusing on named entities (e.g., “Billionaire Donald Trump”), as some previous approaches do, our system supports an unbounded range of noun phrase concepts (e.g., “the snow storm on the East Coast”) and relationships with explicit relation labels (e.g., “became mayor of”). The latter are extracted from verb phrases as well as from other constructions. For this, we adopt an open information extraction approach, in which the subject, predicate, and object are natural language phrases extracted from the sentence. These often correspond to syntactic subject, predicate, object, respectively.
2.2 Fact Filtering
The filtering algorithm aims at hiding less representative facts in the visualization, seeking to retain only the most salient, confident, and compatible facts. This is achieved by optimizing for a high degree of coherence between facts with high confidence. The joint optimization problem can be solved via integer linear programming, as follows:
Here, \(\mathbf x \in \mathbb {R}^{N}\), \(\mathbf y \in \mathbb {R}^{M}\) with \(N = (M + 1)(M - 2)/2 + 1\). The \(y_{i}\) are indicator variables for facts \(t_i\): If \(y_i\) is true, \(t_{i}\) is selected to be retained. \(x_{k}\) represents the compatibility between two facts \(t_{i}, t_{j} \in T\) (\(i, j \le M\), \(i \ne j\)), where \(T = \{t_{1}, \dots , t_{M}\}\) is a set of fact triples containing M elements. \(\beta _{i}\) denotes the confidence of a fact, and \(n_{\max }\) is the number of representative facts desired by the user. \(\alpha _{k}\) is weighted by similarity scores \(sim(t_{i}, t_{j})\) between two facts \(t_{i}, t_{j}\), defined as \(\alpha _{k} = sim(t_{i}, t_{j}) = \gamma \dot{s}_{k} + (1-\gamma ) \dot{l}_{k}\). Here, \(s_{k}\), \(l_{k}\) denote the semantic similarity and literal similarity scores between the facts, respectively. We compute \(s_{k}\) using the Align, Disambiguate and Walk algorithmFootnote 2, while \(l_{k}\) are computed using the Jaccard index. \(\gamma =0.8\) denotes the relative degree to which the semantic similarity contributes to the overall similarity score, as opposed to the literal similarity. The constraints guarantee that the number of results is not larger than \(n_{\max }\). If \(x_k\) is true, the two connected facts \(t_i, t_j\) should be selected, which entails \(y_i=1\), \(y_j=1\).
2.3 Conceptual Graph Construction
In order to establish a single connected graph that is more consistent, our system provides an interactive user interface, in which expert annotators can merge potential entities and concepts stemming from the fact filtering process, whose labels present equivalent meanings. They can discover obvious features in the lexical structure of entities or concepts, e.g., Billionaire Donald Trump, Donald Trump, Donald John Trump, Trump, etc. all refer to the same person. For NER, they can use the powerful entity linking ability from a search engine for deciding on coreference. To support the annotators, once again the Align, Disambiguate and Walk tool (see footnote 2) is used for semantically similarity computation between concepts for coreference. After that, on average, there remains not more than 5 subgraphs that can further be connected for different topics. Hence, users were able to add up to three synthetic relations with freely defined labels to connect these subgraphs into a fully connected graph.
The recommended [1] maximum size of a concept graph is 25 concepts, which we use as a constraint. In our evaluation metrics, the coverage rate is the number of topic entities and concepts for which marked as correct divided by the total number of all entities and concepts in the graph. We trained a binary classifier by the topic words with high frequency extracted from different topics to identify the important topic entities and concepts in the set of all potential concepts. We used common features, including frequency, length, language pattern, whether it is named entity, whether it appears in an automatic summarization [2], the ratio of synonyms, with random forests as the model. At inference time for topic concepts, we use the classifier’s confidence for a positive classification as the score. We rely on a heuristic to find a full graph that is connected and satisfies the size limit of 25 concepts: We iteratively remove the weakest concepts with relatively lower score until only one connected component of 25 entities and concepts or less remains, which is used as the final conceptual graph. This approach guarantees that the graph is connected with high coverage rate of topic concepts, but might not find the subset of concepts that has the highest total importance score. A concrete example is illustrated in Fig. 1.
Notes
- 1.
A video presenting the system is available at https://shengyp.github.io/vmse.
- 2.
References
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)
Li, J., Li, L., Li, T.: Multi-document summarization via submodularity. Appl. Intell. 37(3), 420–430 (2012)
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL System Demonstrations, pp. 55–60 (2014)
Acknowledgments
This paper was partially supported by National Natural Science Foundation of China (Nos. 61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No. ZYGX2016Z003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sheng, Y. et al. (2019). Visualizing Multi-document Semantics via Open Domain Information Extraction. In: Brefeld, U., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11053. Springer, Cham. https://doi.org/10.1007/978-3-030-10997-4_54
Download citation
DOI: https://doi.org/10.1007/978-3-030-10997-4_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10996-7
Online ISBN: 978-3-030-10997-4
eBook Packages: Computer ScienceComputer Science (R0)