Visualizing Multi-document Semantics via Open Domain Information Extraction

Sheng, Yongpan; Xu, Zenglin; Wang, Yafang; Zhang, Xiangyu; Jia, Jia; You, Zhonghui; de Melo, Gerard

doi:10.1007/978-3-030-10997-4_54

Yongpan Sheng²⁰,
Zenglin Xu²⁰,
Yafang Wang²¹,
Xiangyu Zhang²⁰,
Jia Jia²¹,
Zhonghui You²⁰ &
…
Gerard de Melo²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11053))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2942 Accesses
2 Citations

Abstract

Faced with the overwhelming amounts of data in the 24/7 stream of new articles appearing online, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we present a system that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. We rely on a series of natural language processing methods, including open-domain information extraction, a special filtering method to maintain only meaningful relationships, and a heuristic to form graphs with a high coverage rate of topic entities and concepts. Our graph visualization then allows users to explore these connections. In our experiments, we rely on a large collection of news crawled from the Web and show how connections within this data can be explored. Code related to this paper is available at: https://shengyp.github.io/vmse.

You have full access to this open access chapter, Download conference paper PDF

Multi-document semantic relation extraction for news analytics

Article 18 May 2020

Graphia: Extracting Contextual Relation Graphs from Text

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Keywords

1 Introduction

In today’s interconnected world, there is an endless 24/7 stream of new articles appearing online, including news reports, business transactions, digital media, etc. Faced with these overwhelming amounts of information, it is helpful to consider only the key entities and concepts and their relationships. Often, these are spread across a number of disparate articles and sources. Not only do different outlets often cover different aspects of a story. Typically, new information only becomes available over time, so new articles in a developing story need to be connected to previous ones, or to historic documents providing relevant background information.

In this paper, we present a system^{Footnote 1} that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. Such a system is useful for anyone wishing to drill down into datasets and explore relationships, e.g. analysts and journalists. We rely on a series of natural language processing methods, including open-domain information extraction and coreference resolution, to achieve this while accounting for linguistic phenomena. While previous work on open information extraction has extracted large numbers of subject-predicate-object triples, our method attempts to maintain only those that are most likely to correspond to meaningful relationships. Applying our method within and across multiple documents, we obtain a large conceptual graph. The resulting graph can be filtered such that only the most salient connections are maintained. Our graph visualization then allows users to explore these connections. We show how groups of documents can be selected and showcase interesting new connections that can be explored using our system.

2 Approach and Implementation

2.1 Fact Extraction

The initial phase of extracting facts proceeds as follows:

Document Ranking. The system first select the words appearing in the document collection with sufficiently high frequency as topic words, and computes standard TF-IDF weights for each word. The topic words are used to induce document representations. Documents under the same topic are ranked according to the TF-IDF weights of the topic words in each document. The user can pick such topics, and by default, the top-k documents for every topic are selected for further processing.

Coreference Resolution. Pronouns such as “she” are ubiquitous in language and thus entity names often are not explicitly repeated when new facts are expressed in a text. To nevertheless interpret such textual data appropriately, it is thus necessary to resolve pronouns, for which we rely on the Stanford CoreNLP system [3].

Open-Domain Knowledge Extraction. Different sentences within an article tend to exhibit a high variance with regard to their degree of relevance and contribution towards the core ideas expressed in the article. While some express key notions, others may serve as mere embellishments or anecdotes. Large entity network graphs with countless insignificant edges can be overwhelming for end users. To address this, our system computes document-specific TextRank importance scores for all sentences within a document. It then considers only those sentences with sufficiently high scores. From these, it extracts fact candidates as subject-predicate-object triples. Rather than just focusing on named entities (e.g., “Billionaire Donald Trump”), as some previous approaches do, our system supports an unbounded range of noun phrase concepts (e.g., “the snow storm on the East Coast”) and relationships with explicit relation labels (e.g., “became mayor of”). The latter are extracted from verb phrases as well as from other constructions. For this, we adopt an open information extraction approach, in which the subject, predicate, and object are natural language phrases extracted from the sentence. These often correspond to syntactic subject, predicate, object, respectively.

2.2 Fact Filtering

The filtering algorithm aims at hiding less representative facts in the visualization, seeking to retain only the most salient, confident, and compatible facts. This is achieved by optimizing for a high degree of coherence between facts with high confidence. The joint optimization problem can be solved via integer linear programming, as follows:

$$\begin{aligned} \max \limits _{\varvec{x}, \varvec{y}}~~~~&\varvec{\alpha }^\intercal \varvec{x} + \varvec{\beta }^\intercal \varvec{y}\end{aligned}$$

(1)

$$\begin{aligned} \text{ s.t. }~~~~&\varvec{1}^\intercal \varvec{y} \le n_{\max }\end{aligned}$$

(2)

$$\begin{aligned}&x_{k} \le \min \{y_{i}, y_{j}\}\end{aligned}$$

(3)

$$\begin{aligned}&\quad \forall \,\,\, i < j, i,j \in \{1,\dots , M\},\nonumber \\&\quad \quad k = (2M - i)(i - 1)/2 + j - i\nonumber \\&x_{k}, y_{i} \in \{0, 1 \} \,\forall i \in \{1,\dots , M\}, k \end{aligned}$$

(4)

Here, $\mathbf x \in \mathbb {R}^{N}$, $\mathbf y \in \mathbb {R}^{M}$ with $N = (M + 1)(M - 2)/2 + 1$. The $y_{i}$ are indicator variables for facts $t_i$: If $y_i$ is true, $t_{i}$ is selected to be retained. $x_{k}$ represents the compatibility between two facts $t_{i}, t_{j} \in T$ ($i, j \le M$, $i \ne j$), where $T = \{t_{1}, \dots , t_{M}\}$ is a set of fact triples containing M elements. $\beta _{i}$ denotes the confidence of a fact, and $n_{\max }$ is the number of representative facts desired by the user. $\alpha _{k}$ is weighted by similarity scores $sim(t_{i}, t_{j})$ between two facts $t_{i}, t_{j}$, defined as $\alpha _{k} = sim(t_{i}, t_{j}) = \gamma \dot{s}_{k} + (1-\gamma ) \dot{l}_{k}$. Here, $s_{k}$, $l_{k}$ denote the semantic similarity and literal similarity scores between the facts, respectively. We compute $s_{k}$ using the Align, Disambiguate and Walk algorithm^{Footnote 2}, while $l_{k}$ are computed using the Jaccard index. $\gamma =0.8$ denotes the relative degree to which the semantic similarity contributes to the overall similarity score, as opposed to the literal similarity. The constraints guarantee that the number of results is not larger than $n_{\max }$. If $x_k$ is true, the two connected facts $t_i, t_j$ should be selected, which entails $y_i=1$, $y_j=1$.

2.3 Conceptual Graph Construction

In order to establish a single connected graph that is more consistent, our system provides an interactive user interface, in which expert annotators can merge potential entities and concepts stemming from the fact filtering process, whose labels present equivalent meanings. They can discover obvious features in the lexical structure of entities or concepts, e.g., Billionaire Donald Trump, Donald Trump, Donald John Trump, Trump, etc. all refer to the same person. For NER, they can use the powerful entity linking ability from a search engine for deciding on coreference. To support the annotators, once again the Align, Disambiguate and Walk tool (see footnote 2) is used for semantically similarity computation between concepts for coreference. After that, on average, there remains not more than 5 subgraphs that can further be connected for different topics. Hence, users were able to add up to three synthetic relations with freely defined labels to connect these subgraphs into a fully connected graph.

The recommended [1] maximum size of a concept graph is 25 concepts, which we use as a constraint. In our evaluation metrics, the coverage rate is the number of topic entities and concepts for which marked as correct divided by the total number of all entities and concepts in the graph. We trained a binary classifier by the topic words with high frequency extracted from different topics to identify the important topic entities and concepts in the set of all potential concepts. We used common features, including frequency, length, language pattern, whether it is named entity, whether it appears in an automatic summarization [2], the ratio of synonyms, with random forests as the model. At inference time for topic concepts, we use the classifier’s confidence for a positive classification as the score. We rely on a heuristic to find a full graph that is connected and satisfies the size limit of 25 concepts: We iteratively remove the weakest concepts with relatively lower score until only one connected component of 25 entities and concepts or less remains, which is used as the final conceptual graph. This approach guarantees that the graph is connected with high coverage rate of topic concepts, but might not find the subset of concepts that has the highest total importance score. A concrete example is illustrated in Fig. 1.

Notes

1.
A video presenting the system is available at https://shengyp.github.io/vmse.
2.
https://github.com/pilehvar/ADW.

References

Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)
Google Scholar
Li, J., Li, L., Li, T.: Multi-document summarization via submodularity. Appl. Intell. 37(3), 420–430 (2012)
Article Google Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL System Demonstrations, pp. 55–60 (2014)
Google Scholar

Download references

Acknowledgments

This paper was partially supported by National Natural Science Foundation of China (Nos. 61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No. ZYGX2016Z003).

Author information

Authors and Affiliations

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Yongpan Sheng, Zenglin Xu, Xiangyu Zhang & Zhonghui You
Shandong University, Jinan, China
Yafang Wang & Jia Jia
Rutgers University, New Brunswick, USA
Gerard de Melo

Authors

Yongpan Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Zenglin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yafang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jia Jia
View author publications
You can also search for this author in PubMed Google Scholar
Zhonghui You
View author publications
You can also search for this author in PubMed Google Scholar
Gerard de Melo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gerard de Melo .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
National University of Ireland, Galway, Ireland
Edward Curry
IBM Research - Ireland, Dublin, Ireland
Elizabeth Daly
University College Dublin, Dublin, Ireland
Brian MacNamee
Nokia (Ireland), Dublin, Ireland
Alice Marascu
Vodafone, Milan, Italy
Fabio Pinelli
IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
University College Dublin, Dublin, Ireland
Neil Hurley

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheng, Y. et al. (2019). Visualizing Multi-document Semantics via Open Domain Information Extraction. In: Brefeld, U., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11053. Springer, Cham. https://doi.org/10.1007/978-3-030-10997-4_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-10997-4_54
Published: 18 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10996-7
Online ISBN: 978-3-030-10997-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)