Keywords

1 Introduction

In today’s interconnected world, there is an endless 24/7 stream of new articles appearing online, including news reports, business transactions, digital media, etc. Faced with these overwhelming amounts of information, it is helpful to consider only the key entities and concepts and their relationships. Often, these are spread across a number of disparate articles and sources. Not only do different outlets often cover different aspects of a story. Typically, new information only becomes available over time, so new articles in a developing story need to be connected to previous ones, or to historic documents providing relevant background information.

In this paper, we present a systemFootnote 1 that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. Such a system is useful for anyone wishing to drill down into datasets and explore relationships, e.g. analysts and journalists. We rely on a series of natural language processing methods, including open-domain information extraction and coreference resolution, to achieve this while accounting for linguistic phenomena. While previous work on open information extraction has extracted large numbers of subject-predicate-object triples, our method attempts to maintain only those that are most likely to correspond to meaningful relationships. Applying our method within and across multiple documents, we obtain a large conceptual graph. The resulting graph can be filtered such that only the most salient connections are maintained. Our graph visualization then allows users to explore these connections. We show how groups of documents can be selected and showcase interesting new connections that can be explored using our system.

2 Approach and Implementation

2.1 Fact Extraction

The initial phase of extracting facts proceeds as follows:

Document Ranking. The system first select the words appearing in the document collection with sufficiently high frequency as topic words, and computes standard TF-IDF weights for each word. The topic words are used to induce document representations. Documents under the same topic are ranked according to the TF-IDF weights of the topic words in each document. The user can pick such topics, and by default, the top-k documents for every topic are selected for further processing.

Coreference Resolution. Pronouns such as “she” are ubiquitous in language and thus entity names often are not explicitly repeated when new facts are expressed in a text. To nevertheless interpret such textual data appropriately, it is thus necessary to resolve pronouns, for which we rely on the Stanford CoreNLP system [3].

Open-Domain Knowledge Extraction. Different sentences within an article tend to exhibit a high variance with regard to their degree of relevance and contribution towards the core ideas expressed in the article. While some express key notions, others may serve as mere embellishments or anecdotes. Large entity network graphs with countless insignificant edges can be overwhelming for end users. To address this, our system computes document-specific TextRank importance scores for all sentences within a document. It then considers only those sentences with sufficiently high scores. From these, it extracts fact candidates as subject-predicate-object triples. Rather than just focusing on named entities (e.g., “Billionaire Donald Trump”), as some previous approaches do, our system supports an unbounded range of noun phrase concepts (e.g., “the snow storm on the East Coast”) and relationships with explicit relation labels (e.g., “became mayor of”). The latter are extracted from verb phrases as well as from other constructions. For this, we adopt an open information extraction approach, in which the subject, predicate, and object are natural language phrases extracted from the sentence. These often correspond to syntactic subject, predicate, object, respectively.

2.2 Fact Filtering

The filtering algorithm aims at hiding less representative facts in the visualization, seeking to retain only the most salient, confident, and compatible facts. This is achieved by optimizing for a high degree of coherence between facts with high confidence. The joint optimization problem can be solved via integer linear programming, as follows:

$$\begin{aligned} \max \limits _{\varvec{x}, \varvec{y}}~~~~&\varvec{\alpha }^\intercal \varvec{x} + \varvec{\beta }^\intercal \varvec{y}\end{aligned}$$
(1)
$$\begin{aligned} \text{ s.t. }~~~~&\varvec{1}^\intercal \varvec{y} \le n_{\max }\end{aligned}$$
(2)
$$\begin{aligned}&x_{k} \le \min \{y_{i}, y_{j}\}\end{aligned}$$
(3)
$$\begin{aligned}&\quad \forall \,\,\, i < j, i,j \in \{1,\dots , M\},\nonumber \\&\quad \quad k = (2M - i)(i - 1)/2 + j - i\nonumber \\&x_{k}, y_{i} \in \{0, 1 \} \,\forall i \in \{1,\dots , M\}, k \end{aligned}$$
(4)

Here, \(\mathbf x \in \mathbb {R}^{N}\), \(\mathbf y \in \mathbb {R}^{M}\) with \(N = (M + 1)(M - 2)/2 + 1\). The \(y_{i}\) are indicator variables for facts \(t_i\): If \(y_i\) is true, \(t_{i}\) is selected to be retained. \(x_{k}\) represents the compatibility between two facts \(t_{i}, t_{j} \in T\) (\(i, j \le M\), \(i \ne j\)), where \(T = \{t_{1}, \dots , t_{M}\}\) is a set of fact triples containing M elements. \(\beta _{i}\) denotes the confidence of a fact, and \(n_{\max }\) is the number of representative facts desired by the user. \(\alpha _{k}\) is weighted by similarity scores \(sim(t_{i}, t_{j})\) between two facts \(t_{i}, t_{j}\), defined as \(\alpha _{k} = sim(t_{i}, t_{j}) = \gamma \dot{s}_{k} + (1-\gamma ) \dot{l}_{k}\). Here, \(s_{k}\), \(l_{k}\) denote the semantic similarity and literal similarity scores between the facts, respectively. We compute \(s_{k}\) using the Align, Disambiguate and Walk algorithmFootnote 2, while \(l_{k}\) are computed using the Jaccard index. \(\gamma =0.8\) denotes the relative degree to which the semantic similarity contributes to the overall similarity score, as opposed to the literal similarity. The constraints guarantee that the number of results is not larger than \(n_{\max }\). If \(x_k\) is true, the two connected facts \(t_i, t_j\) should be selected, which entails \(y_i=1\), \(y_j=1\).

2.3 Conceptual Graph Construction

In order to establish a single connected graph that is more consistent, our system provides an interactive user interface, in which expert annotators can merge potential entities and concepts stemming from the fact filtering process, whose labels present equivalent meanings. They can discover obvious features in the lexical structure of entities or concepts, e.g., Billionaire Donald Trump, Donald Trump, Donald John Trump, Trump, etc. all refer to the same person. For NER, they can use the powerful entity linking ability from a search engine for deciding on coreference. To support the annotators, once again the Align, Disambiguate and Walk tool (see footnote 2) is used for semantically similarity computation between concepts for coreference. After that, on average, there remains not more than 5 subgraphs that can further be connected for different topics. Hence, users were able to add up to three synthetic relations with freely defined labels to connect these subgraphs into a fully connected graph.

Fig. 1.
figure 1

Example of the user interface: In the left panel, when the user selects the entity “Billionaire Donald Trump” within the set of representative facts extracted from the document topics, the system presents the pertinent entities, concepts, and relations associated with this concept via a graph-based visualization in the right panel, including “Hillary Clinton” as a prominent figure.

The recommended [1] maximum size of a concept graph is 25 concepts, which we use as a constraint. In our evaluation metrics, the coverage rate is the number of topic entities and concepts for which marked as correct divided by the total number of all entities and concepts in the graph. We trained a binary classifier by the topic words with high frequency extracted from different topics to identify the important topic entities and concepts in the set of all potential concepts. We used common features, including frequency, length, language pattern, whether it is named entity, whether it appears in an automatic summarization [2], the ratio of synonyms, with random forests as the model. At inference time for topic concepts, we use the classifier’s confidence for a positive classification as the score. We rely on a heuristic to find a full graph that is connected and satisfies the size limit of 25 concepts: We iteratively remove the weakest concepts with relatively lower score until only one connected component of 25 entities and concepts or less remains, which is used as the final conceptual graph. This approach guarantees that the graph is connected with high coverage rate of topic concepts, but might not find the subset of concepts that has the highest total importance score. A concrete example is illustrated in Fig. 1.