Multimodal news analytics using measures of cross-modal entity and context consistency

The World Wide Web has become a popular source to gather information and news. Multimodal information, e.g., supplement text with photographs, is typically used to convey the news more effectively or to attract attention. The photographs can be decorative, depict additional details, but might also contain misleading information. The quantification of the cross-modal consistency of entity representations can assist human assessors’ evaluation of the overall multimodal message. In some cases such measures might give hints to detect fake news, which is an increasingly important topic in today’s society. In this paper, we present a multimodal approach to quantify the entity coherence between image and text in real-world news. Named entity linking is applied to extract persons, locations, and events from news texts. Several measures are suggested to calculate the cross-modal similarity of the entities in text and photograph by exploiting state-of-the-art computer vision approaches. In contrast to previous work, our system automatically acquires example data from the Web and is applicable to real-world news. Moreover, an approach that quantifies contextual image-text relations is introduced. The feasibility is demonstrated on two datasets that cover different languages, topics, and domains.


Introduction
With the widespread use and availability of digital environments, the World Wide Web plays an essential role in disseminating information and news. In particular, social media platforms such as Twitter allow users to follow worldwide events and news and become a popular source of information [6,35,39]. These news articles often leverage different modalities, e.g., texts and images, to convey information more effectively (Fig. 1). Every modality conveys its specific information, and the combination of modalities enables the communication of a coherent multimodal message. In this regard, photograph content can range from   [36] and corresponding texts with untampered and tampered entities. Bottom: Two real-world news from BreakingNews [33] and outputs of our system (LOCation, PERson, EVENT). The examples show that realworld news have much longer text and refer to many entities. Images are replaced with similar ones due to license restrictions. Original images and full text are linked on the GitHub page: https://github.com/ TIBHannover/cross-modal_entity_consistency (Color figure online) decorative (with little or no information about the news event) over depicting rich information enhancements (important or additional details) to even misleading visual information.
According to Bateman [4], the consideration of multimodal relationships such as the semantic coherence and mutual concepts is crucial to understand and evaluate the overall message and meaning. With the rapidly growing amount of news available on the Web, it is becoming an increasingly important task to develop automated systems for information extraction in multimedia content in order to, e.g., evaluate the overall message, facilitate semantic search, or analyze the content with regard to credibility. Measures of crossmodal consistency might also support human assessors and expert-oriented fact-checking efforts such as PolitiFact 1 and Snopes 2 to identify misinformation or fake news.
While part of previous work [16,25,32,44,46] aims at finding measures to model semantic cross-modal relations in order to bridge the semantic gap, approaches on image repurposing detection [21,22,36] check the consistency of named entities mentioned in the text, as illustrated in Fig. 1. Our approach is similar to the task of image repurposing detection since it focuses on the evaluation of cross-modal entity occurrences between image and text. Related approaches [21,22,36] rely on multimodal deep learning techniques that require appropriate datasets of non-manipulated image-text pairs. However, these datasets are hard to collect as they need to be verified for valid cross-modal relations. Besides, the training or reference data provide the source of world knowledge and limit these methods to entities, e.g., persons or locations, that appear in these datasets. Experimental evaluations have been performed on images with short image captions [21,36] or existing metadata [22], which do not reflect real-world characteristics as illustrated in Fig. 1.
In this paper, we present an automatic system that quantifies the cross-modal consistency of entity relations. In contrast to previous work, the system is completely unsupervised and does not rely on any pre-defined reference or training data. To the best of our knowledge, we present a first baseline that is applicable to real-world news articles by tackling several news-specific challenges such as the excessive length of news documents, entity diversity, and misleading reference images. The workflow of our system pipeline is as follows: First, we automatically crawl reference images for entities extracted from the text by named entity linking. Then, these images serve as input for the visual verification of the entities to the associated news image. In this respect, appropriate computer vision approaches serve as generalized feature extractors. Unlike the more general model for scene/place classification used in [30], we utilize a novel ontology-driven deep learning approach [31]  as for a more general news context are introduced to quantify the cross-modal similarity of image and text.
The applications are manifold, ranging from a retrieval system for news with low or high cross-modal semantic correlations to an exploration tool that reveals the relations between image and text as shown in Fig. 1. The feasibility of our approach is demonstrated on a novel large-scale dataset for cross-modal consistency verification that is derived from BreakingNews [33]. It contains real-world news articles in English and covers different topics and domains. In addition, we have collected articles from German news sites to verify the performance in another language. In contrast to previous work, the entities are manipulated with more sophisticated strategies to obtain challenging datasets. Web application, source code, and datasets are publicly available 3 .
The remainder of this paper is organized as follows. The framework for automatic verification of cross-modal entity relations as well as contextual relations between image and text is described in Sects. 3 and 4. Section 5 introduces two benchmarks datasets and discusses the experimental results of the proposed approach for document verification and collection retrieval. Section 6 summarizes the section and outlines potential areas of future work.

Related work
The analysis of multimodal information such as image and text has attracted researchers from linguistics, semiotics, and computational science for many years. Bateman [4] considers multimodal relationships to be crucial for the interpretation of the overall multimodal message. Linguists, semioticians, and communication scientists [3,13,27,28,40] attempted to assign joint placements of image and text to distinct imagetext classes in order to define the interrelations using suitable taxonomies. However, only recently few works attempted to build computational models to quantify the cross-modal relations between image and text. Few approaches explore more general semantic correlations [16,25,32,44,46] to bridge the gap [38] between both modalities.
Also related to our approach are systems for image repurposing detection [21,22,36] that intend to reveal inconsistencies between image-text-pairs with respect to entity representations (persons, locations, organizations, etc.), mainly to identify repurposed multimedia contents that might indicate misinformation. In a more general sense these kind of approaches quantify the Cross-modal Mutual Information (CMI) [16,17]   containing an untampered image and a corresponding caption to assess a given document's semantic integrity. Experiments were conducted by replacing one modality, which results in semantically inconsistent image-captions pairs, making them relatively easy to detect. This motivated Sabir et al. [36] to introduce a dataset where specific entities (persons, locations, and organizations) are carefully replaced to generate semantically consistent altered packages. They have also refined the multimodal model using a multitask learning approach that further incorporates geographical information. Jaiswal et al. [22] presented an adversarial neural network that simultaneously trains a bad actor who intentionally counterfeits metadata and a watchdog that verifies multimodal semantic consistency. The system was tested for person verification, location verification, and painter verification of artworks. However, the system is more closely related to approaches for metadata verification [7,8,23,26] as it only verifies the consistency between pairs of images and metadata and does not incorporate any textual information.
Overall, the aforementioned approaches neglect the various challenges of real-world news and applications in terms of the vast amount and variety of entities, incorrect or unrelated reference data as well as outputs of named entity linking tools. They instead rely on pre-defined reference datasets consisting of image-text pairs [21,36] or existing metadata [22] that are (1) closely related ( Fig. 1 top), (2) hard to collect automatically, and (3) rather limited and static with respect to the covered entities.

Cross-modal entity consistency
In this section, we present a system that automatically verifies the semantic relations in terms of shared entities between pairs of image and text. Verification is realized through measures of cross-modal similarities for different entity types (persons, locations, and events). Based on named entity linking (Sect. 3.1), visual evidence for cross-modal entity occurrences is collected from the Web. Visual features are obtained by appropriate computer vision approaches (Sect. 3.2), which are used in conjunction with measures of cross-modal similarity (Sect. 3.3) to quantify the cross-modal consistency. The workflow is illustrated in Fig. 2.

Extraction of textual entities
In order to quantify cross-modal relation for specific types of entities, namely persons, locations, and events, named entity recognition and disambiguation is applied to extract a set of named entities T from the text. We have tried several frameworks such as AIDA [18], NERD [34], or Kolitsas et al.'s [24] approach. In an initial experiment, we found that combining the output of spaCy [19] for named entity recognition and Wikifier [5] for named entity linking provide the best results for different languages. Given a named entity recognition system for a specific language, Wikifier enables our system to support a large number of 100 languages. We link the entity candidate with the highest PageRank according to Wikifier for every named entity recognized by spaCy to the Wikidata knowledge base. Linked entities with a PageRank below 1 · e −4 are neglected due to their low confidence. If Wikifier does not provide a linked entity for a given string, the Wikidata API function "wbsearchentities" is used for disambiguation.
As shown in Fig. 2, suitable computer vision approaches based on deep learning are applied to extract visual features that are used to quantify the cross-modal entity consistency. The computer vision model is selected based on the type (person, location, or event) of the named entity. Thus, it is necessary to assign each named entity t ∈ T = {P, L, E} to one of the entity types to create distinct sets of persons P, locations L, and events E. Although some named entity recognition tools such as spaCy [19] automatically predict entity types, they do not make use of the knowledge base information of the linked entities. To handle mistakes of entity type classification by spaCy and to discard irrelevant entities such as given names that cannot be linked to a knowledge base, the entity types are re-evaluated using the Wikidata information of the linked entities based on the following requirements. For persons only entities that are an instance (P31) of human (Q5) according to Wikidata are considered, while for locations a valid coordinate location (P625) is set as a requirement. This allows us to extract a variety of locations ranging from continents, countries, and cities to specific landmarks, streets, or buildings. For events we instead require an entity to be in a verified list of events 4 according to EventKG [10,11]. Entities that do not fulfill any of the aforementioned criteria are neglected. As a result, distinct sets of persons P, locations L, and events E are extracted from the text that are used to acquire example images from the Web, as explained in Sect. 3.3.

Extraction of visual features
Our approach is applicable to articles containing multiple images, but we assume that only a single image is present for simplicity. State-of-the-art models are applied to obtain visual image representations. Person Features: For person verification, we first jointly detect and normalize faces using the Multi-task Cascaded Convolutional Networks [45]. An implementation 5 of Face Net [37] is used to calculate a feature matrix F V that contains the individual feature vectors f v of all faces v ∈ V found in the image. Location Features: We employ the base (M, f *) model 6 for geolocalization [29] to obtain a geospatial representation of the article's image. It provides good results across different environmental settings (indoor, natural, and urban). In contrast to the original method, we treat geolocalization as a verification approach and utilize the feature vector f L from the penultimate pooling layer of the ResNet-101 model [14,15].
Event Features: In our initial approach [30], we used a more general image descriptor for scene classification to extract features for events since related approaches for event classification [1,2,43] have not considered many event types that are relevant for news. Recently, we have presented a dataset and ontology-driven deep learning approach for event classification [31]. Unlike previous work, it considers the majority of newsworthy event types such as natural disasters, epidemics, and elections. For this reason, we use this ontology-driven C O cos γ model 7 in the approach described in this paper. The visual event features f E are extracted from the last pooling layer of the ResNet-50 architecture [14,15]. A comparison to the previous approach [30] is conducted in Sect. 5.5.

Verification of shared cross-modal entities
In this section, we present measures of Cross-modal Similarity for different entity types, namely persons, locations, and events. It should be emphasized that we treat each verification task independently. The Cross-modal Similarity for different entity types are not combined which allows a more detailed and realistic analysis. Referring to Fig. 1 (bottom), please imagine a news article where the image depicts one or several person(s) talking at a conference. While there can be multiple events and locations mentioned in the corresponding text, the news image does not provide any visual cues for their verification. This is typical for news articles since the text usually contains more entities and information. In case of fake news, it is common that only one entity type is manipulated to maintain credibility.

Verification of persons
As illustrated in Fig. 2, we first crawl a maximum of k example images using image search engines such as Google or Bing for each person p ∈ P that was extracted from the named entity linking approach presented in Sect. 3.1. Since these images can depict other or several persons, a filtering step is necessary. As described in Sect. 3.2, feature vectors are extracted for each detected face v ∈ V in the images. These features are compared with each other using the cosine similarity to perform a hierarchical clustering with a minimal similarity threshold τ P as a termination criterion. Consequently, the mean feature vector of the majority cluster is calculated and serves as the reference vectorf p for person p, since it most likely represents the queried person.
Finally, the feature vector f v of each face v ∈ V detected in the document image is compared to the reference vectorf p of each person p ∈ P. Several options are available to calculate an overall Cross-modal Person Similarity (CMPS) 7 C O cos γ model for event classification: https://github.com/ TIBHannover/VisE. such as the mean, n%-quantile, or the max of all comparisons. However, as mentioned above, usually the text contains more entities than the image, and already a single correlation can theoretically ensure credibility. Thus, we define the Crossmodal Person Similarity (CMPS) as the maximum similarity among all comparisons according to Eq. (1), since the mean or quantile would require the presence of several or all entities mentioned in the text.

Verification of locations and events
In general, we follow the pipeline of person entity verification. The feature vectors of a maximum of k reference images for each location and event mentioned in the text are calculated using the deep learning approach of the respective entity type according to Sect. 3.2. However, while some entities are very specific (e.g., landmarks, sport finals), others are more general (e.g., countries, international crises) and can therefore contain diverse example data. This makes a visual filtering based on clustering very complicated since these entities can already contain many visually different subclusters due to high intra-class variations. Thus, the feature vector f L (for locations) or f E (for events) of the news photograph (Sect. 3.2) is compared to the feature matrixF l (for locations) orF e (for events) that contains the features of each reference image crawled for a given location l ∈ L or event e ∈ E using the cosine similarity according to the following equations: To obtain a Cross-modal Similarity value for each entity, an operator function : s → [0, 1] (e.g., the maximum operator) is applied that reduces the resulting similarity vector s containing the similarities of all reference image to the news image to a scalar. In the experiments (Sect. 5.3), we evaluate the maximum and several n%-quantiles as potential operator functions. We believe that using a n%-quantile is more robust against incorrect or unrelated entity images in the retrieved reference data. As explained for person verification, we decided to use the maximum Cross-modal Similarity among all entities of a given type for both the Cross-modal Location Similarity (CMLS) and Cross-modal Event Similarity (CMES) of the document.

Cross-modal context consistency
In the previous section, we have presented an approach that quantifies the cross-modal consistency for each entity based on reference images crawled from the Web. This approach is not applicable to the quantification of the contextual semantic relation since Web queries are hard to define automatically based on the entire news content. For this reason, we pursued a different direction. We extracted word embeddings from the articles text (Sect. 4.1) as well as the visual probabilities of general scene concepts along with their respective word embeddings (Sect. 4.2) to quantify the Cross-modal Context Similarity (CMCS) (Sect. 4.3). An overview is provided in Fig. 3.

Textual scene context
To retrieve suitable candidates representing the textual (scene) context C, the part-of-speech tagging from spaCy [19] is applied to extract all nouns c ∈ C. They can represent general concepts, such as politics or sports, as well as scenes or actions, that might correlate to specific classes, e.g., of a place (scene) classification dataset such as Places365 [47]. Subsequently, we calculate the word embedding w c for each candidate c ∈ C using fastText [12] as a prerequisite for the cross-modal comparison explained in Sect. 4.3.

Visual scene context
A ResNet-50 model 8 [14,15] for scene (place) classification that is trained on 365 places of the Places365 dataset [47] is applied to predict the visual scene probabilitiesŷ S . As for the textual scene context (Sect. 4.1), fastText [12] is employed to additionally extract the corresponding word embeddings w s of each scene label s ∈ S. While the scene label such as beach, conference center, or church are rather generic, their word embeddings can be also associated with specific news topics such as holiday, politics, or religion. Both the visual scene probabilities and scene word embeddings are used as visual scene context. The scene labels were manually translated to German for the experiments on German news articles.

Cross-modal context similarity
Unlike the cross-modal entity verification, the quantification of the Cross-modal Context Similarity (CMCS) does not require any reference images as it is solely based on the textual (Sect. 4.1) and visual scene context (Sect. 4.2) given by the news article. In this regard, we compare the individual word embeddings w c of each noun c ∈ C to the word embeddings w s of all 365 scene class labels s ∈ S covered by the Places365 dataset [47] using the cosine similarity. Since only certain scenes are depicted in a news image, these similarities are weighted with the respective visual scene probabilityŷ s of a scene class s ∈ S to integrate the image information. Finally, the Cross-modal Context Similarity (CMCS) is defined as the maximum similarity among all comparisons according to Fig. 3.

Experimental setup and results
In this section, we introduce two novel datasets for crossmodal consistency verification (Sect. 5.1). Furthermore, the metrics for evaluation (Sect. 5.2) and parameter selections (Sect. 5.3) are explained in more detail. The performance of the proposed system on real-world news articles is evaluated in Sect. 5.4 and two different deep learning approaches for the quantification of cross-modal event relationships are compared in Sect. 5.5. Finally, the limitations and dependencies of our proposed approach are discussed in Sect. 5.6.

Datasets
Two real-world news datasets that cover different languages, domains, and topics are utilized for the experiments. They were both manipulated to perform experiments for cross-modal consistency verification. Experiments and comparisons to related work [21,36] on datasets such as MEIR [36] are not reasonable since (1) they do not contain public persons or events, and (2) rely on pre-defined reference or training data for given entities. These restrictions severely limit the application in practice. We propose an automated solution for real-world scenarios that works for public per-sonalities and entities represented in a knowledge base. In the remainder of this section, we introduce the tampering techniques (Sect. 5.1.1) as well as the TamperedNews (Sect. 5.1.2) and News400 (Sect. 5.1.3) datasets, which contain articles written in English and German, respectively.
Dataset Version 2: Please note that we have noticed a minor problem in the first version of our dataset [30] that affected circa 5% of the linked entities. As a consequence, the results slightly differ from the first version. The repository and datasets have been updated accordingly 3 .

Tampering techniques
We have created multiple sets of tampered entities for each document in our datasets. Similar to Sabir et al. [36], we replaced entities extracted from the text at random with another entity of the same type to change semantic relations as little as possible. We also apply more sophisticated tampering techniques as follows. Three additional tampered person sets are created by replacing each untampered person with another person of the same gender (PsG), the same country of citizenship (PsC), or matching both criteria above (PsCG). Locations are replaced by other locations that share at least one parent class (e.g., country or city) according to Wikidata and are located within a Great Circle Distance (GCD) of dmin and dmax kilometers (GCD dmax dmin ). Three intervals are used to experiment with different spatial resolutions at region-level (GCD 200 25 ), country-level (GCD 750 200 ), and continent-level (GCD 2500 750 ). Similarly, events that share the same parent class (e.g., sports competition or natural disaster) with the untampered event are used for a second set (EsP) of tampered events. In case no valid candidate for a tampering strategy was available, we have used a random candidate that matched most of the other tampering criteria.
The contextual verification is based on the nouns in the text. Thus, textual tampering techniques are not applicable. We instead replaced the image with a random image from all other documents for a first tampered set. We randomly selected similar images (from top-k% with k ∈ {5, 10, 25}) to maintain semantic relations to create three more sets. The similarity was computed using feature vectors extracted from a ResNet model [14,15] trained on ImageNet [9].

TamperedNews dataset
To the best of our knowledge, BreakingNews [33] is the largest available corpus with news articles that contain both image and text. It originally covered approximately 100,000 English news articles from 2014 across different domains and a huge variety of topics (e.g., sports, politics, healthcare). We created a subset called TamperedNews for cross-modal consistency verification of 72,561 articles for which the news text and image were still available. The entities in these articles were additionally tampered according to Sect. 5.1.1. To discard most irrelevant entities, only persons and locations mentioned at least in ten documents and events that occur in at least three documents are considered. Detailed dataset statistics are reported in Table 1.

News400 dataset
To show the capability of our approach for another language and time period, we have used the Twitter API to obtain the web links (URLs) of news articles from three popular German news websites (faz.net, haz.de, sueddeutsche.de). The texts and main images of the articles were crawled from the URLs. We have gathered 397 news articles containing four different topics (politics, economy, sports, and travel) in the period from August 2018 to January 2019. The smaller size of the dataset allowed us to conduct a manual annotation with three experts to ensure valid relationships between image and text. For each document, the annotators verified the presence of at least one person, location, or event in the image as well as in the text and whether the context was consistent in both modalities. Experiments were conducted exclusively on data with valid relations. Again the tampering techniques presented in Sect. 5.1.1 are applied to create the test sets. Due to its smaller size, every entity is considered regardless of how often it appears in the entire dataset. The resulting statistics are shown in Table 1.

Evaluation tasks and metrics
The evaluation tasks are motivated by potential real-world applications of our system. We propose to evaluate the system for two tasks: (1) document verification and (2) collection retrieval. The system can also be used as an analytics tool to quickly explore cross-modal relations within a document as illustrated in Fig. 1.

Document verification
Please imagine a set of two or more news articles with similar content and imagery but differences in the mentioned entities that might have been tampered by an author with harmful intents. The idea behind this task is to decide which joint pair of image and entities extracted from the news text provides a higher cross-modal consistency. Thus, a document verification can help users to detect the most or least suitable document. We address this task using the following strategy. For each individual document in the dataset, we compare the cross-modal similarities between the news image and the respective set of untampered entities as well as one set of tampered entities (e.g, PsG) according to the strategies proposed in Sect. 5.1.1. This allows us to evaluate the impact of different tampering strategies. We report the Verification Accuracy (VA) that quantifies how often the untampered entity set has achieved the higher cross-modal similarity to the document's image. Some qualitative examples are shown in Fig. 5. Please note that the image is tampered for the context evaluation instead and that the nouns in the text are considered as "entities".

Collection retrieval
The system can also be leveraged in news collections to retrieve news articles with high or low cross-modal relations to support human assessors to gather the most credible news or possibly fake news (in extreme cases). We therefore consider all |D| untampered documents as well as |D| tampered documents applying one tampering strategy. The cross-modal similarities are calculated and used to rank all 2 · |D| documents. As suggested by previous work [21,36], the Area Under Receiver Operating Curve (AUC) is used for evaluation. We also propose to calculate the Average Precision (AP) for retrieving untampered (AP-clean) or tampered (AP-tampered) documents at specific recall levels R according to Eq. (4). In this respect, TP i is the number of relevant documents at position i. For example, AP-tampered@25% describes the average precision when |D R | = 0.25 · |D| of all tampered documents are retrieved.

Test document selection for TamperedNews
Although the large size of the TamperedNews dataset allows for a large-scale analysis of the results, unfortunately a manual verification of cross-modal relations as for News400 is infeasible. Thus, reporting the proposed metrics for the whole dataset can be misleading since it turned out during the annotation of News400 that only a fraction of the documents has cross-modal entity correlations ( Table 1). As discussed at the beginning of Sect. 3.3, it is possible that not a single entity mentioned in a news text is depicted in the corresponding image. To address this issue, we suggest measuring the metrics for specific subsets. More specifically, we consider the top-25% and top-50% documents (denoted as Tampered-News (Top-k%)) with respect to their cross-modal similarity of untampered entities since they more likely contain relations between image and text. This selection is also supported by the Cross-modal Person Similarity (CMPS) values for person verification (Fig. 4), which decrease more significantly after 25−50% of all documents and correspond to the percentage of manually verified documents in the News400 dataset. Please note, that experiments on top-k% subsets limit the comparability between two approaches to some degree. Depending on the specified parameters (e.g., feature descriptor, operator, etc.), the top-k% subsets comprise different documents. However, in Sect. 5.5 we explain how a meaningful comparison between two different approaches can be conducted.   Table 2.

Parameter selection
For comparison, we also tested the face verification using the approach applied for event and location entities described in Sect. 3.3.2. Surprisingly, results for 90% and 95% quantiles are on par with the proposed person clustering. Also, contrary to our assumption that a quantile is more robust against noise for locations and events, it turned out that the maximum operator provides slightly better results for these entity types. This indicates that incorrect examples in the reference data have no significant impact on the performance. Except for person entities, where reference faces can be very similar, we assume that irrelevant or unrelated reference images less likely matches the entity depicted in the news image. In the remainder of this paper, results for persons are reported using the clustering strategy because we still believe that this is more robust in many scenarios. For locations and events the maximum operator is applied. Amount and Sources of Reference Images: In total, we collected a maximum of k = 20 images from the image search engines of Google and Bing as well as all k W available images on Wikidata (mostly one Wikimedia image) for each entity recognized in the text. We have used multiple sources to prevent possible selection biases of a specific image source and investigated the performance for different image sources and number of images. Since Wikidata usually only provides a single or sometimes no image for the linked entities, we exclude it from the comparison. The results on the respective TamperedNews (Top-50%) subsets for the AUC metric using the hardest tampering strategies are presented in Table 3.
They demonstrate that the performance using a single or all image sources is very similar. Also, the results using k = 10 reference images are almost identical compared to the maximum of k = 20 images. Hence, for the rest of our experiments, we use all available image sources with a maximum of k = 10 images per source as this provides a good trade-off between performance and speed and prevents possible selection biases.

Experimental results
In this section, we present the baseline results of the proposed system for cross-modal consistency verification on the Tam-peredNews (Sect. 5.4.1) and N ews400 dataset (Sect. 5.4.2).

Results for TamperedNews
Qualitative and quantitative results are presented in Figs. 4, 5, and Table 4. Results for all TamperedNews documents as well as the top-25% subset allow similar conclusions and are reported as supplemental material 3 . Results for Person Entities: As expected, person verification achieves the best performance since the entities and the retrieved example material are very unambiguous and neural networks for face recognition, such as FaceNet [37], can achieve impressive results. Despite the more challenging tampering techniques, our approach is still able to produce similar results. We have only experienced problems if persons were depicted in challenging conditions (e.g., extreme poses as shown in Fig. 5a for John Kerry) or were rather unknown, which results in false entity linking results and confusion with other persons (e.g., with a similar name). Results for Location Entities: To evaluate performance for location entities, we distinguished between images of indoor and outdoor scenes using the scene probabilitiesŷ S extracted according to Sect. 4.3 and the hierarchy provided by the Places365 dataset [47]. Due to the data diversity and ambiguity and the unequal distribution of photographs on earth, geolocation estimation is a complex problem that has attracted attention only in recent years [29,41,42]. Therefore, the results were expected to be worse compared to the person verification. Despite the complexity, good results were achieved for outdoor images, whereas the detection of modified indoor scenes is more challenging given the low amount of geographical cues and their ambiguity. However, even when entities are tampered with locations of similar appearance and low Great Circle Distance (GCD) (Fig. 5b, d), the system can operate on a good level and shows promising results.
In contrast to person entities, location entities are an instance of various parent classes such as countries or cities. For a more in depth-analysis, we have calculated the results for all types of locations separately using the documents D s where an instance of a given type has achieved the highest Cross-modal Locations Similarity (CMLS) within the untampered set of entities. The results for some location types are presented in Table 5 (top) and show that the performance is best for more fine-grained entities such as tourist attractions, buildings, and cities. The performance for coarse location types such as oceans, mountain ranges, and country states are typically worse since they do not provide sufficient geographical cues or are too broad to retrieve suitable reference  images. Although the results for continents or countries are also comparatively high, we believe the reason is that the candidates for tampering are easier to distinguish since locations of those types have higher geographical and cultural differences. The tampering is much more challenging for fine-grained entities, as illustrated in Fig. 5b, d.
Results for Event Entities: In general, good results were achieved for event verification. As for locations we have provided results of common event types in Table 5 (bottom). While the results for festivals, holiday, and disasters are promising, event types such as football club competitions, protests, and wars are hard to distinguish. We believe that this is caused by the high visual similarity of events within these types. For example, many news articles on football clubs cups contain images which, unlike articles on sport competitions that refer to different types of sports, depict typical scenes (e.g, players on the pitch) of the same sport. Thus, reference images for the different competitions are very similar. Moreover, the utilized event classification approach [31] distinguishes between event types such as football, elections, or types of natural disasters rather than between sub-types or concrete event instances such as UEFA Champions League or 2020 U.S. elections. Despite these limitations the results are superior to the scene classification approach used in our previous work [30], as discussed in more detail in Sect. 5.5.

Cross-modal Context Similarity:
The results for scene context verification indicate that our system can reliably detect documents with randomly changed images. However, as also stated by [36], this task is rather easy as the semantic relations are not maintained. When similar images are used for tampering, this task becomes much more challenging.
Since networks for object classification (used for tampering) and scene classification (used for verification) can produce comparable results, performance is steadily decreasing using more similar images for tampering that might even show the same scene, e.g., sport. However, our system is still able to hint towards cross-modal consistencies.

Results for News400
Since the number of documents is rather limited and the cross-modal mutual presence of entities was manually verified, results for News400 are reported for all documents with verified relations. Based on the results displayed in Table 6, similar conclusions on the overall system performance can be drawn. However, results while retrieving tampered documents are noticeably worse. This is mainly caused by the fact that some untampered entities with valid cross-modal relations can be either unspecific (e.g., mentioning of a country) or the retrieved images for visual verification do not fit the document's image content. Since subsets of the top-k% documents for TamperedNews were used to counteract the influence of untampered documents that do not show any cross-modal relations (as discussed in Sect. 5.2.3) this problem was bypassed. We have verified the same behavior for News400 when experimenting on these subsets. For more details, we refer to the supplemental material 3 . In addition, performance for context verification is worse compared to TamperedNews. We assume that this is due to the less powerful word embedding for the German language. Overall, the system achieves promising performance for cross-modal consistency verification. Since it dynamically gathers example data from the Web, it is robust to changes in topics and entities, even when applied to news articles from another country and publication date.

Comparison of event feature descriptors
As discussed in Sect. 3.2, we decided to use the ontologydriven event classification approach [31] to compute event features for our proposed system. Due to the absence of suitable methods for event classification, a more general scene classification model was applied in our previous approach [30]. It is trained on 365 places covered by the Places365 dataset [47] and the visual features f E are obtained from the last pooling layer of a ResNet-50 model 8 [14,15].
To compare both approaches, we evaluate their performances on the News400 dataset as it contains documents with verified event relations. As explained in Sect. 5.2.3 we have used the TamperedNews (T op − 50%) documents as subsets for testing since they more likely contain cross-modal relations. However, this complicates the comparison of two approaches as those subsets can be different depending on their specified parameters (feature descriptor, operator, etc.). Thus, we report results on all documents as well as on the intersection and union of the TamperedNews (T op − 50%) document sets of both approaches. In this way, the test sets contain documents that are either considered relevant from both or at least one approach, respectively. The results are presented in Table 7 and demonstrate that the event classification approach achieves superior performances. However, as already discussed in Sect. 5.4.1 the approach is not trained for the classification of concrete event instances and instead focuses on more generic event types. As a consequence, improvements for the EsP test set containing tampered events Bold numbers indicate the best results for the individual test subsets of the same parent class are not as significant as for the randomly tampered test set. Further limitations and dependencies are discussed in the next section.

Limitations and dependencies
News covered in the World Wide Web are dynamic and new entities and topics evolve every day. We have deliberately chosen Wikifier for named entity linking as it can dynamically cover Wikipedia entities. However, the proposed system is restricted to entities that exist in a knowledge base. Besides, the system relies on the rankings and response times of image search engines. In this regard, the reference images for coarse entities such as countries or continents crawled from the Web might not match the news image. Some named entities such as "Hanover" (German or U.S. city) or "Tesla" (company or inventor) can also be ambiguous. Referring to Fig. 1, we also noticed that querying entities such as the city "Liverpool" retrieves images that depict another more popular entity, in this case the football club "Liverpool F.C." rather than the actual entity. A potential solution to the aforementioned problems is to include knowledge graph information and relations that are already extracted by the system. For example, adding the country (Wikidata property P17) "Germany" to the query "Hanover" (Wikidata item Q1715) or using the entity type (Wikidata property P31) "city" in combination with the query "Liverpool" (Wikidata item Q24826) can prevent potential ambiguities.

Conclusions
In this paper, we have presented a novel analytics system and benchmark datasets to measure the cross-modal consistency in real-world news articles. Named entity linking is applied to find persons, locations, and events in the news text. Reference data is automatically gathered from the Web and used in combination with novel measures of cross-modal similarity for the visual verification of entities in the article's photograph. In this regard, state-of-the-art computer vision methods are applied. Furthermore, a more general measure of cross-modal similarity of the textual content to the scene depicted in the image has been introduced. Unlike previous work, our system is completely unsupervised and visual representations of the extracted entities are not derived from similar data sources with additionally available metadata. Experiments were conducted on two datasets that contain real-world news articles across different topics, domains, and languages and have clearly demonstrated the feasibility of the proposed approach.
As mentioned in Sect. 5.6 the system performance for coarse (countries, continents, etc.), ambiguous, or less popular entities can suffer due to the lack of relevant reference images crawled by the unsupervised Web image search. Thus, we aim to refine the image search queries based on the extracted named entities for the visual verification approach by further exploiting knowledge graph information and entity relationships in the future. Furthermore, the event classification approach is only able to distinguish between event types such as types of sports, natural disasters, elections, etc. The system can greatly benefit from an event classification approach that is capable of differentiating between more fine-grained event types and concrete event instances, e.g., UEFA Champions League, 2020 U.S. elections, etc. Another interesting direction of research is to investigate the impact of other types of entities such as time or organizations, entity relations, as well as relations between the overall textual and visual sentiment. in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.