1 Introduction

We introduce Police PR SearchFootnote 1, a search system that allows for easily double-checking a piece of online news about police-related events (e.g., serious crimes but also accidents or demonstrations) against the content of relevant police press releases. Our current prototype indexes and retrieves press releases of virtually the entire German police force when queried with the URL of some (German) online news article. Readers of online news about police-related events can use the system to retrieve official background information.

An illustrating example is given in Fig. 1. On June 23, 2016, a hostage situation occurred in a cinema in Viernheim, a small town south of Frankfurt and Darmstadt, Germany. The yellow press German newspaper BILD quickly picked up the incident (cf. Fig. 1 (left)), reporting about a rampage and the involvement of explosives. Soon after, the BBC tweeted about 20 casualties in the cinema (cf. Fig. 1 (right)). Indeed, the police did shoot and kill the hostage-taker, but luckily no one else was injured, and no explosives were involved, as explained in the official press release of the police, which was published later after the incident. The sensationally exaggerated and erroneous facts were then removed from the online articles.

Not every piece of (online) news is wrong or sensationally exaggerated. Still, wrong news articles are published frequently [8], and sometimes even intentionally [7]. Readers might thus want to double-check news articles against official statements, which can be a rather time-consuming process of searching for other trustworthy sources. Manually selecting text fragments as queries, for instance, may still yield the same wrong information in the search results.

Fig. 1.
figure 1

Excerpts from the coverage of a hostage taking in a cinema in Viernheim on June 23, 2016, in the German newspaper BILD (left) and a tweet from the BBC (right).

To offer some (semi-) automatic support in such situations, we have developed a search engine that can be queried directly with the URL of an online news article. As results to be retrieved, we index official press releases from police and fire departments. They offer information on a lot of local events—topics that many readers are interested in anyway [6]—and the police is a trusted source of information for many [3]. Currently addressing the German market with the prototype, we have crawled and indexed press releases from the German press portal Blaulicht. In our evaluation, we compare different strategies of formulating search queries given a URL. In a TREC-style setup [5] on 105 topics covering 7 classes of police-related events, we show that even the most simple querying strategy (searching the title of the news article in the titles and bodies of the press releases) substantially outperforms the search facility offered by the press portal itself. It turns out that the best (and more involved) automatic querying strategy implemented in our system achieves precision@1 and nDCG@5 scores of about 0.9—a performance clearly indicating practical applicability.

2 Search System and Query-by-Document Strategies

We extract the title, body, date, and police department location from all press releases as fields for retrieval with Elasticsearch’s BM25F implementation. As querying strategies against this index, we basically follow a query-by-document approach (the news article as the “query”). Somewhat following previous works that try to identify the most important keyphrases from an input query document to find similar content [2, 9], we compare three query formulation strategies.

Our three “query-by-document” strategies extract information from an input news article and combine them as follows: (1) Only the title of the article, (2) title and main content of the article, and, (3) title, main content, and publication date and locations mentioned in the article (if any). Since publication dates and locations can not be extracted accurately for all potential input articles (third strategy), we resort to only title and body information in such cases.

The queries against the Elasticsearch index are formulated as follows: The title of a given news article is queried against the title and body of the police press releases. When a news article has a body, it is queried against that of the indexed press releases. When a publication date for the news article can be extracted, it is queried against the body of the press releases and used as a filter to remove press releases that were published more than two weeks before, and more than eight weeks after that date. The potential locations extracted from a news article are used as queries against the police department name field as well as against the title and body of the police press releases.

3 Evaluation

To test our system, we follow the TREC evaluation paradigm [5]: 1,172,703 press releases form the document collectionFootnote 2, covering virtually all German police departments. Topics are formed by news articles about police-related incidents, and relevance judgments are obtained in a depth-5 pooling of different search system rankings.

To create topics representative of the “importance” of police-related incidents, we use the German crime statistics of 2018 [1]. The seven categories we select to cover are murder, theft, migration, related to sports events, thunderstorms, traffic accidents, and general capital offenses. As per the results of a G*Power t-test [4], based on a small pilot experiment, we create 15 topics (105 in total).

The individual topics in the form of news articles were compiled as follows: Given a random police press release from one of the aforementioned categories, an expert tried to identify a related news article using various online news search engines. The expert also was instructed to rate a topic’s difficulty during the creation. A difficulty of “Level 1” indicates that the news article and the press release use very similar vocabulary, “Level 2” that the titles greatly differ but there are similarities in the bodies, and “Level 3” for larger differences in the titles and the bodies. If no news article was found for some press release after 5 min, the expert continued with another random press release.

The qrels for the 105 topics (i.e., news articles) have been created as follows: The initial press release used to create the topic is judged as highly relevant (score of 2). Then a depth-5 pool of the rankings returned by different querying strategies and retrieval systems is completely judged: the Blaulicht portal’s original search facility using the title or the title and the body as query, and our three strategies detailed above. The top-5 results of each ranking are judged on a graded scale from 0 (irrelevant) to 2 (highly relevant). A press release is judged as “highly relevant” (score 2) if it directly deals with the event described in the news article, and as “relevant” (score 1) if the news article’s event refers to the police press release. Most topics only have one highly relevant press release.

Table 1. Evaluation of the query-strategies: title (T), title and body (TB), and a combination of title, body, place, and date (TBPD) for various difficulty levels. We compare our search-engine (PPR) with the search engine of the German police press portal (ORI), reporting nDCG@5 and precision@1 (P@1).

Table 1 shows the aggregated effectiveness of the different systems/strategies in terms of precision@1 and nDCG@5. The performance of the Blaulicht portal’s search facility is rather low: even for easy topics (Level 1), hardly any relevant documents are found using a news article’s title as the query. A reason might be that some exact match retrieval is used since for many topics no result is returned. This trend gets even worse if additional information in the form of the bodies of the news articles is incorporated into the query.

Our Elasticsearch-based system outperforms the portal’s search facility by far on every category (differences significant) even when only the title is used as the query. Adding more information to the query (body, location, date) results in slower search as the news articles’ body texts produce very long queries. Still, the effectiveness greatly improves by adding body as well as location and date information to the query with precision@1 and nDCG@5 reaching 0.9 on average. However, this gain in effectiveness comes at the cost of an increased average response time of more than 9 s. Testing and implementing strategies selecting the most informative keywords and phrases from the body to reduce query length thus form an interesting direction for future efficiency improvements.

4 Conclusion and Future Work

Our prototype shows that using a news article as the query against an index of police press releases can often very accurately deliver background information about police-related incidents. Facts can directly be double-checked against official statements, a source of trust for many. For future work, we envision improvements in a number of directions. The efficiency for long queries involving a news article’s body text can possibly be improved by keyphrase extraction methods, reducing the queries to maybe several tens of words only. It would also be interesting to more closely analyze cases where no press release can be identified; in our user study, this often was the case when the vocabulary greatly differs and broadening this analysis might help to avoid showing only irrelevant or no results at all.