Introduction

Since its founding in 2001, Wikipedia has grown from a simple, often dismissed, ‘Web 2.0’ based dream of shared knowledge to the Web’s primary authoritative information resource for billions of people. Its emergence has overhauled traditional conceptions of the encyclopaedia to become a rapidly updatable real-time record. In addition, it may act as an audience barometer for both historical knowledge and, crucially, current events. Wikipedia is thus an intriguing site of study for news events—reflective of collective audience trends, yet somewhat divorced from the influence of journalistic and content delivery processes.

Wikipedia is a particularly attractive platform to analyse audience reception of current events. Beyond its position as one of the most popular sites on the World Wide Web, the information on Wikipedia is stored in a hyperlink network of articles. Information representation is then a result of both the knowledge structure on Wikipedia and the way its readers navigate that structure. This is unlike a news website, where there may be a single news article, or occasionally a small series of articles towards each event, often disconnected from each other, despite sharing similar subjects. Wikipedia data then facilitates connections and comparisons between different events, as well as to longer term historical knowledge structures. Moreover, using Wikipedia as a data source to study current events fulfils calls for the use of “extra-media data”—information about current events not taken directly from news media [1]. Various works have taken advantage of Wikipedia’s online importance, relevance towards current events, and data availability, using different aspects of the real-time digital encyclopaedia in studying areas such as stock prices [2], film box office performance [3], and disease spread [4].

To holistically study the interaction of current events with Wikipedia, we must first identify what events are represented and how they are manifested. Individuals browse groups of articles relating to events in the news, but what are the various groups that are browsed? Are they consistent between events in forming news topics? And how well do access patterns align with the structure of the information in the encyclopaedia? One must answer these questions to study how different kinds of events are integrated into collective memory, encyclopaedic knowledge, and the historical record.

Previous research has made progress in exploring different modes of data, such as page views, hyperlinks, and edits [5, 6], as well as focusing on specific events or event categories, such as natural disasters or sporting events [7, 8]. However, these efforts have often been limited in scope and/or have not been able to connect directly with news records [9, 10]. Therefore, there is a need for comprehensive techniques that can link specific events, integrate page view attention patterns with hyperlink network structure, and extrapolate individual events to broader topics, all simultaneously.

To address this gap, in this work we present a topic detection model for Wikipedia. Our method leverages both Wikipedia hyperlink network structure and correlations between page view time series, combined with a database of events from Wikipedia’s current events portal, to identify groups of articles that are both well connected and exhibit similar patterns of page views around individual events. These groups of articles are then connected through time to identify the recurrent topics attracting attention on Wikipedia. The resulting ‘Event Reactions’ and ‘Topics of Attention’ encompass news topics as well as attention towards background topics on Wikipedia (and can be resolved as such). These objects are revealing of various interesting features on the nature of knowledge and news recorded on Wikipedia and can act as general purpose event and topic summaries for study in future work. Our work is built around the following research questions.

  • RQ: How are current events represented in the knowledge structures and access patterns of Wikipedia and its users?

  • RQa: Can we meaningfully and robustly sample the related groups of Wikipedia articles associated with a given news event?

  • RQb: Are these groups of articles from different events related and do they form coherent topics?

We find that our approach yields communities of articles (‘Event Reactions’) that are strongly related according to both hyperlink network structure and correlated page views. When considering Event Reactions’ relationships to each other we find that they form coherent, human-validated, higher-level ‘Topics of Attention’. These topics may be resolved to volatile news topics and stable background topics, representing both facets of Wikipedia as a stable knowledge base and rapidly updating current events record. The topics themselves also qualitatively exhibit a background historical concept space, strong geographical effects, a focus on individuals, and breakout subtopics.

Related work

Historic variation in news and encyclopaedism is based on the discrepancy in information distribution methods. The Internet has narrowed, even brought about a convergence, in the time, format, and audience for these two initially very different knowledge exchange media. Through Wikipedia the encyclopaedia as a format has come to be a highly responsive trove of information on current events, plus readily accessible online news archives (occasionally from citizens’ records) comprehensively chart the events of the 21st century. News and encyclopaedic recording, together with the public’s experience of them, have never been more similar. This convergence has been thoroughly explored [11, 12], and is also reflected in reckonings on its significance to “open-source history” [13], and collective memory [14,15,16,17,18,19,20,21]. Issues of contestation and distortion of this prominent digital record of collective memory are also attracting increasing attention, for example, the cases of widespread administrator-driven far-right historical revisionism on Croatian Wikipedia [22] and distortion of Holocaust history on the English edition [23].

Theoretical work on news media has long had to wrestle with the fact that news outlets themselves, in (justifiably) selectively covering events, are not necessarily representative of current events more widely or the audiences’ views on them. Karl Erik Rosengren calls for studies that incorporate “extra-media data”—information about current events not taken directly from news media—to help address this issue [1]. In the Internet age, we may now turn to the large scale tracking of user access patterns and active constructions of repositories of knowledge for a fresh perspective on this extra-media data.

Wikipedia is of course not a news website. It is, however, used as a secondary information resource, which individuals use to further research and contribute towards topics they have encountered through other news media. According to respondents to Wikimedia surveys, 13% of readers visit the site directly because of current events, and a further 30% visit due to wider media coverageFootnote 1 [24]. Perceptions around the reliability of Wikipedia have only improved since its early days, teachers’ and lecturers’ warnings have grown increasingly futile, with various studies confirming its overall reliability on a range of subjects [25,26,27,28,29]. From bar wager to academic papers [30], its authority as the unofficial arbiter of social facts is undeniable. As such, Wikipedia is an appealing representative site for predicting external collective behaviour, both on and offline, from search trends [31], to stock and cryptocurrency prices [2, 32], film box office performance [3], city tourism numbers [33], disease spread [4], and election results [34, 35].

Moreover, plenty of web companies rely on Wikipedia’s content in powering their own services. For example, platforms such as Facebook, Google, YouTube, Twitter, Amazon Alexa, and Apple Siri use Wikipedia in producing their own knowledge graphs, informing automated search results and infoboxes, verifying of notable persons, directing their users to authoritative sources on issues of conspiracies and misinformation, and in building powerful large language models [36,37,38,39,40,41]. The collective telling of events from Wikipedia, and the aggregate user behaviour in browsing it, is thus emblematic of the kind of audience-centric extra-media data required for studying news media.

Studying the news through Wikipedia article data can be considered a form of content analysis, a common technique in communication and journalism studies. Of particular interest is the task of automatic content analysis, whereby topics, trends, agenda, etc. are analysed across large corpora of news stories where manual coding is not feasible. Three main forms of this are identified in [42] and [43]; rudimentary dictionary based methods (e.g. [44]), supervised machine learning approaches (e.g. [45]), and unsupervised machine learning approaches (e.g. [46]). A novel unsupervised approach in which individual news stories are clustered into ‘news story chains’ according to textual similarity (particularly with novel words), grouping individual articles and their follow-ups into single entities is offered by [47].

A key feature of Wikipedia that separates it from traditional encyclopaedias and news media (as well as in truth much of the modern social web) is the fact that content on Wikipedia exists in a single hyperlinked network of articles. What is constructed by different groups of editors, as well as how it is built, better informs us of the information itself. Prior literature has explored how these cultural differences manifest on the site [48], how the network itself drives rich collaborative environments [49], as well as how related knowledge graphs can be used for automated fact checking [50]. Previous literature typically concerns itself with events that have dedicated articles [51, 52]. However, news events can also be documented within one already existing article or across several different articles. For example, stories about one public figure making newsworthy comments about another may be separately recorded on their respective articles.

Work on the concept of spillovers is also relevant here. Even in cases where content in linked articles is not directly relevant to the focus of some exogenous shock, attention in page views and edits may still ‘spillover’ to a selection of neighbours. This has been studied in the context of current events [18, 53], but also the effects of articles being featured on the Wikipedia main page [54], or editing campaigns [55]. Analysing the dynamics of news events in the context of their links to related topics thus clearly necessitates a network based approach.

Several papers cover methods for the detection and summarisation of news events using activity on the platform, frequently relying on bursts in edits [5] or page views [6, 56], or alternatively dedicated databases for particular event categories [8]. Collecting and comparing events from a common category can be informative in answering how event dynamics vary according to category specific parameters. However, further work is needed in comparing across categories. Efforts have also been made to enhance news event detection by combining Wikipedia with other forms of social media acting as a source [51] or filter [57] for events. Bursts of attention towards a topic seem like a natural way of selecting news events but clearly do not cover the full expressible range of event dynamics. Case studies considering edit and page view dynamics [7] or work studying breaking news events more generally [58] look to address how users’ collective attention towards articles situated in a wider topic network evolves, and how it can drive the collaborative editing activities that shape the content and structure of knowledge on the website. News events are rapidly covered on Wikipedia, yet they can have a lasting impact on article content and network structure, incrementally contributing to the collective knowledge base of Wikipedia.

In doing so, individual events shape and integrate into the wider topics represented on Wikipedia. Work that approaches the task of identifying and summarising these topics can use semantic information [59, 60], the article network structure [56, 61], category tags [62, 63], and page view patterns [9, 10, 56]. Several of these approaches are also language agnostic [9, 10, 56, 64], or even multi-lingual [59, 60]. Most notable of these approaches is that of [9, 10, 56] whose language agnostic community detection model incorporates correlated page views together with article network structure. However, in cases where larger datasets are used (such as the topic-level analyses) it is frequently the case that properties such as page views are studied independent of any explanatory description, with any detected interesting features such as peaks later being ascribed meaning by the researcher(s) (likely to some external event) [65]. There is an important distinction between starting from the point of current events and understanding their dynamics, rather than observing particular dynamics (such as bursts and anomalies) and later attempting to relate them to news events. Firstly, sampling-wise we may only select for particular dynamics when adopting the latter approach (as already touched upon). Secondly, immediately linking to news events assists in later stages in the interpretation of results.

Wikipedia data

The three primary classes of Wikipedia data used are information on events that occur from the Wikipedia Current Events Portal [66], data for the article network of Wikipedia (i.e., the article names and what hyperlinks exist between them), and time series data for the daily page views to each article. Supplementary data on Wikipedia redirects is also used. Further detail on data and how it is obtained may be found in Appendix A. All code and data is available through the WikiNewsTopics GitHub repository.Footnote 2

Current events portal

The Wikipedia Current Events Portal is a daily archive of events as recorded by Wikipedia editors. Whilst records are in English, coverage is (nominally) of global events of international interest. Events are sorted in 10 categories and are recorded with a summary sentence with links to relevant articles. A partial snapshot is displayed in Fig. 1. We scrape the page to sample a full year of events from 1st December 2017 to 30th November 2018. Initial data gathered for each event includes the date, category, full text description, and the linked Wikipedia articles (henceforth referred to as “core articles”) in each description. For example, in Fig. 1, the final event on 01/04/2017 in the ‘Disasters and Accidents’ category is described as “Authorities cannot contact the South Korean cargo freighter Stellar Daisy. It is believed that the ship sunk off the coast of Uruguay”. We extract the linked pages South Korea (displayed text does not have to match article title), Stellar Daisy, and Uruguay as the core articles for this event. In total, 7919 events are gathered from this year-long period.

Fig. 1
figure 1

A snapshot of the Wikipedia current events portal. Live version available at https://en.wikipedia.org/wiki/Portal:Current_events

Clickstream networks

For the Wikipedia network data, we use dumps from the Wikipedia Clickstream [67]. The Wikipedia Clickstream contains monthly aggregated counts for the number of times links are accessed on Wikipedia, and crucially where from, in (referrer, resource)—equivalently (source, target)—pairs. Only hyperlinks which are clicked more than 10 times in a month are included in the dataset. We only include hyperlinks between Wikipedia articles, excluding links from external to Wikipedia and from Wikipedia’s Main Page. Essentially, this represents an edgelist that forms a directed, weighted network of monthly navigation between Wikipedia articles. Clickstream data for the English Wikipedia was downloaded for November 2017–December 2018, allowing a one month buffer for studying events at the start and end of the time period of study.

Table 1 A summary of the Wikipedia data used. Further detail available in Appendix A

Page view time series

Page view data in hourly granularity for all articles is downloaded from the Wikimedia data dumps [68]. We identify the networks of articles linked to the entries in the Current Events Portal for which page view time series are required (more details in Sect. 4.1), and process the raw compressed time series data to more accessible HDF5 format. This data were downloaded for the period November 2017–December 2018.

Redirects

Wikipedia article redirects are not resolved in all data sources. However, we must account for their important role in Wikipedia’s structure and the shaping of traffic [69]. Wikimedia API calls for redirects [70] were used to create a mapping using (1) what ‘correct’ article title they redirect to (if necessary) (2) all other names that redirect to the article. Page views for individual articles were then calculated by summing those for the groups of their redirects. When mapping the page view data to the articles in the clickstream data this guarantees correct correspondence. The different forms of data are summarised in Table 1.

Methods

Building Event Networks

Here we detail the exact pipeline by which we generate a network of related articles and associated page view time series related to each event (the ‘Event Networks’), analyse these for communities of articles representing distinct content and dynamic based ‘Event Reactions’, and finally cluster these communities (the ‘Event Reactions’) based on overlapping constituent Wikipedia articles to identify ‘Topics of Attention’. These concepts are the key levels of analysis in this work, and are more clearly defined as follows:

  • Event Network: The hyperlink network of Wikipedia articles and associated page view time series related to a particular news event.

  • Event Reaction: A community of articles within a single Event Network that are relatively strongly linked and receive correlated patterns of page views.

  • Topic of Attention: A cluster of Event Reactions from different events, grouped according to common constituent Wikipedia articles.

Firstly, entries from the Current Events Portal must be related to networks of Wikipedia articles and page view data. The process to generate ‘Event Networks’ runs as follows:

  • For each event:

  • Scrape event data from Current Events Portal.

  • Resolve redirects of ‘core articles’ linked in event description.

  • Use clickstream data to create network of all articles that link to, and are linked to by, as well as all links between these articles over a window of 61 days centred on the recorded event date. Edge weights are a weighted average of the monthly click totals (weights based on fraction of 61 day window in each month).

  • Keep all edges with weight \(>100\) (i.e. removing edges clicked infrequently), remove any isolates. This is done primarily due to computational speed and memory limitations regarding the sparsity of graphs during later community detection stages.

  • Collect all article names in the networks, with all redirects, and the period of time they are in the news and require page view data for.

  • Process the page view data, keeping data for all required article names, with redirects, over the required time periods.

  • Assign 61-day time series (30 days before/after event date) to each event for each article in the respective networks.

The Event Networks encapsulate the network structure and page view dynamics in both anticipation and response to current events, however, they are not a wholly satisfactory description when attempting to generate summary statistics for each event. It is not the case, given the breadth of pages included in each network, that all their articles will exhibit the same signals for page views or edits, many simply being unrelated in the context of the news event. As such, simple averaging techniques for network level features will likely wash out any useful information.

From Event Networks to Event Reactions

One might think that the issues with the Event Networks are a result of the sampling strategy. In abstract terms, there may well be one true signal for each news event, yet it is obfuscated by the noise of less related pages picked up in the network, or concurrent news events involving the same pages, and that the solution is simply some filter or averaging process. We argue that on the contrary it is the very nature of current events that when studying the underlying constituent concepts there may be a variety of responses. These may be due to longer term effects from historical events, structural effects from related information, as well as associations with other current events. This could lead to a variety of different page view patterns. This may seem trivial, but it often does not explicitly emerge in research where the objects of study in focus are specific individual hashtags, news articles, YouTube videos, etc.

The constituent Wikipedia articles relating to individual events exhibit a variety of different dynamics tied to historical, structural, and concurrent news effects. We thus propose a method to separate responses across both content (structure) and attention (dynamics), to identify which groups of articles are both well connected and exhibit similar page view time series. Simply taking clusters according to network structure ignores short term associations. On the other hand, simply taking pages with correlated responses ignores context of the related content, and could also introduce spurious associations. This approach takes both factors into account.

The chosen two-stage temporal community detection approach disentangles the different response signals across each Event Network into communities of articles termed ‘Event Reactions’. Each news event is partitioned into a handful of ‘Event Reactions’ across the different subjects represented. These ‘Event Reactions’ are then clustered with those from other news events according to common constituent articles (via a weighted Jaccard index), to detect broader topics, termed ‘Topics of Attention’. A schematic of the full process is shown in Fig. 2.Footnote 3

Fig. 2
figure 2

A schematic of the processing of Event Networks to Event Reactions to Topics of Attention. For each Event Network, edge weights from rolling page view Pearson correlations (\(\rho\)) are calculated between article time series. Temporal community detection extracts the Event Reactions which (through Jaccard similarity J) form the higher level-network. We then identify the Topics of Attention with another stage of community detection

Temporal community detection of signals on knowledge networks

Together with Sect. 4.1, here we tackle RQa: Can we meaningfully and robustly sample the related groups of Wikipedia articles associated with a given news event? For a given news event i, with associated network of articles \(G_{i}(V, E)\) (nodes representing articles, edges representing hyperlinks between them), there is an associated set of time series for page views towards articles \(p_n(t) \forall n \in V\) with T timesteps. To measure similarity in patterns of collective attention towards articles, we calculate Pearson correlations of these time series for all linked nodes over a rolling 7 day window, yielding \(W_{\text {edges}}\), which is a \(\Vert V\Vert \times \Vert V\Vert \times (T-6)\) dimensional array. The full correlation matrix would be represented by W, so that \(W_{\text {edges}} = W \circ (A + A^T)/2\), where A is the (unweighted) adjacency matrix of \(G_i\) and \(\circ\) denotes the Hadamard product between the matrices. We exploit the information in the sparsity of the hyperlink network to only calculate correlations for a tiny fraction of node combinations, allowing the algorithm to scale well to large networks. The operation also enriches the static network structure with temporal information. Related approaches could find use in other domains, e.g., in neuroscience to combine structural network information (DTI) with time series (fMRI) [71, 72].

\(W_{\text {edges}}\) represents an undirected (note that Pearson correlation is a symmetric measure, which motivates the removal of directedness), weighted temporal network \(G'_i(t)(V, E(t))\), on which we perform community detection to identify groups of articles that are both well connected by hyperlinks, and exhibit correlated patterns of page views.

The Leiden algorithm [73] is selected for temporal community detection. This is an extension of the popular Louvain algorithm [74], but addresses an issue whereby communities may be arbitrarily badly connected, it also runs faster than the Louvain algorithm. Rather than standard modularity, the Constant Potts Model [75] is used for the quality function, since it can handle both positive and negative edge weights (which can in principle be observed), the readily interpretable resolution parameter, and the independence of communities from the observed graph/subgraph (particularly important given the articles present are a sample of the much larger Wikipedia article network). To extract information from the temporal network, we further adapt the method proposed in [76], by considering the \(T-6\) layers of the temporal graph, and connecting the same node in successive layers by an interlayer edge with weight \(\tau =1\). By doing so, the temporal network is represented as a static, weighted network where each node appears \(T-6\) times. We optimise its Constant Potts Model with the Leiden algorithm, thereby uncovering communities that are made of nodes at multiple times. Note that this operation requires the determination of the resolution parameter. A search for this parameter with a robustness test on a 100 event sample is carried out in Appendix B, with the resolution being set to \(r=0.25\).

For each Event Network, the obtained partition \({\textbf{P}}_i\) is comprised of a handful of communities \(C_{ij}\). Any detected communities which contain at least one of the ‘core articles’ from the descriptive text of the event, and that overlap in time with the day of the event are kept as Event Reactions—\(R_{ij}\). Each of these elements is in effect a building block of wider Topics of Attention. The discrepancy in timescales between fast-paced attention towards news events and the more slowly evolving structure of the Wikipedia article network means these topics are not necessarily reflected in solely the hyperlink structure, or solely through correlated short term page views. In addition, satisfactory temporal community detection on one network for one year over the \(\approx 6\) million English Wikipedia articles is not computationally feasible. From the 7,919 events, we obtain 7,823 Event Networks with more than 1 node and edge (since a small number of event records do not have a popular associated article), and generate 26,579 Event Reactions.

Community detection comparison

Capturing excess page views

Our objective is to collect as good a selection as possible of Wikipedia articles, representative of a particular event. We expect articles related to some current event will exhibit a heightened level of page views around the time of the event. To describe the dynamics of attention around the event, we should select the communities containing core articles that on average exhibit excess page views around the time of the event. Communities that do not contain core articles are deemed not relevant to the event—the constituent articles are not structurally connected well enough to the core articles and/or page views do not follow a similar enough pattern. In cases where a community contains a core article but does not exhibit an increase in page views, we can conclude the core article, and the rest of the community, are background articles not directly relevant to the event, and can put them aside when focusing on event page view dynamics. We can then measure how well we have captured the event with the total excess page views towards articles in the identified relevant communities. We can compare how different community detection approaches perform on this measure on each event i, defined as

$$\begin{aligned} \text {Excess}_i = \sum _{\begin{array}{c} R_{ij} \in \mathbf {P_i},\\ \max _{-1 \le t \le 1}({\widetilde{q}}_{ij}(t)) > 3 \end{array}} \sum _{k \in R_{ij}} \sum _{t'=0}^{t'=6}{p_k(t') - \text {median}(p_k(t))}. \end{aligned}$$
(1)

Here \(R_{ij}\) refers to an Event Reaction (community containing a core article) in partition \(\mathbf {P_i}\), \({\widetilde{q}}_{ij}(t)\) is the total page views towards articles in Event Reaction j, centred on the median value and scaled to the interquartile range, and \(p_k(t)\) is the page view time series of article k in Event Reaction j. Effectively, by the condition on the first summation, we only consider communities that contain a core article, and that have increased overall views around the time of the event. For each of these selected communities, we sum the page views above the median values over the first 7 days for each of the constituent articles.

To make an instructive comparison, we can consider how community detection approaches with just structural hyperlink information, aggregate navigational information, and our method perform on the excess views measure. For each event, we calculate the captured excess views in communities from our temporal approach against that from a simple implementation of the Leiden algorithm on both the static, unweighted, “structural” hyperlink network, as well as against a static, weighted, “navigational” network, where edge weights are set to the weighted average link clicks from the clickstream data (as used in Sect. 4.1). In both cases, resolutions were selected in a similar fashion to as in the temporal approach, detailed in Appendix B. Comparing the results on the captured excess page views, the temporal approach captures at least as many excess page views as the structural approach in 72.4% of events and at least as many excess page views as the navigational approach in 72.8% of events. Taking the ratios of captured excess views (\(\text {Excess}_i^{\text {Temp}}/\text {Excess}_i^{\text {Struc}}\) and \(\text {Excess}_i^{\text {Temp}}/\text {Excess}_i^{\text {Nav}}\)) and considering the geometric mean across all events, the temporal approach captures 1.13 times the excess views of the structural approach and 1.23 times that of the navigational approach. Taking the median, the temporal approach captures 1.04 times the excess views of both the structural and navigational approaches. Our method better captures articles in communities relevant to current events. An example comparing the three approaches on a single event is provided in Fig. 3.

Fig. 3
figure 3

A comparison between the communities obtained through different methods. The event record in question is “2018/11/30. [[2018 Anchorage earthquake]]: A [[magnitude]] 7.0 earthquake hits Alaska, with the epicenter in [[Anchorage]]. Severe damage is reported.” (core articles indicated by square brackets). In all three approaches there is a community centred around each of the core articles ‘2018 Anchorage earthquake’ (a new article dedicated to the event), ‘Anchorage, Alaska’, and ‘Moment magnitude scale’. The structural, navigational, and temporal approaches capture 263,585, 489,538, and 546,978 excess views, respectively. The absolute page views (in terms of total and mean) increase with the temporal approach (f vs d & e), yet the scaled page view patterns remain similar (i vs g & h). This indicates that a number of additional articles with similar spikes in attention relating to the earthquake have been identified. These articles were not captured in any of the previous static communities. With a static approach, in only taking the articles that are structurally/navigationally close to the “core” articles, we may both miss where attention is being directed by this event and markedly underestimate the amount of attention towards it

Structural similarity

A follow-up task is to compare the makeup of the obtained communities from our combined network structure and page view approach correlations against those generated from a solely network structure approach. If the communities from the combined approach are no different from those for the network structure baseline, then this indicates attention dynamics in response to any news event have very little effect, and the community is well represented solely by the network. Any ‘disturbance’ by the news event is either minimal, or closely aligns with how information is already represented on Wikipedia. On the other hand, if the communities obtained from each approach are quite different, then the variation in page view dynamics among the articles in the network is important in producing the Event Reactions. Any ‘disturbance’ by the news event is of sufficient magnitude that the association between concepts related to the event is then not well represented by the relatively static network structure on Wikipedia.

The approach for each event is based on comparing the communities already obtained from the temporal network of correlated time series to those we obtain from community detection on a single-layer, unweighted graph representing solely the structure of the article network, without user attention and navigation patterns, (i.e. \(G_i\) from Sect. 4.3). For each event i we run community detection on the graph \(G_i\) over the same logarithmic range of resolutions \(r \in [1.23\times 10^{-4}, 1]\) from the robustness tests (Appendix B), yielding the partitions \({\textbf{P}}'_{ir}\). Each Event Reaction from the temporal approach (\(R_{ij}\)) receives a ‘Structural Similarity’ score \(s_{ij}\). This score is defined as the maximum of the similarities between \(R_{ij}\) and each community obtained from the non-temporal approach across all resolutions \(C'_{ikr} \in {\textbf{P}}'_{ir}\, \ 1.23\times 10^{-4} \le r \le 1\). Thus,

$$\begin{aligned} s_{ij} = max(J_w(R_{ij}, C'_{ikr})\ \forall \ C'_{ikr} \in {\textbf{P}}'_{ir}\, \ 1.23\times 10^{-4} \le r \le 1), \end{aligned}$$
(2)

where \(J_w(x, y)\) is the Jaccard similarity between communities x and y, weighted by the PageRank scores [77] of nodes in the subgraphs x and y [78] (weighting more important articles to the community more highly). This takes into account both the content of each community in terms of Wikipedia articles, and the relative importance of said articles within their respective communities.

The Structural Similarity score describes how dependent the observed community \(R_{ij}\) is on variation in short term correlated attention dynamics in an Event Network, compared to the longer term network structure. If all page view time series were uniformly correlated, we would expect \(s_{ij} \approx 1\), i.e., all edge weights would be approximately equal, and the community detection is more reliant on the presence/absence of edges. If on the other hand a subset of articles receive strongly correlated page views, uncorrelated with the page views to other articles, we would expect \(s_{ij} \approx 0\), i.e., community detection is more dependent on edge weight than simply the presence/absence of an edge. The distribution of s across all Event Reactions is shown in Fig. 4. We observe a range of behaviours; a prominent mode with relatively low structural similarity (i.e. page views are important), a broader intermediate mode (page views have some effect), and finally the sharp mode around \(s=1\) (page views have little to no effect).

Fig. 4
figure 4

Distribution for the structural similarity scores of all Event Reactions

Higher-level Topics of Attention

Over all events, we now have a collection of Event Reactions. Many of these will be related through covering different stages of the same continuous event (e.g. different rounds of the FIFA World Cup), or through the re-emergence of events and news topics in time (e.g. updates about the Mueller investigation, or new natural disasters). We now turn to RQb: Are these groups of articles from different events related and do they form coherent topics? We seek to identify the recurring groups of Wikipedia articles associated with news events—the topics that are represented. Event Reactions from different events that are made up of broadly the same collection of Wikipedia articles are representative of a wider concept receiving repeated news exposure. We look to quantify the similarity between Event Reactions and use this to find the more closely related groups that represent Topics of Attention. We construct a higher-level network H(VE) of all Event Reactions (nodes). Edge weights are set as the weighted Jaccard similarity [78] between the sets of articles of each Event Reaction, weighted by their PageRank centrality in their respective networks [77], indicating similarity in content (and weighting more important articles to the concept more highly). This network contains all recorded instances of Event Reactions in the sample, representing their relation to one another over the course of one year. To identify the Topics of Attention (groups of related Event Reactions) we run a further stage of community detection over this network H(VE), using the Leiden algorithm with the Constant Potts Model as before. Whilst the nodes in the network represent snapshots of events centred on different points in time, H(VE) is not a temporal network. The resolution parameter is set at \(r=0.067\), according to the robustness test set out in Appendix C. This process yields a partition of communities that are the Topics of Attention which we go on to label, validate, and explore.

Topic labelling and validation

In line with literature on news values and newsworthiness [79,80,81,82], the Topics of Attention were sorted by several features detailed in Table 2, with the top topics across each feature manually labelled. Several of these are based on the constructed time series, \(W_{ij}\) for each Event Reaction (\(R_{ij}\)). This is a sum of the daily page views to each article k in an Event Reaction (\(p_k(t)\)), weighted by their PageRank centrality (\(w_k\)) to the network they form;

$$\begin{aligned} W_{ij}(t) = \sum _{k\in R_{ij}} w_k p_k(t). \end{aligned}$$
(3)

The time series is then centred in time to the max value occurring \(\pm 1\) day from the recorded date of the event. For a Topic of Attention \(A_\alpha\) with constituent Event Reactions \(R_i\) (i acting as an index for the Event Reactions in \(A_\alpha\) and no longer referring to a specific event) and their associated time series \(W_i\), the average prominence, magnitude, and deviance of a topic are then accordingly

$$\begin{aligned} \text {Prominence}_\alpha = \frac{\sum _{W_i \in A_\alpha }{\text {median}(W_i(-30, -29,\ldots , 0))}}{\Vert A_\alpha \Vert }, \end{aligned}$$
(4)
$$\begin{aligned} \text {Magnitude}_\alpha = \frac{\sum _{W_i \in A_\alpha }{W_i(0) - \text {median}(W_i(-30, -29,\ldots , 0))}}{\Vert A_\alpha \Vert }, \end{aligned}$$
(5)
$$\begin{aligned} \text {Deviance}_\alpha = \frac{\sum _{W_i \in A_\alpha }\frac{W_i(0) - \text {median}(W_i(-30, -29,\ldots , 0))}{\text {median}(W_i(-30, -29,\ldots , 0))}}{\Vert A_\alpha \Vert }. \end{aligned}$$
(6)

Intuitively, prominence corresponds to how popular the subject(s) of an event or topic are before a particular event takes place. Magnitude corresponds to the absolute attention towards a given event when it does occur. Finally, deviance corresponds to unexpected event popularity, or how shocking a given event is, relative to its typically fairly unpopular subject matter.

Table 2 A summary of the features with which we sort and examine the Topics of Attention

Two coders independently manually labelled a subset of 65 of the Topics of Attention by examining the constituent Wikipedia articles for each topic and the news events most associated with them (five events initially, with the option to see more). The labelling interface is shown in Appendix D. This set of topics were selected by taking the top 20 topics across each feature in Table 2 (some topics appear more than once across the four features, hence the total less than 80). Each coder was then presented with the combined list of labels and independently tasked with identifying where there was ‘strong agreement’, ‘partial agreement’, or ‘weak/no agreement’ between labels. For the topics, 72.3% of labels were in unanimous strong agreement, 7.7% in strong-partial agreement (i.e., one coder ranked as strong agreement, and one as partial agreement), 15.4% in unanimous partial agreement, and 4.6% in weak/no agreement. The procedure demonstrates validity of the interpretable topics. For the purpose of display in figures and tables, in cases where there was not unanimous strong agreement between coders the first coder’s labels are used.

Table 3 Top topics by certain measures (min 10 events). Colour indicates quartile of structural similarity score, from red=bottom quartile to green=top quartile. Symbol indicates labelling agreement. No symbol: Unanimous Strong, *: Strong-Partial, **: Unanimous Partial, \(\dag\): No agreement

Results and discussion

Studying the contents of the emergent Topics of Attention in Table 3 reveals various interesting details on current events as recorded on Wikipedia. Identified features include; a background concept space, strong geographical effects (including a heavy Anglosphere/US focussed bias), a focus on individuals, and breakout subtopics.

Background concept space

Several of the top Topics of Attention by number of associated events (Countries, Global Cities, Tropical Storms, etc.) are those of lasting historical context. These topics also typically have high structural similarity—attention towards the topic is correlated with its structural composition on Wikipedia. Whilst the Event Networks are sampled from a current events records, much of the related content is built on and widely considered as part of long established knowledge. This supports the case of news events contributing to longer term narratives.

Strong geographical effects

The Topics of Attention are strongly characterised by geography. Many of the labelled topics are specified by the region they are relevant to. It is also clear that when incorporating the structure of the knowledge graph and attention that many of the most prominent topics are Anglosphere focussed. This is of course partly a consequence of studying the English Wikipedia. However, the current events portal’s nominal aim, and that of English Wikipedia as a whole, is to objectively cover global events and knowledge—something it still falls short on. Topics relating to the US and UK are covered with far higher granularity than those relating to other countries. That is, there are several top topics related to the intricacies of US politics, yet other countries typically have all related news summarised within a single topic. Considering the topic labels with some geographic link, 46% are Anglosphere based (primarily US, with a handful UK and Australia based). This is not entirely unsurprising, given prior work on Wikipedia biases [83,84,85,86] as well as this work’s focus on the English language Wikipedia. Nevertheless, this further validates assertions that rather than being the “sum of all human knowledge” [87], Wikipedia (in its various languages), through its content, structure, and access patterns is highly sensitive to its cultural setting.

Focus on individuals

Several of the labelled topics are focussed on, or strongly feature a powerful individual (e.g. Trump family, Putin & Russian Politics, Musk and Tesla). This points to an audience sensitivity towards people that can be related to or reviled and is reflective of findings on the news values of celebrity/power elite [82].

Breakout subtopics

There are several cases where topics may be strongly related, yet one cluster achieves breakout popularity enough to distinguish itself from the original topic. These could correspond to the well studied phenomena of “media storms” [88], whereby there is intense media focus on a single issue. An example of this is the topic for North Korea–South Korea relations—representing an overview of related articles—and the Korean Conflict—which is the subject of more intense focus around events by the audience as indicated by the differing structural similarity scores. On top of the differing structural similarity scores, the Korean Conflict topic has higher prominence, magnitude, and deviance than the North Korea - South Korea relations topic. Another example is the broader US Politics topic compared to the US political houses or current US administration topic. The former represents stable knowledge attracting attention around the topic and the latter represent new, more unusual combinations of articles more closely associated with current events. This may be a consequence of the choice of Jaccard similarity for the higher-level graph edge weights, where news events create strong, synchronous deviations from typical page view behaviour across a very small group of articles that are still related to a wider group. Since the number of deviating articles for the individual event is small, the edge weight through Jaccard similarity to other related events with a larger set of articles is also small, leading to it not being included in the Topic of Attention. An alternative similarity metric such as the overlap (Szymkiewicz-Simpson) coefficient could account for this effect, though using this would likely smooth over any breakout clusters, interesting features unto themselves.

Further remarks

The qualitative discussion of results and exploration of content is important in contextualising findings in further works on Wikipedia’s coverage of current events, as well as Wikipedia status as an ‘independent’ data source for news media. Beyond simply being indicators for the notable issues in the news over a year, the detected Topics of Attention and their properties are demonstrative of the assertion that there is a disconnect between the ‘editor’s’ Wikipedia and the ‘page viewer’s’ Wikipedia. The central tension of Wikipedia as both a slow moving encyclopaedic knowledge base and fast moving current events record is displayed in the ways the topics are constructed. In the first mode, the audience’s access patterns align with the established structure of knowledge on Wikipedia. The truly interesting mode occurs when the audience’s attention does not align with the article network and in effect establishes its own communities of related articles. This collective behaviour is what stretches Wikipedia both towards updating its content and remaining a popular information resource, and away from its traditional encyclopaedic grounding.

There are several limitations to the methods proposed here. Firstly is the issue of this being a single language study. There have been a number of articles on the varying content, coverage, and use of different language Wikipedias based on linguistic, cultural, and national focusses [48, 89,90,91]. One could contend that this means a single story from one community of people editing and viewing Wikipedia. A strong Anglosphere bias is indeed observed but we see it as the case the English Wikipedia is not the product of, nor the information tool, for a single, large, homogeneous community. [92,93,94] all observe that certain editors occupy particular roles in lending substantive expertise towards particular categories, whether that be due to identity, education, or other personal interest, and the same would be expected of regular users (to some extent also supported by [24]). In addition, whilst solely applied to English Wikipedia here, the majority of the methods used are language agnostic, and may be swiftly applied to other language Wikipedias, which may be a fruitful avenue to pursue.

The Current Events Portal is clearly not an exhaustive source of news stories, many of which would have no discernible effect on Wikipedia. Explicit editing guidelines state that “Stories added to the main portal page should be of international interest” [95]. Beyond this restriction, there are a very large number of people who regularly access information related to sports, entertainment, and popular culture, whose news stories are rarely featured on the current events portal. Celebrity deaths, for instance, have their own summary article, rather than residing on the current events portal. The topic map thus does not cover the universe of what might be considered news, and is sensitive to the contents of the news story source, the Wikipedia Current Events Portal, raising the issue of endogeneity. Sampling news events by ‘Wikifying’ [96] alternative sources such as news website RSS feeds would indeed yield a different set of events, though unfortunately due to editorial decisions, we of course arrive at a similar obstacle where there is no objective set of events.

A more thorough comparison between the topic landscape of several different news outlets would be of interest and an immediate application of the developed methods towards agenda-setting research, yet is outside the scope of this work. An argument in favour of choosing the Wikipedia Current Events Portal is that this collaborative recording of news is representative of the collective received importance of events, incorporating what the news recording and accessing communities consider relevant. This, together with time constraints and the simplicity of selecting descriptions already formatted with Wikipedia links, resulted in the decision being made to concentrate on the Current Events Portal.

Conclusion

The encyclopaedic origins of Wikipedia mean it is not set up as a ready-made data source for the study of news events. Typically, long term information for each subject, involving many events, is aggregated in each Wikipedia article, as opposed to aggregating information about multiple subjects at the event level for each event (i.e. like a news website). Events, aside from in rare cases where a particularly notable event justifies its own article, and news topics do not have an established natural representation on Wikipedia. Equally, the broadly consistent, common structured information available to its huge audience, as well as how this audience accesses content on current events, is too appealing to ignore. To take advantage of this, one must establish a framework for event and topic level study. To this end, we have developed an approach for event sampling and topic detection on Wikipedia, with a focus on news topics, that takes into account article network structure, dynamics, and content.

The graph supported correlation network approach towards temporal community detection successfully detects stable Event Reactions, relating both short-term dynamics of attention through page views as well as long-term knowledge structures, thus addressing RQa. We have demonstrated its utility in identifying and exploring different Event Reactions, and in their aggregate how they represent Topics of Attention, the objective of RQb. These objects of study improve upon those used in prior work for their generality across topics, usage of the knowledge network rather than focus on individual articles, explicit relation to news events, incorporation of short and long term effects, and lack of reliance on detection through particular attention dynamics. The Topics of Attention on Wikipedia exhibit a background historical concept space, strong geographical effects, a focus on individuals, and breakout subtopics. The Topics of Attention (through their constituent Event Reactions) may also be resolved to volatile news topics and stable background topics. More importantly, they represent both facets of Wikipedia as a stable knowledge base and rapidly updating current events record.

Detecting Topics of Attention using Wikipedia has proven to be a non-trivial task. It is important to encapsulate the contrasting timescales of news and existing knowledge, the many to many relationship between events and topics, together with the corresponding dynamics of attention and memory building. Through this process we gain insight how topics are represented and accessed on Wikipedia and on which events are considered important enough to make it into the encyclopaedic record. Finally, generating Event Reactions, and wider Topics of Attention can enable more detailed event and topic level study in further work. Establishing representations of events and topics, beyond individual articles, on Wikipedia allows us to quantitatively address questions on theories of news media, collective memory, and historical recording in ways not previously possible without this kind of massive audience level data.