1 Introduction

The fast development and adoption of information and communications technology (ICT) and digital services have major social implications. Policy-makers often struggle to design appropriate regulations, defend the rights of citizens and ensure competition. However, the policy-making process has been gradually transformed by the use of evidence-based practices (Dosso et al. 2018). Evidence-based policy prioritises the use of theoretically informed empirical analysis and data-based decision making, requiring substantial analytical capacity (Howlett 2009). Policy analysts implement various analytical techniques in the policy process, including more advanced mathematical modelling (Howlett and Wellstead 2011).

Increasing the volume and type of processed information supports the policy success (Howlett 2009). While the amount of available information is quickly growing in the Internet age, making relevant observations on future developments is challenging, requiring the use of new methods and tools for processing data. In the case of textual data, such as online documents, text mining techniques provide solutions for systematic examination (Kayser and Blind 2017). The importance of text mining for informing policy has been recognised in the recent years by various political institutions. A notable example is the European Commission, which has established the Competence Centre on Text Mining and Analysis. As the founders argue, text mining and analysis tools are necessary to address not only the problem of volume, but also of timeliness in order to provide the right information in the proper format for the decision making process, in a variety of contexts (European Commission 2016).

The aim of our work is to present a tool that improves the policy capacity by harnessing the potential of the continuous flow of online news. As a case study, we focus on social challenges related to the Internet and online services, such as privacy and the spread of fake news. The methodology supports policy formulation with problem recognition, definition and selection. Using text mining techniques, we demonstrate how early signals and trends can be explored in online news articles. The methodology enables policy analysts to identify the most important technology and social issues gaining relevance in online news, without any prior selection of topics for investigation. Furthermore, the tool highlights the connections between the identified trending issues, such as the relationships between technologies and social challenges. Therefore, policy analysts can gain a better understanding on problem causation, an important step in policy formulation. Finally, the tool also enables the analysis of emotions surrounding the identified topics with sentiment analysis, approximating whether public opinion is rather positive or negative regarding the selected issue. The various steps of the methodology support the identification of policy issues, stakeholders, and public attitudes, which is a major contribution to the policy analytical capacity (Howlett 2009).

Using the proposed tool, policy analysts are able to answer the following questions:

  1. 1.

    Which are the most trending technologies and social challenges?

  2. 2.

    What are the connections between social challenges and technologies?

  3. 3.

    What is the public perception of selected social and technological issues?

Our methodology implements various text mining techniques in a sequential order. The pipeline of the analysis is presented in Fig. 1. The combination of terms frequency, co-occurrence and sentiment analysis provides a funnel for information retrieval. In the first stage, policy-makers gain a high-level overview of recent trends. Next, narrow topics can be selected for more focused investigation.

Fig. 1
figure 1

The combination of text mining methods

We begin the analysis by collecting news articles: using web-scraping tools, 247500 articles published in the period 01.2016–12.2019 were compiled. The sources include 14 major English-language technology websites from the US, EU and Australia.

Next, the frequencies of terms featured in the online documents are examined. Trending terms - terms with the greatest increase of frequency over time - can be filtered using regression analysis. The identified terms serve as input for further analysis. The connections between trending terms are explored using co-occurrence and sentiment analysis techniques. The co-occurrence analysis highlights pairs of terms that are most frequently mentioned together in the news articles. In order to track the public perception of issues and identify the positive and negative news stories related to a selected topic, sentiment analysis is performed.

The results confirm that the methodology has huge potential in supporting informed policy-making, and decrease the lag between technological changes and regulatory responses. Such methodology has not been available for policy analysts before. The literature discussing the implementation of text mining for policy is very limited, with no studies presenting working solutions. Moreover, the methodology has significant advantages over prior efforts in the area of trend identification, a common policy-related task.

First, the method enables the automatic detection of trending technologies and issues. Unlike existing studies (Kim and Ju 2019; Yoon 2012; Bildosola et al. 2017), the process does not require the prior selection of topics or keywords. In the absence of a costly and biased initial filtering process (e.g. expert groups), the methodology can be easily implemented, which facilitates the exploration of new and unexpected areas by policy-makers. Second, by combining various text mining methods, the presented tool is not only more robust, but also provides more insights beyond highlighting trends.

The presented results are available in the form of interactive visualisations at https://policy.delabapps.eu. The raw results are stored at Zenodo repository (https://zenodo.org/communities/ngi_forward), while the codes to replicate them are available on Gitlab (https://gitlab.com/enginehouse). Further analyses of the NGI Forward project are published at https://fwdmain.delabapps.eu.

2 Literature review

The OECD emphasises that the profound impact of digital transformation in the private sector has not been mirrored by equally significant changes to how policy is designed, implemented and evaluated (OECD 2019, p. 3). The authors point out the potential of digital technologies for innovative policy design and impact evaluation. In our work, we focus on a specific area of digital technologies that can support policy-making, which is text mining.

A systematic literature review on text mining in a policy context was conducted by Ngai and Lee (2016). The authors used the framework of Jann and Wegrich (2007) that established different stages of the policy-making cycle: (i) agenda setting, (ii) policy formulation and decision making, (iii) implementation and (iv) evaluation. The authors highlighted the wider adoption of text mining tools in the agenda setting stage. Poel et al. (2018) reviewed 58 big data initiatives that were implemented to shape the policy process. They concluded that while data-driven analysis has the potential to inform about the available policy options, the use of big data is lagging behind, especially in the implementation and evaluation stage of policy-making. According to Berryhill et al. (2019), various AI tools can be integrated into the entire policy-making process. They claim that machine learning tools boost public sector efficiency and help governments to make decisions. In a similar vein, Höchtl et al. (2016) emphasise that big data analysis reduces the time required to produce reports and supports the true evidence-based policy-making. Rubinstein et al. (2016) described how text mining can shed light on blind spots and complement the results obtained from traditional research during the policy-making cycle. Similarly to Poel et al. (2018), the authors stress the importance of data literacy and argue that visualization techniques can be used to help policymakers understand the complexity of results. Ceron and Negri (2016) demonstrated that social media data can be used to monitor the preferences of citizens and rate the available policy alternatives.

A prevalent conclusion in the reviewed literature is that the use of text mining tools in policy-making remains at an early stage. However, various case studies showed that text mining could be effectively used by policy analysts in different policy areas. For a wider use of data-driven policy-making, the public sector should adopt tools and methods that are already available.

Focusing on our goal - facilitating the problem recognition and agenda setting of policy analysts - we considered tools that are easy to interpret, transparent and do not require more complicated processes, such as the use of training data. Therefore, in this literature review we summarise the text mining methods that have been frequently used for the identification and exploration of trending areas.

A relatively simple, but effective method is the analysis of changes in the frequency of documents and terms. Kim and Ju (2019) examined the trajectory of selected technologies in online news and blog posts based on daily frequencies of documents. Yoon (2012) also analysed online news and focused on a set of pre-selected keywords in a defined field (solar energy). The described methodology is based on both term and document frequencies, also including a time dimension to differentiate between strong and weak signals. Albert et al. (2015) analysed blog posts to identify whether a technology is basic (mature) or pacing (emerging) based on the changing frequencies of defined lists of terms.

Several studies provide additional insight by analysing not only individual terms, but rather groups of keywords, e.g. by incorporating co-occurrences, topic modelling and other clustering techniques. Lee and Park (2018) based their study on the work of Yoon (2012), concluding that a single keyword is not sufficient to identify a topic. In order to establish the meaning of word groups, the authors improved the methodology by incorporating co-occurrences. Bildosola et al. (2017) combined the analysis of term frequencies with various forecasting techniques to explore and map emerging technologies in the area of cloud computing. Packalen and Bhattacharya (2015) examined the appearance of new terms and term sequences in patent texts to identify novel ideas and innovations. Similarly, Arts et al. (2019) assessed the novelty of patents based on the number of new words and word combinations. Lee and Jeong (2008) examined emerging areas in information security publications by hierarchical clustering of co-occurring keywords and prepared a technology roadmap for robot technology. Li et al. (2019) combined text mining techniques with expert analysis to examine trends in the field of perovskite solar cells. Kajikawa et al. (2008) tracked emerging energy research fields by clustering publications and analysing the growth in the number of publications by cluster. Niemann et al. (2017) examined patent lanes in selected areas, analysing semantic similarities in patents and applying topic modelling.

Following the analysis of trends and exploring their topological features, further dimensions can be explored, such as the related public debate. Sentiment analysis is a field of text mining focused on identifying the emotions in documents. Sentiment analysis is widely used in analysing public perception of technologies based on news articles or social media. Choi et al. (2010) identified controversial issues and related sub-topics using query generating methods and sentiment models. Ku et al. (2006) presented algorithms for opinion extraction in news articles based on concept keywords and sentiment words. Kim et al. (2006) demonstrated an approach with semantic role labelling to extract opinions in news media. Finally, Ceron and Negri (2016) used sentiment analysis to inform policy based on social media data.

Relevant to our study are the works that explore multiple aspects of emerging technologies, combining various methods. The literature becomes very limited for such works: existing papers mostly present a hybrid approach of topic modelling with sentiment analysis. Xie et al. (2018) identified emerging technologies in a selected area with topic modelling and analysed the changing sentiments in online news. Bian et al. (2016) analysed public perception of topics related to IoT in Twitter. Similarly, Mejia and Kajikawa (2017) identified topics related to robots in newspaper articles and scientific papers, and examined sentiments. We have also experimented with topic modelling: the results of our analyses with Latent Dirichlet Allocation (Blei et al. 2003) are available in the online appendixFootnote 1.

The presented studies demonstrate that various text mining methods can be used to analyse trends, identify the topologies of emerging technologies, and also extract opinions and sentiments of public debate based on news articles.

Our work addresses a crucial research gap related to supporting policy with text mining. While previous works highlight the potential of text mining tools to facilitate policy agenda setting, there is a lack of studies demonstrating working solutions. The presented methodology contains transparent and clear steps that provide results that are easy to interpret and do not require domain expertise in statistics or text mining.

However, our study also contributes to the literature on the identification of trends and emerging technologies. First, in contrast to the existing literature, our approach does not require prior assumptions on trending areas. Instead of reducing our sample or pre-selecting keywords, the presented approach enables the automatic identification of trending terms. Such methodology creates opportunity to explore unexpected, but highly relevant trending issues. Second, we demonstrate that the proposed combination of tools provide more insights than the use of a single method. Not only the trending areas are identified (term frequencies), but also the wider topics and relationships are established (co-occurrences), along with the public perception (sentiment analysis).

3 Dataset

Large-scale, automated analysis of news outlets has proven to be successful in predicting political events (e.g. related to the Arab Spring), even surpassing the predictive power of traditional models of foreign policy in certain cases (Leetaru and Schrodt 2013). A major advantage of online news articles is the short lag between the time of a relevant event and the publication of the text. While we have experimented with other types of sources, such as working papers, the results suggested that news articles are more suitable for tracking developments in areas of policy interest. Therefore, similarly to other studies with a focus on forecasting (Kim and Ju 2019; Yoon 2012; Xie et al. 2018), this work is based on the analysis of online news. We have included 14 popular, published in English, online tech press sources in the analysis. The final list of sources have been selected in a four stage process.

First we identified online sources that covered and reported on early signals of technological change in the past. The feature of Google search to filter news from specific time periods was used to search for articles that covered early on promising technologies and business models of today. Additionally, the Google Trends tool was used to identify the particular time periods when various technologies first appeared in online news. These periods with early signs are called innovation trigger stage in the hype cycle literature, i.e. a time when awareness about the technology starts to spread and attracts first media coverage (Dedehayir and Steinert 2016). A commonly used hype cycle model introduced in the 1990s by Gartner corporation explains the evolution of a technology in terms of expectation or public visibility (y-axis) in relation to time (x-axis). The curve of hype around the blockchain technology, proxied by its online search popularity, is presented in Fig. 2. The figure shows that the first mentions are likely to originate from the period 2010–2012. The chosen technologies were under intensive development (e.g. autonomous vehicles) or in the pursuit of practical application (e.g. blockchain) during 2018. The keywords to identify relevant sources included “IoT”, “virtual reality”, “blockchain”, “bitcoin”, “smartwatch”, “sharing economy” (articles from period Jan 2010-Jan 2012), “autonomous cars”, “big data” (news from Jan 2007-Jan 2008). This process helped us to identify around 25 sources that published articles at the beginning of the current decade on a set of promising technologies.

Fig. 2
figure 2

Blockchain query popularity. Source: Own elaboration using Google Trends data

Second, the initial list was supplemented with a set of other relevant sources reporting on technologies (e.g. IEEE Spectrum) or regulatory issues (e.g. Politico Europe). Additionally, high quality non-US sources (e.g. Euractiv, The Guardian) were included in order to counterbalance the dominant American tech perspective in the analysis.

Third, those media outlets were prioritized that covered tech news also from a regulatory and social aspect. On the other hand, sources with a greater focus on consumer electronics or enterprise IT were not preferred.

Finally, some selected sources were excluded from the study due to technical reasons, such as a paid subscription business model. The final list of analysed sources, with the number of articles and the location of headquarters is presented in Fig. 3.

The articles have been collected for a period of four years between January 2016 and December 2019. The collection process was conducted with the use of web-scraping tools. Web-scraping scripts are designed to recognize different types of content, and extract and store only the ones specified by the user (Ignatow and Mihalcea 2017). In this study, for each individual source, a separate script has been written in Python programming language, using the web automation framework Selenium WebDriver. For the collection of the full text of articles, the Newspaper3kFootnote 2 tool was also implemented for a number of sources. In the case of the Guardian, articles were accessed via the Guardian Open Platform API. The categories of scraped news sections are included in the Appendix (Tables 3).

Therefore, a unique dataset was created, collecting almost 250 thousand articles. The sources greatly vary in terms of number of articles. The most abundant sources, including Techcrunch, ZDNet and Register constitute together around 50% of the dataset. On the other hand, the three smallest sources: Euractiv, The Conversation and Politico account only for around 3% of all articles.

In terms of location, US sources are the most prevalent in the dataset, comprising 70% of articles, preceding sources from the UK (27% of articles), Belgium (2% of articles) and one source located in Australia (1% of articles) (see: Fig.  3).

Fig. 3
figure 3

Number of articles per source

4 Methodology

4.1 Term-frequencies and regression analysis

Terms have been transformed to their stemmed (for example from "elections" to "elect") form using SnowballStemmer (Natural Language Toolkit - NLTK package). For readability purposes, all tables present terms in their human-readable form (e.g. "election" instead of "elect"). The most common bigrams from stemmed unigrams have been identified using Phraser from the Gensim package (Řehůřek and Sojka 2010).

For each term and source, dividing the number of occurrences of the term by the number of occurrences of all terms produced the average monthly term frequency. Afterwards, the average of term frequencies weighted by source has been calculated. Weights have been assigned to ensure that no source has excessive influence on final results due to the number of articles and to maintain relative balance between American and other sources.

For all terms which occurred at least once in the last two months of the analysis, an ordinary least squares regression has been performed for the entire time period, and also for the last 3, 6 and 12 months. The dependent variable of the estimation is the weighted frequency, while the number of months since the beginning of the analysed period is the independent variable. The result is a single coefficient \(\beta\) (referred to as coef). The terms with the highest coefs have grown the most. However, the top growing words are always stopwords (the, a, and, were, etc.) due to their sheer number of occurrences. Most lists of stopwords are not domain-specific: NLTK’s list does not include words such as “internet”, which should be regarded as a stopword in modern technological media. Instead of creating a domain-specific stopwords list, we divided coef by the mean weighted frequency in all months of the regression. The resulting normalised coefficient (coef_norm) delivers a number which can be used to winnow out irrelevant terms by setting a threshold a term needs to achieve to be included in further analysis. The threshold has been set to 0.0125, a value high enough to remove stopwords (including domain-specific ones), but low enough to allow the capture of early signals of new technologies and quickly growing established topics.

The 1000 most significantly growing terms (with the largest coef which are above the threshold for coef_norm) have been reviewed, and the relevant terms for further analysis have been selected.

4.2 Co-occurrence analysis

For the terms chosen in the previous part (“analysed terms”), the most common “co-occurring” terms out of the top 15,000 most significantly growing terms have been calculated. The terms co-occur if they are present in the same article, not necessarily in the same sentence or paragraph. The number of occurrences of the co-occurring term in all articles in a given source containing the analysed term has been checked. This number of co-occurrences was divided by the number of occurrences of the analysed term in all articles in the source. Sources have been aggregated using weights, just as frequencies were. As a simplified example, word A occurs 3500 times in one source, while word B occurs 400 times in the same source, and word A occurs 100 times in the source’s articles which contain the word B. While analysing most co-occurring terms to term B, the term A has a co-occurrence index for the source of \(\frac{100}{400} = 0.25\).

4.3 Sentiment analysis

The same words which have been chosen for the co-occurrence analysis were selected for the sentiment analysis. The sentiment analysis has been prepared using VADER (Hutto and Gilbert 2014), an open-source rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment scores were compiled using various sources (other sentiment data sets, Twitter etc.) and were validated by human input. As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs.

All paragraphs in the articles containing the given term were modified to exclude the term and assigned a score between -1 (most extreme negative) and 1 (most extreme positive) by VADER. The removal of the term is necessary, as the term may not be emotionally neutral, e.g. when some technologies or companies attempt to solve a negative issue. In such case, the neighbourhood’s scores would be positive, but the negative term would bring the paragraph’s score down. We present two analyses: the average monthly sentiment of paragraphs containing the selected terms, and co-occurring words with the most extreme average paragraph sentiment score. In the case of the latter, for each term the 100 most co-occurring terms were selected. The sentiment was computed on paragraphs that were modified once again by removing both the analysed and frequently co-occurring terms. Based on the average sentiment scores of paragraphs, the co-occurring terms with the most negative and positive sentiment were identified.

5 Results

The described methods are implemented step-by-step to identify and analyse trending news stories on technologies and related regulatory challenges. First, the most trending terms are filtered out based on term-frequency analysis. Based on the list of terms with the greatest increase of frequency, policy-makers gain a high-level picture on the most important news stories. For the next steps, the user can select relevant terms for further analysis. Second, the relationship between trending terms is established with co-occurrence analysis. For the terms selected by the user, the most frequently co-occurring trending terms are identified. This analysis helps the user to map the connections in news stories between social issues, technologies, institutions and other key actors. Third, policy-makers can examine the public perception of selected topics with sentiment analysis. We showcase two types of sentiment analysis. The first enables the user to track changes in the sentiment of news articles containing selected terms. The second combines the sentiment and co-occurrence analysis, pinpointing the most positive and negative associated terms. This method facilitates a better understanding of topics, highlighting the different sides of news stories.

In order to present the value of the methods in informing policy, three case studies are analysed: privacy, information in social media, and the technology sector in China. These trending areas have been chosen due to their high regulatory and social relevance.

5.1 Identification of emerging terms

Table 1 Term coefficients

We begin the analysis with the identification of emerging topics in the examined sources. The methodology is based on the regression analysis: the regression coefficient (coef) reveals the trend of growing terms. As the value of the coefficient is heavily dependent of the average frequency of the term, it highlights relatively frequent and trending words. However, the aim of the exercise is to capture early signals of technologies and social issues that may still have low frequencies. Therefore, a normalised coefficient (coef_norm) is calculated that is the coefficient divided by the average frequency of the term. This normalised coefficient is used to exclude terms that have a growing, but already large frequencies, such as stopwords.

Table 1 presents the results, sorting the terms by the regression coefficient coef. The results show that various technologies (5G, AI, blockchain) and political issues (China, ban, climate) gained traction in the online tech press. The top 20 words are all closely related to tech topics. It means that adequate sources were selected and that the topic identification methodology is performing well in finding trending topics.

We have reviewed the top 1000 trending terms and summarised the most relevant terms in Table 2. The results provide an overview for the most important topics online tech press covered in the recent years. The first half of the table includes topics related to computer science and emerging technologies, while the lower section summarises various social and regulatory challenges.

The results demonstrate the importance of such technologies as AI, next generation wireless technologies, quantum computing or blockchain. Moreover, the identified terms reveal various domain specific terms, such as mmwave from 5G technology, quantum supremacy from quantum computing or kubernetes related to cloud computing.

The identified social issues present recent regulatory challenges: e.g. online privacy, fake news and hate speech, election interference or the growing influence of China. Similarly to technologies, the analysis presents some terms outside of the mainstream: e.g. the Pegasus spyware.

Another desired attribute of the results is that they lack buzzwords from the past. As an example, big data had been a hot topic in the past, however, this bigram has been not identified in the analysis. On the other hand, discussions moved to technologies that exploit big data, such as machine learning algorithms.

In the case of policy-making it is especially crucial to be informed about early signals of social issues. For an even easier filtering of relevant topics, coef is calculated for the period of the last 3, 6 and 12 months. The table is included in the online appendix.

Table 2 Trending topics based on top 1000 terms

5.2 Co-occurrence analysis

The regression analysis served as an automated method to filter out the most relevant terms from the the text corpus, providing a list of emerging issues in the tech world. The next step to explore further details is to establish the relationships between trending terms. The analysis of co-occurrences enables us to find which emerging terms were most often mentioned together in the same article, hence finding the most relevant pairs of expressions. In the case of tech news, such method can be used to identify the areas where a technology is applied, or connections to regulatory issues.

Terms related to privacy (the General Data Protection Regulation - GDPR, facial recognition), social media (Facebook) and the Chinese telecommunications industry (Chinese Telecom) have been selected for the co-occurrence analysis that are listed in Fig. 4. For all the trending words that were mentioned in the articles containing the expression of interest (e.g.GDPR), co-occurrences were calculated. The online appendix contains tables with the calculated indices for the top 30 co-occurring terms. Among these terms, 10 were selected for presentation that provide a broad insight into the discussion around the examined issue.

The co-occurrence analysis helps to unravel the news stories around the selected words, providing related technologies, actors and institutions. As an example, the Chinese telecommunications industry was mostly mentioned together in the context of such words as trade war, 5G, Huawei, Zhengfei or sanctions. These keywords well describe news about the US administration recommending to avoid Huawei networking equipment due to strong connections of Huawei to the Chinese government (Reichert 2019).

Fig. 4
figure 4

Co-occurrences for selected terms

The analysis of terms related to Facebook map out major regulatory problems, such as the issue of privacy, the influence of social media on the democratic process (such as the Cambridge Analytica scandal) and the spread of misinformation. The results also highlight novel technologies (deepfakes) and business plans (the Libra project) of high regulatory relevance. The results for GDPR indicate the importance of ethical aspects of online privacy, the regulation’s impact outside the EU (FTC, China) and the role of the tech industry in respecting personal data (tech giants, Huawei). Finally, this method can be also used to gain further insights on technologies, such as facial recognition. The co-occurrences suggest not only its application in consumer products (Face ID), but also its high political and social relevance (protest, China).

5.3 Sentiment analysis

Following the exploration of the co-occurring terms, a new layer can be added to the analysis: sentiment. News stories are often polarising, and public perception evolves over time. Therefore, the changing sentiment of trending topics is examined. Additionally, news stories involve positive and negative actors and relations. Analysing the sentiment of co-occurring words, the different sides of debate can be identified.

Figures 5, 7, 6, 8 present the evolution of sentiment for GDPR, facial recognition, Huawei and Facebook. For each term, the 3-month moving average of sentiment was calculated from the paragraphs containing them. The values can vary between -1 (most negative sentiment) and +1 (most positive). The authors of the VADER tool (Hutto and Gilbert 2014) recommend to interpret the scores as:

  • positive: \(>0.05\)

  • neutral: \(>-0.05\) and \(<0.05\)

  • negative: \(<-0.05\)

Additionally, the size of bubbles also reveals the number of analysed paragraphs. In each graph, the top 3 paragraph counts are shown.

Public perception has been rather volatile for the analysed terms. In the case of GDPR, the overall positive sentiment declined around May 2018, possibly as a reaction to the complications faced by users and businesses during the introduction. The increase of negative news stories on Huawei is especially visible. Since 2018, the public sentiment significantly declined, most probably due to the issues related to cybersecurity and conflict with the US. Similarly, news stories covering facial recognition became less positive over time. In the case of Facebook, a gradual decrease of sentiment is revealed. The first significant decline happened at the end of 2016, possibly due to the scandals related to misinformation campaigns and spread of fake news (Goulard 2016). The second rapid decline is reported at the beginning of the Cambridge Analytica scandal that began in March 2018.

Fig. 5
figure 5

GDPR: Sentiment over time

Fig. 6
figure 6

Facial recognition: Sentiment over time

Fig. 7
figure 7

Huawei: Sentiment over time

Fig. 8
figure 8

Facebook: Sentiment over time

Besides tracking the positive and negative sentiments for selected topics, the different shades of selected topics can be further examined by combining the co-occurrence analysis and the sentiment analysis. Figures 9, 10, 11, 12 demonstrate that technologies and social challenges are related to numerous news stories that are described differently by the media. The figures summarise the most positive and negative co-occurrences based on articles containing both expressions, with sentiment computed on paragraphs. The terms have been selected among the 20 most positive and negative words that co-occurred in at least 100 paragraphs. The figures show the calculated sentiments, as well as the number of paragraphs.

Therefore, news stories on GDPR have been most positive in the context of data management, cloud computing, the California Consumer Privacy Act (CCPA), while neutral when covering the Whois system (Hern 2018), the Marriott data breach (Sweney 2019) or the TikTok social media platform.

In the case of Huawei, media coverage has been positive about smartphone technologies, and negative in the context of the scandals related to the extradition case of Huawei CFO in 2019. Similarly, news on the use of facial recognition in consumer products have been rather positive, while the least positive stories highlight crucial ethical and regulatory problems, such as algorithmic bias, military and law enforcement applications of the technology.

Finally, the news stories on Facebook have been more positive in relation to technologies, while negative or neutral when mentioning the far-right, conspiracy theories and content moderation.

Fig. 9
figure 9

GDPR: Positive and negative sentiments

Fig. 10
figure 10

Facial recognition: Positive and negative sentiments

Fig. 11
figure 11

Huawei: Positive and negative sentiments

Fig. 12
figure 12

Facebook: Positive and negative sentiments

Various robustness checks were carried out to validate our methods. In order to keep the article concise, the additional analyses and materials are published in the online appendix.

6 Conclusions

This study presented a methodology for identifying trending topics in online news media, enabling a deeper exploration of technologies and related social challenges. Our methodology brings together a set of straightforward text mining methodologies that are easy to diagnose, tune, evaluate and interpret. The proposed sequence of methods enables the exploration of news stories by different levels of granularity. The terms frequency analysis provides a bird’s eye view on the emerging technologies and interrelated social issues. The co-occurrence analysis helps building the topologies of the most relevant topics. The changing public perception is tracked by the sentiment analysis. Finally, the combination of the co-occurrence and sentiment analysis is used to unravel the positive and negative stories related to a topic.

The implementation of our methodology is illustrated with the exemplary path a policy analyst can take through the results obtained for the period 01.2016 and 12.2019 from 14 popular online news sources. The topic identification exercise revealed that the most trending technologies include AI, 5G, decentralised computing, blockchain and quantum computing. Among the most debated social issues we identified content crisis and fake news, privacy, election meddling, the rising influence of China, cybersecurity, competition in the digital economy and ethical questions.

Following the presentation of the main trending topics, selected case studies were explored in greater details, including privacy and GDPR, the Chinese tech sector and the content crisis in social media. The results of the case studies demonstrate how these tools can facilitate a better understanding of policy analysts about topics of regulatory interests. As an example, the investigation around facial recognition showed that while its use in consumer products is gaining traction, a wide array of controversies are related to this technology, including its use by law enforcement.

Although text mining has an established literature for the identification of trending topics, we address numerous research gaps. The previous studies are narrowly specialised in regards of the applied methods and the examined technological areas. However, policy analysts need information on the overall technological landscape at the very beginning of a policy cycle. The presented methods give policy analysts tools to quickly process vast amounts of information and discover new knowledge at a low cost. The temporal feature of the analysis enables to select the most relevant issues and dismiss overhyped hot topics, characterised by a sudden increase and immediate drop in public discussion. We have demonstrated that simple and explicable text mining techniques can support policy-making, especially in the agenda setting and policy formulation phase of the policy-cycle. By highlighting emerging areas, the methodology has potential to decrease the policy lag, i.e. the time between the recognition of policy challenge and implementation of the solution.

The raw results, documented programming scripts and interactive visualizations available in the accompanying paper’s website let users explore the tech landscape from different angles. Basic programming background is sufficient for policy analysts to reproduce the results for a different set of sources and different time periods.