The described methods are implemented step-by-step to identify and analyse trending news stories on technologies and related regulatory challenges. First, the most trending terms are filtered out based on term-frequency analysis. Based on the list of terms with the greatest increase of frequency, policy-makers gain a high-level picture on the most important news stories. For the next steps, the user can select relevant terms for further analysis. Second, the relationship between trending terms is established with co-occurrence analysis. For the terms selected by the user, the most frequently co-occurring trending terms are identified. This analysis helps the user to map the connections in news stories between social issues, technologies, institutions and other key actors. Third, policy-makers can examine the public perception of selected topics with sentiment analysis. We showcase two types of sentiment analysis. The first enables the user to track changes in the sentiment of news articles containing selected terms. The second combines the sentiment and co-occurrence analysis, pinpointing the most positive and negative associated terms. This method facilitates a better understanding of topics, highlighting the different sides of news stories.
In order to present the value of the methods in informing policy, three case studies are analysed: privacy, information in social media, and the technology sector in China. These trending areas have been chosen due to their high regulatory and social relevance.
Identification of emerging terms
Table 1 Term coefficients
We begin the analysis with the identification of emerging topics in the examined sources. The methodology is based on the regression analysis: the regression coefficient (coef) reveals the trend of growing terms. As the value of the coefficient is heavily dependent of the average frequency of the term, it highlights relatively frequent and trending words. However, the aim of the exercise is to capture early signals of technologies and social issues that may still have low frequencies. Therefore, a normalised coefficient (coef_norm) is calculated that is the coefficient divided by the average frequency of the term. This normalised coefficient is used to exclude terms that have a growing, but already large frequencies, such as stopwords.
Table 1 presents the results, sorting the terms by the regression coefficient coef. The results show that various technologies (5G, AI, blockchain) and political issues (China, ban, climate) gained traction in the online tech press. The top 20 words are all closely related to tech topics. It means that adequate sources were selected and that the topic identification methodology is performing well in finding trending topics.
We have reviewed the top 1000 trending terms and summarised the most relevant terms in Table 2. The results provide an overview for the most important topics online tech press covered in the recent years. The first half of the table includes topics related to computer science and emerging technologies, while the lower section summarises various social and regulatory challenges.
The results demonstrate the importance of such technologies as AI, next generation wireless technologies, quantum computing or blockchain. Moreover, the identified terms reveal various domain specific terms, such as mmwave from 5G technology, quantum supremacy from quantum computing or kubernetes related to cloud computing.
The identified social issues present recent regulatory challenges: e.g. online privacy, fake news and hate speech, election interference or the growing influence of China. Similarly to technologies, the analysis presents some terms outside of the mainstream: e.g. the Pegasus spyware.
Another desired attribute of the results is that they lack buzzwords from the past. As an example, big data had been a hot topic in the past, however, this bigram has been not identified in the analysis. On the other hand, discussions moved to technologies that exploit big data, such as machine learning algorithms.
In the case of policy-making it is especially crucial to be informed about early signals of social issues. For an even easier filtering of relevant topics, coef is calculated for the period of the last 3, 6 and 12 months. The table is included in the online appendix.
Table 2 Trending topics based on top 1000 terms Co-occurrence analysis
The regression analysis served as an automated method to filter out the most relevant terms from the the text corpus, providing a list of emerging issues in the tech world. The next step to explore further details is to establish the relationships between trending terms. The analysis of co-occurrences enables us to find which emerging terms were most often mentioned together in the same article, hence finding the most relevant pairs of expressions. In the case of tech news, such method can be used to identify the areas where a technology is applied, or connections to regulatory issues.
Terms related to privacy (the General Data Protection Regulation - GDPR, facial recognition), social media (Facebook) and the Chinese telecommunications industry (Chinese Telecom) have been selected for the co-occurrence analysis that are listed in Fig. 4. For all the trending words that were mentioned in the articles containing the expression of interest (e.g.GDPR), co-occurrences were calculated. The online appendix contains tables with the calculated indices for the top 30 co-occurring terms. Among these terms, 10 were selected for presentation that provide a broad insight into the discussion around the examined issue.
The co-occurrence analysis helps to unravel the news stories around the selected words, providing related technologies, actors and institutions. As an example, the Chinese telecommunications industry was mostly mentioned together in the context of such words as trade war, 5G, Huawei, Zhengfei or sanctions. These keywords well describe news about the US administration recommending to avoid Huawei networking equipment due to strong connections of Huawei to the Chinese government (Reichert 2019).
The analysis of terms related to Facebook map out major regulatory problems, such as the issue of privacy, the influence of social media on the democratic process (such as the Cambridge Analytica scandal) and the spread of misinformation. The results also highlight novel technologies (deepfakes) and business plans (the Libra project) of high regulatory relevance. The results for GDPR indicate the importance of ethical aspects of online privacy, the regulation’s impact outside the EU (FTC, China) and the role of the tech industry in respecting personal data (tech giants, Huawei). Finally, this method can be also used to gain further insights on technologies, such as facial recognition. The co-occurrences suggest not only its application in consumer products (Face ID), but also its high political and social relevance (protest, China).
Sentiment analysis
Following the exploration of the co-occurring terms, a new layer can be added to the analysis: sentiment. News stories are often polarising, and public perception evolves over time. Therefore, the changing sentiment of trending topics is examined. Additionally, news stories involve positive and negative actors and relations. Analysing the sentiment of co-occurring words, the different sides of debate can be identified.
Figures 5, 7, 6, 8 present the evolution of sentiment for GDPR, facial recognition, Huawei and Facebook. For each term, the 3-month moving average of sentiment was calculated from the paragraphs containing them. The values can vary between -1 (most negative sentiment) and +1 (most positive). The authors of the VADER tool (Hutto and Gilbert 2014) recommend to interpret the scores as:
Additionally, the size of bubbles also reveals the number of analysed paragraphs. In each graph, the top 3 paragraph counts are shown.
Public perception has been rather volatile for the analysed terms. In the case of GDPR, the overall positive sentiment declined around May 2018, possibly as a reaction to the complications faced by users and businesses during the introduction. The increase of negative news stories on Huawei is especially visible. Since 2018, the public sentiment significantly declined, most probably due to the issues related to cybersecurity and conflict with the US. Similarly, news stories covering facial recognition became less positive over time. In the case of Facebook, a gradual decrease of sentiment is revealed. The first significant decline happened at the end of 2016, possibly due to the scandals related to misinformation campaigns and spread of fake news (Goulard 2016). The second rapid decline is reported at the beginning of the Cambridge Analytica scandal that began in March 2018.
Besides tracking the positive and negative sentiments for selected topics, the different shades of selected topics can be further examined by combining the co-occurrence analysis and the sentiment analysis. Figures 9, 10, 11, 12 demonstrate that technologies and social challenges are related to numerous news stories that are described differently by the media. The figures summarise the most positive and negative co-occurrences based on articles containing both expressions, with sentiment computed on paragraphs. The terms have been selected among the 20 most positive and negative words that co-occurred in at least 100 paragraphs. The figures show the calculated sentiments, as well as the number of paragraphs.
Therefore, news stories on GDPR have been most positive in the context of data management, cloud computing, the California Consumer Privacy Act (CCPA), while neutral when covering the Whois system (Hern 2018), the Marriott data breach (Sweney 2019) or the TikTok social media platform.
In the case of Huawei, media coverage has been positive about smartphone technologies, and negative in the context of the scandals related to the extradition case of Huawei CFO in 2019. Similarly, news on the use of facial recognition in consumer products have been rather positive, while the least positive stories highlight crucial ethical and regulatory problems, such as algorithmic bias, military and law enforcement applications of the technology.
Finally, the news stories on Facebook have been more positive in relation to technologies, while negative or neutral when mentioning the far-right, conspiracy theories and content moderation.
Various robustness checks were carried out to validate our methods. In order to keep the article concise, the additional analyses and materials are published in the online appendix.