Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

García-Méndez, Silvia; de Arriba-Pérez, Francisco; Barros-Vila, Ana; González-Castaño, Francisco J.; Costa-Montenegro, Enrique

doi:10.1007/s10489-023-04452-4

Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

Open access
Published: 10 March 2023

Volume 53, pages 19610–19628, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

Download PDF

Silvia García-Méndez ORCID: orcid.org/0000-0003-0533-1303¹,
Francisco de Arriba-Pérez¹,
Ana Barros-Vila¹,
Francisco J. González-Castaño¹ &
…
Enrique Costa-Montenegro¹

2509 Accesses
2 Altmetric
Explore all metrics

Abstract

Financial news items are unstructured sources of information that can be mined to extract knowledge for market screening applications. They are typically written by market experts who describe stock market events within the context of social, economic and political change. Manual extraction of relevant information from the continuous stream of finance-related news is cumbersome and beyond the skills of many investors, who, at most, can follow a few sources and authors. Accordingly, we focus on the analysis of financial news to identify relevant text and, within that text, forecasts and predictions. We propose a novel Natural Language Processing (nlp) system to assist investors in the detection of relevant financial events in unstructured textual sources by considering both relevance and temporality at the discursive level. Firstly, we segment the text to group together closely related text. Secondly, we apply co-reference resolution to discover internal dependencies within segments. Finally, we perform relevant topic modelling with Latent Dirichlet Allocation (lda) to separate relevant from less relevant text and then analyse the relevant text using a Machine Learning-oriented temporal approach to identify predictions and speculative statements. Our solution outperformed a rule-based baseline system. We created an experimental data set composed of 2,158 financial news items that were manually labelled by nlp researchers to evaluate our solution. Inter-agreement Alpha-reliability and accuracy values, and rouge-l results endorse its potential as a valuable tool for busy investors. The rouge-l values for the identification of relevant text and predictions/forecasts were 0.662 and 0.982, respectively. To our knowledge, this is the first work to jointly consider relevance and temporality at the discursive level. It contributes to the transfer of human associative discourse capabilities to expert systems through the combination of multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns, topic modelling with lda to detect relevant text, and discursive temporality analysis to identify forecasts and predictions within this text. Our solution may have compelling applications in the financial field, including the possibility of extracting relevant statements on investment strategies to analyse authors’ reputations.

Automatic Extraction of Events from Open Source Text for Predictive Forecasting

Dynamic Topic Models for Retrospective Event Detection: A Study on Soviet Opposition-Leaning Media

SENTiVENT: enabling supervised information extraction of company-specific events in economic and financial news

Article Open access 08 October 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Motivation

New efficient algorithms [1,2,3,4] and the prolific sources of online information have boosted applied data analysis research. In this scenario, Natural Language Processing (nlp) techniques are being successfully applied to unstructured textual data [5,6,7,8], from the simplest approaches that use morphological information as input [9] to more complex methodologies that take advantage of syntactic patterns and semantic relations [10].

Financial Knowledge Extraction (ke) is of particular interest. nlp techniques have been used to apply a wealth of market forecasting research to financial news, economic reports and financial expert comments [11]. Financial news describes relevant market events, their causes and their possible effects. Transferring human associative discourse capabilities [12] from this type of content is challenging.

1.2 Financial knowledge extraction

Some representative examples of financial ke include information extraction from financial news for firm-based monitoring [13]; analysis of financial risk such as volatility [14] and Personal Finance Management applications [15], among other interesting use cases [7]. Most of these ke systems engineer specific features of the content with their target in mind [16].

It is well known that there is a strong relation between mass media news and stock market state [17, 18]. Previous research has shown that information published in media outlets or shared financial data in printed media, radio, television, and web sites is correlated with future stock market events [19]. Apart from providing valuable objective information in financial news, authors speculate about market events within political, social and cultural contexts. In these unstructured texts the discourse flows around certain key statements and predictions, and an automatic financial news analysis system should distinguish between less relevant data and predictions to gather knowledge to assist investors in decision making [20].

1.3 Temporality at the discursive level

Temporal representation in texts and speculative statements in particular is based on semantic combinations of certain linguistic structures and elements [21]. However, the vast majority of works on temporality research at the discursive level have simply focused on verb tenses [22], ignoring their semantic context.

1.4 Research goal and main contribution

Our research is a case of financial News Analysis (na) [13,14,15] within the field of Intelligence Amplification, which has lately gained attention [23,24,25] as a means of enhancing the understanding and reasoning capabilities of automatic ke solutions and transferring human associative discourse capabilities to expert systems. Our case contributes to solving the problem of extracting relevant text from financial news and, within that relevant text, identifying forecasts and predictions. Our solution may be valuable in helping inexpert stockholders to process more financial news more efficiently.

To the best of our knowledge, this is the first study to propose an approach for the automatic detection of relevant events in financial na based on the joint consideration of relevance and temporality analysis at the discourse level.

1.5 Approach

Our approach comprises:

Multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns.
Detection of relevant text through topic modelling with Latent Dirichlet Allocation (lda), outperforming a rule-based system.
Identification of forecasts and predictions within relevant text using discursive temporality analysis and Machine Learning (ml).

We demonstrate the performance of these features using an experimental data set composed of news items from widely used financial sources. The final data set had 2,158 financial news items similar in size to or even larger than other studies in the literature [26,27,28,29,30,31] that were manually labelled by nlp researchers.

1.6 Structure of the paper

The rest of this article is organised as follows. Section 2 reviews related work on ke and na solutions. Section 3 describes our automatic system for detecting relevant financial events based on nlp and ml techniques. Section 4 presents the text corpus and numerical evaluation of our solution. Finally, Section 5 concludes the article.

2 Related work

Stock market research is based on fundamental and technical approaches [32]. Fundamental approaches involve performing stock market forecasts using numerical data such as price variation. Technical approaches, in turn, focus on the temporal dimension of financial events. They apply trend modelling techniques to historical asset data forecasts.

Previous research on Data Mining and ke for stock market screening on textual data has considered financial news [14, 33], stockholder comments in blogs [18] and social networks [34]. These systems apply nlp techniques [11] or ml models [2, 14], which may be supervised [35], relying on automatic or manually annotated data sets, or unsupervised [36], taking into account the peculiarities of input data and descriptive patterns. The simplest approach consists of using a vector representation of the content and weighting the terms once meaningless elements, such as prepositions [37], are removed. More complex approaches, like the one presented by De Arriba-Pérez et al. (2020) [38], seek to identify syntactic and semantic patterns as key descriptors of financial news through lexica, grammar and name entity recognition techniques.

Traditional extraction methods for filtering relevant text in this context comprise manual^{Footnote 1} and automatic^{Footnote 2} pattern discovery approaches [40]. The former require large knowledge bases, such as dictionaries and lexica, and rule sets. They tend to be constrained by specific application domains. Automatic approaches include simple statistical and more demanding, complex linguistic approaches, in addition to the previously mentioned ml solutions [39]. tf-idf [41] is remarkably simple, but it has been reported to under-perform on professional texts, as in our case. Alternative solutions combine the previous techniques with knowledge heuristics such as position, length and text format information. A more competitive solution is fuzzy logic for sentence scoring. However, this lacks adaptability and requires manual rule generation, which directly affects performance [42].

Unsupervised extraction is highly practical because it eliminates the burden of text tagging. Nevertheless, many ke solutions rely on supervised methodologies. Examples those of Gottipati S et al. (2018) [43], who designed an ml course improvement solution based on student feedback and compared its performance to a rule-based method; López-Úbeda et al. (2021) [44], who extracted relevant information from radiological reports; and Verneer et al. (2019) [45], who proposed a relevance detection system from social media messages (although they noted the great potential of lda as an alternative).

Among extraction solutions developed to detect relevant topics from news pieces (setting aside temporality analysis), Jacobs et al. (2018) [46] developed a supervised model for economic event extraction in English news using a sentence-level classification approach, as in our case; Oncharoen et al. (2018) [47] applied the Open Information Extraction system to represent the news data as tuples (actor, action and object); Carta et al. (2021) [48] employed a real-time domain-specific clustering-based approach for event extraction in news; and Harb et al. (2008) [49] presented a linguistic-based opinion extraction system for blogs.

Assuming there is a direct causal relationship between financial news and asset prices [17], some authors have explored both ml and other sophisticated techniques such as deep learning to gather context-dependent information for stock market screening [50]. Worthy of note in this respect is the Naive Bayes model by Atkins et al. (2018) [14] for predicting stock market volatility, which employed as input word-topic correspondence feature vectors obtained with lda. Unlike our proposal, this model considered news content as a whole and did not differentiate non-relevant from relevant parts for their target application. Shilpa & Shambhavi (2021) [50] presented a prediction framework based on sentiment analysis and stock market technical-indicator features. Temporality was not considered.

Prior work has addressed linguistic [51], template-based [52] and statistical news summarisation approaches [53]. State-of-the-art summarisation systems may be extractive [53] or abstractive [54]. Extractive summaries, which are more akin to our goal, extract key sentences directly from the input text. These sentences are ranked by importance and selected if they pass a threshold. Query-focused and update summarisation approaches [55] also deserve consideration as they retrieve information tailored to a specific audience. The summarisation of online financial news in our work focuses on financial investors. The temporal dimension, expressed as discursive temporality, is crucial to us because relevant text in finance-related news may include in addition to factual information, speculations or predictions, whether quantitative or not. In further relation to summarisation, template-based systems on financial na [56] were limited in early research due to their computational load, the laborious task of defining the templates and their lack of flexibility.

ke solutions, and in particular, financial na systems, have not paid sufficient attention to temporal analysis. The vast majority simply use temporal references provided by timestamps or verb tenses. Evers-Vermeul et al. (2017) [22], for example, simply noted that as linguistic markers, verb tense suffixes express temporal order and coherence relations through text. Our work goes a step further by analysing sentence-level temporality through syntax and semantics, and detecting temporal elements, expressions and the patterns in which they are arranged.

Summing up, Table 1 compares the most relevant work related to our proposal. Our main contribution is the detection of relevant statements on financial news including forecasts and predictions. To do this, our system automatically groups related data and filters out background information. In brief, we present a novel technique combining lda analysis of automatically segmented news with temporality analysis at the discourse level. To our knowledge, this is the first na approach that jointly considers relevance and temporality at the discourse level.

Table 1 Comparison with related works

Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

Abstract

Similar content being viewed by others

Automatic Extraction of Events from Open Source Text for Predictive Forecasting

Dynamic Topic Models for Retrospective Event Detection: A Study on Soviet Opposition-Leaning Media

SENTiVENT: enabling supervised information extraction of company-specific events in economic and financial news

1 Introduction

1.1 Motivation

1.2 Financial knowledge extraction

1.3 Temporality at the discursive level

1.4 Research goal and main contribution

1.5 Approach

1.6 Structure of the paper

2 Related work

3 System architecture

3.1 Multi-paragraph topic segmentation

3.2 Co-reference resolution

3.3 Tag processing: financial term detection, homogenisation and replacement

3.4 Relevant text detection with lda topic modelling

3.5 Temporal analysis

4 Experimental results

4.1 Experimental setting

4.2 Experimental data set

4.3 Inter-agreement evaluation

4.4 Discussion of the results

4.5 Application use case

5 Conclusions

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Consent to participate

Consent for Publication

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation