1 Introduction

With the advent of the Internet and digitalization, most financial services and investment platforms have moved online. Organizations publish their performance reports and brochure digitally. Earnings conference calls of executives get transcribed and digitally preserved. Most investors rely on this information to make investment decisions. Numbers present in such information may be claims or not-claims (i.e. facts). Facts are always true whereas claims may be true or false. It is expected that investors will rely only on facts and not be allured by false claims. However, making such a distinction is not easy specifically for novice investors. Thus, we need to have an automated system that would be able to detect whether numbers in financial texts are claims (in-claims) or not (out-of-claims/facts). Figure 1 presents two instances. The number ‘23’ present in the text “For the full year we continue to expect an adjusted effective tax rate of 23–24%” is a claim. The number ‘1.1’ in the text “Free cash flow a really good start to the year at $1.1. billion.” is not a claim.Footnote 1

Fig. 1
figure 1

Claim detection in financial texts

1.1 Our contributions

  • We developed a system that can detect whether a numeral present in a given financial text is a claim or not. For this, we used the English version of the publicly available dataset FinNum-3 [11]. On the validation set, our system achieved macro F1 score of is 0.8671.

  • We studied how adding handcrafted features and information regarding the category of a target numeral affect the performance of the model.

This remaining paper is structured as follows. In Sect. 2 we discuss some of the existing works. We formally state the problem statement in Sect. 3 and describe the dataset in Sect. 4. In the subsequent Sects. 5, 6 and 7, we discuss the methodology, the experiments we performed and their results, respectively. Section 8 concludes and mentions some future work directions.

2 Related works

Detecting claims from texts using Natural Language Processing (NLP) has been one of the trending areas of research. This has been applied in various domains like NEWS [13, 26], Twitter [6], legal [24], etc. Hassan et al. [16] developed a system, ClaimBuster, to detect claims present in the 2016 US presidential primary debates. They evaluated ClaimBuster on statements selected for fact-checking by CNN and PolitiFact. They found that their system was able to detect several sentences with claims which were not selected for fact-checking by the above mentioned organizations. [13] created a new dataset by manually labelling the debates. They also proposed SVM and neural based systems to rank claims for prioritizing fact-checking. Subsequently, a similar application was presented by Konstantinovskiy et al. [19]. They used universal sentence representations for classification and out-performed existing claim ranking system [13] and ClaimBuster [16]. Furthermore, they proposed an annotation schema and a crowdsourcing methodology. This enabled them to create a dataset having 5571 sentences with labels as claims or non-claims. Reddy et al. [26] released a new dataset NewsClaim which consisted of 529 manually annotated claims collected from 103 news articles mostly relating to COVID-19. They showed that zero-shot and prompt-based approaches perform well in detecting claims from news articles.

Aharoni et al. [1] developed a dataset for detecting claims in controversial topics. It consisted of 2683 arguments which were collected from 33 controversial topics. Sundriyal et al. [29] proposed a novel framework called DESYR. It consisted of a gradient reversal layer and attentive orthogonal projection over poincare embeddings. They evaluated it on informal datasets like online comments, web disclosures, Twitter, etc. Chakrabarty et al. [6] created a corpus from Reddit consisting of 5.5 million self-labelled claims which contain “IMO/IMHO (in my (humble) opinion)” tags. They fine-tuned ULMFiT [18] on this corpus. They further demonstrated how fine-tuning helped in argument detection tasks. Wright et al. [31] proposed a unified model called Positive Unlabelled Conversion. It constituted of a positive unlabelled classifier and a positive-negative classifier. They evaluated their model on three datasets namely Wikipedia citations, Twitter Rumours and Political Speeches.

Levy et al. [20] trained context-dependent classifiers for detecting claims on Wikipedia corpus. It primarily consisted of three components—Sentence Component, Boundaries Component and Ranking Component. Subsequently, Levy et al. [21] proposed an unsupervised framework to detect claims and evaluated its’ performance on the same corpus. Lippi et al. [23] used Partial Tree Kernels to generate features for detecting claims irrespective of the context. The inner nodes of these trees consisted of POS tags of the words in the leaf nodes. Furthermore, Lippi et al. [24] validated the effectiveness of this approach in the legal domain. To do this, they manually annotated claims from fifteen decisions of the European Court of Justice. Bar-Haim et al. [4] expanded the initial set of manually curated sentiment lexicons and added some contextual features (like headers, claim sentences, neighbouring sentences and neighbouring claims) to improve the existing claim stance classification systems. Botnevik et al. [5] proposed a browser-based extension BRENDA that helped users to verify facts within claims which are present in different webs.

Recently, with the increase in the availability of financial textual data, researchers have been focusing on detecting claims in financial texts as well [9, 10]. Chen et al. [9] presented a novel dataset NumClaim in Chinese which comprised financial texts, their categories and whether a target number within a text is in-claim or out-of-claim. They further proposed some neural architecture based baselines. Their best performing model CapsNet resulted in a macro-F1 score of 82.62% on the NumClaim Corpus. Recently, they released a similar dataset in English while organizing the FinNum-3 workshop [11].

3 Problem statement

Given a set F = {(t1, n1, s1, e1, c1, m1), (t2, n2, s2, e2, c2, m2) ...(tk, nk, sk, ek, ck, mk)} of k elements, the ith element of F consists of a financial text ‘ti’, a number ‘ni’ present within the text having starting and ending index positions ‘si’ and ‘ei’ respectively. Moreover, each element also contains ci which denotes the category ti belongs to and mi which represents whether ni is in-claim or out-of-claim. mi \(\epsilon \) {0, 1}, 0 and 1 representing out-of-claim and in-claim, respectively. ci \(\epsilon \) {‘date’, ‘other’, ‘money’, ‘relative’, ‘quantity absolute’, ‘absolute’, ‘product number’, ‘ranking’, ‘change’, ‘quantity relative’, ‘time’}. Our target is to develop a system for classifying an unknown numeral ‘n’.

We evaluate the performance of our models using macro-averaged F1-score.

4 Dataset

Our experimental dataset comprises transcripts from earnings conference calls in English. They are formal financial documents. A similar dataset in Chinese consisting of reports written by analysts has been described in more detail [9]. Recently, a shared task, “NTCIR-16 FinNum-3: Investor’s and Manager’s Fine-grained Claim Detection”Footnote 2 [11], is being held where participants are provided with this dataset. We registered in the shared task and obtained the training and validation data. The training data consists of 8337 records whereas the validation data consists of 1191 records. Of all these records the train and validation set has 1,039 and 114 in-claim instances respectively. There are 2627 and 409 unique financial texts in the training set and validation set respectively. This indicates that most of the texts present in the training and validation sets have multiple numbers present in them. We present the category-wise distribution in Table 1.

Table 1 Category wise distribution of the training and validation set

5 Methodology

Fig. 2
figure 2

Methodology. EF engineered features, LR logistic regression

Our final system consists of an ensemble of 3 sub-systems. The first two sub-systems consist of fine-tuning pre-trained language model FinBERT [2] and are almost identical. The third one is a logistic regression based model built using contextual BERT embedding [12] of the numerals and other engineered features. BERT (Bidirectional Encoder Representations from Transformers) [12] is one of the state of the art language models. It has been pre-trained using masked language modelling (MLM) and next sentence prediction (NSP) objectives. We use the base and uncased version of it which consists of 768 hidden units, 12 attention heads and encoder blocks. It has a total of 110 million parameters and can be used to generate contextual embeddings of 768 dimensions. FinBERT [2] is a version of BERT which has been subsequently pre-trained on Financial text and fine-tuned for financial sentiment classification task. We fine-tune the FinBERT model even further for the text classification task to detect in-claim numerals. Since the given training set has multiple numbers that are present in the same text, we try to narrow down the context of the target numeral. For the first sub-system, we define context as 8 words before and after the numeral. For the second and third sub-system, we further narrow it down to 6 words around the numeral. The entire process is depicted in Fig. 2.

5.1 Sub-system-1(S1)

Firstly, we tokenize the financial texts and extract 8 words before and after the target numeral. We follow the standard method of fine-tuning a FinBERT model (768 dimensions) so that its [CLS] token learns to predict whether the target numeral is in-claim or out-of-claim. We run this model in batches of size 256 for 40 epochs with a learning rate of 0.00002. We consider a maximum of 64 tokens. Finally, we select the model which is tuned till 15th epoch as it performs the best on the validation set (Macro F1 score = 0.8585).

5.2 Sub-system-2(S2)

This sub-system is similar to the first one. The only differences are we narrow down the context around the target numeral from 8 to 6 and consider a maximum of 16 tokens. We do this to focus specifically on the target numeral. This model performs the best just after the 14th epoch (Macro F1 score = 0.8439).

5.3 Sub-system-3 (S3)

This sub-system is different from the previous two. In this sub-system, given a context window of 6 words, we first extract BERT base uncased embedding (768 dimensions) of the target numeral. Since we have been using sub-word tokenization, for many cases the target numerals resulted in more than one token. This is one of the drawbacks of transformer based models. It has also been mentioned by Wallace et al. in the paper [30]. To deal with such instances, we take the mean of the embeddings of all the constituent tokens. Moreover, being inspired by [3, 22, 28] and [8] we engineer several features from the target numerals. These features include

  • number of digits before the decimal

  • number of digits after the decimal

  • one-hot vectors of different categories extracted using Microsoft Recognizers for TextFootnote 3

  • one-hot vectors of parts of speech of the target numeral as well as the just succeeding and preceding words

Finally, we develop a logistic regression model which takes the embeddings and engineered features as input and predicts whether a given numeral is in-claim or out-of-claim. The hyper-parametersFootnote 4 of the logistic regression model are: C = 1.0, fit_intercept = True, intercept_scaling = 1, max_iter = 100, penalty = l2, solver = lbfgs, tolerance = 0.0001. The macro F1 score of this model is 0.8318.

5.4 Final system

The final system is an ensemble model. It selects results from the three subsystems (S1, S2 and S3) using majority voting. The macro F1 score of this model is 0.8671.

6 Experiments

We performed the experiments in four phases as mentioned below.

6.1 Defining the context window

At first, while exploring the data we noticed that 1867 and 285 financial texts from the training set and the validation set respectively had more than one target numerals. Thus, it was essential to define a context around the target numeral. We tried to extract the sentences in which the target numerals were present. This did not solve the problem as more than half of the data had multiple target numerals in a given sentence. We further tried to extract the portion of the text on which the target numeral was dependent using the dependency parser provided by spaCy.Footnote 5 However, the performance did not improve. Finally, we performed several experiments by varying the context window size from 2 to 10. Context window of size k means we consider k words before and after the target numeral. Context window of size 8 gave us the best results.

6.2 Exploring various embeddings and classification algorithms

We explored various ways to numerically represent texts starting from TF-IDF to sentence transformer [27] based embeddings generated using BERT [12], RoBERTa [25] and FinBERT [2]. We further trained several classifiers over it. These classifiers included Logistic Regression, Random Forest [17], XG-Boost [7], etc. The performances of these models were not good enough. Thus, we added several engineered features as mentioned in Sect. 5. This improved the performance slightly but the improvement was not notably high.

6.3 Fine-tuning pre-trained transformer based models

We tried to fine-tune several variants of BERT [12] model for the task of classification. A FinBERT [2] model trained with batches of 256, for 15 epochs with a learning rate of 0.00002 gave the best performance. This model was trained on a context window of size 8.

6.4 Adding information regarding category and handcrafted features

We experimented by adding the categories to which the target numeral belonged as one hot vectors. We further engineered several features as mentioned in Sect. 5.3. These actions improved the macro F1 score to 0.8315 and 0.8318 respectively.

6.5 Ensembling individual models

Finally, we tried to combine outputs of the individual models using majority voting. On combining the individual models which are described in Sect. 5, the macro F1 score improved from 0.8585 to 0.8671.

6.6 Implementation details

These experiments were performed in a node having Nvidia Tesla V100 GPU with 32 GB RAM. We used Python (3.7) for all the computations. The main libraries used here includes of PyTorch,Footnote 6 SentenceTransformers,Footnote 7 pandas,Footnote 8 NumPy,Footnote 9 scikit-learnFootnote 10 and Microsoft recognizers-text-number.Footnote 11

7 Results and discussion

We present the overall results in Table 2. We observe that machine learning based classifiers built with TF-IDF (with ngrams ranging from 1 to 4 and ignoring terms with document frequency strictly lower than 0.0005) based features (Sl. No. 1 to 3) did not perform as good as those which were built with FinBERT embeddings as features (Sl. No. 4 to 6). We tried extracting the portion of the financial text on which the target numeral was dependent. We further fine-tuned a FinBERT model using only the words on which the target numeral was dependent. This did not yield any improvement in the model performance (Sl. No. 7, Macro F1 score = 0.7250). However, on adding handcrafted engineered features and using context words within a window of 6 for fine-tuning the FinBERT model, the Macro F1 score improved to 0.8244 (Sl. No. 8). On adding information relating to categories as one hot vectors the F1 score further improved to 0.8315 (Sl. No. 9). Details regarding models S1, S2, S3 and their ensemble have already been mentioned in Sect. 5. The ensemble model (Sl. No. 14) performed the best (Macro F1 score = 0.8671 on validation set, 0.8473 on the test set). This is a significant improvement over the existing baseline CapsNet [9] (Sl. No. 10, Macro F1 score = 0.5736 on the test set). The results on the test set have been provided by the organizers in the paper [11].

Next, we evaluate the performance of the ensemble model across different categories. We present this in Table 3. It is interesting to note that the model performs well for almost all categories except ‘product number’ and ‘date’. This is because the training set did not have a single in-claim instance of the category ‘date’ and only 9 such instances of the category ‘product number’.

Table 2 Overall results
Table 3 Category wise performance of the ensemble model

7.1 Ablation study

We conduct a detailed ablation study to assess the importance of each component present in the ensemble model. We present the results of this in Table 4. We observe that the ensemble model performs better than the constituent models. While testing the hypothesis that the ensembled model is better than S1, we obtained a p-value of 0.18. We modified S3 by removing engineered features and considered only the largest sub-word token of the target numeral. It resulted in the reduction of macro F1-score. This proves the effectiveness of every part of the final model. We further tried varying the context window size. We conclude that the context window of size 8 gives the best performance.

Table 4 Ablation study

7.2 Qualitative error analysis

Subsequently, we performed a qualitative evaluation for instances where our model made wrong predictions. We present a sample of it in Table 5. We observe that more than 66% of the miss-classified target numerals have a dollar (‘$’) symbol and 17% of them have a percentage (‘%’) symbol associated. Microsoft digit recognizer was able to effectively put these instances into categories ‘currency’ and ‘percentage’ respectively. Thus, creating a classifier to first predict the categories and then training separate classifiers for each category may have helped in achieving better performance.

Table 5 Qualitative analysis of miss-classified instances

8 Conclusion and future works

In this paper, we introduced an ensemble based system to detect whether numerals in financial texts are in-claim or out-of-claim. This system consists of three sub-systems. Two of these sub-systems were created by fine-tuning FinBERT [2] on a context window of 8 and 6 words before and after the target numeral. The third sub-system is a logistic regression model. BERT based context embedding of target numeral and a few engineered features were used to train it. We conclude that adding hand crafted features and information relating to category of the target numerals improves the performance slightly. However, training a model using the only the portion of the text on which the target numeral is dependent, performs poorly. This is probably happening because the algorithm to extract dependent text is not yielding acceptable results. After conducting several experiments, we conclude the effectiveness of our model over the baseline CapsNet architecture [9].

In future, we would like to build a custom tokenizer that will tokenize other words into sub-tokens while keeping the target numeral as it is. We also want to experiment by changing the ensembling method from majority voting to a meta-classifier. Furthermore, a Convoluted Neural Network (CNN) or a Long Short Term Memory (LSTM) model trained using the context embeddings may yield better results. Another interesting direction would be to experiment if we could leverage knowledge from a similar kind of datasets [9, 10] available in other languages like Chinese. Finally, we want to improve the algorithm being used to extract words from the given texts on which the target numerals are dependent.