1 Introduction

Models for predicting financial distress are increasingly considered among the most valuable financial models as they are capable of providing an early warning signal to stakeholders of imminent corporate failure. Past financial crises incurred considerable financial losses, heightening the urgency to alleviate the information asymmetry between firm managers and the remaining stakeholders by providing them with powerful financial distress prediction models.

The overall objective of financial distress prediction is to foresee whether a company will run into financial difficulties, using statistical or machine learning models based on quantitative or qualitative financial indicators currently available. Most previous models were designed with the aim of accurate detection one year before failure [3]. Many traditional models fail for longer prediction horizons [5]. However, some models are able to predict financial distress earlier, typically three to five years before failure, with relatively high accuracy [12, 54]. Financial distress prediction plays a fundamental role in a number of decision-making problems faced by corporate stakeholders, including corporate financial managers’ decisions, investors’ investment decisions, lenders’ credit decisions and banks’ assessment of clients’ creditworthiness [56].

Most financial distress prediction models have so far tended to concentrate on financial indicators derived from corporate financial statements [3]. Furthermore, some studies focusing on SMEs in particular have considered external drivers unrelated to managerial skills [17, 19], such as psychological factors (e.g. psychological distress of owners), natural disasters, regulatory inefficiencies or labour market turbulence. However, with the increasing availability of textual data relating to companies, there is an emerging opportunity to examine whether indicators derived from these textual documents can improve the prediction of financial distress. Further insights regarding the attitudes of managers and investors have been found to be significant indicators of firms’ financial performance. Prior relevant research in this respect has concentrated on extracting textual sentiment from managerial discussion included in corporate disclosures [22, 24, 38, 44]. However, linguistic indicators derived from annual reports suffer from low frequency and, therefore, outdatedness, which limits them in dynamical financial distress prediction. In addition to annual reports, other sources of managerial communication are available. In particular, earnings conference calls provide a more flexible source of information with more forward-looking narratives, as they also include communication between executives and analysts [46]. Indeed, the linguistic tone of conference call transcripts proved to be a prominent indicator of abnormal returns [15]. Furthermore, previous studies revealed that the effect of emotional linguistic information may outweigh the value of factual accounting information as disclosed by executives, thereby indicating underlying company financial risks [43]. Evidence from empirical studies also suggests that nonverbal managerial communication, when coupled with financial indicators, is incrementally useful in financial performance prediction [40, 47]. This can be attributed to the cognitive conflict (dissonance) between actual financial performance and managerial emotional state, suggesting potential financial difficulties.

Motivated by the above, this study asserts that sentiments and emotions derived from earnings conference calls can provide important indicators of financial distress. To this end, this study aims to incorporate linguistic indicators and vocal cues extracted from earnings conference calls in models predicting corporate financial distress. To address this issue, a deep learning model is used that incorporates: (1) a set of diverse dictionary-based sentiment and emotion indicators and a domain-specific FinBERT model to represent conference call transcripts; (2) a convolutional neural network (CNN)-based speech emotion recognition model proposed by [29] to capture the managerial emotional states in conference call recordings; and (3) a long short-term memory (LSTM) neural network leveraged by the linguistic indicators, emotional states and financial indicators for dynamical financial distress prediction. By fusing verbal and nonverbal managerial communication obtained from the transcripts and audio recordings of earnings conference calls, respectively, and financial ratios from corporate financial statements, we believe that the fused model will dominate the traditional prediction models using only financial indicators. Furthermore, by comparing the predictive performance of verbal and nonverbal attributes, this study aims to enhance our understanding of the role of text sentiment and speech emotion in financial distress prediction.

In summary, the contributions of this paper are fourfold:

  • A novel dynamical financial distress prediction model is proposed by using LSTM neural networks. Unlike the existing stationary financial distress prediction models based on linguistic attributes, the proposed model utilises a more frequent source of textual data represented by earnings conference calls.

  • For the first time, a rich word representation combining dictionary-based emotion features and domain-specific transformer-based sentiment model is used for financial distress prediction. To this end, a high lexical coverage is ensured by using a wide range of emotion indicators, as well as that based on a pre-trained FinBERT model.

  • To the best of our knowledge, this is the first study using speech emotion recognition for financial distress prediction. Existing research based on vocal features was limited to stock price prediction and financial fraud prediction.

  • To show the predictive power of verbal and nonverbal components of the model, several sets of experiments are performed to predict the financial distress of the largest US companies. As far as we know, this is the first time that a combination of verbal and nonverbal features are used for financial distress prediction.

This paper is a substantially extended version of [20]. Note that the earlier version was limited to speech emotions, while neglecting the sentiment and emotion information provided in conference call transcripts. Furthermore, an improved deep learning architecture for financial distress prediction is proposed in the current study and the effect of different approaches to sentiment analysis is investigated. This extension also presents an in-depth comparative analysis with existing financial distress prediction models based on textual data.

The remainder of this paper is organised as follows. Section 2 provides the literature review on using linguistic indicators and vocal cues in the field of finance. Section 3 presents the components of the proposed conceptual framework for financial distress prediction. Section 4 describes the data used. Section 5 describes the setting of the proposed deep learning architecture and presents experimental evidence for the excellent performance of the proposed model, especially compared with existing prediction model using only financial indicators. Section 6 concludes the paper and suggests future research directions.

2 Related work

The literature review provided in this section focuses on two main lines of research in the field of sentiment and emotion analysis in finance, namely the use of (1) linguistic sentiment indicators obtained through text analysis and (2) emotional states extracted through speech emotion recognition.

2.1 Financial distress prediction using linguistic indicators

Early works in the area of financial distress prediction focused primarily on quantitative financial indicators extracted from financial statements [3, 56]. The study of Cecchini et al. [11] has made a significant contribution to the field by examining, for the first time, the effect of information obtained from financial texts on the accuracy of predicting financial distress. More precisely, terms with a high frequency of occurrence and high discriminating power were identified in MD&A (management discussion and analysis) sections of corporate annual reports, thus complementing the quantitative financial indicators. As a result, the improvement in prediction accuracy was an impressive 17.1% over the baseline model consisting of the Altman model’s financial ratios. Notably, even using text alone, an accuracy of 80.0% was achieved, outperforming the often-used Altman model. Limitations of the methodology used by Cecchini et al. [11] stem from the bag-of-words approach, leading to sparse and high-dimensional document representation. Moreover, the reliability of the discriminatory power of the words contained in the annual reports of companies with financial distress is limited by the small number of such companies.

To overcome the problems of high dimensionality and sparsity of the document representation using the bag-of-words approach, dictionary-based approaches have been preferred in subsequent studies [22, 44]. These, among other things, facilitate the interpretability of models by reducing the information from the bag-of-words model into typically two linguistic indicators, i.e. positive and negative sentiment. However, general dictionaries, such as the Harvard’s General Inquirer, can be context-sensitive, so dictionaries developed for finance have proved to be more relevant. In their widely acclaimed work, Loughran and McDonald [36] proposed a set of dictionaries for the finance domain, including those for positive and negative financial sentiment. This has allowed other researchers to more accurately estimate sentiment in annual reports (especially managerial sentiment in the MD&A sections) and to achieve more accurate predictions of credit ratings [22], overall corporate financial performance [24] and corporate bankruptcy  [23]. These studies also offered sensitivity analyses comparing general and financial dictionaries, concluding that the latter are more indicative predictors of financial distress.

To achieve a richer linguistic representation of annual reports, Wang et al. [62] combined the dictionary-based approach with a bag-of-words model based on term frequency. Specifically, the model using financial features performed best with AUC=0.912, followed by the bag-of-words model (AUC=84.5) while the financial dictionary-based model performed the worst among these individual models with AUC=0.681, suggesting that dictionary-based linguistic indicators, while easy to interpret, do not provide sufficient predictive power on their own. However, when the features of these approaches were fused, the overall performance of financial distress prediction could be significantly increased to AUC=0.961, but this was mainly due to the merit of the bag-of-words linguistic features.

In addition to the above problems of bag-of-words models with high data dimensionality and sparsity, the inadequate consideration of the context of financial documents should be pointed out. To overcome these problems, low-dimensional dense feature representations based on word embeddings were used in more recent studies. Specifically, the word2vec model was used by Mai et al. [38] to effectively capture word meaning and word context of annual reports. By simply averaging the word embeddings in each document and adding two fully connected layers, a relatively accurate financial distress prediction model was produced that outperformed more complex models using convolutional neural networks. However, Matin et al. [39] and Huang et al. [27] reached the conclusion that word embeddings of auditors’ reports are more informative than those based on managerial statements in annual reports. The main drawback of the word2vec model is that it does not support explicit sentence representation. In contrast, the BERT language model generates contextual sentence and word representation. Li et al. [32] exploited the advantages of the BERT model when constructing a financial domain sentiment dictionary, which outperformed the traditional dictionary of Loughran and McDonald [36]. Most recently, Jiang et al. [30] have showed that semantic features extracted from the BERT model have stronger predictive power than topic and sentiment features.

A list of related studies is presented in Table 1, showing the method used, the data and the resulting classification accuracy.

Table 1 Financial distress prediction using textual data

Overall, earlier research reported that the textual information presented in the annual report is qualitatively complementary to the data obtained from financial statements and can therefore be of great importance to stakeholders. The purpose of the annual report is to provide additional information on the company’s financial situation and to outline the future financial development of the company. In addition, management priorities and financial risks are discussed. However, the above studies using annual reports as text source have several major shortcomings. The main one is the low frequency of annual reports. Hence, linguistic indicators extracted from annual reports are not ideal for time series prediction. As a result, the above studies represent only stationary financial distress prediction models, neglecting the impact of concept drift in financial distress prediction. In fact, the area of dynamic financial distress prediction is attracting considerable interest due to its capacity to take into account the change in the distribution of financial data over time [1, 57, 58]. To capture the dynamics of the change, we used LSTM neural networks in this study.

The low frequency of disclosure of annual reports is linked to the lack of timeliness of the data contained therein. To provide the financial distress prediction model with more up-to-date dynamic data with higher frequency, we use earnings conference calls as the text source in the current study. Similar to annual reports, conference calls capture insider perspectives and views, reducing the potential information asymmetry between investors and managers. In addition, earnings conference calls are a more flexible source of information than annual reports, and the conversation between executives and analysts regarding corporate performance is a potentially rich source of information [46]. Furthermore, earnings conference calls typically contain more forward-looking narratives [8], and therefore, the linguistic features in these transcripts have more potential to predict future corporate financial performance.

Although BERT represents a state-of-the-art language model, sentiment analysis is a heavily domain-dependent task. Therefore, to leverage the large amount of financial text, including corporate reports, earnings call transcripts and analysts’ reports, FinBERT was pre-trained as a domain-specific BERT model. Results reported by Araci [6] and Yang et al. [64] suggest that FinBERT outperforms the generic domain BERT model on financial sentiment analysis tasks. Another limitation of previous studies is their tendency to focus on polarity-oriented sentiment, using lexicons of positive and negative words. Not only do these lexicons usually lack the intensity of the sentiment level, but narrowing the scope of linguistic analysis to only positive and negative sentiment is inadequate in the case of earnings conference calls, as management typically displays a much broader range of emotions during discussions with analysts, such as anger, surprise, fear, among others [46]. Therefore, to provide the financial distress prediction model with a state-of-the-art finance-specific language model and fine-grained sentiment indicators, the combination of FinBERT sentiment and emotion features allowed us to constitute a rich linguistic representation of earnings conference calls.

2.2 Vocal features in finance

A considerable amount of the literature has been published on the role of vocal cues in finance in the past decade. Earlier research has shown that financial prediction models can benefit from vocal features derived from earnings conference calls, see Table 2. Considering the vocal features used, earlier research tended to apply two fundamental voice analysis tools: LVA (layered voice analysis) [26, 40] and Praat [50, 63].

LVA allowed the researchers to derive affective state levels, including the cognitive, thinking, emotional and stress level. The cognitive level represents cognitive dissonance, while the emotional level stands for the level of excitement. The thinking level denotes mental efforts, and physical arousal is estimated using the stress level. In particular, several studies using LVA reported that extremely high emotional levels are significant indicators of abnormal financial states. Mayew and Venkatachalam [40] and Price et al. [47] reported that excessive affective states indicate future cumulative abnormal returns of stocks, while vocal dissonance can be used to identify financial fraud [26].

Table 2 Summary of prior studies using vocal features in finance

The Praat voice analysis tools offer a different perspective than LVA, providing users to quantify a variety of acoustic features, such as voice intensity, pitch, shimmer, jitter and excitation. Throckmorton et al. [61] revealed that combining linguistic and vocal features outperformed financial indicators in fraud detection. Similarly, testimonies concealing deception allegedly had a higher pitch and lower voice quality than legitimate testimonies [9]. Vocal cues are also reported to have a good predictive capacity for stock price volatility [63]. In a similar manner, a multi-modal prediction model also worked well in predicting stock price volatility Sawhney et al. [50], as well as the neural attentive alignment model capturing interdependencies between verbal and vocal modalities [49, 51]. The most recent work found that there is gender bias in multi-modal financial forecasts [52] and that managers modify the way they talk knowing that machines are listening [10].

Summarising the previous findings, vocal features have so far only been used in detecting financial fraud and predicting stock price volatility and abnormal stock return. In addition to the new application area of vocal cues, this study takes advantage of recent developments in the field of speech emotion recognition. Specifically, deep learning allows us to accurately identify the emotional states of managers based on a wide range of spectral features.

3 Model for financial distress prediction

The conceptual framework outlined in the current study for predicting financial distress is represented by a deep learning model combining three types of information: (1) managers’ emotions obtained from their speech during conference calls; (2) sentiment obtained from the transcription of the conference calls; and (3) conventional financial indicators from corporate financial statements. This section details the model.

3.1 Deep learning model for financial distress prediction

The proposed financial distress prediction model is presented in Fig. 1. The deep learning architecture comprises the pre-trained CNN model that classifies the audio recordings from earnings conference calls into 8 emotional states using a wide range of spectral features. The CNN model for speech emotion recognition is derived from the model proposed by Issa et al. [29] and adapted for eight emotional states. Specifically, the first two one-dimensional convolutional layers contained 256 and 128 filters, respectively. As depicted in Fig. 1, the kernel size was 5 and stride was 1. The max-pooling layer followed using a window size of 8. Next, another one-dimensional convolutional layer was used, and flattening resulted in a fully connected layer followed by a dropout layer to prevent over-fitting (dropout rate was 0.2). Finally, a softmax layer was used to produce the predicted emotional states. The Adam optimizer was used to train the CNN model using the cross-entropy loss function and a learning rate of 0.0001. Note that to obtain the managerial emotional states, the CNN model was trained on a labelled benchmark dataset using fivefold cross-validation. Then, the trained model was used to classify the unlabelled managers’ audio recordings.

Fig. 1
figure 1

Deep learning model for financial distress prediction

Next, the emotional state features were fused with the 20 financial features from quarterly financial statements and FinBERT-based sentiment and emotional states extracted from conference call transcripts. Thus, inputs for LSTM layers were produced. The LSTM model was employed to capture high-level temporal features from sequential quarterly corporate data [14, 34]. In contrast to the CNN model, whose structure was taken from a previous study [29], different LSTM structures were examined to achieve the best prediction performance. More precisely, we tested the LSTM model with one or two layers comprising \(2^3\), \(2^4\) to \(2^9\) units and one or two fully connected layers with \(2^3\), \(2^4\) to \(2^8\) units. The LSTM model was trained to predict the financial distress class (safe/grey/distress) one year ahead. Alternatively, the softmax layer with two units was used to predict the financial performance trend (upward/downward). After testing different model structures, the final LSTM model consisted of two layers with \(2^7\) and \(2^6\) units and one fully connected layer with \(2^3\) units, respectively. The Adam optimizer was used with the cross-entropy loss function, the learning rate was 0.0001, and the number of epochs was 100. Again, this setting of the hyperparameters was the result of experimentation with their various values, namely the learning rate with values of 0.01, 0.001 and 0.0001, and the number of epochs of 10, 100 and 500. For the experiments with the proposed deep learning model, the Keras library was used running on a Jetson AGX Xavier (512-core Volta GPU, 32GB memory).

3.2 Speech emotion recognition

As highlighted above, prior research has only looked at the vocal utterances of managers, not directly gauging the emotions managers express in earnings conference calls. To tackle this issue, we first used a CNN model for speech emotion recognition. It is worth noting that various models have been developed and introduced to conduct speech emotion recognition [2]. It was decided that the best model to adopt for this investigation was the deep CNN model proposed by [29] because it delivered state-of-the-art performance for diverse speech emotion datasets, including the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset [35]. This dataset was chosen to pre-train the CNN model in this study because it is considered a validated and genuine benchmark dataset in the field of speech emotion recognition  [16, 18, 53].

The RAVDESS dataset includes 1,440 audio recordings. When building the dataset, different levels of emotional intensity were considered and each recording was evaluated it terms of emotional validity, genuineness and intensity to provide researchers with a gold standard for speech emotion recognition [35]. The dataset comprises eight emotional states: calm, neutral, surprised, happy, angry, sad, disgust and fearful. It is worth noting that the dataset is almost balanced in terms of the frequency of individual emotions. (Only the neutral emotional state includes 96 recordings, and other emotional states are equally represented by 192 audio recordings.)

To obtain features for speech emotion recognition, the Librosa audio library [41] was utilised. More precisely, 180 spectral features were extracted, divided into three distinct feature categories: (1) 40 features representing mel-frequency cepstral coefficients, (2) 12 chromagram features and (3) 128 mel-scaled spectrogram features. This set of features was used to mimic the way humans receive sound frequencies [29]. Mel-frequency cepstral coefficients produce a mel-frequency cepstrum, thus representing the short-term sound power spectrum. Short-time Fourier transform is used to obtain chromagrams, constituting harmony and pitch classes [41]. The mel-scaled spectrogram features are obtained using a nonlinear transformation of the spectrogram to simulate how relative frequencies (emotions) are perceived by a human. Altogether, our aim was to obtain a rich audio recording representation for the CNN model.

After pre-training the CNN model on the RAVDESS dataset, we fed the CNN model with the unlabelled audio recordings of earnings conference calls to get the labels of managers’ emotional states.

3.3 Text sentiment analysis

As noted above, this study exploited the combination of domain-specific sentiment analysis and emotional features to produce a rich linguistic representation of conference call transcripts that adequately cover a broad range of managerial sentiments and emotions.

For the domain-specific sentiment analysis, we chose the pre-trained FinBERT due to its remarkable performance in sentiment analysis of various financial texts [6]. The FinBERT model for sentiment analysis comprised two steps. First, the authors used a large financial corpus of Reuters TRC2-financial with almost 30 M words to pre-train the BERT model [6]. Second, the model was fine-tuned for sentiment analysis using the Financial PhraseBank dataset, an annotated set of 4,845 sentences from financial news. Specifically, the annotators were requested to assign labels based on how they thought the information in the financial news would be reflected in the stock price of the company mentioned. The fine-tuned pre-trained FinBERT modelFootnote 1 was used to label each sentence in conference call transcripts. That is, we first performed sentence segmentation (using the Stanza tools [48]), and then, the intensity of conference call sentiment was calculated as the sentiment score averaged over all sentences.

It was considered that text emotional features would usefully supplement and extend both the speech emotions and the sentiment analysis, as reported in previous business research [25, 31]. To obtain the text emotional features, the following emotional states were considered in agreement with the Plutchik wheel of emotions [42]: anticipation, surprise, joy, trust, sadness, disgust, fear and anger. The text emotional features were calculated as the sum of emotional words matching the respective categories in pre-defined lexicons. The NRC emotion lexiconFootnote 2 was used for analysing emotions in conference call transcripts.

3.4 Financial features

The obtained audio and text features were matched with traditional financial features. To calculate the financial features, corporate financial statement data were used. The publicly available Macrotrends databaseFootnote 3 was used as the source of the financial data. Consistent with relevant studies on the prediction of financial distress [21, 55], the financial features comprised the following financial categories: (1) company size (sales, equity, total assets, cash flow); (2) activity ratios (inventory turnover, receivable turnover, asset turnover); (3) leverage ratios (total debt, equity to liabilities); (4) profitability ratios (gross margin, operating margin, return on equity, return on total assets, retained earnings to total assets); (5) liquidity ratios (free cash flow per share, operating cash flow per share, working capital to total assets, cash ratio, current ratio); and (6) global measure of financial performance (Altman’s Z-score).

Company financial distress is generally regarded as a company’s inability to comply with its payment obligations under the debt. Since the Altman’s Z-score is the preferred model for the early detection of financial distress in the literature [5], we opted for this model to obtain the financial distress classes. Specifically, the model categorises companies into three zones: safe, grey and distress. Another advantage of this model is its validity for firms listed on US financial markets [5]. Furthermore, the model was shown to perform well on bankruptcy prediction tasks, accurately predicting bankruptcy up to 2 years ahead. The formula for the Z-score model for publicly tradeable companies is as follows:

$$\begin{aligned}{} & {} Z{-}score = 1.2 \times \frac{WC}{TA} + 1.4 \times \frac{RE}{TA} + 3.3 \times \frac{NI}{TA} \nonumber \\ {}{} & {} \qquad \qquad \qquad \;\, + 0.6 \times \frac{E}{TL} + 1.0 \times \frac{TS}{TA}, \end{aligned}$$
(1)

where TA is total assets, WC is working capital, RE denotes retained earnings, NI is net income, E represents equity, TL denotes total liabilities and TS is total sales. A Z-score of 3 or higher denotes the safe class (healthy companies with a high survival probability), that of 1.80 to 2.99 indicates the grey zone (signalling some financial difficulties), and that of less than 1.80 denotes the distress class (high-risk companies with financial difficulties).

4 Data

The database of audio recordings comprised 1278 earnings conference calls gathered from the publicly available EarningsCast databaseFootnote 4 from Q1 2010 to Q3 2021 on a quarterly basis. The data sample included the 40 largest US companies listed on the NYSE (New York Stock Exchange) in terms of market capitalisation. In the pre-processing stage, the collected audio recordings were first converted to.wav files. Note that the audio recordings were supplied without any breakdown or identification of the speakers. That is why the whole record was labelled with one emotional state based on a set of 180 spectral features. To illustrate the features used for speech emotion recognition, Fig. 2 presents the spectral features obtained from the earnings conference call for Adobe. But before we could label these conference calls, it was necessary to train the speech emotion recognition model. To this end, the RAVDESS dataset was used to train the CNN model. As a result, 69.8% accuracy was achieved on the RAVDESS dataset using fivefold cross-validation. We adopted this testing scheme in keeping with the original study [29], yet we did not observe many differences in classification performance, even for other data split ratios and tenfold cross-validation. It is worth noting that this performance is close to that reported by Issa et al. [29]. We also examined the performance of other models that have previously shown promising accuracy on the RAVDESS dataset. More precisely, we obtained the following results using alternative approaches: 58.4% accuracy for bagged ensemble of SVMs [7], 59.8% for AdaBoost ensemble of SVMs [7] and 64.5% for deep residual networks [65]. Figure 3 shows that the classification performance was reasonable across all the eight emotional states.

Fig. 2
figure 2

Example of spectral features (mel-scaled spectrogram, mel-frequency cepstral coefficients (MFCC) and chromagram) extracted from the earnings conference call of Adobe in Q4 2019

Fig. 3
figure 3

Confusion matrix of the CNN speech emotion recognition model for the RAVDESS dataset

The trained CNN model was then used to label the conference call recordings with emotional states. As presented in Fig. 4, a calm emotional state dominated in the audio recordings (67%), followed by happiness (10%). In contrast, disgust and surprise rarely occurred (< 1%), see Fig. 4.

Fig. 4
figure 4

Frequency of emotional states in the conference call dataset

Regarding the text sentiment and emotional features, Table 3 shows that the overall sentiment of the conference call transcripts was more positive and that managers tended to use trustworthy and anticipatory tones while avoiding negative emotions, such as disgust and anger.

Table 3 Descriptive statistics for linguistic features

The overall financial performance of the companies is depicted in Fig. 5, indicating a deterioration in corporate financial performance from 2019. Indeed, the global economy entered into a challenging period of recession while recording the lowest growth of the decade due to decelerating domestic investment and trade disputes.

Fig. 5
figure 5

Overall financial performance of the sample in terms of Z-score. The median is denoted by the central marker, the box indicates the interquartile range, and the whiskers refer to the data instances that are not regarded as outliers

The companies were categorised into three financial distress classes according to the Z-score, with 59 companies belonging to the safe class (4.6%), 329 falling into the grey zone (25.7%) and 890 classified as distressed (69.7%), suggesting a problem of class imbalance. Interestingly, according to Altman et al. [5], the high percentage of firms in the distress class may be indicative of an impending financial crisis. The dataset was constructed in order to perform a 1-year-ahead prediction of financial distress. To validate the proposed model, an additional run of experiments was considered, categorising the companies into two classes, according to whether there was an increase or decrease in the overall financial performance (Z-score) in the next year. Regarding the trend of the financial performance, 615 companies were classified in the upward class, while 663 companies were put into the downward class.

5 Model validation

To evaluate the effectiveness of the proposed model, two separate experimental runs were performed. In the first run, this study examined the model performance in a prediction of three financial distress classes (safe companies/grey zone companies/distressed companies). In the second run, the model was trained to forecast the upward or downward trend of financial distress. Both sets of experiments considered the 1-year-ahead prediction. Data partitioning took into account the temporal nature of the data, using data for 2010 to 2016 as training set and data for 2017 to 2020 as test set. To estimate the performance of financial distress prediction models, consistent with the previous literature [27, 32], a combination of three measures was considered in the current study. Specifically, accuracy (Acc), F1 measure and AUC (area under the receiver operating characteristic curve) were used to evaluate the classification performance of the model. The F1 measure is represented by the weighted harmonic mean of precision and sensitivity, where precision denotes the percentage of correctly identified distressed companies of all distressed companies, and sensitivity indicates the percentage of distressed companies the model misses. The AUC evaluates the model’s probability of ranking a randomly selected distressed company higher than a randomly selected non-distressed company. The main strength of the AUC measure is its robustness against imbalanced classes.

To examine the role of text and speech emotions and FinBERT-based sentiment, we started with experiments restricted to these variables only, that is, after excluding financial indicators. For this set of experiments, random forest was utilised because it is considered a state-of-the-art benchmark method in financial distress prediction [45, 54]. Existing text sentiment features were used for comparative purposes, including the sentiment score [66] and sentiment indicators from the L&M dictionary [32]. Furthermore, BERT language model [30] was used as the state-of-the-art linguistic approach. On the one hand, given the low dimensionality of our text representation, one cannot expect to beat BERT. On the other hand, our intention was to create an easily interpretable model, which is difficult in the case of the BERT model (each layer contains vectors of length 768). Figures 6 and 7 show the corresponding prediction performance for the class prediction and trend prediction task, respectively. First of all, the BERT-based model excelled for both classification tasks in terms of all evaluation metrics, which is consistent with existing research [30, 32]. However, our model that combined text and speech emotions with FinBERT-based sentiment was highly competitive and substantially outperformed the two compared dictionary-based sentiment approaches. Figures 6 and 7 also reveal that the trend prediction task is too challenging for the traditional sentiment features. The results also demonstrate a considerable difference between the performance obtained by sentiment and emotional features, thus confirming the role of managerial emotions in predicting financial distress. The results in Figs. 6 and 7 also show a clear trend of increasing classification performance with richer representation of conference calls, validating the contribution of all proposed feature categories.

Fig. 6
figure 6

Prediction performance for different speech and linguistic features—financial distress class prediction

Fig. 7
figure 7

Prediction performance for different speech and linguistic features—financial distress trend prediction

In the next step, to validate the effectiveness of the proposed prediction model, we used the proposed deep learning architecture with the above features fused with the financial indicators and compared the model’s performance with a baseline approach based on financial indicators only. To further validate the prediction performance of the proposed model, a variety of previously used financial distress prediction models were used. Note that, in agreement with the original studies that introduced these models, the compared models were based solely on financial indicators 1 year in advance. The following models were considered for the comparative purposes:

  • Multi-layer perceptron (MLP) [4] architecture comprising one hidden layer with 20 rectified linear units. The Adam optimiser was used to train the MLP model with the following parameters: The maximum number of epochs was 200, the learning rate was 0.001, and the L2 penalty parameter was set to 0.0001.

  • XGBoost [28]. Consistent with Huang and Yen [28], the maximum tree depth was 10, the learning rate was 0.1, and the subsample ratio was 0.7.

  • Stacking SVM [33]. As in Liang et al. [33], we used the SVM with linear kernels to generate the base and meta-classification models. We examined different values of the regularisation parameter \(C=\{2^{-1},2^0,2^1,\dots , 2^5\}\) for these classification models.

  • Clustering-based under-sampling (CUS) + gradient boosting decision tree (GBDT) [13] with 3 clusters for CUS and 100 GBDT estimators.

  • Synthetic minority over-sampling technique (SMOTE) + AdaBoost SVM (ADASVM) [58]. Consistent with [58], the total count of new minority class instances generated by SMOTE was equal to the number of majority instances minus the number of minority instances. As the SVM training parameters are not reported in the original study, different kernel functions were examined for the SVM base classifiers, and again, we experimented with the regularisation parameter C as with the stacking SVM.

To implement the above comparative models, we used the Scikit-learn library and the Imbalanced-learn library.

The results in Table 4 show that the compared prediction models performed well in terms of accuracy in 1-year-ahead financial distress prediction. Among the compared models, the XGBoost model excelled, although, different from the CUS+GBDT and SMOTE+ADASVM models, it does not tackle the class imbalance problem. The outstanding performance of the baseline model indicated the advantage of larger window size (multiple time steps) in the LSTM model. We also managed to improve the prediction performance by integrating managers’ sentiment and emotions. The results given in Table 4 apparently show that financial indicators play a key role in financial distress prediction. However, the proposed sentiment and emotional features have a valuable predictive power, improving the overall prediction performance in terms of all evaluation criteria. Furthermore, the superior values of AUC suggest that the proposed prediction model performed well despite the imbalance of classes. Specifically, the accuracy of the individual classes for the imbalanced class prediction problem was as follows: 0.955 for the safe class, 0.915 for the grey zone and 0.969 for the financial distress class. These results are consistent with the findings of recent studies focusing on multi-class financial distress prediction [59], that is, it is more difficult to detect classes of financial conditions between financial health and severe financial distress. Finally, a balanced prediction performance regarding precision (0.953 for class prediction and 0.741 for trend prediction) and sensitivity (0.954 and 0.741) was obtained for both classification problems.

Table 4 Results of 1-year-ahead financial distress prediction

The integration of the sentiment and emotional features was even more influential in improving prediction performance in case of trend prediction. Even though the results for both classification problems were generally consistent, the problem of trend prediction appears to be a more challenging classification task. Overall, the results of the comparative models validated the established model for financial distress prediction.

SHapley Additive exPlanations (SHAP) were used to explain the effects of sentiment and emotional features on prediction results. The SHAP method is a game theoretic method that is widely used to explain the deep learning outputs [37]. A major benefit of SHAP is that it provides both local explainability (how decisions are reached for each company) and global explainability (the overall structure of the model decisions). In this study, we explored the global explainability of the prediction model to illustrate the effect of each sentiment and emotional feature on financial distress prediction. The SHAP library was used to obtain the SHAP values. Figure 8 depicts the SHAP values of the proposed features compared with the financial indicator of greatest importance (the Z-score from the previous year). The results show that conference call transcripts were generally more relevant for the prediction model than the speech emotional states. Notably, anticipation was a crucial text emotion, suggesting that the high occurrence of forward-looking statements indicates future financial difficulties and a deterioration in the financial situation, respectively. Moreover, exciting statements (joy) were indicators of an improvement in corporate financial performance. In contrast, sadness and fear text emotions indicated financially safe companies, which can be attributed to managers’ concerns about the company’s financial situation and careful consideration of financial risks, respectively. Positive FinBERT-based sentiment was associated with good financial performance and its improvement in the future. The emotions of sadness and happiness were the most relevant speech emotions for financial distress prediction, while the emotions of fear and happiness were critical for the trend prediction. Speech happiness indicated a safe financial position, whereas sadness suggested financial distress. For the prediction of financial trend, the impact of happiness was similar, while the fear emotion was indicative of an increased financial performance. The angry emotion was more likely to imply deterioration. The impact of the sentiment and emotional features on model outputs is summarized in Table 5.

Fig. 8
figure 8

SHAP values showing how sentiment and emotional features contribute to financial distress prediction (class prediction is presented on the left, and trend prediction is depicted on the right)

Table 5 Summary of the impact of the top 5 sentiment and emotional features on model output (likelihood of financial distress or safe position)

6 Conclusion

The purpose of the current study was to determine the effect of emotion and sentiment features from earnings conference calls on financial distress prediction. To this end, a dynamical deep learning-based financial distress prediction model was proposed to exploit a rich word representation and speech emotion recognition. In summary, the findings of this study suggest that both the text and speech emotions deliver valid information for predicting financial distress. The results of this study show that the fusion of traditional financial indicators with the proposed sentiment and emotional indicators can improve the prediction performance. In accordance with earlier research [24, 38, 44], this study revealed that negative sentiment and emotions in audio recordings and conference call transcripts may indicate financial distress, whereas positive emotions (e.g. happiness) indicate good financial performance. The most striking result to emerge from our analysis is that managers may try to mask financial difficulties by expressing positive emotions such as anticipation and trust. In contrast, some negative emotions in managers’ statements, such as fear and sadness, show that the company is important to managers and they care about its financial situation.

To make informed financial decisions, financial managers, investors and banks assess the financial performance of companies, which can be complex and time-consuming. Therefore, the use of accurate and interpretable models has the potential to significantly streamline this assessment. Traditional models based on financial ratios may be flawed by accounting practices or even deliberate misrepresentation of financial statements. To mitigate these problems, our model integrates financial, audio and text information, and therefore, it could be applied to reliably guide a range of financial and investment decisions.

Finally, several limitations need to be considered. The most important limitation is that only the speech emotions represented the entire audio recordings. However, the speakers’ emotional states may be subject to changes while the conference call is in progress. For instance, managerial emotions may change when discussing different areas of company’s performance. Therefore, further work needs to be done to analyse the topics of the conference call transcripts to identify topics being discussed. Another limitation is the sharp division of financial distress classes for classification purposes. Fuzzy sets could be utilised to consider the uncertainty in the overall financial performance indicators. Finally, only large companies and a relatively short prediction horizon were considered, which provides opportunities for future research: investigating a larger sample of companies, including small- and medium-sized enterprises, and extending the prediction horizon.