1 Introduction

Liver disease is a major cause of death globally, responsible for approximately 2 million deaths each year [1]. Prevention measures, including reducing alcohol consumption, promoting healthy lifestyles, and vaccinating against viral hepatitis, are important in reducing the incidence of liver disease and related deaths [2]. Efforts are underway to address the global burden of liver disease by increasing awareness, improving access to screening and treatment, and developing new therapies [3, 4]. Breath biomarkers have the potential to revolutionize healthcare by providing a non-invasive and convenient method for diagnosing and monitoring a wide range of health conditions [5]. These biomarkers are identified by analyzing the volatile organic compounds (VOCs) in a person's exhaled breath, which can provide valuable information about their health status. One of the main advantages of using breath biomarkers is that it is a simple and painless procedure that can be easily repeated over time. Breath biomarkers have shown promise in diagnosing and monitoring a range of diseases [6], including cancer [7], diabetes, infectious diseases and COVID-19 [8,9,10,11].

Machine learning has the potential to revolutionize healthcare by improving patient outcomes, reducing costs, and advancing medical research [12]. It is being used for predictive analytics to identify risk factors and predict the likelihood of disease onset or progression, improve the accuracy of medical diagnoses, and drug discovery and development [13]. It is also being used for personalized medicine to identify the most effective treatments for individual patients, which can improve treatment outcomes and reduce the risk of adverse reactions [14, 15].

Previous research has shown that breath biomarkers can accurately diagnose liver fibrosis using a single VOC or a panel of VOCs [16, 17]. The breath test was found to be comparable in accuracy to traditional blood tests, but less invasive and more convenient for patients [10, 18]. To achieve maximum diagnostic accuracy, multiple biomarkers need to be used as individual VOC biomarkers linked to the cellular state may not differentiate between causative agents or symptoms. The liver is responsible for metabolism, and the liver disease affects multiple VOCs because it alters various metabolic pathways. These pathways are related to some VOCs of one or more functional groups [9, 19].

It has been seen that the variability in laboratory test results can lead to confusion in diagnosis, treatment, and disease monitoring. This variability is caused by pre-analytical variation, biological variation, and analytical variation. To avoid misinterpretation of results, it is essential to consider an individual's overall health status and medical history when interpreting laboratory test results [20]. A study is mentioned that highlights the importance of stable biomarkers for predicting schizophrenia in the human connectome [21].

The objective of this investigation is to identify a new panel of stable potential biomarkers using statistical and machine learning techniques that can accurately detect abnormal liver function, intending to improve diagnostic and prognostic tools for liver disease. Breath samples were collected from liver patients and healthy volunteers at different physiological states. Samples were collected at various physiological states to identify biomarkers that remain consistent across different conditions. By analyzing the samples collected at rest, after exercise, and during recovery, the study aimed to identify biomarkers that exhibit stability and reliability irrespective of the individual's physiological state. This approach enables the identification of robust biomarkers that can be consistently used for diagnostic or monitoring purposes. The samples were analyzed and quantified to obtain the existing compound names and their relative concentration. To obtain a stable and potential panel of biomarkers multiple strategies were adopted. At first, the common compounds among different physiological states were shortlisted. Then the common compounds were ranked based on the contribution to predicting the samples belong to the either patient or healthy group. After that a significance test was conducted to verify the statistical significance of the ranked best compounds. Finally, the compound which is commonly ranked best and stable or consistent is chosen in the panel of potential and stable biomarkers. The selected biomarkers are validated to determine their accuracy and reliability. The development of accurate and reliable diagnostic tests for the liver disease could improve patient outcomes and provide clinicians with a powerful tool for monitoring disease progression and developing personalized treatment plans.

2 Method

2.1 Experimental setup

To identify common and consistent biomarkers for detecting abnormal liver function, breath samples were collected three times from two groups of study subjects consisting of 30 liver patients and 33 healthy individuals, recruited and tested at National Taiwan University Hospital, Taipei, Taiwan (REC: 201912138RINB). The Research Ethics Committee B of National Taiwan University Hospital, Taipei, Taiwan granted approval for the breath test protocol and research method. All methods were performed in accordance with the relevant guidelines and regulations to ensure ethical conduct of the research study. The healthy volunteers in the study were recruited voluntarily, and their participation involved conducting liver function tests at NTU Hospital. A preliminary liver donor criterion determines healthy participants based on blood test results, age, and BMI. In a recent study of 473 liver donors over ten years revealed that healthy participants are typically under 30 years old [22]. This age group is chosen to avoid older individuals with other health conditions that could affect the study's findings. On the other hand, the patients included in the study were hospitalized at NTU Hospital for diagnosis and treatment purposes of liver illness. This ensured that the participants in both the healthy and diseased groups underwent evaluation and monitoring within the controlled hospital setting, minimizing potential confounding effects related to environmental exposure in exhaled breath samples. In the supplementary file Figs. S1, S2 and S3 represents the distribution of the study-subjects Child-Pugh’s score, AST to PLT ratio (APRI) and model of end-stage liver disease (MELD) score, respectively. A significant portion of the patient group, specifically over 20 out of the total 30, exhibit low scores in the Child-Pugh, APRI, and MELD assessments. This pattern suggests a prevalence of initial stage patients within this subset, indicating that a substantial proportion of these individuals are in the early stages of their respective conditions. Exclusion criteria included patients with lung disease, those who were advised bed rest, and those who did not fit within the age range of 20 to 70 years. Study participants provided informed consent and performed staircase walking with a minimum heart rate of 100 bpm for two minutes. Breath samples were collected using the experimental setup shown in Fig. 1, with three bags of samples collected before exercise, after exercise, and after 15 min of rest. The collected exhaled breath from the bags is sampled into desorption tubes (Carbotrap, Perkin Elmer ®). The flow rate of a pump was adjusted to collect one liter of sample from the bag to the tube. Likely three tubes containing samples from three different bags are fitted to the automatic thermal desorption (ATD) unit, Turbomatrix Perkin Elmer ®. The sample from the tube was heated and collected in a trap unit. Then it is transferred into the gas chromatography (GC) (Claurus 680 Perkin Elmer ®) column and the output is detected using the mass spectrometer (MS). A detail regarding, the breath collection protocol and GC-MS setup were discussed in length in this article [17].

Fig. 1
figure 1

Breath sample collection and sample analysis setup. Three different samples collect at three different state. The samples are transferred to the desorption tubes. Then tubes are loaded into the ATD unit. Then ATD unit transfer the sample to the GC-MS

2.2 Data collection

Here we used the PerkinElmer TurboMass software® which is a data acquisition and analysis software for mass spectrometry used to analyze and process mass spectral data. The software is designed to work with PerkinElmer's mass spectrometry systems and can perform a range of functions such as data acquisition, instrument control, and data processing. It has a quantitative analysis method that can be linked with the National Institute of Standards and Technology (NIST) database. This feature allows users to generate quantitative reports for their samples based on the comparison of the sample spectrum with the NIST database. Then use the external calibration method, where a calibration curve is generated using a standard compound or compounds, and then the response values of the analytes in your sample are used to determine their concentrations. A four-point calibration is prepared using acetone as standard and the response value of various compounds is quantified to a relative concentration. This approach can be useful for a limited number of samples to quantify and quantify a variety of compounds in your sample simultaneously [23, 24]. All the concentration and compound names from the three bags are separated as three datasets with each containing data from both groups. The dataset has a positive class label for patients and a negative class label for healthy samples which makes it a binary classification problem. Every compound is treated as a feature and discussed in further sections.

2.3 Feature selection process

At first, the compounds which are found in at least 50% of the total number of samples are selected, this is a common approach in statistical analysis to ensure that only the most prevalent and consistent features are considered for further analysis. This is a type of frequency-based feature selection, which involves selecting features (in this case, compounds) that occur commonly found across the samples [25]. By adopting this method, noise and irrelevant features can be eliminated, and the most informative features can be identified for further analysis. This helps to avoid background noise, and residuals and increase the reliability of the results. A conceptual flowchart (Fig. 2) describes the feature selection process to select the final set of potential features or biomarkers to predict the samples belonging to the liver patient group.

Fig. 2
figure 2

A conceptual flowchart for the feature selection process for multiple datasets

The features which are common in the three datasets are considered for further steps. These features were then subjected to a recursive feature elimination (RFE) technique to rank the features and assess their impact on model accuracy [26]. RFE is a powerful method for identifying the most important features in a dataset for a given problem. Using a decision tree (100 estimators chosen), RFE works by repeatedly training a model on subsets of features and eliminating the least important ones until a desired number of features is reached. The algorithm ranks the features based on their contribution to model performance and evaluates various combinations of ranked features to identify the best ones. Additionally, fivefold cross-validation with 3 times repetition is conducted to ensure the analysis is robust and reliable. RFE with a decision tree estimator can improve model performance while minimizing overfitting. The model and data analysis methods were constructed using Python programming language, with Jupyter Notebook serving as the integrated development environment for the development process.

Then a reasonable number of features are selected with the possible maximum cross-validated accuracy achieved. The features are then subjected to a statistical test to verify the significant difference in each feature between two groups in an individual dataset. In order to assess the normality of the data distribution, the Shapiro-Wilk test was conducted. For group-wise significance analysis, non-parametric tests, specifically the Mann-Whitney U test, were employed. This test was chosen due to its robustness against non-normality and its suitability for comparing two independent groups. Furthermore, to explore the significance of features across different conditions of sample collection, the Kruskal-Wallis test was employed. This non-parametric test was utilized to evaluate potential differences among multiple independent groups, taking into account the non-normal distribution of the data [27].

The features which are selected using RFE and the features that showed a significant difference between the two groups (p < 0.01) and were present in all the datasets were selected. The common sort listed features are selected to find the consistent biomarkers. Again the Kruskall-Wallis test (p > 0.5) is adapted to verify the means of a compound is not significantly different in the three datasets. A Kruskall-Wallis test with a null hypothesis that the means of all the groups are equal if the Kruskall-Wallis test does not reject the null hypothesis, suggests that the means of all the groups are similar. In the end, the features which are ranked contributing (using RFE), significant between groups (p < 0.01), and common and consistent among datasets (p > 0.5) are the final features in the features set.

All possible combinations of input features are enumerated then, each subset of features trains a decision tree model. The accuracy of the model is evaluated using cross-validation with repeated stratification. This gives a detailed idea about the various combinations of best features selected using RFE techniques and the accuracy of various feature combinations.

2.4 Machine learning model

Additionally, the other two models, a simple Naïve-Bayes (NB) classifier, and a Random Forest (RF) classifier were trained and tested to gain insights into the strengths of the features selected and to validate the results of the analysis.

2.4.1 Naive bayes classifier

Naive Bayes is a popular machine learning algorithm based on the Bayes theorem. It predicts categorical data using probability theory and is efficient in handling high-dimensional data. The algorithm estimates the probability distribution of each input feature given the target class, assuming each feature is independent. When presented with new data, it calculates the posterior probability of each class and predicts the class with the highest probability as the output [28].

2.4.2 Random forest classifier

Random Forest is a powerful ensemble machine-learning algorithm that creates multiple decision trees by randomly selecting subsets of data and features. The trees are built using recursive partitioning and each independently predicts the class of a new data point. The final prediction is made by aggregating the predictions of all trees using a majority voting scheme, reducing the effects of individual incorrect predictions and leading to more accurate overall performance [29, 30]. The hyperparameters are kept at default value with 500 estimators.

2.5 Performance metrics

There are five important parameters for model evaluation accuracy, precision, recall (sensitivity), F1-score and specificity [31]. To evaluate model performance, accuracy is suitable for even datasets, but when dealing with uneven classes, F1-score is the better option. Precision indicates how well a model predicts a class label, while recall measures misclassified labels for a class. Depending on the type of misclassification, either recall or specificity is a better metric for evaluation.

To further assess the reliability of the best-performed model, a bootstrap and confidence interval method was applied to find a 95% confidence interval that covered the true skill of the model [32]. Finally, a classification report, Receiver Operating Characteristics (ROC), Precision-Recall (PR) Curve and their area under the curve (AUC) provide a comprehensive overview of the model's performance [33]. The three datasets are analyzed parallelly as described to verify the results obtained from the sample at different times are close enough.

3 Results

3.1 Feature selection result

The recruited study-subjects health status was confirmed at the hospital by conducting a blood test. The liver function test parameters data are shown in Table 1.

Table 1 Clinical data for liver disease patients and healthy group

The quantified VOC concentration forms the dataset, in total, there are three datasets obtained from the collected 3 bags of all study subjects. A total of 36 compounds in Bag-1, 35 compounds in Bag-2 and 29 compounds are found in half of the samples. The 15 compounds which are common in the three datasets are considered for further analysis. Then three conditions are applied. First, in all datasets, the 15 common compounds are ranked using the RFE method. Then, the significance test is performed using Mann Whitney test and the compounds are marked with p < 0.01. At last the compounds which are contributing based on RFE analysis, significant features and their mean values are close (not significant) in all bags and are treated as stable VOCs forming the final panel of biomarkers. In the supplementary Table S1 describe the names of the common compounds, rank of the compounds measured at different physiological state and the significance test result. The following figure explains the RFE method result.

The box and whisker plot shown in Fig. 3 explains the effect of feature elimination on the model accuracy. In each features combination the cross-validation accuracy results for the possible combinations from a maximum of 15 to a minimum of 2 are shown. At every combination, the least ranked feature is removed from the combination. Overall, 6 to 2 combinations give the best combination. Those features are significantly different in both groups and all the datasets' mean values are close. All the features are ranked based on performance. A combination of three to five features was found to be contributing maximum accuracy (mean and medium), significant in discriminating the class in each bag, and common and consistent in three datasets with a not significant difference in the mean value.

Fig. 3
figure 3

Box plot showing the classification accuracy of decision tree models with varying numbers of selected features using RFE on bag 1 data a, bag 2 data b and bag 3 data c. The green diamond indicates the mean value, the red box represents the interquartile range (IQR), the whiskers extend to the lowest and highest values within 1.5 times the IQR, and the circles represent outliers

The Fig. 4 box and whisker plots provide valuable insights into the concentration of these compounds in liver patients and healthy individuals, as well as the variation in concentration across different sample bags. The p-value of their significance based on two said groups of study subjects are shown using the standard ‘*’ system [34]. Then all the possible combinations with three to five features are iterated, trained and tested with a decision tree model with 100 estimators and all the accuracies obtained are shown in Table 2.

Fig. 4
figure 4

The figure presents 15 box and whisker plots representing the concentration of five different compounds in three different sample bags (Bag-1, Bag-2, Bag-3) for both liver patients (red boxes) and healthy individuals (blue boxes). The compounds and their concentrations in parts per billion (ppb) are 2-Myristynoyl Pantetheine, N_Acetyl Cystine, Pterin-6 Carboxylic Acid, Butanoic Acid, and Methyl Mercaptan. The p-values are represented by asterisks, with *p < 0.05, **p < 0.01, and ***p < 0.001

Table 2 The results of various iterations of feature combination for five different compounds (1. 2-Myristynoyl Pantetheine, 2. Pterin-6 Carboxylic Acid, 3. Methyl Mercaptan, 4. N_Acetyl Cystine, 5. Butanoic Acid) across three different sample bags (Bag-1, Bag-2, and Bag-3)

Table 2 presents the results of different feature combinations for five compounds across three sample bags. The "Range of accuracy for all combinations" column shows the accuracy range achieved for all possible feature combinations. The "Best accuracy (combinations)" column shows the highest accuracy achieved, along with the corresponding feature combinations used. The "Accuracy with selected five features" column displays the accuracy obtained by using all five features together. These findings offer valuable insights into the effectiveness of various feature combinations for predicting the accuracy of the five compounds in different sample bags. Notably, any three combinations of the selected features achieved an accuracy of approximately a minimum of 0.77 for a decision tree model.

3.2 Classification model result

The five selected features are further used with the RF classifier and NB classifier for training and testing. The RF classifier with 500 estimators and NB classifier fitted with 70% of the training data. Then the trained models are tested with 30% of the data. The predicted class and real class are used to obtain the classification reports. Table 3 shows the classification reports from the RF classifier and NB classifier.

Table 3 Classification reports of RF classifier and NB classifier for three different sample bags (Bag 1, Bag 2, and Bag 3) based on the precision, recall, F1 score, accuracy, ROC curve AUC, PR curve AUC, and 95% confidence interval of accuracy

Table 3 shows the performance metrics of two different classifiers on three different bags of data (Bag-1, Bag-2, and Bag-3) for a binary classification problem. The metrics are presented separately for healthy and patient classes. Overall, the RF classifier performed better than the NB classifier in most cases. In Bag-1 and Bag-2, the precision, recall, F1 score, and accuracy of the RF classifier were consistently higher than those of the NB classifier for both healthy and patient classes. In Bag-3, however, both classifiers' performances close in terms of precision, recall, F1 score, and accuracy both classes.

The results obtained from bootstrapping and computing 95% confidence intervals for the classification model on the three datasets indicate the range of accuracies that can be expected when using the model on similar datasets. For the first dataset (Bag-1), the 95% confidence interval was 94.7%. This means that, with 95% confidence, the true accuracy of the model on similar datasets will be 94.7%. For the second dataset (Bag-2), the 95% confidence interval was 89.2%. This is a very narrow range and suggests that the model is highly accurate on this type of data. For the third dataset (Bag-3), the 95% confidence interval was 89.5%. For the NB model, the 95% confidence interval was 84.2% for all datasets. The ROC and PRC graphs of three datasets for the RF model and NB model are shown in Figs. 5 and 6.

Fig. 5
figure 5

Performance evaluation of the RF classifier on three independent test datasets (Bag 1, Bag 2, and Bag 3) using ROC and PR curves. a ROC curves for the RF classifier on Bag 1, Bag 2, and Bag 3 are shown, with the AUC indicated for each dataset. b PR curves for the RF classifier on Bag 1, Bag 2, and Bag 3 are shown, with the AUC indicated for each dataset

Fig. 6
figure 6

Performance evaluation of the NB classifier on three independent test datasets (Bag 1, Bag 2, and Bag 3) using ROC and PR curves. a ROC curves for the NB classifier on Bag 1, Bag 2, and Bag 3 are shown, with the AUC indicated for each dataset. b PR curves for the NB classifier on Bag 1, Bag 2, and Bag 3 are shown, with the AUC indicated for each dataset

4 Discussion

The concept of stability or consistency is relevant to the study of breath biomarkers in general, as stable biomarkers are necessary to ensure accurate and reliable diagnosis and monitoring of diseases. In this study, the breath samples are collected at three different physiological states to identify potential biomarkers for liver dysfunction. The common biomarkers are identified in all three samples and the RFE algorithm ranked them based on their ability to predict liver dysfunction. The significance of the biomarkers between healthy and patient classes was determined by conducting an Mann Whitney test. The stability and consistency of the biomarkers by checking their means are not significantly different in all sample bags. The findings suggest that this approach provides a unique and effective means of identifying stable biomarkers for liver dysfunction and may have broader implications for the development of biomarker-based diagnostic tools.

The RFE algorithm works by repeatedly training a model using subsets of the 15 common features and eliminating the least important features at each iteration. Figure 3(a–c) explains about a combination of three features gives the highest mean and median accuracy for all datasets. Selecting the top three ranked compounds across all datasets results in a list of five compounds (Table 2). The selected compounds are significant between the healthy and patent group. Their mean values are not significantly different among all datasets which are tested by considering.

Many compounds among the 15 common compounds are found significant and have potential as biomarkers but are not consistently found significant or stable in all cases. The Mann Whitney test qualified some common compounds are Acetone, Alkane, Toluene, Isopropyl alcohol, Ethyl acetate and Furan. Some abundant biomarkers are more in concentration are fluctuates with heart rate and the biomarkers with less concentration are not commonly found in all bags. Some biomarkers are not only potential biomarkers for liver disease but also for other diseases as described in other studies such as Acetone [16], Acetaldehyde, Alkanes [35], Toluene, Furan, Dimethyl Sulphide, and Terpenes [36].

The stable biomarkers that do not change with physical activity or heart rate were chosen as the final panel of biomarkers, and their possible source of origin, pathway, and relation to body metabolism are discussed from the available supporting literature. The name of the compounds listed in the panel of biomarkers are: 1) N- Acetyl Cystine, 2) 2-Myristynayl Pantetheine, 3) Pterin-6 Carboxylic Acid, 4) Butanoic Acid, and 5) Methyl Mercaptan.

N-Acetyl cysteine, a precursor for glutathione synthesis [37], can reduce inflammation, lower liver enzymes and risk of alcohol-induced liver damage [38, 39], and may benefit those with inadequate production or genetic variations affecting its metabolism [40, 41]. The patient group consistently shows lower levels of N-Acetyl Cystine across all bags compared to the healthy group may be because of ill-liver function. The patient group has relatively consistent low mean values of Myristoyl pantetheine across all three bags, while the healthy group displays consistently high levels in all bags. It is a derivative of coenzyme A (CoA) synthesized in the liver from pantothenic acid and cysteine [42, 43], plays a role in various metabolic pathways as a component of CoA and in its myristoylated form [44,45,46,47]. The results dictate that liver dysfunction may dysregulate the concentration of Myristoyl pantetheine. The patient group displays higher mean values of Pterin-6 Carboxylic Acid than the healthy group across all bags, while the healthy group shows consistently low levels in all bags. Pterin-6 Carboxylic Acid, a metabolite of tetrahydrobiopterin, which is involved in producing nitric oxide [48], may serve as a biomarker for the presence of tumours and cancer [49,50,51], though its association with liver disease is not well established, and altered liver function may affect its concentration in breath samples. The patient group has consistently low levels of butyric acid, while the healthy group has consistently high levels in all bags. It is suggested that the disruption of the gut microbiome in liver disease may play a role, leading to a decrease in butyrate-producing bacteria in the gut and hence lower levels of butyric acid. Additionally, impaired liver function may affect the metabolism and absorption of butyric acid in the body, contributing to lower levels in liver disease sample [52, 53]. The patient group has consistently higher levels of Methyl Mercaptan than the healthy group across all three bags. Methyl Mercaptan is a compound produced by gut bacteria when breaking down certain amino acids. The liver normally processes it, but if liver function is poor, it can build up and harm health. A study looked at mercaptans' role in inducing coma in liver disease or methanethiol gas exposure [54]. In cirrhosis patients have more mercaptans in their breath than healthy individuals, suggesting a connection between sulfur-containing amino acids and mercaptan production in liver disease [55].

The results of the three datasets with the selected biomarkers as features are quite similar, with high recall, F1 score, accuracy, and precision obtained for both classes. The balanced result indicates that the model is equally good at predicting both healthy and patient classes. The results obtained from all datasets are similar and the performance of the two models is also similar. A higher ROC AUC (more than 0.9 for all observations) results suggest that the model performed well in distinguishing between the positive and negative classes, with Bag2 having the highest AUC value of 0.989. The PRC AUC of the classification models is more than 0.9 for all datasets and the model’s combination. These results suggest that the model performed well in identifying positive examples with high precision, with Bag2 having the highest AUC value and F1 score. The results obtained from bootstrapping and computing 95% confidence intervals for the classification model on the three datasets indicate the range of accuracies that can be expected when using the model on similar datasets. For all the datasets and RF model, the 95% confidence interval is a minimum of 89% to 94% and for the NB model, the 95% confidence interval is 84%. This suggests that the model is likely to perform well on this type of data.

The five biomarkers have successfully met the requirements for RFE test qualification. Three conditions were applied to different classes, and statistical tests were run to determine their significance. These biomarkers also showed stability under various conditions, which is significant. The significance of the five biomarkers has been thoroughly investigated, both individually and in relation to possible combinations. The above data analysis procedure and result discussed shows the model potential to successfully monitoring the possibility of liver disease. This study highlights the findings of potential, stable and consistent biomarkers at different stage. The findings were further supported by use machine learning model to use them to predict liver function.

5 Conclusion

In conclusion, our study successfully identified 2-Myristynoyl Pantetheine, Pterin-6 Carboxylic Acid, Methyl Mercaptan, N-Acetyl Cystine, and Butanoic Acid as stable biomarkers found in breath profiling that has the potential to detect abnormal liver function. Our extensive GC-MS data analysis, statistical analysis, and feature selection technique enabled us to rank and select the most significant and consistent biomarkers. We iterated the final selected biomarkers for all combinations and identified the range of accuracy. Our results showed that the model test accuracy for various possible combinations of biomarkers ranged from 0.7 to 0.9 in all conditions. Moreover, the precision, recall, prediction probability, and 95% confidence interval ranged from 0.89 to 0.94 in all conditions. Our findings pave the way for future research in this field and provide a non-invasive approach to detecting potential biomarkers for various diseases.