Adapting transformer-based language models for heart disease detection and risk factors extraction

Houssein, Essam H.; Mohamed, Rehab E.; Hu, Gang; Ali, Abdelmgeid A.

doi:10.1186/s40537-024-00903-y

Adapting transformer-based language models for heart disease detection and risk factors extraction

Research
Open access
Published: 04 April 2024

Volume 11, article number 47, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Big Data Submit manuscript

Adapting transformer-based language models for heart disease detection and risk factors extraction

Download PDF

Essam H. Houssein¹,
Rehab E. Mohamed¹,
Gang Hu² &
…
Abdelmgeid A. Ali¹

1068 Accesses
Explore all metrics

Abstract

Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.

Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Article Open access 03 May 2023

Clinical Natural Language Processing with Deep Learning

Using deep learning-based natural language processing to identify reasons for statin nonuse in patients with atherosclerotic cardiovascular disease

Article Open access 15 July 2022

Introduction

Heart disease, chronic respiratory disease, and diabetes are among the many non-communicable diseases associated with the modern lifestyle. One of the highest death rates is caused by heart disease [1]. Heart disease is a term used to describe abnormalities of the heart. It is regarded as one of the world’s most powerful killers, surpassing Alzheimer’s and cancer in power. The prevention of heart disease has become a serious issue in today’s world that needs to be addressed. It is estimated that one American dies of heart disease every 30 s [2]. Each year, 647,000 Americans suffer from heart disease [3]. Approximately 17.8 million deaths were caused by heart disease worldwide in 2017, an increase of 21.2% compared to 2007 [4]. In addition, heart disease can increase the need for hospital treatment by acting as a risk factor for other diseases. For example, they have been associated with a poor prognosis in the setting of COVID-19, threatening the ability of healthcare systems around the world [5]. Half of those who have a heart attack are not 'at risk’. These concerns require automatic prediction of heart disease and earlier identification, which is a critical issue. It is essential to prevent this life-threatening disease before it leads to millions of deaths. It is important to identify various risk factors to diagnose and prevent this disease earlier, such as Coronary Artery Disease (CAD), Diabetes, Hypertension, Hyperlipidemia, Smoking, Medications, Family history of CAD, and Obesity [6,7,8,9].

All other heart risk factors must be identified with indicators and temporal features, except CAD in the family and smoking status. Each characteristic of the indicator indicates the clinical significance of the risk factor. A significant difficulty in the field of heart disease detection and prevention is the identification of risk factors reported in clinical notes.

That means it is a difficult problem in clinical data analysis to create a fully automated method to predict heart disease from EHR [10, 11]. Natural language used in clinical narratives stored in EHRs is sometimes described as idiosyncratic, with considerable variability in format and quality [12]. Structured data are commonly created for administrative purposes only in electronic health records, so the data are biased toward diagnoses and procedures that are relevant to billing purposes. Unstructured clinical notes are the most in-depth source of data, but semantic labeling is not common because it requires advanced planning and analysis [13]. Although unstructured data has numerous uses, there is a growing need to unlock them for primary and secondary purposes [14]. Secondary use of such data can include supporting observational studies, such as cohorts, cross-sections, and case–control research [15]. By developing systems for analyzing narrative clinical notes to register patients according to selection criteria, sampling bias could be reduced [16]. Using NLP techniques, we can convert the meaning of human language into machine-readable representations that can be used for secondary purposes. NLP models from the general domain cannot be easily applied to clinical text due to significant linguistic differences since it is likely to be simple terms, often referred to as the telegraphic style. Developing these systems for the clinical field is challenging because there are few publicly accessible annotated clinical narrative datasets. During the big data revolution, neural networks (NNs) were trained to model a variety of human languages with high accuracy as a result of the availability of large amounts of data in the general domain, but there has not been the same in small data scenarios, where models are often trained from scratch. Consequently, transfer learning methods have become increasingly popular, allowing previously trained models to be applied in new contexts with minimal annotation and labeling [17].

Transfer learning is a deep learning technique that refers to the process of adapting a model originally pre-trained for a specific task and is used as a basis for training a model to perform a different task using a new dataset [18, 19]. Although transfer learning has received much research in the field of medical image analysis, its application to text-clinical data is still lacking. Therefore, this scoping study aimed to investigate the feasibility of applying transfer learning to non-image data in the clinical text. Many of the most recent advances in generalizable and adaptable techniques are based on transfer learning. When data is scarce, knowledge of fields, tasks, or languages with large data is applied [20]. Several clinical studies highlighted the potential of transfer learning to reuse models in a wide range of prediction tasks, data types, and even species. Transfer learning was apparent to be particularly effective when applied to smaller datasets, rather than when machine learning algorithms were trained from scratch in terms of prediction [18].

Different methods can be used to transfer knowledge from a large dataset depending on the availability of the data source, task labels, and reused data [21]. Feature representation transfer is one of the most common methods in which an input representation strategy that is trained unsupervised on a large dataset is transferred to a smaller annotated sample [21]. However, Goodfellow et al. [22] suggest that the application of this strategy has decreased since deep learning provides human intervention with large labeled datasets, while Bayesian methods perform better when small data are available. Mikolov et al. [23] promoted feature representation transfer in the NLP area, by releasing word2vec embeddings trained on approximately 100 billion words extracted from a Google News corpus. However, this model has a low coverage rate for clinical text due to uncommon words and misspellings, prompting the search for other input representation strategies [24]. Bojanowski et al. The [25] suggested including sub-word information in word vectors to accommodate morphology. Although deep learning has become common for text classification, Joulin et al. [26] have developed fastText, a quick and accurate application of multinomial logistic regression that makes text classification on a large scale possible. More recently, the National Center for Biotechnology Information developed BioWordVec, which was trained using fastText on more than 30 million documents from the MIMIC-III (Medical Information Mart for Intensive Care) [27] and the PubMed clinical dataset. When combined, these tools can facilitate the transfer of learning from the large data domain to the clinical field by addressing the unique challenges posed by clinical settings [28, 29].

Motivation There is promise in the detection of heart disease risk factors using transformer-based models based on transfer learning approaches to learn bidirectional relationships in EHR. We proposed a heart disease risk factor identification model by comparing five transformer-based models using EHR data. We modeled the task as NER task according to [30, 31]. The study uses several statistical criteria and evaluation measurements to support these findings. The evaluation included measures of precision, recall, and F1 at the micro-level when comparing the results of the fine-tuned transformer-based models to the document-level gold standard. The primary contributions of this paper can be summarized as follows:

1.
Developing a model that identifies heart disease risk factors in EHRs using transfer learning models.
2.
In this study, we explored transfer learning using openly available biomedical contextual embeddings.
3.
Implement a transfer learning technique that could effectively use these embeddings to identify risk factors for heart disease.
4.
The fine-tuned transformer-based models outperformed the 2014 i2b2/UTHealth shared task systems and models.
5.
In this study, we applied five transformer-based models, which are contextual embeddings of BERT [32], BioBERT [33], BioClinicalBERT [34], RoBERTa [35], and XLNet [36] contextual embeddings.
6.
Ensembling strategies help improve the performance of all eight risk factors extraction challenging.

The remaining sections of the paper are structured as follows, Section "Related work", provides a literature review of several recent related works on the 2014 i2b2UTHealth shared task track 2 and adaptation of transfer learning in clinical EHR. Section "Materials and methods", demonstrates the objectives of the proposed task, the description of the dataset, the description of the research problem, the pre-processing steps, the transfer learning models, and the transformer-based models. Pre-training and fine-tuning process. Section "Experimental results and simulations", shows the evaluation and results of the proposed study. Finally, the conclusion and future works are discussed in Section "Conclusion and future work".

Related work

The proposed study is motivated by the challenges of the 2014 i2b2/UTHealth heart disease risk factor detection task, as well as some previous Information Extraction (IE) research in the clinical domain with the adaptation of transfer learning techniques.

Track 2 of the 2014 i2b2/UTHealth shared task

The National Center (https://www.i2b2.org/) for Biomedical Computing has organized the Informatics for Integrating Biology and Bedside (i2b2) (https://www.i2b2.org/) Challenges since 2006 to encourage NLP study in the health domain. Track 2 of the 2014 i2b2/UTHealth shared task proposed the challenge of text classification in the clinical domain with limited data and requested the participating teams to categorize patients based on eight risk factors for heart disease: CAD, diabetes, hypertension, hyperlipidemia, obesity, smoking, medications, and family history.

The teams investigated a wide range of approaches, from rule-based to hybrid, using a wide variety of feature combinations and machine learning methods [30]. Participants could not clearly agree on the optimal approach to the challenging task because so many different hybrid systems were proposed. Most teams had discovered a challenging issue with the encoded pseudo-tables and heart disease risk indicators in clinical notes, which led to low F1 scores. Using SVM models based on custom-built lexica, the best team participating in 2014 achieved an F1-score of 0.9276 after reannotating a large portion of the training corpus [37].

A preprocessing step was performed to extract headings from sections, negation markers, modalities, and other output using the ConText tool [38], but no other syntactic or semantic signals were used. They demonstrated that other automated systems can be improved with fine-grained annotations.

Kotfila and Uzuner [39] investigated the effectiveness of SVM classifiers trained on the shared dataset by comparing the size of training data, features, weighting schemes, and kernels.

The authors indicated that limited feature spaces with lowercase alphabetic tokens were equivalent to combinations of lexically normalized tokens and extracted semantic concepts using MetaMap [40], and linear kernels were not significantly less effective than radial kernels.

Furthermore, they demonstrated that the use of SVM models may not require large corpora to achieve high efficiency.

Chen et al. [41] developed a hybrid pipeline system with three modules for tag extraction to extract tags based on phrases, logic, and discourse, as well as a module for identifying time attributes with temporal indicators using SVM.

The system achieved significant efficiency among Information Extraction (IE) systems that do not require more annotations by treating phrase-based tagging as a Name Entity Recognition (NER) task and identifying time attributes as a temporal relationship extraction task.

In addition, Urbain [42] has used various techniques such as conditional random fields (CRFs) to identify risk factors, regular expressions to identify time attributes, and a semantic distribution model to classify specific risk factors. Torii et al. [43] have developed three classifiers for various identifications: a general classifier, a smoking status classifier, and a sequence labeling-based classifier, using hot-spot features (phrases annotated as risk factor evidence) in conjunction with several open machine learning tools such as MedEx [44], Weka [45], LibSVM [46] and Stanford NER [47].

Related work in transfer learning and domain adaptation for NLP of EHRs

Several studies have proposed applications that applied text-based transfer learning. These applications have proposed the prediction of morbidity, mortality, and adverse events from oncological radiation [48,49,50], and the assessment of the risk of psychological stressors, diseases, and drug abuse [51,52,53,54]. Transfer learning methods have been applied to the clinical domain by sequentially training several tasks. The researchers pre-trained a convolutional neural network (CNN) in PubMed-indexed biomedical articles to identify medical subject headings and then transferred this model to predict International Classification of Diseases (ICD) codes in EHRs [55]. A similar approach uses unlabeled data from three institutions and applies self-training and transfer learning to classify radiological reports using a small labeled data set [56]. Pre-trained word embeddings are commonly transferred to downstream tasks, such as applying medical embeddings to NER in the clinical domain [57]. Embeddings have been trained in both the general and clinical domains, and as a result, many methods have been developed to adapt embeddings, such as concatenation and fine-tuning [58]. Another proposed method pre-trained embeddings on the relation extraction task of the 2009 i2b2 challenge [59] and then transferred them to NNs for the extraction of clinical terms in the shared dataset [60].

Pre-trained transformers methods Transfer learning with transformer-based models has become a standard approach in NLP due to its effectiveness and efficiency in leveraging pre-existing knowledge for various downstream tasks. Recently, transformer architectures [33] (e.g., BERT) applying self-attention mechanisms [61] have achieved the best results on many NLP tasks [62, 63]. A transformer-based NLP model has achieved significant performance in several areas, such as NER [64, 65], relation extraction [66, 67], sentence similarity [68, 69], natural language inference [69, 70], and question answering [69, 71,72,73]. Transformer training involves two phases: (1) pretraining, where the language model is learned based on self-supervised training on a large unlabeled dataset; and (2) fine-tuning when the pre-trained model is applied to labeled training data to address specific tasks. Fine-tuning is the process of applying a pre-trained language model to address several NLP tasks, which is known as transfer learning. Transfer learning is a technique for transferring knowledge from one task to another [74]. The sample space for human language is enormous; there are an infinite number of possible permutations of tokens, sentences, and their grammar and meaning. According to recent studies, the emergence and homogenization of large transformer models trained on large text data have been significantly superior to previous NLP models [74].

Biomedical models based on BERT include BioBERT [33], BlueBERT [75], and ClinicalBERT [34]. These models use a continuous pretraining method, initializing the model weights using weights from BERT pre-trained on Book Corpus and Wikipedia while using the same vocabulary. Pre-training from scratch using domain-specific corpora and vocabulary improves the performance of models SciBERT [76], PubMedBERT [77], and Biolm [78].

The BERT model has been applied to the scientific, clinical, and biomedical domains. BERT is pre-trained in PubMed and PubMed Central articles in BioBERT [33]. In BlueBERT [75], BERT is pre-trained on PubMed, PMC, and MIMIC III data [27]. ClinicalBERT [34] is pre-trained in MIMIC III data using BioBERT weights, while SciBERT [76], PubMedBERT [77] and Bio-lm [78] train BERT with domain-specific data. SciBERT pre-trained on Semantic Scholar data. PubMed and PMC data are used to pretrain PubMedBERT. PubMed, PMC, and MIMIC III are used to pre-train Bio-lm data [78]. BlueBERT and PubMedBERT have launched benchmarks for biomedical NLP-BLUE (Biomedical Language Understanding Evaluation) and BLURB (Biomedical Language Understanding & Reasoning Benchmark). Table 1 summarizes the state-of-the-art transformer-based models with their pre-trained dataset and training weights.

Table 1 The recent pre-trained transformers models

Adapting transformer-based language models for heart disease detection and risk factors extraction

Abstract

Similar content being viewed by others

Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Clinical Natural Language Processing with Deep Learning

Using deep learning-based natural language processing to identify reasons for statin nonuse in patients with atherosclerotic cardiovascular disease

Introduction

Related work

Track 2 of the 2014 i2b2/UTHealth shared task

Related work in transfer learning and domain adaptation for NLP of EHRs

Materials and methods

Objective task

Hypothesis

Dataset

Research problem description

Preprocessing

The proposed model for identifying heart disease risk factors based on transfer learning models

Experimental results and simulations

Evaluation metrics

Results and discussion

Error analysis

Conclusion and future work

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation