Fully automated quantification of cardiac chamber and function assessment in 2-D echocardiography: clinical feasibility of deep learning-based algorithms

We aimed to compare the segmentation performance of the current prominent deep learning (DL) algorithms with ground-truth segmentations and to validate the reproducibility of the manually created 2D echocardiographic four cardiac chamber ground-truth annotation. Recently emerged DL based fully-automated chamber segmentation and function assessment methods have shown great potential for future application in aiding image acquisition, quantification, and suggestion for diagnosis. However, the performance of current DL algorithms have not previously been compared with each other. In addition, the reproducibility of ground-truth annotations which are the basis of these algorithms have not yet been fully validated. We retrospectively enrolled 500 consecutive patients who underwent transthoracic echocardiogram (TTE) from December 2019 to December 2020. Simple U-net, Res-U-net, and Dense-U-net algorithms were compared for the segmentation performances and clinical indices such as left atrial volume (LAV), left ventricular end diastolic volume (LVEDV), left ventricular end systolic volume (LVESV), LV mass, and ejection fraction (EF) were evaluated. The inter- and intra-observer variability analysis was performed by two expert sonographers for a randomly selected echocardiographic view in 100 patients (apical 2-chamber, apical 4-chamber, and parasternal short axis views). The overall performance of all DL methods was excellent [average dice similarity coefficient (DSC) 0.91 to 0.95 and average Intersection over union (IOU) 0.83 to 0.90], with the exception of LV wall area on PSAX view (average DSC of 0.83, IOU 0.72). In addition, there were no significant difference in clinical indices between ground truth and automated DL measurements. For inter- and intra-observer variability analysis, the overall intra observer reproducibility was excellent: LAV (ICC = 0.995), LVEDV (ICC = 0.996), LVESV (ICC = 0.997), LV mass (ICC = 0.991) and EF (ICC = 0.984). The inter-observer reproducibility was slightly lower as compared to intraobserver agreement: LAV (ICC = 0.976), LVEDV (ICC = 0.982), LVESV (ICC = 0.970), LV mass (ICC = 0.971), and EF (ICC = 0.899). The three current prominent DL-based fully automated methods are able to reliably perform four-chamber segmentation and quantification of clinical indices. Furthermore, we were able to validate the four cardiac chamber ground-truth annotation and demonstrate an overall excellent reproducibility, but still with some degree of inter-observer variability.


Introduction
Echocardiography is a primary cardiac imaging tool which provides noninvasive real-time imaging for identifying cardiac structure along with function in clinical practice. However, unlike other imaging modalities such as computed tomography (CT) or magnetic resonance imaging (MRI), echocardiography is highly operator dependent with various types of artifacts which can occur during image acquisition as well as post-processing [1]. Unavoidably, these limitations can cause large inter and intra observer variability and Sekeun Kim and Hyung-Bok Park have contributed equally to this work as co-first authors. poor reproducibility, resulting in clinical shortcomings, in particular, when it comes to monitoring longitudinal quantitative measurements [2,3]. Even though, semi-automated delineation and quantification of cardiac structures have demonstrated their utility in the quantitative echocardiographic assessment, they still need significant amount of manual modification which is time-consuming and may introduce the risk of the inter-/intra-observer variability [4].
The adoption of artificial intelligence (AI) technology in echocardiographic imaging has recently emerged as a novel alternative solution for these challenges [5,6]. With the advance of deep neural networks, there have been attempts to develop fully-automated algorithms for mainly focusing on the left ventricle (LV), where there is a clinical unmet need for the accuracy and reproducibility of quantitative assessments [7]. Importantly, in order to develop an accurate chamber segmentation deep learning algorithm, which is the basis of the functional assessment, is dependent on the ground-truth image data, being so called 'training data set', being essentially accurate and reproducible [8,9].
The poor repeatability is primarily due to the low signalto-noise ratio, edge dropout, and presence of shadow, there is a high variability in quantitative assessment of echocardiography [2,3]. However, to date, there has been no investigation of the inter-/intra-observer variability of ground-truth of four chamber data itself which is crucial for the accurate and advanced deep-learning algorithms. In addition, there have been no comparisons among currently available deep learning algorithms to suggest the most potent approaches for fully automated quantitative analysis nor have there been clinical performance comparisons made between the algorithms and conventional techniques.
Therefore, firstly we sought to evaluate whether currently prominent deep learning methods could accurately estimate clinical indices including LV mass, LV/LA volume, and EF when compared mutually as well as with conventional methods. Secondly, we validated the reproducibility of our manually created ground-truth segmentations in two-dimensional echocardiography which are imperative for the development of highly advanced deep-learning AI based algorithms.

Study population
In this study, we retrospectively enrolled consecutive patients who had visited our cardiology outpatient clinic and underwent transthoracic echocardiogram (TTE) due to various symptoms from December 2019 to December 2020. The Institutional Review Board at Yonsei University Severance Cardiovascular Hospital approved this study protocol. Inclusion criteria consisted of having had all echocardiography visually correspond to standard views and we excluded patients with diagnosis of heart failure, coronary artery disease, valvular heart disease, known pregnant state, uninterpretable quality of echocardiographic images or images that were acquired by non-standardized scan angles, and those had abnormal echocardiograms (Table 1). All patients were scanned in the left lateral position using grayscale secondharmonic 2D imaging techniques, with the adjustment of image contrast, frequency, depth, and sector size for adequate frame rate and optimal LV border visualization. All patients received a complete quantification report, which was validated by cardiologists. Image quality was defined based on percent endocardial border visualization as good image quality: 67%-100%, fair image quality: 34%-66%, and poor image quality: 0%-33% [10]. Echocardiographic images were acquired by standard ultrasound equipment including Vivid 9 (GE Healthcare, Horton, Norway; n = 259), EPIQ 7C (Phillips Healthcare, Andover, MA, USA; n = 170), Acuson SC2000 (Siemens, Mountain View, CA, USA; n = 42), and Artida (Canon Medical Systems, Tokyo, Japan; n = 29).

Ground-truth generation
Anonymized echocardiographic digital images were analyzed and annotated in a core laboratory, where experienced sonographers (five expert sonographers who all had more than 5 years of echocardiographic experience) manually contoured cardiac structures according to the recommendations of the American Society of Echocardiography [11]. For each patient, we selected a set of B-mode images including apical two chamber (A2C) view, apical four chamber (A4C) view and parasternal short axis (PSAX) view at papillary muscle (PM) level which included at least one cardiac cycle. In a series of multi-video frames, manual annotation at enddiastole (ED) and end-systole (ES) for four chambers were performed using a commercial annotation tool (OsiriX, Pixmeo, Switzerland). Two chambers including LV and left atrium (LA) were delineated when available in A2C, A4C, and PSAX views, respectively. In addition, we further delineated four chamber structures including right ventricle (RV) and right atrium (RA) in A4C view (Fig. 1). During LV endocardial wall segmentation, trabeculations and papillary muscles were excluded in PSAX view at PM level.

Inter-and intra-observer reproducibility analysis
To validate our manually created ground truth annotation, we performed both inter-and intra-observer variability analysis. Two sonographers, who both had more than 5 years of echocardiography experience with more than 5000 echocardiographic examinations, were chosen and the variability of clinical indices were assessed on a randomly selected echocardiographic images of A2C, A4C, and PSAX views in 100 patients. One sonographer manually annotated LV, LA, RV, and RA contours on the three views at ED and ES with an interval of one month with a randomly shuffled analysis preventing the observer from being influenced by previous measurements (intra-observer variability). The other sonographer annotated the same groups without being influenced by the other (inter-observer variability).

Dataset splitting
A total of 500 echocardiograms were allocated for developing automated deep learning methods. Each of the echocardiograms consisted of consecutive frames with dozens of still images. All information was removed from echocardiograms that could identify individual patients. Echocardiographic images were extracted from anonymized DICOM files, and unorganized videos with different views were grouped according to their views. The entire dataset containing 500 patients was divided into training (80%, n = 400), validation (10%, n = 50), and test set (10%, n = 50) for deep learning methods. The fivefold cross validation was employed to analyze the generalization performance of ML methods. The entire dataset containing 500 patients was divided into five groups. In training stages, four subsets were used for training and validation of the network. In test stages, the remainder was employed to evaluate the ML model.

Deep learning-based algorithms
To automatically segment cardiac structures in echocardiography, we employed three deep learning models based on U-net which were used for biomedical image segmentation and have demonstrated high performance on segmentation of organs [12]. The U-net consists of a fully convolutional encoder path, called backbone with a symmetric expanding decoder path for segmentation. We constructed three deep learning models on the basis of encoder-decoder architecture of U-net. In our experiments, we deployed the backbone of U-net architecture with residual and dense blocks, which shows robust performance in the image classification network with a U-net decoder path [13,14]. The flow chart of deep learning methods is shown in Fig. 2.

Training strategy
Given input images and corresponding annotated masks containing different categories with their views, data augmentation was performed with up-down, left-right flips, and rotation. For training deep learning models, a pixel-wise cross-entropy loss function was employed to minimize inaccuracy of prediction. An Adam optimizer with learning rate of 0.0001 was adopted to optimize network parameter. We trained our models from scratch without using any pretrained weight for initialization. We randomly shuffled training dataset and trained deep network for 200 epochs with a minibatch size of 5. All input cases were resized to 512 × 512 due to GPU memory limitation and then the image intensity values were normalized into the range of [0,1].

Performance analysis of deep-learning models
Dice similarity coefficient (DSC) and IOU, and clinical indices such as volume, mass, and EF were included to compare the performance of deep learning methods. The DSC and IOU are both quantifies the pixel-wise degree of similarity between the model-predicted segmentation mask and the ground truth, and ranges from 0 (no similarity) to 1 (identical). We employed segmentation results of the deep learning model to calculate clinical index to compare chamber structure quantification and EF based on standard guidelines. Left and right volume were calculated by the area-length method derived by the long axis and area of A2C and A4C [11]. LVM was calculated as the left ventricular myocardial volume derived by the delineation of its endocardial and epicardial borders and multiplied with the specific gravity of myocardial tissue (assuming a tissue density of 1.05 g/ml). These annotations, which were established and verified by board-certified cardiologists, were used as the 'ground truth' for the deep-learning model.

Statistical analysis
Continuous and normally distributed variables were represented as mean ± SD and median with interquartile range (IQR) for non-normally distributed variables. Comparison between ground truth and prediction results were assessed by the paired t test and Pearson correlation coefficient using two-sided p values. A p value of < 0.05 was considered significant. Bland-Altman plots with 95% confidence intervals for correlation were calculated. Inter-observer and intra-observer reproducibility were assessed by the intraclass correlation coefficients (ICC) for absolute agreement of single measures between two observers. All statistical analyses were performed using commercially available statistics software (MedCalc, version 18.9, MedCalc software Inc., Mariakerke, Belgium).

Performance comparison between deep-learning models
In Table 2, automated segmentation performances between deep neural networks (U-net vs. Res-U-net vs. Dense-U-net) were compared by using DSC and IOU metrics. Except LV wall area on PSAX PM level view (average DSC of 0.83, IOU 0.72), all deep learning methods showed an overall excellent performance (average DSC 0.91 to 0.95 and average IOU 0.83 to 0.90) ( Table 2 and Fig. 3). There were no significant differences observed between the three deep learning methods for all the parameters analyzed at each echocardiographic view (Table 2).
Additionally, we tested the impact of the number of training dataset for robustness of ML algorithm. There were linear association between the performance of automated ML analysis and the increment of dataset until 60% of the training dataset (n = 300; 1,271 frames) achieving already the maximal DSC and IOU of 0.93 and 0.87. After this point, the accuracy was sustained regardless of the number of training dataset (Fig. 6).

Discussion
In this study, we were able to show that the current prominent deep learning-based methods can be reliably utilized for automated quantification of cardiac chambers in 2-D echocardiography with no significant segmentation performance variabilities between these algorithms. Simple U-net, Res-U-net, and Dense-U-net are currently the most prominent deep learning algorithms for image segmentation and have demonstrated their robust performance on organ segmentation in biomedical imaging [12][13][14]. However, we observed that the segmentation performance was substantially influenced by image quality especially involving certain parameters. In addition, we also validated the reproducibility of our manually created four chamber ground-truth annotations which are the fundamental basis of advanced deep learning algorithms, which had not been specifically investigated in prior studies. To our knowledge, this is the first study comparing current prominent deep learning algorithms and validating reproducibility of the four-chamber ground-truth. Furthermore, this is the largest manually annotated datasets acquired in different vendors which was then utilized for the evaluation of automated analysis of echocardiography.
A major well known limitation of quantitative analysis in clinical practice is the significant inter-and intra-observer variability involving tracing of endocardial borders [2,3]. Even though semiautomated software was used, inherent low reproducibility impedes the reliable longitudinal quantitative assessment, particularly in the following situations: poor image quality scans, patients who have arrhythmias, complex abnormality in cardiac chambers such as congenital anomaly, multivalvular heart disease, presence of regional wall motion abnormalities in the LV due to myocardial infarction, and infiltrative myocardial diseases [15,16]. In addition, the relatively time-consuming quantification process prohibits quantification of multiple frames or dynamic frame-by-frame analysis in order to average clinical measurements especially for patients with irregular heartbeats [17].
However, recently introduced fully automated deep learning-based cardiac chamber quantification techniques have no operator variability, providing almost real-time quantitative processing (0.12 s per frame with CPU), as well as enabling dynamic cardiac chamber quantification [15,17]  with disease detection models, which could be a profound pipeline work for the realization of AI-based one button whole echocardiography quantitative analysis in the future [15]. However, in their study, segmentation and quantitative analysis performances were not tested in poor or modest image quality in various disease models which are frequently encountered in clinical practice. In addition, a relatively small number of ground-truth annotations were used at each view for training their deep learning algorithm from 124 to 214, which might not be sufficient for the development of robust segmentation algorithm. In addition, no validation of reproducibility for ground-truth was presented [15]. In our study, on the other hand, 500 ground-truth annotations were used in deep learning training and analyzed algorithm performances according to image quality. We found that in those with image quality, relatively low segmentation performance were observed for LV cavity, LA cavity, LV myocardium area, which is consistent with previously reported studies [15,18,19]. For the reliability of training data sets, we reported the interobserver and intra observer variability analysis from 100 patients (20% of total population), demonstrating that inter-observer variability was higher than intra-observer variability showing still some degree of variability exists between experienced experts. Another interesting finding  5 Representative echocardiography image from good to bad image quality was that high variance was shown in EF due to the accumulated error from EDLV and ESLV which has constantly shown in prior studies [15,19]. Narang et al. introduced a ML based quantification of 3D echocardiography and compared it to cardiac MRI [17]. Although small number of patients (n = 20) were studied, their accurate and instant frame-by-frame LV and LA volume-time curve generation based on 3D segmentation has potential application for searching novel dynamic parameters predicting future outcome [17].
The most significant barrier of deep learning method for echocardiography is the lack of large datasets. Recently, Arafati et al. [18] suggested additional adversarial training to fully convolutional networks to overcome data dependency of network and combined post processing procedure to improve the segmentation performance, but they also reported the variance of segmentation performance depending on age and gender in LV and LA segmentation, respectively. Thus, there are still several unsolved issues that need to be investigated for the full implementation of deep learning-based fully automated quantification into clinical practice. First, further advancement of deep learning based-segmentation algorithms should be made not only for relatively normal heart model but also for various disease models. These algorithms will also need to prove their robustness in studies with modest to poor image qualities (Fig. 7). Furthermore, similar to other imaging modalities, inherent operator variation of ground-truth itself is inevitable and the variability will naturally increase depending on image quality and the complexity of heart disease. Therefore, it is essential to diminish this variability and enhance the accuracy of segmentation algorithm through building up large number of accurately annotated ground-truth by experienced echocardiographic experts, despite is being time-consuming and labor-intensive.  In our study, we represented the relationship between the number of training datasets and segmentation performance showing that maximal performance level was already reached when 60% of the dataset was being utilized. Still, in order to establish that amount of datasets, it took 5 experienced experts working full time for three months. Moreover, to our knowledge, there has been no study up to now that has validated deep learning methods on the patients with different types of disease. Deep learning models can be improved by diverse sampled data without the class imbalance of disease types, which requires large databases to improve their robust accuracy [18,20].
The recently emerged deep active learning (AL) method could be the solution for this unmet need for the clinical annotation process [21][22][23]. AL is a powerful method that enables data-efficient model. It reduces the laborious and expensive annotation tasks by carefully choosing data points for which ground-truth need to be labeled [24]. Specifically, AL selects the set of the most informative data points from a pool of unlabeled dataset, so it enables us to minimize the amount of data points that should be labeled and achieve comparable performance to a model trained with fully annotated dataset [25,26].
Echocardiography has been established as an important primary screening test since the incidence of cardiovascular disease has been on the rise [27]. The non-invasive bedside approach enables clinicians to easily monitor the disease progression as well as the response to treatment [27,28]. Furthermore, by using the serial assessment of quantitative parameters including global or longitudinal LV strain, it is possible to diagnose the disease at its preclinical stage and inhibit its progression [29][30][31]. The development of technology has allowed for the minimalization of echocardiographic systems such as portable handheld transthoracic echocardiography which can further broaden the utility of echocardiography in emergency settings by novice or nonexpert paramedics [32][33][34]. However, the operator dependency of echocardiography particularly in achieving quantitative parameters serves as a significant limitation. Recently emerged deep learning-based AI technologies may allow us to level the playing field between rural and urban areas through AI aid or guided fully automated image acquisition and quantification. These novel methods can also be widely adopted in busy echocardiographic laboratories as well as be deployed in emergency departments for more efficient ways to triage patients.

Study limitations
Our study has several limitations. These data should be interpreted cautiously because of its retrospective single-center study design; hence, further prospective multi-center studies are warranted. In addition, various types of disease models were not included in our analysis. However, for the first time, we evaluated the performance of fully automated deep learning-based methods according to image quality as well as the required number of training datasets to achieve the maximal performance. Therefore, our study results provided important considerations for future studies for developing specific disease model algorithms such as hypertrophic cardiomyopathy, and heart failure with reduced EF or having regional wall motion abnormality. Finally, we only included four chambers on limited echocardiographic views in our analysis. However, to establish a fully automated quantification, beyond the four cardiac chambers, aorta, valves, and pericardial structures should be automatically included into quantification. Also, instant multiple frame-by-frame segmentation method is needed to obtain average quantitative parameters in patients with irregular heartbeats.

Conclusion
We demonstrated that three current prominent deep learningbased fully automated methods are all reliable to perform four-chamber segmentation and quantification of clinical indices without any superiority. However, the performances of these methods were significantly dependent on image quality as well as the number of well annotated training datasets. Additionally, this is the first study to validate the reproducibility of the four cardiac chamber ground-truth annotation itself showing overall excellent reproducibility, but some degree of inter-observer variability still being noted. This study emphasizes that further technical advancement of the fully automated deep learning methods is needed to maintain the clinical performance even in low quality image and specific disease models, not only that, well-designed clinical validation studies are desperately warranted for these technologies to be applied into clinical practice.

Conflict of interest No potential conflict of interest was reported by the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.