Introduction/Background

Kawasaki disease (KD) is an acute vasculitis associated with coronary artery abnormalities (CAA) and the risk of myocardial ischemia and thrombosis in affected children. In the absence of a diagnostic test, the 2017 American Heart Association (AHA) guidelines for diagnosis, risk assessment, and management of KD rely on categories of CAA severity defined by echocardiographic z-scores during 3 illness phases: acute, subacute and convalescent [1]. Recently, studies of multisystem inflammatory syndrome in children (MIS-C) associated with COVID-19 infection have reported symptoms and coronary artery dilation and aneurysms (defined by z-scores) that mimic KD. Although the association between the two diseases remains unknown and somewhat controversial, it appears that the guidelines for KD are being incorporated into the evaluation and management of MIS-C [2, 3].

The 2017 AHA guidelines reference 7 z-score models for coronary artery dimensions and note that z-scores for an individual patient can vary between models [1]. In the USA, many echocardiography laboratories use Boston Children’s Hospital’s z-score model, derived from coronary artery measurements obtained from healthy children at a single center [4]. Because published nomograms of pediatric echocardiographic z-scores were limited by insufficient sample sizes to account for growth and maturation, sex and race, the Pediatric Heart Network (PHN) published a multicenter z-score model adjusted for body surface area (BSA), age, sex, race and ethnicity in 2017 [5]. As we began developing protocols for MIS-C, we sought to explore the impact of changing our practice from our usual standard of care (Boston z-scores) to the newer PHN model for evaluating coronary artery dimensions. We used coronary artery measurements of patients treated for KD, recognizing that the impact would likely be similar for patients treated for MIS-C. We hypothesized that changing the model for determining coronary artery z-scores in children with KD would alter diagnosis and management and that age, sex and race would affect agreement between models.

Methods

In this single-center retrospective study, we searched the hospital database for all patients < 18 years old who received intravenous immunoglobulin (IVIG) for KD treatment between September 2007 (when Boston z-scores were adopted as the local standard of care) and January 2020. We excluded echocardiograms performed at outside centers and patients treated at an outside institution. Since our objective was to assess the impact of changes in z-scores on real-world treatment decisions, we excluded echocardiograms with no z-scores recorded in the electronic medical record (EMR) and did not review the images or remeasure the coronary artery dimensions. The Institutional Review Boards of the University of Utah and Primary Children’s Hospital approved this study under a waiver of consent.

Demographic and Clinical Data

We collected demographic data from the EMR, including age at diagnosis, sex, and self-reported race/ethnicity. We divided patients into age categories similar to those reported for the PHN Normal Echocardiogram Database but combined the older age groups (since most children with KD are < 5 years old [1]) [5], using age categories of < 1 month, 1 month to 3 years, 3–6 years, and 6–18 years in our analysis. We used the race categories defined in the PHN study: White, Black and Other (including Asian, Pacific Islander, Native American, Middle Eastern, and multiracial, and Hispanic as race or as ethnicity if there was no other racial designation) [5]. We coded patients without race/ethnicity data as “unknown”.

To categorize KD as complete or incomplete, we used AHA standard definitions [1]. Children who did not meet clinical, laboratory, or echocardiographic criteria for either complete or incomplete KD, but had prolonged fever and were treated with IVIG for possible KD were categorized as “other”. We defined antithrombotic strategies by class I and class IIa AHA recommendations based on coronary artery z-scores: < 10.0, low dose aspirin (ASA) alone and ≥ 10, low dose ASA and anticoagulation (warfarin or low molecular weight heparin). In the convalescent phase, continued low-dose ASA would be recommended for z-scores ≥ 2.0 and ASA could be discontinued for z-scores < 2.0 [1]. The authors (DT and DR) reviewed changes in each coronary artery z-score classification within each phase and determined whether z-score changes between the Boston and PHN models altered recommended antithrombotic strategies.

Echocardiographic Data

Echocardiographic reports were reviewed for BSA, LAD, RCA and LMCA dimensions and Boston z-scores within 3 illness phases, timed from diagnosis: acute, within 6 days; subacute, 1 week to 1 month; and convalescent, 1–3 months. No echocardiograms beyond the third month after diagnosis were included in the analysis. The clinically reported Z-scores were obtained using the Boston regression equations [6]. We calculated PHN z-scores from the absolute coronary artery diameters and BSA using the published PHN regression equations [5].

The echocardiogram with the most coronary artery data for each illness phase was selected for analysis. If several echocardiograms had complete data, the first study during the illness phase was analyzed. CAA were defined using AHA z-score classifications: no involvement (< 2); dilation only (≥ 2 to < 2.5); small aneurysm (≥ 2.5 to < 5); medium aneurysm (≥ 5 to < 10 and absolute dimension < 8 mm); giant aneurysm (≥ 10 or an absolute dimension ≥ 8 mm). For patients not meeting complete KD criteria, a z-score > 2.5 for the LAD or RCA was used as an independent criterion for diagnosing incomplete KD, consistent with AHA guidelines [1].

Statistical Analysis

Demographic and clinical data were summarized using counts and percentages for categorical variables and median and interquartile range (IQR) for continuous variables due to distribution skew. Changes in CAA classifications were enumerated at each illness phase. Proportions of change in recommended antithrombotic therapy per change in CAA classification at each phase were calculated.

We assessed the degree of agreement between Boston and PHN z-scores for each coronary artery within each of the 3 illness phases as a continuous variable using Lin’s Concordance Correlation Coefficient (CCC) and 95% confidence intervals (CI) [7, 8]. The interpretation of CCC values followed McBride’s recommendations, in which the lower limit of a CCC 95% CI < 0.90 indicates poor agreement; 0.90 to < 0.95, moderate; 0.95 to 0.99, substantial; and > 0.99 indicates nearly perfect agreement [9]. We performed similar assessments with dichotomized z-score categories of < 2.5 and ≥ 2.5 to distinguish CAA from measurements with lower z-scores using unweighted Cohen’s Kappa (κ) coefficients with 95% CI. The interpretation of κ values followed recommendations from Altman et al., in which the lower limit of a κ 95% CI < 0 indicates no agreement; 0.00–0.21, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–0.99, almost perfect; and 1.0 indicates perfect agreement [10]. We further compared the agreement between Boston and PHN z-scores for LAD, RCA, and LMCA dimensions aggregated across all illness phases and obtained estimates of CCC for repeated measures using R package “cccrm” [11]. We constructed Bland–Altman plots to further evaluate the agreement between PHN and Boston z-scores and determine whether the variability of the discrepancy between the 2 models increased with higher z-scores [12].

The agreement of z-score models between categories of age, sex, and race was compared using CCC for repeated measures as described above, but with an 83.7% CI. The reason for using an 83.7% CI instead of a 95% CI to assess agreement differences is that there is sampling variability associated with both estimates, and thus 95% CIs can marginally overlap even if the two estimates are significantly different (p ≤ 0.05) [13,14,15]. Thus, a narrower 83.7% CI is needed to equate overlap of CIs with a lack of statistical significance (p > 0.05). Using CIs as opposed to p-values encourages a descriptive interpretation of these comparisons, as determination of an optimal procedure to adjust for multiple comparisons in this case is not straightforward. Differences in levels of agreement between demographic subgroups (e.g., male vs female) were statistically non-significant (p-value > 0.05) if their corresponding 83.7% CI overlapped. Forest plots were constructed to visually depict these data. Statistical analyses were implemented using R v. 3.6.0 [11].

Results

Study Population

Of the 357 children who met all inclusion and no exclusion criteria (Table 1), 63.9% were male, 76.8% were White and 56.6% were 1 month to 3 years of age at the time of diagnosis. No child was under 1 month of age. Criteria for complete KD were met in 66.9% of children. A total of 904 echocardiograms had sufficient data for analysis: 345 (96.6%) of children had qualifying studies during the acute phase, 286 (80.1%) during the subacute phase, and 273 (76.5%) during the convalescent phase. Acceptable echocardiograms were available for analysis in all phases for 216 (60.5%) children. Among the 141 children without acceptable echocardiograms in at least one illness phase, 8 had studies from outside centers, 16 had studies with no Boston z-scores recorded, and the remainder had no follow-up at our center at an illness phase during the first 3 months after diagnosis.

Table 1 Summary of demographic data of patients treated for Kawasaki disease

Comparison of Coronary Artery Z-Scores

All Boston LAD z-scores were lower than the corresponding PHN z-scores with a median difference of 1.3 (IQR 1.3, 1.3) across all illness phases (Table 2). Comparison of aggregated data for the LAD across all illness phases showed only moderate agreement between models. In contrast, Boston and PHN z-scores for the RCA and LMCA had substantial agreement at all illness phases. Most Boston RCA z-scores and nearly all LMCA z-scores were within 0.5 of the PHN z-scores (86.5% and 98.8%, respectively).

Table 2 Z-score distribution and percentage of echocardiograms with Z-scores ≥ 2.5 in Boston and PHN models

With conversion from Boston to PHN z-scores, the percentage of LAD z-scores ≥ 2.5 more than doubled in all illness phases (Table 2). In contrast, the percentage of z-scores ≥ 2.5 decreased in all phases when converting from Boston to PHN z-scores for the RCA and LMCA but the difference was ≤ 5.2% in each phase (Table 2). When assessing all 3 coronary artery dimensions across all phases, the number of individual z-scores ≥ 2.5 increased by 39.8% (294 using Boston vs 411 using PHN z-scores).

The consistency of the difference between the Boston and PHN LAD z-scores was only slightly impacted by z-score magnitude when the average of paired z-scores exceeded 5 (Fig. 1). Since Bland–Altman plots for each illness phase were consistent for each coronary artery, only the acute phase plots are shown.

Fig. 1
figure 1

Bland–Altman plots for each coronary dimension in the acute phase of illness. X-axis represents the subject-specific average between two z-scores (PHN z-score + Boston z-score/2). Y-axis represents the subject-specific difference of the two z-scores (PHN z-score − Boston z-score). The dashed line intersecting zero on the y-axis indicates the reference value of no discrepancy/bias. The solid horizontal line indicates the mean of the differences. The farther the solid line (average difference) departs from zero, the larger the overall discrepancy between the two z-score systems. Stippled (“dotdash”) lines bound the 95% CI of mean differences

Over half (59.7%, 213/357) of the study population had at least one change in coronary artery z-score classification when converting between models. When changes in LAD z-score classification were excluded, this percentage decreased to 21.2% (76/357). The LAD accounted for 72.0% (265/368) of all individual z-score classification changes, with 16.1% (144/896) of all LAD measurements changing by 1 classification level, and 13.5% (121/896) changing by 2 levels. Changes most frequently occurred from the no involvement/dilation only level to the small aneurysm level (Table 3). For the RCA, 8.3% (75/899) changed by 1 level and 0.6% (5/899) changed by 2 levels. All 5 children who had giant RCA aneurysms by the Boston model were reclassified as having moderate-sized aneurysms by the PHN model. For the LMCA, only 3.5% (31/897) changed by 1 level of classification, and 4 (0.4%) changed by 2 levels.

Table 3 Summary of changes in Z-score classification from Boston to PHN models for all echocardiograms (N)

Of the 13 children originally classified as “other” KD using the Boston model, 7 children (54%) were reclassified as meeting criteria for incomplete KD with the PHN model, given an increase in their LAD z-score to ≥ 2.5.

Impact of Differences in Z-Score Models on Antithrombotic Recommendations

Of the 213 children with a change in AHA coronary artery z-score classification, recommendations for antithrombotic management changed in at least one illness phase for 22.5% (48/213). Antithrombotic therapy would change for 1.1% (2/175) of all individual z-score reclassifications during the acute phase, 5.4% (6/112) in the subacute phase, and 76.3% (71/93) in the convalescent phase (Table 4). Most of the changes in the convalescent phase (55/71, 77.5%) were related to the recommendation to continue low dose ASA beyond that phase, largely driven by increases in LAD z-scores (54/55, 98%).

Table 4 Summary of changes in recommended antithrombotic strategy due to change in AHA Z-score classification with conversion from Boston to PHN models

Effects of Age, Sex and Race on the Level of Agreement Between Models

There was substantial agreement between Boston and PHN LAD scores in the 3–6 years age category, but only moderate agreement in the 1 month to 3 years category and poor agreement in the 6–18 years category. For the RCA, substantial agreement was found in the 1 month to 3 years category, compared to moderate agreement in the 6–18 years category. All age groups had substantial agreement for the LMCA (Fig. 2a). Males demonstrated substantial agreement between Boston and PHN LAD z-scores while females had poor agreement. There was no difference in agreement based on sex for the RCA or LMCA z-scores (Fig. 2b). White children showed moderate agreement in LAD z-scores between models, while children of non-White races had poor agreement. There was no significant difference in RCA or LMCA z-score agreement by race (Fig. 2c).

Fig. 2
figure 2

Forest plots depict agreement between the Boston and PHN z-score models compared by: a age, b sex, and c race. Symbols in the keys represent demographic subgroups, with the location of each symbol along the 83.7% CI denoting CCC point estimates. Differences in levels of agreement between demographic subgroups (e.g., male vs female) were statistically non-significant (p-value > 0.05) if their corresponding 83.7% CI overlapped [13,14,15]. Asterisks represent the lower limit of the 95% CI, which corresponds to the level of agreement between the Boston and PHN models: < 0.90 indicates poor agreement; 0.90 to < 0.95, moderate; 0.95 to 0.99, substantial; and > 0.99 indicates nearly perfect agreement [9]

The proportions of changes in AHA z-score classification after conversion from Boston to PHN z-scores varied significantly by age and sex, but not by race. Of 102 children aged 3–6 years, 41.2% (42) had a change in at least one coronary artery z-score categorization in at least one phase of illness, compared to 67.3% (136/202) of those 1 month to 3 years of age, and 66% (35/53) of those 6–18 years of age, (p < 0.001). Of 129 female patients, 48.8% (63) had a change in z-score categorization compared to 65.8% (150/228) of males (p = 0.002).

Discussion

Although previous studies have explored differences between z-score models [16,17,18,19], we investigated the actual impact of these differences on KD diagnosis and management. Our findings are particularly relevant now, given the emergence of the KD-like syndrome known as MIS-C that has been newly described during the COVID-19 pandemic [2, 3]. Since KD and MIS-C share similar features and clinicians are referring to KD guidelines in the management of MIS-C, the methods of determining coronary artery size normalized to body size in growing children have drawn increasing attention. We identified important effects on diagnosis of CAA and recommended antithrombotic therapy when converting from the commonly used Boston z-score model to the newly published PHN model. Conversion from Boston to PHN z-scores yielded at least one change in CA z-score classification for nearly 60% of children and changed recommended antithrombotic strategy in 13.4%. PHN LAD z-scores were always higher than Boston z-scores, showing the weakest correlations between the models (compared to the RCA and LMCA) and accounting for 72.0% of all changes in CAA classification. It is important to note that nearly 14% of all LAD measurements increased by 2 levels of z-score classification. Overall, we found a nearly 40% increase in CAA with z-score ≥ 2.5. These findings illustrate the practical implications of using different models in the evaluation of a disease for which the standard of care is based on z-scores.

Despite some methodological differences, our results are supported by prior reports. When the PHN investigators compared Boston and PHN z-scores of commonly measured cardiac structures (abstract), one of the poorest correlations between the two models was for LAD z-scores, with a median difference of 1.33 (PHN-Boston) [16]. Similarly, Lorenzoni, et al. (abstract) compared Boston and PHN coronary artery z-scores for local KD patients and those enrolled in the PHN Randomized Trial of Pulse Steroid Therapy for Primary Treatment of KD and reported a 50% increase in abnormalities as well as a change in the AHA recommended antithrombotic strategy in 21.5% of patients using PHN z-scores, defining CAA as a z-score of 2.0 or higher [17]. Similar to our study, the higher frequency of CAA using the PHN model was driven by higher LAD z-scores, and AHA antithrombotic treatment category changed for a relatively large number of patients. However, we also found that over 50% of patients treated for KD who did not meet criteria for complete or incomplete KD using Boston z-scores had PHN LAD z-scores ≥ 2.5, satisfying the AHA diagnostic criteria for diagnosing incomplete KD. Unfortunately, we have no data regarding the coronary artery z-scores of children with similar clinical findings who were not treated for KD.

Disagreements between the Boston model and other z-score models have also been reported [5, 18]. In contrast to our findings regarding the PHN model, Ronai, et al., found Boston coronary artery LAD z-scores tended to be higher when compared to Washington D.C. and Quebec models. These investigators plotted coronary artery dimensions corresponding with a z-score of 10.0 in each model vs. BSA and found the dimensions of both the LAD and RCA for the Boston model were significantly lower than the Washington D.C. model and similar to the Quebec model. Agreement between models was poorest for measurements with the highest LAD or RCA Boston z-scores (range 7–14): 50% had Boston z-scores ≥ 10 while none of the corresponding D.C. and only 30% of Quebec z-scores were ≥ 10 for the same dimensions [19]. In our study, for larger coronary artery dimensions (when the average of paired z-scores exceeded 1), the Boston RCA z-scores were indeed higher than the PHN, and all 5 patients with RCA giant aneurysms using Boston z-scores had only moderate-sized aneurysms using PHN z-scores (despite overall substantial agreement between the Boston and PHN RCA z-scores). Conversely, our study showed that Boston LAD z-scores were consistently lower than the PHN, with increasing variation in the magnitude of that difference when the average of paired z-scores exceeded 5.

The PHN Normal Echocardiogram database study found that age, sex, and race contributed little to the variation in the z-score models for coronary artery dimensions [5]. As the first study to examine the effects of age, sex, and race on agreement between pediatric coronary artery z-score models in a real-world setting, we found that these factors had an impact on the level of agreement between Boston and PHN z-score models, with higher proportions of males and children in the extreme age groups (1 month to 3 years and 6–18 years) having at least one change in classification. While children of non-White races had poorer agreement between LAD z-scores than White children, the proportions of changes in AHA z-score classification did not significantly differ by race. These differences may be due to under-representation of certain groups in single-center z-score models, such as non-White race groups, leading to lower levels of agreement when compared to a system derived from a more evenly diverse population. While the Boston model provided no breakdown of age, sex, or race [4], the PHN model was specifically designed for equal representation of various races (White, Black, and Other), sexes, and ages [5]. Therefore, the demographics of each z-score model population may need to be more carefully examined to determine the generalizability to the local population served by each center.

The reason for these differences in z-scores between the Boston and PHN models is likely multifactorial, including differences in sample size, definitions of normal or healthy children, proportions of ages, sexes, or races of patients comprising the study cohorts, methods of normalizing raw dimensions, and number of observers making measurements, as well as changes in technology and image resolution. For example, Boston z-score regression equations were derived from measurements by multiple observers at a single center obtained between 1987 and 2000 from 221 healthy children who were predominantly White [4]. In communication with the authors, it was confirmed that new patients have not been added to the Boston database for coronary artery dimensions. PHN coronary z-scores were derived from measurements by 2 core lab observers of 2507 healthy children of nearly equal racial distribution from 19 North American centers collected between 2013 and 2015 [5]. The poor correlation between models for LAD z-scores, in particular, may be attributable to greater difficulty in imaging the LAD in comparison to the more proximal LMCA and RCA. Furthermore, spatial resolution for available echocardiographic technology may lead to greater variability and random error in measurement of small structures such as the LAD. Both of these issues could have contributed to differences in the measurements used to derive regression equations for the LAD between the two models. Although we suggest potential reasons for differences among z-score models, it is important to note that our study was not designed to determine the best z-score model and we have no outcome data to evaluate the newly published PHN model. Nevertheless, we did highlight some of the pitfalls in applying two different z-score models derived from healthy children to children with a specific disease. While use of multiple z-score systems is common practice in pediatric cardiology, it is important to recognize that models are not interchangeable and the z-score model for clinical decision-making should be the same model from which outcome data driving the guidelines and recommendations are derived.

Limitations

This retrospective study had several limitations. First, measurements of coronary arteries were obtained from the EMR without reviewing the echocardiograms. Thus, reported measurements could have been influenced by a subjective impression of coronary artery size at the time of image acquisition and may have been adjusted to fit z-scores to the threshold supporting a diagnosis of KD. Regardless, the aim of this study was to understand the impact of changes to z-scores available at the time of clinical decision-making and not to validate measurements. Second, children for whom KD was considered but not ultimately treated were not captured in the database, so the effect of a change in z-score leading to a change in diagnosis of these patients could not be fully explored. Because only one echocardiogram for each phase of illness was analyzed, the largest coronary artery measurements for a given child may not have been captured; thus, the impact of change in z-scores over time (e.g., regression from large to medium coronary artery aneurysms between the acute and convalescent phases) on antithrombotic strategy could not be evaluated. Third, although the impact of age, sex, and race was statistically significant, it is not clear that these differences are clinically significant. Based on published thresholds for larger arteries, the PHN investigators attributed measurement differences of up to 5% to measurement variability and thus, defined clinical significance as a difference of 5% or more between actual and predicted measurements [5]. Although this difference may be useful for larger arteries, there are no similar data for the coronary arteries, for which the variability is likely higher given their far smaller dimensions. Finally, we can make no recommendation regarding the best z-score model since, unlike other models [6, 20, 21], the association of z-scores and KD outcomes has not yet been reported for the PHN model.

Conclusions

Conversion from Boston to PHN z-scores resulted in differences in CAA classification, recommended antithrombotic strategies, and even in KD diagnosis. Differences were most frequently seen for LAD z-scores. These findings have implications for patients with KD and the KD-like disease, MIS-C, associated with COVID-19. Selection of a z-score system for risk stratification of CAA should include consideration of similarity of the model population to the local population being evaluated. The model used for making the outcome-based recommendations contained in the AHA guidelines should be the model used in clinical practice. Consistent application of a z-score model for longitudinal evaluation of CAA in each patient is critical. As much is still unknown about MIS-C, investigators leading research protocols may benefit from making an a priori determination of which z-score model(s) should be used to define inclusion criteria and determine the association between short and long-term outcomes. As more outcome data are collected using the PHN z-scores, we may be better able to understand the differences between models.