Background

The outcome of patients with acute myeloid leukemia (AML) has improved substantially over the past decades, thanks to the development of more aggressive therapies and better supportive care. However, a substantial proportion of patients still do not obtain complete remission (CR), and others eventually relapse after achieving CR [13]. In an attempt to stratify subgroups with different survival rates, several prognostic factors have been identified, including age, gender, baseline white blood cell count, lactic dehydrogenase serum level, immunophenotype, karyotypic abnormalities and genetic profiles [47].

In addition to baseline variables, early assessment of response to chemotherapy may help to define prognosis. Previous studies have shown an association between the lack of early blasts clearance and failure to obtain CR after a first cycle of induction [8, 9]. This early assessment of treatment response is usually performed between the 14th (D14) and 17th day of the first cycle of induction chemotherapy, by analyzing the cellular content of the bone marrow aspirate (BMA) and/or biopsy (BMB). A hypocellular bone marrow sample suggests adequate anti-leukemic activity [8, 10]. However, its interpretation may be inaccurate because of different levels of expertise among pathologists and hematologists, and a great variability in BMA and BMB sample quality [11]. Furthermore, a BMA blast count above which poor response to chemotherapy is predicted has not been clearly defined, with values ranging from 5 % to 40 % [819]. By contrast, the BMB provides a better assessment of marrow cellularity [20], but the results are available only a few days after the BMA, delaying the decision to administer a second course of induction chemotherapy for non-responders.

Given these uncertainties, we sought to evaluate the quantitative and qualitative assessment of D14 BMA, to verify the inter-observer agreement, and to compare the results of BMA and BMB. We also assessed the impact of D14 blast clearance on the overall survival (OS).

Methods

Study population and treatment

All patients diagnosed with AML at University Hospital Clementino Fraga Filho, Universidade Federal do Rio de Janeiro (UFRJ) Brazil, from January 1979 to December 2008 were retrospectively evaluated. Entry criteria for this study included: a diagnosis of AML other than acute promyelocytic leukemia, no previous treatment in other institution, receipt of standard induction chemotherapy (cytarabine + antracycline), and performance of BMA on D14 of induction chemotherapy. The study was approved by the local ethics committee (Hospital Clementino Fraga Filho/Universidade Federal do Rio de Janeiro, CAAE n°. 0094.0.197.000-09) and was conducted in accordance with the principles of Helsinki declaration. Informed consent was not obtained due to its retrospective nature of this study did not affect the healthcare of the included individuals. Moreover, confidentiality was preserved.

The diagnosis of AML was based on available procedures at the time, including BMA and BMB, and cytogenetic and immunophenotype analyses. Cases were classified according to de French-American-British (FAB) criteria [21]. The treatment regimens changed over time (Table 1) [22].

Table 1 Different approaches to the treatment of acute myeloid leukemia between time periods

Bone marrow aspirate and biopsy

Routine assessments of BMA and BMB were performed on D14 of induction remission. Aspirate smears were prepared at the bedside and stained with Wright-Giemsa stain, and biopsy samples were fixed in 10 % buffered formalin, and stained with hematoxylin and eosin. Patients with persistent disease according to D14 assessment received a second cycle of induction as early as possible [2, 13]. All glass slides were kept in storage units in the hospital achieves.

We reviewed all available slides from BMA performed at diagnosis and on D14. The analysis was performed by two independent observers (board certified hematologists), blinded for patient identification and outcome. The evaluation included confirmation of the initial diagnosis of AML and identification of D14 residual leukemia in a quantitative (percentage) and qualitative (scale) manner. Quantitative evaluation was performed by counting the percentage of blasts in 200 nucleated marrow cells. The qualitative assessment was determined by stratification in a Likert scale [23] of five categories: definitely infiltrated, probably infiltrated, doubtful, probably free and definitely free.

The results of D14 BMB were obtained by reviewing patients’ medical records and registries from the Pathology Service of the hospital. The reports were categorized as aplastic (leukemia free) or infiltrated.

Statistical analysis

The qualitative assessment of blasts was first treated as an ordinal categorical variable and latter grouped in two categories, and treated as dichotomous categorical variable. Agreement between the two observers was assessed using the kappa coefficient (Cohen’s kappa) and quadratic weighted kappa coefficient (Kw). The kappa coefficient may range from −1 (complete disagreement) to +1 (complete agreement) and the correlation is usually classified as poor (below 0), mild (0 to 0.2), low (0.21 to 0.4), moderate (from 0.41 to 0.6) substantial (0.61 to 0.8) and almost perfect (0.81 to 1.00) [24]. Further evaluation of the marginal homogeneity of proportions was performed with the McNemar test for dichotomous categorical variables and the McNemar modified test for ordinal categorical variables. In both tests, the presence of a significant p value (<0.05) indicates excessive variation between observers [25].

The quantitative assessment of blasts was treated as a discrete variable with a non-normal distribution; comparisons between observers were performed with Spearman’s Correlation Coefficient (rs). Measurements between observers were also compared using Intraclass Correlation Coefficient (ICC) and the Bland and Altman method [26].

The D14 BMA evaluation was compared with the BMB (considered as “gold standard”) using receiver operating characteristic (ROC) curves to assess the best cut-off point in terms of sensitivity, specificity and accuracy. The areas under the ROC curves (AUC) were compared using the method of Delong [27]. OS was defined as the time from diagnosis to death of any cause or last follow-up. Survival curves were estimated with the Kaplan-Meier method and differences were compared with the log-rank test. Multivariate analysis for OS was conducted using a Cox model and hazard ratios (HR) were obtained for each observer. All tests were 2-sided, and p values <0.05 were considered statistically significant. Statistical analyses were performed using SPSS 11.0 (SPSS Inc., 1989–2001), MedCalc 11.3 and MH Program 1.2142.

Results

Patients

Of 295 patients with AML identified in the hospital records, 119 fulfilled entry criteria. Among these 119 patients who had a BMA on D14, we could recover 107 sets of BMA smears, containing samples of the diagnosis and D14 assessment. The median age was 38 years (range 12–77), 12 % were >60 years-old and 58 % were males. In addition, we were able to compare D14 BMA and BMB in 82 patients.

Agreement analysis between observers

The comparisons between observers of D14 BMA evaluation using the qualitative scale is shown in Table 2. The quadratic weighted kappa coefficient was 0.74 (95 % confidence interval [95 % CI] 0.64 - 0.83, p < 0.001), and no bias was observed (p = 0.8, modified McNemar test). Typical qualitative categories are shown in Fig. 1.

Table 2 Agreement and comparison of frequency between categories of the Likert scale between two observers
Fig. 1
figure 1

Qualitative assessment of bone marrow aspirates on D14 induction chemotherapy in AML patients. a and b: definitely free; c and d: doubtful; e and f: definitely infiltrated (Wright-Giemsa, x400 and x1000, respectively)

The median blast count on D14 was 4 % and 6 % for observers 1 and 2, respectively, with a Spearman correlation coefficient of 0.798 (p <0.001) (Fig. 2), and an ICC within assessments of 0.836 (95 % CI 0.768 - 0.885, p < 0,001). The average difference between measurements of the percentage of blasts among the observers, according to the Bland and Altman method, was 5.01 % (95 % CI 7.63 - 2.39).

Fig. 2
figure 2

Correlation between the percentage of blasts in D14 bone marrow aspirate by two observers

Comparison of bone marrow aspiration and bone marrow biopsy on D14

The evaluation of BMB on D14 showed 33 patients with bone marrow infiltration and 49 free of leukemia. Table 3 shows the distribution of the categories of the qualitative scale according to the BMB status. We observed an association between the categories of definitely free and probably free with leukemia free in the BMB, and the categories of definitely infiltrate and probably infiltrated with infiltrated BMB (85.4 % for observer 1 and 75.6 % for observer 2). Doubtful results of BMA represented mainly leukemia free BMB for both observers.

Table 3 Correlation of BMA evaluation by both observers using the Likert scale with the results of the BMB

Figure 3 shows the ROC curves correlating the BMA quantification of blasts and qualitative scale, by both observers, according to BMB results. The AUCs for the quantitative and qualitative assessments were 0.924 and 0.946 for observer 1, and 0.867 and 0.870 for observer 2, respectively. We also compared the ROC curves of the quantitative and qualitative analysis of each observer. The difference in AUCs was 0.025 for observer 1 (p = 0.22) and 0.002 for observer 2 (p = 0.97).

Fig. 3
figure 3

ROC curves of the quantitative and qualitative evaluations of D14 BMA by two observers

Determining the best cut-off points

The best cut-off points for blast percentage in BMA was 6 % for observer 1 (AUC 0.883, 84.9 % sensitivity, 91.8 % specificity, and 89.9 % accuracy), and 7 % for observer 2 (AUC 0.858, 81.8 % sensitivity, 89.8 % specificity, and 86.6 % accuracy). A similar analysis for the Likert scale showed the best cutoff point as the 4th item of the scale (probably infiltrated) for both observers: AUC 0.898, 87.9 % sensitivity, 91.8 % specificity, and 90.2 % accuracy for observer 1, and AUC 0.818, 69.7 % sensitivity, 93.9 % specificity, and 84.1 % accuracy for observer 2.

Based on the best cut-off point of qualitative assessment, we divided the five categories of the scale in two: “free” and “infiltrated”. The first represents the grouping of categories definitely free, probably free and doubtful, while the second included the categories probably infiltrated and definitely infiltrated. The kappa coefficient for the comparison between observers was 0.66 (95 % CI 0.51 - 0.80, p < 0.001), with no bias per McNemar test (p = 0.1) (Table 4).

Table 4 Agreement and comparison of frequency between grouped categories of the Likert scale between two observers

Impact of D14 blasts on survival

Five-year OS was significantly longer in patients with <5 % blasts on D14 for both observers (Fig. 4). With Likert scale, a better outcome in patients with lower grades of marrow involvement was also observed (Fig. 5). The same results were obtained among 55 patients in CR who received two or more cycles of intensification (Fig. 6). Other variables detected as prognostic factors by univariate analysis were: age >60 years, year of diagnosis, treatment delay >7 days from diagnosis, presence of comorbidities, previous cardiac disease, hepatomegaly, active bleeding, gastrointestinal infection and FAB subtype M2 (p <0.05) (Table 5).

Fig. 4
figure 4

Overall survival according to the quantitative evaluations of D14 BMA by two observers

Fig. 5
figure 5

Overall survival according to the qualitative evaluations of D14 BMA by two observers

Fig. 6
figure 6

Overall survival according to the qualitative evaluations of D14 BMA by two observers in patients (n = 55) treated with two or more cycles of intensification

Table 5 Factors associated with poor outcome (overall survival) in patients with acute myeloid leukemia by univariate analysis

Predictors of poor outcome (lower OS) by multivariate analysis, with HR obtained respectively for observers 1 and 2, were age >60 years [HR = 4.67 (95 % CI 1.91-11.4) and 4.36 (95 % CI 1.79-10.61)], the presence of active bleeding at diagnosis [HR = 2.37 (95 % CI 1.18-4.74) and 2.05 (95 % CI = 1.01-4.13)] and residual D14 blasts with Likert scale [HR = 1.42 (95 % CI 1.11-1.81) and 1.43 (95 % CI = 1.11-1.92)] (Table 6).

Table 6 Factors associated with poor outcome (overall survival) in patients with acute myeloid leukemia by multivariate analysis performed with D14 BMA evaluation by both observers

Discussion

In this study we found substantial agreement between observers using two different methods: a quantitative assessment, with the determination of the percentage of bone marrow blasts, and a qualitative, based on the perception of marrow infiltration. In addition, a cutoff value of 6-7 % of blasts in the quantitative assessment and “probably infiltrated” marrow in the qualitative assessment was established, with good discriminatory power to identify patients with infiltrated BMB. Moreover, we observed a higher OS in patients who obtained higher grades of cytoreduction by day 14 marrow evaluation.

While risk assessment in AML relies mainly on age and cytogenetic profile [5], the assessment of in vivo chemosensitivity by determining early response to induction therapy is an additional predictive marker. Indeed, this parameter has been used to guide clinicians in deciding for an early second cycle of chemotherapy [13, 2829]. However, the type of D14 bone marrow evaluation (BMA, BMB or both) has varied, with some studies relying on BMA [816], others used BMB [18], and occasionally no clear information was provided [9, 10, 17, 19].

In our study we observed that the qualitative and the quantitative methods were equally predictive of BMB results, with a substantial inter-observer agreement. Bone marrow evaluation by more than one observer has been previously reported [16, 17], but to our best knowledge, our study was the first that reported the assessment of inter-observer agreement.

Another point of controversy is the cutoff values of blast cell percentage in the quantitative assessment of BMA. Different studies have established cutoff values that ranged from 5 % [9, 1030, 31], 10 % [8, 917], 15-22 % [16], and even 40 % [10]. These variations are also present in published Guidelines: <5 % [11], <5-10 % [12], <10-15 % [13] and hypoplasia or aplasia (without defining a numerical value) [14]. We established a cutoff value of 6-7 % (inter-observer variation), which is in the range of previous studies, and identified that the qualitative categories of definitely and probably infiltrated were predictive of residual leukemia on BMB.

All analyzes of response assessment by D14 BMA by both methods (qualitative and quantitative) and both observers resulted in higher specificity than sensitivity. Likewise, the concordance between observers was very good for “definitely/probably infiltrated”, but not so good for “definitely/probably free”. Therefore, there is no debate that a large amount of leukemic blast on day 14 constitutes unequivocal evidence of residual leukemia. However, the presence of a few blasts in a paucicellular or hemodilute marrow sample cannot be considered as definite evidence of residual disease. Indeed, most guidelines determine a second induction cycle for unequivocal residual disease and most dilemmas occurs in patients with low blast count (5-15 %) [32].

Few previous studies have shown an association between D14 marrow findings and long-term outcome [8, 9, 10, 17, 30]. In the present study, multivariate analysis showed that the evaluation of the bone marrow infiltration by Likert scale (but not the percentage assessment) was significantly associated with poor outcome.

Our study shares the limitations of all retrospective studies. It was not possible to recover D14 BMA and BMB slides from all cases. In addition, survival analysis was performed without the inclusion of well-known prognostic factors such as chromosomal and molecular abnormalities. Finally, we did not analyze the potential effect of the different induction regimens given throughout the study period and the number of entry-patients over the study period. Despite these limitations, we were able to show that BMA may be considered the procedure of choice to assess treatment response on D14 because it provides results immediately, and exhibited good agreement between observers and good correlation with BMB and OS.

Conclusions

We conclude that the assessment of BMA on day 14th of remission induction chemotherapy in patients with AML is a reproducible test with a substantial agreement between observers, both quantitatively and qualitatively, has good correlation with BMB and with OS. The percent cut-off 6-7 % or “probably infiltrated” may help to early identify a population of patients with unfavorable prognosis.