Introduction

Acute Graft-versus-host disease (aGvHD) is one of the most threatening complications of allogenic hematopoietic stem cell transplantation (alloHSCT). The gastrointestinal tract (GI) is a major target organ [25]. However, a widely accepted standard for histological GvHD reporting has not yet been established. This is reflected in the existence of numerous different grading systems that are applied in studies assessing histological findings of aGvHD, with Lerner grade being one of the most widely used [2, 6, 9, 12,13,14, 16,17,18,19,20, 24, 26, 28, 31, 33,34,35]. The lack of universally accepted standards hampers comparability of previous studies of aGvHD. Additionally, there have been reports of discrepancies when correlating histological and clinical findings [1, 28].

Moreover, as interobserver reproducibility is an issue, much effort has been made to standardize histopathological GvHD diagnoses [14, 31]. The latest modification of NIH categories for GvHD grading is strongly simplified including only “no”, “possible”, or “likely” GvHD [25, 31].

The present Round-Robin test aimed to improve reproducibility and standardisation of morphological changes of colorectal aGvHD. Several preexisting grading systems and newly generated sum scores were compared to identify the most robust and reproducible tool reflecting clinical findings.

Material and methods

Selection of patients, biopsies and clinical data

Inclusion criteria were a history of alloHSCT and colon biopsies taken 20 to 180 days after transplantation. Biospsies with signs of infection were excluded. Patients were randomly selected (Erlangen (n = 22), Mainz (n = 38), Regensburg (n = 51), and Würzburg (n = 12)). Age, sex, primary disease, days post transplant, GvHD-stage lower gastrointestinal tract (GI), overall Glucksberg grade (Supplemental Table 1 [11, 27]), response to steroid treatment (not applied/sensitive/ refractory/intolerant) and primary cause of death were retrieved from the MAGIC data base or clinical files (Supplemental Table 2). The overall Glucksberg grade is a combined value of clinical signs of GvHD in the skin, liver, upper and lower GI giving a grade of the clinical severity of GvHD (Supplemental Table 1). The GvHD stage lower GI is the respective value of the lower GI tract included in the Glucksberg grade, which stratifies the degree of GvHD according to the daily volume (< 500; 500–999; 100–1500; > 1500 ml/day) and frequency (< 3; 3–4; 5–7; > 7 episodes/day) of diarrhea and additional symptoms as severe pain or bloody stool [11]. The study was approved by the local Ethics Committee of the University Hospital Regensburg (No. 18–900-101).

Table 1 ≥75% agreement and results of consensus meeting
Table 2 Association of pathological findings and graduation with clinical findings

Histomorphological assessment and consensus meeting

The Round-Robin test was performed in two rounds with a consensus meeting between them (Suppl. Fig. 1). In the 1st round, 27 biopsies (at least 5 stained sections) of 10 patients (= Group1) were assessed by 3 experienced pathologists (S.R-H., A.K. and M. B.-H.) and a pathology fellow well acquainted to GvHD (K.Hip.). The section with the most severe changes was preselected (by K.Hip.) for analyses. Sections were digitized and made accessible via a CaseCentre 2.9 (3DHISTECH, Budapest, Hungary) for online microscopy. Parameters assessed in the first round included: number of apoptoses as defined by Kreft et al. [14] in 10 neighboring crypts in the hot-spot, as suggested previously [10]; presence or absence (yes/no) of crypt destruction [14], architectural distortion, increase of eosinophilic and neutrophilic granulocytes, ulceration and epithelial denudation [14]. Grading was performed according to modified Lerner [14, 16], Sale [24], Melson [19] and NIH categories [31].

After the 1st round a consensus meeting was held (K.Hip., S.R.-H., A.K., A.R., M.B.-H.) for standardization (Fig. 1):

  • Crypt apoptotic bodies (CAB) [14]: number of apoptoses in 10 neighboring crypts in the hot-spot

  • crypt destruction as defined [14]: 0 = none, 1 = individual, non-contiguous crypts, 2 = destruction of ≥ 2 neighboring crypts

  • crypt loss, as defined by missing intact crypts without above-described signs of crypt-destruction as defined [14]: 0 = none, 1 = individual, non-contiguous crypts, 2 = loss of ≥ 2 neighboring crypts

  • increase of eosinophilic or neutrophilic granulocytes (modified after [7]): ≥ 5 granulocytes in one high power field in the hot-spot excluding eschar in ulcer/erosion; 0 = no increase, 1 = increase

  • architectural changes of the mucosa including at least one of the following: shortened crypts not reaching the lamina muscularis mucosae, distorted or branched crypts [23]; 0 = no or mild, 1 = moderate to severe architectural changes

  • denudation/erosion [14] and/or granulation tissue; 0 = absent, 1 = present.

Ulceration was omitted and grading systems were adapted (Suppl. Table 3) to be independent of clinical findings.

In the second round (approximately one year later), Group1 was reassessed plus 96 additional biopsies (Group2). In Group2, which was correlated with clinical findings, only one colon specimen per time-point was included per patient.

Sum scores generated from the histomorphological parameters and CAB count cut-offs

Sum scores from the histological findings (Suppl. Table 3) were generated as follows: Sum score 1 included a score of the mean CAB count of all four observers (CAB score: 0: no; 1: 1- < 5; 2: 5- < 10; 3: ≥ 10 CAB/10 continuous crypts) plus the mean score of crypt loss between the four observers. Sum score 2, in addition to these two parameters, included the mean values of crypt destruction and epithelial denudation. Sum score 3, in addition to the parameters included in sum score 2, also included the mean values of architectural distortion and of the increase of eosinophilic and neutrophilic granulocytes.

To assess the significance of CAB counts cut-off values for the mean CAB count were defined (mean CAB count < cut-off versus ≥ cut-off) and resulting groups were compared with clinical findings.

Validation cohort

For the validation of sum scores 1 and 2, an independent cohort of 111 patients was analyzed by one patholgogist (A.K.) including cases from Mainz (n = 58) and Regensburg (n = 53). For each patient, the colon biopsy with the most severe signs of GvHD at the time-point was evaluated.

Statistical analyses

Statistical analyses were performed using SPSS software (IBM Statistics SPSS 24). To compare the distribution of continuous and ordinal parameters between two or more groups Mann–Whitney and Kruskal–Wallis tests were chosen, respectively. For nominal parameters, cross-tabulation was applied using Chi2-testing and post-hoc testing as described by Beaseley et al. [3]. For correlation analyses, a Spearman test was performed. To assess the reproducibility between observers, inter-rater reliability (IRR) was quantified using Fleiss’s Kappa (for nominal parameters and ordinal parameters with no more than 5 possible values) and intra-class correlation (ICC, for all ordinal parameters, two-way model, agreement type, single unit), relying on the R statistical environment v. 4.0.3 (https://www.R-project.org/) and the irr package v. 0.84.1. (https://CRAN.R-project.org/package=irr). P < 0.05 was used to identify statistically significant findings.

Results

Patients´ cohort, 1st and 2nd round of the Round-Robin test

Patients’ characteristics are summarized in Supplemental Table 2. In the first round, separate analysis of the 27 biopsies (Group1) was performed by the four observers without prior discussion (Table 1). Thereafter, a consensus meeting was held to establish more concise definitions of histomorphological parameters (Fig. 1 and Suppl. Fig. 1). As a result, ulceration was omitted, crypt loss added and crypt destruction changed into a semi-quantitative parameter. Moreover, a cut-off of ≥ 5 cells per HPF was defined for the presence of increased eosinophils and neutrophils [7]. Additionally, some of the definitions for assigning a case to the grading systems were specified (Suppl. Table 3). In the second round, Group1 plus 96 newly selected biopsies (Group2) were assessed using the updated criteria. A consensus diagnosis was accepted when at least 3 of 4 observers assigned the same value to a respective biopsy. Results of this” ≥ 75% agreement “ before and after the Consensus meeting are summarized in Table 1. Improvement of” ≥ 75% agreement “ was mild to moderate looking at the histomorphological parameters, whereas the interobserver reproducibility of the grading systems was at best mildly improved. Best concordance was achieved for the most simplified NIH categories followed by the Lerner grade. Correlation of CAB counts between the observers was high in both rounds with only minimal improvement (Suppl. Table 4). As additional parameters of interrater reliability Fleiss‘ kappa values and intra-class correlation coefficient (ICC) were calculated (Suppl. Table 5). No improvement was seen in CAB counts, epithelial denudation or grading systems, whereas improvement was highest when assessing increased neutrophils and eosinophils. Sum scores generated from morphological parameters appeared to have better reproducibility than the prepublished grades.

Histological findings, grading and clinical findings in Group2

The mean values of CAB as assessed by the four observers in Group2 were significantly associated with overall Glucksberg grade (0 vs. 1&2, p = 0.01 and 0 versus 3&4, p < 0.001, Table 2, Fig. 2A), the GvHD-stage lower GI (0 vs. 1&2, p < 0.001 and 0 vs. 3&4, p < 0.001, Table 2, Fig. 2B), and the Lerner grades with ≥ 75% agreement (0 vs 1&2, p < 0.001; 0 vs 3&4, p < 0.001; Fig. 2C). Additionally, CAB counts were significantly higher in patients with non-relapse mortality (NRM, p = 0.021), but also with relapse mortality (RM, p = 0.012) when compared to living patients (Table 2). Other morphological parameters that reflect the clinical findings are summarized in Table 2. Increased eosinophilic and neutrophilic granulocytes and crypt architectural distortion were not associated with clinical findings (data not shown).

Fig. 1
figure 1

Histomorphological parameters evaluated by the 4 observers. (A) Crypts with several apoptotic bodies (CAB, arrows) with at least two fragments of karyorrhectic debris surrounded by a halo, enlarged in the inlay (H&E, 400x. original magnification (o.m.)). (B) Crypt destruction (arrow) with flattened epithelium of the crypt filled with cell debris. In the surrounding crypts several apoptotic bodies can be seen (arrow heads, H&E, 200 × o.m.). An alternative definition of crypt destruction according to Kreft et al. includes apoptotic destruction of ≥ 1/3 of the crypt epithelium with at least ½ of the diameter of a normal crypt retained [14] (C) Crypt loss (arrows) indicated by missing or strongly degenerated crypts (H&E, 200 × o.m.) not fulfilling the criteria of crypt destruction. CAB in surrounding crypts indicated by arrow heads. (D) Epithelial denudation with surface deposition of fibrin (arrow) and granulation tissue (asterisks, H&E, 100 × o.m.). (E) Crypt architectural distortion with a conspiciously branched crypt (arrow) next to a distorted crypt (H&E, 200 × o.m.). (F) Increased granulocytic infiltrate as exemplified by neutrophilic granulocytes (arrows) as defined by ≥ 5 granulocytes per 400 × high power field (H&E, 400 × o.m.)

Fig. 2
figure 2

Association of morphological findings with clinical parameters. (A) Distribution of mean crypt apoptotic body (CAB) count related to clinical overall Glucksberg grades, showing a significant difference between no GvHD (grade 0) and low-grade (1&2) as well as high-grade (3&4) changes, resp.. (B) Mean CAB counts related to GvHD-stage lower GI with significant differences between no GvHD (grade 0) and low-grade (1&2) or high-grade (3&4) changes, resp.. (C) Comparison of the different Lerner grades (only cases with ≥ 75% agreement were included) in the distribution of mean CAB counts. Significant differences were seen between no signs of GvHD (Grade 0) and grades 1&2 and 3&4, resp.. (D) Mean Lerner grades increase with rising overall Glucksberg grades with a significant difference between no signs of GvHD (Grade 0) and low-grade (1&2) or high-grade changes (3&4), resp.. (E) Mean Lerner grades increased with GvHD-stage lower GI with a significant difference between stage 0 compared to 1&2 and 3&4, resp.. (F) Mean Lerner grades were additionally associated with survival, with higher Lerner grades in the NRM-group compared to living patients. Bars indicate the median. * p < 0.05; ** p < 0.01; *** p < 0.001

Regarding the grading systems (Table 2), Sale, Melson, Lerner (Fig. 2D-E), and NIH grades uniformly showed a significant association with overall Glucksberg grade (0 vs. 1&2: p = 0.044, 0.011, 0.014 and 0.006, resp. and 0 vs 3&4: p = 0.003, < 0.001, < 0.001 and < 0.001, resp.) and GvHD-stage lower GI (0 vs. 1&2: p = 0.047, 0.005, 0.002 and < 0.001, resp. and 0 vs 3&4: p = 0.001, < 0.001, < 0.001 and < 0.001, resp.). Moreover, higher Sale, Melson, Lerner, and NIH grades were associated with steroid refractoriness when compared to cases without application of steroids (p = 0.017, 0.010, 0.035 and 0.041, resp.). Higher Melson, Lerner (Fig. 2F), and NIH grades were also significantly associated with NRM when compared to living patients (p = 0.018, 0.025 and 0.029, resp.). None of the histological grading systems could differentiate between clinical low- and high-grade changes (Table 2).

Sum scores as an alternative measure of grading GvHD and association with clinical findings

As the transfer of histomorphological parameters into qualitative histological grading systems may give rise to misinterpretation or loss of information, we tested whether sum scores of histological parameters might better represent clinical findings (Fig. 3). The most simplified score included a score of mean CAB counts and crypt loss, both strongly associated with clinical findings (Table 2) and frequently observed in histological analysis of the cohort. Sum score 1 was significantly associated with overall Glucksberg grade (0 vs. 1&2, p = 0.024, 0 vs. 3&4, p = 0.002), GvHD-stage lower GI (0 vs. 1&2, p = 0.019, 0 vs. 3&4, p = 0.003), and survival (alive vs NRM, p = 0.013), but not response to therapy (Table 2). Sum score 2 included only parameters relevant for at least one pre-published grading systems (Sale, Melson, Lerner). Significant differences were observed for overall Glucksberg grade (0 vs. 1&2, p = 0.025, 0 vs. 3&4, p = 0.001), GvHD-stage lower GI (0 vs. 1&2, p = 0.013, 0 vs. 3&4, p = 0.001), survival (alive vs. NRM, p = 0.021), and steroid response (not applied vs. refractory, p = 0.032). For the most complex sum score 3 including all parameters assessed, no significant association with either steroid responsiveness nor survival was seen in contrast to overall Glucksberg (0 vs. 1&2, p = 0.009, 0 vs. 3&4, p = 0.001) and GvHD-stage lower GI (0 vs. 1&2, p = 0.005, 0 vs. 3&4, p = 0.001).

Fig. 3
figure 3

Sum scores—association with clinical findings. To test whether sum scores generated from the histological parameters might be useful for the grading of GvHD in colon biopsies, three different scores were generated and analyzed in the light of clinical findings. The most simplified sum score 1 included only a score of mean CAB counts and crypt loss, two parameters, which were frequently present in the biopsies and showed good association with clinical findings. Sum score 2 included parameters used in previous grading systems and sum score 3 included all parameters assessed in this study. Association with clinical findings was best in sum score 2 and better in sum score 1 than 3. For Group2 for the generation of sum scores results for the mean values of all 4 observerse are depicted with the CAB score being generated from the mean value of CAB, for the validation cohort results of one single pathologist (A.K.) are shown. In the lower part of the Figure significant associations of clinical findings for sum score 1 and 2 are shown for “group2/validation cohort” in comparison

Correlation between published GvHD grading systems, sum scores and clinical GvHD grading

Correlation analyses of established grading systems and sum scores showed a strong, positive association. The positive correlation with clinical parameters was moderate and within the same range regarding published grading systems and sum scores (Suppl. Table 6).

Association of the validation cohort for sum scores with clinical signs of GvHD

To validate sum scores 1 and 2, an independent cohort of 111 cases was investigated by one pathologist (A.K.). Patients’ characteristics are summarized in Suppl. Table 7. In the validation cohort both analyzed sum scores (sum scores 1 and 2) were associated with clinical GvHD grading (Suppl. Table 8, Fig. 3). Both were able to differentiate between Glucksberg grades 0 vs. 3&4 and 1&2 vs. 3&4 (all p < 0.001) and GvHD-stage lower GI 0 vs. 1&2, 0 vs. 3&4, and 1&2 vs. 3&4 (sum score 1 p = 0.016, < 0.001, 0.009; sum score 2 p = 0.002, < 0.001, 0.002, resp.). Both sum scores were also different in cases, in which steroids were not applied vs. cases refractory to steroids (sum score 1 p = 0.019; sum score 2 p = 0.002). No association with mortality was observed.

Relevance of CAB counts in reflecting clinical signs of GvHD

To analyze whether CAB counts, alone, could reflect clinical findings, cases were divided according to their CAB counts (Suppl. Table 9). Very low mean CAB counts of < 0.5 and/or < 1 were significantly associated with a lack of clinical signs of GvHD and no application of steroids. 100% of cases with no clinical signs of GvHD in both overall Glucksberg grade and GvHD-stage lower GI showed < 2 CAB. A cut-off of < 3 CAB was significantly associated with the absence of overall Glucksberg grade 3&4. Cut-off values of < 5–7 were significantly associated with patient survival in the follow-up. A cut-off of 6 CAB indicated approximately the median for cases with adverse clinical findings, ie. high-grade changes for overall Glucksberg and GvHD-stage lower GI, steroid refractoriness, and no relapse mortality, whereas 80–100% of the biopsies associated with favourable clinical findings had a CAB of < 6.

Discussion

The present study aimed to assess the reproducibility and comparability of biopsy findings and grading of GvHD across pathologists at different HSCT centres. The diagnostic value of histology was determined by correlating histopathological characteristics and grading systems with clinical findings of GvHD. Finally, sum scores and different cut-offs for CAB counts were tested for their relevance in determining aGvHD.

The demographics of our cohort were within the range of previous studies [5, 6, 10, 14, 17, 18, 21, 26, 30, 32, 34]. In a first step morphological parameters and grading systems reported previously as diagnostic tools for GvHD reporting were tested for their reproducibility between pathologists. Before the first round of the Round Robin test, all observers familiarised themselves with histological criteria as defined earlier [14] without previous discussion. Agreement of ≥ 75% was high for dichotomized histomorphological parameters in the first round and further improved after consensus discussion. ≥ 75% agreement was much lower for the 3 to 5 tiered grading systems and improved only for Melson grading. Compared to a previous Round-Robin test [14] and a recent report assessing interrater reproducibility [26], our results were in the same range. Correlation between the observers in CAB counts was already high in the first round and no clear improvement was observed after the second round. These findings indicate that a relatively high comparability between different observers can be achieved just by studying the diagnostic criteria in the literature. A consensus meeting improves reproducibility in recognition and quantification of some morphological parameters, but appears to be less efficient in improving agreement in the application of grading systems.

In a next step the histological parameters were tested for their relevance as indicators of GvHD by comparing the mean values of all 4 observers with clinical signs of GvHD. Mean values were used, to reflect the ambiguities of GvHD reporting. CAB counts reflected overall Glucksberg grade, GvHD-stage lower GI, and survival, whereas they could not predict responsiveness to steroids. Only mean CAB counts and crypt loss were able to differentiate between no signs of GvHD and low-grade changes in overall Glucksberg grading and CAB and crypt destruction when looking at GvHD-stage lower GI. None of the parameters was able to discriminate between overall Glucksberg grade or GvHD-stage lower GI 1&2 and 3&4, i.e. to stratify low-grade and high-grade clinical GvHD findings. Myerson et al. proposed to subclassify Lerner grade 1 according to the numbers of CAB, which correlated with increased frequency of treatment [20], also arguing for the importance of apoptosis in the detection of low-grade GvHD. Crypt destruction, epithelial denudation, and crypt loss were all associated with severe clinical signs of aGvHD. Crypt loss and epithelial denudation, additionally, predicted refractoriness to steroids. In line with this observations, an association of severe crypt loss with higher stool volumes [6, 19], longer duration of diarrhea [6] and steroid refractoriness [19] has been reported before. Increased numbers of eosinophilic or neutrophilic granulocytes and architectural distortion were not significantly associated with clinical findings of GvHD in our cohort. Accordingly, eosinophilic counts did not support the diagnosis of colonic GvHD in previous reports [26, 30]. Increased neutrophilic granulocytes have been reported to be associated with inferior survival in GvHD of the upper GI [32], an association which we did not observe in the colon.

After evaluation of single morphological parameters, published grading systems based on these parameters were assessed for their association with clinical GvHD. All previously published [16, 19, 24, 31] histopathological grading systems were associated with clinical findings. Correlation between the grading systems was high, whereas correlation with clinical findings was only moderate. All grading systems could differentiate a group with no clinical signs of GvHD from low-grade or high-grade changes with regard to overall Glucksberg or GvHD-stage lower GI. No grading system could discern low- from high-grade clinical aGvHD. In line with this, a lack of correlation of low versus high histological grades with clinical GvHD grading has been reported [12]. Survival comparing no and mild histological signs of GvHD (4-tiered NIH categories) was the same in an earlier study. However, comparing no/mild and moderate/severe catergories showed improved survival in the former [26]. Moreover, reportedly severe histological damage (Lerner grading) was associated with inferior treatment response and survival compared to lower grades [9]. In contrast to our results, a modified Lerner grading system was able to discern GvHD of low and high severity with regard to volume and duration of diarrhea [6], whilst histological findings were unable to predict steroid response [6]. Sale et al. reported an association of high clinical stages and stool volume with positive results for GvHD in rectal biopsies [24]. Taken together, histological grading appears to efficiently reflect clinical GvHD, whereas it was of limited value for stratifying the severity of clinical findings. Previous findings [13] and our results also justify the widespread use of Lerner grade to report histological findings of GvHD for scientific purposes [9, 12, 14, 17, 18, 20] as it was significantly associated with all assessed clinical parameters and showed good reproducibility, whilst not including clinical parameters in its definition as opposed to NIH categories [31]. Underlining this conclusion, Lerner grading was also associated with GvHD-related death in a recent report [8].

Next, sum scores were tested as an alternative means of grading GvHD, as transfer into qualitatively defined scores carries the risk of misclassification. In line with this, IRR for the sum scores was better than for previously published grading systems. Results of sum score 3, not unexpectedly, indicated that an unselective increase of parameters does not necessarily improve the predictive value. Even sum score 2, including parameters used in previous grading systems, was only slightly superior to the very simple sum score 1, which included only CAB counts and crypt loss. The advantage of sum score 2, however, was its association with steroid refractoriness.

To validate our approach to apply sum scores 1 and 2, we analyzed an independent cohort of colon biopsies evaluated by a single pathologist. This approach better reflects the daily routine in the diagnosis of GvHD than using the mean values of 4 pathologists. Both sum scores were significantly associated with overall Glucksberg, GvHD-stage lower GI, and steroid response, supporting the use of sum scores. In contrast to the Round-Robin test, in the validation cohort a significant difference between low-grade and high-grade clinical findings could be observed for overall Glucksberg and GvHD-stage lower GI, maybe due to the fact that only the most severely affected biopsies were specifically chosen for analysis. In line with our approach, Farooq et al. tested the use of a sum score to grade colonic GvHD [8] and found an association with GvHD-related death in one of two analyzed cohorts.

Another important issue in the daily routine of diagnostic pathology is the cut-off of CAB counts to diagnose GvHD with certainty. Sauvestre et al. reported that in GvHD CAB count always exceeded 5 per biopsy [26]. Others [10, 17] suggested a cut-off of ≥ 7 CAB per 10 contiguous crypts. Moreover, it was suggested to classify ≤ 6 CAB/10 crypts as”indeterminate for GvHD “ as this group showed heterogeneous clinical findings [17]. Moreover, as minimal criteria of GvHD ≥ 1 CAB/biopsy piece [31] or ≥ 0.07 CAB per section [20] have been suggested, whereas others used ≤ 3 CAB/biopsy fragment as a cut-off for a negative histology [12]. In normal colon mucosa specimens any CAB were reported in only 20–25% of cases [5, 15]. In our cohort, < 1 CAB/10 contiguous crypts were significantly associated with negative overall Glucksberg grade, negative GvHD-stage lower GI, and no application of steroids. All cases with negative overall Glucksberg and negative GvHD-stage lower GI were included in the group of biopsies with < 2 CAB/10 crypts. Therefore, < 1 CAB/10 crypts appeared to be a relatively reliable cut-off value to identify cases without GvHD.

Shortcomings of our study are the retrospective nature and the fact that for correlation with clinical data in the Round-Robin test only one paraffin specimen per time-point and patient was investigated, therefore neglecting possible differences between different biopsy sites. The relatively low number of cases included for the clinical correlation may also have obscured the stratification of low- and high-grade findings. Any study based on histology after HSCT may face the problem of differentiating GvHD from mycophenolate mofetil (MMF)-colitis since the latter may mimic intestinal GvHD histologically [22, 33]. However, in the setting of solid organ transplantation, MMF-colitis is associated with GvHD-like histology in only a subset of cases [4, 29]. Moreover, apoptotic microabscesses (classified as crypt destruction by us) were reported to be absent in MMF-colitis [33], so that the differential diagnosis of MMF-colitis would be limited mainly to a subset of cases, which are treated with MMF and have low-grade GvHD. The significant association of histological and clinical signs of GvHD argues against MMF-colitis strongly confounding our results.

Taken together, our data indicate that relatively high concordance of grading aGvHD between pathologists can be achieved, when histological parameters are well defined and easily recognized, whereas reproducibility of the more complex and poorly defined grading systems is more difficult to obtain. As it stands, all previously published histopathological grading systems showed high correlations with each other and were able to reflect clinical findings in a significant manner. Histology appears to be helpful in confirming the diagnosis of aGvHD, whereas reliability was much worse in terms of the stratification of GvHD severity. Additionally, more simplified sum scores showed a slightly better reproducibility, retaining a comparable correlation to the clinical findings, a concept that we were able to reproduce in a validation cohort. A definite cut-off in CAB counts for the diagnosis of aGvHD of the colon does not exist, however, cases without clinical signs of GvHD were significantly associated with < 1CAB/10 crypts.

In conclusion, for the moment a combination of Lerner grading, based on morphology alone, and assignment of the NIH category proposed by the NIH Consensus development project [31], but in part dependent on clinical information, appears to be a pragmatic approach for the reporting of intestinal GvHD. In future, sum scores, after additional validation, might offer a simplified means of grading GvHD as they were slightly more reproducible across our team than previously published histological gradings and more straightforward to use as morphological parameters are simply added up and not transferred into a qualitative new grade. Finally, even if only very few CAB are present in a biopsy the possibility of GvHD should be considered as a diagnosis.