Background

Interstitial lung disease (ILD) refers to a range of diagnostic entities which show varying degrees of inflammation and fibrosis [1]. Idiopathic pulmonary fibrosis (IPF) is the most common idiopathic ILD and the most lethal, with 50 % mortality of patients within 3–5 years after diagnosis [2]. IPF was initially considered a chronic inflammatory disease and was therefore commonly treated with immunosuppression. More recently, it has been shown that immunosuppressive therapies have a detrimental effect while antifibrotic drugs are effective in slowing the progression of the disease [3, 4]. Other ILDs such as collagen vascular disease associated ILD and fibrotic hypersensitivity pneumonitis (HP) are still potentially treated with immunosuppression even though so called progressive fibrotic ILD of any association may ultimately benefit from anti-fibrotics [5]. It is therefore essential that a distinction be made between IPF and these other ILDs.

Current IPF diagnostic guidelines assess criteria across clinical, radiologic and pathologic domains which are then combined to create a probabilistic estimate of the diagnosis [6,7,8]. Modern consensus histopathologic criteria for evaluation of IPF include (1) extent and pattern of fibrosis, (2) extent and pattern of inflammation and (3) presence of other features (e.g. foreign material) that would indicate another diagnosis. An underlying principle of this approach is that the clinical entity of IPF is uniquely associated with the histologic pattern of usual interstitial pneumonia (UIP) which is distinguishable from other patterns in other diseases. However, the utility of these histologic criteria is limited, because of significant interobserver variation [9,10,11,12]. We were interested in understanding the underlying basis for this variation as that has not been well described in previous studies. We further hypothesized that understanding this variation would allow us to improve performance by developing improved criteria and testing it on a subset of previously evaluated cases and on a set of new cases.

Materials and methods

A website was created for this project which displayed both fixed images of selected features relevant to diagnosis of IPF and whole slide images (WSI) of cases of ILD [13]. These images were displayed adjacent to questions concerning presence or absence of specific features and, in cases of WSI, final diagnosis (Table 1, Additional file 2: Figs. S1–S2A–B) (See reference [13] for web address). In those ILD cases which involved multiple slides, selected representative slides were chosen with a range of 1–4 WSI per case. The website collected user specific answers based on unique sign ins. The website is publicly accessible and currently displays all of the images used in this study and a summary of the data generated. No identifiable patient information is available on that website or was used in this project. The pathologists who participated in this study are all senior academic thoracic pathologists, many of whom have published on this and related topics and have recognized expertise by serving as referral specialists in tertiary academic centers [1, 14].

Table 1 Questions used for criteria sets and whole slide images

This study consisted of two rounds of review. In the first round, pathologists were asked to (1) categorize fixed images of lung using standard criteria used for diagnosis of ILD and (2) to categorize cases of ILD using whole slide images. The criteria sets were created for the domains of fibrosis (25 images), inflammation (9 images), granulomas (10 images) and fibroblast foci (7 images). These images were selected by the senior author to include a range of characteristics required to make a diagnosis of UIP and other ILD diagnoses. Eight pathologists provided answers to those images. The whole slide images were created from thirty wedge biopsy cases and included a range of ILD diagnoses. These cases were selected by the senior author to include the range of cases seen in routine practice. Seven pathologists provided answers for these cases.

After the initial round, the data was evaluated and shared with the participants. Multiple conference calls were made among the authors to discuss cases and criteria with discrepancies. A consensus document was circulated among the authors for evaluation and a final version was used for a second round of WSI cases. For the second evaluation round, the senior author selected twenty cases (ten WSI cases from the first round and ten new cases from two of the participants’ routine sign out) for review. Ten pathologists provided answers for those cases. The criteria assessments were not repeated.

Spearman’s rank correlation analysis was performed using Prism v 8.4.3 for MacOS.

Results

Initial evaluation round

In the first round, participants reviewed a set of fixed images (“criteria sets”) from the project website which were designed to clarify the use of specific criteria relevant to fibrotic ILD as well as WSI (Table 1) [13]. Criteria sets were created for the domains of fibrosis (25 images), inflammation (9 images), granulomas (10 images) and fibroblast foci (7 images). Eight pathologists participated in the criteria set portion of the survey. Consensus, as defined by agreement among six or more of eight pathologists, varied among criteria relevant to a diagnosis of UIP (Additional file 1; Additional file 2: Figs. S3–S5). For example, the pattern and extent of fibrosis typically achieved consensus from 68 to 76% of the time (Table 2, Additional file 2: Fig. S3a–c) depending on exactly how this is evaluated. There was less agreement for questions aimed at extent and type of inflammation, including evaluation of dense inflammation away from scar and the distinction between well or poorly formed granuloma vs. scattered giant cells (Table 2, Additional file 2: Figs. S4, S5).

Table 2 Interobserver concordance on various criteria set images

Thirty wedge biopsy cases for evaluation of ILD were also selected by the senior author for the project website including two thought most likely UIP, fourteen thought most likely not UIP and fourteen that were thought ambiguous [13]. Seven pathologists answered nearly all of the WSI cases (Fig. 1, Additional file 1). In order to analyze the data, we grouped together definite and probable UIP vs. possible and not UIP. We considered consensus to be agreement among at least five of seven pathologists. Four cases were identified as probable or definite UIP pattern including the two initially thought to be UIP (cases 5, 6, 21, 30). Fourteen cases were nearly universally thought not, or at most possible, UIP, all of which were among the cases initially not thought to be UIP (cases 4, 7, 10, 12–16, 18, 22, 23, 25–27). Thirteen of those fourteen had areas of hyaline membranes, extensive organizing pneumonitis, marked increase in eosinophils, diffuse hyalinized fibrosis (“smoking related interstitial fibrosis”) and / or irregular fibrosis (see below for definition) with patchy lymphoid infiltrate and/or granulomas. The remaining twelve cases had variable interpretation by the participants with only seven of these cases reaching consensus. Even in cases ultimately reaching consensus, there was marked variation in interpretation across all criteria (Additional file 1). For example, while case eight met consensus, only four pathologists thought that it showed severe fibrosis, and two of those thought distribution was either airway centered or diffuse rather than patchy and paraseptal /subpleural. Those who thought fibrosis was not severe thought distribution was either uncertain or airway centered. Two thought there was excess inflammation. The performance of pathologists in the initial phase of this project was consistent with the existing literature and reinforces the concept that pathologist confidence is an important component of ILD diagnosis [15].

Fig. 1
figure 1

Variation in overall diagnosis of cases in initial round including definite UIP, probable UIP, possible UIP and not UIP. WSI case numbers are listed on the left. Specific pathologists are listed along the bottom

Discussion phase

We then reviewed a number of the most problematic of the criteria set images from the project website in an effort to derive more specific criteria for severity of fibrosis, distribution of fibrosis, extent of inflammation and nature of granulomas (Fig. 2a–d). We attempted to create these rules by consensus to determine if they would reduce interobserver variability. While these rules were created independent of any attempt to determine if those were clinically predictive of outcome, the criteria developed largely paralleled the clinical practice of the participants. For severe fibrosis, we adopted two independent criteria. (1) We considered that a case displayed severe fibrosis if at least 25% of the slide showed established fibrosis and the fibrotic process was distributed across the entire slide even if that process was patchy. (2) While bronchiolectasis can be seen in honeycombing, we considered that bronchiolectasis reflected severe fibrosis even in presence of much milder fibrosis since that pattern has been associated with radiologic honeycombing, possibly due to severe fibrosis not seen in the plane of histologic Sect. [16]. On the other hand, while honeycombing is severe fibrosis by definition, we considered honeycombing in lung tips only as non-specific. A review of Fig. 2 shows how this worked in practice. The amount of fibrosis in Fig. 2a is less than 25%, changes are not continuous across the entire slide and there is no honeycombing, so it would fail to meet criteria for severe fibrosis. For Fig. 2b, on the other hand, the presence of bronchiolectasis on a background of patchy mild to moderate fibrosis would support the final interpretation of this image as patchy severe fibrosis.

Fig. 2
figure 2

Selected criteria set images taken from the project website illustrative of problems and potential solutions in diagnoses of chronic fibrotic ILD [13]. a Fibrosis criteria set image 5, also used as Inflammation criterion set image 4, b Fibrosis criteria set image 25, c Fibrosis criteria set image 2, d granuloma criteria image 1. Figure 2a shows fibrosis which is focally severe predominantly around the airways in the center of the image although there is a connection to the adjacent interlobular septa. Figure 2b shows mildly cellular fibrosis without dense (“collagen”) fibrosis. The fibrosis seems to merge continuously into the non-fibrotic region without an easily drawn boundary. There is marked bronchiolectasis (the airway in the right middle is massively enlarged relative to the adjacent artery) while there is minimal fibrosis around that airway itself. This pattern of fibrosis is not clearly defined in established criteria and generates conflicting interpretation in the diagnostic categories. See text for additional discussion

Among fibrosis criteria set images, there was wide variation in evaluation of distribution of fibrosis (Additional file 2: Fig. S3c). We had difficulty categorizing the distribution of fibrosis in some cases since we considered diffuse fibrosis to show expansion of septa without residual normal septa while patchy fibrosis should show severe established fibrosis adjacent to relatively normal lung. However, in some cases (Fig. 2c, which was used as fibrotic criteria set image 2), there is neither uniform involvement nor a strong boundary between normal and fibrotic lung. We proposed creation of a new category we called irregular fibrosis which shows patchy expansion (“thickening”) of alveolar septa with some residual normal alveoli. Some might consider irregular fibrosis a milder form of either diffuse or patchy fibrosis. It would not support a diagnosis of UIP. We thought airway-based fibrosis is only confidently diagnosed when it is either an isolated or strongly dominant finding. While Fig. 2a shows an airway with fibrosis, that process shows bridging to an area of subpleural/paraseptal fibrosis and we therefore did not consider that to represent airway centric fibrosis. Similarly, we considered that honeycombing around airways is also not airway centered fibrosis if it is connected to the periphery. None of the cases in our study had examples of airway-to-airway fibrosis which some have proposed to represent airway centric fibrosis [17] nor did any of our participants consider this a common finding in their practice.

Inflammation criteria set images with marked disagreement was seen in cases with isolated clusters of lymphoid cells, commonly associated with some degree of fibrosis (Fig. 2a). The conventional criteria require inflammation to be away from scar to be considered significant, but we agreed that too much inflammation in areas of only mild fibrosis would also be inconsistent with UIP. However, there was little agreement on how much of either was acceptable or required for that to be true.

Among granuloma criteria set images, there was a striking disagreement on interpretation of cholesterol clefts in the giant cells (Fig. 2d). The presence of cholesterol clefts in scattered macrophages in airspaces may reflect response to degenerating/ necrotic material and is typically thought to be of no significance. However, we ultimately recognized that true granulomas, especially those present within the interstitium, might also have such clefts and can be recognized as such if the architecture and background cellular composition are otherwise typical.

Second (final) evaluation round

For a second and final evaluation round, we selected ten WSI cases from the first round as well as ten new cases from two of the participants’ routine sign out practice. Nine of the ten selected cases from the original set were originally considered ambiguous by the senior author (other than case 18). In the second group of ten, four were confidently considered by the senior author as not UIP (case 31, 33, 34, 40, Fig. 3) and the remaining six were considered ambiguous. Ten pathologists evaluated all twenty cases. We did not repeat the criteria assessments. The cases were reviewed six months after the initial round, at which point we assumed that the participants would not recall their initial impression of the cases. Questions were similar to those used in the first round, but the first three questions were consolidated into two (Table 1; Additional file 2: Fig. S2b).

Fig. 3
figure 3

Variation in overall diagnosis of cases in final round including definite UIP, probable UIP, possible UIP and not UIP. WSI case numbers are listed on the left. Cases 1–29 are taken from the initial round while cases 31–40 are new. Specific pathologists are listed along the bottom. Case 34 was not diagnosed by pathologist 6

We again grouped together definite and probable UIP vs. possible and not UIP. Since ten pathologists participated in this round, we considered consensus to be agreement among at least seven of the pathologists. Using the revised diagnostic criteria, twelve of the twenty cases reached consensus (Cases 1–6, 10–11, 13–14, 20, Fig. 3, Additional file 1). The specific new category of irregular fibrosis was used in eleven cases. It did not increase reproducibility nor was it used to consistently to rule in or rule out IPF/UIP. As in the first round, a confident diagnosis of possible or non-UIP was highly reproducible with all five cases (one from initial set and four from second set) initially thought to not be UIP by the senior author achieving consensus.

If we restrict analysis to the ten cases seen in both rounds, six achieved consensus in both rounds, two cases did not achieve consensus in either round (cases 20 and 28, Fig. 4), one case lost consensus (case 11, Fig. 4.) and one case gained it (case 24, Fig. 4). If we restrict analysis only to those pathologists who evaluated these ten cases in both rounds, two cases (cases 20 and 24, Fig. 4) improved agreement with one case now achieving consensus while the rest remained essentially the same (zero or one changed diagnoses) (Fig. 4).

Fig. 4
figure 4

Comparison of diagnoses for WSI cases evaluated in both rounds including definite UIP, probable UIP, possible UIP and not UIP. Two sets of diagnoses are listed for each case. The upper set of diagnoses are from first round and lower are from second. WSI case numbers are from the first round. Specific pathologists are listed along the bottom. Pathologists 8, 9, and 10 did not make diagnoses in the first round

We were interested to know whether the variability in round one and two could be due to specific pathologists who had consistent differences of opinion from the rest since the rate of diagnosis of UIP varied markedly with the pathologist e.g. in the first round one pathologist diagnosed definite or probable UIP three times, while another two diagnosed UIP eleven times (Fig. 1). There was a positive correlation between rate of diagnosis by pathologist of definite or probably UIP in first and final round with a Spearman’s r value of 0.49 although with a p value of 0.27. On the other hand, we do note that two of the three pathologists with the lowest rate of UIP diagnosis in the first round remained in the bottom three in the second round suggesting that this may play a role for some pathologists.

Discussion

All three of the recent standard criteria for idiopathic pulmonary fibrosis and the new hypersensitivity pneumonitis criteria include histologic criteria [6,7,8, 18]. However, there are no large validated series of images or cases derived from daily practice that serve as reference standards. Possibly as a result, significant interobserver variation exists limiting the utility of this approach [9,10,11,12, 19]. Notably all of the criteria use various quantitative assessments that are not given more specific definitions, nor are rules provided when criteria conflict. For example, lymphoid aggregates have long been noted in conventional histology of IPF and have been documented in more recent molecular characterization [19,20,21,22,23,24]. On the other hand, excess inflammation away from fibrosis is still considered to argue against a UIP diagnosis by raising consideration for HP, other hypersensitivity reaction or occult collagen vascular disease. Perhaps not surprisingly, therefore, and similar to our work, one previous study limited to IPF showed that excess inflammation and/or presence of giant cells were areas of diagnostic difficulty [9]. In comparison to that study, we examined a broader range of diagnoses and involved a larger number of pathologists. The difficulties in this differential have been explored in more detail recently by some of us [14]. Our data here documents that many of the concerns raised in that article are problems in actual practice. On the other hand, we did find that cases with findings that are considered to be inconsistent with UIP such as smoking related fibrosis, some cases with patchy inflammation and/or granulomas and cases with acute lung injury are readily distinguished from UIP, even in a whole slide imaging format. In general, we suspect that cases that have high confidence that they are not UIP are strongly reproducible as such, although we have not formally tested that.

It is important for pathologists to appreciate that there is significant mortality associated with wedge lung biopsy [25, 26]. We also note that, possibly as a result, one current trend in ILD diagnostics is to discard specific histologic categories in favor of a more general progressive fibrotic phenotype [27]. It may also be that there are no fixed borders among fibrotic ILD and that all such efforts at distinction may fail due to lack of underlying discrete categories [28].

The improvement we identified with our revised criteria was modest at best. While the criteria we used were somewhat arbitrary, they reflect the consensus of a group of pathologists who are extremely active in the field. Notably, the existing criteria have also never been subject to clinical validation but were only generated by consensus. Thus our approach is not different from that which is standard in the field. Our approach was also not too dissimilar to that of the commonly used Delphi system of developing expert consensus when data is lacking. Finally, our goal was primarily to determine if improved criteria could be developed rather than prove that that system improved clinical prediction. We did not attempt to determine who was “right” in this study e.g. by comparison to outcomes, for that reason. We note that if the pathologic criteria are not reproducible, they cannot be tested either by comparison to clinical features or in clinical trials. As a result, we have not considered location or number of biopsies, nor radiologic impression in understanding our data. While those are important to consider in making a multidisciplinary diagnosis, they do not explain the problems we have outlined here. It is also common to assess interobserver variation by use of various statistical tool e.g. Fleiss kappa. However, we note that the extent of disagreement will be heavily influenced by the case mix of patients selected for biopsy. We have not selected sequential cases from our institutions for that reason. While consensus was achieved even in the difficult cases, it is not standard of care for cases to be reviewed by a large panel such as this. The way in which we grouped cases for consensus may also have varying clinical consequences depending on other (clinical and radiologic) factors. Consequently, there will be a significant number of cases for which the consensus opinion will not be the one used clinically. One interpretation therefore is that there is a group of biopsies for which consensus may not be reached even with more precise criteria but that the number of those will vary from institution to institution. One significant limitation of our data is that we did not identify and then systematically re-analyze a large number of discordant cases. We think this is worth exploring going forward.

It is possible that intrinsic propensity among pathologists for diagnosing UIP/IPF accounts for some of these discrepancies. We therefore suggest that benchmarking rates of ILD diagnosis among pathologists needs to be further explored to understand this source of diagnostic variation. Other areas of pathology e.g. evaluation of Barrett’s esophagus, have successfully adopted this concept using web based approaches [29]. Finally, it may be necessary to combine this kind of analysis with newer technologies including image analysis and biomarkers to create the desired result. Such technologies are in development but are not yet routinely incorporated into clinical practice [30].