Introduction

Currently, the gold standard for the diagnosis of celiac disease (CD) and seronegative villous atrophy (SNVA) is based on duodenal histology [1]. Apart from the invasive nature of a gastroduodenoscopy, histological diagnosis can be flawed with errors. Unless an appropriate number of biopsies are taken and the samples are properly oriented [2] (at least four biopsies, including a biopsy from the duodenal bulb [1, 2]) a false negative result can occur. This is mainly because the mucosal distribution of CD is patchy [3,4,5]. Undiagnosed CD has long-term implications such as osteoporosis, iron deficiency anaemia, refractory celiac disease and SB malignancy [6].

CD and SNVA are histologically similar [7] and sometimes it is difficult to make a distinction between the two conditions based on other clinical and serological tests. Small bowel capsule endoscopy (SBCE) is carried out in patients with SNVA to assess for features of CD and to rule out other causes of villous atrophy [8]. Differentiating between CD and SNVA is important because of the different management these patients require. Patients with SNVA of unknown cause have been shown to respond to immunosuppressive therapy [9]. The mainstay of management of patients with CD is gluten-free diet (GFD) [10].

Determining severity of CD can be useful in the follow-up of patients as this enables comparison to be made [11]. Improving the diagnostic yield of SBCE through machine learning methods can help overcome some of these pitfalls in the diagnosis of CD and SNVA.

Machine learning for the detection of pathology on SBCEs has been previously explored, including to quantify aspects of macroscopic features of CD [12], for the detection of angieoectasias on SBCE [13], in the delineation of small bowel (SB) tumours [14], the recognition of inflammatory changes on SBCE [15] and even in the detection and assessment of colonic polyps on colon capsule endoscopy [16]. These studies have a common aim: to improve the delineation of pathology on SBCE. There is however, no literature on the use of machine learning methods to improve on the current reported sensitivities in the detection of pathology on SBCE based on human performance.

One aim was to assess whether a probabilistic model could be used to predict severity of duodenal histology in patients with CD and SNVA by considering features on SBCE. Another aim was to assess whether a similar model could be used to predict the type of disease (CD or SNVA) through macroscopic features on SBCE.

Methodology

Study design and patients

Patients with newly diagnosed and established CD and SNVA were included in this study over a one-year period from a tertiary centre for the management of CD. All patients had a confirmative diagnosis of CD or SNVA from serology and histology. Patients with SNVA had negative CD serology and were not on GFD at the time of histological diagnosis. They underwent extensive investigations to rule out other causes of SNVA such as inflammatory or infective conditions [9, 17]. All patients had a gastroduodenoscopy within two weeks from SBCE for duodenal histology and contemporary CD serology was checked. They underwent a SBCE to assess severity of disease, to rule out complications and to exclude other causes for SNVA such as Crohn’s disease.

SBCEs were de-identified and read by two expert reviewers (more than 300 capsules per year) who were blinded to each other’s findings, the indication for SBCE and the histology result from duodenal biopsies. To increase the size and variation, patients were considered as separate participants in the study if their SBCE was read differently by the two reviewers. After including the additional readings, the dataset included 81 sets of features (i.e. readings), corresponding to 72 original patients. This allowed for a larger dataset to be studied.

Duodenal histology

At least two biopsies from the duodenal bulb and four biopsies from the second part of the duodenum were taken during gastroduodenoscopy. The histology was then classified according to the modified Marsh Criteria which reflects severity of changes [18] (Table 1). All histological samples were reviewed by two expert histopathologists. In the case of discrepancy, a third histopathologist was involved.

Table 1 Marsh classification of histological changes of celiac disease; *IEL: intraepithelial lymphocytes

Small bowel capsule endoscopy

All patients underwent SBCE using Pillcam SB3 (Medtronic, Minneapolis, USA) [19].

Features reviewed included: total area affected, patchy /continuous pattern and macroscopic features of CD: mosaic pattern, fissuring of mucosa, scalloping of folds, villous atrophy, nodularity of mucosa and presence of ulcers (Fig. 1).

Fig. 1
figure 1

(a) fissuring of folds, (b) scalloping of folds, (c) villous atrophy, (d) mosaic pattern, (e) nodularity and (f) ulcers

Ethical considerations

The study protocol was approved by the Yorkshire and the Humber Research Ethics committee (IRAS 232382) and registered with the local research and development department of Sheffield Teaching Hospital NHS Foundation Trust under the registration number STH 19998. All SBCEs used in this study were de-identified. No additional consent was required for the study with the use of de-identified videos as assessed and approved formally by the Research Ethics Committee.

Capsule endoscopy features

CD on SBCE was represented by nine features, {f1, …, f9}.

$$ SBCE=\left\{{f}_1,\dots, {f}_9\right\} $$
(1)

Each feature fi was considered to be a categorical variable, with possible values fi ∈ {1, …, Ki}. These corresponded to the associated condition of that feature; thus, fi ∈ {1, …, Ki} (Table 2).

Table 2 SBCE feature descriptions

Target predictions

The data were used to build two predictive models. The class associated with each patient was defined by the value of the variable c. The first model predicted mild (c = 1) or severe Marsh scores (c = 2). The aim of the second model was to differentiate between patients with SNVA (c = 1), and CD (c = 2). Two, two-class classifiers were defined, where c ∈ {1, 2} (Tables 3 and 4).

Table 3 Mild vs Severe Marsh Scores
Table 4 CD (celiac disease) vs SNVA (seronegative villous atrophy)

Probabilistic analysis of features

Following a probabilistic approach, each feature fi was considered to be categorically distributed,

$$ p\left({f}_i=k|c\right)= Cat\left({f}_i\right|{\lambda}_i\Big) $$
(2)

The parameter λi can be considered to be a histogram over the K possible conditions for the feature fi (in the class c); therefore, k ∈ {1, …, K} and λi = {λi1, …, λiK}. Specifically, the probability of condition k was,

$$ p\left({f}_i=k\right)={\lambda_i}_k $$
(3)

In words, p(fi = k| c) was the likelihood that the condition of feature fi was equal to k, given the class c.Each feature was considered to be conditionally independent. This assumption is appropriate for smaller datasets, as there are less parameters to learn from the data. The assumption of conditional independence (between features) is common for Naïve Bayes classification, which is shown to be appropriate in many applications, even if correlation between the features is expected [20]. The likelihood of a SBCE given the class c was the product of the likelihoods of its corresponding features,

$$ p\left( SBCE|c\right)=\prod \limits_{i=1}^9p\left({f}_i=k|c\right) $$
(4)

such that p(SBCE| c) was the likelihood of observing a SBCE, given the patient was in class c. For example, the likelihood of a given SBCE feature-set, given that the patient was in the severe Marsh score group.

To inform feature analysis, and to make predictions, the parameters of the distribution (for each feature) were learnt from the available data. In this work, a Bayesian estimate of the parameters was used, to mitigate overtraining, and account for the zero-count problem [20, 21]. Additionally, a Bayesian approach leads to distributions over the predicted values, allowing for the uncertainty associated with predictions to be approximated. To calculate Bayesian estimates, a prior distribution was placed over the parameters of the categorical distribution, λi, for each feature. The parameters were then marginalised out (from the model) by integration. An appropriate prior was the Dirichlet distribution, as it was compatible to and conjugate with the categorical distribution [21]. Conjugacy is desirable, as it leads to tractable solutions (the functional form of the posterior distribution will be the same as the prior). The distribution is specified by,

$$ p\left({\lambda}_i\right)= Dir\left({\lambda}_i|\alpha \right) $$
(5)
$$ \alpha =\left\{{\alpha}_1,\dots, {\alpha}_K\right\} $$
(6)

The parameters of the Dirichlet distribution, α, were set to 1 (for conditions 1 to κ) for each feature. In terms of regularisation, this corresponded to add-one or Laplace smoothing [20]. As discussed, the effects of the parameters were then integrated out, to provide the posterior-predictive distribution. In this case, this was a posterior-predictive likelihood,

$$ p\left({f}_i=k|c\right)=\int p\left({f}_i|c,{\lambda}_i\right)p\left({\lambda}_i\right) d\lambda $$
(7)
$$ p\left({f}_i=k|c\right)=\frac{\left({N}_k+{\alpha}_k\right)}{N+{\sum}_{k=1}^K{\alpha}_k} $$
(8)

N was considered to be the total number of patients and Nk was the number of patients (in class c), with the value k for the feature fi. The likelihood p(fi| c) was the (estimated) proportion of the patients in class c with condition k for feature fi; for example, the proportion of patients in the dataset with a severe Marsh score (c = 2), who showed (k = 2) scalloping (f5).

Results

Characteristics of the cohort studied

Seventy-two patients (45; 62.5% females, mean age 52.5 ± 16.6 years) were included in this study. Patients had a diagnosis of CD (51, 70.8%) or SNVA (21, 29.2%). Marsh histology is shown in Table 5. A small proportion of patients (n = 14; 19%) had extensive abnormal small bowel mucosa beyond the proximal area. Seventeen patients (24%) had a normal SBCE. Patients had the following features of CD on SBCE: 30 (42%) scalloping, 42 (58%) fissuring, 38 (53%) mosaicism, 14 (19%) villous atrophy, 12 (17%) nodularity, 2 (3%) ulcers (Table 6).

Table 5 Marsh classification of disease
Table 6 Features of celiac disease on SBCE

Feature analysis–severity of marsh classification of disease

The posterior-predictive likelihoods for the features associated with mild vs severe Marsh scores are illustrated in Fig. 2. The likelihoods can be interpreted as the histogram over the possible conditions for each feature, given the class of patients, and the information in all the available data. The likelihoods showed that all features (other than f8 for the Marsh score model) were likely to represent more severe disease, i.e. k > 1, for:

  • severe Marsh scores (rather than mild)

  • CD (rather than SNVA).

Fig. 2
figure 2

Histograms learnt from the data, showing the likelihood of each feature, given the class of patients corresponding to: mild (white), or severe (black) Marsh score

Patients with higher Marsh scores were more likely to have a positive SBCE and a continuous distribution of macroscopic features (Fig. 2). Features including mosaic pattern, fissuring and scalloping of folds were indicative of a more severe Marsh classification. Villous atrophy and nodularity of mucosa were more common in patients with severe Marsh scores but the difference was not distinct. Ulcers were present in a small number of patients, and frequencies were similar in both groups. Overall, most patients had less than 50% of the SB involved, and those with more severe Marsh scores had more extensive SB involvement.

Feature analysis – Type of disease (SNVA or CD)

The posterior-predictive likelihoods for SNVA and CD are shown in Fig. 3. Again, the likelihoods can be interpreted as the histogram over the possible conditions for each feature, given the class of patients, and the information in all the available data. Fig. 3 illustrates that patients with CD were more likely to have a positive SBCE than those with SNVA. Patients with CD were more likely to have a continuous distribution of features. Patients with SNVA were more likely to have patchy disease. Features such as mosaic pattern of the mucosa, fissuring of folds, scalloping, villous atrophy and nodularity of mucosa were more likely to be present in patients with CD than patients with SNVA. Patients with CD were more likely to have extensive SB involvement.

Fig. 3
figure 3

Histograms learnt from the data, showing the likelihood of each feature, given the class corresponding to: serology negative villous atrophy (white), or celiac disease (black)

Predictive model: Maximum-likelihood

A maximum likelihood approach was used to define a model for prediction. Specifically, the predicted class, \( \hat{c_{\ast }} \) is the class with the maximum posterior-predictive likelihood for the SBCE,

$$ \hat{c_{\ast }}= argma{x}_{c\in \left\{1,2\right\}}\left\{\ p\left( SBC{E}_{\ast}\right|c\right)\Big\} $$
(9)

The performance of each classifier was assessed using Leave-One-Out cross-validation (LOOCV) [22]. This involved learning the parameters of the likelihood model (Eq. 8) with all of the available data and excluding one patient (the validation data). The model was then used to predict the class of the single, held-out patient, and a score was recorded. The score was considered unity if the patient was successfully classified and zero if the patient was misclassified. This process was repeated, such that all 81 SBCE readings were held out for validation in turn. Finally, the score over the whole data set was presented as a (percentage) accuracy.

Following LOOCV, the validation accuracy was 69.1% when predicting the severity of the Marsh score (Table 3) and when distinguishing between CD and SNVA (Table 4).

Predictive model: Naïve Bayes

To estimate the probability of each class given the SBCE of a patient, the distribution over c could be estimated and included in the model. To include this information, p(c) was estimated from the data. In other words, the probability of observing a class, c, was estimated, given the number of times it occurred in the dataset (in a similar manner to the feature analysis). Therefore,

$$ p(c)= Cat\left(c|\pi \right) $$
(10)

The parameter π was considered to be a histogram over the two possible classes (for each predictor), c ∈ {1, 2}.Thus, π = {π1, π2}, and,

$$ p\left(c=i\right)={\pi}_i $$
(11)

Again, a Dirichlet prior was placed over π, and the α parameter was set to a vector of ones, corresponding to add-one (Laplace) smoothing. The marginal probability over classes could then be estimated by,

$$ p(c)=\frac{\left({N}_c+1\right)}{N+2} $$
(12)

Note, N was the total number of SBCE readings, and Nc was the number of patients in class c. A probabilistic classifier could then be defined using Bayes rule [20], which was used to define a posterior-predictive distribution over the class groups, given the features from the SBCE of a patient,

$$ p\left({c}_{\ast }| SBC{E}_{\ast}\right)=\frac{p\left( SBC{E}_{\ast }|\ {c}_{\ast}\right)\ p\left({c}_{\ast}\right)}{\sum_{c_{\ast }=1}^2p\left( SBC{E}_{\ast }|\ {c}_{\ast}\right)\ p\left({c}_{\ast}\right)} $$
(13)

Therefore, p(c| SBCE) was a two-dimensional histogram, corresponding to the probability of a patient belonging to:

  • mild Marsh scores or severe Marsh scores;

  • SNVA or CD.

Importantly, the success of this approach requires that the proportion of patients in each class is representative of future data.

Firstly, the estimate of p(c) was included in the model used to predict Marsh scores. In this case, the LOOCV accuracy decreased significantly to 56.8%. The probability p(c) was also included into the model to predict differentiation between classes of SNVA and CD. For this predictor, the validation accuracy increased to 75.3%.

Discussion

This is the first study that demonstrates how macroscopic features of CD/SNVA on SBCE can help predict the severity of disease. It is also the first study to show that pattern recognition on SBCE can help distinguish between CD and SNVA.

In patients with newly-diagnosed CD, severity of duodenal histology at the time of diagnosis can be predictive of histological recovery at one year from diagnosis [23]. In established CD, incomplete mucosal recovery can also be predictive of complications [24]. Predicting severity of disease by considering features on SBCE can help physicians assess the risk of complications or predict the time needed for mucosal recovery. Quantifying severity of CD on SBCE is useful in the follow up of patients with CD, as this enables a comparison to be made between SBCEs before and after treatment is commenced [11, 25]. Patients with persistently similar features on follow up SBCE can be managed more aggressively by adding immunosuppressants. In this study, patients with higher Marsh scores were more likely to have macroscopic evidence of CD and for a longer distribution on SBCE.

There are several causes of SNVA such as medications, infections and inflammatory conditions [17]. These can cause histological changes that can mimic CD. CD and SNVA cannot be distinguished by considering histology only, but by taking into consideration other parameters such as human leucocyte antigen genotype, CD serology and exclusion of other causes of villous atrophy. Recognition of differences in patterns of disease in the SB for both conditions can enable SBCE to play an additional role in their distinction. Differentiating between both conditions is crucial because of different management [26] and also because of the higher mortality associated with SNVA than with CD [27]. In the second model, that predicted the distinction between CD and SNVA, patients with CD were more likely to have macroscopic evidence of disease on SBCE than patients with SNVA.

The sensitivity of SBCE to detect features of CD varies between 71 and 93% [25, 28,29,30,31,32,33]. For both models, a LOOCV validation accuracy of 69.1% was achieved using a maximum (posterior-predictive) likelihood approach. While these accuracies may seem low, the performance is near the quoted sensitivity of the SBCE, which effectively limits the classification accuracy of the model.

The reliability of the probabilistic model relies on the assumption that the number of patients in each class is representative of future data. When the proportion of patients was taken into consideration in the severity Marsh score model, the LOOCV accuracy decreased to 56.8%; therefore, it is assumed that the available data were not representative of the expected distribution over the class labels p(c) for marsh score subgroups (mild / severe). For CD or SNVA, however, the validation accuracy increased to 75.3% by including an estimate of the distribution over class marginal, used to define Naïve Bayes classifier. The performance increase for the CD SNVA model makes sense, as the estimate of the distribution p(c) (77.8% CD, 22.2% SNVA) was similar to that reported in the literature (77% CD, 29% SNVA) in a group of patients with varying villous atrophy [34].

One limitation of the study was the relatively small dataset, however, this is one of the largest CD/SNVA SBCE dataset compared to the current literature. Another limitation of this study is the lack of distinction between SNVA-CD, SNVA in patients with an unknown cause and SNVA in those with an identifiable cause other than CD; however, further classification into these subgroups would have rendered the groups even smaller.

Conclusions

Using probabilistic analysis of macroscopic features on SBCE it is suggested that more pronounced and extensive features are present in patients with more severe Marsh scores of histology and in patients with CD (as opposed to SNVA). The findings of this work suggest that, from the available data, SBCE features are suitable in making predictions as to whether a patient has CD or SNVA and to determine the severity of disease. With validation on new patients, this implies that the diagnosis of CD or SNVA and the severity of disease may be supported by probabilistic machine learning, applied to SBCE features.