1 Introduction

Glaucoma is a group of eye conditions about damages to the optic nerve and vision, which is mainly diagnosed by ophthalmologists looking at the fundus images. Since glaucoma is an optic nerve-related disease, most existing works diagnose glaucoma disease by computing the optic cup-to-disk ratio automatically with deep neural networks [1]. The researchers also found that the ganglion cells and nerve fiber [2, 3] are strongly related to glaucoma detection in their early stage, which provides glaucoma new diagnosis indicators. Figure 1 illustrates the glaucoma fundus image with marked features and a comparison of healthy vision and glaucoma vision.

Recently, lots of glaucoma datasets have been released to the research community [4, 5] to accelerate the rapid development of data-driven deep neural methods on glaucoma detection. Actually, most of the available glaucoma datasets are in low volume that the medical images collection and annotation are always more challenging than the natural images. MT-UDA improves model performance by introducing binocular correlation in diabetic severity grading [6]. In addition, some medical images even cannot be released due to privacy protection. Therefore, some glaucoma datasets only have tens of glaucoma samples, e.g., 10 and 15 glaucoma samples in the whole datasets of DRHAGIS [7] and HRF [8], respectively. The recently proposed REFUGE [4] dataset also only has 40 glaucoma data samples in its training set. Thus, it is not easy to apply traditional machine learning algorithms in these low-resource data scenarios. Some works employ transfer learning to alleviate the low-resource training difficulty in deep models. However, this transfer learning paradigm is still hard-code finetuning without considering the domain discrepancies across glaucoma datasets.

Fig. 1
figure 1

Glaucoma fundus image and vision comparison. The fundus image marked with glaucoma-related features: optic disk, optic cup, ganglion cell, and ROI (region of interests). The right sub-figure shows the vision loss of glaucoma patient

Another challenge is domain divergences across different datasets that different optometrists collect those datasets from diverse devices, such as REFUGE [4] from Zeiss Visucam-500, RIM-ONE [9] from Kowa WX-3D, and ACRIMA [10] from Topcon TRC. Those differences directly result in the intrinsic discrepancies of the fundus images across glaucoma datasets, including image quality, lightness variations, resolution discrepancy, viewpoint changes, etc. All those discrepancies enlarge the domain gap between the aforementioned glaucoma datasets. Existing works point out the stable network input with similar data distribution helps the convergence of deep model as well as benefits the model performance in prediction [11, 12]. In contrast, the domain divergence of glaucoma dataset increases the instability of deep model inputs and leads to model performance declines, which are generally exist across glaucoma datasets and mostly be neglected in the research of glaucoma detection.

Transfer learning is widely used in low-resource learning scenario as well as reduces the domain gap between the source and target domains [13]. Specifically, some domain adaptation works make efforts to reduce the domain discrepancies through learning a shared feature space to represent multiple domains (e.g., feature sharing [14], domain confusion [15, 16]). Similarly, the feature disentanglement methods take efforts to reduce domain gap from another aspect that by learning a domain-specific feature representation for different domains [17]. Nevertheless, all those adaptation approaches conduct domain adaptation through implicit learning paradigm, which is hard to guarantee the adaptation explicitly. Moreover, the aforementioned learning-based adaptation methods neglect the outliers samples learning in adapted domains, which tend to be classified incorrectly. Therefore, how to effectively conduct transfer learning on outlier is urgently needed for glaucoma detection.

To address the aforementioned issues of transfer learning in glaucoma detection task, we propose a mixup domain adaptation (mixDA) to bridge the domain gaps across glaucoma datasets with an explicit domain adaptation manner. Figure 2 shows the overview of our mixDA, which integrates domain adaption with domain mixup into one framework with an enhanced outlier-learning capability. Generally, the source domains have significant gaps to the target domain in transfer learning that have the discriminated fundus images and divergent data distributions. Figure 2 also presents the domain gap by visualizing the data distributions of glaucoma datasets. To reduce the domain gap of the source domain to a target domain, the domain adaptation (DA) of our mixDA conducts data transformation from the source domain to the target domain in an explicit manner, which avoids the implicit adaptation of feature-learning-based approaches. Moreover, our mixDA improves the hard-code transfer learning paradigm by integrating mixup into domain adaptation, which mixes the original data with adapted data into a new mixup sample to enhance its generalized learning capability.

Fig. 2
figure 2

Overview of mixup domain adaptation (mixDA). Our mixDA contains of two key modules: domain adaptation (DA) and domain mixup (Mixup). The pipeline of distributions illustrates the data changing after different modules. Here, “Source-1” is the data distribution of ORIGA train set, “Source-2” is the data distribution of REFUGE training set, and “Target” is the data distribution of REFUGE validation set

To further address the low-resource data issue of the deep model in transfer learning, our mixDA extends the vanilla mixup [18] from inter-domain to cross-domain fashion. Moreover, mixDA formulates the mixup of DA and cross-domain mixup into one uniform domain mixup fashion. This generalized domain mixup enhances our model learning capability on the outliers (vicinal samples) in the adapted distribution. Correspondingly, Fig. 2 shows the adapted source domain still existing outliers between the adapted source domain and the target domain, which also called vicinal sample or adversarial sample in [19]. Thereby, our domain mixup is a cross-domain fashion that mixes the vicinal samples with other samples to improve the model generalization capability. Different from the hard-code finetuning method that directly conducts tuning on the target domain and neglects the domain gap, our domain mixup fills the small discrepancy gaps between the adapted domain and the target domain with a soft-filling manner. Benefit from the generalized learning capability, our domain mixup not only reduces the domain adaption discrepancies but also bridges the domain gaps of glaucoma datasets with a manner of smoothly gap filling. Lastly, we also want to claim that our mixDA is a backbone-free approach that the model performance can be enhanced with stronger backbones.

We conclude the main contributions as follows:

  1. 1

    Mixup domain adaptation (mixDA) optimizes the implicit adaption manner of existing transfer learning paradigms with an explicit domain adaptation manner to improve model performance in diverse glaucoma detection datasets.

  2. 2

    mixDA unifies the inter-domain and cross-domain mixup into one uniform fashion, which reduces the domain gap across glaucoma datasets and enhances model generic performance by minimizing the vicinal risk of adapted domain outliers.

  3. 3

    Extensive experiments show the superiority of our mixDA over state-of-the-art glaucoma detection baselines on several public glaucoma datasets.

2 Related works

2.1 Glaucoma detection

The ophthalmologist generally diagnoses the glaucoma disease by measuring the optic cup-disk-ratio manually on the fundus images. Some early glaucoma detection research works employ deep neural networks to automatically learn the changes of optic cup and optic disk to help ophthalmologist diagnose glaucoma [20]. The later research works on glaucoma detection focus on the optical disk area (region of interest, ROI) instead of the whole original image. To further employ the prior knowledge, an intuitive solution is the two-staged detection paradigm, which segments the ROI first from the fundus image, then employs the advanced deep learning backbone to perform the glaucoma classification on the cropped fundus images. Concretely, [1], respectively, employs the DeepLab and MobileNet as its feature encoder and segmentation module to perform glaucoma detection. Inspired by the attentive networks for diabetic retinopathy grading [21, 22], AMNet [23] similarly utilizes a Faster-RCNN as their segmentation module to crop the optic disk area, then multiple the segmented mask with fundus image as an attentive glaucoma detection approach. Therefore, more and more detect glaucoma works follow the two-stage detection paradigm in their glaucoma detection pipeline. There also have some works make efforts from other aspects; SenBr [24] proposes a multi-branch network to distinguish the difficulty of training data, which aims to pay more attention to the hard samples learning. EGDCL learns the hard samples with curriculum learning to relieve the data bias issue in glaucoma datasets [25]. Recently, some ophthalmologists found that outside of the optic disk area, the apoptosis retinal cells [2] and the retinal nerve fiber layer [3] are also strongly related to glaucoma in its early stage. Those findings uncover that the ROI area (optic nerve head area) is not the sole indicator for glaucoma detection, which is mostly neglected in previous research. Moreover, the data discrepancy across different glaucoma datasets and low training samples further increases the challenge for the glaucoma detection task.

2.2 Transfer learning

Transfer learning is a sub-category of machine learning that transfers the learned knowledge from the source domain to the target domain. The pretrained models, obtained knowledge by training on source dataset, have greatly improved the model performance in various downstream tasks [16, 26,27,28]. Domain gaps typically exist between the source and target domains, significantly impacting the model's performance. Fortunately, through domain adaptation in transfer learning, the influence of these domain gaps can be alleviated. One representative domain adaptation approach is universal feature-based transfer learning, which assumes the different domains can be represented by learning a group of universal features. And this universal feature can bridge the domain gaps of source and target domains. Specifically, DAN [14] builds a deep adaptive neural network with a domain-shared encoder and domain alignment decoder in its transfer learning pipeline. AdaBN extends the DAN feature alignment from the specific decoder layers to the whole network to further reduce the domain learning gap [29, 30]. Moreover, the partial alignment methods further relieve the feature alignment and consider the domain intrinsic discrepancy to improve their downstream tasks, e.g., feature partial alignment adaptation [31], memory-assistant discriminative learning [32], selective feature alignment [33].

Besides, there are also some other works solving domain gaps of transfer learning from other aspects. Concretely, DANN [34] proposes inverse gradient learning to reduce the discrimination of source and target dataset. Adversarial-based transfer learning achieved great success in domain adaptation [35, 36] that generates adversarial data samples to confuse the discriminator to help deep model align different domains. Some works also try to improve the domain adaptive from partial alignment [37] with practical conditions [38]. Similarly, the reconstruct-based method [39] alternates the data generation by data reconstruction in their transfer learning pipeline. Unlike the learning-based methods conducting domain adaptation implicitly, FAD [40] solves the domain adaptation from frequency spectrum instead of feature alignment by swapping the source and target domain low-frequency spectrum. However, all aforementioned transfer learning methods neglect the vicinal risk (outliers) after domain adaptation.

3 Methodology

In this section, we are going to present our proposed method termed mixup domain adaptation (mixDA) that mainly consists of two parts, i.e., domain adaptation and domain mixup. For the domain adaptation, we employ the Fourier domain adaptation or histogram-matching domain adaptation to explicit adapt the source fundus images to the target domain. Then, our domain adaptation conducts a mixup of the adapted sample with the same category sample into a new sample to improve the hard-code transfer learning. For the domain mixup, we mixed the adapted source domains with the target domain together, which increases model generalization capability on discrepancy outlier data points and minimizes the vicinal risk [18]. More importantly, we formulate the inter-domain mixup of domain adaptation and the cross-domain mixup of domain mixup into one uniform framework in our mixDA. In the following parts, we will first present the domain gap across glaucoma datasets and then introduce the domain adaptation and domain mixup of our mixDA separately.

3.1 Domain gap

Suppose we are given a source dataset \({\mathcal {D}}^S = \{({\textbf{x}}_i^S, {\textbf{y}}_i^S)\}_{i=1}^{N_S}\), where \({\textbf{x}}^S \in {\mathbb {R}}^{H\times W \times 3}\) is a fundus images from source dataset and \(y^S \in \{0,1\}\) is the label associate with \({\textbf{x}}^S\). Similarly, we can define the target domain dataset \({\mathcal {D}}^R = \{ ({\textbf{x}}_i^R, {\textbf{y}}_i^R\}_{i=1}^{N_R}\). To facility view those domain gaps straightforwardly, we employ the data possibility distributions with kernel density estimation to visualize the data distributions across glaucoma datasets, which computes by:

$$\begin{aligned} \hat{{\mathcal {F}}_K}({\textbf{x}})= \frac{1}{nh} \sum _{i=1}^{n}K\frac{{\textbf{x}}-{\textbf{x}}_i}{h}, \end{aligned}$$
(1)

where h is the smoothing parameter of kernel K, and \({\textbf{x}}\) is the given point of estimated density \({\mathcal {F}}_h\). Here, we visualize the data distributions of glaucoma training dataset by computing the mean of image samples.

Fig. 3
figure 3

Domain gaps in glaucoma dataset distributions. X-axis is the lightness of samples, and the Y-axis is the density of dataset

From Fig. 3, we can clearly observe domain gaps among 12 public glaucoma dataset, which depart from the primary principle of learning theory that all the training and prediction data samples should follow a consistent distribution. From the distributions, most datasets are in a Gaussian-like distribution, except the dataset of IEEE1450 (1450) and DCGAN. All these domains apart from each other, where the gaps hinder the generalization performance and transfer learning of deep models. Thereby, our mixup domain adaptation (mixDA) is proposed to address those negative impacts of domain gaps in the glaucoma detection task.

3.2 Domain adaptation

Domain adaptation (DA) is the first step in pipeline of our mixDA, which aims to coordinate the source domain to the target domain on the data distribution. In other words, all samples from different domains are adapted to the target domain distribution as a stable input for the deep models training in an explicit domain adaptation manner. Concretely, our mixDA employs the out-of-shelf methods, Fourier Domain Adaptation (FDA) [40] and Histogram-matching Domain Adaptation (HDA) [41], as our domain adaptation backbones. Besides, our mixup domain adaptation introduces the inter-domain mixup to increase model generalization capability and reduces vicinal risks on vicinal samples outside of target domain distribution.

The significant data distribution gap causes the model performing well on the source dataset while performing poorly in the discriminated target datasets. Domain adaptation is a straightforward solution that keeps the inputs into a consistent distribution. Concretely, our mixDA introduces mixup Fourier domain adaptation (mFDA) and mixup histogram-matching domain adaption (mHDA) to reduce the domain gaps. Here, we introduce the mFDA first.

Fig. 4
figure 4

The pipeline of mixup Fourier domain adaptation (mFDA). “FFT” and “IFFT” are the fast Fourier transformation and inverse fast Fourier transformation. “\({\textbf{x}}, \check{{\textbf{x}}}, {\hat{{\textbf{x}}}}\) denote the image of source, FDA and HDA correspondingly. “\({\mathcal {M}}_{\alpha }\)” is the mixup parameter of mFDA

Figure 4 shows the pipeline of mFDA from the source domain to the target domain with mixup. In the fast Fourier transformation (FFT), the fundus image is transformed into frequency information \(F({\textbf{x}}(m,n))\), and for each color channel, it computes as follows,

$$\begin{aligned} {\mathcal {F}}\left({\textbf{x}}\left(m,n\right)\right) = \sum _{h,w} {\textbf{x}}\left(h, w\right)e^{-2\pi \left(\frac{h}{H}m+\frac{w}{W}n\right)i}, \end{aligned}$$
(2)

where \(i^2 = -1\). Then, we can compute the amplitude spectrum (\({\mathcal {F}}_{A}\)) and phase spectrum (\({\mathcal {F}}_{P}\)) by,

$$\begin{aligned} {\mathcal {F}}_A(m,n) & = \sqrt{\left({\textbf{x}}(m,n)\right)^2 + \left(-2\pi \left(\frac{h}{H}m+\frac{w}{W}n\right)\right)^2}, \end{aligned}$$
(3)
$$\begin{aligned} {\mathcal {F}}_P(m,n) & = \text {arctan} \frac{{\textbf{x}}(m,n)}{-2\pi \left(\frac{h}{H}m+\frac{w}{W}n\right)}. \end{aligned}$$
(4)

The amplitude spectrum and phase spectrum are, respectively, store the features of relative brightness and object boundaries. Regarding changing lightness is not strong affect image information than changing the object boundaries [42], and fundus images also have the lightness issue across different glaucoma datasets. To relieve the image light issue, our mFDA conduces domain adaptation by transferring the amplitude spectrum information with a masked ratio of \(M =0.1\) from the target domain to the source domain:

$$\begin{aligned} {\mathcal {F}}^A:= M \circ \mathcal {F^A}({\textbf{x}}^R) + (1-M)\circ \mathcal {F^A}({\textbf{x}}^S). \end{aligned}$$
(5)

Then, we map the adapted amplitude spectrum and original phase spectrum back to the fundus image with inverse Fourier transform (\({\mathcal {F}}^{-1}\)), as follows,

$$\begin{aligned} {\check{x}}(h,w) = {\mathcal {F}}^{-1}({\mathcal {F}}^A e^{i{\mathcal {F}}^P}). \end{aligned}$$
(6)

The last step of mFDA is its inter-domain mixup (Eq. 7), which conducts the sample mixup of samples from the same category,

$$\begin{aligned} {\tilde{{\textbf{x}}}}_{f}={\mathcal {M}}_{\alpha } \check{{\textbf{x}}} + (1-{\mathcal {M}}_{\alpha }) {\textbf{x}}^{*}, \end{aligned}$$
(7)

where \({\textbf{x}}^{*} \in \{{\textbf{x}}; \check{{\textbf{x}}}; {\hat{{\textbf{x}}}} \}\), and \({\mathcal {M}}_{\alpha } \in (0,1)\).

Comparing with mFDA conducts adaptation on the amplitude spectrum, mixup histogram-matching domain adaptation (mHDA) conducts mixup operation on its histogram. Let \(P_r\) denotes the possibility density function of source domain,

$$\begin{aligned} P_r(r_j)=\frac{N({\textbf{x}}(m,n)=r_j)}{N({\textbf{x}}(m,n))}, \end{aligned}$$
(8)

where \(N({\textbf{x}}(m,n))\) denotes \({\textbf{x}}(m,n)\) with value \(r_j\). We can compute the cumulative distribution function (\({\mathcal {S}}({\textbf{x}}_j)\)) of source domain, as follows,

$$\begin{aligned} {\mathcal {S}}({\textbf{x}}_j) = \sum _{j=0}^{L-1} P_r(r_j). \end{aligned}$$
(9)

Similarly, the target domain defines its possibility density function \(P_z(r_j)\) and cumulative distribution function \({\mathcal {G}}({\textbf{z}}_j)\). So, the mHDA can bridge the domain gap between source and target through equalling the function of source and target cumulative distribution:

$$\begin{aligned} {\mathcal {S}}({\textbf{x}}_j) = {\mathcal {G}}({\textbf{z}}_j). \end{aligned}$$
(10)

Thereby, the transformation of domain adaptation of our mHDA is computed by,

$$\begin{aligned} {\textbf{z}}= {\mathcal {G}}^{-1} {\mathcal {S}}({\textbf{x}}). \end{aligned}$$
(11)

In this way, all pixels of fundus images are mapped to the target domain distribution by Eq. 11. For the last step, mHDA follows mFDA that mixes the inter-domain samples of the same category.

$$\begin{aligned} {\hat{{\textbf{x}}}} = {\mathcal {M}}_{\beta }\circ {\textbf{x}}+ (1-{\mathcal {M}}_{\beta })\circ {\textbf{z}}. \end{aligned}$$
(12)

3.3 Domain mixup

After the domain adaptation of mFDA and mHDA, the fundus images of the source domain are generally adapted to the target domain. However, most glaucoma dataset has low-resource training data as well as some discrepancies exist outside the target domain distribution, which is named vicinal samples and adversarial examples. To solve the low-resource issue and minimize the vicinal risk of the discrepancies in domain adaptation, our mixDA introduces the domain mixup to further improve the learning capability on vicinal samples by mixing different domains.

Fig. 5
figure 5

Different mixup comparison. Colored labels (y) denote samples with different categories, uncolored data (x) denotes the target domain sample, colored data (x) denotes adapted sample

From Fig. 5, we can observe the vanilla mixup method conducts mixup on the target domain images only, which we called the inter-domain mixup. Different from the vanilla mixup, mFDA/mHDA and mixDA perform cross-domain mixup that mix the data samples from different domains (i.e., domain 1 mixup with target domain). Note, the main difference is that mFDA/mHDA only do mixup on the same categories (the same colored y). In contrast, our mixDA is a generalized version, which performs cross-domain mixing not only on different domains but also on different categories. Moreover, our domain mixup formulates the mFDA and mHDA with cross-domain mixup into one uniform computation, as follows,

$$\begin{aligned} {\tilde{{\textbf{x}}}} & ={\mathcal {M}}_{\lambda } {\textbf{x}}^{*}_i + (1-{\mathcal {M}}_{\lambda }) \circ {\textbf{x}}^{*}_j, \end{aligned}$$
(13)
$$\begin{aligned} {\tilde{{\textbf{y}}}} & = {\mathcal {M}}_{\lambda } {\textbf{y}}^{*}_i + (1-{\mathcal {M}}_{\lambda }) \circ {\textbf{y}}^{*}_j. \end{aligned}$$
(14)

where \({\textbf{x}}^{*} \in \{ \check{{\textbf{x}}}; {\hat{{\textbf{x}}}}; {\textbf{x}}\}\) and \({\textbf{y}}^{*}\) is the corresponding label to \({\textbf{x}}^{*}\). From the equations, we can learn the domain mixup of our mixDA extends to different categories instead of the same category as well as their labels (i.e., \({\textbf{x}}^{*}_i\) and \({\textbf{x}}^{*}_j\) can be either the glaucoma sample or none-glaucoma sample). More importantly, this mixed data is also the new augmented data to release the low-resource data issue of glaucoma datasets.

Fig. 6
figure 6

Mixup comparison with transfer learning

To straightforward illustrate the intuitive differences of our domain mixup with transfer learning and vanilla mixup, Fig. 6 illustrates a straightforward comparison of transfer learning, vanilla domain mixup and our mixDA. Specifically, the transfer learning directly pushes the source pretrained model to perform hard-code finetuning on the target domain without considering the existing domain gap. The vanilla mixup tries to reduce the domain gap by mixing the samples of source and target domains. But the domain gap is still large if the source domain and target domain distribution has a far distance. Our mixDA firstly bridges the domain gap by domain adaptation and then conducts the domain mixup to further reduce the vicinal discrepancies between the adapted and target domains.

4 Experiments

4.1 Datasets

We evaluate our mixDA and conduct experiments on 12 public glaucoma datasets. The dataset overview information is summarized in Table 1. Some glaucoma datasets have multiple resolutions, where we only demonstrate one of them.

Table 1 Summary of Glaucoma Datasets

From the numbers of Table 1, we can observe that the huge discrepancies across glaucoma datasets. In details, the volume and partition are different that some glaucoma datasets are low-resource ones with only hundred samples, e.g., HRF (High Resolution Fundus), DRHAGIS (DRH), ACRIMA, and DRISHTI (DRISHTI-GS). Moreover, the image resolutions are changed across datasets. Since the original fundus images with the large image resolutions, some datasets made the image process that cropped the original large image into a smaller resolution with only ROI areas reserved, e.g., HPD (Harvard Processed Data), RIM (RIM-ONE), and ACRIMA. Last but not least is the intrinsic difference of fundus image, such as light, image processing, and camera hardware, which further increase the domain gaps of glaucoma datasets. All those data discrepancies increase the challenge of glaucoma detection with data-driven deep models.

4.2 Results

This section reports the state-of-the-art performance of our mixup domain adaptation on four public glaucoma datasets (REFUGE, LAG, ORIGA-light, and RIM-ONE). Following the previous glaucoma detection works [4, 10], we employ accuracy, sensitivity, specificity, and area under curve (AUC) as our evaluation metrics in the glaucoma detection task. Note, due to the imbalanced data distributions on different glaucoma datasets (i.e., the REFUGE is imbalanced 40 (glaucoma cases) vs 360 (healthy cases). Most research works [4, 50,51,52] employ the “AUC” as their main evaluation metric instead of the sensitive metrics (such as “accuracy, specificity and sensitivity”) in the imbalanced distribution datasets. In our experimental parts, we also provide the “accuracy, specificity and sensitivity” for the readers’ reference.

We first evaluate mixDA on the REFUGE, which exists a discriminated domain gap between the train and validation sets. To improve the glaucoma detection performance on the REFUGE, some existing works try to introduce the modules of multi-task learning (Masker, AMNet), feature fusion(FusionBr, SenBr), and model ensemble (EnsembleTL, Masker) to improve their model learning capability. Meanwhile, the two-stage glaucoma detection methods also make efforts by introducing the prior knowledge that glaucoma disease is strongly related to the optic nerve areas. Thereby, the two-stage methods (SDSIRC, CUHKMED) conduct feature learning on the RIO cropped fundus image instead of the original ones. What’s more, the SOTA method VRT employs an attentive neural network conducting discriminated learning and achieves the best performance on REFUGE. Following the official REFUGE setting, we summarize the experimental reports as follows,

Table 2 Comparison with SOTAs on REFUGE

Table 2 provides a performance summary of existing SOTAs on REFUGE. With the help of domain adaptation and mixup, our mixDA achieves the best performance on REFUGE with a 0.9901 AUC score. Unlike the aforementioned methods of improving model performance by feature fusion or ROI area cropping, our mixDA pays more attention to solving the intrinsic dataset issue of domain gaps and mixup learning to improve model performance. Meanwhile, our mixDA employs the extra datasets i.e., LAG, DCGAN into domain mixup learning as well as conducting multi-task learning to further improve the model performance from the previous SOTA AUC score 0.9885 to 0.9901. Our performance is also with a cost of a sensitive decline that REFUGE is an imbalanced dataset with a strong sensitivity fluctuation. Overall, our mixDA is a better solution in the domain discrepancy dataset with the official AUC criteria.

The fundus images of dataset LAG are cropped in a unified resolution and near the optical nerve head. Meanwhile, the LAG is the largest dataset than other glaucoma datasets listed in Table 1. In the experiments of LAG, our mixDA employed the cropped DCGAN dataset as the extra dataset in the domain mixup learning. The detailed results are reported as follows,

Table 3 Comparison with SOTAs on LAG

From the results of Table 3, our mixDA also achieves the best performance with an AUC score of 0.9953 on LAG than the other SOTA methods. Compared with REFUGE, we can observe the leading SOTAs have higher AUC scores around 0.99 than the REFUGE between \(0.97 \sim 0.98\). For the baselines, EGDCL introduced adaptive curriculum learning to help unbiased glaucoma diagnosis, Auxiliary-PSD and Transductive are the teacher-student learning models. All of where are all surpassed 0.99 AUC. The intuitive reason is LAG has much more training samples and similar data distributions in its train and test sets. Compared with the method of DCGAN pretrained on the same dataset, our generalized mixDA boosts \(2\%\) AUC improvements than DCGAN. Moreover, we also evaluate mixDA with different sources, i.e., REFUGE and ORIGA. mixDA achieved a competitive performance with the accuracy and AUC of 0.9720 and 0.9941, respectively. This performance further verifies the strong learning capability of mixDA, even trained on different sources.

In the evaluation of ORIGA dataset, we found there are different evaluation settings in the previous works, which cannot directly make comparisons. To filling the incomparable gap of previous works and providing a summarized baseline, we follow researchers on ORIGA dataset ( [61, 62, 51]) that evaluate ORIGA in three settings: two random partition and 10-fold cross-validation. For the baselines, DCNN and ReconstructNN are the early glaucoma works that introduce deep neural network into glaucoma classification tasks. Holistic+Local, SVM, and SVM+SMOTE are the classical machine learning solutions. M-Net, joint U-Net, and M-Net+PT are all based on U-shape-like neural network for glaucoma classification. All detailed experimental results are reported in Table 4.

Table 4 Comparison with SOTAs on ORIGA

From the results of Table 4, we can observe the overview performance of AUC scores is below 0.90, which indicates ORIGA is a more challenging dataset than REFUGE and LAG. With the help of domain mixup, our mixDA achieves consistent superior performance in all experimental settings. Specifically, the first experimental setting with only 99 training samples, our mixDA with the help of transfer learning on extra sources achieves \(5\%\) improvements than the previous works of DCNN and ReconstructNN. The secondary experimental setting splits more samples for training, which helps M-Net achieves higher AUC scores of 0.8508, but still far behind ours 0.8857. The last experimental setting is 10-fold cross-validation. Our mixDA consistent superior to the SVM+SMOTE.

Following the setting of EGDCL, we evaluate the mixDA on RIM-ONE-R1. Different from LAG cropped on optic nerve head area, the RIM-ONE only crops the areas near optical disk with a relatively smaller resolution.

Table 5 Comparison with SOTAs on RIM-ONE

Table 5 shows the summary of different methods on RIM-ONE. With low generalized backbones, DENet and GON limit their performance to 0.574 and 0.681, respectively. MCL-NET and DCNN improve their model performance superior 0.8 with the help of advanced models but still inferior to the generative model of AG-CNN 0.916. EGDCL introduces the adaptive curriculum learning and pushes the AUC score to 0.976. Different from EGDCL, our mixDA not only consider the domain adaptation but also consider the vicinal samples (also called adversarial samples in generative model [18]) to further improve \(2\%\) of AUC to 0.9933. Furthermore, the accuracy, sensitivity, and specificity all surpass other methods on RIM-ONE.

5 Discussions

In this section, we extend the explorations of mixDA from different aspects: ablation study, transferability, generalization performance, mixDA variant, backbones impact, etc. We also provide more experimental details in our supplementary information.

5.1 Ablation study

We first conduct the ablation study of mixDA on REFUGE and ORIGA datasets. Concretely, there are four settings: the baseline (ResNeST50), domain adaptation (DA), domain mixup (Mixup), and mixDA with both DA and Mixup. The detailed results are reported in Table 6.

Table 6 Ablation Study of mixDA on REFUGE and ORIGA

For the REFUGE dataset, we can clearly observe the baseline has a poor performance on the scores of AUC and sensitive that REFUGE is an unbalanced dataset with only 40 glaucoma samples and 360 healthy samples in its test set. Thereby, the baseline model is easily over-fitting to the healthy category in the classification task with a low sensitivity score. As the domain gap exists between train set and validation/test sets, the DA conducts Fourier domain adaptation on its train set to the validation set, which improves its sensitive performance slightly. While, the improvement is not significant as the intrinsic domain gap existing. After introducing domain mixup with LAG and ORIGA, the model performance of sensitivity, specificity, and AUC all be improved. At last, our mixDA with both DA and Mixup further improves the AUC score to 0.9901.

Different from REFUGE, ORIGA train set and test set have the similar data distributions. So, the introducing of extra dataset to DA setting helps the model obtaining \(10\%\) improvements than the baseline setting. Meanwhile, the Mixup setting with the extra dataset also improves \(5\%\) than the baseline. The last setting has a similar result to REFUGE; mixDA with both DA and Mixup achieves the best performance with \(12\%\) improvements than the baseline model.

5.2 Transferability of glaucoma datasets

The transferability is defined by the AUC performance of the pretrained model on the unseen glaucoma datasets. To provide an intuitive transferability overview across different glaucoma datasets, we fix all experimental hyperparameters without data augmentation and only adjust the batch size regarding the dataset volume. From the results of Table 7, we conclude our findings as follows:

Table 7 Transferability on different glaucoma datasets

The first finding is that datasets with similar distributions can be benefited from each other. Specifically, REFUGE has a similar data distribution with DRH and DRISHT. So, its transferability score on DRH and DRISHT is higher than other datasets (i.e., G1020, RIM). Correspondingly, the pretrained model of DRH and DRISHT performs high transferability scores on REFUGE across their evaluations. Moreover, the IEEE1450 dataset has a discriminate distribution, making it has a poor transferability across evaluation datasets. Besides the data distribution, we also find the original fundus images are an important factor for the transferability. From the transferability comparison, we found the pretrained model on the original fundus dataset performs well on the original fundus dataset (e.g., ORIGA to REFUGE, G1020 to DRH), even to the crop ones (e.g., DRISHT to LAG, ORIGA to LAG).

Last but not least, the transferability of pretrained models is strongly related to the source data size. The small data size limited the model learning and enlarged model bias to its transferability performance. The representative low-resource datasets are HRF (22) and DRH (20), which get a good performance on their own dataset but perform poorly on unseen datasets. Overall, we found those three aspects have important impacts for domain mixup learning in mixDA.

5.3 Adaptation comparison

Our mixup domain adaptation has two strategies in the module of domain adaptation (DA): mixup Fourier domain adaptation (mFDA) and mixup histogram domain adaptation (mHDA). We evaluate the performance of those two adaptation strategies on the category-imbalanced REFUGE and category-balanced LAG. Moreover, we further explore those two adaptation strategies on different source domains. The evaluation of those two is reported as follows.

Table 8 shows the comparison of mFDA and mHDA training on different source domains. From the results, we can observe that the AUC performance of mFDA surpasses mHDA on three source domains than the two domains of mHDA surpasses mFDA. In contrast, mHDA works more stable than mFDA on the category imbalanced REFUGE dataset that mHDA with consistent sensitivities on most source domains than the mFDA. The best score is achieved on source domain LAG but with a collapsed sensitivity performance. To avoid the sensitivity collapsing into a low-resource category, our mixDA introduced the none-collapsed source to relieve this issue. So, our mixDA employ ORIGA instead of DCGAN with LAG to conduct domain mixup on REFUGE.

Table 8 Domain adaptation comparison on REFUGE

Compared with the unbalanced REFUGE, both adaptation strategies get a stable performance on the balanced LAG. From the results of Table 9, we can observe mFDA achieves a slight better AUC scores than mHDA in most source domains. The reason here is that the evaluated LAG is a category-balanced dataset with a large training data volume than most public glaucoma datasets.

Table 9 Domain adaptation comparison on LAG

5.4 Generalization capability on glaucoma detection

In this part, we have three experiments to evaluate the generalization performance of mixDA. The first setting is evaluating the domain generalization performance on the unseen dataset LAG. Then, we evaluate the model performance on the diverse dataset DCGAN, which consists of six public glaucoma datasets (ARCIMA, Drishiti-GS, RIM-ONE, HRF, ORIGA, and sjchoi86-HRF). At last, we conduct more evaluation on more public glaucoma datasets (1450, G1020, and HPD).

Our mixDA also works well on the unsupervised domain generalization, which is only trained on the source datasets and tested on the unseen dataset. Compared with the unsupervised method of SAIL trained on a private dataset pri-RFG, our unsupervised mixDA (u-mixDA) achieves a superior performance with a large margin. The main reason is u-mixDA benefits from the module of domain mixup, which pretrained on a larger and diverse glaucoma dataset DCGAN.

To verify the generality of mixDA on diverse datasets, we evaluate it on six different glaucoma datasets that are mostly low-resource with limited data size. Thus, we evaluate the mixDA on those datasets following the setting of SS-DCGAN [69] that splits the combo dataset DCGAN by \(70\%\) and \(30\%\) for train and test. From the results of Table 10, we find most baselines achieve an F-score around 0.81 with limited training samples. SS-DCGAN addresses this low-resource issue by introducing semi-supervised learning on a large extra fundus dataset which improved the ACU and F1 scores to 0.9017 and 0.8429. Different from SS-DCGAN, our mixDA directly performs domain mixup learning on the LAG glaucoma to enhance the model learning capability with \(3\%\) improvements on both AUC and F1 score than the SS-DCGAN. All those two settings verify our mixDA performs a good performance on the unsupervised learning setting and diverse dataset setting.

Table 10 Generalization capability evaluation

Besides, we also evaluate mixDA on more public glaucoma datasets: 1450, G1020, and HPD. The dataset partition default follows the comparable baselines in 1450 and G1020, and HPD is default set half by half partition. From the results of Table 11, we can observe mixDA all achieve the best performance on three datasets. Specifically, mixDA and the baseline method both get good performances on 1450. To the G1020, all baseline performances are dropped, and our mixDA can still keep a superior performance than the baselines in 6-fold cross-validation. In the last dataset HPD, we chose the ResNet50 and ResNeST50 as the baselines, and our mixDA consistently performance on F1 score on HPD dataset than its backbones. Meanwhile, mixDA can also improve the F1 score of backbone ResNet50 from 0.8550 to 0.8704, which verifies the mixDA is a backbone-free method and its performance can be boosted with the stronger backbones.

Table 11 Experiments on glaucoma datasets

5.5 Backbone comparison

As mixDA is a backbone free approach, we evaluate it with different backbones on glaucoma datasets of REFGUE, LAG, and HPD. Seven different neural networks are employed as the backbones of mixDA, which can be categorized as transformer-based (VIT and COAT), ResNet-based (ResNet, SEResNeXt, and ResneST), and typical-based (Xception and EfficientNet) backbone.

From the overview of Fig. 7, the ResNet-based backbones perform better on datasets LAG and HPD than the others. But to the dataset REFUGE, only the ResneST achieves the best performance. Meanwhile, the VIT also performs well on different glaucoma datasets with the help of its attentive transformer layers. While, the same transformer-based COAT is not worked well on the glaucoma detection tasks. After analyzing the training process, we found that the backbones VIT and COAT with huge parameters are easier to go over-fitting than Resnet-based backbone models on the training-size-limited glaucoma datasets, which greatly limits their performance. Xception and EfficientNet can converge to lower losses, but their performances are not compatible to the ResneST. Thereby, our mixDA selects the ResNet-based ResneST and SEResNeXt as the default backbones in all glaucoma detection tasks.

Fig. 7
figure 7

Different backbones of mixDA on glaucoma datasets

5.6 One-stage and two-stage variants of mixDA

Since the optic cup-disk ratio is the key indicator for diagnosing glaucoma disease, most works pay attention to model training on the optic nerve head area or optic disk area as their region of interest (ROI) instead of training on the whole original fundus image. One straightforward solution is segment the ROI (crop) and conducts model training and prediction on the cropped fundus image. Another solution is attentive domain adaptation on the background area and reserves the ROI area by optic disk/cup segmentation (add). We named those two solutions as two-stage solutions that both need an extra process of ROI area segmentation. In contrast, the one-stage solution is direct conducts training and prediction on the original fundus images that the glaucoma disease also related to the ganglion cells and nerve fiber outside the ROI area. Detailed evaluation of those two paradigms on the REFUGE dataset is summarized in Table 12.

Table 12 Different variants comparison on REFUGE

From the variant comparisons, we found both two-stage solutions can achieve higher AUC scores than one-stage solution in the vanilla setting and “+DA” setting (domain adaptation). The reason we think two-stage paradigms with “crop” and “add” process keep glaucoma-related information. And those processes help deep model learning glaucoma-specific information instead of learning from the original fundus image without prior information. However, the performances of two-stage solutions are both dropped in “+Mixup” setting, even worse than their vanilla model setting. The reason is that “Mixup” module introduces extra data having big differences with the processed (cropped or added) fundus images. These differences incur the model performance dropping on two-stage solutions. Although the combination of DA and Mixup can release this performance dropping, their performances are still inferior to the one-stage solution. This comparison further verifies the importance of domain distribution alignment and the effectiveness of our mixDA in glaucoma detection task.

5.7 Error analysis

In this part, we conduct error analysis on our mixDA. Traditionally, the main reasons for prediction error are the vicinal risk and model over-fitting, which misleads the deep model making the fault predictions. But we want to point out some other findings for prediction error: the challenging cases and poor image quality.

From Fig. 8, we can observe some challenging fault cases that the deep model predicts failure with a high possibility scores. On the one hand, those samples with a boarding condition for glaucoma diagnosis by the indicator of optic cup-disk-ratio. On the other hand, multiple deep models made the same fault predictions on this kind of samples with high fault confidence possibilities. In contrast, the poor image quality is an intuitive issue, which provides insufficient information and leads the fault predictions for deep models. We will pay more efforts to improve mixDA performance on the aforementioned fault predictions in our future work.

Fig. 8
figure 8

Fault prediction samples on REFUGE

6 Conclusion

In this work, we proposed a novel mixup domain adaptation (termed mixDA) for glaucoma detection. Domain mixup and domain adaptation are two key modules in our mixDA, which help the deep model learning with a consistent data distribution as well as learning a generalized performance on the vicinal samples (outliers). We conducted extensive experiments on mixDA that got the competitive performance on 12 public glaucoma datasets and achieved new SOTA performance on the REFUGE, LAG, ORIGA, and RIM-ONE datasets. Moreover, we also discuss mixDA from different aspects, such as ablation study, domain transferability, generalization performance, mixDA variants, backbones impacts, and error analysis.