Nonparametric Predictive Comparison of Two Diagnostic Tests Based on Total Numbers of Correctly Diagnosed Individuals

In clinical applications, it is important to compare and study the ability of diagnostic tests to discriminate between individuals with and without the disease. In this paper, comparison of two diagnostic tests is presented and discussed using nonparametric predictive inference (NPI). We compare the two tests by considering the total numbers of correct diagnoses for specific numbers of future healthy individuals and future patients. This NPI approach for comparison of diagnostic tests is also generalized by the use of weighted sums for the healthy and patients groups, reflecting possibly different importance of correct diagnoses. Examples are provided to illustrate the new method.


Introduction
Developing and improving diagnostic tests to detect the presence or absence of a particular disease are important in medical applications. Often, researchers are asked to confirm the superiority of a new diagnostic test to an existing test. In practice, diagnostic tests are not perfect. The tests can have two types of errors, namely false negative (FN) and false positive (FP) errors. This raises the question how one can paper in two ways, namely the use of different events of interest as we consider the (possibly weighted) total number of correctly diagnosed individuals in this paper, and the NPI methods used in the two papers differ. In Alabdulhadi et al. [2], NPI for future order statistics [3,12] was used, while in this paper the problem formulation requires the use NPI for Bernoulli quantities [6], for which also new results are presented.
Classical methods often focus on estimation rather than prediction. The end goal of studying the accuracy of diagnostic tests is to apply these tests to future individuals. Thus, it is of interest to consider the use of a frequentist predictive inference method for comparison of diagnostic tests as an alternative to the classical methods that have been presented in the literature. It will be useful to apply the NPI approach together with some other approaches, to see if they provide similar conclusions about the different tests. If the NPI approach comes to quite a different conclusion than classical methods, then it is likely due to the model assumptions underlying the other methods, as only few assumptions are made in the NPI method.
In this paper, we present NPI for comparing two diagnostic tests, assuming that the tests are applied on the same individuals from two groups, namely, healthy and diseased individuals. In Sect. 2, we provide a brief review of NPI for Bernoulli quantities [6]. Section 3 presents the main method for comparison of two diagnostic tests by considering the total sum of correctly classified individuals from both the healthy and disease groups. We also show how this method can be generalized to include weights for the two different groups, to express possibly different importance of getting the diagnosis right for either healthy or diseased individuals. This section contains some new results for NPI for Bernoulli quantities which can also be applied to different problems than the comparison of diagnostic tests. Section 4 presents some examples to illustrate and discuss the new method. Finally, some concluding remarks are made in Sect. 5. Coolen [6] presented NPI for Bernoulli quantities, which is based on Hill's assumption A (n) [23,24], sequentially applied to derive at inference for m ≥ 1 future observations given n observed values, together with a latent variable representation of Bernoulli quantities represented as observations on the real line, with a threshold such that observations to one side are successes and to the other side failures. Suppose that there is a sequence of n + m exchangeable Bernoulli trials, each with success and failure as possible outcomes, and data consisting of s successes in n trials. Let Y n 1 denote the random number of successes in trials 1 to n; then, a sufficient representation of the data for NPI is Y n 1 = s , due to assumed exchangeability of all trials. Let Y n+m n+1 denote the random number of successes in trials n + 1 to n + m . Based on the basic method presented by Coolen [6], Coolen and Coolen-Schrijner [13] introduced the NPI lower and upper probabilities for events Y n+m n+1 ≥ y and Y n+m n+1 < y , these are the only lower and upper probabilities needed in this paper. The upper probabilities for these events are as follows: For y ∈ {0, 1, … , m} and 0 < s < n, and for y ∈ {1, … , m + 1} and 0 < s < n,

NPI for Bernoulli Quantities
The corresponding NPI lower probabilities can be derived via the conjugacy property P(A) = 1 − P(A c ) , for any event A and its complementary event A c , which holds generally in imprecise probability theory and also in NPI for Bernoulli quantity [5,6].

Comparison of Tests Using NPI for Bernoulli Quantities
In this section, we compare the accuracy of two diagnostic tests to classify individuals into one of two groups, which we indicate as 'healthy group' X and 'disease group' Y. Throughout this paper, we use either subscripts x and y or superscripts X and Y to refer to groups X and Y, respectively. We explicitly consider multiple future individuals from each group, with the inference based on observed data for individuals known to belong to either the healthy group or the disease group. Throughout this paper, we assume that the two groups are fully independent in the sense that any information about one group does not provide any information about the other group.
We compare the two tests by considering the total number of correct diagnoses for m x future healthy individuals and m y future patients for one test with those for the other test, using NPI for Bernoulli quantities for each group separately. In this paper, it does not matter what kind of measurements are actually used in the diagnostic tests, the only relevant aspect is whether or not the diagnosis is correct. However, the model underlying NPI for Bernoulli quantities [6] assumes a latent variable representation for successes and failures using real-valued observations and a threshold, such that an observation to one side of the threshold is a success and to the other side is a failure. This provides a natural link to diagnostic tests which provide real-valued outcomes, with an optimal threshold determined on the basis of the data and some optimality criterion. We have recently presented NPI methods for determination of an optimal diagnostic threshold for such a scenario [1,16], and this motivated us to develop the method presented in this paper. We also considered comparison of two diagnostic tests which are restricted to the real-valued case, and with criterion to maximize the NPI lower or upper probability of correctly classifying at least two specified proportions of the future individuals from the healthy and diseased group [2]. That work uses NPI for future order statistics [12] and cannot be used for the criterion on (possibly weighted) total number of correct future diagnoses considered in this paper. The method presented in the current paper can also be applied in different diagnostic scenarios as long as one can identify whether or not a diagnosis is correct.
The number of correct diagnoses by test t, for t = 1, 2 , in n x and n y data observations from groups X and Y, is denoted by s t x and s t y , respectively. Let C X t m x denote the random number of successful diagnoses for m x healthy future individuals according to test t, and let C Y t m y denote the random number of successful diagnoses for m y diseased future individuals for test t. We compare the two tests by considering the random total number of correct diagnoses for the m x + m y future individuals, when each test would be applied to them. Hence, we consider the event C X 1 and develop the NPI lower and upper probabilities for this event. These results have not been presented before for such quantities, and of course these NPI lower and upper probabilities can also be useful for scenarios other than comparison of diagnostic tests. The first equation in this derivation follows from the fact that P(C X 2 increasing in k. Hence, to derive the NPI upper probability for the event of interest, we put the maximum possible probability mass for C X 1 ≥ m x + m y , followed by assigning the maximum possible remaining probability mass for C X 1 [13]. We can interpret Eq. (1) as if we are optimistic for Test 1 by putting the maximum possible probability masses for this test at the larger values of C X 1 m x and C Y 1 m y , while we are pessimistic for Test 2, we put the maximum possible probability masses for this test at the smaller values of C X 2 m x and C Y 2 m y . The NPI lower and upper probabilities for the individual sums of Bernoulli quantities in the final formula above are as given in Sect. 2.
We also consider the event C X 1 , for which the NPI upper probability is derived as above, with just the first term P(C X 2 . The corresponding NPI lower probabilities for these two events can be derived via the conjugacy property P(A) = 1 − P(A c ) , together with the obvious swapping of the Test 1 and Test 2 indicators in the respective formulae.
It is important to note that the NPI method presented in this paper, where the predictive inferences are done separately for the future individuals from group X and (1) group Y, following which we consider the sum of numbers of correct diagnoses, differs from the possible simpler approach to only count the total number of successful diagnoses, both in the data and for future individuals, without taking the different groups into account. The latter approach would straightforwardly use the NPI for Bernoulli data method for comparison of different groups [13] and would lead to less imprecision that is corresponding NPI lower and upper probabilities would differ less. Particularly, in situations where the sample sizes for the two groups differ substantially, one could get quite different results if one neglects the fact that there are two groups. In addition, our approach can be generalized to reflect that correct diagnoses may be more important for one group than for the other group.
We can take different importance of correct diagnosis for the two groups into account by using weighted totals of correctly diagnosed individuals. As we will consider the same weighted total for both tests, the weights used can be scaled to any total. For ease of presentation, we will use positive integer-valued weights w x for group X and w y for group Y. We now compare the two diagnostic tests by considering the event w The NPI upper probability for this event, which also has not been presented elsewhere and may have applications to a wider range of statistical problems, is derived as follows: The NPI upper probability for the event w is again derived by replacing the first term after the final equality in Eq. (2), , and the corresponding lower probabilities can again be derived via the conjugacy property. (2)

Examples
In this section, we illustrate the NPI method for comparison of two diagnostic tests introduced in Sect. 3. A special feature of our method is that the number of future individuals from both the healthy and disease groups must be specified for the event of interest in the comparison. We therefore consider the application of the method for different values of m x and m y , which we mostly assume to be equal, but we also consider what happens when they are not equal. The first two examples use made-up data in order to illustrate the approach and discuss its important features. Example 3 uses data from the literature and is linked to an application of our recently presented NPI method to determine the optimal diagnostic threshold for real-valued data [1,16].
Example 1 Tables 1 and 2 present the NPI lower and upper probabilities for the events T 1 > T 2 , T 1 ≥ T 2 , T 2 > T 1 and T 2 ≥ T 1 for different values of m x and m y , which are set equal in Table 1 but differ in Table 2. It is obvious from the data that Test 1 has performed better for the observed individuals than Test 2, for both healthy and diseased groups. The aim of this example is to show how such a better performance is reflected by the predictive inferences to compare the two tests if they are applied to m x and m y future individuals from the groups X and Y. Assume that two diagnostic tests have been applied to the same n x = 10 individuals from healthy group X and n y = 10 individuals from disease group Y. The numbers of correctly diagnosed individuals when Test 1 is used are s 1 x = s 1 y = 8 from both groups, while for Test 2, these numbers are s 2 x = s 2 y = 6 for both groups. To denote the events of interest concisely, we introduce notation where the values m x and m y will be clear from the tables or the context.
The first thing to note from Table 1 is that the entries in the last two columns, that is, the NPI lower and upper probabilities for the events T 2 > T 1 and T 2 ≥ T 1 could have been deleted as they follow from the entries for the events T 1 ≥ T 2 and T 1 > T 2 , respectively, by use of the conjugacy property. However, we have included them because it simplifies comparison of the NPI lower and upper probabilities for all these events. The better performance of Test 1 than of Test 2 is reflected by larger values of the lower and upper probabilities for the event T 1 > T 2 than for the event T 2 > T 1 , and larger values for T 1 ≥ T 2 than for T 2 ≥ T 1 .
Comparing the lower and upper probabilities for the events T 1 > T 2 and T 1 ≥ T 2 , for the same value of m = m x = m y shows that these differ a lot for small m yet the differences decrease for increasing m, to become very small for m = 100 . This is of course due to the fact that, for small m, it is quite likely that one gets T 1 = T 2 , yet for larger m, this becomes unlikely. Due to this effect, it is the easiest to study the effect of different choices for the value m by looking at the event T 1 ≥ T 2 . We note that the lower and upper probabilities for this event vary with m, the upper probability increases while the lower probability first decreases and then increases slightly. This is not a pattern observed in all such examples, it varies from case to case. But overall the imprecision, that is, the difference between corresponding upper and lower probabilities tends to increase for larger values of m, unless a lower probability gets close to 1 (or an upper probability close to 0), which forces imprecision to become small as the corresponding upper probability cannot exceed 1 (and the lower probability cannot be less than 0). This example shows that for the predictive criterion chosen in this paper to compare two diagnostic tests, the actual choice of the numbers of future individuals considered has some influence on the results.
In Table 2, the NPI lower and upper probabilities for the comparison of these two diagnostic tests are given for some cases with m x ≠ m y . Of course, due to the data for groups X and Y being the same for both tests, the first two reported cases lead to the same results. We furthermore see similar aspects as discussed above for the situation with equal numbers of future individuals for both groups.

Example 2
In this example, we consider two tests that have similar total numbers of correct diagnoses for the groups X and Y. As in the previous example, we set n x = n y = 10 and the observed numbers of correct diagnoses for Test 1 are s 1 x = 7 for group X and s 1 y = 9 for group Y, while for Test 2, the numbers are s 2 x = 9 and s 2 y = 6 , respectively.
Tables 3 and 4 present the NPI lower and upper probabilities for the same four events for comparison of these two tests as in the previous example, with Table 3 presenting results for m x = m y = m and Table 4 presenting some cases with m x ≠ m y . The values of the lower and upper probabilities for the event T 1 > T 2 are a bit higher than for the event T 2 > T 1 , for the same value of m, and similar for the events including equality of T 1 and T 2 , reflecting the slightly better performance of Test 1 on the 38 Page 10 of 17 20 data observations than Test 2. Of course, the differences here are much smaller than in Example 1, as the tests have performed very similarly here, but Test 1 had performed quite a bit better than Test 2 in Example 1. For larger values of m, where T 1 = T 2 becomes unlikely, all intervals created by the lower and upper probabilities in this example contain the value 0.5, which one could interpret as there not being a strong indication that either test is better than the other. Note that there is substantial imprecision in this example, in particular for the larger values of m. If we had larger data sets with similarly close performance, the imprecision would be less.
The results in Table 4 are quite different to those for the case with unequal values for m x and m y in Example 1. Since Test 1 is better for diagnoses for group Y, while Test 2 is better for diagnoses for group X, this is reflected in the predictive inference for the future performance if one considers different numbers of individuals from these groups. For relatively small numbers, one of m x and m y equal to 15 and the other equal to 30, we see that for more future individuals from group Y, Test 1 performs better than Test 2, while for more future individuals from group X, Test 2 performs better. The differences between the entries in the first two rows of this table are large, which shows the influence that different choices of m x and m y can have, while there is also much imprecision, due to the small samples. However, once we consider larger numbers of future individuals, namely 50 and 70, Test 1 remains better than Test 2 if there are more future individuals from group Y, but even with more individuals from group X, Test 1 is still marginally better than Test 2. This reflects that Test 1 was overall a little better for the observed data, while the values of m x and m y are relatively close. Note that there is again quite much imprecision, so with NPI lower and upper probabilities as presented here for the case m x = 70 and m y = 50 one would reach the conclusion that there is very little evidence that one test would be better than the other.
To illustrate the use of weights for the different groups, as also presented in Sect. 3, Table 5 presents the NPI lower and upper probabilities for comparison of the two diagnostic tests in this example, using weights to let successful diagnoses for one group be twice as important as for the other group. We restrict attention here to equal numbers of future individuals, m x = m y = m , to ensure that the effects illustrated are resulting from the use of the weights. Using weights w x = 2 and w y = 1, Test 2 is better than Test 1 for all considered values of m, albeit only marginally so for small m. This reflects that Test 2 had a better performance than Test 1 for individuals from group X in the data. For w x = 1 and w y = 2 , Test 1 compares favourably to Test 2 for all considered values of m, also reflecting that Test 1 had performed better than Test 2 for group Y in the data. Note that for the latter case, Test 1 is quite a bit stronger than Test 2, while the difference was not so large in the first case with the weights the other way around. This reflects that Test 1 had performed slightly better overall in the observed data. Also these lower and upper probabilities have quite some imprecision, which suggests that larger data samples may be needed before a final decision can be made on the choice of the diagnostic test for the future individuals.

Example 3
In this example, we use a data set from a study to develop screening methods to detect carriers of a rare genetic disorder. The data were discussed by Cox et al. [20] (available from http://lib.stat.cmu.edu/datas ets/). Four tests are applied on the same blood samples, each taking a real-valued measurement. The tests are indicated by M1, M2, M3 and M4. For some patients, there were several measurements for the same test; in such cases, the average is taken, and five patients with some missing values are excluded from the analysis. The remaining sample, which is used in this example, consists of 120 individuals, 38 carriers of the rare genetic disorder, which we call group X, and 82 non-carriers, group Y. Table 5 NPI lower and upper probabilities for comparison of two tests for m x = m y = m , using different weights To illustrate our method for comparison of two diagnostic tests, we first decided on the optimal diagnostic threshold for each test. To stay within the NPI framework, we applied the recently presented method [1,16] where we choose the threshold which maximizes the NPI lower probability that at least half of the m x future individuals from group X will be correctly diagnosed, and also at least half of the m y future individuals from group Y. Throughout this example, we set m x = m y = m . How the specific thresholds are chosen is in itself not important for the illustration of our method for comparison of the tests, but by choosing this NPI method we will see an important feature of such comparisons that may otherwise have gone unnoticed.
First, we applied the above-mentioned method to find the optimal diagnostic thresholds for the four tests and for different values of m. It is important here to note that the threshold, using the NPI method to determine it, can vary for different values of m. We only need the numbers of correctly diagnosed individuals from both groups X and Y for our comparison method, we denote the numbers by s Mt x and s Mt y , respectively, for t = 1, 2, 3, 4 . We further denote the random number of correctly diagnosed future individuals for Test Mt for group X by C X t m , and for group Y by C Y t m . We base our predictive comparison of the tests on the random total numbers T Mt = C X t m + C Y t m for the four tests. Table 6 shows the number of successful diagnoses in the data from healthy and diseased groups for each test, for different values of m. Test M1 performs best overall for the data observations, if we consider the total observed correct diagnoses. Test M4 is second best, and both these tests had the same optimal threshold for all the considered values of m. For Tests M2 and M3, the situation is less clear, and the optimal threshold is not the same for all m. For Test M2, the optimal threshold is slightly different for m = 1 than for the larger values of m considered, but for Test M3, the optimal threshold differs much more, leading to substantially different numbers of correctly diagnosed individuals from both groups for small m compared to larger values of m. It should be noted here that this is due to the multi-modal shape of our criterion function for specific values of m, as function of the threshold, while also our criterion changes with m. This multi-modality also happens for other methods to determine the optimal threshold, so it is not a peculiarity of the NPI approach, although of course other methods presented in the literature are not predictive, hence do not depend on m, and hence they tend not to show this feature. The criterion functions have very similar values at several modes, but picking the threshold by overall optimization of the functions, for different m, can lead to quite different thresholds and hence quite different numbers of correctly diagnosed individuals from the two groups in the data set. We will see that this feature can substantially impact on the comparison of the diagnostic tests. We present the pairwise comparisons for all pairs of these tests, by considering the NPI lower and upper probabilities, as presented in Sect. 3, for different values of m in Table 7. Test M1 was the best for the data, and this shows in the comparisons of this test with each of the other tests. For small m, there is again a considerable possibility that any two tests considered lead to the same total number of correct future diagnoses, as can be seen from the differences in the first and second, and third and fourth, columns with lower and upper probabilities in this table. This effect decreases for larger m, and Test M1 has high lower and upper probabilities to be better than Tests M2 and M3 for m = 100 , while it is also quite likely to be better in this case than Test M4. Test M4 is also likely to perform better than M2 and M3, so this is all in line with the conclusions drawn from the observed data, although these predictive inferences provide far more detailed information, and they provide much insight into the role of m for the predictions. Note further here that imprecision is far smaller than in Examples 1 and 2, reflecting that there is considerably more information from the data in this example.
The most interesting pairwise comparison here is between Tests M2 and M3, mainly due to the changes of optimal thresholds as discussed above, and the corresponding changes in numbers of correctly diagnosed individuals from groups X and Y. For smaller values of m, here m = 1, 5, 10 , the future performance of Test M3 is likely to be slightly better than that of Test M2, but for larger values of m, here m = 30, 100 , it is the other way around, with only a very small difference. The latter reflects that for these larger m, Test M2 has one more correct diagnosis for group X in the data than Test M3, with the same number of correct diagnoses for group Y. For the smaller values of m, it is quite different as Test M3 then performed considerably better on the data for group X but worse for group Y. It turns out, however, that using these data for predictive inference, with m x = m y = m , indicates a better performance to be likely for Test M3 than for Test M2 for these smaller values of m, something which would have been quite impossible to foresee without this formal predictive inference method being used.
More aspects of this example are considered in the PhD thesis of the first-named author [1], including a comparison of Tests M2 and M3 under the assumption that the thresholds used do not vary with m, but are the ones used for m = 100 in the analysis above. As that led to Test M2 diagnosing one more individual correctly, this test is then of course slightly better than Test M3 for all choices of m in our comparison. Finally, it is worth to mention that the empirical areas under the ROC curves (AUC) for these four tests are equal to Â UC M1 = 0.9034 , Â UC M2 = 0.7526 , Â UC M3 = 0.8232 and Â UC M4 = 0.8798 . This is often considered to be a useful measure to distinguish between diagnostic accuracy of tests. An NPI method for diagnostic accuracy leading to lower and upper AUCs, which always bound the empirical AUC, has also been presented in [17]. While these results also indicate that Tests M1 and M4 are the two best tests; they do not show any possible further aspects of comparison for Tests M2 and M3, and it is also unclear what these quantities actually mean for future application of the tests. We should emphasize here that we are not advocating the use of our proposed method on its own, as there is certainly value in measures such as the empirical AUC, but considering several methods, including ours, and studying the results carefully can provide interesting insights for important applications.

Concluding Remarks
This paper introduces a new method for comparison of two diagnostic tests to distinguish between two groups, based on the numbers of correctly diagnosed individuals from both groups in a data set. The method uses NPI for Bernoulli quantities and leads to lower and upper probabilities for the event that the total number of correctly diagnosed future individuals from both groups is greater for one test than for the other, if we consider m x future individuals from group X and m y from group Y. We believe that such predictive inferences provide valuable insights and can be used together with more traditional ways for comparison of tests. The explicitly predictive nature can be natural when one considers that any decisions with regard to choice of test will be relevant for future individuals.
We have not discussed how to choose m x and m y , this is not a trivial issue and we mainly wish to emphasize in this paper that the actual values of these quantities can make a difference to the overall conclusion on which test is best. If the results clearly indicate that one test is better than another one for some values of m x and m y , and the test is applied sequentially but one needs to select a single test to be used for multiple future individuals, then one could for example safely choose the better test for a number of future diagnoses that is equal to the minimum of these two numbers. This is because one would of course not know whether the future individuals are from group X or group Y. Similar reasoning was used by Coolen [9] to determine the maximum group size for simultaneous testing in high potential risk scenarios. It is also possible that a practitioner may have a fair idea about the proportion of future individuals from either group, this could be used in our analysis by considering the m x and m y in similar proportion.
Another possible choice for these numbers of future individuals would be m x = n x and m y = n y . This could be of particular interest for studying reproducibility characteristics of the tests, a topic that recently has received increasing interest as there is much confusion about it, and for which NPI methods have proven to be attractive due to their explicitly predictive nature [10,11].