Background

Definition of indicators of performance

Biomedical ontologies aim at providing the most exhaustive and rigorous representation of reality as described by biomedical sciences. A large part of medical reasoning deals with diagnosis and is essentially probabilistic. It would be an asset for biomedical ontologies to be able to support such a probabilistic reasoning.

Ledley and Lusted’s seminal article [1] on Bayesian reasoning in medicine defines different kinds of probabilistic entities. Consider for example the simple case of an instance of test of type IT (for “index test” – a test whose accuracy is being measured) aiming at detecting if a patient in a group g has an instance of disease of type M.Footnote 1 The performance of test IT in diagnosing M can be quantified by the positive predictive value of this test, hereafter abbreviated PPV, defined by the Oxford Handbook of Medical Statistics [2] as the “proportion of tested positives who are true positives” and by the negative predictive value, hereafter abbreviated NPV, defined as the “proportion of tested negatives who are true negatives”. These values provide the probability that a patient has or not the disease, depending upon the result (positive or negative) to the test.

However, such values depend on some characteristics of the patient. If a patient received a positive test, the probability that he has the disease can for example depend upon his sex, his status of smoker or non-smoker, and other biological or environmental parameters. In particular, it depends on the prevalence of the disease among the group of persons with those characteristics.

Therefore, the statistical data communicated in the medical literature for a test are generally not the positive and negative predictive values, but the so-called “sensitivity” and “specificity”. The Oxford Handbook of Medical Statistics defines sensitivity as “the proportion of those who have the disease who are correctly identified by the test as positive” ([2], p. 340) and specificity as “the proportion of those who do not have the disease who are correctly identified by the test as negative“. The PPV and NPV can be computed on the basis of the prevalence Prev, sensitivity Se and specificity Sp thanks to the following Bayesian equations:

$$ PPV=\frac{Prev.Se}{Prev.Se+\left(1- Prev\right)\left(1-Sp\right)} $$
$$ NPV=\frac{\left(1- Prev\right).Sp}{Prev.\left(1-Se\right)+\left(1- Prev\right).Sp} $$

In the remainder of the article, sensitivity, specificity, PPV and NPV will be called “(Bayesian) indicators of performance” and abbreviated “IPs”.

In the wake of Ledley and Lusted [1] the sensitivity and specificity values have often been considered as depending only on the pathophysiological characteristics of the disease and of the test, and thus as being independent of the group of people under consideration. However, sensitivity and specificity values do in fact depend upon the group under consideration: this is the “spectrum effect” [3].

The spectrum effect

If IT is an index test and M is a disease, let’s introduce f1(IT,M) as “the proportion of individuals who get a positive result to IT, among individuals who have M”, which fits with the usual definition of sensitivity (as provided by [2]). The main problem with this definition is that it does not specify the reference population. "The individuals who have M” are part of which population: the population in a given sample? The population of a specific country? The whole human population? Ledley and Lusted [1] considered that sensitivity and specificity depend upon pathophysiological characteristics of the disease, but not upon the population in consideration. If this was the case, the proportion of people tested positive among the diseased would be the same in any group under consideration – abstracting from statistical fluctuations due to randomness. However, as has been recognized by the medical literature, but regularly omitted, this hypothesis is false for at least two reasons. First, most tests are not inherently dichotomous but rely on a categorization of individuals based on continuous traits [3]. Second, various populations can express various disease characteristics (such as various degrees of severity [4]) that will influence the chance to get a positive result to a test.

The latter can be illustrated with the following example. Suppose that around 80 % of people having rheumatoid arthritis have a rheumatoid factor (RF), and would with certainty receive a positive result to a test that would perfectlyFootnote 2 detect this factor; and that the remaining 20 % do not have a rheumatoid factor, and would receive a negative result (yet do have the disease). The diseased population is then composed of two subgroups: a subgroup sg 1 whose members would all get for sure a positive result to IT, and a subgroup sg 2 whose members would all get for sure a negative result (see Fig. 1). The sensitivity calculated in this example would be 80 %.

Fig. 1
figure 1

Variation of sensitivity depending on the group under consideration

Nevertheless, in reality, those proportions vary based upon various characteristics of the patients. For example, RF presence increases with age at onset of disease in juvenile arthritis [5]. As a result, the sensitivity of a test for RF will increase according to the age of the individuals of the population being tested. Its sensitivity will be lower in younger patients and higher in older patients.

Therefore, f1 is not a well-defined function: the value of the proportion does not depend only upon IT and M, but also upon the population g under consideration (which could be, for example, the whole human population, the Canadian smoker population, the female population, etc.). This is the “spectrum effect”, which can also be manifested, for example, as a dependence of sensitivity and specificity on the degree of severity of the disease in the group under consideration [4].

The sensitivity can therefore depend on the group g under consideration. A better candidate than f1(IT,M) to the definition of the sensitivity value would be the function f2(g,IT,M) defined as “the proportionFootnote 3 among people in g who have M of those who would get a positive result to IT if the test IT was realized on them” – the mention in italic is necessary, as a test IT will not be realized on all individuals who have M, but on a sample only. The next part will distinguish three related entities: the real sensitivityFootnote 4 value, its estimates, and the measurements of proportion in samples. It will also explain how such entities should be distinguished in an ontology of IPs.

Methods

Proportion measurement in a sample

It is impossible to know f2(g,IT,M) with certainty in practice, for two reasons. The first reason is that it is often not possible to determine with certainty, through reasonable means, whether a given person has the disease M or not; in some cases, the only way to be certain would be to perform an autopsy on the deceased patient. Therefore, one needs to use a “reference test”, which is the best diagnostic test that is reasonable to perform in the present context (for more on the distinction between a reference test and the associated disease, see section “The challenge of representing indicators of performance in an ontology” below).

If the patient receives a positive result to this reference test, it will be concluded that he has the disease; if he receives a negative result, it will be concluded that he does not have it. But those inferences can be wrong: the reference test might lead to a positive result for a non-diseased person, or a negative result for a diseased person. If RT is a reference test for M and IT is an index test (of unknown accuracy) for M, then one can define the function f3(g,IT,RT) as “the proportion, among individuals of g who would get a positive result to RT if the test RT had been performed on them, of people who would get a positive result to IT if the test IT was realized on them”. Since RT is a reference test for M, f3(g,IT,RT) approximates f2(g,IT,M). Both values can differ though: this is a first epistemic limit to the knowledge of f2(g,IT,M).

On top of this, f3(g,IT,RT) is not directly measurable. As a matter of fact, a test IT is never realized on a population as large as e.g., the whole population of smokers, or the whole male population. It is however possible to approximate f3(g,IT,RT) by performing both tests IT and RT on individuals in a sample s judged as being representative of the population g. Let’s define f4(s,IT,RT) as “the proportion, among members of s who got a positive result to RT, of those who got a positive result to IT”. If s is a representative sample of g, then f4(s,IT,RT) does approximate f3(g,IT,RT) – and thus, by transitivity, does approximate f2(g,IT,M). Note that as long as the sample s is not perfectly representative of g, f4(s,IT,RT) will differ at least slightly from f3(g,IT,RT) (which also differs from f2(g,IT,M)): this is a second limit to the knowledge of f2(g,IT,M).

Let’s illustrate those two limits of estimations with a study [4] which analyzes the quality of the Neer test (here written IT’) for diagnosing the shoulder impingement syndrome (written M’), a syndrome that is characterized by rotator cuff muscles inflammation near the sub-acromial space. In this study, the Neer test IT’ is realized on a sample (written s’) of 552 patients, judged as representative of the target population (g’). Park et al. [4] take as reference test (RT’) the surgical observation. Here, f4(s’,IT’,RT’) is the proportion of people in the sample who have received a positive result to the Neer test, among those diagnosed as positive by surgical operation. f4(s’,IT’,RT’) approximates f3(g’,IT’,RT’), namely the proportion of individuals in the target population g’ who would get a positive result to the Neer test among those who would get a positive result by surgical observation, if those tests were performed on them. Finally, f3(g’,IT’,RT’) itself approximates f2(g’,IT’,M’), which is the proportion of individuals in g’ who would receive a positive Neer test result among those who have an impingement syndrome. Thus, f4(s’,IT’,RT’) approximates f2(g’,IT’,M’).

Note that similar approximation strategies hold for prevalence, specificity, PPV and NPV. Concerning e.g. specificity, one could thus define f’2(g,IT,M) as “the proportionFootnote 5 among people in g who don’t have M of those who would get a negative result to IT if the test IT was performed on them”; and f’4(s,IT,RT) as “the proportion, among members of s who got a negative result to RT, of those who got a negative result to IT”. Thus, f’4(s,IT,RT) approximates f’2(g,IT,M).

Sensitivity value and sensitivity estimates

Now that those definitions have been given, we can determine which entity the word ‘sensitivity’ refers to in the medical literature. At first sight, this term might appear polysemic. To illustrate this, let’s consider a study which evaluates the quality of an exercise test in the diagnosis of coronary artery disease, and claims: “The sensitivity varied substantially according to sex (women 30 % and men 64 %)” [6]. On one hand, the statement “sensitivity varies substantially according to the sex” suggests that sensitivity depends on the target population g in consideration, and that there is a sensitivity value for the female population, and another one for the male population. This formulation thus suggests that sensitivity value is given by the function f2(g,IT,M). However, the value 30 % assigned to the sensitivity of the test for women refers to a proportion which has been measured by the authors in a sample of 37 women, using coronary angiography as a reference test. This might thus suggest that the sensitivity value is in fact given by the function f4(s,IT,RT)

However, two arguments suggest that the sensitivity value should be interpreted as f2(g,IT,M) rather than f4(s,IT,RT). First, the value which is ultimately relevant for medical practice is f2(g,IT,M): if s is a sample of g and RT is a reference test for M, f4(s,IT,RT) is of interest for the medical practitioner only insofar as it provides an information on the disease M and the target population g from which the sample is taken – that is, insofar as it provides an estimate of f2(g,IT,M). Indeed, the fact that a few people who got a positive result to RT in a given sample have got a positive or negative result to a test IT has medical relevance only insofar as it teaches us something about how diseased people in the target population (not only in the sample) will react to this test IT.

Second, the sensitivity value is usually given with a 95 % confidence interval (see e.g., [7] or [8]), which estimates the likely range of error in determining the sensitivity value. But f4(s,IT,RT) can be measured with certainty,Footnote 6 and thus the confidence interval cannot characterize the uncertainty on our knowledge of f4. On the other hand, there is some uncertainty on the knowledge of f2(g,IT,M) and f3(g,IT,RT), as they are estimated on the basis of f4(s,IT,RT). Therefore, the 95 % confidence interval would characterize the uncertainty on the knowledge of f3(g,IT,RT), which is taken as a proxy for f2(g,IT,M).Footnote 7

Thus, those two arguments suggest that the term “sensitivity” should refer to f2(g,IT,M) – which is relative to a disease and a target population – rather than to f4(s,IT,RT) – which is relative to a reference test and a sample.Footnote 8 As for f4(s,IT,RT), it can be interpreted as the value of a measurement of proportion in a sample, which provides an estimate of the sensitivity value.

Therefore, a sentence such as “The sensitivity varied substantially according to sex (women 30 % and men 64 %)” should, more rigorously, be formulated as: “The sensitivity varies substantially depending on the sex: through measurement of proportions in samples, its value was estimated to be 30 % for the women, and 64 % for the men”. We could prefer the first formulation, more compact, for practical reasons; but it is important to remember that it is only a shortcut for the second formulation.

Accordingly, we will need to dissociate three different kinds of entities. First, tests execution on a sample s, referring more precisely to the process of performing tests IT and RT and measuring the numbers of true positive, false positive, true negative and false negative as operationalized by IT and RT - for example, the false positive are people who are tested positive by the index test IT but negative by the reference test RT in the sample s. Second, the proportion of true positives among positives (as given by the reference test) is relative to the index test, the reference test and the sample, and its value is given by the function f4(s,IT,RT); as such, it provides an estimate of the sensitivity value. Third, the “real sensitivity”, which is relative to an index test, a disease and a population g, and whose value f2(g,IT,M) is given by the proportion of people in the group who would have a positive result to the test IT among those who are diseased. The real sensitivity would provide a better information than a sensitivity estimate on the probability that a random member of the group g would get a positive test result, in case he has the disease. However, its value f2(g,IT,M) cannot be known with certainty, contrarily to the value of the sensitivity estimate f4(s,IT,RT).

More generally, those considerations can be adapted to other indicators of performance (specificity, PPV and NPV), as well as the prevalence. In particular, f’2(g,IT,M) should refer to the real specificity value, whereas f’4(s,IT,RT) can be interpreted as the value of a measured proportion in a sample that provides an estimate of the real specificity value. In particular, real sensitivity, specificity, PPV and NPV, as we have defined them above, depend neither on the sample nor on the reference test. However, they are estimated on the basis of proportion measurements which depend both on the sample and the reference test. Accordingly, when a study [9] mentions “cadaveric prevalence” of the rotator cuff tears, this expression should be understood as a linguistical shortcut denoting a proportion measurement in a sample when the cadaverical analysis is adopted as reference test; and the “radiological prevalence” should be understood as a proportion measurement when the radiological analysis is adopted as reference test. The real prevalence, however, does not depend on the reference test.

Aggregation of sensitivity estimates

Finally, we need to add a last layer to this model. Approximations of sensitivity taken in different samples, with different index tests, can be combined in order to build a finer estimate of sensitivity for a more encompassing category of index tests. Consider for example the meta-analysis [7] which assess the quality of peripheral thermometers in detecting fever. They use as reference test a pulmonary artery catheter, and consider 29 studies assessing the sensitivity and specificity of those devices. Combining those values, they come up with an estimate of 0.64 for the sensitivity and of 0.96 for the specificity.

The challenge of representing indicators of performance in an ontology

To the extent that they aim at representing biomedical knowledge and enabling medical reasoning, biomedical ontologies should provide a formalization of IPs as well as the prevalence, by dissociating e.g. the real sensitivity from the sensitivity estimates, and the process leading to those estimates. This article will introduce such a formalization in the context of the OBO Foundry [10], one of the most massive set of interoperable ontologies in the biomedical domain, built on the upper ontology Basic Formal Ontology (BFO) 1.1 [11].

BFO endorses a realist methodology, which carefully dissociates material entities (such as disorders) from informational entities (such as diagnosis). In common medical practice, a disease may be diagnosed in ideal circumstances by a given gold standard test, which can be defined as the most accurate reference test; but the disease, the diagnosis, and the result to a gold standard test are three different entities that should be distinguished. As a matter of fact, many human diseases already existed a few thousands of years ago, much before they could be diagnosed. Moreover, a diagnosis can be wrong or imprecise. Finally, a given gold standard can be later replaced by a better one: this shows that the disease cannot be defined by a positive result to a gold standard - otherwise, there could not be, by definition, a “better” gold standard. Thus, while a diagnosis of a disease represents the best knowledge by some health or research professional of the presence of the disease in a particular patient, a diagnosis is not equivalent to a disease: it is rather “about” a disease. This formalization is compatible with IAO (Information Artifact Ontology [16]) and OGMS (Ontology for General Medical Sciences).

The question of how probabilistic notions can be represented in ontologies has been tackled from different perspectives in the past. For example, [12] has proposed the alternative PR-OWL format that extends the classical OWL format; we take here a different approach, which does not aim at changing the OWL format. Soldatova and colleagues [13] have described a model in which probabilities can be assigned to research statements. We build here upon an alternative approach [14], in which probabilities can be assigned to dispositions.

Sensitivity and specificity have been recently introduced in the Ontology of Biological and Clinical Statistics (OBCS [15]) as subclasses of Data item. We will partly endorse and refine this classification, by considering estimates of sensitivity and specificity as subclasses of Data Item, and extend this classification to PPV and NPV. A data item, as defined by the Information Artifact Ontology (IAO) [16], is intended to be a truthful statement about something. In order to formalize IPs, one should thus clarify which entities in the real world they are about.

Proportion measurements are data items that are obtained from some processes named "proportion measures", which involve performing two kinds of tests (the index test and the reference test) in a sample. On the other hand, we have defined a real sensitivity value f2(g,IT,M) as the proportion of people who would get a positive result by IT among those who have the disease M. But note here the conditional structure: what is referred to is the proportion of true positives among diseased if IT was performed on them. In realistic situations, however, as explained above, the sensitivity value will be estimated by performing the test on a sample of the population only – not the entire population g; thus, f2(g,IT,M) is the value of a non-actual proportion.Footnote 9 However, possible-but-non-actual situations cannot be straightforwardly represented in a realist ontology like BFO. To solve this problem, we will formalize the real IP value as the probability assigned to a disposition borne by an instance of group of individuals; and estimates of IPs as data items which are about such a disposition. This will provide a formal characterization of IPs and their estimates based on proportion measurements.

Results

The formalization that will be presented here can be visualized on Fig. 2 and Fig. 3, in which classes are in rectangles, instances in boxes with rounded edges, and the numerical value assigned by datatype properties in ellipses. Unless specified otherwise, all the relations used here belong to BFO 1.1 [11].

Fig. 2
figure 2

Real sensitivity and specificity values and their estimates

Fig. 3
figure 3

Aggregation of several sensitivity estimates

Test results and sensitivity estimate

Let us first start with the formalization of test results and the IP estimates they lead to (see Fig. 1).Footnote 10 A Medical_test will be here considered as a subclass of Planned_process (as defined by OBI, the Ontology for Biomedical Investigations [17]) which consists in the observation of a given feature to infer the presence of another feature – in the case of interest, a pathological entity such as a disease. Consider a medical testFootnote 11 IT 1 and a disease M:

  • Medical_test is_a Planned_process

  • IT 1 is_a Medical_test

  • M is_a Disease

Suppose that we are interested in the sensitivity and specificity of test IT 1 for diagnosing M in a group g 1 . This group g 1 will be formalized as a collection of humans (for more on collections, see [18]). To estimate this sensitivity and specificity, one can select a sample s 1 considered to be representative of g 1 (which will be called the reference class). Thus:

  • g 1 instance_of Collection_of_humans

  • s 1 instance_of Sample_of_humans

  • Sample_of_humans is_a Collection_of_humans

  • s 1 part_of g 1

Let’s now introduce the class of tests RT 1 which are reference tests for M:

  • RT 1 is_a Medical_test

s 1 is composed of n humans, named p 1 , p 2 ,…,p n . TwoFootnote 12 tests will be performed on each p i : an instance of RT 1 , named thereafter rt 1,i , and an instance of IT 1 , named it 1,i ; thus, for every i between 1 and n:

  • p i instance_of Human

  • p i part_of s 1

  • p i participates_in rt 1,i

  • p i participates_in it 1,i

We introduce tests_execution s1, IT1,RT1 which has as part all the tests rt 1,i and it 1,i for i between 1 and n and the recording of which members of the sample are true positives (those who have been tested positive both by IT 1 and RT 1 ), true negatives (those who have been tested negative both by IT 1 and RT 1 ), false positives (those who have been tested positive by IT 1 but negative by RT 1 ) and false negatives (those who have been tested negative by IT 1 but positive by RT 1 ). This recording leads (OBI:has_specified_output) to the creation of the instance of Data_set named tests_results s1, IT1,RT1 :

  • tests_execution s1, IT1,RT1 instance_of Planned_process

  • rt 1,i part_of tests_execution s1, IT1,RT1

  • it 1,i part_of tests_execution s1, IT1,RT1

  • tests_results s1, IT1,RT1 instance_of Data_set

  • tests_execution s1, IT1,RT1 has_specified_output tests_results s1, IT1,RT1

The tests_results s1,IT1,RT1 will then serve as input (OBI:has_specified_input) to a planned process noted computation Se 1 which computes a sensitivity estimates noted estimate Se 1 , by calculating the proportion of true positives among positives:Footnote 13

  • computation Se 1 is_a Planned_process

  • estimate Se 1 is_a Data_item

  • computation Se 1 has_specified_input tests_results s1, IT1,RT1

  • computation Se 1 has_specified_output estimate Se 1

Finally, we can use the datatype property OBI:has_specified_value to relate estimate Se 1 with its numerical value f4(s 1 ,IT 1 ,RT 1 ):

  • estimate Se 1 has_specified_value f4(s 1 ,IT 1 ,RT 1 )

Similar strategies can hold for representing Specificity, PPV and NPV and their estimates.Footnote 14

Aggregation of sensitivity estimates

We will now show how various sensitivity estimates can be aggregated for a finer sensitivity estimate (cf. Fig. 3). Suppose that we have another sample s 2 (also a part_of g), composed of n’ humans named q 1 , q 2 , ..., q n' . We can perform another measure of sensitivity for a related (possibly identical to IT 1 ) index test IT 2 for M in g on this sample, using a related (possibly identical to RT 1 ) reference test RT 2 , by performing instances of RT 2 named rt 2,j (for j between 1 and n’) and instances of IT 2 named it 2,j on each member q j of s 2 . One can then define the entity tests_execution s2, IT2,RT2 as a planned process which has as part those tests rt 2,j and it 2,j , and which has as output tests_results s2, IT2,RT2 ; the latter serves as input to another computation of sensitivity computation Se 2 , which has as output another estimate of sensitivity estimate Se 2 , to which the value f4(s 2 ,IT 2 ,RT 2 ) can be assigned (the latter being the proportion, among people who have been tested positive by RT 2 in s 2 , of people who had a positive result to IT 2 ).

As explained earlier, various sensitivity estimates can be combined to estimate the value of the sensitivity of a test for M in g. If IT 1 and IT 2 on one hand, and RT 1 and RT 2 on the other hand, are similar enough (in particular, if they are identical), those results might be gathered to come up with a finer estimate of the sensitivity value. More specifically, if IT 1 and IT 2 can be subsumed under a common index test class IT 0 , and RT 1 and RT 2 can also be subsumed under a common reference test class RT 0 , then their values can be compiled mathematically (for example by meta-analysis methods) to come up with the value of a (hopefully finer) estimate named estimate Se 1,2 , whose value is given by a function h(s 1 ,IT 1 ,RT 1 ,s 2 ,IT 2 ,RT 2 ). This can be generalized to the aggregation of more than two former estimates.

We can here introduce a planned process of computation of sensitivity named computation Se 1,2 , which takes as input both estimate Se 1 and estimate Se 2 , and the output of such a process, a data item named estimate Se 1,2 :

  • computation Se 1,2 instance_of Planned_process

  • estimate Se 1,2 instance_of Data_item

  • computation Se 1,2 has_specified_input estimate Se 1

  • computation Se 1,2 has_specified_input estimate Se 2

  • computation Se 1,2 has_specified_output estimate Se 1,2

  • estimate Se 1,2 has_specified_value h(s 1 ,IT 1 ,RT 1 , s 2 ,IT 2 ,RT 2 )

We will not aim at giving the details of this function h, which is the responsibility of the statistician, not the ontologist – who focuses on how to represent such values.

Finally, since estimate Se 1 or estimate Se 1,2 are informational entities, they must be about some entities. To determine what those entities are about, we will need to formalize the entity to which is assigned the “real sensitivity value”.

Real sensitivity value

As said earlier, estimates of sensitivity of IT for M in g aim at estimating the real sensitivity value, which is given by the proportion of members of g who would get a positive result to IT among those who have M. However, the condition of performing the test IT on the members of g is never realized, because the test is performed (at best) on one or several samples of the population, not on the whole population g: the performance of test IT on the members of g is a possible (leaving aside practical difficulties), non-actual condition. Interpreting specificity, PPV, and NPV along the former lines would also imply such possible, non-actual conditions.

BFO’s realist methodology [19] implies that all instances should be actual entities. Thus, one cannot represent directly such a possible-but-not-actual condition in an ontology based on BFO. In order to solve this difficulty, we will introduce a strategy named “randomization”, which will clarify the nature of the real sensitivity value as a probability assigned to an actual entity, namely a disposition. This will also clarify what an estimate of sensitivity is about, namely about this disposition. Thus, it will enable to represent IPs in a realist fashion, compliant with BFO’s methodology.

From proportions to objective probabilities: the randomization strategy

We will explain first how the proportion of a subgroup in a group can be formalized as a probability value assigned to a disposition; this will help explaining later how the proportion of a subgroup in a group undergoing a possible, non-actual condition can be formalized along similar lines.

Dispositions are entities that can exist without being manifested; an example of disposition is the fragility of a glass, which can exist even when the glass does not break. We will use Röhl & Jansen's model of disposition [20] in BFO, which associates to every instance of disposition one or several instances of realizations, and one or several instances of triggers (a trigger is the specific process that can lead to a realization occurring). In this model, the fragility of a glass is a disposition of the glass to break (the breaking process is the realization) when it undergoes some kind of stress (the process of undergoing such a stress is the trigger); this disposition inheres in the glass. Starting with the definition of these entities and their relations at the instance level, Röhl & Jansen proceed to formalize them at the universal level. Previous work [14] has shown how to adapt this model to probabilistic dispositions. Thus, an instance of balanced coin is the bearer of an instance of disposition to fall on heads (the realization process) when it is tossed (the trigger process), to which an objective probability 1/2 can be assigned.

We will now extend the scope of this model to the situation at hand. Consider the prevalence Prev(g,M), which was defined above as the proportion of persons having M in the actual population g. We can define the disposition d Prev g,M , borne by the group g, that a person randomly drawn in g has M. More specifically, let’s write T g the process “randomly drawing a person in g”, and R g ,M the process “drawing by T g someone who has M”: the triggers of d Prev g,M are instances of T g and its realizations are instances of R g ,M . Following the lines of previous work [14], one can thus define the probability assigned to the dispositionFootnote 15 d Prev g,M , which is the probability of drawing randomly someone who has M in g. This probability is equal to the proportion of individuals who have M in g, that is, to Prev(g,M): if there are e.g., 10 % diseased people in g, then the probability of drawing randomly a diseased person in g is 10 %. Thus, the prevalence value can be identified to the objective probability assigned to the disposition d Prev g,M . We name this strategy the “randomization” of the proportion of persons having M in g.

The randomization strategy may not be necessary to formalize a proportion in an actual group, such as the prevalence. But this strategy can also be applied to proportions of people in groups which are subject to a possible, non-actual condition – and thus, be relevant to formalize sensitivity and other IPs, and their estimates. As a matter of fact, the real sensitivity value f2(g,IT,M) was defined as the proportion of people who would get a positive result to IT among M’s bearers in g. This value can be “randomized” as follows. We can define d Se g,IT,M as the dispositionFootnote 16 to draw randomly, among the individuals of g who have M, someone who is tested positive by IT. More specifically, let’s define the process T Se g ,IT,M as the “performance of test IT on the individuals in g, and random draw of an individual among those who have the disease M”;Footnote 17 and the process R Se g ,IT,M as the “drawing by T Se g ,IT,M of someone who got a positive result to IT”. The triggers of d Se g,IT,M are instances of T Se g ,IT,M , and its realizations are instances of R Se g ,IT,M . As it happens, the real sensitivity value f2(g,IT,M) is the objective probability assigned to this disposition d Se g,IT,M,: indeed, if there are e.g., 15 % of the diseased people in g who would get a positive result by IT, then the probability of randomly drawing someone who got a positive test result by IT among diseased people in g if test IT would be performed on them is equal to 15 %.

Specificity value can be defined along similar lines, as probabilities assigned to actual dispositions borne by the group g noted d Sp g,IT,M (and similarly for the PPV and NPV). Although both d Se g,IT,M and d Sp g,IT,M are dispositions inhering in g, they have different triggers and different realizations; the process T Sp g ,IT,M is the “performance of test IT on the individuals in g, and random draw of an individual among those who do not have the disease M” and the process R Sp g ,IT,M is the “drawing by T Sp g ,IT,M of someone who got a negative result to IT”.

Assignment of real sensitivity values to dispositions

Let us now consider how to formalize these probability values in ontologies. d Se g,IT,M is a disposition individual inhering in the group g; and a probability value can be assigned to this disposition using a datatype property has_probability_value [15]. This probability value is what we called the real sensitivity value:Footnote 18

  • d Se g,IT,M has_probability_value f2(g,IT,M)

Thanks to our analysis above, we can now answer our original question, and state what sensitivity estimates such as estimate Se 1 or estimate Se 2 are aboutFootnote 19 - namely, about this disposition:

  • estimate Se 1 is_about d Se g1,IT1,M

  • estimate Se 2 is_about d Se g2,IT2,M

Also, if the samples s 1 and s 2 are considered by the statistician as representative enough of a general population g 0 encompassing g 1 and g 2 , if RT 1 and RT 2 are considered as similar enough to be representative in the same way of the disease M, and if IT 1 and IT 2 are considered as similar enough to be representative of a more general index test IT 0 , then:

  • estimate Se 1,2 is_about d Se g0,IT0,M

As d Se g,IT,M is an individual, it cannot be related directly to the classes IT and M, but only indirectly, through the following formalization. First, d Se g,IT,M can be seen as an instance of a disposition class written D Se IT,M , which has as trigger the process class T Se IT,M : “performance of test IT on the members of a group, and random draw of a person among those who have the disease M”; and as realization the process class R Se IT,M defined as “drawing by T Se IT,M of someone who got a positive result to IT”. We can then introduce two new relations sensitivity_disposition_of_test and sensitivity_disposition_for (abreviated as se_of_test and se_for_disease) relating D Se IT,M with IT and M:

  • d Se g,IT,M instance_of D Se IT,M

  • D Se IT,M is_a Disposition

  • D Se IT,M se_of_test IT

  • D Se IT,M se_for_disease M

These two relations se_of_test and se_for_disease are introduced for pragmatic reasons of facility of use: on a foundational level, D Se IT,M and M (resp. IT) could be related through a complex array of relations and entities that involve the relation has_trigger between D Se IT,M and T Se IT,M , as well as a sequence of relations between T Se IT,M and M (resp. IT). Such an analysis would raise interesting theoretical questions, as instances of D Se IT,M can exist even if no instance of M or IT do exist - we therefore face here issues similar to the ones addressed by [20] and [21].

Figure 2 represents classes and particulars involved in formalizing tests execution and results, sensitivity estimates, the disposition this estimate is about, and the real sensitivity value. Figure 3 represents the classes and particulars involved in formalizing aggregation of sensitivity estimates into a finer estimate. Specificity, PPV and NPV can be formalized along similar lines, as data items about dispositions related to tests and diseases through relations that could be labeled sp_of_test, sp_for_disease, ppv_of_test, ppv_for_disease, npv_of_test, and npv_for_disease.

Example of application

An example will now illustrate this formalization. McTaggart and colleagues [8] have performed a meta-analysis to determine the accuracy of point-of-care tests for detecting albuminuria (let’s call IT 0 the class of such index tests), using as reference test a laboratory test albumin-creatinine ratio-ACR (let’s call RT 0 the class of such reference tests).

They take into account ten studies in their article. Consider for example Lloyd et al. [22], which measures the accuracy of semiquantitative Clinitek® microalbumin urine dipstick with a cutoff value indicating albumineria at 3.4 mg/mmol (let’s call IT 1 the class of such index tests), with a laboratory ACR test with the same cutoff value as a reference (let’s call RT 1 the class of such reference tests). A sample s 1 of 204 diabetic patients (labelled here p 1,1 , p 1,2 ,…, p 1,204 ) was considered. On each of those patients, one measurement of IT 1 called a 1,i,1 and one of RT 1 called rt 1,i,1 is performed. The 2x204 = 408 processual entities are all part of a general tests execution process labelled tests_execution s1, IT1,RT1 , which leads after computation to the informational entity estimate Se 1 , giving the proportion of measure pairs in which IT 1 led to a positive result among those in which RT 1 led to a positive result. This proportion is 83.8 %, and therefore, the value f4(s 1 ,IT 1 ,RT 1 ) of the informational entity estimate Se 1 is 0.838.

Writing g the human population, we have s 1 part_of g; also, RT 1 is_a RT 0 and IT 1 is_a IT 0 . Therefore, f4(s 1 ,IT 1 ,RT 1 ) provides an estimate of f2(g,IT 0 ,RT 0 ), which is the sensitivity value of a point-of-care test in detecting albuminuria in the general population. However, other studies are pooled with this one by McTaggart and colleagues [8] to provide a better estimate of f2(g,IT 0 ,RT 0 ). All together, they lead to the value h(s 1 ,IT 1 ,RT1,…,s 10 ,IT 10 ,RT 10 ) which provides an estimate of the value of f2(g,IT 0 ,RT 0 ).

Note that the ten studies taken into account in this meta-analysis include different kinds of patients. Seven studies involve each a different sample of patients (let’s call them s 1 , s 2 , …., s 7 ) with diabetes mellitus, one of them (s 7 ) involving young patients with type 1 diabetes. Two studies consider samples of patients (s 8 and s 9 ) with kidney disease, diabetes mellitus, or both. Finally, one study includes a sample (s 10 ) of patients treated for advanced chronic kidney disease in a renal outpatient clinic. Let’s call g the human population, g 1 the members of g who have diabetes mellitus, g 2 the members of g who have a kidney disease and g 0 the members of g who have either diabetes mellitus or a kidney disease (that is, g 0 is the mereological sum of g 1 and g 2 ). All s i are part of g, the human population. Thus, the meta-analysis made by McTaggart and colleagues [8] provides an estimation of f2(g,IT 0 ,RT 0 ) or f2(g 0 ,IT 0 ,RT 0 ). If the meta-analysis had been performed on s 1 -s 7 only, then it would have provided an estimation of f2(g 1 ,IT 0 ,RT 0 ); and if it had been performed on samples of patients with kidney disease only, then it would have provided an estimation of f2(g 2 ,IT 0 ,RT 0 ).

Note also that various cutoff values can be used to define the presence of albuminuria, varying between 2.65 mg/mmol to 3.4 mg/mmol, and those values are chosen by the medical sub-community who is conducting the study (the same cutoff value is taken for both IT 0 and RT 0 in each study). Therefore, the classes IT 0 and RT 0 , which mention ‘detecting albuminuria’ without specifying a cutoff value, are not scientifically defined: those classes are not universals, but rather collection of particulars [19] whose nature is partly social ([8] acknowledge this limitation in their meta-analysis).

Alternative meta-analysis could use a subset of those studies to estimate various sensitivities, for example the sensitivity f2(g 1 ,IT 1 ,RT 1 ) of point-of-care test with a reference of laboratory ACR test, with albuminuria defined as ACR greater than 3.4 mg/mmol, in the reference class of patients with diabetes mellitus; or the sensitivity f2(g 2 ,IT 2 ,RT 2 ) of point-of-care test, with a reference of laboratory ACR test, with albuminuria defined as ACR greater than 2.65 mg/mmol, in the reference class of patients with kidney disease; etc. A well-founded semantic representation of sensitivity should thus make clear what is the reference class, as well as the class of index test and reference test.

Discussion and conclusions

We have thus provided a practically tractable formalization of IPs in a realist ontology, which clearly dissociates IPs’ real values, their estimates and the related proportion measurements. It has defined the central entities that are concerned by an IP estimation in a way that is compliant with OBO Foundry. In particular, it addresses the difficulty of considering possible, non-actual conditions in a realist ontology based on BFO by introducing dispositions.

This model could then be extended in three directions. A first step would be to clarify the ontological status of the two following entities: sample sizes on one hand; and 95 % confidence interval for sensitivity and specificity values on the other hand. A second step would be to clarify the relations se_of_test and se_for_disease, which could be reduced to basic relations and entities already accepted in the OBO Foundry. A third step would be to use this model in an ontology-based diagnostic system that would compute positive predictive values or negative predictive values from the prevalence, sensitivity and specificity values. More generally, it could be articulated with medical Bayesian networks. As a matter of fact, the notion of medical test used here could be generalized to a very general notion of test consisting in inferring the presence of an entity on the basis of the knowledge of the presence of another entity; as such, it could serve as a foundation for the integration of Bayesian reasoning into ontologies.

This model could be used in two kinds of computer applications targeted at two different kinds of audiences. First, clinicians could determine more easily which kind of sensitivity and specificity (or PPV and NPV) estimates they could use when diagnosing a disease for a given patient, by having a clearer view of the subjects’ characteristics in each samples on which those IP estimates are based. As a matter of fact, section 3.4 illustrates how an ontological analysis can make explicit what are the index test, the reference test and the sample associated with a sensitivity estimation. Universal qualities that are instantiated by all members of the sample - such as having diabetes mellitus, being a man, being more than 65 years old, etc. - would enable to determine what could be the reference class g associated with a sensitivity estimate. This enables to determine, when applying some given IP values to a specific patient with given characteristics, whether this application is warranted or not.

Second, statisticians could determine more easily which kind of sensitivity estimates they could aggregate together. If several estimations of IPs are represented ontologically according to the structure shown above, one could use this ontological structure to determine which estimations of IPs could be combined to obtain a finer estimate. First, one would have to find a group g 0 that would encompass the reference classes (such as g 1 and g 2 ) associated with those studies. Second, one would have to analyze whether there exists some general index test class such as IT 0 (resp. some general reference test class such as RT 0 ) which would subsume the various index tests classes such as IT 1 and IT 2 (resp. reference tests such as RT 1 and RT 2 ) that are used in those studies. Once those are found, one could use meta-analytic methods to derive a value for f2(g 0 ,IT 0 ,RT 0 ) from the other studies. Future work will aim at building an ontology of medical tests to facilitate finding such encompassing index and reference test classes.

As it takes into account the dependence of IPs upon the group of people considered, it has the potential to contribute to the development of precision medicine [23] in context of learning health systems [24, 25], an emerging approach that takes into consideration patients characteristics and dispositions, including individual variability in genes, to offer more personalized preventive, diagnostic and therapeutic strategies.