11.1 Introducing Quality-Assurance of Measurement in Person-Centered Care

In person-centered care (PCC), the patient is first and foremost regarded as a person with reason, will, feeling and needs [24] with unique and holistic properties which need to be brought into a partnership with health care professionals (e.g., [24, 30, 44, 47, 59]). When quantifying symptoms and experiences in PCC, patient centered outcomes (PCOs) pertain to a patient’s beliefs, opinions and needs in conjunction with a clinician’s medical expertise and assessment. To capture faithfully these special aspects requires a special “person-centered” measurement process – typically with tests of performance, questionnaires and rating scales – alongside more traditional kinds of measurement in healthcare.

PCC is one of several applications where increased attention to quality assurance in healthcare is driving the discipline of quality-assured measurement – that is, metrology – as a topic of burgeoning and increasingly multidisciplinary interest. Quality assurance in a field such PCC needs conformity assessment [35] in order to provide

  1. A.

    Confidence for the consumer (the patient) that requirements on products and services (care) are met

  2. B.

    The producer and supplier (the health care organization) with useful tools to ensure product and service (care) quality

  3. C.

    Help to regulators when ensuring that health, safety or environmental conditions are met.

Quality-assured measurement as a discipline has several centuries of history behind it, responding to quality-assurance needs in mainly technical domains such as trade and industrial production where physicists and engineers have led the field. At the start of the twenty-first Century, there is now a need and a challenge to formulate a unified view of metrology to address contemporary concerns and grand challenges, not only in physics and engineering but also in the social sciences [90]. PCC is one particular area of focus, [16] where some essential measurement aspects have been emphasized, for instance, by Walton et al. [94] and Cano et al. [17] and are considered further in the section 11.1.2 Quality assurance in person-centered care. Design of experiments.

Because metrology is not an end in itself, the start of this chapter – as well as at the end – will not deal directly with measurement, but rather with the objects – products, services, concepts… and their characteristics in PCC, particularly the patient centered outcomes (PCO). It is the latter which are the concern of the vast majority of people working in PCC, who then ask metrologists to measure them. Presenting the assurance of measurement quality in person-centered care takes the approach advocated more generally in another book, Quality Assured Measurement [75] in this Springer Series. The start and end of the chapter provide object-related ‘bookends’ – supporting a description of Quality-Assured Measurement which is the central issue and dealt with in the middle sections of this chapter.

Presenting measurement in relation to assessed “objects” (tasks, cohort individuals, clinicians), rather than as an end in itself, will allow measurements to be anchored in relevance and interest for third parties, which in the context of PCC can be all stakeholders: from patient, health care professionals, relations, health care organizations, to regulators, politicians, and decision-makers.

As it turns out, the approach also provides the key to a unified presentation about quality-assured measurement across Social and Physical Sciences where objects are probed by a Human as a Measurement Instrument in “person-centered” measurement processes and in Measurement System Analysis (MSA) [1, 7], as explained further in Pendrill [75] (2019 and the section 11.2.3 A way forward for measuring PCOs: Human as a B: Measurement Instrument).

11.1.1 Opening the Quality-Assurance Loop

Since neither production nor measurement processes are perfect, assuring the quality of the product or other entity (process, phenomenon, etc.) – in the present case, PCC as assessed in terms of PCOs – will involve efforts to keep within tolerable limits the unavoidable, real or apparent dispersion in the entity value. (Wherever possible, use will be made of the international vocabulary for conformity assessment in choice of terminology [38].

Production errors (addressed in the section 11.1.3 A: Entity attribute description and specification) and apparent errors arising from an imperfect measurement system (addressed in the section 11.2 Benefits of combining Rasch Measurement Theory (RMT) and quality assurance) can both be revealed by measurement, either of a series of items or repeated observations of a single item.

“Item” in the context of PCC refers to a particular example of product – for example, an instance of delivering a care service to an individual patient (see further in the section 11.1.2 Quality assurance in person-centered care. Design of experiments).

Limited knowledge – about production or about measurement – will lead, respectively, to uncertainties in both product error and measurement error (as described in the section 11.3.2 Errors and uncertainties in PCOs); each having associated risks of incorrect decisions of conformity (as described in the section 11.4 Decision risks and uncertainty).

Conformity assessment is a formal process with the specific aim of keeping product “on target”. In many cases, both product and measurement processes will be subject to conformity assessment.

The overall concept of a quality-assurance loop was developed over 60 years ago in the context of industrial production, emphasizing that product quality is not only determined by actions at the point of production, but importantly at every step of the loop – from initially defining, designing to finally delivering product – in what should be an on-going ‘dialogue’ between consumer and supplier [21].

The rules and regulations embodying the demands of decision-makers, regulators and authorities, together with expert and end-user opinion and advice for each sector – such as in healthcare (section 11.1.2 Quality assurance in person-centered care. Design of experiments) – can be formulated in clear, unambiguous and harmonized ways in written standards and norms suitable for implementation when assessing quality. The quality-assurance loop has provided a framework for major series of international standards for quality assurance, particularly the famous ISO 9000 series which over the years have been adapted not only for industrial production but also healthcare, such as the European norm: EN 15224:2017, Quality management systems – EN 9001:2015 for healthcare, where clinical processes in health care services are the main focus. At the same time, quality-assured measurement informs the written standards to ensure the viability of the regulatory demands in terms of what is practically and realistically possible to measure as part of an overall quality infrastructure (figure 6.1 in [75]).

11.1.2 Quality Assurance in Person-Centered Care. Design of Experiments

The first step in the quality-assurance loop is to define “product” (in the widest sense). A valid and relevant formulation of the construct of interest is a key step.

Conformity assessment occurs in the framework of regulation where clear definition of “product” is of course necessary. As an example, according to current Swedish Health and Medical Services Act:

The goal of health care is good health and care on equal terms for the entire population… This implies that health care activities must uphold a good hygienic standard; meet the patient's need for security, continuity and safety; build on respect for the patient's self-determination and integrity; promote good contacts between the patient and health care professionals, and be easily accessible. Moreover, the quality of the healthcare shall be systematically and continuously developed and assured. (HSL, 2017:30, chap. 3, General § 1 and chap. 5 Activities General Sections §§1–4) [31]

Several of these quality characteristics can also be found in the “ISO 9000-like” European norm for healthcare: EN 15224:2017 [25].

While not explicit in the regulations, PCC is implicitly included. For our purposes when specifying PCC requirements, a useful definition can be found in a new standard, EN 17398:2020, Patient involvement in health care – Minimum requirements for person-centered care, where a shift is attempted, away from a conventional medical approach in which the patient is a passive target of medical and/or care intervention, to an explicitly PCC approach in which:

  • every patient’s resources, interests, needs and responsibilities are acknowledged and endorsed in situations of concern to them, and

  • the patient takes an active part in their care, decision-making processes and self-care.

PCC emphasizes co-production of care. Particularly in times of stress such as the current COVID pandemic, it is important not to set more conventional medical health care and PCC in opposition to one another – “competing” for limited resources – but rather see them as complementary and of mutual benefit [11, 45]. For instance, an enduring personal relation between patient and health staff will provide support not only at a critical moment of intervention, but also before and after, whatever the outcome for that patient. A key aspect is the relative importance a patient attaches to various outcomes and processes having a large, if not determining, influence on decision-making in patient-centered care. Berwick [11] and Kaltoft et al. [45] also emphasize that the relevant preferences are those of the individual patient facing a decision, as opposed to the average preferences of a group of patients with the same condition or the preferences of the health professional involved in the decision.

Obviously, the umbrella term “patient centered care including physical, psychological and social integrity” (this phrase appears in EN15224 [25] which uses the term “patient centered care” without further definition) needs to be broken down into more specific quality characteristics (task difficulty, person ability and the like) for PCC, as will be exemplified in the later section PCOs: Example neuropsychological cases. A further key aspect of PCC, not clearly covered in EN15224 but in EN 17398, is the partnership guided by the patient’s narrative and characterized by dignity, compassion and respect between the patient and health care professionals [24], which can also be resolved into a number of specific quality characteristics (quality of care, person leniency), as exemplified in the later section PCOs: Example patient participation.

Apart from and together with legislation and standards, expert and end-user opinion and advice are of course valuable when defining health care. For example, the OMERACT collaboration has formulated concepts, core and domains for outcome measurement in health intervention studies [14]. A person-centered approach has even recently entered the long-established field of laboratory medicine, where one has highlighted:

Effective collaboration with clinicians, and a determination to accept patient outcome and patient experience as the primary measure of laboratory effectiveness

in the recommendations of the IFCC Task Force on the Impact of Laboratory Medicine on Clinical Management and Outcomes (TF-ICO) [29].

11.1.3 A: Entity Attribute Description and Specification

In what follows, we will examine in turn each of the three main stages in the measurement process (Fig. 11.1a,b below) – from (A) an object’s entity, through (B) measurement with an instrument to (C) response as registered by an operator.

Fig. 11.1
figure 1

(a, b) Measuring humans in MSA. (Adapted from [69, 72])

The various quality characteristics of the entity, A, (attributes such as the quality characteristics for care services mentioned in legislation, various standards and in expert and end-user advice, first section), need to be identified, described, measurable, predictable and prioritized when formulating the overall entitic construct (“dedicated quantity” [22] or “quantity of an entity” [27]) for quality assurance. While the first section also gave some examples of typical constructs for PCC, this section indicates a number of tools and methodologies to be used when making valid construct identification.

A ‘quality characteristic’ attributed to any entity intended to be assessed may be – as in statistics – either a measure of:

‘location’, e.g., a direct quantity of the entity, such as the mass of a single object or ability of an individual patient, an error in mass or ability (deviation from a nominal value), or an average mass or ability of a batch of objects or cohort of patients

‘dispersion’, e.g., the standard deviation in mass or ability amongst a batch of objects or cohort of patients

The all-important definition of what is actually intended to be quality-assured will be in most cases a defining moment, literally speaking. Among the requirements of validity and reliability in both the physical and social sciences [84], of particular interest at this stage are: content validity: degree to which all necessary items are included in test; and construct validity: convergency and discrimination to related and unrelated measures, respectively.

In most cases PCOs can be resolved into pairs of attributes, sometimes referred to as coupling attributes or item: probe pairs: (i) a care service attribute – such as the difficulty of a task or quality of providing care – and, respectively, (ii) an attribute of the person receiving care – such as the ability of each care recipient or the leniency (how easily satisfied a person is). As will be described more in the section PCC including physical, psychological and social integrity, task difficulty or quality of care, δ, and person ability or person leniency, θ, is determined together in a psychometric logistic regression to cohort performance response data obtained typically with tests of performance, questionnaires and rating scales where the measurement object is an “item” and the measurement instrument (typically here a person) is the “probe” or “agent” in a measurement system model to be described in Fig. 11.1a, b.

At this first step, there should be an ambition to capture unconditionally as many relevant and valid product and process aspects as possible – no amount of sophisticated analyses subsequently will be able to compensate for a component missed here. When identifying the processes essential for quality assurance in production, design of experiments is a tool of traditional statistics that has been employed for decades [65]. Design of experiments means the process of systematically varying controllable input factors to a “manufacturing” process (in the broadest sense) so as to demonstrate the effects of these factors on the output of production not only in manufacturing but also more widely throughout the physical and social sciences.

In line with PCC, i.e., the aim to include the patient’s narrative, and in order to develop meaningful PCOs for the patient, an example of a useful method for capturing a broad spectrum of experiences of the target group is the critical incidents’ technique (CIT). A critical incident is an event that is particularly satisfying or dissatisfying, as typically identified using content analysis of stories or vignettes [12, 28], rather than quantitative methods in the data analysis.

Structuring, prioritizing and choosing among the many entity characteristics the end-user might mention may be done in different ways. In an ideal world, when formulating the construct it should relate to an ordinal theory, i.e., a prior understanding of what it means for the target group to progress from low to high ability and items correspondingly represent a harmonized hierarchy for task difficulty [17]. Some criteria for prioritizing amongst identified entity characteristics are needed when bringing structure, be it ‘tailor-making’ manufacturing or ‘production’ anywhere in the social and physical sciences [75]. An example of this in PCC is to ask patients to rate the relative importance of each task as an aid to ranking and prioritization. Tools for this include: Importance-Performance Analysis [19], where a simple experiment might be to ask a person to place task or situation cards on a table, where the position of each card is determined on the vertical axis in terms of perceived difficulty of performing that specific task and location of the card on the horizontal axis indicates how important the performance of the task is rated. The results of this simple investigation can be analyzed further using logistic regression.

The Activity Inventory tool [54] is another approach to structuring constructs, where a bank with items describing everyday activities is arranged within a hierarchical framework. At the top level of the hierarchy, activities are organized by Massof et al. [54] according to the objective they serve (Daily Living, Social Interactions, or Recreation). A selection among the (several hundred) identified items is made; a few items are classified as “Goals”, which describe what the person is intending to accomplish (e.g., prepare daily meals, manage personal finances, entertain guests). The remaining items in the bank, grouped under the Goals, are classified as “Tasks”. Tasks in the vision-related studies of Massof et al. [54] describe specific cognitive and motor activities that must be performed to accomplish their parent Goal (e.g., cut food, read recipes, measure ingredients, read bills, write checks, sign name). Massof and Bradley [55] report the estimation of the utility (i.e., a variable representing value, benefit, satisfaction, happiness, etc.) assigned by each patient to a hypothetical risk-free intervention that would facilitate each of the identified important and difficult Goals in the Activity Inventory for that patient.

Whatever technique is used to capture every essential aspect to be included in the construct of interest will enable a valid and reliable formulation of a construct specification equation (CSE) to be used throughout a measurement task, as described further in the section 11.3.1 Metrological references for comparability via traceability and by Melin & Pendrill [61].

11.1.4 Quality Assurance in Healthcare Service Provision

Following definition of the entity and its quality characteristics to be assessed for conformity with specified requirements, the quality loop progresses, via stages such as product design, procurement, through actual production to delivery of care and final use. The healthcare service quality assurance norm EN 15224 [25] stipulates – in 7.5.1 Control of production and service provision – that:

The organization shall plan and carry out production and service provision under controlled conditions.

Such conditions shall include availability of information describing health care service characteristics, of necessary work instructions and of monitoring and measuring equipment as well as implementation of monitoring and measurement and of health care service release, delivery and post-delivery activities.

The specific aim of keeping product on target in conformity assessment involves setting specification limits on the quality characteristics. Here we also see how, as in all ISO 9000-like standards, measurement is included explicitly as part of the “production” process, as will be dealt with in more detail in the later sections of this chapter. Typical decision rules in conformity assessment are of the form: “Conformity to requirements is assured if, and only if, the uncertainty interval of the measurement result is inside the region of permissible values RPV” (ISO 10576-1 [37]), where that region is bounded above and below specification limits, by regions of non-permissible values, RNV. Note that entity specifications are normally set on the basis not only of what can be practically and economically produced, but also ultimately on what the consumer or other end-user requires in terms of product characteristics, as described above in the section 11.1.2 Quality assurance in person-centered care. Design of experiments. We return to decisions of conformity at the end of this chapter when we close the quality loop.

The quality characteristics and their entities (the health care services) have a different character to those associated with the more technical ‘production’ of care – the EN15224 standard [25] mentions several: the handling of material products such as tissue, blood products, pharmaceuticals, cell culture products and medical devices – which are not in focus in the health care service standard as they are regulated elsewhere. Alongside traditional quality-assurance of products and processes of this more “objective and quantitative” kind, many measurements in healthcare (e.g., of care services) of interest here are made with questionnaires and categorical observations.

11.1.5 Examples of Quality Characteristics and Specification Limits for Person-Centered Care

PCC, with its emphasis on the patient’s unique and holistic properties brought into a partnership with health care professionals, contrasts with more traditional disease control programs. In practice, this has the consequence that the quality characteristics of typical PCC attributes are often less quantitative and more subjective – typically the response of a patient to a task – than more familiar technocratic indicators of medical signs and healthcare production measures. This can make the identification of quality characteristics for PCC, such as “equity” and “patient involvement”, a challenge to health care professionals. Despite this, it may well turn out that PCC in the long run will be the most effective in providing health proactively than merely retroactively “fighting the fire”. PCC Including Physical, Psychological and Social Integrity. Specification Limits, Counted Fractions, Ability, Difficulty and the Psychometric Rasch Measurement Theory (RMT) Approach

Despite their apparent subjective character, PCC attributes can in many cases nevertheless be quantified and assessed objectively for conformity. Examples of specification limits on PCOs could be: the patient should score at least 60% on a memory test (section PCOs: Example neuropsychological cases) or a patient participation survey (section PCOs: Example patient participation). A couple of tools are needed to handle such data in PCC:

Firstly, it is known since the time of Pearson (1897, cited in [2], and in [91]) that raw data such as the response score, Psuccess, of a person to a test (perhaps administered with a survey questionnaire with a finite number of response categories) lie on an ordinal scale (what is called ‘counted-fraction’, bounded by 0% and 100%) which is not linear and where most of the regular tools of statistics are not directly applicable. An example will be given in Fig. 11.4 which shows the increasing non-linearity of raw data at scale extremes which, if not recognized and corrected for, will lead to incorrect decisions about healthcare. It is like making a measurement with a bent ruler ([75], pp. 88 ff). As recommended long ago by Aitchison in the 1980s and even earlier by Rasch [83], such counted-fraction data need first to be transformed onto a linear, quantitative scale – using log-ratios for instance – before attempting any statistical analysis. Thereafter one can transform back to the observation space. It is straightforward to convert a raw-score specification Psuccess= 60% limit to a corresponding log-ratio specification limit z = 0,41 logits using the equation: \( z=\ln \left(\frac{P_{success}}{1-{P}_{success}}\right) \). The centre of the logistic scale, z = 0, is clearly at the raw data point Psuccess = 1 − Psuccess = 50 % .

Secondly, it is well-known that the success rate Psuccess in the response to a test or classification depends not only on the test person’s ability θ but also on how difficult, δ the test is. The same probability of success can be obtained with an able person performing a difficult task as with a less able person tackling an easier task. Any measurement is indirect, that is, there is always the need to have the intervention of a measurement system in order to observe any object of interest, as modelled with MSA.

To the extent that care, including PCC, makes both diagnoses and attempts to cure illness, then relevant quality characteristics are, respectively, the actual level of patient ability θ at any one diagnosis, as well as the change in patient ability following an intervention, Δθ. Further discussion can be found for example in the standard ISO 18104 [40] (and in the section Rater constructs), which aims at providing a comprehensive categorical structure or domain concept model for nursing diagnoses and nursing interventions.

The same approach can be applied when considering responses to a care participation test, as will be described further below, where the corresponding care service and care recipient attributes are, respectively, care quality and recipient leniency (how easily satisfied a person is).

How to formulate an MSA for categorical observations needs to be a key focus in PCC. Although Rasch [83] did not use those terms, an early form of his model, in response to contemporary demands for individual measures (section 11.2.3 A way forward for measuring PCOs: Human as a B: Measurement Instrument), is very much an MSA formulation [75] which posits that the odds ratio of successfully performing a task is equal to the ratio of an ability, h, to a difficulty, k:

$$ \frac{P_{success}}{1-{P}_{success}}=\frac{h}{k} $$

(Rasch [83] used the person attribute “inability” instead, given by h−1, which can be written:

$$ \kern0.5em \theta -\delta =\log \left(\frac{P_{success}}{1-{P}_{success}}\right) $$

where the test person (“agent”) ability, θ =  log (h), and task (“object”) difficulty, δ =  log (k) (or other item:probe pairs), can be evaluated by logistic regression to the score data in terms of the probabilities q (Psuccess) which quantify the response of the measurement system to a given stimulus (object difficulty). The original Rasch [83] formulation of Eq. 11.1 refers to a Poisson probability distribution \( p(x)=\frac{e^{-\lambda}\cdot {\lambda}^x}{x!};x=0,1,\dots \), well known from quality control as a model of the number of defects or nonconformities that occur in a unit of product when classifying it [65]. The parameter λ is equal directly both to the mean and variance of the Poisson distribution and in Rasch’s [83] model λ = h−1 ∙ k. In accord with MSA, the object of measurement is, simultaneously, the object itself as well as an essential element of the measurement process. So, while here the discussion concerns mainly the object of interest (e.g., PCOs as in the next sections), the rest of this chapter will deal more with the quantities as measured. Both the mean and variance of this Rasch distribution will be considered when dealing, respectively, with the measured person abilities and task difficulties on the one hand, and the uncertainties in each of these on the other. RMT, as will be seen, is not only a statistical approach but is, importantly, a metrological approach to person-centered measurement ([69, 72] and Fig. 11.1a, b). PCOs: Example Neuropsychological Cases

A quality characteristic in a neuropsychological memory test regularly used in clinics monitoring neurodegeneration is the ability, θ of a patient to remember a particular sequence, for instance of blocks tapped, digits or words in a list. In the latter section of this chapter we will give examples of actual data and analyses, with more detailed accounts given in the accompanying chapter by Melin & Pendrill [61].

A first question is: Is patient ability θ a quality characteristic of person-centered healthcare? We would argue that the characteristic is certainly person-centered, perhaps more so than characteristics – such as brain volume, protein concentration, etc. – of the traditional technocratic biomarker approach, where an impressive array of sophisticated instruments and theories of brain-functioning has had decades of research dedicated to it. At least a patient’s performance is directly relevant and certainly not a “surrogate” for biomarker levels. Secondly, since the aim of healthcare is to maintain and preferably improve health, then any change in patient ability Δθ is a pertinent quality characteristic. Our view can be compared with that of Walton et al. [94] who write:

clinical outcome assessments (COAs) … include any assessment that may be influenced by human choices, judgment, or motivation. COAs must be well-defined and possess adequate measurement properties to demonstrate (directly or indirectly) the benefits of a treatment. In contrast, a biomarker assessment is one that is subject to little, if any, patient motivational or rater judgmental influence. PCOs: Example Patient Participation

A couple of years ago, in a Health Foundation report, [20] pointed out that there is neither a universally agreed definition of PCC, nor a golden standard that can be used for measuring all aspects of PCC. Likewise, earlier critique emphasized that few of existing scales were based on a solid PCC theoretical framework [23]. There is, however, a lot of research and proposals ongoing about how to define and measure experiences of PCC, e.g., to assess a subdomain of PCC such as patient participation.

In an ideal world, the ordinality of quantity scales (as described above in the section PCC including physical, psychological and social integrity) should be recognized at an early phase of construct specification. This was, however, not the case with the Patient Participation in Rehabilitation Questionnaire (PPRQ) [51]. A subsequent analysis according to RMT [62] has identified a reasonable item hierarchy of what it means to go from lower quality characteristics, where the lower levels include being respected as a person and being given information, while the middle levels include shared decision making and care planning and the higher include being empowered and motivated.

There are striking similarities between items and the hierarchical levels identified about the same time in two other independent developed questionnaires, Patient Preference for Patient Participation tool (The 4Ps) [52] and the Person-Centered Care in outpatient care in rheumatology (PCCoc/rheum) [8]. Consequently, it appears promising in the understanding of the construct PCC/patient participation from a patient perspective on these scales.

11.2 Benefits of Combining Rasch Measurement Theory (RMT) and Quality Assurance

After having decided on which clinical processes are to be quality-assured in person-centered healthcare and what the principal quality characteristics associated with the processes are, the next essential steps in our quality loop are monitoring and measurement of these characteristics during the actual provision of healthcare services.

As we have seen above, clear demands about measurement can be found in quality assurance and conformity assessment, such as in legislation: ‘the quality of the healthcare shall be systematically and continuously developed and assured’ [31]. The healthcare service quality assurance norm EN 15224 [25], although perhaps not explicitly covering PCC in detail, nevertheless usefully stipulates (as other ISO 9000 norms) metrological requirements in general (section Requirements for traceable measurement, below).

When making measurements at this “production” stage in the quality loop, it is important, and one of the most challenging tasks, to distinguish between measurement dispersion (due to limited measurement quality) and actual product (health care service in the present case) variation. The two types of scatter appear together in the displayed results of measurement and are easily confounded, even conceptually [75]. For instance, as a patient progresses in her health condition or as a result of care interventions, of main concern in care provision will be variations, such as changes in ability, actually caused by illness or the intervention. It is obviously essential and beneficial in providing good care to distinguish these actual changes from apparent changes registered with a less-than-perfect measurement system. Indeed, most measurements are made through the intermediary of a measurement instrument, so some account is necessary to compensate for possible imperfections and distortions in the measurement process if one wants a true picture of the object of interest – in the present case, the clinical processes. This in turn is key if measurements are ultimately to provide a reliable and valid basis for clinical decisions (see further in the last section 11.4 Decision risks and uncertainty).

A unique aspect when assuring measurement quality in PCC turns out to be that the many measurements made, for instance with questionnaires, tests and surveys – often made on categorical scales – are best analyzed with a measurement model where the human respondent is herself acting as the measurement instrument, as a metrological interpretation of RMT [69, 69,70,73, 75]. An MSA with a human measurement instrument is as “person-centered” as PCC, where one is interested not only in measuring the status of a person, but also understanding what and how the person perceives and experiences. At the same time, this approach enables the benefits of drawing analogies with the well-established MSA framework from measurement and quality engineering [65] to be fully exploited, where the instrument is at the heart of any measurement system.

11.2.1 Measurement Quality-Assurance Loop

Confidence in the measurements performed in the conformity assessment of any entity (product, process, system, person, body or phenomenon) can be considered sufficiently important – as in the present case of care – that the measurements themselves will be subject to a “product” quality loop as a kind of metrological conformity assessment ‘embedded’ in any product Conformity Assessment.

A measurement quality-assurance loop is formed by the steps a…f (also illustrated in figure 2.1 of [75]) is of course analogous to a product quality-assurance loop, where the “product” is a measurement result:

  1. (a)

    Define the entity and its quality characteristics to be assessed for conformity with specified requirements (section 11.1.3 A: Entity attribute description and specification).

  2. (b)

    Set corresponding specifications on the measurement methods and their quality characteristics (such as maximum permissible uncertainty and minimum measurement capability) required by the entity assessment at hand (section 11.2.2 Measurement specifications)

  3. (c)

    Produce test results by performing measurements of the quality characteristics together with expressions of measurement uncertainty (section 11.3 Metrological references for comparability via traceability and reliable estimates of uncertainty).

  4. (d)

    Decide if test results indicate that the entity, as well as the measurements themselves, is within specified requirements or not (section 11.4 Decision risks and uncertainty).

  5. (e)

    Assess risks of incorrect decisions of conformity.

  6. (f)

    Final assessment of conformity of the entity to specified requirements in terms of impact.

11.2.2 Measurement Specifications

In order to assess an entity according to product specifications, it will often be necessary to set corresponding measurement specifications (step b, above). Much can be gained if one can plan measurements proactively for “fitness for purpose”, encompassing considerations – prior to measurement – of aspects such as the measurement capability needed when about to test a product or process of a certain production capability, in the same way as design of experiments was used to find the principal input factors determining the output of “production” [73].

Models of measurement will need to be able to deal with:

  • The actual result of measurement, that is, estimating the measurand or quality characteristic of the entity of interest

  • Propagation of measurement bias and other errors through the measurement system

  • Propagation of variances through the measurement system

The two principal hallmarks of metrology (quality-assured measurement) – namely, traceability and measurement uncertainty, respectively – are the two aspects which ensure that the product (in the broadest sense) itself is quality-assured, so that it will be comparable and decision risks will be quantified. Assuring the quality of measurement in terms of the respective measures of “location” and “dispersion” will in turn support corresponding quality assurance of product – in the present case the requirements on the attributes of PCC.

As in all measurement, there are requirements when assuring quality on both metrological traceability and general risk management. These two metrological aspects are of course key to ensuring quality in PCC for several of the PCC quality characteristics listed in the section above 11.1.2 Quality assurance in person-centered care. Design of experiments, such as “equity” (i.e., a patient can expect the same quality of case wherever it’s provided) and “patient safety” (i.e., risks of incorrect decisions about care are proactively assessed). When assuring quality in PCC, there are some special challenges to be met when fulfilling both these sets of requirements. Requirements for Traceable Measurement

The healthcare service quality assurance standard EN 15224 [25] (as in all ISO 9000 like norms) stipulates – in 7.6 Control of monitoring and measuring equipment – that:

The organization shall establish processes to ensure that monitoring and measurement can be carried out and are carried out in a manner that is consistent with the monitoring and measurement requirements.

Where necessary to ensure valid results, measuring equipment shall:

a) be calibrated or verified, or both, at specified intervals, or prior to use, against measurement standards traceable to international or national measurement standards; where no such standards exist, the basis used for calibration or verification shall be recorded

NOTE 2 For further information see EN ISO 10012:2003 Measurement management systems – Requirements for measurement processes and measuring equipment [36].

It can be noted that, to date, there are few established measurement standards for attributes typical of PCC.

11.2.3 A Way Forward for Measuring PCOs: Human as a B: Measurement Instrument

Metrological traceability and uncertainty are equally pertinent to assuring measurement quality in PCC, as they are to more traditional quality assurance contexts. While superficially similar, there are however a number of caveats and challenges to be overcome in person-centered care. An early mention of the topic was given by Maxwell [57]:

We may observe the conduct of individual men and compare it with that conduct which their previous character and their present circumstances, according to the best existing theory, would lead us to expect. Those who practice this method endeavour to improve their knowledge of the elements of human nature, in much the same way as an astronomer corrects the elements of a planet by comparing its actual position with that deduced from the received elements.

Rasch [83], in motivating what was to become RMT, quotes [100]:

Recourse must be had to individual statistics, treating each patient as a separate universe. Unfortunately, present day statistical methods are entirely group-centered, so there is a real need for developing individual-centered statistics.

RMT is, importantly, a metrological approach to person-centered measurement [69, 72]. This contrasts with current thinking in some quarters (stemming from attitudes prior to Rasch’s approach of the 1960s) that ‘secondary to the scientific task is the instrumental task of quantification’ [64]. Even today, accounts [53] seem still to follow this thinking, and essentially omit, to their detriment, description of a measurement system in any detail. Drawing simple analogies between “instruments” in the social sciences – questionnaires, ability tests, etc. – and engineering instruments such as thermometers does not unfortunately go far enough. Not only does the counted-fraction nature of human responses need to be compensated for, but also and the task of separating the instrument factor from the sought-after object factor has to be achieved even in qualitative, categorical responses of the measurement systems. The importance and benefits of regarding the human responder as an instrument in MSA are explained at the beginning of this section.

Attempts have been made, for instance by analytical chemists [22, 67] who have proffered a suggestion that ‘examination’ could replace ‘measurement’ in the case of nominal properties (a specific kind of categorical observation). In the example 1 of VIN §3.9 Examination uncertainty given by Nordin et al. [67]:

The reference nominal property value is “B”. The nominal property value set … of all possible nominal property values is {A, B}. For one of 10 examinations … the examined value differs from “B”. The examination uncertainty is therefore 0.1 (10 %).

Texts such as this raise a couple of questions:

  • Firstly, the term “examination” in English usually means either a “detailed study” or a “test”. For categorical observations, we prefer the term “classification” which is commonly used (for example in taxonomy) and can cover less detailed studies (which are nevertheless important to make) as well as not obliging us to make tests.

  • Secondly, performance metrics such as Psuccess, and misclassification probabilities, α or β – considered by Nordin et al. [67], Bashkansky et al. [9] and others as accuracy measures, often belong to the ‘counted fraction’ kind of data (section PCC including physical, psychological and social integrity) and in general are of ordinal, rather than fully quantitative nature, and are not directly amenable to regular statistics. An uncertainty of “10%” has little meaning. There is also, again, the requirement that the distinct factors of task difficulty and person ability need to be determined separately from the aggregate raw score Psuccess.

Not accounting for the counted-fraction nature of human responses can have serious consequences in healthcare. An example is given by Pendrill [76] who demonstrated significant errors in previous studies of the correlation of the cognitive ability of Alzheimer disease sufferers with corresponding biomarkers when not compensating for known counted-fraction scale distortion. Another example is by Kersten et al. [46] which showed that raw data are invalid for decisions of Minimum Clinically Important Differences as these either under- or overestimate true changes.

In summary, the status quo concerning a metrological vocabulary which would adequately cover ordinal and nominal properties satisfactorily would need three additions to that available internationally today:

  1. 1.

    clear definitions of ordinal and nominal properties (which in most cases are not quantities: A “quantity” is a measurable property)

  2. 2.

    a new chapter in the vocabulary defining classification systems, analogous to measurement systems, but where the response is the probability of a “correct” assignment of an observation to a class or category

  3. 3.

    clearer definitions of what a measurement system is, including the quantities associated with each element of the system, and the process of restitution [85]

With the explicit aim of developing a common language and complementary methodologies for cooperation about measurement and decision-making between sociologists, physicists and others, we have recently proposed [69, 71, 72, 75] an approach that seems to be equally applicable to both physical and social measurements and combines both aspects. While it is difficult to find a single definition of “measurement” which would apply to all scales – from nominal through to ratio – [82], it does appear to be feasible to unite about a definition of “classification”, including qualitative observations typical of PCC as well as vacillation when making decisions based on quantitative measurement results in the presence of measurement uncertainty (“unknown measurement errors”).

Almost all measurements are not direct, but in most cases human beings need the help of instruments to translate measurement information from a measurement object into a human-understandable signal. Thus, a measurement system (depicted in Figs. 11.1a, b) is necessary but at the same time because it is not perfect, brings with it more or less measurement errors (noise, distortion) that must be corrected for [10, 75].

In Fig. 11.1 (a) a conventional measurement system ([1, 7], MSA Measurement System Analysis), perhaps used in traditional healthcare for measurement of a person’s mass, length or temperature – challenging owing to the complexity of a human being – is contrasted with Fig. 11.1 (b) of a “person-centered” measurement system suitable for PCOs, where a human being acts as a measurement instrument at the heart of a measurement system [71, 72].

Our approach [71, 72, 74, 75] identifies the test person (e.g., patient in the present context) as the instrument at the center of the measurement system (Fig. 11.1b), in accord with Rasch’s [83] intentions to ‘treat each patient as a separate universe’ and that a PCC needs person-centered metrology. Note however that, as reported already in the EU MINET project [69, 70]:

Care has to be exercised in such studies and an overall aim is to bridge the gap between: Engineering tradition criticized for a far too instrumental view of operators. Humanistic and behavioral science tradition all too preoccupied with issues centered on human operators.

The MSA approach is applicable to, in principle, all scales of measurement – as measurement systems for the more quantitative ratio and interval scales, and as classification systems for the less quantitative ordinal or nominal scales, as summarized in Fig. 11.2.

Fig. 11.2
figure 2

Descriptions of measurement scales and various metrological parameters (S = stimulus; R = response of measurement system; b = bias in response; K = sensitivity; y = indication). Light blue cells: interval or ratio quantities; dark blue cells: nominal or ordinal properties (not quantities)

11.2.4 Benefits of Analysing Response Data with RMT

Classification, where responses are put into a number of categories (class labels) – and visualized typically with a PMF (probability mass function, histogram of occupancy in a finite number of categories) – is analogous to measurement, where responses are put on a continuous scale of the measurand (quantity intended to be measured) and visualized typically with a PDF (probability density function).

Parameters used to characterize measurement systems (in MSA, including object, instrument, method, environment and operator) have often analogous meanings when characterizing classification system. For example, a “classification error” is analogous to a “measurement error” (see Fig. 11.2). Similarly, concepts such as “sensitivity”, “specificity”, “selectivity”, “uncertainty”, etc. are analogous between classification systems and measurement systems. However, as illustrated in Fig. 11.1b, the “instrument” in a classification system can be a classifier such as a person.

These analogies reflect the fact that the more quantitative properties (ratio, interval) include at the same time the more basal characteristics of the less quantitative properties (ordinal, nominal). But the usual tools of statistics cannot be directly used for classification system response since distances on ordinal and nominal scales are not known. Ratio and interval scales belong to quantities (which can be measured) while ordinal and nominal properties are not quantities, but are based on categorical observations (i.e., classifications). However, as will be described below and in Melin & Pendrill [61], through measurand restitution, ordinal and nominal properties can be transformed into quantities.

For most people, a measurement does not stop with a “measurement result”, that is, a value of the measurand (after measurand restitution), but measurement is done for some, third party reason, such as conformity assessment of products (entities).

In fact, continuing a measurement to include a decision – such as the entity is approved or rejected with respect to a specification limit – turns the measurement system into a classification system! Thus, risks of decision-errors (vacillation – arguably closer to everyday “uncertainty”) arising from finite measurement uncertainty will belong to ordinal properties.

In this case, the MSA is very useful to describe classification system performance in terms of the ability of a classifier (instrument) to make a rating; the level of difficulty (task object); the ability of the operator rate the rater. This is discussed further below.

It is fitting that MSA is the chosen technique to interpret RMT [83] metrologically, developed as it has been for quality-assurance of measurements in the “field” or the “workshop floor” [1, 7]. In contrast to the well-controlled conditions of the measurement laboratory, the majority of instruments (both traditional, engineering kinds as well as the PCOs, humankind considered here in PCC) are used to make measurements where the surroundings may have considerable influence on every element of the measurement system. A patient undergoing a psychometric test, for instance, might easily be perturbed in her response by some disturbance from the environment in the “field” of the clinic, as modelled with MSA.

Other factors determining measurement outcome to consider are (i) possible interactions between the instrument (person) and the object (task performed) – similar to instrumental ‘loading’ in engineering measurement systems ([10, 69,70,73], Chapter 2 in [75]) and (ii) the particular measurement method chosen. All these factors are included in our MSA approach (despite statements to the contrary in the recent literature [93]).

The measurement system approach (Fig. 11.1) is essential in all sciences in obtaining both aspects of metrology – traceability and uncertainty:

  • Metrological standards, reflecting fundamental symmetries which provide minimum entropy and constants for measurement units (Chapter 3 in [75, 77]), traceability to which can enable the comparability of measurements (and by extension even the intercomparability of products and services of all kinds), can only be established if one can separate out – with a process called measurand restitution – the limiting factors of the measurement system used to make measurements from the response of the system to a given stimulus from the measurement object (section 11.3 Metrological references for comparability via traceability and reliable estimates of uncertainty).

  • Less than complete information about a system leads to uncertainties, leading to incorrect decision making, for example approval of an incorrect product. Formulation of a performance measure, i.e., how well a measurement system performs an assessment – actually measurement uncertainty – seems to be treated with similar (categorical) methods, whether it is “instruments” – questionnaires, examinations etc. – in social science or in assessing how well a measuring instrument shows if an item or product is within or outside a specification limit (section 11.4 Decision risks and uncertainty).

Using MSA to interpret metrologically the RMT [83] adds extra insight into the meaning and impact of the various terms in Eq. (11.1). For instance, while mathematically the pair of attributes (δ, θ) appear symmetrically (apart from a difference in sign) in the logistic function, metrologically the role played by the object attribute and the instrument attribute, respectively, are very different, as will be illustrated in the rest of this chapter.

11.3 Metrological References for Comparability via Traceability and Reliable Estimates of Uncertainty

Much of the established metrological terminology of physical measurement carries well over into measurement in the human sciences (chapter 4 of [75], and Fig. 11.2), such as PCOs. There are however caveats, as mentioned (e.g., in the above section PCC including physical, psychological and social integrity):

  • Firstly, that data obtained from the response of a measurement system where a human is the instrument are often not themselves directly amenable to the usual statistical tools (e.g., calculating a mean or standard deviation) owing to the ordinal or nominal character of the raw data. Rather, the raw data are classifications.

  • Secondly, and often forgotten, as with any data obtained via an instrument, metrological traceability and reliable assessment of uncertainties and decision-risks with human science data require a proper measurement system analysis, where there are preferably clear and separate estimates of the contributions of each element of the measurement system – instrument, operator, environment and measurement method (Fig. 11.1) – to the overall response when measuring a certain object.

The recommendation to deal with these caveats is that, instead of attempting to treat raw data with invalid statistical tools, one transforms the classification system response Psuccess (e.g., the probability of making a correct decision or performing a task of a certain difficulty) lying on a less quantitative ordinal or nominal scale, onto a more quantitative interval or ratio scale by applying the RMT formula (Eq. 11.1). (Chapter 5 of [75] presents appropriate tests of the validity of the transformation.)

11.3.1 Metrological References for Comparability via Traceability

The use of a measuring stick, such as a ruler marked with a scale for length, is familiar to most when determining how many times a unit fits into the quantity to be measured. Maxwell wrote [58]:

Preliminary on the Measurement of Quantities.

1) EVERY expression of a Quantity consists of two factors or components. One of these is the name of a certain known quantity of the same kind as the quantity to be expressed, which is taken as a standard of reference. The other component is the number of times the standard is to be taken in order to make up the required quantity. The standard quantity is technically called the Unit, and the number is called the Numerical Value of the quantity. There must be as many different units as there are different kinds of quantities to be measured… .

In its logistic regression form, the ‘straight ruler’ aspect of the RMT formula, i.e., Eq. (11.1), has been described by Linacre and Wright [50] in the following terms:

The mathematical unit of Rasch measurement, the log-odds unit or ‘logit’, is defined prior to the experiment. All logits are the same length with respect to this change in the odds of observing the indicative event.

The RMT approach goes further in defining measurement units [33] since it uniquely yields estimates ‘not affected by the abilities or attitudes of the particular persons measured, or by the difficulties of the particular survey or test items used to measure,’ i.e., specific objectivity [34]. The RMT [83] approach is thus not simply mathematical or statistical, but instead a specifically metrological approach to human-based measurement.

Note that the same probability of success can be obtained with an able person performing a difficult task as with a less able person tackling an easier task. The separation of attributes of the measured item from those of the person measuring them – visualized with MSA (Fig. 11.1) – brings invariant measurement theory to psychometrics. Fisher’s [26] work on the metrology of instruments for physical disability was one of the first to demonstrate the concepts of intercomparability through common units, commonly referred to today as “item banking” [80]. Stone [90] for instance has written:

Item and person separation statistics in Rasch measurement provide analytic and quality control tools by which to evaluate the successful development of a variable and by which to monitor its continuing utility. … The resulting map is no less a ruler than the ones constructed to measure length.

Having enabled with RMT a set of metrological references, e.g. for task difficulty, one can then proceed to set up a scale (analogous to conventional measurement etalons (Fig. 11.3)) which is delineated by measurement units where any measured quantity, δj = {δj} ∙ [δ], is the product of a number {} and a unit denoted in square brackets [ ], according to Maxwell’s [58] text quoted above.

Fig. 11.3
figure 3

A set of tasks of increasing difficulty (recalling a series of digits in the memory test Digit Span Test, DST) as metrological references analogous to a set of increasingly heavy mass standards

This step of establishing metrological references is enabled by combining a procedure to transform qualitative data (i.e., classifications) to a new ‘space’ (in the present case, through restitution, to the space of the measurand), together with ability of RMT to provide separate estimates of measurement and object dispersions in the results when a human acts as a measurement instrument.

This new approach to the metrological treatment of qualitative data differs from others in that the special character of the qualitative data is assigned principally not to the measurand (object entity characteristic) but to the response of the classification system. (A person-centered measurement process, where the human responder is the instrument at the heart of the measurement system, places the focus clearly on what and how the person perceives and experiences, analogous to the common expression: ‘Beauty is in the eye of the beholder’.) Using RMT in the restitution process establishes a linear, quantitative scale for the measurand (e.g., for a property such as task difficulty) where metrological quality assurance – in terms of traceability and uncertainty – can be performed. Reference Measurement Procedures: Construct Specification Equations as Recipes for Traceability

“Recipes” to define measurement units in the social sciences, analogous to reference measurement procedures in analytical chemistry and materials science, appear promising as a viable procedure to establish metrological references in fields such as PCC where one does not enjoy access to universal units of measurement as in Physics (section 4.4.3 in [61, 75]). A condition is of course that the principal caveats of such data have been dealt with adequately, as described above.

What we earlier referred to, literally speaking, as a defining moment, was the all-important definition of which construct is actually intended to be quality-assured (section 11.1.3 A: Entity attribute description and specification). After identifying in that process all significant variables which can explain a particular construct, a construct specification equation (CSE) \( \delta =f\left[{x}_1\cdots {x}_m\right]={\sum}_{k=1}^m{\beta}_k\cdot {x}_k, \)can be developed. This procedure can be done for both item (object) attribute δ and person (instrument) attribute θ. A step-by-step description of ways of formulating CSEs and examples are given in the case of memory in the accompanying chapter by Melin & Pendrill [61] in this book.

11.3.2 Errors and Uncertainties in PCOs

Some typical data shown in Fig. 11.4 help describe errors and uncertainties in PCOs, where the example data set is described by Linacre (WINSTEPS ® [96]) as follows:

35 arthritis patients have been through rehabilitation therapy. Their admission to therapy and discharge from therapy measures are to be compared. They have been rated (1–7, although 6 or 7 are not used in the current example) on the 13 mobility items (e.g., the task of ‘eating’) of the Functional Independence Measure (FIM™):

  1. 1.

    0% Independent

  2. 2.

    25% Independent

  3. 3.

    50% Independent

  4. 4.

    75% Independent

  5. 5.


  6. 6.


  7. 7.


Such a plot illustrates many of the essential aspects to consider when dealing with data based on classifications. The solid red line, with the characteristic ogive form, shows the result of a least-squares regression fit to the logistic formula Eq. 11.1 to the raw data for one particular item (task of ‘eating’) across the cohort of patients. How well the fit is made mainly determines the uncertainties in patient ability estimates.

Ogive curves of the kind exemplified in Fig. 11.4 show how the success rate Psuccess varies from 0% to 100% across the range of abilities of individuals in the test cohort. With a simple binary – “yes-no” – score without uncertainty, the curve would instead be a sharp step function, where 0% to the left of the zero of the logistic ability scale switches immediately to 100% to the right of zero. The example of Fig. 11.4 is the case where raw data scoring is instead divided into a finite number (five in the present case) of response options (categories) for classifications, which when fitted to the logistic Eq. 11.1 leads to a smooth ogive curve instead of a step.

Fig. 11.4
figure 4

Residuals of logistic regression as differences \( {\boldsymbol{y}}_{\boldsymbol{i},\boldsymbol{j}}={\boldsymbol{x}}_{\boldsymbol{i},\boldsymbol{j}}-{\mathbbm{E}}_{\boldsymbol{i},\boldsymbol{j}} \) between each observed score xi,j (i.e. classification) and the expected score \( {\mathbbm{E}}_{\boldsymbol{i},\boldsymbol{j}} \) (i.e. the estimated measurand) for one item (task of ‘eating’) and across the cohort of test persons, i, in the example data set ‘EXAM12’ provided with the WINSTEPS® program The Peculiar Sensitivity of a Human as a Measurement Instrument

Apart from the reservations already expressed about considering a human being as an “instrument” (section 11.2.3 A way forward for measuring PCOs: Human as a B: Measurement Instrument), with all the complexities that human behavior implies, there are some more specifically technical issues to deal with. (Discrimination will be tackled later.)

As with all counted-fraction data (bounded by 0% and 100%), the raw data scale (y-axis of Fig. 11.4) becomes strongly non-linear when approaching the lower and upper ends of the scale. This is simply stating that, at either scale extreme, the response has to be one response or the other – either “yes” or “no” – in this binary, two-category case. Don’t expect a Normal distribution of responses [95]. Attempts to use raw data in correlation studies (e.g., against biomarker concentration) or fit residual studies will certainly not be valid for test persons who find this item (task of eating) either too easy (those to the right of the plot) or too difficult (to the left of the plot) [76].

The sensitivity of the instrument (person) to changes, in contrast, varies rapidly at the “sweet spot” at mid-range where Rasch’s models for measurement are most revealing ([75], p. 178). This spot is at the zero of the horizontal x-axis of person ability in Fig. 11.4 is where the ability equals the task difficulty, that is, where there is a probability of succeeding with the task, Psuccess = 1 − Psuccess = 50%. An “ordinary” measurement instrument in physics and engineering has a sensitivity (i.e., the relation between the output response to a given input) more or less constant across a range of measured values. The sensitivity of the instrument in PCOs (the person) is very different: sensitivity is greatest at mid-range, while at either end of the scale, well away from mid-range, the Rasch response model indicates that large excursions in test person ability at either extreme have negligibly small effects on the raw score.

This means, for instance, that studies of healthy cohort members (such as when researching early detection of disease degeneration) will be especially challenging. There seems to be no alternative to transforming the raw data – where the Rasch transformation ‘stretches out’ the non-linearity – if one is to have any chance of reliably resolving person ability differences. Apart from the logit approach of RMT, there are other possible transformations: Tukey [91] mentions, for example, arc-sine “anglits” and Normal ruling “normits” or “probits”. However, it seems that the peculiar sensitivity makes the choice of restitution transformation less critical.

This peculiar sensitivity will also provide a special “flavor” to how measurement uncertainties propagate through the measurement system, from object through instrument to response, as discussed next. Errors and Uncertainty. Reliability

In general, on inspection it will be found that the estimated person attribute θ (e.g., a level of ability of a particular person or instrument) differs, because of limited measurement reliability, from the ‘true’ θ, with an error εθ:

$$ \theta ={\theta}^{\prime }+{\varepsilon}_{\theta } $$

and the limited reliability evident in estimates of task difficulty:

$$ \delta ={\delta}^{\prime }+{\varepsilon}_{\delta .} $$

Such deviations arise, at least in part, because the measurement instrument (the person) used to ‘probe’ the object or item is not perfect. While every effort should be expended by the metrologist to evaluate (e.g., through calibration), correct for and reduce these measurement errors, there will always be – because resources are limited – some measurement errors which remain unknown. Measurement uncertainty is an estimate of these unknown measurement errors.

As in all statistics, the more data points (degrees of freedom) one has, the better the reliability in cases where statistical noise is a dominant source of uncertainty. In logistic regression, as in a simple calculation of a mean value, the reliability variance increases linearly in proportion to the number of data points. This is captured in the Spearman-Brown formula (Eq. 11.2, [88, 89]), which relates reliability coefficients, RC and RT, for the current (C) and target (T) tests to the corresponding test lengths, LC and LT, that is, the number of instruments (persons) or objects (items), is:

$$ {L}_T={L}_C\cdot \frac{R_T\cdot \left(1-{R}_C\right)}{R_C\cdot \left(1-{R}_T\right)} $$

where a reliability coefficient (Rz) for an attribute, z, is calculated as:

$$ Reliability,{R}_z=\frac{True\ variance}{Observed\ \textrm{v} ariance}=\frac{Var(z)}{Var\left({z}^{\prime}\right)}=\frac{Var\left({z}^{\prime}\right)- Var\left({\varepsilon}_z\right)}{Var\left({z}^{\prime}\right)} $$

In psychometrics a measurement reliability coefficient (calculated with Eq. (11.2)) of 0.8 – corresponding to a measurement uncertainty of about one-half of the measured dispersion – is considered acceptable for so-called high stakes testing [48]. A factor one-half is also a pragmatic limit [73] to limiting the impact of decision risks. Such reliability limits will of course constitute minimum requirements on sample size (number of test persons) and test lengths in random control trials, often formulated in regulations (e.g., by the regulators FDA and EMA).

Although often used when estimating measurement uncertainties, a Bayesian approach has been criticized by Cramér [18], who wrote:

In a case where there are definite reasons to regard ɛ (i.e., measurement error) as a random variable with a known probability distribution, the application of the preceding method [Bayes theorem) is perfectly legitimate and leads to explicit probability statements about the value of ɛ corresponding to a given sample. However, in the majority of cases occurring in practice, these conditions will not be satisfied. As a rule, ɛ is simply an unknown constant and there is no evidence that the actual value of this constant has been determined by some method resembling a random experiment. Often there will even be evidence in the opposite direction, as for example in cases where the ɛ values of various populations are subject to systematic variation in space or time. Moreover, even when ɛ may be legitimately regarded as a random variable we usually lack sufficient information about its a priori distribution.

Cramér continues by recommending the use of confidence intervals as estimates of measurement uncertainty:

$$ P\left({c}_1<\varepsilon <{c}_2;\varepsilon \right)=1-\alpha $$

The probability that such and such limits (c1, c2 which may vary from sample to sample) include between them the parameter value ɛ (an unknown “measurement error”) corresponding to the actual sample, is equal to Psuccess = 1 − α (the “risk of error”).

Apart from limited-number statistics, uncertainties will also be determined by how well the ability of a particular cohort member matches the item task difficulty. This is because, as we described in relation to Fig. 11.4, counted-fraction data is only sensitive in mid-range, so that cohort members who are not challenged or who are too challenged by the task will have larger uncertainties compared cohort members with abilities matching the task difficulty.

To provide a representative sampling, adding analyses with additional items can be expected to improve reliability, not only by increasing the number of degrees of freedom, but also by varying the level of difficulty by choosing a number of different items which together span the range of interest. As stressed in the section above, 11.1.3 A: Entity attribute description and specification, items can be arranged in a hierarchy of item difficulties: choosing items of greater difficulty will challenge the healthier cohort members, while easier tasks enable a fair assessment of the less healthy.

Wright [97] described how the RMT makes separate estimates of attributes of each test person (TP) i with attribute (e.g., ability) θi and of each item j with attribute (e.g., difficulty) δj. These two parameters are adjusted in a logistic regression of Eq. (11.1): \( \theta -\delta =\log \left(\frac{P_{success}}{1-{P}_{success}}\right), \)to the score response data yi, j on an ordered category scale by minimising the sum of the squared differences:

$$ \sum \limits_{i=1}^{N_{TP}}\sum \limits_{j=1}^L{\left({y}_{i,j}-{P}_{success,i,j}\right)}^2 $$

The goodness of fit can be judged by examining how closely the overall fitted ogive item response curve of Eq. (11.1) matches individual average scores at different locations across the scale [6].

Continuing the discussion of the distribution variance in the Rasch [83] model, estimates of measurement uncertainty, u, for person (i, NTP) and item (j, L) attributes and categories k, k’, derived from the Rasch expression Eq. (11.1) are made, respectively with the following expressions ([97] and Fig. 11.2):

$$ \left\{\begin{array}{c}u\left(\theta \right)=\sum \limits_{j=1}^L{\left({P}_{succe ss,i,j,k}\bullet {P}_{succe\textrm{s}s,i,j,k\prime}\right)}^{-\frac{1}{2}}\\ {}u\left(\delta \right)=\sum \limits_{i=1}^{N_{TP}}{\left({P}_{succe ss,i,j,k}\bullet {P}_{succe ss,i,j,k\prime}\right)}^{-\frac{1}{2}}\end{array}\right. $$

The dichotomous relations of the basic Rasch model can be extended to the polytomous case, according to the expression:

$$ {q}_{i,j,c}=P\left({y}_{i,j}=c\right)=\frac{e^{\left[c\cdot \left({\theta}_i-{\delta}_j\right)-\sum \limits_{k=1}^c{\tau}_{k,j}\right]}}{\sum_{c=0}^{K_j}{e}^{\left[c\cdot \left({\theta}_i-{\delta}_j\right)-\sum \limits_{k=1}^c{\tau}_{k,j}\right]}} $$

The polytomous Rasch variant of GLM then models the response Y at any one point on the scale as a sum of dichotomous functions expressed as the log-odds ratio z = θ − δ + τ for each threshold τ, where the latter is where there is 50% probability that the response scores in either of the two adjacent categories. This polytomous Rasch variant is referred in the literature to the Andrich [5] “rating scale”, Masters [56] “partial credit” approaches, amongst others. In R this is referred to as Extended Rasch Modeling: The R Package eRm. Programs such as WINSTEPS [96] and RUMM make a logistic regression of the polytomous Rasch formula to the response data Y = Psuccess, using the “Joint Maximum Likelihood Estimation” method to estimate values of the ‘latent’ (explanatory or covariate) variables Z: θ, δ and the thresholds τ.

Naturally, the requirements of measurement – such as about invariance and dependency – need to be tested quantitatively in any specific application of a model (Eq. 11.1) to a set of data. Linacre & Fisher [49] write:

An advantage of Rasch methodology is that detailed analysis of Rasch residuals provides a means whereby subtle inter-item dependencies can be investigated. If inter-item dependencies are so strong that they are noticeably biasing the measures, then Rasch methodology supports various remedies.

The statistical uncertainties (evaluated with Type A methods [42]) associated with the rate of success (Eq. 11.4), can be described with the Poisson distribution in Rasch’s original [83] model (and associated with the response of the measurement system (Fig. 11.1b). Over and above type A evaluations, a complete expression of measurement uncertainties will also include accounts of a lack of knowledge about other elements of the measurement system, particularly the measured object (e.g., task) and the measurement instrument (person). In the same way as a Rasch approach insists on separate analysis of task difficulty and person ability, even measurement uncertainties are best evaluated in terms these elements separately. Examples of such uncertainties arise, as mentioned earlier, for instance when either the task differs – for one reason or another – from the “standard” recipe, perhaps arising from varying ways of administering the test (section above Reference measurement procedures: Construct specification equations as recipes for traceability) or an individual rates his response on another scale than the cohort average, perhaps because of special sensitivities arising from emotion associated with stigma or illness (section Discrimination).

Once again, when regarding measurement uncertainties not merely as standard deviations, but rather as estimates of “unknown measurement errors” [73], an MSA approach is appropriate when capturing the causes of uncertainty arising from less than perfect measurement systems. There are a number of similarities with the corresponding historic lack of attention to the “instrumental task of quantification” as mentioned in the section 11.2.3 A way forward for measuring PCOs: Human as a B: Measurement Instrument. Both measurement uncertainty analyses in general as well psychometric studies benefit from an MSA-based methodology.

Uncertainties in both the measured object (task difficulty) and measurement instrument (person ability) will propagate through the measurement system as corresponding uncertainties in the raw response data (and by extension to measurand estimation on restitution). An important part of a detailed analysis of Rasch residuals is thus a proper account of the peculiar sensitivity of the instrument (person) in a person-centered measurement system. Recall that “sensitivity” is the response of an instrument to a given input stimulus. Because of the strongly “resonant-like” variation of sensitivity as a function of attribute level, uncertainties in object or instrument at either extreme end of the scale will contribute essentially nothing to response uncertainties, while uncertainties at attribute levels closer to mid-scale will result in positive and negative swings in fit residuals on either side (as explained in relation to Fig. 11.4). A full account, including analytical expressions for construct alleys for various scale distortions, as well as the effects of discrimination can be found in section 5.7 of [75]. Discrimination

Referring again to Fig. 11.3, showing hierarchies of metrological standards, a requirement of the basic RMT is that person ability can be determined irrespective of task difficulty (and vice versa). That requirement is never fully satisfied, and we can envisage special cases, particularly in PCOs, where we might fail to meet it quite dramatically. When regarding a human being as an “instrument”, again, with all the complexities that human behavior implies, one can consider cases where an ‘irrational’ response of a test person – caused by emotions or illness, for instance – could occur for one or more tasks in which the scaling of person ability for a given task difficulty deviates from the responses overall for the cohort. It is, so to say, a differential item and person response which leads to scale distortions which are not automatically compensated for with the basic RMT, but require an additional parameter, such as person discrimination (i.e., instrument sensitivity). Examples include [63]:

  • Acquiescence: “Agreement regardless of item content”

  • Disacquiesence: “Disagreement regardless of item content”

  • Extreme response bias: “Use scale endpoints regardless of item content”

  • Middle response bias: “Use scale midpoint regardless of item content”

  • Social desirability bias: “Present oneself in a positive way, regardless of item content”

Scale distortions associated with any of these effects would be additional to the counted fraction non-linearity modelled with RMT which has to be compensated for when making a proper analysis of raw score ordinal data Evidence for such effects can be sought in the residuals of fit and various plots, such as the so-called construct alleys ([54], section 5.7 of [75]). A construct alley, i.e., a plot of values of task difficulty values, δ (or person ability values, θ) against the residuals, such as INFIT – ZSTD, of the logistic regression, is a sparsely used but potentially powerful tool to diagnose response patterns and further enhance the understanding of fit statistics. Apart from random noise, Massof et al. [54] have reported systematic distortions in construct alleys for different vision traits (for tasks of mobility, reading, visual information and visual motor). Another tool for analyzing goodness of fit of the logistic Eq. 11.1 found to be useful in detecting scale distortions is the principal component loading plot [78].

The form of the ogive response curve exemplified in Fig. 11.4, particularly how fast the ogive curve rises through the mid-point, is in general determined by measurement uncertainty associated with how sensitive each instrument (test person) is to the task at hand [60].

A basic requirement of Rasch’s models for measurement is that every test person lies on the same ogive curve, irrespective of who they are. Apart from task difficulty and instrument ability, in some cases an additional factor – the discrimination of the person responding – may vary (for instance, because of illness or emotion [86]). This can be captured in related Item Response Theory (IRT) with a discrimination parameter: the finite resolution, ρ, of the instrument (a patient in the present case) modelled as the entropy \( H\left(Z,Y\right)\sim \ln \left(\rho \right)=\ln \left(\sqrt{3}\cdot 2\cdot u\right) \) of a uniform distribution associated with limitations in measurement quality as measurement information is transmitted from the object attribute, Z, to the response, Y, of the instrument, where u is the standard measurement uncertainty [77].

There are, of course, different conceivable models of a modified response associated with effects such as acquiescence and response bias at various locations of the scale (section 5.7 of [75]). A simple rescaling, centered on the logistic scale and varying linearly would be ∂δj = s ∙ δj, where s is a re-scaling factor. For some reason or other, rating of item j is made so that the item attribute (such as task difficulty) lies on a different scale: a positive value τ of rescaling indicates an extended scale (test persons (instruments) rate this item more strongly than others, perhaps to indicate an increased importance or weight), while a negative value τ of rescaling corresponds to the case where rating does not recognize a reversed scale. An example of the latter is where a survey designer has deliberately included alternately positive (true key) and negative (false key) items, perhaps to reveal evidence of acquiescence in raters, that is, responses which tend to agree with questions (e.g., personality scales [87]) without due regard for the content of the item. Measurement Uncertainty and Measurement System Analysis

Summarizing the steps to be taken overall when expressing measurement uncertainty [42]:

  1. 1.

    Analyze the measurement system. Set up an error budget

  2. 2.

    Correct for known measurement errors, such as subtracting for known bias, b, and sensitivity K differing from 1 (unity) (Fig. 11.3)

  3. 3.

    Evaluate (standard) measurement uncertainties with methods of type A (i.e., statistically) alternatively type B

  4. 4.

    Combine standard measurement uncertainties by quadratic addition = > uc

  5. 5.

    Expand measurement uncertainty = > U = k·uc

The first (and perhaps most essential) step in any evaluation of measurement uncertainty is to make so complete, valid and reliable description of the measurement or classification system at hand. Analogously to the corresponding first step on construct description, if one misses a key contribution when formulating one’s measurement model, then no amount of statistics or other actions in the later steps of the list above will compensate for that omission.

The techniques of MSA are recommended, not only when ensuring metrological traceability, but also here where there are preferably clear and separate estimates of the contributions of each element of the measurement system – instrument, operator, environment and measurement method (Fig. 11.1), using methods such as ANOVA – to the overall uncertainties in the response when measuring a certain object.

Further discussion can be found in Chapter 5 of [75].

11.4 Decision Risks and Uncertainty

In this final section, we complete the measurement task by closing the quality loop, by returning to where we started and complete quality-assurance of a measurement task by considering how best to make a final assessment of conformity of the entity to specified requirements in terms of impact. As said at the outset, most measurements are not ends in themselves, but are usually made with the aim of assessing product – in the present case the health care service provided in terms of quality of PCC. When closing the loop and providing the final “bookend”, several of the quality characteristics of health care services identified at the start of this chapter will be key to assuring quality in PCC.

At this final stage, use can be made of all the previous material presented in the preceding sections of this chapter.

11.4.1 Comparing Test Result with Product Requirement

A test result for the cognitive ability, θ, of a test person, with its measurement uncertainty interval, compared with examples of the lower specification limits LSL, AD and LSL, MCI for diagnosis by a clinician of Alzheimer’s disease (AD) and mild cognitive impairment (MCI), respectively is shown in Fig. 11.5 as a typical case of PCC (data taken from Melin & Pendrill [61]). A lower specification limit means that measurement values less than the limit are judged to be a positive indication of disease, so that for instance test results of ability, θ, less than LSL, ADare considered to lie probably in the Alzheimer’s disease region of permissible values RAD.

Fig. 11.5
figure 5

CBT and DST memory test results [62], with measurement uncertainty intervals (double-ended arrow, k = 2), together with specification limits for diagnosis of Alzheimer’s disease (AD) and mild cognitive impairment (MCI) Hughes et al. [32]. The histogram columns indicate the distribution of person ability across the cohort

The specification limits and corresponding regions of permissible values in the case shown in Fig. 11.5 have been set according to the 50% and 35% limits for ‘probable’ and ‘possible’ AD set by Hughes et al. [32]. It should however be noticed that such specification limits vary in clinic and assessments of memory ability is not the only input to the clinical examination to set diagnosis. Requirements for Decision Risk Management

Measurement uncertainty in a test result – an apparent product dispersion arising from limited measurement quality (section Errors and uncertainty. Reliability) – can be a concern in conformity assessment by inspection since, if not accounted for, uncertainty can both lead to incorrect estimates of the consequences of entity error and to an increase in the risk of making incorrect decisions, such as failing a conforming entity or passing a non-conforming entity when the test result is close to a tolerance limit (section Consumer and Provider Risks).

Requirements for appropriately accounting for the consequences of measurement uncertainty when making decisions of conformity have recently entered more prominently in the main written standards for conformity assessment, such as the latest version of ISO 17025:2017 [39], which states:

7.7.1 Evaluation of conformance

When statement of conformity to a specification or standard for test or calibration is requested, the laboratory shall:

document the decision rules employed taking into account the level of risk associated with the decision rule employed (false accept and false reject and statistical assumptions associated with the decision rule employed);

apply the decision rule.

NOTE For further information see ISO/IEC Guide 98-4. [43]

For PCOs addressing cognition, for example, (section PCOs: Example neuropsychological cases) the “laboratory” referred to be above could be a memory clinic. For patient participation (section PCOs: Example patient participation), the “laboratory” could correspond to the survey administrator such as authorities, clinic lead or research group.

Because of measurement uncertainty (as evaluated in the section 11.3.2 Errors and uncertainties in PCOs) or as otherwise arising when making a clinical decision), there are risks of making incorrect decisions of conformity where a test result can appear to lie in one region, but there is a finite probability of lying in another region, particularly where the test result is close (within uncertainty) of the decision limit (section Consumer and Provider Risks). A decision of conformity of a quality characteristic with respect to a specification limit, normally made in the present case by a clinician, but in some cases as a self-assessment by the patient herself, depends on two factors: the measurement uncertainty and the distance between a test result and the specification limit. These two factors together determine if an entity is approved or not and what the risks of incorrect decision are (section Consumer and Provider Risks).

A definition of decision-making accuracy – in terms of Psuccess, that is the probability of making a correct classification – comes from the expression (Chapter 2 in [75], Fig. 11.2):

$$ Accuracy(decisionmaking)= response categorization- input(true) categorization $$

RMT is very broad in its application and is not restricted to measurement systems with human intervention but should also apply to other ‘probe: task’ systems. Examples include [69,70,73, 92] the performance of a system (characterized by the ability of providing healthcare, such as waiting times for surgery, separately from the levels of challenge associated with each task), or the determination of material testing (e.g. the ability of an indenter, separately from the hardness of each material test block).

Apart from illustrating product conformity, plots such as Fig. 11.5 can also be used when considering measurement conformity. To make decisions about conformity to “product” specifications in a reliable manner will require measurements which themselves have been shown to satisfy conformity to corresponding measurement specifications. A measurement conformity assessment version of Fig. 11.5 could show for example how the actual instrument error (with its uncertainty interval) lies with respect to the maximum permissible (measurement) error (MPE) and maximum permissible (measurement) uncertainty (MPU) specifications.

11.4.2 C: Man as an Operator: Rating the Rater

In terms of measurement system analysis where a human enters into different parts of a measurement system (Fig. 11.1), the case of a clinician making a diagnosis on the basis of the data exemplified in Fig. 11.5, from the chapter by Melin and Pendrill [61], can be interpreted as a human acting as the Operator of the measurement system, as shown in Fig. 11.6 (C at the third main stage in the measurement process (Figure 1 of [77])). That is, the operator (clinician) makes a diagnosis, based on the response of the instrument (each test person), about whether that person’s ability lies within or outside specification limits in Fig. 11.5 (corresponding to whether a product is “approved” or not in regular conformity assessments is “approved” or not).

Fig. 11.6
figure 6

A human as an operator in a measurement system

In recent work, Andrich [6] has studied Gaussian and Rasch distributions in instrument response, while van der Bles, et al. [13] considered how epistemic uncertainty is interpreted by the rater, as well as communication about uncertainty to a third-party audience. Our approach is instead to propose that RMT can usefully be applied to extend the ‘rating the raters’ approach of Akkerhuis et al. [3, 4] who have modelled decisions about manufactured product made in an industrial context. With our extension (chapter 6 of [75]), separate estimates of the ability, θi, of each rater, i, to classify product can be made, alongside estimates of the level of difficulty, δj, of each product classification task.

But first, in the next section (, we need to identify the main constructs with which a rater – such as a health care professional – can be characterized for conformity assessment in quality assurance, for instance in PCC. Rater Constructs

A comprehensive description of the constructs of interest when making clinical diagnoses can be based on the preceding techniques and construct specification equations (CSE) can be formed, in analogous fashion (section Reference measurement procedures: Construct specification equations as recipes for traceability). Apart from how well trained, how experienced, and how alert a health care professional is, their ability θ to diagnose may also reflect their attitudes to desiring to give high quality care and their ability to “leave prejudices at the door” when meeting a new cohort individual [66]. Similarly, explaining what makes the clinical classification of different cohort individuals more or less difficult, level δ, probably depends in a complex way on how ill each assessed person is. It is unclear whether it is easier or more difficult to diagnose a healthy or sick person.

With the overall aim of supporting interoperability in the exchange of meaningful information between information systems in respect of nursing diagnoses and nursing actions, the standard ISO 18104 specifies healthcare entities for nursing diagnoses. Motivation for the development of terminological systems to support nursing in ISO 18104 [40] lies in multiple factors including the need to describe nursing in order to educate and inform students and others, represent nursing concepts in electronic systems and communications, including systems that support multiprofessional team communications and personal health records, and analyses data about the nursing contribution to patient care and outcomes – for quality improvement, research, management, reimbursement, policy and other purposes.

Connecting the terminology of ISO 18104 [40] in PCC (nurses below can be any clinical profession):

  • The probability Psuccess, i of a successful clinical diagnosis by an <individual> nurse on a <<recipient of care>> would correspond to a measure of how well a <judgement> is performed (section Consumer and Provider Risks).

  • The <<recipient of care>> is the person, family, group, or other aggregate to whom the action is delivered.

  • The entity (object) is the <<target>> of the diagnosis which is the entity that is affected by the nursing action or that provides the content of the nursing action. Semantic categories in the <<target>> domain include but are not limited to: <body component>, <sign>, <device>, <substance>, <physical environment>, <resource>, <process>, <dimension>, <individual>, <group>, and the categories that have the role of <<focus>> in nursing diagnoses. Nursing diagnosis can also be a <target>.

  • A construct is a <focus> which is an “area of attention”, such as “tissue integrity, body temperature, activity of daily living”. <<Focus>> may be qualified by <timing>.

  • A <judgement> (opinion or discernment related to a <focus>) may be characterized as being “Impaired, reduced, ineffective”.

  • Judgement categories that are valid for representation of a nursing diagnosis include, but are not limited to, alteration, adequacy, and effectiveness.

  • Attributes associated with nursing performance (such as knowledge, motivation, and ability) belong to <dimension>. Consumer and Provider Risks

Four alternative outcomes – illustrated in Fig. 11.7 – can be encountered when making decisions of conformity in the presence of measurement uncertainty. In addition to a pair of correct decisions of conformity, measurement uncertainty can lead to a second pair:

  • non-conforming entities being incorrectly passed on inspection – consumer risk, α, that is, in PCC, the person receiving care

  • correctly conforming entities being incorrectly failed on inspection – provider risk, β, that is, in PCC, the care provider

Particularly when a test result is close (within the uncertainty interval) to a specification limit [73, 75]. Making decisions in the presence of measurement uncertainty are an example of the wider concept of classification.

Fig. 11.7
figure 7

Two correct & two incorrect decisions of compliance. TPR = True Positive Rate, FPR = False Positive Rate, FNR = False Negative Rate and TNR = True Negative Rate

The risks of incorrect decisions on identification [41] – both by variable and by attribute – are simply connected to the logistic regression formula (Eq. (11.1)) by the relation:

$$ 1-\alpha ={P}_{success} $$

Making decisions for a single item – in the present case, an individual patient (“consumer”) receiving PCC from a care provider – the specific risks can be readily calculated in terms of the area under that portion of the measurement PDF gtest which lies in each case, so to say, “beyond – on the other side of” the relevant specification limit, LSL, to the mean \( \overline{z} \), and are expressed mathematically as:

  • Consumer specific risk (by variable)

$$ {\alpha}_{specific}\left(\overline{z}\right)={\int}_{\eta <{L}_{SL}}{g}_{test}\left(\eta |\overline{z}\right)\cdot d\eta \left(\overline{z}\ge {L}_{SL}\right) $$
  • Provider specific risk (by variable)

    $$ {\beta}_{specific}\left(\overline{z}\right)={\int}_{\eta \ge {L}_{SL}}{g}_{test}\left(\eta |\overline{z}\right)\cdot d\eta \left(\overline{z}<{L}_{SL}\right) $$

– for a test result mean value, \( \overline{z} \), (distribution gtest) and a lower specification limit, LSL.

These decision risks can be summarized in a so-called confusion matrix:

$$ \left(\begin{array}{cc}1-\alpha & \alpha \\ {}\beta & 1-\beta \end{array}\right) $$

which corresponds to the four quadrants shown in Fig. 11.7.Apart from specific risks when assessing conformity to specification of a single item (measurement distribution gtest), in general different items (occasions of PCC) will have different quantity values – for instance, due to actual variations (distribution gentity) in health of different patients (see section above 11.2 Benefits of combining Rasch Measurement Theory (RMT) and quality assurance) – and the commensurate global risks of incorrect decisions also need to be estimated. Extensions of the formulae above by convoluting the variables Zglobal and Yglobal leads to expressions such as:

  • Consumer global risk (by variable)

    $$ {\alpha}_{global}\left(\overline{z}\right)=\int {\int}_{\eta <{L}_{SL}}{g}_{entity}\left(\xi |\overline{z}\right)\cdot {g}_{test}\left(\eta |\overline{z}\right)\cdot d\eta \cdot d\xi \left(\overline{z}\ge {L}_{SL}\right) $$

Risks and the consequences of incorrect decision-making in conformity assessment should always be evaluated. Beyond the percentage probabilities discussed in this section, ultimately risks can be minimized by proactively setting limits on maximum permissible measurement uncertainties and on maximum permissible consequence costs [73].

11.4.3 Receiver Operating Characteristics: A Human as an Operator & Rating the Rater

For historical reasons, there are a number of related plots under the name “operating characteristic” which intend to convey an indication of the “power” of any measurements:

  • In statistical acceptance sampling, the “operating characteristic” (or “discriminatory power”) curve is a plot of the probability of accepting a lot as a function of an explanatory variable [68]

  • A “receiver (or “relative”) operating characteristic” (ROC) can be a plot of the true positive rate (TPR, or “sensitivity”) [81, 98]:

$$ TPR=\frac{TP}{TP+ FN}=\frac{1-\alpha }{1-\alpha +\beta } $$
  • against the false positive rate (FPR), where β is the “supplier” risk in a dichotomous case, i.e., the probability that product is incorrectly rejected (‘false negative’, FN).

Two possible methods to plotting ROC curves for calculating the probability of incorrect decisions α and β:

  • using Eq. (11.8) based on uncertainties, e.g., from the Rasch analyses; or

  • simply counting the rate of correct clinical diagnosis at each signal level.

Potential differences between these approaches might arise if the clinician uses other techniques, in addition to inspection of plots such as Fig. 11.5. That is the most likely case since the clinician will typically weigh together several factors (including his or her professional experience) when reaching a final decision. The clinical diagnosis uncertainties in method (ii) are probably greater than the signal level uncertainties of method (i) because the clinical judgment includes a number of criteria, in addition to signal levels. That this is so in the present case is evident for the data shown in Fig. 11.8, since cohort individuals are diagnosed as sick across practically the whole range of signal levels. Further consideration of possible explanations of clinician ability are given in the section Explaining clinical diagnoses.

Fig. 11.8
figure 8

Example of basis for calculating ROC curves in a biomarker case. Histograms of the occupancy (number of cohort members, Nbiomarker = 80) as a function of measured biomarker concentration (arbitrary units). Clinical (binary) classification H = healthy and S = sick = MCI + AD indicated by different colored columns: Blue Ntotal = NH + NS; Orange NH, number of healthy cohort individuals, and Grey NS, number of unwell cohort individuals

Fig. 11.9
figure 9

Two ROC curves of Sensitivity TPR versus False positive rate FPR based on clinical classification over the range of each signal level (C) of interest: Orange dots, Rasch cognitive ability θ; Grey dots, biomarker concentration

Calculating ROC curves with method (ii) on the basis indicated in Fig. 11.8 (using data from the example provided by Melin & Pendrill [61]) involves evaluation of expressions such as [81, 98]:

$$ Sensitivity= TPR=\frac{\sum_{C> SL}{N}_S}{\sum_C{N}_S} $$
$$ False\ positive\ rate= FPR=\frac{\sum_{C> SL}{N}_H}{\sum_C{N}_H} $$

where clinical classification is made of the occupancy, the number N of cohort individuals, in one of two categories H = healthy and S = sick = MCI + AD, at each signal level, C, (in the present case, the concentration of a selected biomarker). Both the true positive rates TPR and the rate of false positive rates FPR (as well as the respective expressions for negatives) are calculated in terms of the number of cohort individuals having signals, C, exceeding a signal specification limit, SL, which is varied successively over the signal range of interest, assuming that higher signal levels should indicate a greater probability of sickness.The relative performance (“discriminatory power”) of different signals in providing a base for clinical diagnosis can be gauged by comparing ROC curves for the different signals. In addition to biomarker concentration, one can also evaluate Eq. 11.10 using signals from Rasch cognitive ability θ, for instance.

The resulting ROC curves – illustrated for two different signals in Fig. 11.8 (example provided by Melin & Pendrill [61]) – are obtained by plotting Sensitivity TPR versus False positive rate FPR over the range of each signal level (C) of interest. The ideal ROC curve in terms of greatest discriminatory power is the one furthest to the left-hand upper corner [81].

Each ROC curve is obtained by calculation of Eq. 11.10 by varying the signal specification limit, SL, from low signal levels (C) – top-right of each ROC curve – where all decisions are positive – to increasingly higher signal values, causing both decreased sensitivity and the rate of false positives to reduce, reaching minimum values at bottom-left of each ROC curve. The discriminatory power of each signal is revealed by how well each kind of signal can maintain the highest sensitivity as signal levels increase. A traditional figure of merit is the Area Under the Curve (AUC), found by integrating the ROC curve across the range of FPR [81]. (Such a figure of merit may however suffer from the uncorrected effects of counted-fraction ordinality – see the section PCC including physical, psychological and social integrity.) Explaining Clinical Diagnoses

To approach as closely as possible the same diagnosis criteria used by clinicians, the signal chosen for ROC curve analysis should be the most representative, comprehensive and valid.

In reflecting over what lies behind making certain signals better than others to diagnose correctly, as revealed in Fig. 11.8, one can conjecture that a composite memory measure is arguably a more comprehensive measure, and perhaps “closer” to the clinician’s assessment method, than say biomarker concentration when making a clinical classification between healthy and unwell cohort individuals. As in this case, a diagnosis of Dementia due to AD is not based only on either memory ability of biomarker concentrations. Rather it comprises criteria such as the ability to function at work or at usual activities and mental status examination or neuropsychological testing together with biomarkers.

Arguably a composite memory measure (naturally Rasch-transformed, section PCC including physical, psychological and social integrity) – preferably with a fully developed construct specification including all significant explanatory variables [61] – is likely to be a better choice than, say an individual biomarker, particularly in the context of PCC.

11.4.4 Many-Body Modelling and Conclusions

In this chapter we have considered measurement models where either the patient (Fig. 11.1b) or the clinician (Fig. 11.6) act solely when assuring quality of PCOs, from start to finish of the quality loop.

Bringing the patient “into a partnership with health care professionals” as a key aspect of PCC (section 11.1.2 Quality assurance in person-centered care. Design of experiments) opens up a new field of quality-assured measurement suitable for future studies. To model the patient: clinician partnership one could envisage extending the “single-body” basic Rasch model to a “many-body” version. Such a modelling has indeed already been done, where two people of different abilities jointly tackle a task of a given difficulty, such as in the game of chess. Brinkhuis and Maris [15] consider how the famous chess rating system Elo can be applied in education to student monitoring. Such models can be extended to any number of mutually interacting people, for instance if one wishes to study social interactions in an area called “social physics” [79]. Such work would extend our study of how informational entropy can be used to model the ability (or more generally attitude) of an individual [61] or the ability of an organization [99] to many-body situations, such as the patient: clinician partnership in PCC.