1.1 Introduction

The assessment of the value of scientific evidence involves subtle forensic, statistical, and computational aspects that can represent an obstacle in practical applications. The purpose of this book is to provide theory, examples, and elements of R code to illustrate a variety of topics pertaining to value of evidence assessments using Bayes factors in a decision-theoretic perspective.

The structure of this book is as follows. This chapter starts by presenting an overview of the role of statistics in forensic science, with an emphasis on the Bayesian perspective and the role of the Bayes factor for logical inference and decision. Next, the chapter addresses three general topics that forensic scientists commonly encounter: model choice, evaluation, and investigation. For each of these themes, Bayes factors will be developed and discussed using practical examples. Particular attention will be devoted to the distinction between feature- and score-based Bayes factors, typically used in evaluative settings. This chapter also provides theoretical background analysts might need during data analysis, including elements of forensic interpretation, computational methods, decision theory, prior elicitation, and sensitivity analysis.

Chapter 2 addresses the problem of discrimination between competing propositions regarding target features of a population of interest (i.e., parameters). Examples include applications involving counting processes and propositions referring to the proportion of items of forensic interest (e.g., items with illegal content) or an unknown quantity. Attention will be drawn to background elements that may affect counting processes or continuous measurements and a decisional approach to this problem.

Chapter 3 addresses the problem of evaluation of scientific evidence in the form of discrete, continuous, and continuous multivariate data. The latter may present a complex dependence structure that will be handled by means of multilevel models.

Chapter 4 focuses on the problem of investigation, using examples involving either univariate or multivariate data.

For each topic covered in the book, examples will be accompanied with R code, allowing readers to reproduce computations and adapt sample code to their own problems. The end of each chapter presents an outline of the principal R functions used throughout the respective chapters. While some functions can be easily reproduced, others are more elaborate and copying their R code would be tedious. These functions are available, as well as datasets, as supplementary materials on the book’s website (on http://link.springer.com/).

1.2 Statistics in Forensic Science

Forensic science uses scientific principles and technical methods to help with the use of evidence in legal proceedings of criminal, civil, or administrative nature. To assist members of the judiciary in their inquiries regarding the existence or past occurrence of events of legal interest, forensic scientists examine recovered traces, objects, and materials related to persons of interest. This may involve, for example, the analysis of the nature of body fluids and various other items such as textile fibers, glass and paint fragments, handwriting, digital device data, as well as the classification of such items and data into various categories.

More generally, forensic science takes a major interest in both investigative proceedings and evaluative processes at trial. This involves the examination of persons and objects, as well as the vestiges of actions. Forensic scientists also help with reconstructing past events. Thus, incomplete knowledge and, hence, uncertainty are key challenges that all participants in the legal process must deal with. The standard approach to cope with uncertainty is the structured collection and sound use of data. Typically, data result from the analysis and comparative examination of evidential material (i.e., biological traces, toxic substances, documents, crime scene findings, imaging data, etc.), followed by an assessment of the probative value of scientific results within the context of the event under investigation and in the light of the task-relevant information.

However, despite its potential to support legal evidence and proof processes, forensic science has also been found to be a contributing factor to miscarriages of justice (Cole, 2014). Furthermore, over the last decade, reviews by expert panels have exposed several areas of forensic science practice as insufficiently reliable (e.g., PCAST, 2016), and courts across many jurisdictions have insisted on the need to probe and demonstrate the empirical foundations of forensic science disciplines.

Scientists currently address these challenges by directing research not only toward more studies involving experiments under controlled conditions but also by developing formal frameworks for value of evidence assessment that can cope with scientific evidence independent of its nature and type. Central to this development is a convergence to the Bayesian perspective, which is well suited to help forensic scientists assess the probative value of observations that, typically, do not arise under only one given hypothesis or proposition .Footnote 1 Bayesian thinking can cope with situations in which one holds varying degrees of belief about competing hypotheses and one considers that those hypotheses may differ in their capacity to account for one’s observations and findings. As noted by Cornfield (1967, p. 34),

Bayes’ theorem is important because it provides an explication for this process of consistent choice between hypotheses on the basis of observations and for quantitative characterization of their respective uncertainties.

In forensic science, the Bayes factor (BF)—a central element in Bayesian analysis—has come to play an extremely important role. It represents a key statistic for assessing the value of scientific findings and is, thereby, widely covered in forensic literature (e.g., Aitken et al., 2021; Buckleton et al., 2016). It allows scientists to assess case-related observations or measurements in the light of competing propositions presented by parties at trial. In essence, the Bayes factor is a concept that provides a measure of the degree to which a scientific finding is capable to discriminate between the competing propositions of interest.

The choice of the Bayes factor to assess the value of outcomes of laboratory examinations and analyses results from the requirement to comply with several practical precepts of coherent thinking and decision-making. The desirable properties that the Bayes factor accounts for are balance, transparency, robustness, and logic. In addition, it is a flexible measure, acknowledged throughout forensic science, law, and statistics, because it can deal with any type of evidence (e.g., Evett, 1996; Jackson, 2000; Robertson & Vignaux, 1993; Robertson et al., 2016; Good, 1950; Kass & Raftery, 1995; Lindley, 1977; Taroni et al., 2010).

In forensic science, the Bayes factor is more commonly called likelihood ratio , even if this may create confusion because the two terms represent two distinct concepts, and the Bayes factor does not always simplify to a likelihood ratio. This will be explained later in Sect. 1.4. Generally, the use of the Bayes factor is now well established in both theory and practice, though some branches of forensic science are more advanced in Bayes factor analyses than others. A general overview is presented by the Royal Statistical Society’s Section Committee on Statistics and Law (e.g., Aitken et al., 2010) in a series of practitioner guides for judges, forensic scientists, and expert witnesses.

While the Bayes factor represents a coherent metric for value of evidence assessmentFootnote 2 in evaluative reporting Footnote 3 (i.e., when a person of interest is available for comparison purposes), it is important to mention that it can also be used in investigative contexts. A case is investigative when there is no person or object available for comparison, and examinations concentrate primarily on helping to draw inferences about general features (e.g., sex, right-/left-handedness, etc.) related to the source of a recovered stain, mark, or trace. More generally, the Bayes factor can be used for two main purposes in forensic science:

  • The first purpose is to assign a value to the result of a comparison between an item of unknown source and an item from a known source. This refers to the evaluative mode in which forensic scientists operate. Evaluating a scientific finding thus means that the scientist provides an expression of the value of the observation in support—which may be positive, negative, or neutral—of a proposition of interest in legal proceedings, compared to a relevant alternative proposition.

  • The second purpose is to provide information in investigative proceedings. Here, scientists operate in what is called investigative mode . They try to help answer questions such as “what happened?” and “what (material) is this?” (Jackson et al., 2006). The scientist is said to be “event focused” and uses the findings to generate hypotheses and suggestions for explanations of observations, in order to give guidance to investigators or litigants.

To illustrate these concepts, imagine a case involving a questioned document and handwriting. In cases of anonymous letter-writing, it regularly occurs that, at least initially, no suspected writer is available. In such a case, there will be no possibility for jointly evaluating characteristics observed on a questioned document and features on reference (known or control) material from a person of interest, as would be the case in an evaluative context. However, this does not mean that measurements made only on the questioned document, without comparison to reference material, could not be informative for investigative purposes. For example, features extracted from the handwriting of unknown source may be evaluated with respect to more general propositions such as “the questioned document (e.g., a ransom note) has been written by a man (woman)” or “the questioned document has been written by a right- (left)-handed person.” Helping to discriminate between such propositions contributes to reducing the pool of potential writers in an investigation.

As a metric to assess the value of findings in a forensic context, the Bayes factor allows practitioners to offer a quantitative expression that they can convey in a more general reasoning framework that conforms to the logic of Bayesian thinking. From the scientist’s point of view, the contribution to inference is perfectly symmetric. That is, the findings may support either of the two competing propositions, with respect to the relevant alternative proposition. This strengthens the scientist’s role as balanced expert in the legal process.

1.3 Bayesian Thinking and the Value of Evidence

Bayesian philosophy is named after Reverend Thomas Bayes and is based on an interpretation of probability as personal degree of belief (de Finetti, 1989). In Bayesian theory, all uncertainties in a problem must necessarily be described by probabilities. Probability is intended as one’s conditional measure of uncertainty associated with the evidence , the available information, and all the underlying assumptions. In this book, we will use the term evidence in the general sense of a given piece of information or data. This includes, but is not restricted to, the idea of evidence used in legal proceedings. The term evidence is used here in a broad sense as synonym for other terms such as “finding” or “outcome.” According to Good (1988), evidence may be defined as data that makes one alter one’s beliefs about how the world is working. The word finding, in turn, is used in this book to designate the result of a forensic examination or analysis. Findings are measurements in a quantitative form, discrete or continuous . Examples for discrete quantitative results are counts of glass fragments or gunshot residues. Examples for continuous results are measurements of physical quantities such as length, weight, refractive index, and summaries of complex comparisons in the form of similarity scores. For a formal definition of the term findings, see also the ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015).

Starting from prior probabilities , representing subjective degrees of belief about propositions of interest, the Bayesian paradigm allows one to rationally revise such beliefs and compute posterior probabilities, draw inferences about propositions, and make decisions (Sprenger, 2016). For example, when new information becomes available, it may be necessary to assess how this information ought to affect propositions regarding the involvement of a person of interest in particular alleged activities. Likewise, physicians need to structure their thought processes when performing medical diagnosis. In general, the question is how to update one’s personal beliefs regarding uncertain events when one receives new information.

Suppose that the events H 1, …, H n form a partition, and denote by \(\Pr (H_i\mid I)\) the probability that is associated with H i, i = 1, …, n, given relevant background information I. This probability is called a prior probability. Furthermore, consider an event or quantity E, whose probability can be expressed by means of the law of total probability as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \Pr(E\mid I)=\sum_{j}\Pr(E \mid H_j,I)\Pr(H_j\mid I). \end{array} \end{aligned} $$
(1.1)

The ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015, at p. 21) regards conditioning information as the essential ingredient of probability assignment, since all probabilities are conditional. In forensic evaluation, it is important not to focus on all possible information, but only on the information that is relevant to the forensic task at hand . Disciplined forensic reporting requires scientists to make clear their perception of the conditioning information at the time they conduct their evaluation. Conditioning information is sometimes known as the framework of circumstances (or background information) . Much of the non-scientific information will not have a bearing on the value of scientific findings, but it is essential to recognize those aspects that do. Examples of relevant information may include the ethnic origin of the perpetrator (but not that of the suspect) and the nature of garments and surfaces involved in alleged transfer events. More generally, conditioning information may also include data and domain knowledge that the expert uses to assign probabilities. The conditioning on (task-) relevant information I is important because it clarifies that probability assignments are personal and depend on the knowledge of the person conducting the evaluation.

Bayes rule (or theorem) is a straightforward application of the conditionalization principle and the partition formula (1.1). It allows one to compute the so-called posterior probability \(\Pr (H_i\mid E,I)\) as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \Pr(H_i \mid E,I)=\frac{\Pr(E \mid H_i,I)\Pr(H_i\mid I)}{\Pr(E\mid I)}=\frac{\Pr(E \mid H_i,I)\Pr(H_i\mid I)}{\sum_j \Pr(E \mid H_j,I)\Pr(H_j\mid I)}, \end{array} \end{aligned} $$

which emphasizes that certain knowledge of E modifies the probability of H i.Footnote 4 Note that prior and posterior probabilities are only relative to the new finding E. The posterior probability will become again a prior probability when additional findings become available. Lindley (2000, p. 301) expressed this as follows: “Today’s posterior is tomorrow’s prior.” Bayesian statistics is the sequential application of Bayes rule to all situations that involve observed and missing data, unknown quantities (e.g., events, propositions, population parameters), or unobserved data (e.g., future observations).

Participants in the legal process are typically concerned with the problem of comparing competing propositions about a contested event. A typical example for trace evidence is “the recovered glass fragments come from the broken window” versus “the recovered glass fragments come from an unknown source.” When measurements on various items (i.e., glass fragments) are available, it may be necessary to quantitatively evaluate these findings with respect to selected propositions of interest. According to Bayesian methodology developed by Jeffreys (1961), this involves the introduction of a statistical model to describe the probability of the available measurements according to different hypotheses (propositions or models). The posterior probability of each hypothesis is then computed via a direct application of Bayes theorem. Following Jeffreys’ criterion for comparing hypotheses, a hypothesis is accepted or rejected on the basis of its posterior probability being greater or smaller than that of the alternative proposition. Note that the acceptance or rejection of a proposition is not meant as an assertion of its truth or falsity, only that its probability is greater or smaller than that of the respective alternative proposition (Press, 2003).

The primary element in Bayesian methodology for comparing propositions is the Bayes factor (BF for short) . It provides a numerical representation of the impact of findings on propositions of interest. In other words, the Bayes factor quantifies the degree to which observed measurements discriminate between competing propositions. The Bayes factor is the ingredient by which the prior odds in favor of a proposition are multiplied in virtue of the knowledge of the findings (Good, 1958):

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mbox{Posterior odds}=\mbox{BF}\times\mbox{Prior odds}. \end{array} \end{aligned} $$

Broadly speaking, prior and posterior odds are the ratios of probabilities of the hypotheses of interest before and after acquiring new findings, respectively. The value of experimental outcomes is measured by how much more probable they make one hypothesis relative to the respective alternative hypothesis, compared to the situation before considering the experimental findings.

A formal definition of the Bayes factor is given in Sect. 1.4, along with a discussion about its interpretation as measure of the value of the evidence. Practical examples in Sects. 1.5 and 1.6 and further developments in Chaps. 3 and 4 will illustrate the use of the Bayes factor for evaluative and investigative purposes.

1.4 Bayes Factor for Model Choice

Consider an unknown quantity X, referring to a quantity or measurement of interest such as the number of ecstasy pills in a sample drawn from a large seizure of pills, the elemental chemical composition of glass fragments, or a feature (e.g., the length) of a handwritten character. Furthermore, suppose that f(xθ) is a suitable probability model Footnote 5 for X, where the unknown parameter Footnote 6 θ belongs to the parameter space Θ. Suppose also that the parameter space consists of two non-overlapping sets Θ 1 and Θ 2 such that Θ = Θ 1 ∪ Θ 2. A question that may be of interest is whether the parameter θ belongs to Θ 1, or to Θ 2, that is to compare the hypothesis

$$\displaystyle \begin{aligned} \begin{array}{rcl} H_1: \theta\in\varTheta_1, \end{array} \end{aligned} $$

against the alternative hypothesis

$$\displaystyle \begin{aligned} \begin{array}{rcl} H_2: \theta\in\varTheta_2. \end{array} \end{aligned} $$

Note that H 1 is usually called the null hypothesis . Under a classical (frequentist) approach, the distinction between null and alternative hypotheses is very important. Users must be aware that when performing significance testing , competing hypotheses are not equivalent and there is, in fact, an asymmetry associated with them. One collects data (or evidence) against the null hypothesis before it is rejected, but the acceptance of the null hypothesis is not an assertion about its truthfulness. It merely means that there is little evidence against it. As will be shown, under the Bayesian paradigm, this does not represent an issue.

A hypothesis H i is called simple if there is only one possible value for θ, say Θ i = {θ i}. A hypothesis is called composite (see, e.g., Example 1.1) if there is more than one possible value.

Let \(\pi _1=\Pr (H_1)=\Pr (\theta \in \varTheta _1)\) and \(\pi _2=\Pr (H_2)=\Pr (\theta \in \varTheta _2)\) denote the prior probabilities for the competing composite hypotheses H 1 and H 2. Note that, for the sake of simplicity, the letter I denoting background information is omitted here. The ratio of the prior probabilities π 1π 2 is called the prior odds of H 1 to H 2. The prior odds indicate whether hypothesis H 1 is more or less probable than hypothesis H 2 (prior odds being greater or smaller than 1) or whether the hypotheses are (almost) equally probable, i.e., the prior odds are (close) to 1.Footnote 7 Suppose observational data x are available that do not provide conclusive evidenceFootnote 8 about the propositions of interest but will allow one to update prior beliefs using Bayes theorem. Let us denote by \(f_{H_i}(x)\) the marginal probability of the data under proposition H i, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f_{H_i}(x)= \int_{\varTheta_i}f(x\mid\theta)\pi_{H_i}(\theta) d\theta, \end{array} \end{aligned} $$
(1.2)

where \(\pi _{H_i}(\theta )\) denotes the prior probability density of θ for θ ∈ Θ i. The marginal probability is also called the predictive probability , which is the probability to observe the actual data before any data become available. Kass and Raftery (1995) refer to it as the marginal likelihood : the probability of the observations averaged across the prior distribution over the parameter space Θ. Note that the parameter space Θ can be either continuous or discrete . In the latter case, the integral in (1.2) must be replaced by a sum, and the marginal probability of the evidence (i.e., data x) becomes

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f_{H_i}(x)= \sum_{\theta\in\varTheta_i}f(x\mid\theta)\Pr(\theta\mid H_i) . \end{array} \end{aligned} $$

The Bayes factor for comparing H 1 and H 2 is defined as the ratio of the marginal probabilities \(f_{H_i}(x)\) under the competing hypotheses, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f_{H_1}(x)}{f_{H_2}(x)} . \end{array} \end{aligned} $$
(1.3)

Let \(\alpha _1=\Pr (H_1\mid x)=\Pr (\theta \in \varTheta _1\mid x)\) and \(\alpha _2=\Pr (H_2\mid x)=\Pr (\theta \in \varTheta _2\mid x)\) denote the posterior probabilities for the competing hypotheses. The ratio of the posterior probabilities α 1α 2 is called the posterior odds of H 1 to H 2. Recalling the odds form of Bayes theorem, one can express the Bayes factor for comparing hypothesis H 1 against hypothesis H 2 as the factor by which the prior odds of H 1 to H 2 are multiplied in virtue of the knowledge of the data to obtain the posterior odds, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \alpha_1/\alpha_2=\mathrm{BF }\times\pi_1/\pi_2. \end{array} \end{aligned} $$

The Bayes factor measures the change produced by the new information (or, data) in the odds when going from the prior to the posterior distributions in favor of one proposition as opposed to a given alternative. For this reason, it is not uncommon to find the BF defined as the ratio of the posterior odds in favor of H 1 to the prior odds in favor of H 1, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{\alpha_1/\alpha_2}{\pi_1/\pi_2}. \end{array} \end{aligned} $$
(1.4)

One of the attractive features of using a Bayes factor to quantify the value of the acquired information is that it does not depend on prior probabilities of competing hypotheses. However, this bears potential for misunderstandings. The Bayes factor is sometimes interpreted as, for example, the odds provided by the data alone, for H 1 to H 2: this is conceptually incorrect. Though cases may be found where the Bayes factor can be expressed as a ratio of likelihoodsFootnote 9 and correctly be interpreted as the “summary of the evidence provided by the data in favor of one scientific theory (…) as opposed to another” (Kass & Raftery, 1995, at p. 777), this does not hold in general. The Bayes factor will generally depend on prior assumptions. It is necessary, thus, to clarify the meaning of “prior assumptions” because confusion may arise between, on the one hand, the notion of prior probability about model parameters (θ ∈ Θ i) and, on the other hand, prior probabilities of propositions (H i).

To clarify this distinction, consider the comparison of a simple hypothesis H 1 : θ = θ 1 against a simple alternative hypothesis H 2 : θ = θ 2. The prior probabilities of these hypotheses are expressed as \(\pi _1=\Pr (\theta =\theta _1)\) and \(\pi _2=\Pr (\theta =\theta _2)\). The posterior probabilities α i in the light of prior probabilities π i (i = 1, 2) and observed data x can be easily computed by means of a direct application of Bayes theorem :

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \alpha_i=\Pr(H_i\mid x)=\Pr(\theta=\theta_i\mid x)=\frac{f(x\mid\theta_i)\pi_i}{\sum_{j=1,2} f(x\mid\theta_j)\pi_j}. \end{array} \end{aligned} $$
(1.5)

The ratio of the posterior probabilities α 1α 2 obtained from computing (1.5) for i = 1, 2 simplifies to the product of the likelihood ratio times the ratio of the prior probabilities, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\alpha_1}{\alpha_2}=\frac{f(x\mid \theta_1)}{f(x\mid \theta_2)}\times \frac{\pi_1}{\pi_2}. \end{array} \end{aligned} $$

Recalling (1.4), it is readily seen that the Bayes factor in this simple case is the likelihood ratio of H 1 to H 2,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f(x\mid \theta_1)}{f(x\mid \theta_2)}\times \frac{\pi_1}{\pi_2}\times\frac{\pi_2}{\pi_1}=\frac{f(x\mid \theta_1)}{f(x\mid \theta_2)}, \end{array} \end{aligned} $$
(1.6)

and it is correct then to interpret this as “the odds provided by the data alone for H 1 to H 2.”

However, the comparison of simple versus simple hypotheses is a particular case among many others. Practitioners may face the more general situation where at least one of the hypotheses is composite, that is, the parameter of interest may take one of a range of different values (e.g., Θ i = {θ 1, …, θ k}), or infinitely many, as is the case when θ is continuous. In the case of composite hypotheses, the prior probabilities π i for i = 1, 2 will take the following form:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \pi_i=\Pr(\theta\in\varTheta_i)=\left\{ \begin{array}{lll} \sum_{\theta\in\varTheta_i}\Pr(\theta) && \mbox{for}\;\theta\;\mbox{discrete}\\ &\\ \int_{\varTheta_i}\pi(\theta)d\theta && \mbox{for}\;\theta\;\mbox{continuous}, \end{array} \right. \end{array} \end{aligned} $$
(1.7)

where π(θ) is the prior probability density for θ ∈ Θ. The posterior probabilities α i are therefore computed as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \alpha_i=\Pr(\theta\in\varTheta_i\mid x)=\left\{ \begin{array}{lll} \frac{\sum_{\theta\in\varTheta_i}f(x\mid\theta)\Pr(\theta)}{\sum_{\theta\in\varTheta}f(x\mid\theta)\Pr(\theta) } && \mbox{for}\;\theta\;\mbox{discrete}\\ &\\ \frac{\int_{\varTheta_i}f(x\mid\theta)\pi(\theta)d\theta}{\int_{\varTheta}f(x\mid\theta)\pi(\theta)d\theta } && \mbox{for}\;\theta\;\mbox{continuous}, \end{array} \right. \end{array} \end{aligned} $$
(1.8)

and the posterior odds will be

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \frac{\alpha_1}{\alpha_2}=\left\{ \begin{array}{ll} \frac{\sum_{\theta\in\varTheta_1}f(x\mid\theta)\Pr(\theta)}{\sum_{\theta\in\varTheta_2}f(x\mid\theta)\Pr(\theta)} & \mbox{for}\;\theta\;\mbox{discrete}\\ &\\ \frac{\int_{\varTheta_1}f(x\mid\theta)\pi(\theta)d\theta}{\int_{\varTheta_2}f(x\mid\theta)\pi(\theta)d\theta} & \mbox{for}\;\theta\;\mbox{continuous}. \end{array} \right. \end{array} \end{aligned} $$
(1.9)

Following (1.4), the Bayes factor can be reconstructed as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\left\{ \begin{array}{ll} \frac{\sum_{\theta\in\varTheta_1}f(x\mid\theta)\Pr(\theta)}{\sum_{\theta\in\varTheta_2}f(x\mid\theta)\Pr(\theta)}/ \frac{\pi_1}{\pi_2} & \mbox{for}\;\theta\;\mbox{discrete}\\ &\\ \frac{\int_{\varTheta_1}f(x\mid\theta)\pi(\theta)d\theta}{\int_{\varTheta_2}f(x\mid\theta)\pi(\theta)d\theta}/ \frac{\pi_1}{\pi_2} & \mbox{for}\;\theta\;\mbox{continuous}, \end{array} \right. \end{array} \end{aligned} $$
(1.10)

where the π i are computed as in (1.7). It is seen that the Bayes factor can no longer be expressed as a likelihood ratio as in the case of comparing simple versus simple hypotheses. We will show this for the case where θ is continuous.

Start with the prior probability density π(θ) on Θ, and divide it by the probability π i of the hypothesis H i to obtain the restriction of the prior probability density π(θ) on Θ i, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \pi_{H_i}(\theta)=\frac{\pi(\theta)}{\pi_i} \quad \mbox{for}\; \theta\in\varTheta_i. \end{array} \end{aligned} $$

The probability density \(\pi _{H_i}(\theta )\) simply describes how the prior probability spreads over the hypothesis H i. The prior probability density π(θ) can thus be rewritten in the following form:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \pi(\theta)=\left\{ \begin{array}{ll} \pi_1\pi_{H_1}(\theta) & \mbox{for}\; \theta\in\varTheta_1,\\ & \\ \pi_2\pi_{H_2}(\theta) & \mbox{for}\; \theta\in\varTheta_2. \end{array}\right. \end{array} \end{aligned} $$

Therefore, the posterior odds in (1.9) for the continuous case can be rewritten as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\alpha_1}{\alpha_2}= \frac{\pi_1\int_{\varTheta_1}f(x\mid\theta)\pi_{H_1}(\theta)d\theta}{\pi_2\int_{\varTheta_2}f(x\mid\theta)\pi_{H_2}(\theta)d\theta}. \end{array} \end{aligned} $$
(1.11)

Recalling (1.4), the Bayes factor in (1.10) will take the form of integrated likelihoods under the hypotheses of interest, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }= \frac{\int_{\varTheta_1}f(x\mid\theta)\pi_{H_1}(\theta)d\theta}{\int_{\varTheta_2}f(x\mid\theta)\pi_{H_2}(\theta)d\theta}. \end{array} \end{aligned} $$
(1.12)

The reader can verify that the two expressions in (1.3) and (1.12) are equivalent. Prior evaluations enter the Bayes factor through the weights \(\pi _{H_1}(\theta )\) and \(\pi _{H_2}(\theta )\). The Bayes factor depends on how the prior mass is spread over the two hypotheses (Berger, 1985). It is also worth noting that whenever hypotheses are unidirectional (e.g., when comparing H 1 : θ ≤ θ 0 against H 2 : θ > θ 0), the choice of a prior probability density π(θ) over Θ = Θ 1 ∪ Θ 2 (with Θ 1 = [0, θ 0] and Θ 1 = (θ 0, 1]) is equivalent to the expression of a prior probability for the competing hypotheses. Conversely, whenever hypotheses are bidirectional (e.g., when comparing H 1 : θ = θ 0 against H 2 : θ ≠ θ 0), one cannot choose a prior probability density π(θ) over the entire parameter space Θ, as this would amount to place a probability equal to 0 to the hypothesis H 1 : θ = θ 0. The prior probability distribution over θ must, in this case, be a mixture of a discrete component that assigns a positive mass \(\pi _1=\Pr (\theta =\theta _0)\) to H 1 and a continuous component that spreads the remaining mass π 2 = 1 − π 1 over Θ 2 according to the probability density \(\pi _{H_2}(\theta )\). The posterior probability α 1 can then be computed as in (1.8), where Θ 1 = θ 0,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \alpha_1=\Pr(H_1\mid x)=\frac{\pi_1 f(x\mid\theta_0)}{\pi_1 f(x\mid\theta_0)+\pi_2\int_{\varTheta_2}f(x\mid\theta)\pi_{H_2}(\theta)d\theta}. \end{array} \end{aligned} $$
(1.13)

Analogously, the posterior probability α 2 may be computed, and the Bayes factor is

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f(x\mid\theta_0)}{\int_{\varTheta_2}f(x\mid\theta)\pi_{H_2}(\theta)d\theta}. \end{array} \end{aligned} $$
(1.14)

It can be observed that the Bayes factor in (1.14) does not depend on the prior probabilities of competing hypotheses which can vary considerably among recipients of expert information. Any such recipient can, starting from their own probabilities, use the Bayes factor to obtain posterior probabilities in a straightforward manner. Consider, for the sake of illustration, the posterior probability of hypothesis H 1 in (1.13). A simple manipulation allows one to obtain

$$\displaystyle \begin{aligned} \begin{array}{rcl} \alpha_1=\left[1+\frac{\pi_2}{\pi_1}\frac{1}{\mathrm{BF }}\right]^{-1}=\frac{\mathrm{BF }}{\mathrm{BF }+\pi_2/\pi_1}. \end{array} \end{aligned} $$

In summary, the Bayes factor thus measures the change in the odds in favor of one hypothesis, as compared to a given alternative hypothesis, when going from the prior to the posterior distribution. This means that a Bayes factor larger than 1 indicates that the data support hypothesis H 1 compared to H 2. However, the Bayes factor does not indicate whether H 1 is more probable than the opposing hypothesis H 2, the BF only makes it more probable than it was before observing the data (Lavine & Schervish, 1999).

Example 1.1 (Alcohol Concentration in Blood)

A person is stopped because of suspicion of driving under the influence of alcohol. Blood taken from that person is submitted to a forensic laboratory to investigate whether the quantity of alcohol in blood θ is greater than a legal threshold of, say, 0.5 g/kg. Thus, the hypotheses of interest can be defined as H 1 : θ > 0.5 versus H 2 : θ ≤ 0.5. Suppose that a prior probability density π(θ) is given for θ and that the prior probabilities of H 1 and H 2 in (1.7) are π 1 = 0.05 and π 2 = 0.95, corresponding to prior odds approximately equal to 0.0526. These values suggest that, based on the circumstances, and before considering results of blood analyses, the hypothesis H 1 is believed to be much less probable than the alternative hypothesis. Suppose next that the posterior probabilities, after taking into account laboratory measurements, are computed as in (1.8). The results are α 1 = 0.24 and α 2 = 0.76. Thus, the posterior odds are approximately equal to 0.3158. The ratio of the posterior odds by the prior odds leads to a BF equal to 6. This result represents limited evidence in support of the hypothesis that the alcohol level in blood is greater than the legal threshold, compared to the alternative hypothesis. Still, the posterior probability of hypothesis H 1 is low: the BF only renders the hypothesis H 1 slightly more probable than it was before observing the measurements made in the laboratory. This example will be further developed in Chap. 2.

1.5 Bayes Factor in the Evaluative Setting

Consider the general situation where evidentiary material is collected and control items from a person or object of interest are available for comparative purposes. The following measurements of a particular characteristic are available: measurements y on a questioned item (e.g., a glass fragment found on the clothing of a person of interest) and measurements x on a control item (e.g., fragments from a broken window). In this evaluative setting, so-called source level propositionsFootnote 10 could be defined as follows:

H 1 ::

The recovered (i.e., questioned) item comes from the same source as the control item.

H 2 ::

The recovered (i.e., questioned) item comes from an unknown source (i.e., different from the control item).

This setting is called evaluative because it involves the comparison between control and recovered items and the use of the results of this comparison for discriminating between the competing propositions. Models for comparison can either be feature-based or score-based . Feature-based models (Sect. 1.5.1) focus on the probability of measurements made directly on evidentiary and reference items. Conversely, score-based models (Sect. 1.5.2) focus on the probability of observing a pairwise similarity (or distance), i.e., score, between compared materials.

1.5.1 Feature-Based Models

If one assumes that y and x are realizations of random variables Y and X with a given probability distribution f(⋅), the Bayes factor is

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f(y,x\mid H_1,I)}{f(y,x\mid H_2,I)}, \end{array} \end{aligned} $$
(1.15)

where I represents the available background information. Application of the rules of conditional probability allows one to rewrite the Bayes factor as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{BF }=\frac{f(y\mid x,H_1,I)}{f(y\mid x,H_2,I)}\times\frac{f(x\mid H_1,I)}{f(x\mid H_2,I)}. \end{array} \end{aligned} $$

This expression can be further simplified by considering the fact that (i) the distribution of measurements x on the control item does not depend on whether H 1 or H 2 is true (and hence f(xH 1, I) = f(xH 2, I) holds) and (ii) the distribution of the measurement y on the questioned item does not depend on the measurement x on the control item if H 2 is true,Footnote 11 so that f(yx, H 2, I) = f(yH 2, I). The Bayes factor can therefore be written as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f(y\mid x,H_1,I)}{f(y\mid H_2,I)}. \end{array} \end{aligned} $$
(1.16)

The numerator is the probability of observing the measurements on the recovered item under the assumption that it comes from the known source, given the information I and knowledge of x, the features of the known source. The denominator is the probability of observing the measurements y on the recovered item, assuming that it comes from an unknown source, usually selected in some aleatory way from a relevant population ,Footnote 12 and assuming again the relevant information I. Note that, for the sake of simplicity, the conditioning information I will be omitted in the arguments hereafter.

For many types of forensic evidence, it can be reasonable to assume a parametric model {f(⋅∣θ), θ ∈ Θ}. In this way, the probability distribution characterizing the available data is of a known form, with the only unknown element being the parameter θ, which may vary between sources. Consider, for example, the probability distribution f(⋅∣θ) with unknown parameter θ = θ y for the measurements y on the recovered item and the same probability distribution with unknown parameter θ = θ x for the measurements x on the control item. In practice, the parameter θ is unknown, and a prior probability distribution π(θH i), representing personal beliefs about θ under each hypothesis H i, is introduced. The marginal distribution f(yx, H 1) in the numerator of (1.16) may be rewritten as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f(y\mid x,H_1)& =&\displaystyle \int f(y\mid \theta)\pi(\theta\mid x,H_1)d\theta\\ & =&\displaystyle \int f(y\mid \theta)f(x\mid \theta)\pi(\theta\mid H_1)d\theta/f(x\mid H_1), \end{array} \end{aligned} $$
(1.17)

where the posterior density π(θx, H 1) in the first line is rewritten in extended form using Bayes theorem. The distribution f(yx, H 1) is also called a posterior predictive distribution .Footnote 13

The marginal distribution f(yH 2) in the denominator of (1.16) can be rewritten as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f(y\mid H_2)=\int f(y\mid \theta)\pi(\theta\mid H_2)d\theta. \end{array} \end{aligned} $$
(1.18)

This is also called a predictive distribution .

Example 1.2 (Toner on Printed Documents)

Suppose experimental findings are available in the form of measurements of magnetism of toner on printed documents of known origin (x) and questioned origin (y) for which a Normal distribution is considered suitable. Thus, X ∼N(θ x, σ 2) and Y ∼N(θ y, σ 2), where the variance σ 2 of both distributions is assumed known and equal (Biedermann et al., 2016a). A Normal distribution with mean μ and variance τ 2 is taken to model our prior uncertainty about the means θ x and θ y, that is, θ ∼N(μ, τ 2) for θ = {θ x, θ y}. The integrals in (1.17) and (1.18) have an analytical solution, and the marginals can be obtained in closed form. See Aitken et al. (2021, pp. 815–817) for more details.

Here, H 1 and H 2 denote the propositions according to which the items of toner come from, respectively, the same and different printing machines. Consider, first, the numerator of the BF in (1.17), where the posterior density π(θx, H 1) is still a Normal distribution with mean μ x and variance \(\tau ^2_x\) , computed according to well-known updating rules (see, e.g., Lee, 2012),

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mu_x = \frac{\sigma^2}{\sigma^2 +\tau^2} \mu+ \frac{\tau^2}{\sigma^2 +\tau^2}x \end{array} \end{aligned} $$
(1.19)

and

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \tau^2_x=\frac{\sigma^2\tau^2}{\sigma^2+\tau^2}. \end{array} \end{aligned} $$
(1.20)

The posterior mean, μ x, is a weighted average of the prior mean μ and the observation x. The weights are given by the population variance σ 2 and the variance τ 2 of the prior probability distribution, respectively, such that the component (observation or prior mean) which has the smaller variance has the greater contribution to the posterior mean. This result can be generalized to consider the distribution of the mean of a set of n observations x 1, …, x n from the same Normal distribution (see Sect. 2.3.1).

The marginal or posterior predictive distribution f(yx, H 1) is also a Normal distribution with mean equal to the posterior mean μ x and variance equal to the sum of the posterior variance \(\tau _x^2\) and the population variance σ 2, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} (Y\mid x,H_1)\sim \mbox{N}(\mu_x,\tau_x^2+\sigma^2). \end{array} \end{aligned} $$
(1.21)

The same arguments apply to the marginal or predictive distribution f(yH 2) in the denominator, which is a Normal distribution with mean equal to the prior mean μ and variance equal to the sum of the prior variance τ 2 and the population variance σ 2, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} (Y\mid H_2)\sim \mbox{N}(\mu,\tau^2+\sigma^2). \end{array} \end{aligned} $$
(1.22)

The Bayes factor can then obtained as follows:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{BF }& =&\displaystyle \frac{\mbox{N}(y\mid\mu_x,\tau_x^2+\sigma^2)}{\mbox{N}(y\mid\mu,\tau^2+\sigma^2)}\\ & =&\displaystyle \frac{(\tau_x^2+\sigma^2)^{-1/2}\exp\left\{-\frac 1 2 \frac{(y-\mu_x)^2}{\tau_x^2+\sigma^2}\right\}}{(\tau^2+\sigma^2)^{-1/2}\exp\left\{-\frac 1 2 \frac{(y-\mu)^2}{\tau^2+\sigma^2}\right\}}. \end{array} \end{aligned} $$

Note that this can be easily extended to cases with multiple measurements y = (y 1, …, y n) (see Sect. 3.3.1).

Note that the value of the measurements y and x may be expressed as a ratio of the marginal likelihoods in (1.17) and (1.18), that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }& =&\displaystyle \frac{\int f(y\mid\theta)f(x\mid\theta)\pi(\theta\mid H_1)d\theta}{f(x\mid H_1)} \times \frac{1}{f(y\mid H_2)} \\ & =&\displaystyle \frac{\int f(y\mid\theta)f(x\mid\theta)\pi(\theta\mid H_1)d\theta}{\int f(x\mid\theta)\pi(\theta\mid H_2)d\theta\int f(y\mid\theta)\pi(\theta\mid H_2)d\theta}, \end{array} \end{aligned} $$
(1.23)

as f(xH 1) = f(xH 2). If the recovered item and the control item come from the same source (i.e., hypothesis H 1 holds), then θ y = θ x, otherwise θ y ≠ θ x (i.e., hypothesis H 2 holds). If H 2 is true and hence the examined items come from different sources, the measurements can be considered independent. Note, however, that this is not necessarily the case. There are instances where the assumption of independence among measurements on control and recovered material under H 2 does not hold, and the BF will not simplify as in (1.23). See Linden et al. (2021) for a discussion about this issue in the context of questioned signatures.

The expression of the Bayes factor in (1.23) involves prior assessments about the unknown parameter θ, in terms of π(θH i), as well as the likelihood function f(⋅∣θ). Thus, the Bayes factor cannot generally be regarded as a measure of the relative support to competing propositions provided by the data alone.

1.5.2 Score-Based Models

For some types of forensic evidence, the specification of a probability model for available data may be difficult. This is the case, for example, when the measurements are obtained using high-dimensional quantification techniques, e.g., for fingermarks or toolmarks (using complex sets of variables), in speaker recognition, or for traces such as glass, drugs or toxic substances that may be described by several chemical components. In such applications, a feature-based Bayes factor (Sect. 1.5.1) may not be feasible, and a score-based approach may represent a practicable (or even the only) available alternative. Broadly speaking, a score is a metric that summarizes the result of a forensic comparison of two items or traces, in terms of a single variable, representing a measure of similarity or difference (e.g., distance). Various distance measures can be used, such as Euclidean or Manhattan distance, see, e.g., Bolck et al. (2015).Footnote 14 One of the first proposals of score-based approaches in forensic science was presented in the context of forensic speaker recognition by Meuwly (2001).

Let Δ(⋅) denote the function which assesses the degree of similarity between feature vectors x and y. The similarity score Δ(x, y) represents the evidence for which a Bayes factor is to be computed. The introduction of a score function for quantifying the similarities/dissimilarities between compared items allows one to reduce the dimensionality of the problem, while retaining the discriminative information as much as possible. For a score given by a distance, for example, one will expect a value close to zero if the features x and y relate to items from the same source. Vice versa, if the features x and y relate to items from different sources, one will expect a larger score, provided that there are differences between members in a population. The score-based Bayes factor (sBF) is

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{sBF }=\frac{g(\varDelta(x,y)\mid H_1,I)}{g(\varDelta(x,y)\mid H_2,I)}, \end{array} \end{aligned} $$
(1.24)

where g(⋅) denotes the probability distribution associated with Δ(X, Y ). For the sake of simplicity, the conditioning information I will be omitted hereafter.

For the Bayes factor in (1.24), one cannot reproduce the simplified expression that was derived in (1.16) for the feature-based Bayes factor. The score-based Bayes factor must be computed as the ratio of two probability density functions evaluated at the evidence score Δ(x, y), given the competing propositions H 1 and H 2. Since these two distributions are not generally available by default, the forensic examiner will generally try to derive a sBF using sampling distributions based on many scores produced under each of the two competing propositions. One way to compute the density of the score Δ(x, y) in the numerator is to generate many scores for comparisons between the known features x and the features y of other items known to come from the potential source assumed under H 1. The numerator can therefore be written as \(\hat g(\varDelta (x,y)\mid x,H_1)\), where \(\hat g (\cdot )\) indicates that the distribution is constructed on the basis of relevant data (scores) produced for the case of interest.

In the denominator, it is assumed that the proposition H 2 is true, and x and y denote features of items that come from different sources. The challenge for the forensic examiner is that of selecting the most appropriate data for obtaining the distribution in the denominator. Note that there are different ways to address this question because, depending on the case at hand, it might be appropriate to condition on (i) the known source (i.e., pursuing a so-called source-anchored approach) , (ii) the trace (i.e., trace-anchored approach) , or (iii) none of these (i.e., non-anchored approach). This amounts to evaluating the score using the probability density distribution that is obtained by producing scores for comparisons between (i) the features x of the control item from the known source and features of items taken from randomly selected sources of the relevant population, (ii) the features y of the trace item and features of items taken from sources selected randomly in the relevant population, (iii) features of pairs of items taken from sources selected randomly in the relevant population (i.e., without using x and y). Formally, this amounts to defining the distribution in the denominator as follows:

$$\displaystyle \begin{aligned} \begin{array}{ll} \mbox{(i)} & \hat g(\varDelta(x,y)\mid x,H_2),\\ \mbox{(ii)} & \hat g(\varDelta(x,y)\mid y,H_2),\\ \mbox{(iii)} & \hat g(\varDelta(x,y)\mid H_2). \end{array} \end{aligned}$$

See, e.g., Hepler et al. (2012) for a discussion of this topic.

Example 1.3 (Image Comparison)

Consider a hypothetical case where the face of an individual is captured by surveillance cameras during the commission of a crime. Available screenshots are compared with the reference image(s) of a person of interest. For image comparison purposes, the evidence to be considered is a score given by the distance between the feature vectors x of the known reference and the evidential recording y (see Jacquet and Champod (2020) for a review). Consider the following competing propositions. H 1: The person of interest is the individual shown in the images of the surveillance camera, versus H 2: An unknown person is depicted in the image of the surveillance camera. To help specify the probability distribution of the score in the numerator, one can take several pairs of images from the person of interest to serve as pairs of questioned and reference items. To inform the probability distribution for the score in the denominator, conditioning on the reference item x (i.e., the images depicting the person of interest) can be justified as it may contain information that is relevant to the case and may be helpful for generating scores (Jacquet & Champod, 2020; Hepler et al., 2012). The distribution in the denominator can thus be computed using a source-anchored approach as in (i). The sBF can therefore be obtained as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{sBF } = \frac{\hat g(\varDelta(x,y)\mid x,H_1)}{\hat g(\varDelta(x,y)\mid x,H_2)}. \end{array} \end{aligned} $$

In other types of forensic cases, conditioning on y in the denominator, case (ii), may be more appropriate. This represents an asymmetric approach to defining the distribution in the numerator and in the denominator.

Example 1.4 (Handwritten Documents)

Consider a case involving handwriting on a questioned document. Handwriting features y on the questioned document are compared to the handwriting features x of a person of interest. The similarities and differences between x and y are measured by a suitable metric (score). To inform about the probability distribution of the scores in the numerator, one can take several draws of pairs of handwritten characters originating from the known source to serve as recovered and control items and to obtain scores from the selected draws. Under H 2, consideration of x is not relevant for the assessment. Note that here H 2 is the proposition according to which the person of interest is not the source of the handwriting on the questioned document, but someone else from the relevant population. It would then seem reasonable to construct the distribution for the denominator by comparing the features y of the questioned document with features x from items of handwriting of persons randomly selected from the relevant population of potential writers. This amounts to a trace-anchored approach as in situation (ii) defined above. In fact, for handwriting, the approach (i) would amount to discarding relevant information related to the questioned document. The sBF can therefore be obtained as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{sBF }= \frac{\hat g(\varDelta(x,y)\mid x,H_1)}{\hat g(\varDelta(x,y)\mid y,H_2)}. \end{array} \end{aligned} $$

In yet other cases, the distribution in the denominator may be obtained by comparing pairs of items drawn randomly from the relevant population, without conditioning on either x or y. In such cases, the alternative proposition H 2 is that the two compared items come from different sources.

Example 1.5 (Firearm Examination)

Consider a case in which a bullet is found at a crime scene and a person carrying a gun is arrested. The extent of agreement between marks left by firearms on bullets can be summarized by a score or multiple scores. An example of a simple score is the concept of consecutive matching striations. To inform the distribution in the numerator, the scientist fires multiple bullets using the seized firearm. To inform the distribution in the denominator, the scientist fires and compares many bullets known to come from different guns (i.e., different relevant models). The distribution in the denominator can thus be computed using a non-anchored approach. The sBF can therefore be obtained as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{sBF}= \frac{\hat g(\varDelta(x,y)\mid x,H_1)}{\hat g(\varDelta(x,y)\mid H_2)}. \end{array} \end{aligned} $$

Note that this is a coarse approach in the sense that no consideration is given to general manufacturing features. Indeed, the amount and quality of striation on a bullet may depend on aspects such as the caliber and the composition (e.g., jacketed/non-jacketed bullets, etc.), hence a conditioning on y may be considered.

Another example for a non-anchored approach, in the context of fingermark comparison, can be found in Leegwater et al. (2017). An example will be presented in Sect. 3.3.4.

Note that the above considerations refer to so-called specific-source cases. In such cases, recovered material is compared to material from a known source. However, there are also other situations where the competing propositions are as follows:

H 1 ::

The recovered and the control material originate from the same source.

H 2 ::

The recovered and the control material originate from different sources.

For such common-source propositions, the sampling distributions under the competing propositions can be learned, under H 1, from many scores for known same-source pairs (with each pair drawn from a distinct source) and, under (H 2), from many scores for pairs known to come from different sources. The score-based BF in this case will account for the occurrence of the observed score under the competing propositions, but it does not account for the rarity of the characteristics of the trace.

While a score-based approach has the potential to reduce the dimensionality of the problem, the use of scores implies a loss of information because features y and x are replaced by a single score. Therefore, there is a trade-off to be found between the complexity of the original configuration of features and the performance of the score-metric, the choice of which requires a justification.

For a critical discussion of score-based evaluative metrics, see Neumann (2020) and Neumann and Ausdemore (2020). See also Bolck et al. (2015) for a discussion of feature- and score-based approaches for multivariate data.

1.6 Bayes Factor in the Investigative Setting

While the use of the Bayes factor for evaluative purposes is rather well established in both theory and practice, the focus on investigative settings still offers much room for original developments. In many forensic settings, especially in early stages of an investigation, it may be that no potential source is available for comparison. In such situations, it will not be possible to compare characteristics observed on recovered and reference materials, as would be the case in an evaluative setting (Sect. 1.5). Nevertheless, one can derive valuable information from the recovered material alone. Consider, for example, two populations denoted p 1 and p 2, respectively, and the following two propositions:

H 1 ::

The recovered item comes from population p 1 (e.g., a population of females).

H 2 ::

The recovered item comes from population p 2 (e.g., a population of males).

Denote by y the measurements on the recovered material known to belong to one of the two populations specified by the competing hypotheses, but it is not known which one. For such a situation, the Bayes factor measures the change produced by the measurements y on the recovered item in the odds in favor of H 1, as compared to H 2, when going from the prior to the posterior distribution.

Assume that a parametric statistical model {f(⋅∣θ), θ ∈ Θ} is suitable for the data at hand. The problem of discriminating between two populations can then be treated as a problem of comparing statistical hypotheses, assuming that the probability distribution for the measurements on the recovered material (under each hypothesis) is of a given form. Consider, first, the situation where the parameters characterizing the two populations are known, that is, θ = θ 1 if the recovered item comes from population p 1 and θ = θ 2 if the recovered item comes from population p 2. Formally, this amounts to specifying the probability distributions f(yθ 1) and f(yθ 2), respectively. The posterior probability of the competing propositions can be computed as in (1.5) and the Bayes factor simplifies to a ratio of likelihoods as in (1.6).

Example 1.6 (Fingermark Examination)

Consider a case involving a single fingermark of unknown source. The fingerprint examiner seeks to help with the question of whether the mark comes from a man or woman. Thus, for investigative purposes, the following two propositions are of interest:

H 1 ::

The fingermark comes from a man.

H 2 ::

The fingermark comes from a woman.

A type of data that can be acquired from fingermarks is ridge width, summarized in terms of the ridge count per surface in mm2. See, for example, Appendix A of Champod et al. (2016) for a summary of different data collections. Consider ridge density, which was found to vary as a function of sex (i.e., women tend have narrower ridges than men), but also between populations. Suppose that normality represents a reasonable assumption for ridge density so that the probability distribution for available measurements can be considered Normal \(\mathrm {N}(\theta _i,\sigma ^{2}_{i})\) , with the unknown mean θ being equal to θ i and the variance σ 2 being equal to \(\sigma ^2_i\) if H i is true. Given H 1, the measurements y thus have a probability distribution \(\mathrm {N}(\theta _1,\sigma ^{2}_{1})\) and given H 2 a probability distribution \(\mathrm {N}(\theta _2,\sigma ^{2}_{2})\).

The posterior probability of the competing propositions can be computed as in (1.5), and the Bayes factor simplifies to a likelihood ratio as in (1.6), that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{BF}=\frac{\mbox{N}(y\mid \theta_1,\sigma^{2}_{1})}{\mbox{N}(y\mid \theta_2,\sigma^{2}_{2})}. \end{array} \end{aligned} $$

Generally, however, the parameters, or some of the parameters, characterizing the two distributions are unknown and a pair of probability density distributions will be introduced to model this uncertainty. As a consequence, the Bayes factor will also depend on prior assumptions and will not simplify to a likelihood ratio. Consider the case where parameters θ i are continuous and take values in the parameter space Θ i. A prior distribution π(θ ip i) must be specified, with θ i ∈ Θ i and p i representing the population of interest. A marginal distribution for each population p i can be computed as in (1.2),

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f_{H_i}(y)=\int_{\varTheta_i}f(y\mid\theta_i)\pi(\theta_i\mid p_i)d\theta_i \end{array} \end{aligned} $$
(1.25)

and the Bayes factor will take the form of a ratio of marginal likelihoods as in (1.3), that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f_{H_1}(y)}{f_{H_2}(y)}. \end{array} \end{aligned} $$
(1.26)

Example 1.7 (Fingermark Examination—Continued)

Recall Example 1.6 where a Normal probability distribution was assumed for the measured ridge density on a fingermark, with variance known and equal to \(\sigma ^{2}_{i}\). A conjugate prior distribution may be introduced for the population mean θ i, say \(\theta _i\sim \mathrm {N}(\mu _i,\tau ^2_i)\). The marginal likelihoods are still Normal with mean equal to the prior mean μ i and variance equal to the sum of the prior variance \(\tau _i^2\) and the population variance \(\sigma ^{2}_{i}\). The Bayes factor therefore is

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{BF }=\frac{\mbox{N}(y\mid \mu_1,\tau^{2}_{1}+\sigma^{2}_{1})}{\mbox{N}(y\mid \mu_2,\tau^{2}_{2}+\sigma^{2}_{2})}. \end{array} \end{aligned} $$

The same idea can be extended to the case where both the mean and the variance are unknown. This will be addressed in Sect. 4.3.2.

The Bayes factor thus depends on the prior assumptions about parameters characterizing each population. This must not be confused, as noted earlier, with prior probabilities for competing propositions. The latter will form the prior odds which will be multiplied by the Bayes factor to compute the posterior odds

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\Pr(H_1\mid y)}{\Pr(H_2\mid y)}=\frac{f_{H_1}(y)}{f_{H_2}(y)}\times \frac{\Pr(H_1)}{\Pr(H_2)}. \end{array} \end{aligned} $$

The Bayesian approach for discriminating between two propositions regarding population membership can be easily generalized to the case where there are any number k (>2) of competing mutually exclusive propositions. Let H 1, …, H k denote k propositions and denote by y the observation to be evaluated. The propositions of interest can be defined as follows:

H 1 ::

The recovered item comes from population 1 (p 1).

H 2 ::

The recovered item comes from population 2 (p 2).

H k ::

The recovered item comes from population k (p k).

Example 1.8 (Questioned Documents)

Consider a case involving questioned documents where the issue of interest is which of three printing machines has been used to print a questioned document. Propositions of interest are:

H 1 ::

The questioned documents have been printed with printer 1.

H 2 ::

The questioned documents have been printed with printer 2.

H 3 ::

The questioned documents have been printed with printer 3.

After having specified a Bayesian statistical model for each proposition (i.e., a probability distribution for the available measurements and a prior distribution for the unknown parameters), the marginal likelihoods \(f_{H_i}(y)\), i = 1, 2, 3, characterizing propositions H 1, H 2, and H 3, can be obtained as in (1.25).

Occasionally, cases involve multiple propositions . Imagine a case involving DNA findings, such as bloodstains recovered on a crime scene, with the reported profile being compared against the profile of a person of interest. The defense argues that the bloodstain does not come from the person but from either a relative (e.g., a brother) or an unknown person. A question that may arise in such a case is how to evaluate and report results, because the Bayes factor involves pairwise comparisons. One option is to report only the marginal likelihoods \(f_{H_i}(y)\), even if they may not be easy to interpret. Alternatively, one may report a scaled version \(f_{H_i}^*(y)\) as suggested by Berger and Pericchi (2015), that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f_{H_i}^*(y)=\frac{f_{H_i}(y)}{\sum_{j=1}^{k}f_{H_j}(y)}. \end{array} \end{aligned} $$
(1.27)

This expression will be much easier to interpret, because the scaled likelihoods \(f_{H_i}^*(y)\) sum up to 1. Generally, prior probabilities \(\Pr (H_i)\) may vary between recipients of such reports, but the posterior probability can be easily computed as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \Pr(H_i\mid y)=\frac{\Pr(H_i)f_{H_i}^*(y)}{\sum_{j=1}^{k} \Pr(H_j)f_{H_j}^*(y)}, \qquad \qquad i=1,\dots,k \end{array} \end{aligned} $$

followed, if required, by classification of the recovered material in the population with the highest posterior probability. Note that reporting the scaled version in (1.27) is equivalent to assuming equal prior probabilities for competing propositions. In fact, if \(\Pr (H_i)=\frac 1 k\), i = 1, …, k, then it can easily be shown that

$$\displaystyle \begin{aligned}\Pr(H_i\mid y)=\frac{f^*_{H_i}(y)}{\sum_{j=1}^{k}f^*_{H_j}(y)}=f^*_{H_i}(y), \qquad i=1,\dots,k,\end{aligned}$$

as \(\sum _{j=1}^{k}f^*_{H_j}(y)=1\).

The analyst may also consider the possibility of summarizing several propositions into one, in order to produce a comparison between two propositions regarding population membership. One of these propositions will be composite. Let \(\bar {H}_1=H_2\cup \cdots \cup H_k\). Starting from k possible populations from which the recovered material may come from, a pair of competing propositions of interest may thus be formulated as follows:

H 1 ::

The recovered item comes from population 1 (p 1).

\(\bar H_1\)::

The recovered item comes from one of the other populations (p 2, …, p k).

The marginal likelihood \(f_{H_1}(y)\) characterizing proposition H 1 is obtained as in (1.25), while the marginal likelihood under \(\bar {H}_1\) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} f_{\bar{H}_1}(y)=\sum_{i=2}^{k}\Pr(p_i)\int_{\varTheta_i}f(y\mid\theta_i)\pi(\theta_i\mid p_i)d\theta_i, \end{array} \end{aligned} $$

with \(\sum _{i=1}^k \Pr (p_i)=1\). The Bayes factor expressing the value of y for comparing H 1 against \(\bar {H}_i\) becomes

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{BF }=\frac{f_{H_1}(y)\sum_{i=2}^{k}\Pr(p_i)}{f_{\bar{H}_1}(y)}. \end{array} \end{aligned} $$
(1.28)

The posterior odds become

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\Pr(H_1\mid y)}{\Pr(\bar{H}_1\mid y)}=\frac{f_{H_1}(y)\Pr(p_1)}{f_{\bar{H}_1}(y)}, \end{array} \end{aligned} $$

(Aitken et al., 2021, p. 643).

Example 1.9 (Questioned Documents—Continued)

Consider the following propositions:

H 1 ::

The questioned documents have been printed with printer 1.

\(\bar {H}_1\)::

The questioned documents have been printed with printer 2 or with printer 3.

The marginal likelihood characterizing proposition H 1 is

$$\displaystyle \begin{aligned} \begin{array}{rcl} f_{H_1}(y)=\int_{\varTheta_1}f(y\mid\theta_1)\pi(\theta_1\mid p_1)d\theta_1. \end{array} \end{aligned} $$

The marginal likelihood characterizing proposition \(\bar {H}_1\) will become

$$\displaystyle \begin{aligned} \begin{array}{rcl} f_{\bar{H}_1}(y)& =&\displaystyle \Pr(p_2)\int_{\varTheta_2}f(y\mid\theta_2)\pi(\theta_2\mid p_2)d\theta_2 \\ & &\displaystyle +\Pr(p_3)\int_{\varTheta_3}f(y\mid\theta_3)\pi(\theta_3\mid p_3)d\theta_3, \end{array} \end{aligned} $$

and the Bayes factor can be obtained as in (1.28).

1.7 Bayes Factor Interpretation

The Bayes factor is a coherent measure of the change in support that the findings provide for one hypothesis against a given alternative (Jeffrey, 1975). Table 1.1 shows a guide for expressing Bayes factors verbally, following Jeffreys (1961). A historical review is presented in Aitken and Taroni (2021).

Table 1.1 Scale for verbally expressing support provided by the observations for one hypothesis over an alternative adapted from Jeffreys (1961)

The verbal equivalent must express a degree of support for one of the propositions relative to an alternative and is defined from ranges of Bayes factor values. Qualitative interpretations of the Bayes factor have also been proposed in the context of forensic science (Evett, 1987, 1990; Evett et al., 2000; Nordgaard et al., 2012; Willis et al., 2015). Table 1.2 summarizes an example of a scale given in the ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015), inspired by the scale proposed by Nordgaard et al. (2012). Users of these scales must be aware that labelling several Bayes factor apportionments offers a broad descriptive statement about standards of evidence in scientific investigation and not a calibration of the Bayes factor (Kass, 1993). See, e.g., Ramos and Gonzalez-Rodriguez (2013), van Leeuwen and Brümmer (2013) and Aitken et al. (2021) for an account of calibration as a measure of performance of BF computation methods.

Table 1.2 Verbal scale for expressing evidential value, in terms of the Bayes factor, in support of the prosecution’s proposition over the alternative (defense) proposition (Willis et al., 2015)

Moreover, it is important to note that the choice of a reported verbal equivalent is based on the magnitude of the Bayes factor and not the reverse. Marquis et al. (2016) present a discussion on how to implement a verbal scale in a forensic laboratory, considering benefits, pitfalls, and suggestions to avoid misunderstandings.

It is worth to reiterate that a Bayes factor represents a measure of change in support rather than a measure of support, though the two expressions may be perceived as equivalent. In fact, the Bayes factor can be shown to be a non-coherent measure of support: a small Bayes factor means that the data will lower the probability of the hypothesis of interest relative to its value prior to considering the evidence, but it does not imply that the probability of this hypothesis is low. The Bayes factor measures the change produced in the odds, thus providing a measure of whether the available findings have increased or decreased the odds in favor of one proposition compared to the alternative (Bernardo & Smith, 2000).

1.8 Computational Aspects

The computation of Bayes factors can be challenging, especially when the marginal likelihoods in the numerator and in the denominator (1.2) involve integrals that do not have an analytical solution. Several methods have been proposed to address this complication. See Kass and Raftery (1995) and Han and Carlin (2001) for a review.

Consider the following general expression for the marginal likelihood :

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f(x)=\int f(x\mid\theta)\pi(\theta)d\theta. \end{array} \end{aligned} $$
(1.29)

If the likelihood f(xθ) and the prior π(θ) are not family conjugates, then an analytical solution may not be available. But suppose that it is possible to draw values from the prior distribution π(⋅). The integral in (1.29) can then be approximated by Monte Carlo methods as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \hat f_1(x)=\sum_{i=1}^N f(x\mid\theta^{(i)})/N, \end{array} \end{aligned} $$
(1.30)

where θ (i), i = 1, …, N, are N independent draws from π(⋅). This is the average of the likelihood of the sampled values. An example will be provided in Sect. 2.2.2 (Example 2.3).

This simulation process can be rather inefficient when the posterior distribution is concentrated, relative to the prior, as most of the θ (i) will have a small likelihood and the estimate \(\hat f_1(x)\) in (1.30) may be dominated by a few values with large likelihood. The precision of the Monte Carlo integration can be improved by importance sampling (Kass & Raftery, 1995). Moreover, statistical packages (e.g., in R) allow one to sample from a certain number of distributions.

Importance sampling as well as other Monte Carlo tools may help to overcome such difficulties as there is no need for the distribution π(θ) to be available in closed form. Consider any manageable density π (θ) from which it is feasible to sample. The integral in (1.29) can then be approximated by importance sampling as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \hat f_2(x)=\frac{\sum_{i=1}^N w_i f(x\mid\theta^{(i)})}{\sum_{i=1}^{N}w_i}, \end{array} \end{aligned} $$
(1.31)

where θ (i) are independent draws from π (θ) and are weighted by importance weights w i = π(θ (i))∕π (θ (i)). The function π (θ) is known as importance sampling function (e.g., Geweke, 1989). An example will be provided in Sect. 2.2.2 (Example 2.3).

In the case where π (θ) is taken to be the posterior density π(θx) = π(θ)f(xθ)∕f(x), the use of this expression in (1.31) yields the harmonic mean of the sampled likelihood values as an estimate for the marginal likelihood f(x):

$$\displaystyle \begin{aligned} \begin{array}{rcl} \hat f_3(x)=\left[\frac 1 N \sum_{i=1}^{N}\frac{1}{f(x\mid\theta_i)}\right]^{-1}. \end{array} \end{aligned} $$

Note that, whatever method is used, the output of such a simulation procedure is an approximation that must be handled carefully. Notwithstanding, it is worth pointing out that while the Monte Carlo estimate is not exact, the Monte Carlo error (e.g., \(f(x)-\hat f_1(x)\)) can be very small if a sufficiently large number of draws are generated. A study of Monte Carlo errors for the quantification of the value of forensic evidence is provided by Ommen et al. (2017).

Many practical problems require more advanced techniques based on Markov chain Monte Carlo methods (MCMC) to overcome computational hurdles. The general idea behind these methods is to sample recursively values θ (i) from some transition distribution that depends on the previous draw θ (i−1) in such a way that at each step of the iteration process, we expect to draw from a distribution that becomes closer (i.e., converges) to the target posterior distribution π(θx). This means that, for many iterations, θ (i) is approximately distributed according to π(θx) and can be used like the output of a Monte Carlo simulation algorithm. To avoid the effect of starting values, the first set of iterations is generally discarded (this is called the burn in period), and the simulated values beyond the first n b iterations

$$\displaystyle \begin{aligned} \theta^{(n_b+1)},\dots,\theta^{(N)} \end{aligned}$$

are taken as draws from the target posterior distribution. The Gibbs sampling algorithm is a well-known method to construct a chain with these features. Suppose that the parameter vector can be decomposed into several components, say θ = (θ 1, …, θ p), and let \(\pi (\theta _j\mid \theta _{-j}^{(i-1)})\) denote the so-called full conditional distribution, that is the conditional distribution of θ j at step (i) given all the other components, say θ j, at the previous step (i − 1)

$$\displaystyle \begin{aligned} \theta_{-j}^{(i-1)}=(\theta_1^{(i-1)},\dots,\theta_{j-1}^{(i-1)},\theta_{j+1}^{(i-1)},\dots,\theta_{p}^{(i-1)}). \end{aligned}$$

For many problems, it is possible to sample easily from the conditional distributions, as is the case when distributions are conjugate. The Gibbs sampling algorithm works as follows: start with an arbitrary value \(\theta ^{(0)}=(\theta _1^{(0)},\dots ,\theta _p^{(0)})\) and generate \(\theta _j^{(i)}\) at each iteration according to the conditional distribution given the current values \(\theta _{-j}^{(i-1)}\). Examples will be given in Sects. 3.4.1.3 (Example 3.14) and 3.4.3 (Example 3.16.)

Whenever it is not possible to decompose the joint distribution in manageable conditionals, one can implement an alternative approach, the Metropolis–Hastings (M–H) algorithm (e.g. Gelman et al., 2014). This algorithm can be summarized as follows. Start with an arbitrary value \(\theta ^{(0)}=(\theta _1^{(0)},\dots ,\theta _p^{(0)})\) and generate \(\theta _j^{(i)}\) at each iteration, as follows:

  1. 1.

    Draw a proposal value \(\theta _j^{\text{prop}}\) form a density \(q(\theta _j^{(i-1)},\theta _j^{\text{prop}})\), called candidate generating density.

  2. 2.

    Compute a probability of acceptance as follows:

    $$\displaystyle \begin{aligned} \begin{array}{rcl}{} \alpha\left(\theta_j^{(i-1)},\theta_j^{\text{prop}}\right)=\min\left\{\frac{\pi\left(\theta_j^{\text{prop}}\right)q\left(\theta_j^{\text{prop}},\theta_j^{(i-1)}\right)}{\pi\left(\theta_j^{(i-1)}\right)q\left(\theta_j^{(i-1)},\theta_j^{\text{prop}}\right)}\right\}. \end{array} \end{aligned} $$
    (1.32)
  3. 3.

    Accept the proposed value \(\theta _j^{\text{prop}}\) with probability \(\alpha \left (\theta _j^{(i-1)},\theta _j^{\text{prop}}\right )\), and set \(\theta _j^{(i)}=\theta _j^{\text{prop}}\); otherwise, reject the proposed value and set \(\theta _j^{(i)}=\theta _j^{(i-1)}\).

Note that if the candidate generating function is symmetric (e.g., a Normal probability density), the acceptance probability in (1.32) simplifies to

$$\displaystyle \begin{aligned} \begin{array}{rcl} \alpha\left(\theta_j^{(i-1)},\theta_j^{\text{prop}}\right)=\min\left\{\frac{\pi\left(\theta_j^{\text{prop}}\right)}{\pi\left(\theta_j^{(i-1)}\right)}\right\}. \end{array} \end{aligned} $$

The performance of an MCMC algorithm can be monitored by inspecting graphs and computing diagnostic statistics. Such exploratory analysis is fundamental for assessing convergence to the posterior distribution. An example will be given in Sect. 2.2.2 (Example 2.6).

The output of the MCMC algorithm can be used to provide the marginal likelihood that is needed for the numerator and the denominator of the Bayes factor, as proposed by Chib (1995) for a Gibbs sampling algorithm and by Chib and Jeliazkov (2001) for an M–H algorithm. The key idea is to obtain the marginal likelihood f(x) by a direct application of Bayes theorem since it can be seen as the normalizing constant of the posterior density

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} f(x)=\frac{f(x\mid\theta^*)\pi(\theta^*)}{\pi(\theta^*\mid x)}, \end{array} \end{aligned} $$
(1.33)

where θ is a parameter value with high posterior density. Note that (1.33) is valid for any parameter value θ ∈ Θ. The likelihood f(xθ) and the prior density π(θ) can be directly computed at a given parameter point θ . The posterior density π(θx) is unavailable in closed form, but it can be approximated by using the output of the Gibbs sampling. Consequently, the marginal likelihood can be approximated as

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \hat f(x)=\frac{f(x\mid\theta^*)\pi(\theta^*)}{\hat\pi(\theta^*\mid x)}. \end{array} \end{aligned} $$
(1.34)

Examples will be given in Sects. 3.4.1.3 (Example 3.14) and 3.4.3 (Example 3.16).

This short overview of computational tools is not intended to be exhaustive. There are instances, for example when dealing with high-dimensional distributions, where the simulation process is very slow, giving rise to inefficiencies in the behavior of the Gibbs sampler or Metropolis algorithm. An alternative solution is given by the Hamiltonian Monte Carlo (HMC) method, where the proposal distribution is not centered on the current position of the chain and changes depend on the current position of the chain. This allows one to obtain more promising candidate values, avoiding to get stuck in a very slow exploration of the target distribution and therefore to move much more rapidly (Neal, 1996). As in any Metropolis algorithm, the HMC proceeds by a series of iterations, though it requires more efforts in terms of programming and tuning. The user can refer to a computer program called Stan (sampling through adaptive neighborhoods) to directly apply the Hamiltonian Monte Carlo method. The reader can refer to Gelman et al. (2014) and Stan Development Team (2021) for instructions and examples. A complete picture of basic and more advanced methods of Bayesian computation can be found, e.g., in Gelman et al. (2014), Marin and Robert (2014), and Robert and Casella (2010). The reader can also refer to Han and Carlin (2001) and to Friel and Pettitt (2008) for a review of methods to compute BFs.

In all examples in this book, dealing with the Gibbs sampler and the Metropolis–Hastings algorithm, we will directly program the computations in R. Other open-source programs however exist that can be used to build Markov chain Monte Carlo sampler, such as Stan or Jags (Just another Gibbs sampler, https://mcmc-jags.sourceforge.io/). They both can interact with R (see libraries RStan, rjags and runjags). Further examples can be found in Albert (2009) and Kruschke (2015).

1.9 Bayes Factor and Decision Analysis

The Bayes factor provides a coherent and quantitative way for relating probabilities for states of nature, before information is obtained, to probabilities given information that has become available. A subsequent step, the choice between different hypotheses, represents a problem of decision-making (Lindley, 1985). For the purpose of illustration, consider the simple and regularly encountered case where only two hypotheses are of interest, say H 1 and H 2. The two hypotheses represent the list of, more generally, n exclusive and exhaustive uncertain events (also called states of nature) and denote the entirety of nature. The decision space is the set of all possible actions, here decisions d 1 and d 2, where decision d i can be formalized as the acceptance of hypothesis H i. The decision problem can be expressed in terms of a decision matrix (see Table 1.3) with C ij denoting the consequence of deciding d i when hypothesis H j is true. Decision d i is called “correct” if hypothesis H j is true and i = j. Conversely decision d i is not correct if hypothesis H j is true and i ≠ j, i.e., H ¬i is true. When choosing between competing hypotheses, one takes preferences among decision consequences into account, in particular among adverse outcomes. This aspect is formalized by introducing a measure for expressing the decision maker’s relative desirability, or undesirability, of the various decision consequences. To measure the undesirability of consequences on a numerical scale, one can introduce a loss function L(⋅), where L(C ij) denotes the loss that one assigns to the outcome of deciding d i when hypothesis H j is true.

Table 1.3 Decision matrix with d 1 and d 2 denoting the possible actions, H 1 and H 2 denoting the states of nature, and C ij denoting the consequence of deciding d i when hypothesis H j is true

If it can be agreed that a correct decision represents neither a loss nor a gain, the loss function for a two-action problem can be described with a two-way table that contains zeros for the losses L(C ij), i = j, and the value l i for L(C ij), i ≠ j. Such a “0 − l i” loss function is shown in Table 1.4, where l i = L(d i, H ¬i) denotes the loss one incurs whenever decision d i is a wrong decision.

Table 1.4 The “0 − l i” loss function for a decision problem with d 1 and d 2 denoting the possible actions, H 1 and H 2 denoting the states of nature, and l i denoting the loss associated with adverse decision consequences

The relative (un-)desirability of available decisions can be expressed by their expected loss EL(⋅), computed as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathrm{EL}(d_i\mid x)& =&\displaystyle \underbrace{\mathrm{L}(d_i,H_i)}_{0}\underbrace{\Pr(H_i\mid x)}_{\alpha_i}+\underbrace{\mathrm{L}(d_i,H_{\neg i})}_{l_i}\underbrace{\Pr(H_{\neg i}\mid x)}_{\alpha_{\neg i}}\\ & =&\displaystyle l_i\alpha_{\neg i}, \end{array} \end{aligned} $$

where x denotes the observation or a series of measurements and α ¬i denotes the (posterior) probability of the event H ¬i given x. The formal Bayesian decision criterion is to accept hypothesis H 1 if the expected loss of the decision to accept H 1 is smaller than the expected loss of rejecting it, that is, if the (posterior) expected loss of decision d 1 is smaller than the (posterior) expected loss of decision d 2:

$$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathrm{EL}(d_1\mid x)& <&\displaystyle \mathrm{EL}(d_2\mid x)\\ l_1\alpha_2& <&\displaystyle l_2\alpha_1 . \end{array} \end{aligned} $$
(1.35)

When rearranging the terms in (1.35) to α 1α 2 > l 1l 2, and dividing both sides by the prior odds π 1π 2, the Bayes decision criterion states that accepting H 1 is the optimal decision whenever

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\alpha_1/\alpha_2}{\pi_1/\pi_2}>\frac{l_1/l_2}{\pi_1/\pi_2}=c. \end{array} \end{aligned} $$

This is equivalent to accepting H 1 whenever the Bayes factor in favor of this proposition is larger than a constant c determined by the prior odds and the loss ratio. Given a set of observations, the Bayes factor is computed and, depending on whether or not it exceeds a given threshold, the decision maker chooses between the members in the list of states of nature (here H 1 and H 2). Examples will be given in Chap. 3 in the context of inference of source (Sect. 3.3.3) and in Chap. 4 in the context of classification (Sects. 4.2.2 and 4.4.1.2). An extended review of elements of decision analysis in forensic science can be found in Taroni et al. (2021b).

This decision criterion is simple and intuitive, yet it poses challenges. For example, the requirement to choose a prior probability for the two hypotheses may be discomforting, because there is no ad hoc recipe for this purpose. In principle, probabilities are personal, since they depend on one’s knowledge (Lindley, 2014). They may change as information changes and may vary among individuals. For example, a given hypothesis may be considered almost true by one individual, but far less probable by someone else. The fact that different individuals with different knowledge bases may specify different probabilities for the same event, provided that they are accompanied with a justification, is not a problem in principle (Lindley, 2000). The only strict requirement to which probability assignments ought to conform is coherence (de Finetti, 2017). Coherence has the normative role of encouraging people to make careful assignments based on their personal knowledge. This can be operationally supported by the concept of scoring rules . See, for example, Biedermann et al. (2013, 2017a) for a discussion of scoring rules in the context of forensic science.

The same viewpoint applies to utility and loss functions, which may be difficult to specify. A “correct” utility (or loss) function does not exist, because preference structures are personal. Adverse decision consequences may be considered more or less undesirable, depending on the background, the context and the decision maker’s objectives (e.g., Taroni et al., 2010). Moreover, the loss function does not need to have constant values, such as the “0 − l i” loss function introduced above. More general loss functions treat the loss as a function of the severity of the consequences. Examples will be given in Chap. 2 regarding inference and decision about a proportion (Sect. 2.2.3) and about a mean (Sect. 2.3.3).

Note that, in the context here, the terms “personal” and “subjective” do not mean that the theory is arbitrary, unjustified or groundless (Biedermann et al., 2017b; Taroni et al., 2018). There are various devices for the sound elicitation of probabilities and the measurement of the value of decision consequences (Lindley, 1985). What matters in a situation in which a decision maker is asked to make a choice among alternative courses of action that have uncertain consequences is that the behavior is one that can be qualified as rational. This includes, in particular, a coherent specification of the loss function, reflecting personal preferences among consequences in terms of desirability or undesirability.

This formal decision-analytic approach provides decision criteria that (i) are based on clearly defined concepts, (ii) promote rational decision-making under uncertainty, and (iii) make a clear distinction between the evaluation of the strength of evidence (as given by the Bayes factor), which is the domain of the forensic scientist, and the specification of the threshold with which the Bayes factor is compared, i.e., the ratio between the loss ratio and the prior odds. The latter lies in the domain of the recipient of expert information, such as investigative authorities and members of the judiciary.

1.10 Choice of the Prior Distribution

Bayesian model builders may encounter various difficulties. One of them is the choice of the prior distribution. Bayes theorem does not specify how one ought to define the prior distribution. The chosen prior distribution should, however, suitably reflect one’s prior beliefs. In this context, so-called vague or non-informative prior distributions may help to find a broad consensus. However, it is important to keep in mind that even a “non-informative” prior distribution effectively conveys a well-defined opinion, i.e., that probabilities spread uniformly over the parameter space (de Finetti, 1993a). In contrast to this, personal or so-called informative priors aim at encoding available prior knowledge. Whenever feasible, it is advantageous to choose a member of the class of conjugate distributions , that is, a family of prior distributions such that for any prior in this family and a particular probability distribution, the corresponding posterior distribution will be in the same family. For example, the beta distribution and the binomial distribution are said to be conjugate in this sense. Several examples will be provided throughout this book. Table 1.5 provides a list of some common families of conjugate distributions. A more extensive list can be found in Bernardo and Smith (2000). Despite such smooth technical options, eliciting a prior distribution may not be easy.

Table 1.5 Some common conjugate prior distribution families

First, it may be that none of the standard parametric families mentioned above is suitable to describe one’s prior degree of belief. There may be circumstances where multimodal priors may better reflect the available knowledge, and mixture priors would be more convenient (see e.g. Taroni et al., 2010). Another option is to specify prior beliefs over a selection of points and then interpolate between them (Bolstad & Curran, 2017). More generally, there may be cases where the choice of a conjugate prior is not appropriate as it does not properly reflect available knowledge. If this is the case, the application of Bayes theorem may lead to a posterior distribution that is analytically intractable. Such situations require the implementation of computational tools as described in Sect. 1.8.

Second, practitioners will immediately realize that even if the choice of a given standard parametric family may appear justifiable, they will still need to choose a member from the selected family. Stated otherwise, they will need to fix the hyperparameters of the prior distribution in a way that the resulting shape will reasonably reflect their knowledge. Assume that practitioners are in a situation where, based on their experience in the field, they can summarize and translate their prior beliefs into a numerical value for the prior mean, say m, and into a numerical value for the prior standard deviation, say s. They can then find the values of the parameters that specify a prior distribution that reflects the assessed prior location and prior dispersion, respectively. For example, suppose that the parameter of interest, θ, is a proportion and that a beta prior distribution is chosen to model prior uncertainty, i.e., θ ∼Be(α, β). The problem then is how to choose α and β. If one can specify a value m for the prior mean and a value s for the prior standard deviation, that is the two values describing the location and the shape of the prior distribution, one can elicit the hyperparameters α and β by relating the assessed prior mean and prior variance to the prior moments of a beta distributed random variable, that is,

$$\displaystyle \begin{aligned} \begin{array}{rcl} m& =&\displaystyle \frac{\alpha}{\alpha+\beta}{} \end{array} \end{aligned} $$
(1.36)
$$\displaystyle \begin{aligned} \begin{array}{rcl} s^2& =&\displaystyle \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2}.{} \end{array} \end{aligned} $$
(1.37)

The hyperparameters of the beta prior can then be obtained by solving the two equations in (1.36) and (1.37) for α and β

$$\displaystyle \begin{aligned} \begin{array}{rcl} \alpha& =&\displaystyle m\left[\frac{m(1-m)}{s^2}-1\right]{} \end{array} \end{aligned} $$
(1.38)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \beta& =&\displaystyle (1-m)\left[\frac{m(1-m)}{s^2}-1\right]{}. \end{array} \end{aligned} $$
(1.39)

It is advisable to inspect the prior distribution thus elicited. Producing a graphical representation can help examine whether the shape of the distribution reasonably reflects one’s prior beliefs. Moreover, the so-called equivalent sample size n e should be calculated in order to examine the reasonableness of the amount of information that underlies the proposed prior; one should make sure that it is not unrealistically high (Bolstad & Curran, 2017). Stated otherwise, one should examine whether the information that is conveyed by the prior is equivalent, at least roughly, to the information that would be obtained by collecting a sample of equivalent size n e. For example, consider a random sample \((X_1,\dots ,X_{n_e})\) of size n e, providing the same information that is conveyed by the prior. The sample mean \(\bar X=\frac {1}{n_e} \sum _{i=1}^{n_e}X_i\) should have, at least roughly, the same location and the same dispersion as the prior.

For the beta-binomial case, the equivalent sample size n e can be obtained by relating the moments of the beta prior to the corresponding moments characterizing a random sample of size n e from a Bernoullian population with probability of success θ:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\alpha}{\alpha+\beta}& =&\displaystyle \theta{} \end{array} \end{aligned} $$
(1.40)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2}& =&\displaystyle \frac{\theta(1-\theta)}{n_e}{}. \end{array} \end{aligned} $$
(1.41)

Solving for n e, one obtains n e = α + β + 1. If this is felt to be unrealistic, then one should revise one’s prior assessments, increase the dispersion and recalculate the prior. Otherwise, one might specify too much information about the proportion θ relative to the amount of information provided by a sample of size n e.

Example 1.10 (Elicitation of a Beta Prior)

Suppose that a prior distribution needs to be elicited for the proportion θ of non-counterfeit merchandise (e.g., medicines) in a target population. It is thought that the distribution is centered around 0.8 with a standard deviation equal to 0.1. Parameters α and β can be elicited as in (1.38) and (1.39)

Figure 1.1 shows the elicited beta prior Be(12, 3).

Fig. 1.1
figure 1

Prior distribution Be(12, 3) over θ in Example 1.10

The equivalent sample size is 12+3+1=16. This is the size of the sample that should be available in terms of information that is equivalent to that conveyed by the elicited prior.

An objection to this procedure might be that while specifying a value for the location of the prior may be feasible, this may not necessarily be so for the dispersion. In many cases, the available prior knowledge takes the form of a realization (x 1, …, x n) of a random sample of size n from a previous experiment. In this case, it is sufficient to solve (1.40) and (1.41) with respect to α and β for this sample size n:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \alpha& =&\displaystyle p(n-1){}, \end{array} \end{aligned} $$
(1.42)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \beta& =&\displaystyle (1-p)(n-1){}, \end{array} \end{aligned} $$
(1.43)

where θ has been estimated by the sample proportion \(\hat \theta =p=\sum _{i=1}^{n}x_i/n\). One can immediately verify that whenever the hyperparameters α and β are elicited as in (1.42) and (1.43), then α + β + 1 = n. The elicited parameters reflect the amount of information provided by a sample of size n.

Some further practical examples will be provided throughout the book. For an extended discussion of prior elicitation, the reader can refer to Garthwaite et al. (2005) and O’Hagan et al. (2006).

1.11 Sensitivity Analysis

In Sect. 1.4, it has been emphasized that the Bayes factor is not a measure of the relative support for the competing propositions provided by the data alone. The Bayes factor is influenced by the choice and the elicitation of the subjective prior densities (probabilities) for model parameters under propositions H 1 and H 2. This reflects background knowledge that may be available to analysts. For this reason, prior elicitation of model parameters must not be confused with prior probabilities of the propositions of interest.

While the computation of the Bayes factor requires prior assessments about unknown quantities, a main objection to the choice of such prior distributions is that they may be hard to define, in particular when the available information is limited. Situations characterized by an abundance of relevant data that can be used to construct a prior distribution may be rare. Generally, the choice of a prior is the result of a subtle combination of relevant information, published data, and explainable personal knowledge of the expert. The specification of the prior must be taken seriously, because it can be shown that even when a large amount of evidence is available, the marginal likelihood is highly sensitive to the choice of the prior distribution, and so is the Bayes factor (Gelman et al., 2014). This is different for the posterior distribution that is dominated by the likelihood.

Sensitivity analyses allow one to explore how results may be affected by changes in the priors (e.g. Kass & Raftery, 1995; Kass, 1993; Liu & Aitkin, 2008). This, however, may turn out to be computationally intensive and time consuming. An alternative approach has been proposed by Sinharay and Stern (2002) for comparing nested models, though it can be extended to non-nested models. The general idea is to assess the sensitivity of the Bayes factor to the prior distribution for a given parameter θ by computing the Bayes factor for a vector of parameter values (or a grid of parameter values in the case of a two-dimensional vector parameter θ). The result is a graphical representation of the Bayes factor (i.e., a sensitivity curve) as a function of θ, say BFθ. In this way, one can get an idea about the Bayes factor one could obtain for different values of θ, and thus about the sensitivity of the Bayes factor to various prior distributions. These prior distributions have their mass concentrated on different apportionments of the parameter space. For one or two-dimensional problems, the inspection of a sensitivity curve represents a straightforward and effective approach to study the impact of varying parameter values on the BF under consideration. An example is given in Sect. 2.3.1 for the choice of the prior distribution about a Normal mean. A sensitivity analysis with respect to the prior probability assessments of competing propositions is provided in Sect. 3.2.3.

A further layer of sensitivity analyses relates to the choice of the utility/loss function. An example is presented in Sect. 2.2.3 for the choice of the loss function in the context of inference and decision about a population proportion. Section 4.4.1.2 gives an example for the investigation of the effect of different prior probabilities and loss values in the context of classification of skeletal remains.

A sensitivity analysis for Monte Carlo and Markov chain Monte Carlo procedures is presented in Sects. 2.2.2 and 3.4.1.3. In Sect. 4.3.3, a sensitivity analysis is developed for the choice of a smoothing parameter in a kernel density estimation.

1.12 Using R

R is a rich environment for data analysis and statistical computing. In its base package, it contains a large collection of functions for exploring, summarizing, and representing data graphically, handling many standard probability distributions and more. R includes a simple programming language that users can extend with new functions. Some basic instructions on the use of R or of particular functions are available from the R Help menu, by using the command helpstart. The reader can refer to, for example, Verzani (2014) for a detailed introduction to the use of R for descriptive and inferential statistics, to Albert (2009) for an overview of elements of Bayesian computation with R, and to the R project home page (https://www.r-project.org) for more references. Datasets and routines used in the examples throughout this book are available on the website of this book (on http://link.springer.com/).

Generally, we will give results of R computations as produced directly by R. We do not make any recommendations as to the level of precision that scientists should use when reporting numerical results.

Published with the support of the Swiss National Science Foundation (Grant no. 10BP12_208532/1).