Quantifying the Accuracy of Forensic Examiners in the Absence of a “Gold Standard”

Mossman, Douglas; Bowen, Michael D.; Vanness, David J.; Bienenfeld, David; Correll, Terry; Kay, Jerald; Klykylo, William M.; Lehrer, Douglas S.

doi:10.1007/s10979-009-9197-5

Quantifying the Accuracy of Forensic Examiners in the Absence of a “Gold Standard”

Original Article
Published: 22 September 2009

Volume 34, pages 402–417, (2010)
Cite this article

Law and Human Behavior

Douglas Mossman^1,2,3,
Michael D. Bowen³,
David J. Vanness^4,5,
David Bienenfeld³,
Terry Correll³,
Jerald Kay³,
William M. Klykylo³ &
…
Douglas S. Lehrer³

274 Accesses
13 Citations
Explore all metrics

Abstract

This study asked whether latent class modeling methods and multiple ratings of the same cases might permit quantification of the accuracy of forensic assessments. Five evaluators examined 156 redacted court reports concerning criminal defendants who had undergone hospitalization for evaluation or restoration of their adjudicative competence. Evaluators rated each defendant’s Dusky-defined competence to stand trial on a five-point scale as well as each defendant’s understanding of, appreciation of, and reasoning about criminal proceedings. Having multiple ratings per defendant made it possible to estimate accuracy parameters using maximum likelihood and Bayesian approaches, despite the absence of any “gold standard” for the defendants’ true competence status. Evaluators appeared to be very accurate, though this finding should be viewed with caution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Cognitive and Social Psychological Bases of Bias in Forensic Mental Health Judgments

The Purpose, Content, and Effects of Expert Testimony on Interrogations and Confessions

Violence Risk Assessment: Research and Practice

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Article Google Scholar
Akinkunmi, A. A. (2002). The MacArthur Competence Assessment Tool Fitness to Plead: A preliminary evaluation of a research instrument for assessing fitness to plead in England and Wales. Journal of the American Academy of Psychiatry and the Law, 30, 476–482.
PubMed Google Scholar
Albert, P. S. (2007). Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard. Biometrics, 63, 593–602.
Article PubMed Google Scholar
American Academy of Psychiatry and the Law. (May 2005). Ethics guidelines for the practice of forensic psychiatry. http://www.aapl.org/ethics.htm. Accessed 19 Sept 2008.
Bennett, G. (1985). A guided tour through selected ABA standards relating to incompetence to stand trial: Incompetence to stand trial. George Washington Law Review, 53, 375–413.
Google Scholar
Berg, W. A., Blume, J. D., Cormack, J. B., Mendelson, E. B., Lehrer, D., Böhm-Vélez, M., et al. (2008). Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer. Journal of the American Medical Association, 299, 2151–2163.
Article PubMed Google Scholar
Boccaccini, M. T., Turner, D., & Murrie, D. C. (2008). Do some evaluators report consistently higher or lower psychopathy scores than others? Findings from a statewide sample of sexually violent predator evaluations. Psychology, Public Policy, and Law, 14, 262–283.
Article Google Scholar
Bonnie, R. J. (1990). The competence of criminal defendants with mental retardation to participate in their own defense. Journal of Criminal Law and Criminology, 81, 419–446.
Article Google Scholar
Brooks, S. P., & Gelman, A. (1998). Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455.
Article Google Scholar
Buchanan, A. (2006). Competency to stand trial and the seriousness of the charge. Journal of the American Academy of Psychiatry and the Law, 34, 458–465.
PubMed Google Scholar
Cain, D. M., & Detsky, A. S. (2008). Everyone’s a little bit biased (even physicians). Journal of the American Medical Association, 299, 2893–2895.
Article PubMed Google Scholar
Cain, D. M., Loewenstein, G., & Moore, D. A. (2005). The dirt on coming clean: Perverse effects of disclosing conflicts of interest. Journal of Legal Studies, 34, 1–25.
Article Google Scholar
Carlin, B. P., & Louis, T. A. (2000). Bayes and empirical Bayes methods for data analysis (2nd ed.). London: Chapman & Hall.
Google Scholar
Choi, Y. K., Johnson, W. O., Collins, M. T., & Gardner, I. A. (2006). Bayesian inferences for receiver operating characteristic curves in the absence of a gold standard. Journal of Agricultural, Biological, and Environmental Statistics, 11, 210–229.
Article Google Scholar
Committee on the Revision of the Specialty Guidelines for Forensic Psychology. (11 January 2006). Specialty guidelines for forensic psychology, second official draft. http://www.ap-ls.org/links/. Accessed 19 Sept 2008.
Cooper, V. G., & Zapf, P. A. (2003). Predictor variables in competency to stand trial decisions. Law and Human Behavior, 27, 423–436.
Article PubMed Google Scholar
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
Article PubMed Google Scholar
Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993).
Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674.
Article PubMed Google Scholar
Douglas, K. S., Ogloff, J. R., Nicholls, T. L., & Grant, I. (1999). Assessing risk for violence among psychiatric patients: The HCR-20 violence risk assessment scheme and the Psychopathy Checklist: Screening Version. Journal of Consulting and Clinical Psychology, 67, 917–930.
Article PubMed Google Scholar
Dusky v. United States, 362 U.S. 402 (1960).
Faigman, D. L., Saks, M. J., Sanders, J., & Cheng, E. K. (2008). Modern scientific evidence: Standards, statistics, and research methods, student ed.. Eagan, MN: Thomson West.
Google Scholar
Faraone, S. V., & Tsuang, M. T. (1994). Measuring diagnostic accuracy in the absence of a “gold standard.” American Journal of Psychiatry, 151, 650–657.
PubMed Google Scholar
Gardner, W., Lidz, C. W., Mulvey, E. P., & Shaw, E. C. (1996). Clinical versus actuarial predictions of violence of patients with mental illnesses. Journal of Consulting and Clinical Psychology, 64, 602–609.
Article PubMed Google Scholar
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 389–409.
Google Scholar
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Article Google Scholar
Golding, S. L., Roesch, R., & Schreiber, J. (1984). Assessment and conceptualization of competency to stand trial: Preliminary data on the Interdisciplinary Fitness Interview. Law and Human Behavior, 8, 321–334.
Article Google Scholar
Grisso, T. (2003). Legally relevant assessments for legal competencies. In T. Grisso (Ed.), Evaluating competencies: Forensic assessments, instruments (2nd ed., pp. 21–40). New York: Kluwer Academic/Plenum Publishers.
Google Scholar
Gutheil, T. G. (2004). The expert witness. In R. I. Simon & L. H. Gold (Eds.), The American Psychiatric Publishing textbook of forensic psychiatry (pp. 75–89). Arlington, VA: American Psychiatric Publishing.
Google Scholar
Hagen, M. A. (1997). Whores of the court: The fraud of psychiatric testimony and the rape of American justice. New York: ReganBooks.
Google Scholar
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology, 143, 29–36.
PubMed Google Scholar
Harris, G. T., Rice, M. E., & Cormier, C. A. (2002). Prospective replication of the Violence Risk Appraisal Guide in predicting violent recidivism among forensic patients. Law and Human Behavior, 26, 377–394.
Article PubMed Google Scholar
Henkelman, R. M., Kay, I., & Bronskill, M. J. (1990). Receiver operator characteristic (ROC) analysis without truth. Medical Decision Making, 10, 24–29.
Article PubMed Google Scholar
Jackson v. Indiana, 406 U.S. 715 (1972).
Jacobs, M. S., Ryba, N. L., & Zapf, P. A. (2008). Competence-related abilities and psychiatric symptoms: An analysis of the underlying structure and correlates of the MacCAT-CA and the BPRS. Law and Human Behavior, 32, 64–77.
Article PubMed Google Scholar
Kim, S. Y. H., Appelbaum, P. S., Swan, J., Stroup, T. S., McEvoy, J. P., Goff, D. C., et al. (2007). Determining when impairment constitutes incapacity for informed consent in schizophrenia research. British Journal of Psychiatry, 191, 38–43.
Article PubMed Google Scholar
Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
Lehman, C. D., Gatsonis, C., Kuhl, C. K., Hendrick, R. E., Pisano, E. D., Hanna, L., et al. (2007). MRI evaluation of the contralateral breast in women with recently diagnosed breast cancer. New England Journal of Medicine, 356, 1295–1303.
Article PubMed Google Scholar
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—A Bayesian modeling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
Article Google Scholar
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlations. Psychological Methods, 1, 30–46.
Article Google Scholar
Melton, G. B., Petrila, J., Poythress, N., Slobogin, C., Lyons, P., & Otto, R. K. (2007). Psychological evaluations for the courts: A handbook for mental health professionals and lawyers (3rd ed.). New York: Guilford.
Google Scholar
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
Article Google Scholar
Miller, H. A. (2005). The Miller-Forensic Assessment of Symptoms Test (M-Fast): Test generalizability and utility across race literacy, and clinical opinion. Criminal Justice and Behavior, 32, 591–611.
Article Google Scholar
Monahan, J., Steadman, H. J., Appelbaum, P. S., Grisso, T., Mulvey, E. P., Roth, L. H., et al. (2006). The classification of violence risk. Behavioral Science and the Law, 24, 721–730.
Article Google Scholar
Mossman, D. (1999). “Hired guns”, “whores”, and “prostitutes”: Case law references to clinicians of ill repute. Journal of the American Academy of Psychiatry and the Law, 27, 414–425.
PubMed Google Scholar
Mossman, D. (2005). Is prosecution “medically appropriate”? New England Journal on Criminal and Civil Confinement, 31, 15–80.
Google Scholar
Mossman, D. (2007). Predicting restorability of incompetent criminal defendants. Journal of the American Academy of Psychiatry and the Law, 35, 34–43.
PubMed Google Scholar
Mossman, D. (2008). Conceptualizing and characterizing accuracy in assessments of competence to stand trial. Journal of the American Academy of Psychiatry and the Law, 36, 340–351.
PubMed Google Scholar
Mossman, D., Noffsinger, S. G., Ash, P., Frierson, R. L., Gerbasi, J., Hackett, M., et al. (2007). AAPL practice guideline for the forensic psychiatric evaluation of competence to stand trial. Journal of the American Academy of Psychiatry and the Law, 35(Suppl 4), S3–S72.
Google Scholar
Mossman, D., & Somoza, E. (1991). ROC curves, test accuracy, and the description of diagnostic tests. Journal of Neuropsychiatry and Clinical Neurosciences, 3, 330–333.
PubMed Google Scholar
Murrie, D. C., Boccaccini, M. T., Turner, D., Meeks, M., Woods, C., & Tussey, C. (2009). Rater (dis)agreement on risk assessment measures in sexually violent predator proceedings: Evidence of adversarial allegiance in forensic evaluation? Psychology, Public Policy, and Law, 15, 19–53.
Article Google Scholar
Murrie, D. C., Boccaccini, M., Zapf, P. A., Warren, J. I., & Henderson, C. E. (2008). Clinician variation in findings of competence to stand trial. Psychology, Public Policy, and Law, 14, 177–193.
Article Google Scholar
Obuchowski, N. A. (2003). Receiver operating characteristic curves and their use in radiology. Radiology, 229, 3–8.
Article PubMed Google Scholar
Parry, J., & Drogin, E. Y. (2007). Mental disability law, evidence and testimony: A comprehensive reference manual for lawyers, judges, and mental disability professionals. Washington, DC: American Bar Association.
Google Scholar
Pate v. Robinson, 383 U.S. 375 (1966).
Poythress, N., Monahan, J., Bonnie, R., Otto, R. K., & Hoge, S. K. (2002). Adjudicative competence: The MacArthur studies. New York: Kluwer/Plenum.
Google Scholar
Qu, Y., Tan, M., & Kutner, M. H. (1996). Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics, 53, 797–810.
Article Google Scholar
Rice, M. E., & Harris, G. T. (1995). Violent recidivism: Assessing predictive validity. Journal of Consulting and Clinical Psychology, 63, 737–748.
Article PubMed Google Scholar
Rosenfeld, B., & Ritchie, K. (1998). Competence to stand trial: Clinician reliability and the role of offense severity. Journal of Forensic Sciences, 43, 151–159.
PubMed Google Scholar
Skeem, J., Golding, S., Cohn, N., & Berge, G. (1998). The logic and reliability of expert opinion on competence to stand trial. Law and Human Behavior, 22, 519–547.
Article PubMed Google Scholar
Small, G. W., Kepe, V., Ercoli, L. M., Siddarth, P., Bookheimer, S. Y., Miller, K. J., et al. (2006). PET of brain amyloid and tau in mild cognitive impairment. New England Journal of Medicine, 355, 2652–2663.
Article PubMed Google Scholar
Somoza, E., & Mossman, D. (1991). ROC curves and the binormal assumption. Journal of Neuropsychiatry and Clinical Neurosciences, 3, 436–439.
PubMed Google Scholar
Spencer, B. D. (2008). When do latent class models overstate accuracy for binary classifiers? With applications to jury accuracy, survey response error, and diagnostic error. Institute for Policy Research, Northwestern University, Working Paper Series WP-08-10.
State v. Sullivan, 739 N.E.2d 788 (Ohio 2001).
Steadman, H. J., Mulvey, E. P., Monahan, J., Robbins, P. C., Appelbaum, P. S., Grisso, T., et al. (1998). Violence by people discharged from acute psychiatric inpatient facilities and by others in the same neighborhoods. Archives of General Psychiatry, 55, 393–401.
Article PubMed Google Scholar
Swets, J. A. (1995). Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers. Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405–416.
Article Google Scholar
Uebersax, J. S., & Grove, W. M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559–572.
Article PubMed Google Scholar
Weissman, H. N., & DeBow, D. M. (2003). Ethical principles and professional competencies. In I. B. Weiner (Series Ed.) & A. M. Goldstein (Vol. Ed.), Handbook of psychology: Vol. 11. Forensic psychology (pp. 33–53). New York: Wiley.
Zapf, P. A., Hubbard, K. L., Cooper, V. G., Wheeles, M. C., & Ronan, K. A. (2004). Have the courts abdicated their responsibility for determination of competency to stand trial to clinicians? Journal of Forensic Psychology Practice, 4, 27–44.
Article Google Scholar
Zhou, X. H., Castelluccio, P., & Zhou, C. (2005). Nonparametric estimation of ROC curves in the absence of a gold standard. Biometrics, 61, 600–609.
Article PubMed Google Scholar
Zweig, M. H., & Campbell, G. (1993). Receiver operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, 561–577.
PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Glenn M. Weaver Institute of Law and Psychiatry, University of Cincinnati College of Law, Clifton Avenue & Calhoun Street, PO Box 210040, Cincinnati, OH, 45221-0040, USA
Douglas Mossman
Department of Psychiatry, University of Cincinnati College of Medicine, Cincinnati, USA
Douglas Mossman
Department of Psychiatry, Wright State University, Boonshoft School of Medicine, Dayton, USA
Douglas Mossman, Michael D. Bowen, David Bienenfeld, Terry Correll, Jerald Kay, William M. Klykylo & Douglas S. Lehrer
Department of Population Health Sciences, University of Wisconsin School of Medicine and Public Health, Madison, USA
David J. Vanness
Center for Health Economics and Science Policy, United BioSource Corporation, Bethesda, USA
David J. Vanness

Authors

Douglas Mossman
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Bowen
View author publications
You can also search for this author in PubMed Google Scholar
David J. Vanness
View author publications
You can also search for this author in PubMed Google Scholar
David Bienenfeld
View author publications
You can also search for this author in PubMed Google Scholar
Terry Correll
View author publications
You can also search for this author in PubMed Google Scholar
Jerald Kay
View author publications
You can also search for this author in PubMed Google Scholar
William M. Klykylo
View author publications
You can also search for this author in PubMed Google Scholar
Douglas S. Lehrer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Douglas Mossman.

Appendix

Adopting the notation used by Albert (2007), suppose I subjects (i = 1, 2, …, I) undergo assessment by J raters (j = 1, 2, …, J), who assign ordinal ratings k = 1, 2, …, K to each subject. Without loss of generality, let rating k = 1 indicate lowest confidence and k = K indicate highest confidence that a subject has the condition or disorder D of interest (here, incompetence to stand trial). Let Y _i = (Y _i1, Y _i2, …, Y _iJ)′ be a vector representing ratings made by the J raters for the ith subject. Because J raters could each assign one of K ratings to each subject, each Y _i has J × K possible combinations of elements. The joint distribution of Y _i, expressed as P(Y _i), the probability of Y _i, is

$$ P ({\user2{Y}}_{{\user2{i}}} )= P ({\user2{Y}}_{{\user2{i}}} |d_{i} = 1 )P (d_{i} = 1 )+ P ({\user2{Y}}_{{\user2{i}}} |d_{i} = 0 )P (d_{i} = 0 ) , $$

(1)

where d _i = 1 means the ith subject has condition D, d _i = 0 means the ith subject does not have D, P(d _i = 1) is the probability or prevalence of D, and P(d _i = 0) = 1 − P(d _i = 1).

We would like to model P(Y _i|d _i) so as to include possible conditional dependence (CD) of ratings, i.e., similarity in raters’ responses attributable to specific characteristics of subjects besides their membership in the D or non-D subgroups that affect how easy or hard their particular cases are. Following Albert, we utilize a probit link function for the parameterization,

$$ \Upphi^{ - 1} \left\{ {P\left( {Y_{ij} \le k|d_{i} ,b_{{d_{i} ,i}} } \right)} \right\} = C_{{d_{i} ,k,j}} + b_{{d_{i} ,i}} $$

(2)

where Φ is the cumulative standard normal distribution function and Φ⁻¹ is its inverse, $ C_{{d_{i} ,k,j}} $ are monotonically increasing cut-offs for the jth rater, and $ b_{{d_{i} ,i}} $ is a random effect attributable to each subject that characterizes conditional dependence in multiple ratings of that subject.

Notice that $ b_{{d_{i}, i}} $ depends on the latent class of each subject—i.e., whether the individual does or does not have D. Following Albert (2007) and Qu et al. (1996), we used the random effect model $ b_{{d_{i} , i}} = \sigma_{{d_{i} }} b_{{i}} $, where b _i has a standard normal distribution. Equation 2 thus says that cut-off points demarcating each rater’s classification thresholds reflect the presence (d _i = 1) or absence (d _i = 0) of D. However, the probability that the jth rater will assign rating k to the ith subject reflects the ith subject’s state (d _i = 0 or d _i = 1), locations of the rater’s particular cut-offs, and peculiarities of the ith subject (which act in common across all raters). We characterize the random effects of the D and non-D populations separately because their cut-offs are not linked (as they would be under the “binormal” ROC model; see Somoza & Mossman, 1991).

In our data set, I = 156, J = 5, and K = 5. Thus, in our CD model, I × J ratings (J raters evaluating I subjects) arise from 2(K − 1)J + 3 = 43 parameters: K − 1 cut-offs for the D subgroup and K − 1 cut-offs for the non-D subgroup for each rater, plus a random effect attributable to each subgroup, plus the prevalence P(d _i = 1) in the rating set. We sought values for the 43 parameters that would, in combination, be most likely to have generated the 5 × 156 rating matrix. We could then construct individual raters’ ROC graphs using (fpr, tpr) coordinates computed as follows:

$$ fpr_{j,k} = 1 - \Upphi \left( {{\frac{{C_{0,k} }}{{\sqrt {1 + \sigma_{0}^{2} } }}}} \right)\,;\quad tpr_{j,k} = 1 - \Upphi \left( {{\frac{{C_{1,k} }}{{\sqrt {1 + \sigma_{1}^{2} } }}}} \right) $$

(3)

Under a conditional independence (CI) assumption (equivalent to setting $ b_{{d_{i} , i}} = 0 $), ROC graphs for the five raters could be constructed from estimates of 2(K − 1)J + 1 = 41 parameters.

We estimated the model’s accuracy parameters in two ways. The first approach, standard maximum likelihood estimation (MLE), used GAUSS 3.6 code kindly furnished by Albert and modified for our data. (The modified code, which calls the GAUSS 4.0 maxlik library’s quasi-Newton BFGS optimization algorithm, is available from the first author.) The natural logarithm of the likelihood function, $ \ln \,L = \sum\nolimits_{i = 1}^{I} {\ln \,L_{i} } $, is (slightly modifying Albert’s notation)

$$ \ln \,L = \sum\limits_{{i_{1} = 1}}^{K} {\sum\limits_{{i_{2} = 1}}^{K} { \ldots \sum\limits_{{i_{J} = 1}}^{K} {I_{{\{ {\user2{Y}}_{{\user2{i}}} = \,(i_{1} , i_{2} , \ldots ,i_{J} )\} }} \times \ln \left\{ {P\left( {{\user2{Y}}_{{\user2{i}}} = \left( {i_{1} ,\,i_{2} , \ldots ,\,i_{J} } \right)} \right)} \right\},} } } $$

(4)

with P(Y _i) given by Eq. 1. Knowing (fpr, tpr) coordinates for each rater permitted computation of “trapezoidal” AUCs as overall measures of rater accuracy, with standard errors computed using the method of Hanley and McNeil (1982).

In contrast to MLE, which provides point estimates of the parameter values most likely to have generated the observed data, Bayesian estimation summarizes knowledge of unknown parameters using “posterior” distributions representing the probability that a parameter has a particular value, given the observed data. According to Bayes’ Rule, the posterior probability of a parameter’s value is proportional to the likelihood of observing the data given that parameter value, multiplied by a “prior” probability of the parameter’s value. The likelihood function is dictated by statistical model choice, and is the same construct as in MLE. When a prior is “non-informative” (e.g., P(θ) = c for all θ ∈ [a,b], where [a,b] is an arbitrarily large bounded interval, c is a constant, and θ is a parameter), Bayesian and MLE methods yield similar inferences (Carlin & Louis, 2000). However, in Bayesian estimation, inference is conducted directly on the unknown parameters (or functions thereof, such as AUC), while in MLE, inference is conducted on the data. Hence, only Bayesian estimation allows direct probability statements such as “the probability that the AUC for rater j is between .955 and .973 is 95%.”

Markov chain Monte Carlo (MCMC) methods (Gelfand & Smith, 1990; Geman & Geman, 1984; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) are used to make inferences on posterior distributions for which lack of analytic methods would make Bayes’ Rule intractable. Under mild regularity conditions, a Markov chain converges to a unique invariant or “target” distribution. To use MCMC methods for Bayesian analysis, one constructs the transition kernel so that the target distribution of the resulting Markov chain will be the joint posterior distribution of interest. After discarding input from initial “burn-in” iterations, one can use the remaining draws to make inferences about model parameters. WinBUGS is a free software package that allows specification of a Bayesian model, determines the transition kernel for the Markov chain, and produces draws from the joint posterior distribution of unknown parameters (Lunn et al., 2000).

For our Bayesian analyses, we used minimally informative priors and the same statistical models as in our MLE approach. WinBUGS 1.4.3 ran five parallel MCMC chains; the Brooks–Gelman–Rubin diagnostic (Brooks & Gelman, 1998) indicated convergence after 2000–5000 iterations. We ran each chain for 15,000 iterations and treated each chain’s first 10,000 iterations as “burn-in” values to be discarded, leaving 5 × 5000 = 25,000 draws for inference. Because our WinBUGS code (available from the first author upon request) calculated (fpr, tpr) coordinates and trapezoidal AUCs directly from the MCMC parameter draws, we obtained samples of and made inferences about our accuracy statistics directly from the posterior distributions.

About this article

Cite this article

Mossman, D., Bowen, M.D., Vanness, D.J. et al. Quantifying the Accuracy of Forensic Examiners in the Absence of a “Gold Standard”. Law Hum Behav 34, 402–417 (2010). https://doi.org/10.1007/s10979-009-9197-5

Download citation

Received: 06 July 2009
Accepted: 24 August 2009
Published: 22 September 2009
Issue Date: October 2010
DOI: https://doi.org/10.1007/s10979-009-9197-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantifying the Accuracy of Forensic Examiners in the Absence of a “Gold Standard”

Abstract

Access this article

Similar content being viewed by others

The Cognitive and Social Psychological Bases of Bias in Forensic Mental Health Judgments

The Purpose, Content, and Effects of Expert Testimony on Interrogations and Confessions

Violence Risk Assessment: Research and Practice

References

Author information

Authors and Affiliations

Corresponding author

Appendix

About this article

Cite this article

Keywords

Navigation

Quantifying the Accuracy of Forensic Examiners in the Absence of a “Gold Standard”

Abstract

Access this article

Similar content being viewed by others

The Cognitive and Social Psychological Bases of Bias in Forensic Mental Health Judgments

The Purpose, Content, and Effects of Expert Testimony on Interrogations and Confessions

Violence Risk Assessment: Research and Practice

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

About this article

Cite this article

Share this article

Keywords

Search

Navigation