Abstract
This study asked whether latent class modeling methods and multiple ratings of the same cases might permit quantification of the accuracy of forensic assessments. Five evaluators examined 156 redacted court reports concerning criminal defendants who had undergone hospitalization for evaluation or restoration of their adjudicative competence. Evaluators rated each defendant’s Dusky-defined competence to stand trial on a five-point scale as well as each defendant’s understanding of, appreciation of, and reasoning about criminal proceedings. Having multiple ratings per defendant made it possible to estimate accuracy parameters using maximum likelihood and Bayesian approaches, despite the absence of any “gold standard” for the defendants’ true competence status. Evaluators appeared to be very accurate, though this finding should be viewed with caution.
Similar content being viewed by others
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Akinkunmi, A. A. (2002). The MacArthur Competence Assessment Tool Fitness to Plead: A preliminary evaluation of a research instrument for assessing fitness to plead in England and Wales. Journal of the American Academy of Psychiatry and the Law, 30, 476–482.
Albert, P. S. (2007). Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard. Biometrics, 63, 593–602.
American Academy of Psychiatry and the Law. (May 2005). Ethics guidelines for the practice of forensic psychiatry. http://www.aapl.org/ethics.htm. Accessed 19 Sept 2008.
Bennett, G. (1985). A guided tour through selected ABA standards relating to incompetence to stand trial: Incompetence to stand trial. George Washington Law Review, 53, 375–413.
Berg, W. A., Blume, J. D., Cormack, J. B., Mendelson, E. B., Lehrer, D., Böhm-Vélez, M., et al. (2008). Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer. Journal of the American Medical Association, 299, 2151–2163.
Boccaccini, M. T., Turner, D., & Murrie, D. C. (2008). Do some evaluators report consistently higher or lower psychopathy scores than others? Findings from a statewide sample of sexually violent predator evaluations. Psychology, Public Policy, and Law, 14, 262–283.
Bonnie, R. J. (1990). The competence of criminal defendants with mental retardation to participate in their own defense. Journal of Criminal Law and Criminology, 81, 419–446.
Brooks, S. P., & Gelman, A. (1998). Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455.
Buchanan, A. (2006). Competency to stand trial and the seriousness of the charge. Journal of the American Academy of Psychiatry and the Law, 34, 458–465.
Cain, D. M., & Detsky, A. S. (2008). Everyone’s a little bit biased (even physicians). Journal of the American Medical Association, 299, 2893–2895.
Cain, D. M., Loewenstein, G., & Moore, D. A. (2005). The dirt on coming clean: Perverse effects of disclosing conflicts of interest. Journal of Legal Studies, 34, 1–25.
Carlin, B. P., & Louis, T. A. (2000). Bayes and empirical Bayes methods for data analysis (2nd ed.). London: Chapman & Hall.
Choi, Y. K., Johnson, W. O., Collins, M. T., & Gardner, I. A. (2006). Bayesian inferences for receiver operating characteristic curves in the absence of a gold standard. Journal of Agricultural, Biological, and Environmental Statistics, 11, 210–229.
Committee on the Revision of the Specialty Guidelines for Forensic Psychology. (11 January 2006). Specialty guidelines for forensic psychology, second official draft. http://www.ap-ls.org/links/. Accessed 19 Sept 2008.
Cooper, V. G., & Zapf, P. A. (2003). Predictor variables in competency to stand trial decisions. Law and Human Behavior, 27, 423–436.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993).
Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674.
Douglas, K. S., Ogloff, J. R., Nicholls, T. L., & Grant, I. (1999). Assessing risk for violence among psychiatric patients: The HCR-20 violence risk assessment scheme and the Psychopathy Checklist: Screening Version. Journal of Consulting and Clinical Psychology, 67, 917–930.
Dusky v. United States, 362 U.S. 402 (1960).
Faigman, D. L., Saks, M. J., Sanders, J., & Cheng, E. K. (2008). Modern scientific evidence: Standards, statistics, and research methods, student ed.. Eagan, MN: Thomson West.
Faraone, S. V., & Tsuang, M. T. (1994). Measuring diagnostic accuracy in the absence of a “gold standard.” American Journal of Psychiatry, 151, 650–657.
Gardner, W., Lidz, C. W., Mulvey, E. P., & Shaw, E. C. (1996). Clinical versus actuarial predictions of violence of patients with mental illnesses. Journal of Consulting and Clinical Psychology, 64, 602–609.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 389–409.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Golding, S. L., Roesch, R., & Schreiber, J. (1984). Assessment and conceptualization of competency to stand trial: Preliminary data on the Interdisciplinary Fitness Interview. Law and Human Behavior, 8, 321–334.
Grisso, T. (2003). Legally relevant assessments for legal competencies. In T. Grisso (Ed.), Evaluating competencies: Forensic assessments, instruments (2nd ed., pp. 21–40). New York: Kluwer Academic/Plenum Publishers.
Gutheil, T. G. (2004). The expert witness. In R. I. Simon & L. H. Gold (Eds.), The American Psychiatric Publishing textbook of forensic psychiatry (pp. 75–89). Arlington, VA: American Psychiatric Publishing.
Hagen, M. A. (1997). Whores of the court: The fraud of psychiatric testimony and the rape of American justice. New York: ReganBooks.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology, 143, 29–36.
Harris, G. T., Rice, M. E., & Cormier, C. A. (2002). Prospective replication of the Violence Risk Appraisal Guide in predicting violent recidivism among forensic patients. Law and Human Behavior, 26, 377–394.
Henkelman, R. M., Kay, I., & Bronskill, M. J. (1990). Receiver operator characteristic (ROC) analysis without truth. Medical Decision Making, 10, 24–29.
Jackson v. Indiana, 406 U.S. 715 (1972).
Jacobs, M. S., Ryba, N. L., & Zapf, P. A. (2008). Competence-related abilities and psychiatric symptoms: An analysis of the underlying structure and correlates of the MacCAT-CA and the BPRS. Law and Human Behavior, 32, 64–77.
Kim, S. Y. H., Appelbaum, P. S., Swan, J., Stroup, T. S., McEvoy, J. P., Goff, D. C., et al. (2007). Determining when impairment constitutes incapacity for informed consent in schizophrenia research. British Journal of Psychiatry, 191, 38–43.
Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
Lehman, C. D., Gatsonis, C., Kuhl, C. K., Hendrick, R. E., Pisano, E. D., Hanna, L., et al. (2007). MRI evaluation of the contralateral breast in women with recently diagnosed breast cancer. New England Journal of Medicine, 356, 1295–1303.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—A Bayesian modeling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlations. Psychological Methods, 1, 30–46.
Melton, G. B., Petrila, J., Poythress, N., Slobogin, C., Lyons, P., & Otto, R. K. (2007). Psychological evaluations for the courts: A handbook for mental health professionals and lawyers (3rd ed.). New York: Guilford.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091.
Miller, H. A. (2005). The Miller-Forensic Assessment of Symptoms Test (M-Fast): Test generalizability and utility across race literacy, and clinical opinion. Criminal Justice and Behavior, 32, 591–611.
Monahan, J., Steadman, H. J., Appelbaum, P. S., Grisso, T., Mulvey, E. P., Roth, L. H., et al. (2006). The classification of violence risk. Behavioral Science and the Law, 24, 721–730.
Mossman, D. (1999). “Hired guns”, “whores”, and “prostitutes”: Case law references to clinicians of ill repute. Journal of the American Academy of Psychiatry and the Law, 27, 414–425.
Mossman, D. (2005). Is prosecution “medically appropriate”? New England Journal on Criminal and Civil Confinement, 31, 15–80.
Mossman, D. (2007). Predicting restorability of incompetent criminal defendants. Journal of the American Academy of Psychiatry and the Law, 35, 34–43.
Mossman, D. (2008). Conceptualizing and characterizing accuracy in assessments of competence to stand trial. Journal of the American Academy of Psychiatry and the Law, 36, 340–351.
Mossman, D., Noffsinger, S. G., Ash, P., Frierson, R. L., Gerbasi, J., Hackett, M., et al. (2007). AAPL practice guideline for the forensic psychiatric evaluation of competence to stand trial. Journal of the American Academy of Psychiatry and the Law, 35(Suppl 4), S3–S72.
Mossman, D., & Somoza, E. (1991). ROC curves, test accuracy, and the description of diagnostic tests. Journal of Neuropsychiatry and Clinical Neurosciences, 3, 330–333.
Murrie, D. C., Boccaccini, M. T., Turner, D., Meeks, M., Woods, C., & Tussey, C. (2009). Rater (dis)agreement on risk assessment measures in sexually violent predator proceedings: Evidence of adversarial allegiance in forensic evaluation? Psychology, Public Policy, and Law, 15, 19–53.
Murrie, D. C., Boccaccini, M., Zapf, P. A., Warren, J. I., & Henderson, C. E. (2008). Clinician variation in findings of competence to stand trial. Psychology, Public Policy, and Law, 14, 177–193.
Obuchowski, N. A. (2003). Receiver operating characteristic curves and their use in radiology. Radiology, 229, 3–8.
Parry, J., & Drogin, E. Y. (2007). Mental disability law, evidence and testimony: A comprehensive reference manual for lawyers, judges, and mental disability professionals. Washington, DC: American Bar Association.
Pate v. Robinson, 383 U.S. 375 (1966).
Poythress, N., Monahan, J., Bonnie, R., Otto, R. K., & Hoge, S. K. (2002). Adjudicative competence: The MacArthur studies. New York: Kluwer/Plenum.
Qu, Y., Tan, M., & Kutner, M. H. (1996). Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics, 53, 797–810.
Rice, M. E., & Harris, G. T. (1995). Violent recidivism: Assessing predictive validity. Journal of Consulting and Clinical Psychology, 63, 737–748.
Rosenfeld, B., & Ritchie, K. (1998). Competence to stand trial: Clinician reliability and the role of offense severity. Journal of Forensic Sciences, 43, 151–159.
Skeem, J., Golding, S., Cohn, N., & Berge, G. (1998). The logic and reliability of expert opinion on competence to stand trial. Law and Human Behavior, 22, 519–547.
Small, G. W., Kepe, V., Ercoli, L. M., Siddarth, P., Bookheimer, S. Y., Miller, K. J., et al. (2006). PET of brain amyloid and tau in mild cognitive impairment. New England Journal of Medicine, 355, 2652–2663.
Somoza, E., & Mossman, D. (1991). ROC curves and the binormal assumption. Journal of Neuropsychiatry and Clinical Neurosciences, 3, 436–439.
Spencer, B. D. (2008). When do latent class models overstate accuracy for binary classifiers? With applications to jury accuracy, survey response error, and diagnostic error. Institute for Policy Research, Northwestern University, Working Paper Series WP-08-10.
State v. Sullivan, 739 N.E.2d 788 (Ohio 2001).
Steadman, H. J., Mulvey, E. P., Monahan, J., Robbins, P. C., Appelbaum, P. S., Grisso, T., et al. (1998). Violence by people discharged from acute psychiatric inpatient facilities and by others in the same neighborhoods. Archives of General Psychiatry, 55, 393–401.
Swets, J. A. (1995). Signal detection theory and ROC analysis in psychology and diagnostics: Collected papers. Mahwah, NJ: Lawrence Erlbaum Associates.
Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405–416.
Uebersax, J. S., & Grove, W. M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559–572.
Weissman, H. N., & DeBow, D. M. (2003). Ethical principles and professional competencies. In I. B. Weiner (Series Ed.) & A. M. Goldstein (Vol. Ed.), Handbook of psychology: Vol. 11. Forensic psychology (pp. 33–53). New York: Wiley.
Zapf, P. A., Hubbard, K. L., Cooper, V. G., Wheeles, M. C., & Ronan, K. A. (2004). Have the courts abdicated their responsibility for determination of competency to stand trial to clinicians? Journal of Forensic Psychology Practice, 4, 27–44.
Zhou, X. H., Castelluccio, P., & Zhou, C. (2005). Nonparametric estimation of ROC curves in the absence of a gold standard. Biometrics, 61, 600–609.
Zweig, M. H., & Campbell, G. (1993). Receiver operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, 561–577.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Adopting the notation used by Albert (2007), suppose I subjects (i = 1, 2, …, I) undergo assessment by J raters (j = 1, 2, …, J), who assign ordinal ratings k = 1, 2, …, K to each subject. Without loss of generality, let rating k = 1 indicate lowest confidence and k = K indicate highest confidence that a subject has the condition or disorder D of interest (here, incompetence to stand trial). Let Y i = (Y i1, Y i2, …, Y iJ )′ be a vector representing ratings made by the J raters for the ith subject. Because J raters could each assign one of K ratings to each subject, each Y i has J × K possible combinations of elements. The joint distribution of Y i , expressed as P(Y i ), the probability of Y i , is
where d i = 1 means the ith subject has condition D, d i = 0 means the ith subject does not have D, P(d i = 1) is the probability or prevalence of D, and P(d i = 0) = 1 − P(d i = 1).
We would like to model P(Y i |d i ) so as to include possible conditional dependence (CD) of ratings, i.e., similarity in raters’ responses attributable to specific characteristics of subjects besides their membership in the D or non-D subgroups that affect how easy or hard their particular cases are. Following Albert, we utilize a probit link function for the parameterization,
where Φ is the cumulative standard normal distribution function and Φ−1 is its inverse, \( C_{{d_{i} ,k,j}} \) are monotonically increasing cut-offs for the jth rater, and \( b_{{d_{i} ,i}} \) is a random effect attributable to each subject that characterizes conditional dependence in multiple ratings of that subject.
Notice that \( b_{{d_{i}, i}} \) depends on the latent class of each subject—i.e., whether the individual does or does not have D. Following Albert (2007) and Qu et al. (1996), we used the random effect model \( b_{{d_{i} , i}} = \sigma_{{d_{i} }} b_{{i}} \), where b i has a standard normal distribution. Equation 2 thus says that cut-off points demarcating each rater’s classification thresholds reflect the presence (d i = 1) or absence (d i = 0) of D. However, the probability that the jth rater will assign rating k to the ith subject reflects the ith subject’s state (d i = 0 or d i = 1), locations of the rater’s particular cut-offs, and peculiarities of the ith subject (which act in common across all raters). We characterize the random effects of the D and non-D populations separately because their cut-offs are not linked (as they would be under the “binormal” ROC model; see Somoza & Mossman, 1991).
In our data set, I = 156, J = 5, and K = 5. Thus, in our CD model, I × J ratings (J raters evaluating I subjects) arise from 2(K − 1)J + 3 = 43 parameters: K − 1 cut-offs for the D subgroup and K − 1 cut-offs for the non-D subgroup for each rater, plus a random effect attributable to each subgroup, plus the prevalence P(d i = 1) in the rating set. We sought values for the 43 parameters that would, in combination, be most likely to have generated the 5 × 156 rating matrix. We could then construct individual raters’ ROC graphs using (fpr, tpr) coordinates computed as follows:
Under a conditional independence (CI) assumption (equivalent to setting \( b_{{d_{i} , i}} = 0 \)), ROC graphs for the five raters could be constructed from estimates of 2(K − 1)J + 1 = 41 parameters.
We estimated the model’s accuracy parameters in two ways. The first approach, standard maximum likelihood estimation (MLE), used GAUSS 3.6 code kindly furnished by Albert and modified for our data. (The modified code, which calls the GAUSS 4.0 maxlik library’s quasi-Newton BFGS optimization algorithm, is available from the first author.) The natural logarithm of the likelihood function, \( \ln \,L = \sum\nolimits_{i = 1}^{I} {\ln \,L_{i} } \), is (slightly modifying Albert’s notation)
with P(Y i ) given by Eq. 1. Knowing (fpr, tpr) coordinates for each rater permitted computation of “trapezoidal” AUCs as overall measures of rater accuracy, with standard errors computed using the method of Hanley and McNeil (1982).
In contrast to MLE, which provides point estimates of the parameter values most likely to have generated the observed data, Bayesian estimation summarizes knowledge of unknown parameters using “posterior” distributions representing the probability that a parameter has a particular value, given the observed data. According to Bayes’ Rule, the posterior probability of a parameter’s value is proportional to the likelihood of observing the data given that parameter value, multiplied by a “prior” probability of the parameter’s value. The likelihood function is dictated by statistical model choice, and is the same construct as in MLE. When a prior is “non-informative” (e.g., P(θ) = c for all θ ∈ [a,b], where [a,b] is an arbitrarily large bounded interval, c is a constant, and θ is a parameter), Bayesian and MLE methods yield similar inferences (Carlin & Louis, 2000). However, in Bayesian estimation, inference is conducted directly on the unknown parameters (or functions thereof, such as AUC), while in MLE, inference is conducted on the data. Hence, only Bayesian estimation allows direct probability statements such as “the probability that the AUC for rater j is between .955 and .973 is 95%.”
Markov chain Monte Carlo (MCMC) methods (Gelfand & Smith, 1990; Geman & Geman, 1984; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) are used to make inferences on posterior distributions for which lack of analytic methods would make Bayes’ Rule intractable. Under mild regularity conditions, a Markov chain converges to a unique invariant or “target” distribution. To use MCMC methods for Bayesian analysis, one constructs the transition kernel so that the target distribution of the resulting Markov chain will be the joint posterior distribution of interest. After discarding input from initial “burn-in” iterations, one can use the remaining draws to make inferences about model parameters. WinBUGS is a free software package that allows specification of a Bayesian model, determines the transition kernel for the Markov chain, and produces draws from the joint posterior distribution of unknown parameters (Lunn et al., 2000).
For our Bayesian analyses, we used minimally informative priors and the same statistical models as in our MLE approach. WinBUGS 1.4.3 ran five parallel MCMC chains; the Brooks–Gelman–Rubin diagnostic (Brooks & Gelman, 1998) indicated convergence after 2000–5000 iterations. We ran each chain for 15,000 iterations and treated each chain’s first 10,000 iterations as “burn-in” values to be discarded, leaving 5 × 5000 = 25,000 draws for inference. Because our WinBUGS code (available from the first author upon request) calculated (fpr, tpr) coordinates and trapezoidal AUCs directly from the MCMC parameter draws, we obtained samples of and made inferences about our accuracy statistics directly from the posterior distributions.
About this article
Cite this article
Mossman, D., Bowen, M.D., Vanness, D.J. et al. Quantifying the Accuracy of Forensic Examiners in the Absence of a “Gold Standard”. Law Hum Behav 34, 402–417 (2010). https://doi.org/10.1007/s10979-009-9197-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10979-009-9197-5