Skip to main content

Evaluation of Biomedical and Health Information Resources

  • Chapter
  • First Online:
Biomedical Informatics

Abstract

This chapter introduces evaluation as an important aspect of the field of biomedical informatics. The text emphasizes how one goes about studying the need for, design of, performance of, and impact of the information resources that support individuals and groups in the pursuit of better health. The chapter begins by introducing the rationale for undertaking these studies, and continues by describing a general structure that all evaluation studies share and the importance of asking good questions as a prerequisite to obtaining useful answers. The chapter then introduces a nine-level classification of evaluation studies and describes the purpose served by each type of study in relation to the lifecycle of an information resource. This discussion sets the stage for the introduction of specific study methods: objectivist/quantitative studies, and subjectivist/qualitative studies. We describe important considerations in the design of quantitative and qualitative studies, along with the collection and analysis of study data. These discussions emphasize the special challenges that arise when health information resources are the focus of study. The chapter concludes with a discussion of the importance of employing effective methods to report study results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this chapter, we will use the terms “information resource” and “information system” generally as synonyms. However, “information system” applies more specifically to applications of digital technology whereas a “resource” is a broad term that could, for example, include informal collegial consultations.

  2. 2.

    7 https://www.bbc.co.uk/news/technology-45328965 (Accessed 11.20.19).

  3. 3.

    7 https://www.nngroup.com/articles/ (Accessed 11.20.19).

  4. 4.

    Examples include: 7 https://www.gem-beta.org/public/home.aspx and 7 https://healthit.ahrq.gov/health-it-tools-and-resources/evaluation-resources/health-it-survey-compendium-search (Both accessed 11.20.18).

  5. 5.

    See 7 https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/ (Accessed 11.20.18).

  6. 6.

    Examples include: Atlas.ti: 7 https://atlasti.com/ (Accessed November 18, 2019) and NVivo 7 https://www.qsrinternational.com/nvivo/home (Accessed November 18, 2019).

References

  • Ammenwerth, E. (2015). Evidence-based health informatics: How do we know what we know? Methods of Information in Medicine, 54(04), 298–307.

    Article  CAS  PubMed  Google Scholar 

  • Anderson, J. G., Aydin, C. E., & Jay, S. J. (Eds.). (1994). Evaluating health care information systems. Thousand Oaks: Sage Publications Inc.

    Google Scholar 

  • Beaudoin, M., Kabanza, F., Nault, V., & Valiquette, L. (2016). Evaluation of a machine learning capability for a clinical decision support system to enhance antimicrobial stewardship programs. Artificial Intelligence in Medicine, 68, 29–36.

    Article  PubMed  Google Scholar 

  • Black, A. D., Car, J., Pagliari, C., Anandan, C., Cresswell, K., Bokun, T., et al. (2011). The impact of eHealth on the quality and safety of health care: A systematic overview. PLoS Medicine, 8(1), e1000387.

    Article  PubMed  PubMed Central  Google Scholar 

  • Brender, J. (2005). Handbook of evaluation methods for health informatics. Burlington: Academic Press.

    Google Scholar 

  • Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi experimental designs for research. Boston: Houghton Mifflin, reprinted often since.

    Google Scholar 

  • Campbell, M., Fitzpatrick, R., Haines, A., Kinmonth, A. L., Sandercock, P., Spiegelhalter, D., & Tyrer, P. (2000). Framework for design and evaluation of complex interventions to improve health. BMJ, 321(7262), 694–696.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Friedman, C. P., Abbas, U. L. (2003), Is medical informatics a mature science? A review of measurement practice in outcome studies of clinical systems, International Journal of Medical Informatics, 69; (2–3), Pages 261–272, ISSN 1386-5056, https://doi.org/10.1016/S1386-5056(02)00109-0.

  • Davey Smith, D. (2007). Capitalizing on Mendelian randomization to assess the effects of treatments. Journal of the Royal Society of Medicine, 100(9), 432–435. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1963388/.

    Article  PubMed  Google Scholar 

  • Demiris, G., Speedie, S., & Finkelstein, S. (2000). A questionnaire for the assessment of patients’ impressions of the risks and benefits of home telecare. Journal of Telemedicine and Telecare, 6(5), 278–284.

    Article  CAS  PubMed  Google Scholar 

  • Elizabeth Murray, Eric B. Hekler, Gerhard Andersson, Linda M. Collins, Aiden Doherty, Chris Hollis, Daniel E. Rivera, Robert West, Jeremy C. Wyatt, (2016) Evaluating Digital Health Interventions. American Journal of Preventive Medicine 51 (5):843–851.

    Google Scholar 

  • Eminovic, N., Wyatt, J. C., Tarpey, A. M., Murray, G., & Ingrams, G. J. (2004, June 02). First evaluation of the NHS direct online clinical enquiry service: A nurse-led Web chat triage service for the public. Journal of Medical Internet Research, 6(2), E17.

    Article  PubMed  PubMed Central  Google Scholar 

  • European Union Medical Devices Regulatory Framework. (2018). https://ec.europa.eu/growth/sectors/medical-devices/regulatory-framework_en. Accessed 24 Oct 2018.

  • Forsythe, D. E. (1992). Using ethnography to build a working system: Rethinking basic design assumptions. In Proceedings annual symposium computer applications in medical care (pp. 505–509).

    Google Scholar 

  • Forsythe, D. E., Buchanan, B. G., Osheroff, J. A., & Miller, R. A. (1992). Expanding the concept of medical information: An observational study of physicians’ information needs. Computers and Biomedical Research, 25, 181–200.

    Article  CAS  PubMed  Google Scholar 

  • Fox, J. (1993). Decision support systems as safety-critical components: Towards a safety culture for medical informatics. Methods of Information in Medicine, 32, 345–348.

    Article  CAS  PubMed  Google Scholar 

  • Friedman, C. P., & Wyatt, J. C. (2005). Evaluation methods in biomedical informatics (2nd ed., p. 386). New York: Springer-Publishing. ISBN 0-387-25889-2.

    Google Scholar 

  • Gaschnig, J., Klahr, P., Pople, H., Shortliffe, E., & Terry, A. (1983). Evaluation of expert systems: Issues and case studies. In F. Hayes-Roth, D. A. Waterman, & D. Lenat (Eds.), Building expert systems. Reading: Addison Wesley.

    Google Scholar 

  • Goddard, K., Roudsari, A., & Wyatt, J. C. (2012). Automation bias: A systematic review of frequency, effect mediators, and mitigators. Journal of the American Medical Informatics Association: JAMIA, 19, 121–127.

    Article  PubMed  Google Scholar 

  • Gray, E., Marti, J., Brewster, D. H., Wyatt, J. C., Piaguet-Rossel R., & Hall P. S. (2019). Feasibility and results of four real-world evidence methods for estimating the effectiveness of adjuvant chemotherapy in early stage breast cancer. J Clin Epidemiol.

    Google Scholar 

  • Haddow, G., Bruce, A., Sathanandam, S., & Wyatt, J. C. (2011). ‘Nothing is really safe’: A focus group study on the processes of anonymizing and sharing of health data for research purposes. Journal of Evaluation in Clinical Practice, 17, 1140–1146.

    Article  PubMed  Google Scholar 

  • Herasevich, V., & Pickering, B. W. (2017). Health information technology evaluation handbook: From meaningful use to meaningful outcome. Boca Raton: CRC Press.

    Google Scholar 

  • House, E. (1980). Evaluating with validity. San Francisco: Sage.

    Google Scholar 

  • Kern, L. M., Edwards, A. M., Pichardo, M., & Kaushal, R. (2015). Electronic health records and health care quality over time in a federally qualified health center. Journal of the American Medical Informatics Association, 22(2), 453–458.

    Article  PubMed  Google Scholar 

  • Koppel, R., Metlay, J. P., Cohen, A., Abaluck, B., Localio, A. R., Kimmel, S. E., & Strom, B. L. (2005). Role of computerized physician order entry systems in facilitating medication errors. JAMA: The Journal of the American Medical Association, 293(10), 1197–1203.

    Article  CAS  PubMed  Google Scholar 

  • Lester, R. T., Ritvo, P., Mills, E. J., Kariri, A., Karanja, S., Chung, M. H., Jack, W., Habyarimana, J., Sadatsafavi, M., Najafzadeh, M., Marra, C. A., Estambale, B., Ngugi, E., Ball, T. B., Thabane, L., Gelmon, L. J., Kimani, J., Ackers, M., & Plummer, F. A. (2010). Effects of a mobile phone short message service on antiretroviral treatment adherence in Kenya (WelTel Kenya1): A randomised trial. The Lancet, 376(9755), 1838–1845.

    Article  Google Scholar 

  • Littlejohns, P., Wyatt, J. C., & Garvican, L. (2003, April 19). Evaluating computerised health information systems: Hard lessons still to be learnt. BMJ, 326(7394), 860–863.

    Article  PubMed  PubMed Central  Google Scholar 

  • Liu, J. L. Y., & Wyatt, J. C. (2011). The case for randomized controlled trials to assess the impact of clinical information systems. Journal of the American Medical Informatics Association: JAMIA, 18(2), 173–180.

    Article  PubMed  Google Scholar 

  • Liu, Y. I., Kamaya, A., et al. (2011). A Bayesian network for differentiating benign from malignant thyroid nodules using sonographic and demographic features. AJR: American Journal of Roentgenology, 196(5), W598–W605.

    Article  PubMed  Google Scholar 

  • Lundsgaarde, H. P. (1987). Evaluating medical expert systems. Social Science & Medicine, 24, 805–819.

    Article  CAS  Google Scholar 

  • Lunenburg, F. C. (2010). Managing change: The role of the change agent. International Journal of Management, Business and Administration, 13(1), 1–6.

    Google Scholar 

  • Mant, J., & Hicks, N. (1995). Detecting differences in quality of care: The sensitivity of measures of process and outcome in treating acute myocardial infarction. BMJ, 311, 793–796.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • McDonald, C. J., Hui, S. L., Smith, D. M., Tierney, W. M., Cohen, S. J., Weinberger, M., & McCabe, G. P. (1984). Reminders to physicians from an introspective computer medical record: A two-year randomized trial. Annals of Internal Medicine, 100(1), 130–138.

    Article  CAS  PubMed  Google Scholar 

  • McMurry, T. L., Hu, Y., Blackstone, E. H., & Kozower, B. D. (2015, July). Propensity scores: Methods, considerations, and applications in the Journal of Thoracic and Cardiovascular Surgery. The Journal of Thoracic and Cardiovascular Surgery, 150(1), 14–19. https://doi.org/10.1016/j.jtcvs.2015.03.057. Epub 2015 Apr 2.

  • Michaelis, J., Wellek, S., & Willems, J. L. (1990). Reference standards for software evaluation. Methods of Information in Medicine, 29, 289–297.

    Article  CAS  PubMed  Google Scholar 

  • Murray, M. D., Harris, L. E., Overhage, J. M., Zhou, X. H., Eckert, G. J., Smith, F. E., Buchanan, N. N., Wolinsky, F. D., McDonald, C. J., & Tierney, W. M. (2004, March). Failure of computerized treatment suggestions to improve health outcomes of outpatients with uncomplicated hypertension: Results of a randomized controlled trial. Pharmacotherapy, 24(3), 324–337.

    Article  PubMed  Google Scholar 

  • Nielsen, J. (1994). Usability inspection methods. Paper presented at the conference companion on human factors in computing systems, Boston.

    Google Scholar 

  • Office of the National Coordinator for Health Information Technology (ONC) website. (2014). FDASIA Committee Report. https://www.healthit.gov/sites/default/files/fdasiahealthitreport_final.pdf. Accessed 25 Oct 2018.

  • Ong, M. S., & Coiera, E. (2011, June). A systematic review of failures in handoff communication during intrahospital transfers. Joint Commission Journal on Quality and Patient Safety, 37(6), 274–284.

    Article  PubMed  Google Scholar 

  • Pinsky, P. F., Miller, A., Kramer, B. S., Church, T., Reding, D., Prorok, P., Gelmann, E., Schoen, R. E., Buys, S., Hayes, R. B., & Berg, C. D. (2007, April 15). Evidence of a healthy volunteer effect in the prostate, lung, colorectal, and ovarian cancer screening trial. American Journal of Epidemiology, 165(8), 874–881.

    Article  CAS  PubMed  Google Scholar 

  • Pope, C., Halford, S., Turnbull, J., Prichard, J., Calestani, M., & May, C. (2013). Using computer decision support systems in NHS emergency and urgent care: Ethnographic study using normalisation process theory. BMC Health Services Research, 13(1), 111.

    Article  PubMed  PubMed Central  Google Scholar 

  • Ramnarayan, P., Kapoor, R. R., Coren, M., Nanduri, V., Tomlinson, A. L., Taylor, P. M., Wyatt, J. C., & Britto, J. F. (2003, November–December). Measuring the impact of diagnostic decision support on the quality of clinical decision making: Development of a reliable and valid composite score. Journal of the American Medical Informatics Association: JAMIA, 10(6), 563–572.

    Article  PubMed  Google Scholar 

  • Ravi, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2017). Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics, 21(1), 4–21.

    Article  PubMed  Google Scholar 

  • Rigby, M., Forsström, J., Ruth, R., & Wyatt, J. (2001). Verifying quality and safety in health informatics services. BMJ, 323, 552–556.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern epidemiology (3rd ed.). Philadephia: Lippincott Williams & Wilkins.

    Google Scholar 

  • Russ, A. L., Zillich, A. J., Melton, B. L., Russell, S. A., Chen, S., Spina, J. R., et al. (2014). Applying human factors principles to alert design increases efficiency and reduces prescribing errors in a scenario-based simulation. Journal of the American Medical Informatics Association, 21(e2), e287–e296.

    Article  PubMed  PubMed Central  Google Scholar 

  • Saitwal, H., Feng, X., Walji, M., Patel, V., & Zhang, J. (2010). Assessing performance of an electronic health record (EHR) using cognitive task analysis. International Journal of Medical Informatics, 79(7), 501–506.

    Article  PubMed  Google Scholar 

  • Scott, G. P., Shah, P., Wyatt, J. C., Makubate, B., & Cross, F. W. (2011, August 11). Making electronic prescribing alerts more effective: Scenario-based experimental study in junior doctors. Journal of the American Medical Informatics Association: JAMIA, 18(6), 789–798.

    Article  PubMed  Google Scholar 

  • Scott, P. J., Brown, A. W., Adedeji, T., Wyatt, J. C., Georgiou, A., Eisenstein, E. L., & Friedman, C. P. (2019). A review of measurement practice in studies of clinical decision support systems 1998–2017. Journal of the American Medical Informatics Association, 26(10), 1120–1128.

    Article  PubMed  PubMed Central  Google Scholar 

  • Sheikh, A., Cornford, T., Barber, N., Avery, A., Takian, A., Lichtner, V., et al. (2011). Implementation and adoption of nationwide electronic health records in secondary care in England: Final qualitative results from prospective national evaluation in “early adopter” hospitals. BMJ, 343, d6054.

    Article  PubMed  PubMed Central  Google Scholar 

  • Sherman, R. E., Anderson, S. A., Dal Pan, G. J., Gray, G. W., Gross, T., Hunter, N. L., LaVange, L., Marinac-Dabic, D., Marks, P. W., Robb, M. A., Shuren, J., Temple, R., Woodcock, J., Yue, L. Q., & Califf, R. M. (2016, December 8). Real-world evidence – what is it and what can it tell us? The New England Journal of Medicine, 375(23), 2293–2297.

    Google Scholar 

  • Slight, S. P., & Bates, D. W. (2014). A risk-based regulatory framework for health IT: Recommendations of the FDASIA working group. Journal of the American Medical Informatics Association, 21(e2), e181–e184.

    Article  PubMed  PubMed Central  Google Scholar 

  • Somerville I. (2015). Software Engineering (10th edition). Pearson, 816 pages. ISBN-10: 0133943038.

    Google Scholar 

  • Spiegelhalter, D. J. (1983). Evaluation of medical decision-aids, with an application to a system for dyspepsia. Statistics in Medicine, 2, 207–216.

    Article  CAS  PubMed  Google Scholar 

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

    Google Scholar 

  • Stead, W., Haynes, R. B., Fuller, S., et al. (1994). Designing medical informatics research and library projects to increase what is learned. Journal of theAmerican Medical Informatics Association, 1, 28–34.

    Article  CAS  Google Scholar 

  • Streeter, A. J., Lin, N. X., Crathorne, L., Haasova, M., Hyde, C., Melzer, D., & Henley, W. E. (2017, July). Adjusting for unmeasured confounding in nonrandomized longitudinal studies: A methodological review. Journal of Clinical Epidemiology, 87, 23–34. https://doi.org/10.1016/j.jclinepi.2017.04.022. Epub 2017 Apr 28.

  • Szczepura, A., & Kankaanpaa, J. (1996). Assessment of health care technologies. London: Wiley.

    Google Scholar 

  • Talmon, J., Ammenwerth, E., Brender, J., de Keizer, N., Nykänen, P., & Rigby, M. (2009). STARE-HI—statement on reporting of evaluation studies in health informatics. International Journal of Medical Informatics, 7, 1–9.

    Article  Google Scholar 

  • van Gennip, E. M., & Talmon, J. L. (Eds.). (1995). Assessment and evaluation of information technologies in medicine. Amsterdam: IOS Press.

    Google Scholar 

  • Van Way, C. W., Murphy, J. R., Dunn, E. L., & Elerding, S. C. (1982). A feasibility study of computer-aided diagnosis in appendicitis. Surgery, Gynecology & Obstetrics, 155, 685–688.

    Google Scholar 

  • Ventres, W., Kooienga, S., Vuckovic, N., Marlin, R., Nygren, P., & Stewart, V. (2006) The Annals of Family Medicine, 4(2) 124-131; https://doi.org/10.1016/10.1370/afm.425.

    Google Scholar 

  • Wasson, J. H., Sox, H. C., Neff, R. K., & Goldman, L. (1985). Clinical prediction rules: Applications and methodological standards. The New England Journal of Medicine, 313, 793–799.

    Article  CAS  PubMed  Google Scholar 

  • Wolf, J. A., Moreau, J. F., Akilov, O., Patton, T., English, J. C., Ho, J., & Ferris, L. K. (2013). Diagnostic inaccuracy of smartphone applications for melanoma detection. JAMA Dermatology, 149(4), 422–426.

    Article  PubMed  PubMed Central  Google Scholar 

  • Wright, A., Sittig, D. F., Ash, J. S., Erickson, J. L., Hickman, T. T., Paterno, M., et al. (2015). Lessons learned from implementing service-oriented clinical decision support at four sites: A qualitative study. International Journal of Medical Informatics, 84(11), 901–911.

    Article  PubMed  Google Scholar 

  • Wyatt, J., & Spiegelhalter, D. (1990). Evaluating medical expert systems: What to test and how? Medical Informatics (London), 15, 205–217.

    Article  CAS  Google Scholar 

  • Wyatt, J., & Wyatt, S. (2003). When and how to evaluate clinical information systems ? International Journal of Medical Informatics, 69, 251–259.

    Article  PubMed  Google Scholar 

  • Wyatt, J. C., Batley, R. P., & Keen, J. (2010, October). GP preferences for information systems: Conjoint analysis of speed, reliability, access and users. Journal of Evaluation in Clinical Practice, 16(5), 911–915.

    Article  PubMed  Google Scholar 

  • Zhang, J., Johnson, T. R., Patel, V. L., Paige, D. L., & Kubose, T. (2003). Using usability heuristics to evaluate patient safety of medical devices. Journal of Biomedical Informatics, 36(1–2), 23–30.

    Article  PubMed  Google Scholar 

Download references

Acknowledgment

The authors wish to acknowledge Nikolas Koscielniak for his multiple important contributions to this chapter. This chapter is adapted from material in an earlier edition of the textbook that was also co-authored by Douglas K. Owens.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles P. Friedman .

Editor information

Editors and Affiliations

Appendices

Appendices

1.1 Appendix A: Two Evaluation Scenarios

Here we introduce two scenarios that collectively capture many of the dilemmas facing those planning and conducting evaluations in biomedical informatics:

  1. 1.

    A prototype information resource has been developed, but its usability and potential for benefit need to be assessed prior to deployment;

  2. 2.

    A commercial resource has been deployed across a large enterprise, and there is need to understand its impact on users as well as on the organization.

These scenarios do not address the full scope of evaluations in biomedical informatics, but they cover a lot of what people do. For each, we introduce sets of evaluation questions that frequently arise and examine the dilemmas that investigators face in the design and execution of evaluation studies.

Scenario 1: A Prototype Information Resource Has Been Developed, but Its Usability and Potential for Benefit Need to Be Assessed Prior to Deployment

The primary evaluation issue here is the upcoming decision to continue with the development of the prototype information resource. Validation of the design and structure of the resource will have been conducted, either formally or informally, but not yet a usability study. If this looks promising, a laboratory evaluation of key functions is also advised before making the substantial investment required to turn a promising prototype into a system that is stable and likely to bring more benefits than problems to users in the field. Here, typical questions will include:

  • Who are the target users, and what are their background skills and knowledge?

  • Does the resource make sense to target users?

  • Following a brief introduction, can target users navigate themselves around important parts of the resource?

  • Can target users carry out a selection of relevant tasks using the resource, in reasonable time and with reasonable accuracy?

  • What user characteristics correlate with the ability to use the resource and achieve fast, accurate performance with it?

  • What other kinds of people can use it safely?

  • How to improve the layout, design, wording, menus etc.

  • Is there a long learning curve? What user training needs are there?

  • How much on-going help will users require once they are initially trained?

  • What concerns do users have about the system – e.g., accuracy, privacy, effect on their jobs, other side effects

  • Based on the performance of prototypes in users’ hands, does the resource have the potential to meet user needs?

These questions fall within the scope of the usability and laboratory function testing approaches listed in ◘ Table 15.1. A wide range of techniques–borrowed from the human-computer interaction field and employing both objectivist and subjectivist approaches–can be used, including:

  • Seeking the views of potential users after both a demonstration of the resource and a hands-on exploration. Methods such as focus groups may be very useful to identify not only immediate problems with the software and how it might be improved, but also potential broader concerns and unexpected issues that may include user privacy and long term issues around user training and working relationships.

  • Studying users while they carry out a list of pre-designed tasks using the information resource. Methods for studying users includes watching over their shoulder, video observation (sometimes with several video cameras per user); think aloud protocols (asking the user to verbalize their impressions as they navigate and use the system); and automatic logging of keystrokes, navigation paths, and time to complete tasks.

  • Use of validated questionnaires to capture user impressions, often before and after an experience with the system, one example being the Telemedicine Preparedness questionnaire (Demiris et al. 2000).

  • Specific techniques to explore how users might improve the layout or design of the software. For example, to help understand what users think of as a “logical” menu structure for an information resource, investigators can use a card sorting technique. This entails listing each function available on all the menus on a separate card and then asking users to sort these cards into several piles according to which function seems to go with which [► www.useit.com].

Depending on the aim of a usability study, it may suffice to employ a small number of potential users. Nielsen has shown that, if the aim is to identify only major software faults, the proportion identified rises quickly up to about 5 or 6 users then much more slowly to plateau at about 15–20 users (Nielsen 1994). Five users will often identify 80% of software problems. However, investigators conducting such small studies, useful though they may be for software development, cannot then expect to publish them in a scientific journal. The achievement in this case is having found answers to a very specific question about a specific software prototype. This kind of local reality test is unlikely to appeal to the editors or readers of a journal. By contrast, the results of formal laboratory function studies, that typically employ more users, are more amenable to journal publication.

Scenario 2: A Commercial Resource Has Been Deployed Across a Large Enterprise, and There Is Need to Understand its Impact on Users as Well as on the Organization

The type of evaluation questions that arise here include:

  • In what fraction of occasions when the resource could have been used, was it actually used?

  • Who uses it, why, are these the intended users, and are they satisfied with it?

  • Does using the resource improve influence information/communication flows?

  • Does using the resource influence their knowledge or skills?

  • Does using the resource improve their work?

  • For clinical information resources, does using the resource change outcomes for patients?

  • How does the resource influence the whole organization and relevant sub units?

  • Do the overall benefits and costs or risks differ for specific groups of users, departments, the whole organization?

  • How much does the resource really cost the organization?

  • Should the organization keep the resource as it is, improve it or replace it?

  • How can the resource be improved, at what cost, and what benefits would result?

To each of the above questions, one can add: “Why, or why not?”, to get a broader understanding of what is happening as a result of use of the resource.

This evaluation scenario, suggesting a problem impact study, is often what people think of first when the concept of evaluation is introduced. However, we have seen in this chapter that it is one of many evaluation scenarios, arising relatively late in the life cycle of an information resource. When these impact-oriented evaluations are undertaken, they usually result from a realization by stakeholders, who have invested significantly in an information resource, that the benefits of the resource are uncertain and there is need to justify recurring costs. These stakeholders usually vary in the kind of evaluation methods that will convince them about the impacts that the resource is or is not having. Many such stakeholders will wish to see quantified indices of benefits or harms from the resource, for example the number of users and daily uses, the amount the resource improves productivity or reduces costs, or perhaps other benefits such as reduced waiting times to perform key tasks or procedures, lengths of hospital stay or occurrence of adverse events. Such data are collected through objectivist studies as discussed earlier. Other stakeholders may prefer to see evidence of perceived benefit and positive views of staff, in which case staff surveys, focus groups and unstructured interviews may prove the best evaluation methods. Often, a combination of many methods is necessary to extend the investigation from understanding what impact the resource has to why this impact occurs – or fails to occur.

If the investigator is pursuing objectivist methods, deciding which of the possible effect variables to include in an impact study and developing ways to measure them can be the most challenging aspect of an evaluation study design. (These and related issues receive the attention of five full chapters of a textbook by the authors of this chapter (Friedman and Wyatt 2005).) Investigators usually wish to limit the number of effect measures employed in a study for many reasons: limited evaluation resources, to minimize manipulation of the practice environment, and to avoid statistical analytical problems that result from a large number of measures.

Effect or impact studies can also use subjectivist approaches to allow the most relevant “effect” issues to emerge over time and with increasingly deep immersion into the study environment. This emergent feature of subjectivist work obviates the need to decide in advance which effect variables to explore, and is considered by proponents of subjectivist approaches to be among their major advantages.

In health care particularly, every intervention carries some risk, which must be judged in comparison to the risks of doing nothing or of providing an alternative intervention. It is difficult to decide whether an information resource is an improvement unless the performance of the current decision-takers is also measured in a comparison-based evaluation. For example, if physicians’ decisions are to become more accurate following introduction of a decision-support tool, the resource needs to be “right” when the user would usually be “wrong” This could mean that the tool’s error rate is lower than that of the physician, or its errors are in different cases, or they should be of a different kind or less serious than those of the clinician, so as not to introduce new errors caused by the clinician following resource advice even when that advice is incorrect – “automation bias” (Goddard et al. 2012).

For effect studies, it is often important to know something about how the practitioners carry out their work prior to the introduction of the information resource. Suitable measures include the accuracy, timing, and confidence level of their decisions and the amount of information they require before making a decision. Although data for such a study can sometimes be collected by using abstracts of cases or problems in a laboratory setting (◘ Fig. 15.2), these studies inevitably raise questions of generalization to the real world. We observe here one of many trade-offs that occur in the design of evaluation studies. Although control over the mix of cases possible in a laboratory study can lead to a more precise estimate of practitioner decision making, ultimately it may prove better to conduct a baseline study while the individuals are doing real work in a real practice setting. Often this audit of current decisions and actions provides useful input to the design of the information resource, and a reference against which resource performance may later be compared.

When conducting problem impact studies in health care settings, investigators can sometimes save themselves much time and effort without sacrificing validity by measuring effect in terms of certain health care processes rather than patient outcomes, in other words by employing a user effect study as a proxy for a problem impact study. For example, measuring the mortality or complication rate in patients with heart attacks requires data collection from hundreds of patients, as complications and death are (fortunately) rare events. However, as long as large, rigorous trials or meta-analyses have determined that a certain procedure (e.g., giving heart attack patients streptokinase within 24 h) correlates closely with the desired patient outcome, it is perfectly valid to measure the rate of performing this procedure as a valid “surrogate” for the desired outcome. Mant and Hicks demonstrated that measuring the quality of care by quantifying a key process in this way may require one tenth as many patients as measuring outcomes (Mant and Hicks 1995).

1.2 Appendix B: Exemplary Evaluation Studies

In this appendix, we briefly summarize studies that align with many of the study types described in ◘ Tables 13.1 and 13.2.

Usability Study

Assessing Performance of an Electronic Health Record Using Cognitive Task Analysis.

Saitwal et al. (2010) is a pure usability testing study that evaluates the Armed Forces Health Longitudinal Technology Application EHR using a cognitive task analysis approach, referred to as Goals, Operators, Methods, and Selection rules (GOMS). Specifically, authors evaluated the system response time and the complexity of the graphical user interface (GUI) when completing a set of 14 prototypical tasks using the EHR. Authors paid special attention to inter-rater reliability of the two evaluators using GOMS to analyze the GUI of the system through task completion. Each task was broken down into a series of steps, with the intent to determine the percent of steps classified as “mental operators”. Execution time was then calculated for each step and summed to obtain a total time for task completion.

Lab Function Study

Diagnostic inaccuracy of smartphone applications for melanoma detection.

Wolf et al. (2013) conducted an evaluation study of smartphone applications capable of detecting melanoma and sought to determine the diagnostic inaccuracy. The study is exemplary of a lab function study and complements the Beaudoin et al. (2016) study described below because study authors paid special attention to measuring application function in a lab setting using digital clinical images with a previous diagnosis obtained via histologic analysis by a dermatopathologist. Authors employed a comparative analysis between four different smartphone applications and assessed the sensitivity, positive predictive value, and negative predictive value of each application compared to histologic diagnosis. Rather than focus on the function in a real health care setting with real users, authors were interested in facilitating decision-making as to which applications performed best under controlled conditions.

Field Function Study

Evaluation of a machine learning capability for a clinical decision support system to enhance antimicrobial stewardship programs.

Beaudoin et al. (2016) conducted an observational study to evaluate the function of a combined clinical decision support system (antimicrobial prescription surveillance system (APSS)) and a learning module for antimicrobial stewardship pharmacists in a Canadian university hospital system. Authors developed a rule-based machine learning module designed from expert pharmacist recommendations which triggers alerts for inappropriate prescribing of piperacillin–tazobactam. The combined system was deployed to pharmacists and outputs were studied prospectively over a five-week period within the hospital system. Analyses assessed accuracy, positive predictive value, and sensitivity of the combined system, the individual learning module, and the APSS compared to the pharmacist opinion. This is an exemplary field function study because authors are evaluating the ability of the combined rule-based learning module and APSS to detect inappropriate prescribing in the field with real patients.

Lab User Effect Study

Applying human factors principles to alert design increases efficiency and reduces prescribing errors in a scenario-based simulation.

Russ et al. (2014) describe a study evaluating the redesign of alerts using human factors principles and their influence on prescribing by providers. The study is exemplary of a lab user effect study because it analyzed frequency of prescribing errors by providers, and it was conducted in a simulated environment (the Human-Computer Interaction and Simulation Laboratory in a Veterans Affairs Medical Center). Authors were particularly interested in three types of alerts: drug-drug interactions, drug-allergy, and drug disease. Three scenarios were developed for this study that included 19 possible alerts. These alerts were intended to be familiar and unfamiliar to prescribers. Authors used a crossover design with a two-week “washout period” for participants to complete both original and redesigned alerts to reduce contamination in repeated measures. Special attention was paid to a repeated measures comparative analysis of the influence of original versus redesigned alerts on outcomes of perceived workload and prescribing errors. Authors also employed elements of usability testing during this study, such as assessing learnability, efficiency, satisfaction and usability errors.

Field User Effect Study

Reminders to physicians from an introspective computer medical record: A two-year randomized trial.

McDonald et al. (1984) conducted a two-year randomized controlled trial to evaluate the effects of a computer-stored medical record system which reminds physicians about actions needed for patients prior to a patient encounter. This study most closely aligns with a field user effect study for the attention to behavior change in preventive care delivery associated with use of the information resource, and is exemplary because its rigorous design accounts for the hierarchical nature of clinicians working in teams without having to manipulate the practice environment. Randomization occurs at the clustered team level and analyses were performed at both the cluster and individual levels. The study did include problem impact metrics, however no significant changes were observed in these outcomes during the study.

Field User Effect Study

Electronic health records and health care quality over time in a federally qualified health center.

Kern et al. (2015) conducted a three-year comparative study across six sites of a federally qualified health center in New York to analyze the association between post-implementation of an electronic health record (EHR) and quality of care delivery as measured by change in compliance with Stage 1 Meaningful Use quality measures. This study is an exemplary field user effect study for its attention to measures of clinician behavior in care delivery through test/screening ordering using the EHR and explicit use of statistical analysis techniques to account for repeated measures on patients over time. The study also includes two problem impact metrics (change in HbA1c and LDL cholesterol) analyzed over the study period; however, the study intent was primarily focused on clinician ordering behavior.

Problem Impact Study

Effects of a mobile phone short message service on antiretroviral treatment adherence in Kenya (WelTel Kenya1): A randomised trial.

Lester et al. (2010) is an exemplar for problem impact studies. Authors conducted a randomized controlled trial to measure improvement in patient adherence to antiretroviral therapy (ART) and suppression of viral load following receipt of mobile phone communications with health care workers. The study randomized patients to the intervention group (receiving mobile phone messages from healthcare workers) or to the control group (standard care). Outcomes were clearly identified and focused on behavioral effects (drug adherence) and an overall intent to measure the extent that improvements in adherence influenced patient health status (viral load). The special attention to randomization and use of effect size metrics for analysis are critical components to measuring the overall impact of mobile phone communications on patient health.

Suggested Reading

Ammenwerth, E., & Rigby, M. (Eds.). (2016). Evidence-based health informatics. Amsterdam: IOS Press. This work includes an extensive exploration of evaluation methods pertinent to health informatics.

Anderson, J. G., & Aydin, C. E. (2005). Evaluating the organizational impact of health care information systems. New York: Springer. This is an excellent edited volume that covers a wide range of methodological and substantive approaches to evaluation in informatics.

Brender, J. (2006). Handbook for evaluation for health informatics. Burlington: Elsevier Academic Press. Along with the Friedman and Wyatt text cited below, one of few textbooks available that focuses on evaluation in health informatics.

Cohen, P. R. (1995). Empirical methods for artificial intelligence. Cambridge, MA: MIT Press. This is a nicely written, detailed book that is focused on evaluation of artificial intelligence applications, not necessarily those operating in medical domains. It emphasizes objectivist methods and could serve as a basic statistics course for computer science students.

Fink, A. (2004). Evaluation fundamentals: Insights into the outcomes, effectiveness, and quality of health programs (2nd ed.). Thousand Oaks: Sage Publications. A popular text that discusses evaluation in the general domain of health.

Friedman, C. P., & Wyatt, J. C. (2006). Evaluation methods in biomedical informatics. New York: Springer. This is the book on which the current chapter is based. It offers expanded discussion of almost all issues and concepts raised in the current chapter.

Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modelling. New York: Wiley. This work offers a technical discussion of a range of objectivist methods used to study computer systems. The scope is broader than Cohen’s book (1995) described earlier. It contains many case studies and examples and assumes knowledge of basic statistics.

Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Thousand Oaks: Sage Publications. This is a classic book on subjectivist methods. The work is very rigorous but also very easy to read. Because it does not focus on medical domains or information systems, readers must make their own extrapolations.

Rossi, P. H., Lipsey, M. W., & Freeman, H. E. (2004). Evaluation: A systematic approach (7th ed.). Thousand Oaks: Sage Publications. This is a valuable textbook on evaluation, emphasizing objectivist methods, and is very well written. It is generic in scope, and the reader must relate the content to biomedical informatics. There are several excellent chapters addressing pragmatic issues of evaluation. These nicely complement the chapters on statistics and formal study designs.

Questions for Discussion

  1. 1.

    Associate each of the following hypothetical evaluation scenarios with one or more of the nine types of studies listed in ◘ Table 13.1. Note that some scenarios may include more than one type of study.

    1. (a)

      An order communication system is implemented in a small hospital. Changes in laboratory workload are assessed.

    2. (b)

      The developers of the order communication system recruit five potential users to help them assess how readily each of the main functions can be accessed from the opening screen and how long it takes users to complete them.

    3. (c)

      A study team performs a thorough analysis of the information required by psychiatrists to whom patients are referred by a community social worker.

    4. (d)

      A biomedical informatics expert is asked for her opinion about a PhD project on a new bioinformatics algorithm. She requests copies of the student’s code and documentation for review.

    5. (e)

      A new intensive care unit system is implemented alongside manual paper charting for a month. At the end of this time, the quality of the computer-derived data and data recorded on the paper charts is compared. A panel of intensive care experts is asked to identify, independently, episodes of hypotension from each data set.

    6. (f)

      A biomedical informatics professor is invited to join the steering group for a series of apps to support people living with diabetes. The only documentation available to critique at the first meeting is a statement of the project goal, description of the planned development method, and the advertisements and job descriptions for team members.

    7. (g)

      Developers invite educationalists to test a prototype of a computer-aided learning system as part of a user-centered design workshop

    8. (h)

      A program is devised that generates a predicted 24-h blood glucose profile using seven clinical parameters. Another program uses this profile and other patient data to advise on insulin dosages. Diabetologists are asked to prescribe insulin for a series of “paper patients” given the 24-h profile alone, and then again after seeing the computer-generated advice. They are also asked their opinion of the advice.

    9. (i)

      A program to generate alerts to prevent drug interactions is installed in a geriatric clinic that already has a computer-based medical record system. Rates of clinically significant drug interactions are compared before and after installation of the alerting program.

  2. 2.

    Choose any alternative area of biomedicine (e.g., drug trials) as a point of comparison, and list at least four factors that make studies in biomedical informatics more difficult to conduct successfully than in that area. Given these difficulties, discuss whether it is worthwhile to conduct empirical studies in biomedical informatics or whether we should use intuition or the marketplace as the primary indicators of the value of an information resource.

  3. 3.

    Assume that you run a philanthropic organization that supports biomedical informatics. In investing the scarce resources of your organization, you have to choose between funding a new system or resource development, or funding empirical studies of resources already developed. What would you choose? How would you justify your decision?

  4. 4.

    To what extent is it possible to be certain how effective a medical informatics resource really is? What are the most important criteria of effectiveness?

  5. 5.

    Do you believe that independent, unbiased observers of the same behavior or outcome should agree on the quality of that outcome?

  6. 6.

    Many of the evaluation approaches assert that a single unbiased observer is a legitimate source of information in an evaluation, even if that observer’s data or judgments are unsubstantiated by other people. Give examples drawn from our society where we vest important decisions in a single experienced and presumed impartial individual.

  7. 7.

    Do you agree with the statement that all evaluations appear equivocal when subjected to serious scrutiny? Explain your answer.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Friedman, C.P., Wyatt, J.C. (2021). Evaluation of Biomedical and Health Information Resources. In: Shortliffe, E.H., Cimino, J.J. (eds) Biomedical Informatics. Springer, Cham. https://doi.org/10.1007/978-3-030-58721-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58721-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58720-8

  • Online ISBN: 978-3-030-58721-5

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics