Skip to main content
Log in

Evaluation of the accuracy of an artificial intelligence in identifying contraindications to exercise therapy - Comparison with and interrater reliability of physical therapists judgments

  • Original Paper
  • Published:
Health and Technology Aims and scope Submit manuscript

Abstract

Purpose

The study validates a rule-based system for identifying contraindications to exercise therapy in a medical context. It evaluates accuracy and performance by comparing it with physical therapists’ assessments and patients' characteristics.

Method

The dataset included 80 patient cases with clinical characteristics assessed by 20 physical therapists for contraindications to exercise therapy. Fleiss kappa and pooled kappa values measured agreement between physical therapists and AI. AI performance was assessed by sensitivity, specificity, accuracy and F1 score. Clinical characteristics were compared between therapists' votes using ANOVA and Bonferroni post-hoc test.

Results

The physical therapists had a mean age of 40.85 (8.23) years and a mean experience of 14.53 (8.20) years. Out of 64 patient cases, there was consensus on 35 cases with no contraindication and 29 cases with a consensus on “contraindication exists” for exercise therapy. In 16 cases there was no consensus between therapists. Overall, therapists had 87.5% agreement with Fleiss Kappa κπ = .43. The pooled kappa value between therapists and AI was κpooled = .63. AI achieved perfect values (1) for sensitivity, specificity, accuracy and F1 score. Statistically, consensus-based comparisons by therapists revealed significant differences in pain intensity, duration, timing, and quality.

Conclusion

The study shows significant agreement between physical therapists and the AI, consistent with similar musculoskeletal studies. Various clinical characteristics highlight the importance of clinical reasoning and contraindication detection. In conclusion, advanced technologies such as decision support and expert systems could have a profound impact on clinical practice, improving accuracy, personalized exercises and telemedicine referrals for efficient care and improved patient decisions.

Trial registration

30.12.2021 via OSF Registries, https://doi.org/10.17605/OSF.IO/YCNJQ.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability statement

The data can be requested from the corresponding author.

References

  1. Ali O, Abdelbaki W, Shrestha A, et al. A systematic literature review of artificial intelligence in the healthcare sector: Benefits, challenges, methodologies, and functionalities. J Innov Knowl. 2023;8:100333. https://doi.org/10.1016/j.jik.2023.100333.

    Article  Google Scholar 

  2. Goodman K, Zandi D, Reis A, Vayena E. Balancing risks and benefits of artificial intelligence in the health sector. Bull World Health Organ. 2020;98:230-230A. https://doi.org/10.2471/blt.20.253823.

    Article  Google Scholar 

  3. Pawloski PA, Brooks GA, Nielsen ME, Olson-Bullis BA. A systematic review of clinical decision support systems for clinical oncology practice. J Natl Compr Canc Netw. 2019;17:331–8. https://doi.org/10.6004/jnccn.2018.7104.

    Article  Google Scholar 

  4. Verboven L, Calders T, Callens S, et al. A treatment recommender clinical decision support system for personalized medicine: method development and proof-of-concept for drug resistant tuberculosis. BMC Med Inform Decis Mak. 2022;22:56. https://doi.org/10.1186/s12911-022-01790-0.

    Article  Google Scholar 

  5. Fiske A, Henningsen P, Buyx A. Your robot therapist will see you now: Ethical implications of embodied artificial intelligence in psychiatry, psychology, and psychotherapy. J Med Internet Res. 2019;21:e13216. https://doi.org/10.2196/13216.

    Article  Google Scholar 

  6. El Asmar ML, Dharmayat KI, Vallejo-Vaz AJ, et al. Effect of computerised, knowledge-based, clinical decision support systems on patient-reported and clinical outcomes of patients with chronic disease managed in primary care settings: a systematic review. BMJ Open. 2021;11:e054659. https://doi.org/10.1136/bmjopen-2021-054659.

    Article  Google Scholar 

  7. Rughani G, Nilsen TIL, Wood K, et al. The selfBACK artificial intelligence-based smartphone app can improve low back pain outcome even in patients with high levels of depression or stress. Eur J Pain. 2023;27:568–79. https://doi.org/10.1002/ejp.2080.

    Article  Google Scholar 

  8. Lewis R, Gómez Álvarez CB, Rayman M, et al. Strategies for optimising musculoskeletal health in the 21st century. BMC Musculoskelet Disord. 2019;20:164. https://doi.org/10.1186/s12891-019-2510-7.

    Article  Google Scholar 

  9. Briggs AM, Cross MJ, Hoy DG, et al. Musculoskeletal Health Conditions Represent a Global Threat to Healthy Aging: A Report for the 2015 World Health Organization World Report on Ageing and Health. Gerontologist. 2016;56(Suppl 2):S243–55. https://doi.org/10.1093/geront/gnw002.

    Article  Google Scholar 

  10. Bonanni R, Cariati I, Tancredi V, et al. Chronic pain in musculoskeletal diseases: Do you know your enemy? J Clin Med. 2022;11:2609. https://doi.org/10.3390/jcm11092609.

    Article  Google Scholar 

  11. Teepe GW, Kowatsch T, Hans FP, Benning L. Postmarketing follow-up of a digital home exercise program for back, hip, and knee pain: Retrospective observational study with a time-series and matched-pair analysis. J Med Internet Res. 2023;25:e43775. https://doi.org/10.2196/43775.

    Article  Google Scholar 

  12. Areias AC, Costa F, Janela D, et al. Impact on productivity impairment of a digital care program for chronic low back pain: A prospective longitudinal cohort study. Musculoskelet Sci Pract. 2023;63:102709. https://doi.org/10.1016/j.msksp.2022.102709.

    Article  Google Scholar 

  13. Chhabra HS, Sharma S, Verma S. Smartphone app in self-management of chronic low back pain: a randomized controlled trial. Eur Spine J. 2018;27:2862–74. https://doi.org/10.1007/s00586-018-5788-5.

    Article  Google Scholar 

  14. Marcuzzi A, Nordstoga AL, Bach K, et al. Effect of an artificial intelligence–based self-management app on musculoskeletal health in patients with neck and/or low back pain referred to specialist care. JAMA Netw Open. 2023;6:e2320400. https://doi.org/10.1001/jamanetworkopen.2023.20400.

    Article  Google Scholar 

  15. Mathews SC, McShea MJ, Hanley CL, et al. Digital health: A path to validation. NPJ Digit Med. 2019;2:38. https://doi.org/10.1038/s41746-019-0111-3.

    Article  Google Scholar 

  16. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. Clin Chem. 2015;61:1446–52. https://doi.org/10.1373/clinchem.2015.246280.

    Article  Google Scholar 

  17. Sounderajah V, Ashrafian H, Golub RM, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021;11:e047709. https://doi.org/10.1136/bmjopen-2020-047709.

    Article  Google Scholar 

  18. World Physiotherapy. Policy statement: Physical therapists as exercise and physical activity experts across the life span. World Physiotherapy. 2019. https://world.physio/sites/default/files/2020-09/PS-2019-Exercise-experts.pdf

  19. Jette DU, Ardleigh K, Chandler K, McShea L. Decision-making ability of physical therapists: physical therapy intervention or medical referral. Phys Ther. 2006;86:1619–29. https://doi.org/10.2522/ptj.20050393.

    Article  Google Scholar 

  20. Gallotti M, Campagnola B, Cocchieri A, et al. Effectiveness and consequences of direct access in physiotherapy: A systematic review. J Clin Med Res. 2023;12:5832. https://doi.org/10.3390/jcm12185832.

    Article  Google Scholar 

  21. Lange T, Kopkow C, Lützner J, et al. Comparison of different rating scales for the use in Delphi studies: different scales lead to different consensus and show different test-retest reliability. BMC Med Res Methodol. 2020;20:28. https://doi.org/10.1186/s12874-020-0912-8.

    Article  Google Scholar 

  22. Diamond IR, Grant RC, Feldman BM, et al. Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol. 2014;67:401–9. https://doi.org/10.1016/j.jclinepi.2013.12.002.

    Article  Google Scholar 

  23. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8. https://doi.org/10.1038/nature21056.

    Article  Google Scholar 

  24. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–82. https://doi.org/10.1037/h0031619.

    Article  Google Scholar 

  25. De Vries H, Elliott MN, Kanouse DE, Teleki SS. Using pooled kappa to summarize interrater agreement across many items. Field Methods. 2008;20:272–82. https://doi.org/10.1177/1525822x08317166.

    Article  Google Scholar 

  26. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.

    Article  Google Scholar 

  27. Terwee CB, Prinsen CAC, Chiarotto A, et al. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27:1159–70. https://doi.org/10.1007/s11136-018-1829-0.

    Article  Google Scholar 

  28. Mokkink LB, Boers M, van der Vleuten CPM, et al. COSMIN Risk of Bias tool to assess the quality of studies on reliability or measurement error of outcome measurement instruments: A Delphi study. BMC Med Res Methodol. 2020;20:293. https://doi.org/10.1186/s12874-020-01179-5.

    Article  Google Scholar 

  29. Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In: Lecture Notes in Computer Science. Berlin Heidelberg, Berlin, Heidelberg: Springer; 2006. p. 1015–21.

    Google Scholar 

  30. Yacouby R, Axman D. Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. 2020;2020:79–91 Association for Computational Linguistics, Online.

    Article  Google Scholar 

  31. Lalkhen AG, McCluskey A. Clinical tests: sensitivity and specificity. Contin Educ Anaesth Crit Care Pain. 2008;8:221–3. https://doi.org/10.1093/bjaceaccp/mkn041.

    Article  Google Scholar 

  32. Dukic V, Gatsonis C. Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds. Biometrics. 2003;59:936–46. https://doi.org/10.1111/j.0006-341x.2003.00108.x.

    Article  MathSciNet  Google Scholar 

  33. Armstrong RA. When to use the Bonferroni correction. Ophthalmic Physiol Opt. 2014;34:502–8. https://doi.org/10.1111/opo.12131.

    Article  Google Scholar 

  34. Redier H, Daures JP, Michel C, et al. Assessment of the severity of asthma by an expert system. Description and evaluation. Am J Respir Crit Care Med. 1995;151:345–52. https://doi.org/10.1164/ajrccm.151.2.7842190.

    Article  Google Scholar 

  35. Gudmundsson HT, Hansen KE, Halldorsson BV, et al. Clinical decision support system for the management of osteoporosis compared to NOGG guidelines and an osteology specialist: A validation pilot study. BMC Med Inform Decis Mak. 2019;19:27. https://doi.org/10.1186/s12911-019-0749-4.

    Article  Google Scholar 

  36. Farmer N. An update and further testing of a knowledge-based diagnostic clinical decision support system for musculoskeletal disorders of the shoulder for use in a primary care setting. J Eval Clin Pract. 2014;20:589–95. https://doi.org/10.1111/jep.12153.

    Article  Google Scholar 

  37. Kim D, Lee J, Woo Y, et al. Deep learning application to clinical decision support system in sleep stage classification. J Pers Med. 2022;12:136. https://doi.org/10.3390/jpm12020136.

    Article  Google Scholar 

  38. Aron A, Cunningham S, Yoder I, et al. Diagnostic momentum in physical therapy clinical reasoning. J Eval Clin Pract. 2023. https://doi.org/10.1111/jep.13884.

    Article  Google Scholar 

  39. Leerar PJ, Boissonnault W, Domholdt E, Roddey T. Documentation of red flags by physical therapists for patients with low back pain. J Man Manip Ther. 2007;15:42–9. https://doi.org/10.1179/106698107791090105.

    Article  Google Scholar 

  40. Bourassa M, Kolb WH, Barrett D, Wassinger C. Guideline adherent screening and referral: do third year Doctor of Physical Therapy students identify red and yellow flags within descriptive patient cases? a United States based survey study. J Man Manip Ther. 2023;31:253–60. https://doi.org/10.1080/10669817.2023.2170743.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank, among others, Philipp Schlüter, Andres Jung, Prof. Dr. Ursula Hübner and Steffen Schulz for their help with specific statistical questions. We would also like to thank medicalmotion GmbH for providing the patient cases.

Funding

The authors declare that no funding, grants or other support was received during the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Griefahn, Annika; Luedtke, Kerstin; Zalpour, Christoff. Methodology: Griefahn, Annika; Luedtke, Kerstin. Formal analysis and investigation: Griefahn, Annika. Writing—original draft preparation: Griefahn, Annika; Luedtke, Kerstin; Zalpour, Christoff. Writing—review and editing: Griefahn, Annika; Luedtke, Kerstin; Zalpour, Christoff. Supervision: Luedtke, Kerstin; Zalpour, Christoff.

Corresponding author

Correspondence to Griefahn Annika.

Ethics declarations

Ethics approval

Approval was obtained from the ethics committee of University of Applied Science Osnabrück (ID: HSOS/2021/1/3). The procedures used in this study adhere to the tenets of the Declaration of Helsinki.

Consent to participate

All participants provided written informed consent following a detailed explanation of the study’s purpose.

Consent for publication

Not applicable.

Competing interests

AG is employee of medicalmotion GmbH. The remaining authors declare no competing interests.

Disclaimer

All authors have read and approved the final version of the manuscript. All authors agree that they are responsible for all aspects of the work, and they will ensure that issues related to the accuracy or integrity of any part of the work are adequately investigated and resolved.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 29 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Annika, G., Christoff, Z. & Kerstin, L. Evaluation of the accuracy of an artificial intelligence in identifying contraindications to exercise therapy - Comparison with and interrater reliability of physical therapists judgments. Health Technol. 14, 513–522 (2024). https://doi.org/10.1007/s12553-024-00827-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12553-024-00827-w

Keywords

Navigation