Abstract
Purpose
The study validates a rule-based system for identifying contraindications to exercise therapy in a medical context. It evaluates accuracy and performance by comparing it with physical therapists’ assessments and patients' characteristics.
Method
The dataset included 80 patient cases with clinical characteristics assessed by 20 physical therapists for contraindications to exercise therapy. Fleiss kappa and pooled kappa values measured agreement between physical therapists and AI. AI performance was assessed by sensitivity, specificity, accuracy and F1 score. Clinical characteristics were compared between therapists' votes using ANOVA and Bonferroni post-hoc test.
Results
The physical therapists had a mean age of 40.85 (8.23) years and a mean experience of 14.53 (8.20) years. Out of 64 patient cases, there was consensus on 35 cases with no contraindication and 29 cases with a consensus on “contraindication exists” for exercise therapy. In 16 cases there was no consensus between therapists. Overall, therapists had 87.5% agreement with Fleiss Kappa κπ = .43. The pooled kappa value between therapists and AI was κpooled = .63. AI achieved perfect values (1) for sensitivity, specificity, accuracy and F1 score. Statistically, consensus-based comparisons by therapists revealed significant differences in pain intensity, duration, timing, and quality.
Conclusion
The study shows significant agreement between physical therapists and the AI, consistent with similar musculoskeletal studies. Various clinical characteristics highlight the importance of clinical reasoning and contraindication detection. In conclusion, advanced technologies such as decision support and expert systems could have a profound impact on clinical practice, improving accuracy, personalized exercises and telemedicine referrals for efficient care and improved patient decisions.
Trial registration
30.12.2021 via OSF Registries, https://doi.org/10.17605/OSF.IO/YCNJQ.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-024-00827-w/MediaObjects/12553_2024_827_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-024-00827-w/MediaObjects/12553_2024_827_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12553-024-00827-w/MediaObjects/12553_2024_827_Fig3_HTML.png)
Similar content being viewed by others
Data availability statement
The data can be requested from the corresponding author.
References
Ali O, Abdelbaki W, Shrestha A, et al. A systematic literature review of artificial intelligence in the healthcare sector: Benefits, challenges, methodologies, and functionalities. J Innov Knowl. 2023;8:100333. https://doi.org/10.1016/j.jik.2023.100333.
Goodman K, Zandi D, Reis A, Vayena E. Balancing risks and benefits of artificial intelligence in the health sector. Bull World Health Organ. 2020;98:230-230A. https://doi.org/10.2471/blt.20.253823.
Pawloski PA, Brooks GA, Nielsen ME, Olson-Bullis BA. A systematic review of clinical decision support systems for clinical oncology practice. J Natl Compr Canc Netw. 2019;17:331–8. https://doi.org/10.6004/jnccn.2018.7104.
Verboven L, Calders T, Callens S, et al. A treatment recommender clinical decision support system for personalized medicine: method development and proof-of-concept for drug resistant tuberculosis. BMC Med Inform Decis Mak. 2022;22:56. https://doi.org/10.1186/s12911-022-01790-0.
Fiske A, Henningsen P, Buyx A. Your robot therapist will see you now: Ethical implications of embodied artificial intelligence in psychiatry, psychology, and psychotherapy. J Med Internet Res. 2019;21:e13216. https://doi.org/10.2196/13216.
El Asmar ML, Dharmayat KI, Vallejo-Vaz AJ, et al. Effect of computerised, knowledge-based, clinical decision support systems on patient-reported and clinical outcomes of patients with chronic disease managed in primary care settings: a systematic review. BMJ Open. 2021;11:e054659. https://doi.org/10.1136/bmjopen-2021-054659.
Rughani G, Nilsen TIL, Wood K, et al. The selfBACK artificial intelligence-based smartphone app can improve low back pain outcome even in patients with high levels of depression or stress. Eur J Pain. 2023;27:568–79. https://doi.org/10.1002/ejp.2080.
Lewis R, Gómez Álvarez CB, Rayman M, et al. Strategies for optimising musculoskeletal health in the 21st century. BMC Musculoskelet Disord. 2019;20:164. https://doi.org/10.1186/s12891-019-2510-7.
Briggs AM, Cross MJ, Hoy DG, et al. Musculoskeletal Health Conditions Represent a Global Threat to Healthy Aging: A Report for the 2015 World Health Organization World Report on Ageing and Health. Gerontologist. 2016;56(Suppl 2):S243–55. https://doi.org/10.1093/geront/gnw002.
Bonanni R, Cariati I, Tancredi V, et al. Chronic pain in musculoskeletal diseases: Do you know your enemy? J Clin Med. 2022;11:2609. https://doi.org/10.3390/jcm11092609.
Teepe GW, Kowatsch T, Hans FP, Benning L. Postmarketing follow-up of a digital home exercise program for back, hip, and knee pain: Retrospective observational study with a time-series and matched-pair analysis. J Med Internet Res. 2023;25:e43775. https://doi.org/10.2196/43775.
Areias AC, Costa F, Janela D, et al. Impact on productivity impairment of a digital care program for chronic low back pain: A prospective longitudinal cohort study. Musculoskelet Sci Pract. 2023;63:102709. https://doi.org/10.1016/j.msksp.2022.102709.
Chhabra HS, Sharma S, Verma S. Smartphone app in self-management of chronic low back pain: a randomized controlled trial. Eur Spine J. 2018;27:2862–74. https://doi.org/10.1007/s00586-018-5788-5.
Marcuzzi A, Nordstoga AL, Bach K, et al. Effect of an artificial intelligence–based self-management app on musculoskeletal health in patients with neck and/or low back pain referred to specialist care. JAMA Netw Open. 2023;6:e2320400. https://doi.org/10.1001/jamanetworkopen.2023.20400.
Mathews SC, McShea MJ, Hanley CL, et al. Digital health: A path to validation. NPJ Digit Med. 2019;2:38. https://doi.org/10.1038/s41746-019-0111-3.
Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. Clin Chem. 2015;61:1446–52. https://doi.org/10.1373/clinchem.2015.246280.
Sounderajah V, Ashrafian H, Golub RM, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021;11:e047709. https://doi.org/10.1136/bmjopen-2020-047709.
World Physiotherapy. Policy statement: Physical therapists as exercise and physical activity experts across the life span. World Physiotherapy. 2019. https://world.physio/sites/default/files/2020-09/PS-2019-Exercise-experts.pdf
Jette DU, Ardleigh K, Chandler K, McShea L. Decision-making ability of physical therapists: physical therapy intervention or medical referral. Phys Ther. 2006;86:1619–29. https://doi.org/10.2522/ptj.20050393.
Gallotti M, Campagnola B, Cocchieri A, et al. Effectiveness and consequences of direct access in physiotherapy: A systematic review. J Clin Med Res. 2023;12:5832. https://doi.org/10.3390/jcm12185832.
Lange T, Kopkow C, Lützner J, et al. Comparison of different rating scales for the use in Delphi studies: different scales lead to different consensus and show different test-retest reliability. BMC Med Res Methodol. 2020;20:28. https://doi.org/10.1186/s12874-020-0912-8.
Diamond IR, Grant RC, Feldman BM, et al. Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol. 2014;67:401–9. https://doi.org/10.1016/j.jclinepi.2013.12.002.
Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8. https://doi.org/10.1038/nature21056.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–82. https://doi.org/10.1037/h0031619.
De Vries H, Elliott MN, Kanouse DE, Teleki SS. Using pooled kappa to summarize interrater agreement across many items. Field Methods. 2008;20:272–82. https://doi.org/10.1177/1525822x08317166.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
Terwee CB, Prinsen CAC, Chiarotto A, et al. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27:1159–70. https://doi.org/10.1007/s11136-018-1829-0.
Mokkink LB, Boers M, van der Vleuten CPM, et al. COSMIN Risk of Bias tool to assess the quality of studies on reliability or measurement error of outcome measurement instruments: A Delphi study. BMC Med Res Methodol. 2020;20:293. https://doi.org/10.1186/s12874-020-01179-5.
Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In: Lecture Notes in Computer Science. Berlin Heidelberg, Berlin, Heidelberg: Springer; 2006. p. 1015–21.
Yacouby R, Axman D. Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. 2020;2020:79–91 Association for Computational Linguistics, Online.
Lalkhen AG, McCluskey A. Clinical tests: sensitivity and specificity. Contin Educ Anaesth Crit Care Pain. 2008;8:221–3. https://doi.org/10.1093/bjaceaccp/mkn041.
Dukic V, Gatsonis C. Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds. Biometrics. 2003;59:936–46. https://doi.org/10.1111/j.0006-341x.2003.00108.x.
Armstrong RA. When to use the Bonferroni correction. Ophthalmic Physiol Opt. 2014;34:502–8. https://doi.org/10.1111/opo.12131.
Redier H, Daures JP, Michel C, et al. Assessment of the severity of asthma by an expert system. Description and evaluation. Am J Respir Crit Care Med. 1995;151:345–52. https://doi.org/10.1164/ajrccm.151.2.7842190.
Gudmundsson HT, Hansen KE, Halldorsson BV, et al. Clinical decision support system for the management of osteoporosis compared to NOGG guidelines and an osteology specialist: A validation pilot study. BMC Med Inform Decis Mak. 2019;19:27. https://doi.org/10.1186/s12911-019-0749-4.
Farmer N. An update and further testing of a knowledge-based diagnostic clinical decision support system for musculoskeletal disorders of the shoulder for use in a primary care setting. J Eval Clin Pract. 2014;20:589–95. https://doi.org/10.1111/jep.12153.
Kim D, Lee J, Woo Y, et al. Deep learning application to clinical decision support system in sleep stage classification. J Pers Med. 2022;12:136. https://doi.org/10.3390/jpm12020136.
Aron A, Cunningham S, Yoder I, et al. Diagnostic momentum in physical therapy clinical reasoning. J Eval Clin Pract. 2023. https://doi.org/10.1111/jep.13884.
Leerar PJ, Boissonnault W, Domholdt E, Roddey T. Documentation of red flags by physical therapists for patients with low back pain. J Man Manip Ther. 2007;15:42–9. https://doi.org/10.1179/106698107791090105.
Bourassa M, Kolb WH, Barrett D, Wassinger C. Guideline adherent screening and referral: do third year Doctor of Physical Therapy students identify red and yellow flags within descriptive patient cases? a United States based survey study. J Man Manip Ther. 2023;31:253–60. https://doi.org/10.1080/10669817.2023.2170743.
Acknowledgements
The authors would like to thank, among others, Philipp Schlüter, Andres Jung, Prof. Dr. Ursula Hübner and Steffen Schulz for their help with specific statistical questions. We would also like to thank medicalmotion GmbH for providing the patient cases.
Funding
The authors declare that no funding, grants or other support was received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
Conceptualization: Griefahn, Annika; Luedtke, Kerstin; Zalpour, Christoff. Methodology: Griefahn, Annika; Luedtke, Kerstin. Formal analysis and investigation: Griefahn, Annika. Writing—original draft preparation: Griefahn, Annika; Luedtke, Kerstin; Zalpour, Christoff. Writing—review and editing: Griefahn, Annika; Luedtke, Kerstin; Zalpour, Christoff. Supervision: Luedtke, Kerstin; Zalpour, Christoff.
Corresponding author
Ethics declarations
Ethics approval
Approval was obtained from the ethics committee of University of Applied Science Osnabrück (ID: HSOS/2021/1/3). The procedures used in this study adhere to the tenets of the Declaration of Helsinki.
Consent to participate
All participants provided written informed consent following a detailed explanation of the study’s purpose.
Consent for publication
Not applicable.
Competing interests
AG is employee of medicalmotion GmbH. The remaining authors declare no competing interests.
Disclaimer
All authors have read and approved the final version of the manuscript. All authors agree that they are responsible for all aspects of the work, and they will ensure that issues related to the accuracy or integrity of any part of the work are adequately investigated and resolved.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Annika, G., Christoff, Z. & Kerstin, L. Evaluation of the accuracy of an artificial intelligence in identifying contraindications to exercise therapy - Comparison with and interrater reliability of physical therapists judgments. Health Technol. 14, 513–522 (2024). https://doi.org/10.1007/s12553-024-00827-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12553-024-00827-w