Abstract
Purpose
Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores.
Methods
Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers.
Results
We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, \(\kappa = 0.41\)) and experts (\(\kappa = 0.55\)). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores (\(\rho \ge 0.86\)) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other (\(\rho \ge 0.84\)), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin.
Conclusions
Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.
Similar content being viewed by others
Notes
The term ground-truth, here and henceforth, has been used to denote a reference value obtained by pooling the crowd/expert responses.
The participants could take breaks and come back and answer these HITs at a later time. Thus, we cannot draw any reliable conclusions based on these numbers.
The number of HITs rate by both the crowd and experts was 120. However, filtering the HITs based on the agreement metric (“HIT agreement and HIT confidence”) drops the count to 75.
References
Ahmidi N, Gao Y, Bjar B, Vedula SS, Khudanpur S, Vidal R, Hager GD (2013) String Motif-Based Description of Tool Motion for Detecting Skill and Gestures in Robotic Surgery. In: Mori K, Sakuma I, Sato Y, Barillot C, Navab N (eds.) Medical image computing and computer-assisted intervention MICCAI 2013, no. 8149 in Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp 26–33. http://link.springer.com/chapter/10.1007/978-3-642-40811-3_4
Ahmidi N, Poddar P, Jones JD, Vedula SS, Ishii L, Hager GD, Ishii M (2015) Automated objective surgical skill assessment in the operating room from unstructured tool motion in septoplasty. Int J Comput Assist Radiol Surg. doi:10.1007/s11548-015-1194-1. http://link.springer.com/article/10.1007/s11548-015-1194-1
Bell Jr RH (2009) Why Johnny cannot operate. Surg 146(4):533–542. doi:10.1016/j.surg.2009.06.044. http://www.sciencedirect.com/science/article/pii/S0039606009004620
Birkmeyer JD, Finks JF, O’Reilly A, Oerline M, Carlin AM, Nunn AR, Dimick J, Banerjee M, Birkmeyer NJ (2013) Surgical skill and complication rates after bariatric surgery. N Engl J Med 369(15):1434–1442. doi:10.1056/NEJMsa1300625. http://www.nejm.org/doi/full/10.1056/NEJMsa1300625
Chen C, White L, Kowalewski T, Aggarwal R, Lintott C, Comstock B, Kuksenok K, Aragon C, Holst D, Lendvay T (2014) Crowd-sourced assessment of technical skills: a novel method to evaluate surgical performance. J Surg Res 187(1):65–71. doi:10.1016/j.jss.2013.09.024. http://www.sciencedirect.com/science/article/pii/S0022480413008998
Cole SJ, Mackenzie H, Ha J, Hanna GB, Miskovic D (2014) Randomized controlled trial on the effect of coaching in simulated laparoscopic training. Surg Endosc 28(3):979–986. doi:10.1007/s00464-013-3265-0. http://link.springer.com/article/10.1007/s00464-013-3265-0
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1007/BF00994018. http://link.springer.com/article/10.1007/BF00994018
Curet M, Dimaio SP, Gao Y, Hager GD, Itkowitz B, Jog AS, Kumar R, Liu M (2012) Method and system for analyzing a task trajectory. International Classification A61B19/00, G01C21/00; Cooperative Classification A61B19/2203, G01C21/00, A61B19/00
Datta V, Chang A, Mackay S, Darzi A (2002) The relationship between motion analysis and surgical technical assessments. Am J Surg 184(1):70–73. doi:10.1016/S0002-9610(02)00891-7. http://www.sciencedirect.com/science/article/pii/S0002961002008917
Dosis A, Aggarwal A, Bello F, Moorthy K, Munz Y, Gillies D, Darzi A (2005) Synchronized video and motion analysis for the assessment of procedures in the operating theater. Arch Surg 140(3):293–299. doi:10.1001/archsurg.140.3.293. http://dx.doi.org/10.1001/archsurg.140.3.293
Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the web. WWW ’01. ACM, New York, NY, USA. pp 613–622. doi:10.1145/371920.372165. http://doi.acm.org/10.1145/371920.372165
Fleiss JL, Levin B, Paik MC (2003) The measurement of interrater agreement. In: Statistical methods for rates and proportions. Wiley, pp 598–626. http://onlinelibrary.wiley.com/doi/10.1002/0471445428.ch18/summary
Freund Y, Iyer R, Schapire R, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969. http://dl.acm.org/citation.cfm?id=945365.964285
Goh AC, Goldfarb DW, Sander JC, Miles BJ, Dunkin BJ (2012) Global evaluative assessment of robotic skills: validation of a clinical assessment tool to measure robotic surgical skills. J Urol 187(1):247–252. doi:10.1016/j.juro.2011.09.032
Haro BB, Zappella L, Vidal R (2012) Surgical gesture classification from video data. In: Ayache N, Delingette H, Golland P, Mori K (eds.) Medical image computing and computer-assisted intervention MICCAI 2012. Springer, Berlin, pp 34–41. http://link.springer.com/chapter/10.1007/978-3-642-33415-3_5
Kumar R, Jog A, Malpani A, Vagvolgyi B, Yuh D, Nguyen H, Hager G, Chen C (2012) Assessing system operation skills in robotic surgery trainees. Int J Med Rob Comput Assist Surg 8(1):118–124. doi:10.1002/rcs.449. http://onlinelibrary.wiley.com/doi/10.1002/rcs.449/abstract
Kumar R, Jog A, Vagvolgyi B, Nguyen H, Hager G, Chen CCG, Yuh D (2012) Objective measures for longitudinal assessment of robotic surgery training. J Thorac Cardiovasc Surg 143(3):528–534. doi:10.1016/j.jtcvs.2011.11.002. http://www.sciencedirect.com/science/article/pii/S0022522311012748
Kumar R, Rajan P, Bejakovic S, Seshamani S, Mullin G, Dassopoulos T, Hager G (2009) Learning disease severity for capsule endoscopy images. pp 1314–1317. doi:10.1109/ISBI.2009.5193306
Maier-Hein L, Mersmann S, Kondermann D, Bodenstedt S, Sanchez A, Stock C, Kenngott HG, Eisenmann M, Speidel S (2014) Can masses of non-experts train highly accurate image classifiers? In: Golland P, Hata N, Barillot C, Hornegger J, Howe R (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2014, no. 8674 in Lecture Notes in Computer Science. Springer International Publishing, pp 438–445. http://link.springer.com/chapter/10.1007/978-3-319-10470-6_55
Maier-Hein L, Mersmann S, Kondermann D, Stock C, Kenngott HG, Sanchez A, Wagner M, Preukschas A, Wekerle AL, Helfert S, Bodenstedt S, Speidel S (2014) Crowdsourcing for reference correspondence generation in endoscopic images. In: Golland P, Hata N, Barillot C, Hornegger J, Howe R (eds.) Medical image computing and computer-assisted intervention MICCAI 2014, no. 8674 in Lecture Notes in Computer Science. Springer International Publishing, pp 349–356. http://link.springer.com/chapter/10.1007/978-3-319-10470-6_44
Malpani A, Vedula SS, Chen CCG, Hager GD (2014) Pairwise comparison-based objective score for automated skill assessment of segments in a surgical task. In: Stoyanov D, Collins DL, Sakuma I, Abolmaesumi P, Jannin P (eds.) Information processing in computer-assisted interventions. Springer International Publishing, pp 138–147. http://link.springer.com/chapter/10.1007/978-3-319-07521-1_15
Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, Brown M (1997) Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 84(2):273–278
Reiley CE, Hager GD (2009) Task versus subtask surgical skill svaluation of robotic minimally invasive surgery. In: Yang GZ, Hawkes D, Rueckert D, Noble A, Taylor C (eds.) Medical image computing and computer-assisted intervention MICCAI 2009. Springer, Berlin, pp 435–442. http://link.springer.com/chapter/10.1007/978-3-642-04268-3_54
Rosen J, Hannaford B, Richards C, Sinanan M (2001) Markov modeling of minimally invasive surgery based on tool/tissue interaction and force/torque signatures for evaluating surgical skills. IEEE Trans Biomed Eng 48(5):579–591. doi:10.1109/10.918597
Rosen J, Solazzo M, Hannaford B, Sinanan M (2002) Task decomposition of laparoscopic surgery for objective evaluation of surgical residents’ learning curve using hidden Markov model. Comput Aided Surg 7(1):49–61. doi:10.1002/igs.10026. http://onlinelibrary.wiley.com/doi/10.1002/igs.10026/abstract
Sharma Y, Plotz T, Hammerld N, Mellor S, McNaney R, Olivier P, Deshmukh S, McCaskie A, Essa I (2014) Automated surgical OSATS prediction from videos, pp 461–464. doi:10.1109/ISBI.2014.6867908
Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden Markov models for surgical gesture classification and skill evaluation. In: Abolmaesumi P, Joskowicz L, Navab N, Jannin P (eds.) Information processing in computer-assisted interventions. Springer, Berlin, pp 167–177. http://link.springer.com/chapter/10.1007/978-3-642-30618-1_17
Van Eaton EG, Tarpley JL, Solorzano CC, Cho CS, Weber SM, Termuhlen PM (2011) Resident education in 2011: Three key challenges on the road ahead. Surgery 149(4):465–473. doi:10.1016/j.surg.2010.11.007. http://www.sciencedirect.com/science/article/pii/S0039606010006148
Varadarajan B, Reiley C, Lin H, Khudanpur S, Hager G (2009) Data-derived models for segmentation with application to surgical assessment and training. In: Yang GZ, Hawkes D, Rueckert D, Noble A, Taylor C (eds.) Medical image computing and computer-assisted intervention. Springer, Berlin, pp 426–434. http://link.springer.com/chapter/10.1007/978-3-642-04268-3_53
Vassiliou M, Feldman L, Andrew C, Bergman S, Leffondr K, Stanbridge D, Fried G (2005) A global assessment tool for evaluation of intraoperative laparoscopic skills. Am J Surg 190(1):107–113. doi:10.1016/j.amjsurg.2005.04.004
Zappella L, Bjar B, Hager G, Vidal R (2013) Surgical gesture classification from video and kinematic data. Med Image Anal 17(7):732–745. doi:10.1016/j.media.2013.04.007. http://www.sciencedirect.com/science/article/pii/S1361841513000522
Acknowledgments
We acknowledge all participants in our crowdsourcing user study, and Intuitive surgical, Inc., for facilitating capture of data from the dVSS. A combined effort from the Language of Surgery project team led to the development of the manual task segmentation. The Johns Hopkins Science of Learning Institute and internal funding from the Johns Hopkins University supported this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Anand Malpani, S Swaroop Vedula, C C Grace Chen, and Gregory D Hager declare that they have no conflict of interest.
Ethical standard
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Rights and permissions
About this article
Cite this article
Malpani, A., Vedula, S.S., Chen, C.C.G. et al. A study of crowdsourced segment-level surgical skill assessment using pairwise rankings. Int J CARS 10, 1435–1447 (2015). https://doi.org/10.1007/s11548-015-1238-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11548-015-1238-6