A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

Malpani, Anand; Vedula, S. Swaroop; Chen, Chi Chiung Grace; Hager, Gregory D.

doi:10.1007/s11548-015-1238-6

A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

Original Article
Published: 30 June 2015

Volume 10, pages 1435–1447, (2015)
Cite this article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Anand Malpani¹,
S. Swaroop Vedula¹,
Chi Chiung Grace Chen² &
…
Gregory D. Hager¹

867 Accesses
28 Citations
Explore all metrics

Abstract

Purpose

Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores.

Methods

Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers.

Results

We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, \(\kappa = 0.41\)) and experts (\(\kappa = 0.55\)). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores (\(\rho \ge 0.86\)) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other (\(\rho \ge 0.84\)), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin.

Conclusions

Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task

Automatically rating trainee skill at a pediatric laparoscopic suturing task

Article Open access 25 October 2017

Robotic Skills Assessment: Crowd-Sourced Evaluation in Surgery and Future Directions in Plastic Surgery

Notes

The term ground-truth, here and henceforth, has been used to denote a reference value obtained by pooling the crowd/expert responses.
The participants could take breaks and come back and answer these HITs at a later time. Thus, we cannot draw any reliable conclusions based on these numbers.
The number of HITs rate by both the crowd and experts was 120. However, filtering the HITs based on the agreement metric (“HIT agreement and HIT confidence”) drops the count to 75.

References

Ahmidi N, Gao Y, Bjar B, Vedula SS, Khudanpur S, Vidal R, Hager GD (2013) String Motif-Based Description of Tool Motion for Detecting Skill and Gestures in Robotic Surgery. In: Mori K, Sakuma I, Sato Y, Barillot C, Navab N (eds.) Medical image computing and computer-assisted intervention MICCAI 2013, no. 8149 in Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp 26–33. http://link.springer.com/chapter/10.1007/978-3-642-40811-3_4
Ahmidi N, Poddar P, Jones JD, Vedula SS, Ishii L, Hager GD, Ishii M (2015) Automated objective surgical skill assessment in the operating room from unstructured tool motion in septoplasty. Int J Comput Assist Radiol Surg. doi:10.1007/s11548-015-1194-1. http://link.springer.com/article/10.1007/s11548-015-1194-1
Bell Jr RH (2009) Why Johnny cannot operate. Surg 146(4):533–542. doi:10.1016/j.surg.2009.06.044. http://www.sciencedirect.com/science/article/pii/S0039606009004620
Birkmeyer JD, Finks JF, O’Reilly A, Oerline M, Carlin AM, Nunn AR, Dimick J, Banerjee M, Birkmeyer NJ (2013) Surgical skill and complication rates after bariatric surgery. N Engl J Med 369(15):1434–1442. doi:10.1056/NEJMsa1300625. http://www.nejm.org/doi/full/10.1056/NEJMsa1300625
Chen C, White L, Kowalewski T, Aggarwal R, Lintott C, Comstock B, Kuksenok K, Aragon C, Holst D, Lendvay T (2014) Crowd-sourced assessment of technical skills: a novel method to evaluate surgical performance. J Surg Res 187(1):65–71. doi:10.1016/j.jss.2013.09.024. http://www.sciencedirect.com/science/article/pii/S0022480413008998
Cole SJ, Mackenzie H, Ha J, Hanna GB, Miskovic D (2014) Randomized controlled trial on the effect of coaching in simulated laparoscopic training. Surg Endosc 28(3):979–986. doi:10.1007/s00464-013-3265-0. http://link.springer.com/article/10.1007/s00464-013-3265-0
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1007/BF00994018. http://link.springer.com/article/10.1007/BF00994018
Curet M, Dimaio SP, Gao Y, Hager GD, Itkowitz B, Jog AS, Kumar R, Liu M (2012) Method and system for analyzing a task trajectory. International Classification A61B19/00, G01C21/00; Cooperative Classification A61B19/2203, G01C21/00, A61B19/00
Datta V, Chang A, Mackay S, Darzi A (2002) The relationship between motion analysis and surgical technical assessments. Am J Surg 184(1):70–73. doi:10.1016/S0002-9610(02)00891-7. http://www.sciencedirect.com/science/article/pii/S0002961002008917
Dosis A, Aggarwal A, Bello F, Moorthy K, Munz Y, Gillies D, Darzi A (2005) Synchronized video and motion analysis for the assessment of procedures in the operating theater. Arch Surg 140(3):293–299. doi:10.1001/archsurg.140.3.293. http://dx.doi.org/10.1001/archsurg.140.3.293
Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the web. WWW ’01. ACM, New York, NY, USA. pp 613–622. doi:10.1145/371920.372165. http://doi.acm.org/10.1145/371920.372165
Fleiss JL, Levin B, Paik MC (2003) The measurement of interrater agreement. In: Statistical methods for rates and proportions. Wiley, pp 598–626. http://onlinelibrary.wiley.com/doi/10.1002/0471445428.ch18/summary
Freund Y, Iyer R, Schapire R, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969. http://dl.acm.org/citation.cfm?id=945365.964285
Goh AC, Goldfarb DW, Sander JC, Miles BJ, Dunkin BJ (2012) Global evaluative assessment of robotic skills: validation of a clinical assessment tool to measure robotic surgical skills. J Urol 187(1):247–252. doi:10.1016/j.juro.2011.09.032
Article PubMed Google Scholar
Haro BB, Zappella L, Vidal R (2012) Surgical gesture classification from video data. In: Ayache N, Delingette H, Golland P, Mori K (eds.) Medical image computing and computer-assisted intervention MICCAI 2012. Springer, Berlin, pp 34–41. http://link.springer.com/chapter/10.1007/978-3-642-33415-3_5
Kumar R, Jog A, Malpani A, Vagvolgyi B, Yuh D, Nguyen H, Hager G, Chen C (2012) Assessing system operation skills in robotic surgery trainees. Int J Med Rob Comput Assist Surg 8(1):118–124. doi:10.1002/rcs.449. http://onlinelibrary.wiley.com/doi/10.1002/rcs.449/abstract
Kumar R, Jog A, Vagvolgyi B, Nguyen H, Hager G, Chen CCG, Yuh D (2012) Objective measures for longitudinal assessment of robotic surgery training. J Thorac Cardiovasc Surg 143(3):528–534. doi:10.1016/j.jtcvs.2011.11.002. http://www.sciencedirect.com/science/article/pii/S0022522311012748
Kumar R, Rajan P, Bejakovic S, Seshamani S, Mullin G, Dassopoulos T, Hager G (2009) Learning disease severity for capsule endoscopy images. pp 1314–1317. doi:10.1109/ISBI.2009.5193306
Maier-Hein L, Mersmann S, Kondermann D, Bodenstedt S, Sanchez A, Stock C, Kenngott HG, Eisenmann M, Speidel S (2014) Can masses of non-experts train highly accurate image classifiers? In: Golland P, Hata N, Barillot C, Hornegger J, Howe R (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2014, no. 8674 in Lecture Notes in Computer Science. Springer International Publishing, pp 438–445. http://link.springer.com/chapter/10.1007/978-3-319-10470-6_55
Maier-Hein L, Mersmann S, Kondermann D, Stock C, Kenngott HG, Sanchez A, Wagner M, Preukschas A, Wekerle AL, Helfert S, Bodenstedt S, Speidel S (2014) Crowdsourcing for reference correspondence generation in endoscopic images. In: Golland P, Hata N, Barillot C, Hornegger J, Howe R (eds.) Medical image computing and computer-assisted intervention MICCAI 2014, no. 8674 in Lecture Notes in Computer Science. Springer International Publishing, pp 349–356. http://link.springer.com/chapter/10.1007/978-3-319-10470-6_44
Malpani A, Vedula SS, Chen CCG, Hager GD (2014) Pairwise comparison-based objective score for automated skill assessment of segments in a surgical task. In: Stoyanov D, Collins DL, Sakuma I, Abolmaesumi P, Jannin P (eds.) Information processing in computer-assisted interventions. Springer International Publishing, pp 138–147. http://link.springer.com/chapter/10.1007/978-3-319-07521-1_15
Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, Brown M (1997) Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 84(2):273–278
Article CAS PubMed Google Scholar
Reiley CE, Hager GD (2009) Task versus subtask surgical skill svaluation of robotic minimally invasive surgery. In: Yang GZ, Hawkes D, Rueckert D, Noble A, Taylor C (eds.) Medical image computing and computer-assisted intervention MICCAI 2009. Springer, Berlin, pp 435–442. http://link.springer.com/chapter/10.1007/978-3-642-04268-3_54
Rosen J, Hannaford B, Richards C, Sinanan M (2001) Markov modeling of minimally invasive surgery based on tool/tissue interaction and force/torque signatures for evaluating surgical skills. IEEE Trans Biomed Eng 48(5):579–591. doi:10.1109/10.918597
Article CAS PubMed Google Scholar
Rosen J, Solazzo M, Hannaford B, Sinanan M (2002) Task decomposition of laparoscopic surgery for objective evaluation of surgical residents’ learning curve using hidden Markov model. Comput Aided Surg 7(1):49–61. doi:10.1002/igs.10026. http://onlinelibrary.wiley.com/doi/10.1002/igs.10026/abstract
Sharma Y, Plotz T, Hammerld N, Mellor S, McNaney R, Olivier P, Deshmukh S, McCaskie A, Essa I (2014) Automated surgical OSATS prediction from videos, pp 461–464. doi:10.1109/ISBI.2014.6867908
Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden Markov models for surgical gesture classification and skill evaluation. In: Abolmaesumi P, Joskowicz L, Navab N, Jannin P (eds.) Information processing in computer-assisted interventions. Springer, Berlin, pp 167–177. http://link.springer.com/chapter/10.1007/978-3-642-30618-1_17
Van Eaton EG, Tarpley JL, Solorzano CC, Cho CS, Weber SM, Termuhlen PM (2011) Resident education in 2011: Three key challenges on the road ahead. Surgery 149(4):465–473. doi:10.1016/j.surg.2010.11.007. http://www.sciencedirect.com/science/article/pii/S0039606010006148
Varadarajan B, Reiley C, Lin H, Khudanpur S, Hager G (2009) Data-derived models for segmentation with application to surgical assessment and training. In: Yang GZ, Hawkes D, Rueckert D, Noble A, Taylor C (eds.) Medical image computing and computer-assisted intervention. Springer, Berlin, pp 426–434. http://link.springer.com/chapter/10.1007/978-3-642-04268-3_53
Vassiliou M, Feldman L, Andrew C, Bergman S, Leffondr K, Stanbridge D, Fried G (2005) A global assessment tool for evaluation of intraoperative laparoscopic skills. Am J Surg 190(1):107–113. doi:10.1016/j.amjsurg.2005.04.004
Article PubMed Google Scholar
Zappella L, Bjar B, Hager G, Vidal R (2013) Surgical gesture classification from video and kinematic data. Med Image Anal 17(7):732–745. doi:10.1016/j.media.2013.04.007. http://www.sciencedirect.com/science/article/pii/S1361841513000522

Download references

Acknowledgments

We acknowledge all participants in our crowdsourcing user study, and Intuitive surgical, Inc., for facilitating capture of data from the dVSS. A combined effort from the Language of Surgery project team led to the development of the manual task segmentation. The Johns Hopkins Science of Learning Institute and internal funding from the Johns Hopkins University supported this work.

Author information

Authors and Affiliations

Johns Hopkins University, 3400 N Charles St, Hackerman Hall Room 200, Baltimore, MD, USA
Anand Malpani, S. Swaroop Vedula & Gregory D. Hager
Johns Hopkins Bayview Medical Center, 301 Building, 301 Mason Lord Drive, Room 3200, Baltimore, MD, USA
Chi Chiung Grace Chen

Authors

Anand Malpani
View author publications
You can also search for this author in PubMed Google Scholar
S. Swaroop Vedula
View author publications
You can also search for this author in PubMed Google Scholar
Chi Chiung Grace Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gregory D. Hager
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Malpani.

Ethics declarations

Conflict of interest

Anand Malpani, S Swaroop Vedula, C C Grace Chen, and Gregory D Hager declare that they have no conflict of interest.

Ethical standard

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malpani, A., Vedula, S.S., Chen, C.C.G. et al. A study of crowdsourced segment-level surgical skill assessment using pairwise rankings. Int J CARS 10, 1435–1447 (2015). https://doi.org/10.1007/s11548-015-1238-6

Download citation

Received: 15 December 2014
Accepted: 04 June 2015
Published: 30 June 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11548-015-1238-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study of crowdsourced segment-level surgical skill assessment using pairwise rankings