Introduction

Substantial surgical skills are required to perform an arthroscopic procedure without risking iatrogenic injury of articular cartilage and within the routine time scheduled for the operation [16, 17, 20, 23]. Learning arthroscopic skills takes considerable time and implicates an increased risk of surgical errors during the early stages of the learning curve when operating on patients [4, 17, 24]. The traditional learning model where the trainee is supervised continuously by the surgeon attempts to minimize these surgical errors. However, as the training time for acquiring arthroscopic skills is being reduced [8, 12] and demands from society for high-quality healthcare increase [23], initiatives have been taken to train basic skills away from the operating room [8, 12].

Training arthroscopic surgical skills preferably is performed with actual instrument handling. The theory that states skilled motor behavior relies on accurate predictive models of our body and the environment we interact with (eg, instruments) supports this approach [5, 14, 28, 33, 34]. These predictive models are stored in our central nervous system. To do a certain task, the best available predictive model is selected. A key feature in this theory is that these predictive models are tuned, updated, and learned by providing feedback from our sensory organs (vision and proprioception).

This requires medical simulators that facilitate adequate training. A broad spectrum of simulators has been described in the literature. Traditionally, cadaveric material has been used as a substitute for live patients [10, 20]. Its importance is evident; however, disadvantages are the availability of cadaveric material and preparation time. Two types of simulators have been introduced to overcome the disadvantages of cadaveric material: anatomic bench models [25, 32] and virtual reality systems [4, 15, 17, 26, 35]. As these simulator developments have reached maturity, they have become commercially available. However, it is unclear if these simulators qualify as suitable training means for arthroscopic skills.

We therefore addressed the following questions: (1) do commercial simulators have construct (times to perform tasks) and face validity (realism), and (2) is the perception of usefulness (educational value and user-friendliness) related to level of experience?

Materials and Methods

On February 18, 2009, we performed a systematic search with the Google™ and Yahoo!® search engines as these contain the largest set of Web pages [1, 7]. Combinations of search terms were used: arthroscopy, simulator, orthopaedic, models, simulation, and trainer. A complementary search was performed in classification code G09B23/28 of the patent database Esp@cenet®. Eight different physical and virtual reality arthroscopy simulators were commercially available. Companies were sent an invitation with the request to provide the simulator for 2 weeks at our institute. Two companies agreed to participate: Toltech Knee Arthroscopy Simulator (Touch of Life Technologies, Aurora, CO, USA) and InsightArthroVR® Arthroscopy Simulator (GMV, Madrid, Spain). The other companies [6, 9, 19, 21, 25, 29, 30] refrained for various reasons unrelated to financial issues.

The Toltech Knee Arthroscopy Simulator (Simulator A) is a virtual reality simulator for arthroscopic knee surgery with two handles that give haptic feedback (Fig. 1) (Appendix 1). The InsightArthroVR® Arthroscopy Simulator (Simulator B) is a virtual reality simulator for arthroscopic knee and shoulder surgery with a multitool that gives haptic feedback (Fig. 2) (Appendix 1).

Fig. 1
figure 1

A photograph shows a participant performing tasks on Simulator A.

Fig. 2
figure 2

A photograph shows a participant performing tasks on Simulator B.

We recruited 37 participants, including (1) all staff members practicing arthroscopy routinely and present at the time of testing (except the main researcher GMMJK), (2) all residents present at the time of testing, and (3) medical students and researchers of our orthopaedic department. The participants were divided into three groups having different levels of arthroscopic experience: novices who had never performed an arthroscopic procedure, intermediates who had performed up to 59 arthroscopies, and experts who had performed more than 60 arthroscopies. This boundary level of 60 arthroscopies was based on the average opinion of fellowship directors who were asked to estimate the number of operations that should be performed to allow a trainee to perform unsupervised meniscectomies [24]. Simulator A was evaluated by 22 participants in April 2009 and Simulator B by 22 participants in October 2009 (Fig. 3). One participant had reached a higher level of experience between those times. The corresponding subgroups had similar characteristics (Fig. 3).

Fig. 3
figure 3

A flowchart shows the participant population. Subgroups were made on arthroscopic experience at three levels based on the number of arthroscopies performed: novices (0), intermediates (1–59), and experts (> 60). Seven participants evaluated both simulators. The age in years and the number of attended arthroscopies (“Observation”) are expressed as median with range in parentheses. The number of participants who previously had used a simulator (“Simulator”) or had experience in playing computer games (“Games”) is shown.

All participants were scheduled a maximum period of 30 minutes. They had no opportunity to familiarize themselves with either simulator before the experiment. The researcher showed the selection of exercises and performance of the calibration protocol and tasks for the test.

The assessment of construct validity (time to perform a task) was based on one basic navigation task. As the simulators were unlikely to offer a navigation task that was the same, one navigation task was prescribed that can and could be performed on all simulators for comparison. With the arthroscope placed in the anterolateral portal and the probe in the anteromedial portal, nine anatomic landmarks had to be probed sequentially: medial femoral condyle, medial tibial plateau, posterior horn of the medial meniscus, midsection of the medial meniscus, ACL, lateral femoral condyle, lateral tibial plateau, posterior horn of the lateral meniscus, and midsection of the lateral meniscus [32]. The participants were asked to repeat this navigation task up to five times in a limit of 10 minutes. The navigation task time was defined as described previously [32] and determined with a separate video recording of the simulator monitor in which the virtual intraarticular joint is presented. We recorded the median time per experience group for each of five repetitions of the navigation task.

Face validity (realism), educational value, and user-friendliness were determined by giving the participants a second task in which exercise(s) had to be performed that were characteristic for that particular simulator and by asking them to fill out a questionnaire afterward. The exercises were selected by the faculty surgeon (GMMJK) and the company to be sure that they best represented the capability of the simulator. Assistance in performing these exercise(s) was given only if a participant failed to continue for a period of 2 minutes. Task performance was pointed out to the participants. The characteristic exercise chosen for Simulator A was “inspection of the suprapatellar pouch with only the 30° arthroscope.” This exercise is set up in three stages: watching an instruction video of the exercise, performing the exercise once guided by example hint-images in a stepwise sequence, and performing the complete exercise once again without guidance. The exercise chosen for Simulator B was threefold: microfracture technique to treat a cartilage lesion in the femoral condyle, visual exploration and probing of a superior labrum anterior superior lesion, and placement of three suture anchors repairing a Bankart lesion (shoulder instability). All three exercises were preceded by textual instructions and had to be performed once. The questionnaire consisted of questions regarding general information (Fig. 3); face validity of the outer appearance of the simulator, the intraarticular virtual joint, and the virtual instruments (Table 1); educational value; and user-friendliness (Table 2). Questions were answered using a 10-point numerical rating scale (NRS) (eg, 0 = completely unrealistic and 10 = completely realistic) or dichotomous requiring a yes/no answer. A 10-point NRS was chosen as all participants were Dutch and this grading system is used at all educational institutions. A value of 7 or greater was considered sufficient. Thus, we expected the grading to be performed based on uniform interpretation of the NRS. Some questions featured a “not applicable (N/A)” answer option, which could be used solely by novices, as these questions required prior knowledge of the real-life arthroscopic situation. For the same reason, only the answers from the expert and intermediate groups were used on simulator realism and educational value. Only answers from the novice and intermediate groups were used on user-friendliness.

Table 1 Questions addressing face validity
Table 2 Questions addressing educational value and user-friendliness

The presence of normal distributions of task times was assessed by Kolmogorov-Smirnov tests. Owing to small sample sizes and skewed distributions, the task times were analyzed nonparametrically. Construct validity was determined for each simulator independently by using Kruskal-Wallis tests to calculate the overall presence of differences in task times between the three experience groups for each of the five task repetitions. The significance level was adjusted for multiple comparisons with the Bonferroni-Holm procedure (alpha = 0.05) [11]; when we detect significant differences we performed pair-wise comparisons between the experience groups separately using Mann-Whitney U tests. The scores of the three separate aspects of face validity of the simulators (Table 1) and User-friendliness I (Table 2) were expressed as mean summary scores of the corresponding questions. Educational Value I (Table 2) was expressed as a sum score of five dichotomous questions and ranged from 0 to 5. The mean summary scores (Face Validity and User-friendliness I) were verified for normality by Kolmogorov-Smirnov tests, expressed as mean and SD, and assessed for differences between both simulators with Student’s t tests. The ordinal scale of Educational Value I was presented as medians with ranges and analyzed using a Mann-Whitney U test. The dichotomous questions (Educational Value II and User-friendliness II) expressed as categorical yes/no answers were presented as frequencies and percentages (%) and analyzed by chi square tests or Fisher’s exact test (in case one or more cells had expected counts less than five). The significance level was adjusted for multiple comparisons with the Bonferroni-Holm procedure (alpha = 0.05) [11].

Results

With the exception of two participants, the novice group completed only the first repetition of the navigation task or none at all on Simulator A within the time limit (Fig. 4). The novices were slower (p = 0.001) in completing the first repetition. Post hoc analysis showed the navigation task times of the experts (median, 125 seconds; range, 68–245 seconds) and the intermediates (median, 129 seconds; range, 60–311 seconds) were faster (p < 0.001 and p = 0.01, respectively) than those of the novices (median, 447 seconds; range, 181–600 seconds) (Fig. 4). The task times of the intermediates and the experts were similar (p = 0.93). No differences were observed between the experience groups for the other repetitions. For Simulator B, we observed slower task completion of the novices for the second and third repetitions (p = 0.005 and p = 0.008, respectively). The navigation task times of the first repetition of the experts (median, 90 seconds; range, 65–177 seconds) were not faster than those of the novices (median, 165 seconds; range, 109–605 seconds) (p = 0.019) and those of the intermediates (median, 105 seconds; range, 75–204 seconds) (p = 0.503) (Fig. 4). Post hoc comparisons of the second and third repetitions showed faster (p = 0.001 and p = 0.002, respectively) task times of the experts compared with those of the novices (Fig. 4). The task times of the intermediates were not different compared with those of the experts or novices for these repetitions.

Fig. 4A–B
figure 4

The graphs show the results of the navigation repetitions for (A) Simulator A and (B) Simulator B. The results are presented as medians with ranges. Construct validity was observed for the first repetition of Simulator A and the second and third repetitions of Simulator B.

The mean face validity scores of the outer appearance and simulated intraarticular joint were 7.3 (SD, 1.4) and 6.4 (SD, 1.4) for Simulator A and 8.4 (SD, 0.6) and 6.1 (SD, 0.9) for Simulator B, respectively. Thus, they were judged sufficient by the intermediates and experts (Fig. 5). The mean face validity score of the simulated instruments was 4.9 (SD, 1.5) for Simulator A and 5.7 (SD, 1.2) for Simulator B. Thus, the face validity of the simulated instruments was judged barely sufficient for both simulators (Fig. 5). Differences were not observed for any aspect of face validity between the simulators. The median sum score for Educational Value I was 3 (range, 1–5) for Simulator A and 5 (range, 2–5) for Simulator B (p = 0.009). Simulator A was judged suitable for real-life surgery (Educational Value II) by 10 of 11 participants (91%), as was Simulator B by all 13 participants (100%) (p = 0.46). The mean score of 8.3 (SD, 1.0) for User-friendliness I of Simulator B was greater (p < 0.001) than that for Simulator A (6.5 [SD, 1.3]) (Fig. 5). More (p = 0.002) respondents felt the need to read the manual (User-friendliness II) before operating Simulator A (11 of 15, 73.3%) than before operating Simulator B (two of 13, 15.4%).

Fig. 5
figure 5

A graph shows the results of the normalized sum scores for face validity and User-friendliness I. The values are expressed as means with SDs. User-friendliness I is the combined opinion of the intermediates and novices; the other columns are the combined opinions of the experts and the intermediates. The face validity of the outer appearance and intraarticular joint were judged sufficient. The face validity of the instruments was judged barely sufficient for both simulators. Differences were not observed for any aspect of face validity between the simulators. The mean score for User-friendliness I of Simulator B was greater (p < 0.001) than that for Simulator A.

Discussion

As arthroscopic simulators gain maturity and become commercially available, it is unclear whether they are suitable for use in training. We therefore addressed the following questions: (1) do commercial simulators have construct (times to perform tasks) and face validity (realism), and (2) is the perception of usefulness (educational value and user-friendliness) related to level of experience?

We note limitations to our study. First is the relatively small number of participants in each experience group, which could have led to nonsignificant results and the skewed distribution of task times. The groups could not be enlarged owing to logistic limitations; however, care was taken to include all experts and intermediates present at the time of testing to prevent selection bias. Other evaluation studies with simulators have experienced similar problems in recruiting participants [3, 17, 22, 32]. Second is the absence of transfer or predictive validation, which was not feasible in the time frame. Studies performed with similar arthroscopy simulators [9, 12] do show training on these systems decreases the operative learning curve. These findings are in line with the opinion of all participants that training on either simulator will be good preparation before performing real-life arthroscopy. Third, our study is limited to two arthroscopic simulators, which were, in principle, not that distinctive as they are both virtual reality systems with haptic feedback devices. This is reflected in the results. If other types of simulators would have been included, such as anatomic bench models, a wider palette of alternatives could have been described and differences would be more pronounced. Fourth, only one navigation task was used to observe construct validity. The choice of this task is in line with tasks evaluated in other studies [12, 17, 22, 32] and is indicated as an important arthroscopic skill to master before operating in the theater [27]. Fifth, only a few tasks were used to determine face validity, educational value, and user-friendliness. These tasks were chosen carefully and reflected the way exercises are built up and feedback on performance is given by each simulator. Therefore, we assumed the participants were given a good impression of the learning environment of each simulator. Sixth, the choice of expert level was somewhat arbitrary, especially for the novice versus intermediate groups. This could have influenced the demonstration of construct validity as the experience level might have been insufficiently distinctive.

Neither simulator showed full construct validity (Fig. 4) because the task times were substantially similar for all repetitions between the novices and experts, and similar between intermediates and experts. These findings are comparable to those in the study by Srivastava et al. [31], who used a similar division in experience levels and found no substantial differences between the groups. They speculated the results may be influenced by the fact that experts knew what to expect and novices were very motivated. This could be true for our study. A more detailed comparison with other studies cannot be made as the criteria to qualify as expert, intermediate, or novice differ among studies [17, 26, 32], or a different acceptable significance level was chosen [2]. It is recommended to set uniform experience levels when performing this type of study. By using the study of O’Neill et al. [24], a solid foundation for assigning experience levels was aimed for. Task time was chosen as an outcome measure, as it is widely used and validated in assessing surgical skills learning, it can be measured using all commercially available arthroscopy simulators, and it makes overall objective comparison possible.

Face validity was observed for both simulators, although there is room for improvement. The presence of tactile feedback in an arthroscopy simulator is considered essential to imitate clinical practice adequately and train safe manipulation [22, 36]. Intermediates and experts indicated tissue probing was unrealistic on both simulators (Fig. 5). Training skills without receiving natural feedback could lead to an offset in the internal models stored in our central nervous system. This might increase errors in the operating room. Performing realistic force feedback for cutting or shaving is another challenge to implement in these simulators [15]. The intraarticular joint space of Simulator B was considered large. Additionally, as both simulators present virtual reality images, they leave an artificial impression. This could be improved with the latest animation techniques used in the gaming industry. These face validity results are comparable to results of other studies, in which imitation of the real-life situation generally is sufficient, but none is given a perfect score [2, 17, 18, 31, 32]. An explanation could be that simulators that do not resemble a human joint are graded more mildly as it is so obvious that they do not resemble reality, whereas simulators that come close to the real-life representation are scrutinized more thoroughly for small deviations. Educational value was perceived for both simulators by intermediates and experts. This subjective opinion is supported by Issenberg et al. [13], who identified a top 10 list of most important educational criteria for medical simulators. Both simulators fulfill seven of 10 criteria, including the most important ones: give feedback on performance, allow repetitive practice, and allow integration into the curriculum. Unfortunately, they do not offer training of precise portal placement, which is another important skill to be mastered before starting to operate on patients [27].

Overall, Simulator B was considered more user-friendly than Simulator A, although Simulator A was graded satisfactory (Fig. 5). The feedback given by Simulator B resembles the way mainstream computer games do this. For both simulators, there is room for improvement. Simulator B offers a larger variety of exercises, and is more user-friendly, whereas Simulator A showed a more distinct difference in task time between experts and novices. Teaching surgeons can embrace this type of simulator for implementation in curricula.