Introduction
Most surgical procedures, including temporal bone surgery, require demanding cognitive and psychomotor skills of the surgeon. High-quality training with repeated practice is important to ensure competency, a good surgical outcome and patient safety.Reference Reznick1 Novices have traditionally been introduced to temporal bone surgery through hands-on cadaveric dissection.Reference George and De2 Nevertheless, because of a decrease in human cadaveric temporal bones available for dissection,Reference Frithioff, Sørensen and Andersen3 interest in alternative training methods such as virtual reality simulation has increased. Even though the evidence for efficacy of virtual reality simulation training is well-established,Reference Zhao, Kennedy, Yukawa, Pyman and O'Leary4–Reference Javia and Deutsch6 implementation and systematic integration in the curriculum is often limited.Reference Frithioff, Sørensen and Andersen3
Virtual reality simulation allows the trainee to practice on an unlimited number of cases but also provides the opportunity for directed self-regulated learning.Reference Brydges, Dubrowski and Regehr7 This represents a self-directed learning experience in which the trainees are able to regulate their own learning, scaffolded by instructional design and learning supports provided by the educator and without the presence of a human instructor.Reference Brydges, Dubrowski and Regehr7
Several benefits of directed self-regulated learning have been reported, such as long-term benefits on performance as well as cost-effectiveness because little or no presence of an instructor is needed.Reference Brydges, Nair, Ma, Shanks and Hatala8,Reference Frendø, Thingaard, Konge, Sølvsten and Steven9 Feedback has consistently been identified as a key feature of successful simulation-based surgical training,Reference Cook, Brydges, Zendejas, Hamstra and Hatala10,Reference Issenberg, Mcgaghie, Petrusa, Gordon and Scalese11 and this can be provided by the simulator itself.Reference Wijewickrema, Zhou, Ioannou, Copson and Piromchai12–Reference Zirkle, Roberson, Leuwer and Dubrowski15 Altogether, this allows trainees to practice and acquire surgical skills at any time, even at home.Reference Frendø, Thingaard, Konge, Sølvsten and Steven9
In temporal bone surgical skills training, virtual reality simulation with continuous simulator-integrated tutoring has been found to accelerate the initial learning curves of novices.Reference Andersen, Konge, Cayé-Thomasen and Sørensen16 However, after just a few procedures novices seemingly reach a learning curve plateau because of over-reliance on tutoring.Reference Andersen, Konge and Mikkelsen17 In accordance with the ‘guidance hypothesis’, this over-reliance on continuous (concurrent) feedback negatively affects performance when the feedback is withdrawn.Reference Park, Shea, Wright, Shea, Reduced-frequency and Park18 Feedback also affects the cognitive processes of the learner,Reference Hatala, Cook, Zendajas, Hamstra and Brydges19 and cognitive load theory provides a theoretical framework for understanding learning from a cognitive perspective. The main premise of cognitive load theory is that working memory and information processing capacity is limited, especially for the novice learner.Reference Sweller20 If the sum of cognitive load exceeds the capacity of the learner, this will induce a cognitive overload that negatively affects performance and learning.Reference Haji, Cheung, Woods, Regehr, de Ribaupierre and Dubrowski21,Reference Frithioff, Frendø, Mikkelsen, Sørensen and Andersen22 However, some cognitive load (the germane load) is required for the formation of mental schemata (i.e. learning), and continuous feedback can interact with this process.Reference Van Merriënboer and Sweller23
In contrast to continuous feedback, the use of summative (terminal) feedback appears to result in better learning.Reference Hatala, Cook, Zendajas, Hamstra and Brydges19 In virtual reality temporal bone surgical simulation, such summative feedback has mostly been based on experts’ rating performance using structured assessment tools.Reference Sethia, Kerwin and Wiet24 This is time-consuming and either requires instructor presence during the training situation or later assessment based on recording of the procedure or evaluation of the final product. This makes timely summative feedback nearly impossible. Many simulator-gathered metrics for performance have been suggested,Reference Al-Shahrestani, Sørensen and Andersen25 and recent efforts on integrating these into valid assessment enables automated and immediate summative feedback.Reference Andersen, Mikkelsen and Sørensen14 For other procedural skills such as endoscopyReference Vilmann, Norsk, Bo and Svendsen26 and ultrasound,Reference Ahmed, Niessen, Gallagher, Breslin, Dunngalvin and Shorten27 automatised simulator-based feedback has shown positive effects on novices’ performance.
Very little is known about the effects of using summative feedback in virtual reality temporal bone simulation training, but we hypothesise that it will improve end-of-training performance, increase retention of skills and modify cognitive load for the novice. In this study, we therefore want to compare summative feedback based on simulator metrics against standard training without summative feedback in distributed virtual reality simulation training of mastoidectomy.
Materials and methods
Study design, participants and setting
This was a prospective, controlled, randomised trial of an educational intervention. In order to represent true novice trainees, 27 medical students were recruited from the University of Copenhagen, Denmark, and 24 completed the training programme. Figure 1 shows the Consolidated Standards of Reporting Trials flow diagram. Participants were recruited from both clinical and non-clinical semesters, but none had any clinical exposure to temporal bone surgery as this is not part of the pre-graduate curriculum. Prior temporal bone surgical simulation training was the only exclusion criterion. Participants were volunteers and did not receive compensation, and the training was considered an extracurricular activity. The trial took place at the Simulation Centre at Copenhagen Academy of Medical Education and Simulation from October to December 2019 with retention testing in February–March 2020.
Simulation equipment
The virtual reality simulation platform used was an experimental version of the Visible Ear Simulator (version 2.1) that features a range of simulator-integrated metrics for feedback.Reference Andersen, Mikkelsen and Sørensen14 The Visible Ear Simulator is a high-fidelity virtual reality temporal bone surgical simulator offered as academic freeware online.Reference Sørensen, Mosegaard and Trier28 The simulator uses the Geomagic Touch haptic device (3D Systems, Rockhill, USA) for drilling of a virtual temporal bone with force feedback.
Randomisation
Participants were randomised by the first author (AF) with a 1:1 allocation ratio into two groups using an online random sequence generator before starting the training programme. Upon dropout of one participant, a new participant was recruited and assigned to the same group as the participant who dropped out.
Intervention
Participants in both groups first completed a background questionnaire. Next, participants were introduced to the simulator's navigation and controls by a brief and individual hands-on exercise (5 minutes).
Both training programmes (control and intervention) consisted of five blocks of distributed training: each block was spaced by at least one week and consisted of three identical procedures (complete anatomical mastoidectomy procedures with posterior tympanotomy). As a warm-up, participants were guided by colour-coding (green-lighting) of the bone volume to be drilled in procedure 1 (baseline) but not during any of the following procedures (procedures 2 to 15). Both groups had access to an on-screen, step-by-step dissection guide (standard instructions), which was available at all times during all training procedures. There was no time limit for the procedures.
In contrast to the control group, the intervention group received structured, written summative feedback based on simulator metrics immediately after each procedure.Reference Andersen, Mikkelsen and Sørensen14 This scoring and feedback sheet (Appendix 1) provides the participant with an overall metrics-based score as well as feedback on choice of drill, bone volume removed, and collisions with important anatomical structures including the dura, facial nerve, chorda tympani, semi-circular canals and the ossicles.
Two months after finishing the initial training, all participants were invited back for retention testing. This consisted of two procedures (procedures 16 and 17) identical to the training procedures but without access to the on-screen instructions and without summative feedback or access to prior scoring sheets.
Outcomes
The primary outcome was manual assessment of the mastoidectomy performance (final-product score). This was done after the trial using a 26-item modified Welling scale for final-product analysisReference Andersen, Cayé-Thomasen and Sørensen29 of the end results of the drilling (Figure 2). Two experienced raters (SAWA and MSS), who were blinded to participant, procedure number and group assignment, assessed the performances.
A secondary outcome was the metrics-based score, which is based on five sub-scores combining different metrics and reflecting a correct use of drills, efficiency and goal-directed drilling behaviour. A proficiency level (i.e. pass) for this score has previously been established at a metrics-based score of 83.6 per cent.Reference Andersen, Mikkelsen and Sørensen14 We further added a collisions score based on the number of collisions with critical structures and also recorded the time used for the procedure.
Cognitive load was another secondary outcome and was measured by secondary-task reaction time, which is an established method for estimating cognitive load.Reference Naismith and Cavalcanti30 This was done using a reaction timer (American Educational Products, Fort Collins, USA) measuring the time (in 1/100 seconds) it takes to press on a foot switch in response to a beep. Measurements were performed in series of four at baseline (before and after training) and at 5 minutes and 15 minutes during the simulation. Cognitive load was calculated as the mean reaction time during simulation divided by the mean reaction time at baseline (i.e. the relative reaction time).Reference Andersen, Mikkelsen, Konge, Cayé-Thomasen and Sørensen31
Sample size
Sample size calculations were based on experience from similar studies because sample size calculations for repeated measurement designs are not well-defined. Therefore, we chose 12 participants in each arm which, based on previous studies, should be able to detect a 10 per cent difference in the final-product outcome.
Statistical methods
Data were analysed using SPSS® (version 25) for Mac® OSX. Because of repeated measurements, linear mixed models, using the principles outlined by Leppink,Reference Leppink32 were used in the analyses. Models were iteratively built to investigate the different factors and their interactions as fixed effects: for the final-product score, the final model included group, procedure number and rater; for the metrics-based score outcomes, the final model included group and the procedure number; for the cognitive load outcome, the final model included only group as timing of reaction because time measurement during the procedure (at 5 minutes and 15 minutes) and procedure number was not found to influence cognitive load; for the retention procedures, the corresponding models included group and rater (final-product score) or group only (metrics-based score and cognitive load). Estimated marginal means and p-values of the linear mixed models were reported. P-values less than 0.05 were considered statistically significant.
Ethics
The regional ethical committee of the Capital Region of Denmark found this educational trial exempt (H-19069755). Written consent was obtained from participants.
Results
Participants in the control and intervention groups had similar baseline characteristics including self-reported computer skills and gaming frequency (Table 1).
SD = standard deviation
Effects on final-product score
For the expert assessment of the final-product score performance, the two groups had similar performance at baseline (i.e. the warm-up procedure) (mean difference = 0.7 points; p = 0.45). During the trial, final-product score increased with repeated practice (0.08 points per procedure; p = 0.045) in both groups as expected (Figure 3). Importantly, we found that the intervention group significantly outperformed the control group (mean difference = 1.0 point; p = 0.001). At retention testing, the intervention group performed slightly better than the control group, but this was not statistically significant (Table 2).
CI = confidence interval
Effects on metrics-based score, collisions and time
For performance assessment using the automated metrics-based score, we found similar results. Participants scored similarly at baseline (mean difference = 1.9; p = 0.60) and repeated practice increased the metrics-based score (1.6 per cent per procedure; p < 0.001). During training, the intervention group performed far superiorly to the control group (mean difference = 12.7 per cent; p < 0.001; Figure 4). This also resulted in the intervention group having more total performances that passed the pre-defined proficiency level compared with the control group (41.6 per cent vs 8.8 per cent; p < 0.001). Finally, at retention testing, the intervention group continued to have a higher metrics-based score compared with the control-group (mean difference = 6.9 per cent; p = 0.02) (Table 2). We found a poor correlation between the metrics-based score and final-product score (r2 = −0.04).
For collisions and time, the intervention group made significantly fewer total collisions (mean, 43.4 vs 54.1; p < 0.001) and also completed the procedure using less time compared with the control group (mean difference = 4.6 minutes; p < 0.001). At retention testing, we found no statistically significant difference in the number of collisions (mean difference = 6.3; p = 0.31) or time (mean difference = 2.4 minutes; p = 0.35).
Effects on cognitive load
There was no difference in cognitive load between the intervention and control group at baseline (mean difference = 6.2 per cent; p = 0.33) or during training (mean difference =1 per cent; p = 0.20), and cognitive load did not decrease with repeated practice. In contrast, the intervention group was found to have a higher cognitive load compared with the control group during retention testing (mean difference = 10 per cent; p < 0.001) (Table 2). When comparing cognitive load at the end of training (procedures 13–15) with the retention test (procedures 16–17), cognitive load was 7.1 per cent higher for the intervention group (p = 0.005) whereas the control group experienced a 1.8 per cent decrease in cognitive load (p = 0.005).
Discussion
Overall, we found that the summative feedback intervention improved novices’ performances during virtual reality simulation training considerably and accelerated the initial learning curve using both manual assessment and automated scoring based on simulator-metrics as the outcome. Further, the intervention resulted in fewer collisions with key structures (e.g. the facial nerve) and also decreased time to complete the procedure. At the retention test, metrics-based score remained higher for the intervention group; however, there was no significant difference in performance for the final-product score. The intervention did not affect cognitive load during training; however, during the retention testing, the cognitive load induced in the intervention group was significantly increased.
It is not surprising that the intervention group had a higher metrics-based score compared with the control group during the training because the intervention group received this score along with feedback based on the same metrics after each completed procedure. However, the control group did not receive any summative feedback. The learning curves of both groups (Figure 3 and 4) follow a classic pattern with initial fast acceleration of performance followed by gradual plateauing after just a few procedures (i.e. negatively accelerated learning curves).Reference Andersen, Konge, Cayé-Thomasen and Sørensen16 The difference in metrics-based score between groups at procedure two reflects the feedback given to the intervention group received after their warm-up procedure (procedure one).
The metrics-based score mainly reflects process and efficiency, such as choosing the appropriate burr size and type, time aspects, and goal-directed behaviour. In line with previous studies,Reference Andersen, Mikkelsen and Sørensen14 we found the metrics-based score to correlate poorly with the manual final-product score, which considers only the end result and emphasises safety-related parts of the procedure, such as avoiding drilling holes and damaging key structures.Reference Andersen, Mikkelsen and Sørensen14,Reference Zirkle, Taplin, Anthony and Dubrowski33 Nevertheless, providing the participants with the summative metrics-based score and collision information had a positive impact on their final-product performance (final-product score). Consequently, the automated summative feedback appears to be a strong educational tool for directed, self-regulated learning. Ultimately, this allows learners to develop basic surgical skills in mastoidectomy, reducing the need for human instructorsReference Brydges, Dubrowski and Regehr7 who can be saved for more advanced training, such as on cadavers.
Our study adds new knowledge for several reasons: first, it is the first study to investigate automated summative feedback in temporal bone training because all previous studies have used continuous feedback (real-time feedback), through green-lighting for example.Reference Wijewickrema, Zhou, Ioannou, Copson and Piromchai12,Reference Davaris, Wijewickrema, Zhou, Piromchai, Bailey, Kennedy, Isotani, Millán, Ogan, Hastings, McLaren and Luckin34,Reference Wijewickrema, Piromchai, Zhou, Ioannou, Bailey and Kennedy35 Next, we have studied the effect in a prolonged, distributed training programme, which is closer to real-life training conditions. Also, we included retention testing after two to three months to study the effect on longer term performance. Finally, we did not only measure the performance as simulator-gathered score (metrics-based score), but also as assessed by experts using an established mastoidectomy assessment tool (final-product score).
This study on summative feedback was motivated by previous findings, which demonstrated that real-time feedback may have negative effects when it is withdrawn.Reference Andersen, Konge, Cayé-Thomasen and Sørensen16 This is likely explained by tutoring over-reliance, which easily occurs in early stages of learning. In contrast, we now report how summative feedback does not have the same negative impact on acquisition of skills or retention, which is consistent with ‘the guidance hypothesis’.Reference Hatala, Cook, Zendajas, Hamstra and Brydges19 We cannot, however, conclude an ideal number of procedures with summative feedback, where performance remains stable after withdrawal of the feedback. A future step would be to further investigate the effects of summative feedback on transfer of simulation skills to performance in cadaveric dissection.
• Simulation-based training can be used for self-directed acquisition of temporal bone surgical skills
• Automated feedback is key for effective directed, self-regulated learning, but the best way to provide such feedback is unknown
• Metrics-based summative feedback leads to a more efficient and safer drilling behaviour in virtual reality mastoidectomy training
• Metrics-based summative feedback is a strong educational tool for novices in the early acquisition of temporal bone surgical skills
• Metrics-based summative feedback can be integrated as an automated learning support in simulation-based temporal bone training
We found cognitive load to be similar and stable for the two groups during training. Surprisingly, during the retention testing, a higher cognitive load was induced in the intervention group. Other studies within virtual reality simulation-based training of mastoidectomy have found that other learning supports affect cognitive loadReference Andersen, Mikkelsen and Sørensen36,Reference Andersen, Frendø, Guldager and Sørensen37 : for example, continuous feedback through automated tutoring reduces cognitive load during training but at the cost of inducing a very high cognitive load when tutoring is withdrawn. According to cognitive load theory, a low cognitive load during training of complex skills is not unconditionally beneficial for actual learning because some cognitive resources need to be allocated for the learning process itself.Reference Andersen, Frendø and Sørensen38,Reference Van Merriënboer, Kester and Pass39 The sub-components of cognitive load are difficult to measure separately and because relative reaction time estimates the total cognitive load, we are not able to determine if there are differences in the distribution between sub-components in our two groups.
A limitation of our study is that we used medical students as participants. In contrast to even first-year residents, medical students are true novices in relation to the procedure, and their learning objectives and motivation might therefore be very different. Consequently, we cannot directly extrapolate our results to more experienced learners, and future studies should elucidate whether our findings also apply to experienced learners or other specialties. Furthermore, we did not investigate a transfer outcome such as performance in cadaver dissection or in the operating theatre. As the virtual reality environment differs from the operating theatre in several ways (e.g. no bleeding or need for suctioning), a complete transfer of skills cannot be expected.Reference Andersen, Foghsgaard, Konge, Cayé-Thomasen and Sørensen5,Reference Gawecki, Wegrzyniak, Mickiewicz, Talar, Wierzbicka and Gawłowska40,Reference Andersen, Foghsgaard, Cayé-Thomasen and Sørensen41 A strength of our study is that our training programme was distributed (i.e. comprised multiple sessions separated by several days), which is not only an important part of directed self-regulated learningReference Gawecki, Wegrzyniak, Mickiewicz, Talar, Wierzbicka and Gawłowska40 but also results in better acquisition of skills in temporal bone surgery compared with massed practice.Reference Andersen, Konge, Cayé-Thomasen and Sørensen16,Reference Andersen, Foghsgaard, Cayé-Thomasen and Sørensen41 Validity evidence for the metrics-based score that we used for summative feedback has been established.Reference Andersen, Mikkelsen and Sørensen14 However, metrics are simulator-specific and vary between simulators,Reference Al-Shahrestani, Sørensen and Andersen25 and consequently, integration of metrics-based score for summative feedback in other simulators requires context-specific validity evidence to be collected.
Our study has several implications for virtual reality simulation-based training in temporal bone surgery. Automated, summative metrics-based feedback leads to an improved training and retention performance, supporting directed, self-regulated learning where the trainee can practice without the presence of human instructors. Furthermore, learning curves were accelerated, and even though the performance-gap between the control and intervention group in this study might diminish over time, summative metrics-based feedback can help reduce training time to reach a certain level of competence. The metrics-based feedback also resulted in a more efficient and safer drilling behaviour, which hopefully could translate into a safe clinical behaviour as well. Finally, virtual reality simulation training should be considered a first step before using other training modalities, saving cadaver and instructional resources, for example, until the trainee has demonstrated adequate skills in simulation. A comprehensive surgical training curriculum should integrate different training modalities and implement mastery learning where feedback, score-tracking and testing constitute crucial elements.Reference Smith and Lonie42
Conclusion
Summative metrics-based feedback has several positive effects on novices’ performance in virtual reality simulation-based training of temporal bone surgery. This includes increasing performance during training, reducing the number of collisions with key structures and reducing time for each simulated procedure. These positive effects seemed to be retained to some degree after two to three months. For these reasons, summative feedback can potentially lead to a safer, better and more efficient performance. The intervention did not seem to affect the total cognitive load during training, most likely because cognitive resources were allocated towards germane load (i.e. formation of mental schemata). Altogether, automated metrics-based summative feedback is a valuable educational tool in novices’ initial mastoidectomy skills acquisition and can be integrated as a support for directed, self-regulated learning in the basic temporal bone skills training curriculum.
Acknowledgements
Steven Andersen has received research funding for his postdoctoral study from the Independent Research Fund Denmark (8026-00003B). The remaining authors have no other sources of funding or support to declare.
Competing interests
None declared