Assessment of pediatric anesthesia trainees is complicated by the random nature of adverse patient events and the vagaries of clinical exposure. However, assessment is critical to improve patient safety. In previous studies, a multiple scenario assessment provided reliable and valid measures of the abilities of anesthesia residents. The purpose of this study was to develop a set of relevant simulated pediatric perioperative scenarios and to determine their effectiveness in the assessment of anesthesia residents and pediatric anesthesia fellows.
Ten simulation scenarios were designed to reflect situations encountered in perioperative pediatric anesthesia care. Anesthesiology residents and fellows consented to participate and were debriefed after each scenario. Two pediatric anesthesiologists scored each scenario by key action checklist. The psychometric properties (reliability, validity) of the scores were studied.
Thirty-five anesthesiology residents and pediatric anesthesia fellows participated. The participants with greater experience administering pediatric anesthetics generally outperformed those with less experience. Score variance attributable to raters was low, yielding a high interrater reliability.
A multiple-scenario, simulation-based assessment of pediatric perioperative care was designed and administered to residents and fellows. The scores obtained from the assessment indicated the content was relevant and that raters could reliably score the scenarios. Participants with more training achieved higher scores, but there was a wide range of ability among subjects. This method has the potential to contribute to pediatric anesthesia performance assessment, but additional measures of validity including correlations with more direct measures of clinical performance are needed to establish the utility of this approach.
What We Already Know about This Topic
Medical simulation using multiple scenarios may enhance training and predict clinical skills in anesthesia residents, but this has not been confirmed in pediatric anesthesia
What This Article Tells Us That Is New
In a study of residents and pediatric anesthesia fellows, a multiscenario simulation assessment was developed with high interrater reliability and correlation with degree of clinical experience
ANESTHESIOLOGY has pioneered the use of simulation technologies to enhance medical education and advance the evaluation of practitioner knowledge and skills. Simulation has been used to provide education in a number of domains in anesthesia practice that are difficult to assess. Some recent studies have used simulation to evaluate whether novice anesthesia residents are able to recognize and manage intraoperative hypoxia or hypotension, to assess team performance during obstetric emergencies, to determine whether residents effectively manage an intraoperative pediatric cardiac arrest, and to provide a method to train residents to wean patients from cardiopulmonary bypass.1,–,4Simulation has also been used to evaluate and train pediatric residents in the performance of cardiopulmonary resuscitation.5In previous studies, a resident's skill was more reliably estimated when multiple scenarios were included in the assessment.6,–,8In this investigation, a multiple-scenario assessment was designed to assess the advanced skills required in pediatric anesthesia settings.
Studies that have examined risks to children receiving anesthesia have described younger patient age, higher degree of illness, and provider inexperience as contributing to poor anesthesia outcomes. A physician's skill and experience are essential components of competent clinical care that contribute to safer outcomes in pediatric patients. The scenarios in this study were designed to reflect the broad range of ages and sizes of pediatric patients and to simulate the types of adverse events reported in pediatric anesthesia practice.9–10
Anesthesiology specialty training stipulates that residents should have a minimum of 2 months of pediatric anesthesia experience and provide anesthesia for a variety of pediatric patients.**However, these experiences may not effectively provide residents with the skills needed to manage various pediatric anesthetic conditions. One of the purposes of this study was to determine how experience and training time in pediatric anesthesia affected a resident's ability to care for a variety of pediatric anesthesia critical events. Although the American Board of Anesthesiology sets this as a minimum training bar to achieve board eligibility, there are no studies that compare the performance in simulation scenarios of residents who have completed a 2-month minimum training period of pediatric anesthesia with those who have completed less than 2 months. In this project, we used simulation technology to assess pediatric anesthesia skills in trainees in anesthesiology who are learning to administer anesthesia to children. Our first goal was to design and construct a set of simulated scenarios that a resident who was board-eligible in anesthesiology would be expected to manage capably. A second goal was to use the performances of the anesthesia residents and pediatric anesthesia fellows to determine whether the scenario scores were reproducible and provide an estimate of ability.
Materials and Methods
Scenarios
Scenarios were designed to incorporate topics from the pediatric anesthesia component of the content outline for the American Board of Anesthesiology's In-Training Examination.††Scenario design was also guided by the Accreditation Council for Graduate Medical Education's expectations for pediatric anesthesia competence for a graduating resident in anesthesiology. After delineating the scenario content, input about expectations of resident performance was obtained from six faculty from the division of pediatric anesthesia at Washington University in St. Louis, Missouri, and a set of checklist scoring items was established for each scenario. Initial scenarios were pilot-tested with residents to determine the feasibility of the project and the logistics of running the simulations. Upon further revision, the scenarios were further fine-tuned by having three pediatric anesthesiologists complete the scenarios and provide feedback about the scenario design and the scoring points. The goal of the scenarios was to present content reflecting clinical situations that are commonly encountered in pediatric anesthesia settings (laryngospasm, neonatal resuscitation) rather than those typically referred to tertiary pediatric care centers (complex congenital heart disease). The goal was to provide a balanced set of scenarios to broadly assess the practitioner in diverse aspects of pediatric anesthetic management.
Scoring
Ten scenarios were developed to reflect a broad array of potential clinical situations encountered in the perioperative management of pediatric patients (table 1). For each scenario, a checklist was created with 8–10 key diagnostic and therapeutic items. For example, in the neonatal resuscitation scenario, the checklist included the following items: dries baby, stimulates baby, suctions mouth, begins bag-mask ventilation, calls for help, starts chest compressions for bradycardia, intubates baby, requests umbilical venous catheterization, and requests a neonatal intensive care unit bed. The scenarios and checklists were designed, piloted, and revised with the input of actively practicing academic pediatric anesthesiologists.
Participants
The Institutional Review Board at Washington University approved this study, and written informed consent was obtained from all participants. There were 51 anesthesiology residents and 8 pediatric anesthesia fellows during the period when the study was conducted. Sessions generally occurred during the participants' pediatric anesthesiology rotation, and 35 anesthesiology residents and fellows (59%) were available and consented to participate. Participation was not mandatory, although no one refused to participate. The results were kept confidential and not shared with the program director nor included in the participants' evaluation on their pediatric anesthesia rotation. These included residents in the Department of Anesthesiology at Washington University in at least their second month of pediatric anesthesiology training and pediatric anesthesia fellows. The scenarios were performed in a 2-h session in the Saigh Pediatric Simulation Center at St. Louis Children's Hospital using the Laerdal Sim NewB® (Wappingers Falls, NY) and the METI PediaSIM HPS® (Sarasota, FL) mannequins. The performances were documented using the B-line SimCapture® (Washington, DC) audiovisual recording system. Each participant was given an orientation to the mannequins and the method of the study, and then participated in the scenarios individually. They had 5 min to complete each simulation with a short debriefing between scenarios. The assessment cohort included 11 first-year anesthesia residents with less than 2 months of experience in pediatric anesthesia. Of the 24 trainees with more than 2 months of pediatric anesthesia experience, 7 were second-year residents, 9 were third-year residents, and 8 were pediatric anesthesia fellows. There were 22 males (62.9%) and 13 females (37.1%). The average age of the participants was 32.2 yr (SD = 3.6). Most of the participants (n = 31, 89%) were currently certified in advanced cardiac life support. Most (n = 31, 89%) were not currently pediatric advanced life support certified.
Rater Training/Scoring
Three raters scored the performances, the lead investigator (JF) and two pediatric anesthesiologists (MB, WW). All scenario performances were evaluated independently by at least two raters (one subject was evaluated by all three raters). The lead investigator designed and conducted all of the sessions and scored all of the performances. The other two raters (MB and WW) provided scores for 17 and 19 residents, respectively. One of these raters (MB) had previously participated in the pilot testing as a subject and as a rater. The lead author oriented the raters to the purpose and the design of the study, including the use of the checklist. Any ambiguities in the checklists noted by the raters were addressed before implementation, and the lead author was always available to attend to any concerns. The raters have completed pediatric anesthesia fellowships, are board-certified in anesthesiology, and have active pediatric anesthesia practices. Two independent scores were obtained for each resident and all of their 10 scenarios. Scenario scores, by rater, were calculated as a percentage of checklist items credited. Individual resident scenario scores were the average of the two rater scores. An overall score was calculated as the average of the scenario scores.
Statistical Analysis.
Several analyses were performed to investigate the psychometric properties of the scenario scores. To explore potential sources of measurement error in the checklist scores, generalizability theory was used.11Estimates of the reliability of the scores, over raters and simulation scenarios, were obtained. Because all scenarios may not be equally effective for measuring pediatric anesthesia skills, scenario-level discrimination (D) statistics were calculated (correlation between individual scenario scores and the overall score obtained by a resident). To examine potential performance differences by pediatric anesthesiology experience, a repeated measures analysis of variance (RM ANOVA) was completed. In performing this analysis, it was hypothesized that residents with more experience would outperform those with less experience. Residents were assigned to one of three groups: those who had completed 2 months or less of pediatric anesthesia training, those who had 3–5 months of pediatric anesthesia training, and those who had completed more than 5 months of pediatric anesthesia training. The first group has less training than the American Board of Anesthesiology's minimal requirement of 2 months of pediatric anesthesiology. The second group represents the typical duration of pediatric anesthesia rotations completed in the anesthesiology residency at Washington University. The third group reflects those who have a particular interest in pediatric anesthesiology or are pursuing fellowship training in pediatric anesthesiology.
Results
Descriptive Statistics
Descriptive statistics (checklist score) are presented in table 2by scenario. Based on the average performance of all residents, the lowest scores were achieved on the Appendicitis with sepsis scenario (mean = 45.5%) followed by venous air embolus (mean = 48.5%). The highest scores were achieved on the bronchospasm scenario (mean = 66.0%). The D values were all positive, indicating that performance on one scenario was related to overall performance. The laryngospasm scenario had the highest discrimination value (D = 0.71).
Reliability
To estimate the reliability of the assessment scores, a generalizability study was conducted and the estimated variance components for the generalizability study are presented in table 3. Variance components were estimated based on a person (P) by rater (R) by task (T) design. That is, each of the residents (person, n = 35) was rated by at least two independent raters across the 10 tasks (scenarios). Decision studies were conducted to estimate the generalizability of the checklist scores for different measurement conditions (i.e. , number of simulation scenarios, number of raters/per scenario).
The person (resident) variance component is an estimate of the variance across residents in resident-level mean scores. Ideally, most of the variance should manifest in the person component, indicating that individual abilities account for differences in observed scores. In this design, the other “main effect” variance components include task (scenario) and rater. The task component is the estimated variance of scenario mean scores. Because the estimate is much greater than zero, this indicates that the 10 tasks vary in average difficulty. As shown in table 2, mean performance, by simulation scenario, ranged from 45.5% for appendicitis with sepsis to 66.0% for the bronchospasm scenario. The rater component is the variance of rater mean scores. The relatively small value indicates that the raters do not differ much in terms of average stringency. The fact that the task component is much larger than the rater component indicates that the raters differ much less in average stringency than the simulation scenarios differ in average difficulty. The large interaction variance component, person × task, suggests that there are considerably different rank orderings of resident mean scores for each of the simulation scenarios. The relatively small rater by task components indicates that the raters rank order the difficulty of the scenarios similarly. The final variance component, error, is the residual variance that includes the triple-order interactions and all other unexplained sources of variation.
Based on the generalizability study described previously, the variance components were used to estimate the reliability of the resident scores for various measurement designs. This process, referred to as decision studies, allows the most efficient measurement procedures for future operations. For the simulation assessment, the goal is to generalize the residents' scores for a universe of generalization that includes many other tasks (scenarios) and many other raters. For a situation where each resident is rated in each of the 10 scenarios (tasks; nt= 10) and each of the tasks is rated by two independent raters (nr= 2), the sample sizes for the decision study are the same as those for the generalizability study. Based on the estimated variance components for nt= 10 and nr= 2, the generalizability coefficient (ρ2) is 0.57. The generalizability coefficient for a design that incorporates nt= 10 tasks (scenarios) and only a single (nr= 1) rater is estimated to be ρ2= 0.54, only slightly smaller than for a design with two raters per task. The precision of the resident ability estimate is therefore only minimally affected by the number of raters for each scenario. The estimated variance components for a design involving twice as many tasks (n = 20 scenarios) and a single rater are also displayed in table 3. Based on these estimates, ρ2= 0.70. Taken together, the results from the decision studies indicate that gains in measurement precision can best be achieved by increasing the number of tasks (scenarios) and not the number of raters per given task.
Although the reliability of the resident scores is most dependent on the length of the assessment (number of scenarios), it is still important to quantify the association between scores provided by the two independent raters. The standardized coefficient of interrater reliability was 0.80 (average correlation between the scores assigned by the same two raters to resident performances to the same task).
Validity Evidence
In an attempt to gather evidence to support the validity of the assessment scores, the performance of residents with more pediatric anesthesiology experience was compared with that of those with less experience. Residents were grouped based on reported months of pediatric anesthesia (group 1, <2 months, n = 14; group 2, 2–5 months, n = 12; group 3, >5 months, n = 9). RM ANOVA was conducted to test the hypothesis that there was no difference in scores based on experience. The independent variables were resident group (1, 2, 3) and scenario (repeated measure). The dependent variable was the summary checklist scenario score.
Based on the RM ANOVA, there was no significant group × scenario interaction, indicating that potential difference in group mean scores were consistent across scenarios. There was a significant main effect attributable to scenario (F1,9= 6.4, P < 0.01), indicating that, averaged over resident groups (experience levels), the scenarios were not of equivalent difficulty. There was also a significant main effect attributable to group (F1,2= 4.9, P = 0.01). This indicates that, averaged over scenarios, there were significant differences in performance between groups. Based on a post hoc analysis of group scores (Scheffé F test), there was a significant difference between group 3 and group 1 and group 2 and group 1, but not between group 2 and group 3. Residents with at least 2 months pediatric anesthesia experience significantly outperformed those with less than 2 months experience.
Because there were no significant differences in scores between participants with 2–5 months of experience (group 2) and those with greater than 5 months of experience (group 3), these two groups were combined. A comparison of group performance (⩽2 months of experience, >2 months of experience) is presented in figure 1. Although there was no significant group × scenario interaction from the RM ANOVA, between-group performance differences were negligible for some scenarios. For example, on the endotracheal tube out scenario, the less experienced residents (⩽2 months, mean = 62.4%), on aggregate, actually outperformed those with more experience (more than 2 months, mean = 61.4%). Based on 35 participants, there were positive associations between the overall performance scores (aggregated over raters and scenarios) and months of postgraduate training (r = 0.22) and months of pediatric anesthesia training (r = 0.27). There was considerable variability in the overall performance of the residents, especially within the more experienced group (fig. 2).
Fig. 1. Mean key action scores by scenario and pediatric anesthesia experience. The total key action scores out of 100% for the simulation scenarios for each group of trainees with either less (red bars ) or more (blue bars ) than 2 months of pediatric anesthesia training.
Fig. 1. Mean key action scores by scenario and pediatric anesthesia experience. The total key action scores out of 100% for the simulation scenarios for each group of trainees with either less (red bars ) or more (blue bars ) than 2 months of pediatric anesthesia training.
Fig. 2. Total key action score by participant. The total key action scores achieved by each trainee with either less (red squares ) or more (blue squares ) than 2 months of pediatric anesthesia training. Each point represents a specific trainee's percent overall score. The scores are scattered to highlight both the overlap and the score variation that occurs among individuals within and between groups.
Fig. 2. Total key action score by participant. The total key action scores achieved by each trainee with either less (red squares ) or more (blue squares ) than 2 months of pediatric anesthesia training. Each point represents a specific trainee's percent overall score. The scores are scattered to highlight both the overlap and the score variation that occurs among individuals within and between groups.
Discussion
Children, most notably neonates and infants, are more likely than adults to experience morbidity and mortality during the perioperative period.9Closed claims analyses demonstrate that major sources of liability in children include cardiovascular and respiratory events, younger age, and higher American Society of Anesthesiologist physical status.12–13A variety of analyses, including reviews of case reports, investigations of perioperative deaths, and closed insurance claims, indicate that a lack of physician experience and skill are contributing factors to adverse outcomes.14,–,17In this study, we designed a set of scenarios to reflect the nature of cardiac and respiratory conditions that can lead to adverse outcomes in the perioperative care of children.
Finding methods to assess advanced skills in anesthesiology is essential not only to establish a performance expectation for training, but also to delineate some of the differences in skills required and expected in fellowship training. The recent decision to establish a certification examination for pediatric anesthesiologists beyond the accreditation program for fellowships that began in 1997 suggests that defining differences is of increasing importance. Simulation can be a very effective method to assess those advanced skills and judgments acquired during pediatric anesthesia fellowship training. As part of our study, we developed a set of simulated scenarios, and associated scoring rubrics, to evaluate a trainee's skill in managing acute events in pediatric anesthesia.
In this investigation, residents and pediatric anesthesia fellows were assessed on a set of pediatric scenarios designed to cover a broad range of skills and ability in anesthesia practice. Subjects with more training and experience received higher scores than residents with less experience, suggesting that the assessment provides construct validity as a measure of skills acquired during residency training. One of the reasons pediatric fellows did not outperform residents could be the decision not to include the more complex scenarios such as managing anesthesia for congenital cardiac disease, solid organ transplantation, or congenital syndromes. Alternatively, the variability in experience among a small cohort of fellows may have affected the results.
The scenarios were designed based on the pediatric anesthesia component of the content outline for the American Board of Anesthesiology's In-Training Examination and the Accreditation Council for Graduate Medical Education's expectations of a graduating resident in anesthesiology for competence in pediatric anesthesia. The scenario score analysis, which ranged from 45.5% (for appendicitis with sepsis) to 66% (for bronchospasm), suggests that a challenging range of content was presented with some scenarios much more difficult than others. For most scenarios, some of the residents were able to achieve high scores in excess of 88%, which suggests scoring expectations were reasonable. However, the lower mean scenario scores indicate that the assessment, at least for some participants, was quite difficult. The scenario difficulty may relate to the relatively inclusive checklist items with less straightforward content.
The scoring analysis offers a method to evaluate whether individual scenario construct and scoring items contribute to the overall assessment. For example, the appendicitis scenario (mean score = 46%) required participants to check the blood pressure before induction, recognize and treat relatively severe hypotension in a 7 yr old, note the potential for sepsis, and anticipate the potential need for intensive care unit admission. Similarly, the venous air embolus (mean score = 48%) required the participant to consider the possibility in a major procedure (nephrectomy for tumor). The discrimination analysis indicated that although these scenarios were relatively difficult, participants who achieved higher overall scores also were able to manage these more difficult scenarios. A detailed checklist item analysis would be useful to identify those items where performance was weak or unrelated to the overall scenario score to modify content or scoring items for use in future assessments. The relative difficulty of the scenarios could be addressed by either eliminating or modifying some of the checklist items that did not contribute to the assessment. This item analysis could also be useful to further increase rater reliability by identifying scoring items that raters were more likely to disagree.
The wide range of scores and their substantial variation may reflect the differences in skill acquisition that exists among resident classes. There are likely multiple factors that lead to the wide range in performance including factors related to the simulation assessment including the design and construct of the scenarios. The preliminary results suggest that this variance in ability increases throughout training and may lead to even greater differences in ability at the completion of training and entry into clinical practice. Although performance improves as training progresses, the variance in ability suggests that clinical experience and approaches to training must be tailored to ability. A competency-based training approach, with the use of individual assessment techniques such as simulation, could be used to progress training more rapidly for some residents and to provide remedial experiences for others who do not meet a minimal standard.
The results from our generalizability study were similar to those reported elsewhere.7,18As noted in the performance assessment literature, content (or context) specificity can have an appreciable effect on performance. Therefore, although we attempted to measure general skills in pediatric anesthesia, performance on one type of scenario may not predict performance on another type. As a result, to obtain a reproducible measure of resident ability, several scenarios are required. For a 10-scenario assessment with 2 independent raters, the generalizability coefficient was moderate (ρ2= 0.57). Although this is likely adequate for low stakes, formative assessments, additional measurement precision would be required if the assessment were to be used for summative (e.g. , certification) purposes. An even more reliable assessment would be possible by adding scenarios that have similar measurement properties (difficulty, discrimination) to construct an assessment form of reasonable length (e.g. , 10–15 scenarios) that yields more precise ability estimates. In this study, the precision of the scores was not greatly affected by the rater or by the number of raters per scenario. This may be due in part to the fact that all of the raters were pediatric anesthesiologists who not only trained as raters for these scenarios, but had participated in a number of previous simulation studies.
In this study, individuals with additional pediatric anesthesia experience outperformed, on average, those with less. Although some evidence to support the validity of the assessment scores was garnered, the process of validation is far from complete. First, although there were significant differences in performance between individuals with less than 2 months pediatric anesthesia experience and those with more experience, this does not say anything about competence. The marked variance among individuals at the same level of training and experience indicates that a more in-depth process of defining individual ability is necessary. As evidenced by the individual overall scores, many residents with 2 months of experience may meet a performance standard; conversely, many participants with considerably more experience performed at a level even lower than the median of the less experienced residents. These findings support the concept that progression through residency could effectively be competence-based rather than time-based. In order to determine an acceptable performance standard, through some form of standard setting process, one could investigate the relationship between competence and experience.19,20Second, with a 10-scenario assessment in the current format, the generalizability of a resident's performance to another potential set of scenarios is only moderate. It would therefore be worthwhile to investigate, in more detail, the content-specific characteristics of scenarios that underlie individual performance. Information gathered through this process will be valuable in constructing better assessment forms (mixes of scenarios)—ones that yield more reliable and valid estimates of skills in pediatric anesthesia. Third, the consequential effect of introducing simulation-based assessment in pediatric anesthesia needs to be documented. Our cross-sectional analysis of resident performance could, and should, be augmented by longitudinal studies that follow residents over the course of training. This would help determine how skill and ability change among residents and define how to effectively accelerate the acquisition as well as retention of skill and ability during training. Finally, although there is some evidence to suggest that experience in actual patient care settings relates to performance in the simulated environment, specific data related to the practice of pediatric anesthesia need to be gathered.
In summary, we describe a multiple-scenario simulation using high-fidelity pediatric mannequins that provides a means to assess performance in areas that are critical to patient care in pediatric anesthesia. Simulation-based education and assessment offer an additional modality for performance-based credentialing of resident competence in managing anesthetics for children. This method has the potential to contribute to performance-based pediatric anesthesia assessment, but additional measures of validity obtained from comparisons with clinical performance are needed to establish this approach as a method to evaluate competence. Ultimately this approach could be used to assess competence and, as a result, have an effect on how other pediatric specialists assess the skills and performance of physicians who care for children.
The authors thank Margie Hassler, R.N., M.E.D., Nurse Facilitator, Saigh Pediatric Simulation Center, St. Louis Children's Hospital, St. Louis, Missouri, and Anne L. Glowinski, M.D., M.P.E., Associate Professor, Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, for their assistance.