Simulations have taken a central role in the education and assessment of medical students, residents, and practicing physicians. The introduction of simulation-based assessments in anesthesiology, especially those used to establish various competencies, has demanded fairly rigorous studies concerning the psychometric properties of the scores. Most important, major efforts have been directed at identifying, and addressing, potential threats to the validity of simulation-based assessment scores. As a result, organizations that wish to incorporate simulation-based assessments into their evaluation practices can access information regarding effective test development practices, the selection of appropriate metrics, the minimization of measurement errors, and test score validation processes. The purpose of this article is to provide a broad overview of the use of simulation for measuring physician skills and competencies. For simulations used in anesthesiology, studies that describe advances in scenario development, the development of scoring rubrics, and the validation of assessment results are synthesized. Based on the summary of relevant research, psychometric requirements for practical implementation of simulation-based assessments in anesthesiology are forwarded. As technology expands, and simulation-based education and evaluation takes on a larger role in patient safety initiatives, the groundbreaking work conducted to date can serve as a model for those individuals and organizations that are responsible for developing, scoring, or validating simulation-based education and assessment programs in anesthesiology.
THE specific purpose of this article is to provide an overview of some of the issues that must be addressed to more fully embrace simulation-based methodology in the assessment of anesthesiologists. These assessments are formative (e.g. , education of residents), involving detailed participant feedback, or summative (e.g. , graduation requirement and board certification), with higher stakes consequences for those who participate. The following four general areas are highlighted: defining the pertinent skills and choosing relevant simulation tasks, establishing appropriate metrics, determining the sources of measurement error in test scores, and providing evidence to support the validity of test score inferences. For each of these areas, a general discussion is integrated with a brief review and synthesis of relevant anesthesia-related investigations. Because many of the logistic impediments have been addressed as part of recently established performance-based certification and licensure examinations,1and the specific challenges of integrating simulation into the existing anesthesia training curricula have been noted,2the discussion, in both general and specific to anesthesiology, will center on psychometric issues and not those associated with test administration logistics, physical test site specifications, or curriculum development. Knowing more about the specific psychometric challenges and potential solutions allows for further expansion of simulation-based assessment in anesthesiology. Before these challenges are outlined, a brief overview of the use of simulation, in both general and specific to anesthesiology, is presented.
Background
The need for assessments that encompass the skills required for specialty practice remains a priority of the Institute of Medicine. However, the Anesthesia Patient Safety Foundation, a pioneering organization in promoting safety, in responding to the report of the Institute of Medicine observed that these assessments “are not a simple matter and (defining and assessing competence in practice) will require considerable research.”‡Fortunately, previously conducted studies, both in anesthesiology and in other disciplines, have led to enhancements across the spectrum of simulation modalities, including advances in scenario design, the formulation and utilization of sophisticated scoring algorithms, and the development of innovative methodologies to set performance standards. Moreover, considerable research has been undertaken with the express purpose of identifying potential threats to the validity of test score interpretations. Although there will certainly continue to be psychometric, and other, difficulties, past experience with performance-based assessments would suggest that most challenges can be addressed through focused research efforts.
Performance assessments in medicine have a long history.3Various simulation modalities have been used to assess student, resident, and practitioner competence as well as to identify curricular deficiencies.4–9Recently, based primarily on concerns related to physician competency and patient safety, summative assessments, including those targeting specific performance domains, have been incorporated into the process used to license and certify physicians.10In contrast to formative assessments, where the primary goal is to provide feedback to the individual concerning strengths and weaknesses, summative evaluation activities are meant to determine some endpoint status (e.g. , competent or not competent; ready to practice independently). Appropriately, these types of assessments, in addition to focusing on the evaluation of knowledge and clinical reasoning, have targeted other important competencies such as patient care, interpersonal and communication skills,11and teamwork.12
One of the main simulation modalities used to assess the clinical skills of physicians involves the employment of standardized patients, lay people who are trained to portray the mannerisms and model the complaints of real patients.1,13–15In developing these standardized patient-based simulations, especially those associated with certification and licensure, much was learned about examination design, test administration and logistics, quality assurance and, perhaps most important, psychometrics.16,17With respect to examination design, efforts were made to model simulation scenarios to specifically measure certain skills in a realistic way by choosing simulated patient complaints based on actual practice data.18Likewise, to ensure the precision of the scores, and any associated competency decisions,19,20both quantitative and qualitative performance-based quality assurance measures have been developed.21As testing practices evolve, and new simulation modalities emerge, they will need to be similarly scrutinized with respect to the reliability of the scores (or decisions made based on the scores), the validity of the inferences one can make based on the scores, and their overall fairness.
In anesthesiology, computer-based case management scenarios, task trainers, and mannequins have all been used as part of both formative and summative simulation-based assessments. From a general perspective, Seropian et al. 22,23provided a detailed overview of the concepts and methodology of simulation in medicine. Issenberg et al. 24summarize the benefits of simulation-based assessment and outline the use of simulation technology for healthcare professional skills training and assessment. Schwid25presents a synopsis of high fidelity mannequin-based simulations and the available technologies. Similarly, Cooper and Taqueti26provide a brief history of the development of mannequin simulators for clinical education and training. Cumin and Merry27review the current spectrum of anesthetic simulators and provide guidelines for their selection for specific tasks. Ziv et al. 28provide an overview of credentialing and certification with simulation. There are even guidelines for those who want to include standardized patients in their anesthesia training programs.29,30Sinz31describes the history of anesthesiology's role in simulation and the efforts of the American Society of Anesthesiologists to promote simulation-based instruction. Scalese et al. 32summarize the use of technology for skills training and competency assessment in medical education. Finally, going forward, Gaba33provides a future vision of simulation in healthcare. Taken together, these reviews, guidelines, and ideas delimit the potential uses of simulation technologies for the education and assessment of anesthesiologists.
Although much of the simulation work in anesthesiology has been limited to part-task (e.g. , airway management trainers) or full-body models (i.e. , electromechanical mannequins), the vast array of research conducted so far has advanced the entire field of performance assessment. In particular, several articles have specifically identified the numerous challenges and opportunities,23,33–37promoted the use of simulation-based assessment to identify individual skills deficiencies and associated curricular problems,38–42evaluated human factors and systems-level problems43and, as part of continuing medical education activities, advocated simulations for use as part of the assessment of anesthesiologists with lapsed skills.44As well, much of the research conducted so far has focused on the individual trainee or practitioner, including physicians in need of remediation44and those involved in self-directed lifelong learning activities.24Recently, the American Board of Anesthesiology outlined the four steps required for maintenance of certification in anesthesia. One step in the process involves the evaluation of practice ability; candidates can demonstrate this practice performance assessment and improvement at accredited simulation centers. Over their 10-yr Maintenance of Certification in Anesthesiology (MOCA) cycle, diplomates must complete two practice performance assessment and improvement activities. By including a step that requires practice performance assessment and improvement, albeit formative in nature, relatively infrequent, and not specifically associated with performance standards, the American Board of Anesthesiology recognizes the role of simulated environments in improving skill and expertise. Moreover, given the more recent emphasis on the education and evaluation of teams in high-acuity medical situations45and the assessment of interprofessional communication,46the high fidelity simulated environment offers the potential to assess many of the complex skills needed by specialists. Anesthesiologists can practice skills that improve their clinical and teamwork competencies, especially in preventing and managing critical events and maintaining their expertise in handling uncommon and rare events. To do this effectively, however, care must be taken to create simulation scenarios that yield meaningful scores.
Key Issues
Defining the Skills and Choosing the Appropriate Simulation Tasks
Although much has been written about the development of mannequin simulators and the design of educational programs,9,22,24,47the construction of quality simulation-based assessments continues to be a difficult task. Test developers must pay attention to the intended purpose of the test, the knowledge and skills to be evaluated and, for performance-based activities, the specific context for, and design of, the exercises.48To be effective, the assessment activities must also be targeted at the ability level of the examinee.49If the purpose of the assessment is not clear, then any ability measures that are gathered may be inaccurate. The choice of skills to be evaluated is usually guided by curricular information, competency guidelines, §and the technical limitations of the chosen simulators.35Once these evaluation issues have been identified and synthesized, one is left with the task of specifying the simulation parameters. Most important among these is choosing the particular scenarios that offer the best opportunity to sample the knowledge and skills that one wishes to measure. For anesthesiology, and other specialties, one can access available data resources such as the National Hospital Ambulatory Medical Care Survey∥to determine the most prevalent conditions and procedures. However, often the best opportunity to assess specific skill sets such as clinical decision making and communication is to select rare, or reasonably complex, events such as when air is entrained during an operation or when septic shock complicates the perioperative period. Based on existing performance-based certification and licensure examinations, an effective strategy has been to rely on both healthcare data resources, where available, and expert judgment.
With the rapid development of simulator technology, including full body mannequins and part-task trainers, the potential domain for assessment, both in terms of the skills being measured and the tasks that can be modeled, has greatly expanded.27,50,51For example, with the newer electromechanical mannequins, in addition to an inventory of preprogrammed scenarios, simpler and more intuitive programming interfaces allow faculty to model scenarios with realistic respiratory mechanics, carbon dioxide waveforms, and hemodynamic responses. For other simulation modalities such as standardized patients, it is often difficult, if not impossible, to measure procedural and management skills. Mannequins and part task trainers can be used to evaluate specific therapeutic skills (e.g. , airway management, venipuncture techniques, administering drugs, and cognitive steps in decision making) and, in combination with other simulation modalities, abilities related to resource management, professionalism, and teamwork.52,53Similar to the expansion of knowledge-based item-testing formats, the introduction of new simulation modalities provides an opportunity to measure various skills in different and more realistic ways, a change that should eventually yield more robust and defensible assessments.
Although the introduction of new simulation modalities will certainly expand the assessment domain, there are some limitations with current technologies, many of which have been acknowledged in the literature.54First, even with the most sophisticated high fidelity mannequins, some presentations and patient conditions cannot be modeled very well (e.g. , sweating, changes in skin color, and response to painful stimuli). As a result, there will still be a need to incorporate assessment activities that involve direct patient contact. Second, for electromechanical mannequins, the interrelationships between different physiologic variables can be imperfect, especially when attempting to simulate a patient with an unstable condition who then receives multiple pharmacologic interventions. If the simulator responds unpredictably to a given intervention (e.g. , coadministration of an anesthetic and an inotropic drug), whether this a function of canned programming or operator intervention, those being assessed may become confused and act in ways that are consistent with instructional feedback but inconsistent with intended patient care expectations. As a result, it will be difficult to have any confidence in the assessment results. Moreover, to the extent that those being assessed are continually queued by changes in monitored output, improperly scripted or modeled scenarios, or ones that are unnecessarily complex, will provide a poor milieu for evaluating ability.35Those charged with developing simulation-based assessments must balance the need to measure specific abilities with technological limitations of the simulators, recognizing that many conditions cannot be simulated with sufficient fidelity, potentially compromising stakeholder buy in.55,56
The scenario is the fundamental building block of most simulation-based assessments in anesthesiology. In general, the design and development procedures for a simulation-based assessment include the following: selecting competence domains that are amenable to a simulation environment, defining the expected skills that are needed to diagnose and manage the crisis, and designing a scenario that has the required skills embedded into the framework. In anesthesiology, the development of simulation scenarios, both computer- and mannequin-based, has been described in detail and typically involves a structured process to gather the insights and opinions of experts in the field.57–59The process of scenario development and selection can later be cross-referenced with curriculum, training, or certification expectations. For anesthesia specialty training, the Joint Council on In-Training Examinations, a committee of the American Society of Anesthesiologists and American Board of Anesthesiology, publishes a relatively detailed content outline that delineates the basic and clinical sciences areas (including anesthesia procedures, methods, and techniques) that a specialist must be knowledgeable about.#Scenarios for many of the content areas described in the outline can easily be simulated for both education and assessment. For example, the recognition and management of the side-effects of anesthetic drugs, respiratory depression, hypotension, anaphylaxis, cardiovascular events (arrhythmias, myocardial ischemia), surgical procedures (air entrainment, hemorrhage), and complications related to equipment failure can all be modeled in simulation scenarios. Overall, the simulated environment is an ideal setting to explore many of the conditions, side-effects, and complications that are listed as key content domains in the outline.
The practice domain of anesthesia is fairly well defined, and the majority of simulated scenarios tend to concentrate on the skill sets expected during a crisis. The rationale for selecting the crisis event as the content of a typical scenario is based on a number of considerations. First, a physician's failure to rapidly manage an acute-care event is often associated with an adverse patient outcome. In a critical patient care setting, or in a situation where unexpected anesthetic or surgical complications arise, the outcome may hinge on whether or not the anesthesiologist knows how to manage a crisis. Second, physicians, particularly residents, frequently manifest skill deficits in performing a logical, sequential, and timely patient evaluation. The “borderline” resident often struggles with setting priorities, managing time effectively, and recognizing when to call for help. In clinical practice, residents with serious skill deficits in these essential domains are often not recognized until multiple questionable judgments and skills deficits are observed in a crisis setting. The acute care simulation is useful in assessing a resident's skill in managing many of these common but difficult-to-evaluate skill sets.59Scenarios designed around acute care events typically include skills in setting priorities, generating hypotheses, processing knowledge, assigning probabilities, isolating important from unimportant information, integrating competing issues, acknowledging limits, and learning when to call for assistance.60Finally, critical events normally include a compressed timeline. A scenario designed to assess the dynamic, interrelated skills required to resolve a crisis quickly can tap numerous abilities, including communication, planning, and both inferential and deductive reasoning.
Specifying what needs to be assessed, both in practice and as part of educational activities is not necessarily complex. However, as some skills are quite difficult to measure (i.e. , teamwork),61and various practice situations (i.e. , ones involving multiple healthcare workers) are not easily modeled in the simulation environment, there remain difficult challenges. One of the most important of these, described in the next section, is the development of fair, reliable, and valid outcome measures.37
Developing Appropriate Metrics
If simulation-based activities are to be used for assessment-related activities, either formative or summative, appropriate metrics must be constructed. One needs to be reasonably certain that the scores, however gathered, reflect “true” ability. In anesthesiology, and other disciplines that use simulations as part of education and certification, creating these rubrics is certainly one of the main assessment challenges. Although much has been learned from the development of performance-based assessments of clinical skills that utilize standardized patients,62the adaptation of some of this knowledge to those types of simulations that would be appropriate for anesthesiology is not without difficulties. With this in mind, efforts to develop scoring metrics for high fidelity simulators are currently expanding at a rapid pace.12,58,63–66
Based on the literature related to simulation-based assessment in anesthesiology, two types of scores have predominated—explicit process and implicit process. Explicit process scores take the form of commonly used checklists or key actions. For a given simulation scenario, content experts (usually practitioners), often with the support of patient care guidelines, determine which actions, based on the presenting complaint, are important for the candidate (medical student, resident, practicing physician) to complete to properly manage the scenario.63,64,67For example, to manage intraoperative hypoxemia, an anesthesiologist should take a number of initial steps to correct hypoxia (100% O2) as well as diagnose the cause (auscultation, evaluate lung compliance, carbon dioxide waveform, and others). These important management activities, when listed as checklist items or key actions, are the logical basis of the scoring rubric. While the weighting of checklist items may have little impact on the overall score, especially for more common clinical presentations where individual tasks are conditionally dependent,68this strategy may be appropriate for some acute care simulation events such malignant hyperthermia. Here, certain actions such as recognizing the condition and then administering dantrolene must be done to effectively manage the patient's condition. In essence, the shortening of checklists to essential key actions implicitly weights critical procedural or management tasks. Unfortunately, although checklists, or shorter key actions, have worked reasonably well and have provided modestly reproducible scores depending on the number of simulated scenarios,58they have been criticized for a number of reasons. First, checklists, while objective in terms of scoring, can be subjective in terms of construction.69Even if specific practice guidelines exist for given conditions informing what goes on the checklist, there can still be considerable debate as to which actions are important or necessary given the patient's condition. Without this consensus, one could certainly question the validity of the scenario scores. Second, the use of checklists often promotes rote behaviors such as using rapid-fire questioning techniques or performing many, some perhaps irrelevant, procedures. Third, and likely most germane for acute care simulations typical to anesthesia, it is difficult to take into account timing and sequencing when using checklists or key actions. Here, one can envision many scenarios where it is not only important what the physician does but also the order and timing. For example, in a scenario associated with a circuit leak, the participant who quickly recognizes and rapidly corrects the hypoventilation would more successfully avert a more serious prolonged period of hypoventilation leading to hypoxia. Although checklist-based timing has been used in some evaluations,70,71the order of actions, at least for explicit process-based scoring, is often ignored completely.
Implicit process scoring, where the entire performance is rated as a whole, can also be used in simulation-based assessments. In the physician community, there is often considerable reluctance to use rating scales, citing concerns regarding interrater reliability. However, based on the literature, holistic or global rating scales can be effective, psychometrically defensible tools for measuring certain constructs, especially those that are complex and multidimensional such as teamwork and communication.65,72–74In many situations, avoiding rating scales, simply because they involve expert judgment rather than the documentation of explicit actions, may not be advisable. With proper construction and effective rater training, they can be used to measure some important medical skills, including the nontechnical aspects of anesthesia practice. They also allow raters to take into account egregious actions and unnecessary patient management strategies (e.g. , performing a needle decompression of the left chest for a scenario requiring an endobronchial intubation), something that would be quite difficult to do with checklists or even key actions.75From a reliability perspective, even though two raters watching the same simulation encounter may not produce the exact same score, or scores, it is often possible to minimize this source of error. In addition, where systematic differences in rater stringency exist, score equating strategies can be used.76In many instances, especially those where multiple scenario assessments are used, one may actually prefer to sacrifice some measurement precision to achieve greater score validity.17
When developing rating scales (implicit process measures), evaluators often concentrate solely on the measurement rubric (i.e. , specification of the constructs that are going to be measured and deciding the number of score points for the scale), frequently ignoring any rater training or quality assurance regimes. Although physician raters are clearly content experts, this does not necessarily qualify them to be evaluators. Regardless of their clinical experience and capabilities, evaluators need to be trained to use implicit process measures. Training protocols can include specific rater exercises (e.g. , rating benchmarked performances), various quality assurance measures (e.g. , double rating a sample of examinee performances), and periodic refresher training. By developing a meaningful rubric and ensuring, through training, that the evaluators are providing ratings that accurately reflect examinee abilities, it is possible to minimize bias and produce more valid measures of performance.
The impetus to create physician-specific ability measures,77,78combined with advances in simulator technology, provides an opportunity to develop new, perhaps more meaningful, measurement rubrics. Many of the available electromechanical devices typically used in anesthesia training can generate machine-readable records of the physiologic responses of the mannequin as well as the treatments used during the simulation. Provided that the mannequin responds realistically and reproducibly to any intervention (e.g. , ventilation or drug therapy), and the timing of the actions can be demarked, then it should be possible to develop explicit performance measures that are based on patient (simulator) outcomes. For example, in a scenario associated with hypoventilation such as endotracheal tube obstruction or cuff leak, the changes in the mannequin's minute ventilation and carbon dioxide may serve as a reasonable measure of the participant's performance. While developing these types of scoring metrics will require some additional work in scenario design and construction, this approach may provide a more effective and timely method to assess performance and provide feedback to participants.
For anesthesiology-related simulation activities, studies have specifically reported the processes used to develop scoring rubrics59,79and, historically, the need to refine the existing rating systems.80Byrne and Greaves81provide a review of the assessment instruments used for anesthetic simulations, highlighting the need for better measurement tools. For checklists, key actions, and even holistic rating scales, attention has been paid to the construct being measured (e.g. , clinical decision making), the applicability of various simulation technologies for gathering performance measures and, where applicable, relevant practice guidelines. For many simulation-based assessment activities, expert panels, often as part of some structured Delphi technique, have developed the scoring rubrics.64,67Even though this strategy is appropriate, there is often little documented evidence regarding the qualifications of the panelists or the subsequent training of the raters. Although physicians are normally used as raters, and the rating is often accomplished via video review, establishing effective rater training programs is paramount, especially when the correctness of certain actions that are scored is open to some interpretation. Without this, any structured activity to develop scoring rubrics may still not yield meaningful scores.
For anesthesia simulations, “technical” skills such as airway management, drug administration, and placement of catheters have used checklists or key actions. In contrast, “nontechnical” skills such as communication, planning, teamwork and situation awareness, which play a key role in anesthesia,82have generally incorporated some form of holistic rating scale. Given the multidimensional nature of constructs such as communication skills, a more subjective rating methodology seems apropos. Various studies have looked at the relationships between scoring modalities, with some concluding that the relative ranking of participant abilities does not vary much whether holistic or analytic (checklist/key action) scores are used.59,83This finding will certainly depend on the type of simulation encounter used and the specific construct being measured. Based on the research conducted to date, the use of key action scores does seem to hold some advantages, at least for measuring procedural skills. First, at least for acute care scenarios, there is generally relatively little disagreement on what constitutes key actions. Second, they are relatively easy to score. Finally, if time stamps are used for critical actions, the sequencing of these actions can be captured. Overall, although much work has been has been conducted to develop meaningful rubrics for anesthesiology-based simulations, additional research aimed at specifically informing the construction and adoption of various measurement scales is certainly warranted.81
Assessing the Reliability of Test Scores
For simulation-based assessments, whether used for formative (e.g. , residency education) or summative (e.g. , certification or licensure) purposes, one needs to be reasonably confident that the scores are reliable. Compared with typical knowledge-based tests, there can be many other sources of measurement error in simulation-based assessments, including those associated with the rater.20,84If these sources, or facets, are not accounted for in the design, one can get an incomplete picture concerning the reliability of the scores. Where checklist or key actions are used to generate scores for a simulation scenario, internal consistency coefficients are typically calculated.85Although these coefficients can be presented as reliability indices, they provide only a measure of the consistency of examinee performance, across scoring items, within a scenario. For a multiscenario assessment, provided that care was taken in developing the scenario-specific performance measures, and specific skills are being assessed, one should be more concerned with the consistency of examinees' scores over encounters, and not within each individual one. Often, some measure of interrater reliability is also computed.86,87While scoring consistency between raters is certainly important, relying solely on a scenario-based measure of agreement is also incomplete. Even if two raters are somewhat inconsistent in their scoring, this may not necessarily lead to an unreliable total assessment score. To better understand the sources of measurement error in a multiscenario performance-based simulation assessment, generalizability (G) studies are often used.88,89These studies are conducted to specifically delimit the relative magnitude of various error sources and their associated interactions. Following the G-study, decision (D) studies can be undertaken to determine the optimal scoring design (e.g. , number of simulated encounters or number of raters per given encounter): that is, one that will provide sufficiently reliable scores given the purpose of the assessment.
Within the performance assessment literature, many studies have been conducted to estimate the impact of various designs on the reproducibility of the scores.90Although raters have been identified as a source of variability, their overall impact on reliability, given proper training and well-specified rubrics, tends to be minimal, often being far outweighed by task sampling variance. Essentially, given the content specificity of certain simulation tasks, especially those that are designed to assess technical skills, examinees may perform inconsistently from one task (or scenario) to the next. For example, based on previous experience and training, a participant may effectively recognize and treat anaphylaxis, yet fail to diagnose or effectively manage myocardial ischemia. As a result, if there are few tasks, and one is attempting to measure overall clinical ability, the reliability of the assessment score can be poor. To think of this concept another way, if we are trying to measure skills in patient management, for example, more performance samples (simulated scenarios) will mitigate the overall impact of content specificity, thus yielding a more precise overall ability measure. In general, for these types of performance-based assessments, issues regarding inadequate score reliability can be best addressed by increasing the number of simulated tasks rather than increasing the number of raters per given encounter or simulated scenario. As well, to minimize any rater effects, it is usually most effective to use as many different raters as possible for any given examinee (e.g. , a different rater for every task or scenario).17
In anesthesiology, there are some unique challenges associated with the development of simulation-based assessments, especially those where fairly reliable estimates of ability are required. Unlike many performance-based assessments in clinical medicine, where fairly generic skills are being measured (e.g. , history taking), the management of patients by anesthesiologists can be very task specific. Where this is true, and one wants to measure skills related to patient management, it could take many more encounters to achieve a reproducible measure of ability. Fortunately, many important events in anesthesia practice, including a large number that can be effectively modeled in a simulated environment, require fairly rapid interventions. Unlike typical standardized patient-based cases, which usually last from 10 to 20 min, acute care scenarios can easily be modeled to take place in a 5-min period. Given that the simulation scenarios can be relatively short, it is possible to include more of them in an assessment. Moreover, for nontechnical skills such as communication, task specificity would likely not be as great. Here, one would not expect that an individual's ability to communicate with the patient, or other healthcare provider, would vary much as a function of type of simulated event. Therefore, fewer behavioral samples (scenarios) would be needed to yield a reliable total score.
Given the amount of work that has been conducted to develop scoring rubrics for anesthesia-related simulation assessments,64it is not surprising that efforts have been directed at disentangling various sources of measurement error in the scores. As is common in medicine, the initial focus of many psychometric investigations rests with establishing interrater reliability. As noted previously, while it is important that two or more raters viewing the same performance are reasonably consistent in their evaluations, this is only one facet to take into account.16Interestingly, some investigations provide evidence to support rater consistency74,91–94while others do not.12,95While the exact cause for these disparate findings is hard to pinpoint, it likely rests with differences in the skills being assessed, the assessment mode (e.g. , live vs. videotape review), the choice of scoring rubrics, and whether specific rater training protocols are used.16All these factors could have some measurable impact on rater consistency. Recently, there has been a general recognition that obtaining reasonably precise measures of ability requires multiple scenarios or tasks sampled over a relatively broad content domain.96However, some behavioral attributes such as communication and teamwork may be less dependent on content knowledge, thus requiring fewer performance samples. In an investigation of the psychometric properties of a simulation-based assessment of anesthesiologists, Weller et al. 97reported that 12–15 scenarios were needed to reliably rank trainees on their ability to manage simulated crises. Several studies of anesthesia residents and anesthesiologists have incorporated multistation assessments, reporting reasonable interstation reliabilities for evaluations that incorporate 8–12 scenarios.58,59,98
Providing Evidence to Support the Validity of Test Score Inferences
Validity relates to the inferences that we want to make based on the assessment scores. Inspecting the simulation literature, in general, and the research related to performance-based examinations, in particular, there are numerous potential sources of evidence to support the validity of test scores and the associated inferences made based on these scores.99–101However, it should be noted that the validation process is never complete and that scores may be valid for one purpose and not for another. Additional evidence to support the intended test score inferences can always be gathered.
For performance-based assessments, there has been a heavy emphasis on content related issues.102,103To support the content validity of the assessment, simulated scenarios are often modeled and scripted based on actual practice characteristics, including the types of patients that are normally seen in particular settings. With respect to rubrics, special care is usually taken to define the specific skill sets and to develop measures, often from an evidence-based perspective, that adequately reflect them. Finally, the encounters are typically modeled in realistic ways, using the same equipment that would be found in a real operating theater or other venue. All of these strategies, including feedback from stakeholders regarding the verisimilitude of the simulated scenarios,104will help support the content validity of the test scores.
If a simulation-based assessment is designed to measure certain skills, then it is imperative that evidence be procured to support this claim.105Various strategies can be used to accomplish this goal. If several skills are being measured, then one could postulate relationships among them. For example, if the simulation is designed to measure both procedural and communication skills, then one would expect that the scores for these two domains should be somewhat, albeit imperfectly, related. Likewise, if external measures are available (e.g. , knowledge-based in-training examination scores) one might postulate both strong and weak relationships between the simulation assessment scores and these criterion measures. Often, the criterion measure is some measure of clinical experience. Here, one would normally expect that individuals with greater expertise (e.g. , more advanced training or board certification) and having proper training will perform better on the simulation tasks.58,71,94,106,107If this is not the case, then one could question whether valid inferences can be made based on the assessment scores. Overall, to the extent that postulated relationships, both internal and external, substantiate the hypothesized relationships, support for the validity of test score interpretations can be gathered.
Unlike the more common formative simulation-based assessments, the purpose of some simulation-based evaluations is to ensure the public that the individual who passes the examination is fit to practice, either independently or under supervision. Here, it is imperative that score-based decisions are valid and reproducible. To accomplish this, a variety of standard setting techniques are available, some of which have been applied for acute care mannequin-based assessments.108As part of a structured process, subject-matter experts make judgments, either based on the score scale or some sampling of performances, concerning minimal competency as it relates to the particular simulation task. Using various statistical procedures, these judgments are then used to derive an appropriate cut-score, the point on the score scale that separates those who are minimally adequate (or some other definition) from those who are not. Unfortunately, while defensible cut-scores can be established for performance-based assessments, procuring evidence to support the validity of the associated decisions can be complicated.109Although performance on the simulation-based assessment may be indicative of future aptitude or competence, and there are some longitudinal studies that support this for certain skills,110the predictive relationships may be weak and difficult to measure.111From a research perspective, only those who “pass” the initial assessment can be studied; individuals who do not demonstrate competence are generally not allowed to practice, effectively attenuating any relationships between the assessment scores and any future outcome measure. Nevertheless, the introduction of simulation-based assessment, if done correctly, can provide the public with greater assurance that practitioners are qualified.44,112Also, if the consequential impact of other previously implemented assessments is a guide, this will ultimately lead to a growth in simulation-based educational programs, a change that will likely lead to greater patient safety.113–115
Although the simulation-based training of anesthesiologists has taken place for some time, and there were early calls for establishing the efficacy of this training,116more rigorous studies aimed at establishing the validity of assessment scores have come only more recently. From a content validity perspective, simulation scenarios have been modeled on real patient events and have included scoring rubrics that are keyed to practice-based guidelines. Moreover, in addition to specific patient management tasks, simulation scenarios have been developed to specifically target nontechnical skills such as communication, teamwork, and clinical decision making.12,117,118Although not generally considered strong evidence for validity, several studies have provided data summarizing the opinions of those being assessed. Based on various simulation modalities, and a host of clinical presentations, most studies indicate that participants thought that the exercises were realistic and pedagogically useful with respect to clinical training and competency assessment.119,120Berkenstadt et al. ,91based on simulations incorporated in the Israeli Board Examination in anesthesiology, reported that those exposed to this form of assessment preferred it to the traditional oral examination. Given that trainees need to demonstrate specific skills, Savoldelli et al. 92also supported the use of simulations as an adjunct to the oral examination for senior anesthesia residents. As simulation technology expands, the breadth of clinical scenarios that can be modeled will certainly increase, providing additional opportunities to gather content validity evidence.
Other sources of validity evidence have been reported throughout the literature. If the scoring systems are appropriate, and actually measure the intended construct, or constructs, then one would expect that those individuals with more training and experience would perform better. Likewise, given the effects of experiential education, especially if it involves repetitive practice and appropriate feedback, those being trained with simulators should show some performance gains over time or with additional training.83,121Going back almost 20 yr, a number of studies involving the assessment of medical students, residents, and practicing anesthesiologists have demonstrated this finding.64,122Moreover, individuals participating in simulator training have been able to retain their skills over time.123In addition to the evidence that supports the discriminant validity of the simulation-based exercises,98,124some studies have looked at the relationships between simulation scores and other measures of performance such as written tests of knowledge, course grades, and various nonsimulation-based resident evaluations.125From a criterion-related validity perspective, some studies have shown a moderate relationship between simulation performance and knowledge. Schwid et al. 94reported positive correlations between simulation scores and both faculty evaluations and American Board of Anesthesiology written in-training examination scores. While other studies have shown little relationship between simulation scores and other evaluations,126,127this may be a function of differences in the constructs that are being measured. Investigators are quick to acknowledge that knowing what to do, which can be measured in many different ways, is somewhat different from showing what you can actually do, either in a real or simulated environment. As an example, one could envision an anesthesiology resident who performs well on in-training assessments, indicating knowledge of what to do in certain situations, but cannot effectively use this knowledge in managing a real or simulated event. To explain some simulator-criterion relationships, or lack thereof, one must not forget that to effectively use simulators as assessments devices, those individuals being evaluated must have some familiarity with the devices.119Often, the orientation process is insufficient. As a result, one might expect only moderate associations between simulator performance and other, perhaps marginally related, ability measures.
From a validity perspective, the strongest evidence lies with the establishment of a link between simulator performance and practice with real patients. To date, there have been relatively few studies that have shown a significant impact of simulator training from a patient outcome perspective. Unlike many other disciplines, there has, however, been some excellent work to show that skills acquired in the simulation environment transfer to “real world” situations. For example, Weller et al. 128used simulation scenarios to investigate the impact of trained assistance on error rates in anesthesia and concluded that a simulation-based model can provide rigorous evidence on safety interventions. For resuscitation skills, Domuracki et al. 129found that learning on the simulator, provided there was appropriate feedback, transferred to clinical practice. Kuduvalli et al. 130also reported long-term retention and transferability of acquired skills into subsequent clinical practice. Unfortunately, these findings were only based on questionnaire responses. Although it can be considered as indirect validity evidence, Blum et al. 131reported that training courses, often using simulation, can make faculty staff eligible for malpractice premium reductions. Even with these sources of validity evidence, there is still a need to continue to address the long-term effects of experiential learning on the retention of knowledge and acquired skills. More important, while extremely difficult to do, establishing a causal link between simulator performance and actual patient outcomes is essential.132
Conclusion
The expansion of simulation models, including those using mannequins and part-task trainers, will lead to many more opportunities to model real-life events. This, in turn, will demand additional investigations to support the use of the resulting assessment scores, either for educational activities (e.g. , for providing feedback to anesthesia trainees) or, more importantly, for higher stakes decisions concerning provider competency, including those associated with licensure and maintenance of certification activities.
From a development perspective, defining the skills and choosing appropriate simulation tasks is paramount. Simulation scenarios constructed to measure teamwork and communication skills may not be suited to measure procedural skills. Ideally, simulation scenarios should be modeled on real events and constructed in such a way as to provide the best milieu to evaluate the skills of interest, whether technical or nontechnical. To accomplish this goal, expert opinion, combined with relevant practice guidelines (if they exist), are the key elements. If the development of the simulation scenarios is not sufficiently rigorous, the assessment scores, regardless of the purpose of the evaluation, or who is being evaluated, may not have much meaning.
Once the content area, or areas, has been identified, developing appropriate metrics for simulation-based assessment activities is paramount. Whether one is providing residents with feedback, or assessing competence as part of certification or licensure activities, scores are needed. In general, the choice of metrics will depend on what is being measured. For technical skills (e.g. , airway management), it is usually possible to identify observable key actions and develop analytic measurement tools. For nontechnical skills, such as communication and teamwork, rating scales are usually more appropriate. Regardless of the metric that is chosen, care must be taken to identify the elements, or behaviors, that anchor the score scale. For key actions, the raters must know when, and when not, to give credit. For holistic, or global, rating tools, the raters must be clear about the construct being measured (e.g. , teamwork) and how someone who is more able in this domain differs from someone who is less able.
For most simulation-based assessments, estimating the reliability scores is not that difficult. Addressing the various sources of measurement can, however, be quite challenging. For situations where the scores are being used for higher stakes purposes, aggregate scores from multiple scenarios will generally be needed to obtain a sufficiently reliable estimate of ability. One should think of the scenarios as vehicles to measure the skills—more scenarios, or testing time, will, in general, yield more reliable estimates of ability. Although one should also be concerned with potential rater effects (i.e. , interrater inconsistency), rater training, combined with various quality assurance activities, will help minimize this potential source of error. Those charged with implementing simulation-based assessments in anesthesiology must identify the various sources of measurement error and use this information, where possible, to modify the structure of their evaluations. If an individual's score is not a sufficiently precise measure of his/her ability, actions based on this score (e.g. , ranking within the class, the provision of feedback, and certification decisions) could be misleading or erroneous.
For simulation-based assessments, providing evidence to support the validity of test score inferences is essential. Even for lower stakes formative assessment activities, including practice performance assessment and improvement initiatives and maintenance of certification in anesthesiology-related activities, one needs to know that the scores are reasonably accurate reflections of the skills that are purportedly being evaluated. In anesthesiology, much work has been conducted to gather evidence to support the validity of simulation-based assessment scores. These efforts should ultimately lead to better, more defensible, assessments, ones that can be used to identify individual strengths and weaknesses, including competency deficits. Going forward, outcomes-based research centering on quantifying the paths between simulation-based assessment, skills acquisition (and decay), and patient safety are essential. Without additional construct validity evidence, the utility of simulation-based assessment, at least for higher stakes applications such as board certification and licensure (or maintenance of licensure), is likely to continue to be questioned.
Anesthesiology as a specialty has made numerous prescient commitments to safer patient care. The adoption of simulation by anesthesiologists was eventually recognized as an assessment modality that can overcome many of the inherent patient risks involved in specialty training. A physician's advanced diagnostic and therapeutic management skills, and the ability to integrate knowledge, apply clinical judgment, communicate, and work within a team, can all be assessed during a high fidelity simulation. These types of performance assessments, when constructed with care and appropriately validated, are considered an essential element in elevating practice standards and, ultimately, in improving the safety of anesthesia practice. Through years of research, the breadth of simulation activities in anesthesiology has widened, with model-based training and assessment, albeit currently limited in scope, now accepted as one of the steps to maintain certification in the profession. By adopting simulation-based training and assessment, and actively addressing many of the challenges associated with developing psychometrically sound evaluations, the specialty has recognized the need for professional skill development, continuing on a path demonstrating a long-term commitment to patient care.