ALTHOUGH good clinical care in anesthesia has many components,1the ability to diagnose and treat acute, life-threatening perioperative abnormalities is near the top of most anesthesiologists' lists. In comparison to other industries performing hazardous work, health care lags behind in its capability to ensure that its personnel are uniformly and provably skilled practitioners.2,3How to measure clinician performance challenges all domains of medicine and is particularly difficult for hazardous and complex domains such as anesthesiology that involve invasive therapies with the proverbial “hours of boredom, moments of terror.” In this issue of Anesthesiology, Murray et al. 4report on their team's continuing effort (see also the December 2003 issue of Anesthesiology5) to develop a validated test of this performance ability of anesthesiologists using mannequin-based simulation scenarios.
Simulation offers advantages in assessing this skill because real acute events are relatively uncommon, are diverse in pathophysiology, and cannot be observed without intervention should mistakes be made. In developing their test, the authors used the same principles and experiences that have guided the careful development of tests of basic clinical skills using “standardized patients” (actors). These have been used in the Educational Commission for Foreign Medical Graduates Clinical Skills Assessment5,6and are now being used for step II of the United States Medical Licensing Examination. In the 2003 article, Murray et al. described a simulation-based test for medical students; in the current article, they extend the test population to anesthesia residents. Their primary goal was to establish the psychometric property of the examination, defining how its results depend on differences among cases, subjects, raters, and four types of rating scales. They also wished to assess its “construct validity” by testing the construct that senior residents (clinical anesthesia [CA]-2 or CA-3) would perform better than junior residents (CA-1).
Key design decisions were that the test would cover only the “technical” response to the events, not any nontechnical skills (e.g. , communication or leadership), and that it would involve the anesthesiologist working alone, not in a team.7–11Although the ability to respond with the appropriate technical actions is a critical performance skill, it may not be enough. In aviation, it was found that 70% of US airliner crashes involved crews with adequate stick and rudder skills, flying aircraft that could been flown to safety.12Aviation then shifted considerable emphasis in training and assessment to nontechnical skills of individuals and crews. Several research groups, mine included, believe that measuring technical ability is similarly a necessary but not sufficient assessment of anesthesiologists' skills.7–11
The authors' chosen statistical technique, generalizability theory,13offers a means to tease out the different contribution of the examinees, the raters, and the cases to the total variance seen across the test sessions. Perhaps the most important finding of the generalizability theory analysis was that far more variance was attributable to the cases and the case-subject interaction than was attributable to the raters or the rating system. The corollary to this finding is that—at least for the skills they assessed—a fair test needs a large number of cases6–10but only a few raters. Because time and costs are limited, these findings drove the design decision to use extremely short case scenarios lasting only 5 min, allowing a large number of cases to be presented in a 1-h test session.
The emphasis of this article on psychometrics highlights a conundrum of measuring the performance of professionals.14Observing anesthesiologists doing full cases could be sensitive to the complexities of the job but would likely lack psychometric rigor. Conversely, to achieve a fair test with robust psychometric properties, it may be necessary to control the work situation carefully, perhaps missing some of its complexity. Consider this analogy. What is the best way to pick a “good runner”? One could use a sprint (e.g. , 100 m), which could be scored objectively with low interrater variance. It could surely distinguish between the fit and the unfit, but is it a good surrogate for races of 1,500 m or 10 km? Having a good test requires not only psychometric validity and content validity, but it also requires context validity and ultimately predictive validity for the tasks and skills of interest. In the anesthesia case, 300 s is a short time to step into a clinical situation, make a reasonable assessment, and implement key actions. Scenarios must be chosen that are unambiguous, require little diagnostic investigation, and have rapidly applicable therapies. In real patient care, such hyperacute situations with unambiguous findings and treatments may be the exception. With ultrashort simulations, is there a risk of missing the forest for the trees?
Perhaps so, but we have to start somewhere, and these two articles by Murray's group represent the most systematic and carefully controlled test development to date in anesthesiology regarding technical performance of acute event management. It is hard to argue that competent anesthesiologists should not be able to perform well on such a test, even if it is artificial. To corroborate this extrapolation, it would be wise not only to assess its psychometrics when applied to more experienced personnel, but also to look at the details of individual success or failure to be sure that they are not artifacts of the artificially limited test.
One should also be careful about the “constructs” that are tested for construct validity. In this study, CA-2 and CA-3 residents were grouped together to be compared to CA-1 residents. Why not test the construct of steady improvement year by year? In a study by Devitt et al. 15of Canadian anesthesia students, residents, community anesthesiologists, and university faculty, they assumed a construct that performance of these populations would increase in that order. They argued that university anesthesiologists would perform best because they usually did their own cases, did more complex cases, and were immersed in an academic environment. However, this construct might not be equally applicable to many US centers, where faculty may supervise others rather than perform their own cases. Also, is “experience” itself a guarantee of success on all tasks? We9,16and others17have demonstrated this in simulation studies involving residents, faculty, and community anesthesiologists in which some highly experienced personnel failed catastrophically in managing certain acute events, whereas some juniors performed exceptionally well. A further step in test development should be to create benchmark metrics of performance by true clinical experts. One means to select a cohort of known experts would be to use peer ratings18by experienced clinicians. Those rated as expert by a large fraction of their peers probably do have outstanding clinical skills. The considerable work in surgery on metrics for testing surgical psychomotor skills19,20is a useful guide, but establishing metrics for decision making in anesthesiology may be more difficult.
The context of performance assessment also must be considered. High-stakes summative assessment (e.g. , United States Medical Licensing Examination or specialty board certification) is only one application. Performance testing is relevant for less exacting purposes such as “formative assessment” of students and trainees, pedagogical research, and research on human factors in medical systems. For these applications, some degree of psychometric rigor may be traded off against better applicability and scope of test content. Even a perfect test of intraoperative medical management should be only one element of a multifaceted assessment of the anesthesiologist's skills.
Ultimately, the public's desire for safer care with greater accountability will be the main driver for the health professions to conduct credible, regular, and never-ending assessments of the performance of their members.21The science of performance assessment in anesthesia has been advanced substantially by Murray et al. , but even they have only scratched the surface of a complex set of questions that will challenge our profession and the rest of health care for the foreseeable future.