ALTHOUGH good clinical care in anesthesia has many components,1the ability to diagnose and treat acute, life-threatening perioperative abnormalities is near the top of most anesthesiologists' lists. In comparison to other industries performing hazardous work, health care lags behind in its capability to ensure that its personnel are uniformly and provably skilled practitioners.2,3How to measure clinician performance challenges all domains of medicine and is particularly difficult for hazardous and complex domains such as anesthesiology that involve invasive therapies with the proverbial “hours of boredom, moments of terror.” In this issue of Anesthesiology, Murray et al.  4report on their team's continuing effort (see also the December 2003 issue of Anesthesiology5) to develop a validated test of this performance ability of anesthesiologists using mannequin-based simulation scenarios.

Simulation offers advantages in assessing this skill because real acute events are relatively uncommon, are diverse in pathophysiology, and cannot be observed without intervention should mistakes be made. In developing their test, the authors used the same principles and experiences that have guided the careful development of tests of basic clinical skills using “standardized patients” (actors). These have been used in the Educational Commission for Foreign Medical Graduates Clinical Skills Assessment5,6and are now being used for step II of the United States Medical Licensing Examination. In the 2003 article, Murray et al.  described a simulation-based test for medical students; in the current article, they extend the test population to anesthesia residents. Their primary goal was to establish the psychometric property of the examination, defining how its results depend on differences among cases, subjects, raters, and four types of rating scales. They also wished to assess its “construct validity” by testing the construct that senior residents (clinical anesthesia [CA]-2 or CA-3) would perform better than junior residents (CA-1).

Key design decisions were that the test would cover only the “technical” response to the events, not any nontechnical skills (e.g. , communication or leadership), and that it would involve the anesthesiologist working alone, not in a team.7–11Although the ability to respond with the appropriate technical actions is a critical performance skill, it may not be enough. In aviation, it was found that 70% of US airliner crashes involved crews with adequate stick and rudder skills, flying aircraft that could been flown to safety.12Aviation then shifted considerable emphasis in training and assessment to nontechnical skills of individuals and crews. Several research groups, mine included, believe that measuring technical ability is similarly a necessary but not sufficient assessment of anesthesiologists' skills.7–11 

The authors' chosen statistical technique, generalizability theory,13offers a means to tease out the different contribution of the examinees, the raters, and the cases to the total variance seen across the test sessions. Perhaps the most important finding of the generalizability theory analysis was that far more variance was attributable to the cases and the case-subject interaction than was attributable to the raters or the rating system. The corollary to this finding is that—at least for the skills they assessed—a fair test needs a large number of cases6–10but only a few raters. Because time and costs are limited, these findings drove the design decision to use extremely short case scenarios lasting only 5 min, allowing a large number of cases to be presented in a 1-h test session.

The emphasis of this article on psychometrics highlights a conundrum of measuring the performance of professionals.14Observing anesthesiologists doing full cases could be sensitive to the complexities of the job but would likely lack psychometric rigor. Conversely, to achieve a fair test with robust psychometric properties, it may be necessary to control the work situation carefully, perhaps missing some of its complexity. Consider this analogy. What is the best way to pick a “good runner”? One could use a sprint (e.g. , 100 m), which could be scored objectively with low interrater variance. It could surely distinguish between the fit and the unfit, but is it a good surrogate for races of 1,500 m or 10 km? Having a good test requires not only psychometric validity and content validity, but it also requires context validity and ultimately predictive validity for the tasks and skills of interest. In the anesthesia case, 300 s is a short time to step into a clinical situation, make a reasonable assessment, and implement key actions. Scenarios must be chosen that are unambiguous, require little diagnostic investigation, and have rapidly applicable therapies. In real patient care, such hyperacute situations with unambiguous findings and treatments may be the exception. With ultrashort simulations, is there a risk of missing the forest for the trees?

Perhaps so, but we have to start somewhere, and these two articles by Murray's group represent the most systematic and carefully controlled test development to date in anesthesiology regarding technical performance of acute event management. It is hard to argue that competent anesthesiologists should not be able to perform well on such a test, even if it is artificial. To corroborate this extrapolation, it would be wise not only to assess its psychometrics when applied to more experienced personnel, but also to look at the details of individual success or failure to be sure that they are not artifacts of the artificially limited test.

One should also be careful about the “constructs” that are tested for construct validity. In this study, CA-2 and CA-3 residents were grouped together to be compared to CA-1 residents. Why not test the construct of steady improvement year by year? In a study by Devitt et al.  15of Canadian anesthesia students, residents, community anesthesiologists, and university faculty, they assumed a construct that performance of these populations would increase in that order. They argued that university anesthesiologists would perform best because they usually did their own cases, did more complex cases, and were immersed in an academic environment. However, this construct might not be equally applicable to many US centers, where faculty may supervise others rather than perform their own cases. Also, is “experience” itself a guarantee of success on all tasks? We9,16and others17have demonstrated this in simulation studies involving residents, faculty, and community anesthesiologists in which some highly experienced personnel failed catastrophically in managing certain acute events, whereas some juniors performed exceptionally well. A further step in test development should be to create benchmark metrics of performance by true clinical experts. One means to select a cohort of known experts would be to use peer ratings18by experienced clinicians. Those rated as expert by a large fraction of their peers probably do have outstanding clinical skills. The considerable work in surgery on metrics for testing surgical psychomotor skills19,20is a useful guide, but establishing metrics for decision making in anesthesiology may be more difficult.

The context of performance assessment also must be considered. High-stakes summative assessment (e.g. , United States Medical Licensing Examination or specialty board certification) is only one application. Performance testing is relevant for less exacting purposes such as “formative assessment” of students and trainees, pedagogical research, and research on human factors in medical systems. For these applications, some degree of psychometric rigor may be traded off against better applicability and scope of test content. Even a perfect test of intraoperative medical management should be only one element of a multifaceted assessment of the anesthesiologist's skills.

Ultimately, the public's desire for safer care with greater accountability will be the main driver for the health professions to conduct credible, regular, and never-ending assessments of the performance of their members.21The science of performance assessment in anesthesia has been advanced substantially by Murray et al. , but even they have only scratched the surface of a complex set of questions that will challenge our profession and the rest of health care for the foreseeable future.

1.
Anesthesiology Residency Review Committee: Program Requirements for Graduate Medical Education in Anesthesiology (document 040pr703). Chicago, Accreditation Council for Graduate Medical Education, 2004
Anesthesiology Residency Review Committee
Chicago
,
Accreditation Council for Graduate Medical Education
2.
Trunkey DD, Botney R: Assessing competency: A tale of two professions. J Am Coll Surg 2001; 192:385–95
3.
Gaba D: Structural and organizational issues in patient safety: A comparison of health care to other high-hazard industries. California Management Rev 2001; 43:83–102
4.
Murray DJ, Boulet JR, Kras JF, Woodhouse JA, Cox T, McAllister JD: Acute care skills in anesthesia practice: A simulation-based resident performance assessment. Anesthesiology 2004; 101:1084–95
5.
Boulet JR, Murray D, Kras J, Woodhouse J, McAllister J, Ziv A: Reliability and validity of a simulation-based acute care skills assessment for medical students and residents. Anesthesiology 2003; 99:1270–80
6.
Sutnick AI, Stillman PL, Norcini JJ, Friedman M, Regan MB, Williams RG, Kachur EK, Haggerty MA, Wilson MP: ECFMG assessment of clinical competence of graduates of foreign medical schools. Educational Commission for Foreign Medical Graduates. JAMA 1993; 270:1041–5
7.
Gaba D, Fish K, Howard S: Crisis Management in Anesthesiology. New York, Churchill-Livingstone, 1994
New York
,
Churchill-Livingstone
8.
Betzendoerfer D: TOMS: Team oriented medical simulation: Briefing, simulation, debriefing. The Anesthesia Simulator as an Educational Tool 1995: 51–5
9.
Gaba D, Howard S, Flanagan B, Smith B, Fish K, Botney R: Assessment of clinical performance during simulated crises using both technical and behavioral ratings. Anesthesiology 1998; 89:8–18
10.
Fletcher G, Flin R, McGeorge P, Glavin R, Maran N, Patey R: Anaesthetists' Non-Technical Skills (ANTS): Evaluation of a behavioural marker system. Br J Anaesth 2003; 90:580–8
11.
Glavin RJ, Maran NJ: Development and use of scoring systems for assessment of clinical competence. Br J Anaesth 2002; 88:329–30
12.
Billings CE, Reynard WD: Human factors in aircraft incidents: Results of a 7-year study. Aviat Space Environ Med 1984; 55:960–5
13.
Shavelson R: Generalizability Theory: A Primer. Newbury Park, California, Sage Publications, 1991
Newbury Park, California
,
Sage Publications
14.
Gaba DM: Two examples of how to evaluate the impact of new approaches to teaching (editorial). Anesthesiology 2002; 96:1–2
15.
Devitt J, Kurreck M, Cohen M, Cleave-Hogg D: The validity of performance assessments using simulation. Anesthesiology 2001; 95:36–42
16.
DeAnda A, Gaba D: The role of experience in the response to simulated critical incidents. Anesth Analg 1991; 72:308–15
17.
Schwid H, O'Donnell D: Anesthesiologists' management of simulated critical incidents. Anesthesiology 1992; 76:495–501
18.
Ramsey P, Weinrich M, Carline J, Inui T, Larson E, LoGerfo J: Use of peer ratings to evaluate physician performance. JAMA 1993; 269:1655–60
19.
Satava RM, Gallagher AG, Pellegrini CA: Surgical competence and surgical proficiency: Definitions, taxonomy, and metrics. J Am Coll Surg 2003; 196:933–7
20.
Satava RM, Cuschieri A, Hamdorf J: Metrics for objective assessment. Surg Endosc 2003; 17:220–6
21.
Gaba D: The future vision of simulation in health care. Qual Safety in Health Care (In press)