Photo: J. P. Rathmell.

“Simulation for education is disparate from simulation for assessment. ... It would be naïve to expect simulation alone to be a panacea.”

ONE of my duties as a residency program director is to make determinations as to whether a resident is ready to graduate by attesting that the resident has demonstrated sufficient ability to enter practice without direct supervision and is qualified to practice anesthesia independently.

Furthermore, in a clinical practice setting, anesthesiologists may be consulted by medical staff or hospital leadership and told a physician in the practice is subpar clinically or has behaved poorly during an event, with the request for us to “look into it.”

Because patient wellbeing is at stake, correctly assessing a doctor’s capabilities is an important endeavor.

The goal of this commentary is to summarize the challenges that exist in assessing physician performance by using simulation techniques. The take home message is that simulation is one tool in the toolbox to be used along with other modalities. Although it would be convenient and timely for the science of simulation to have advanced to the point where it could provide reliable and valid measures of an anesthesiologist’s skills, much work still needs to be done. One such contribution is published in this issue of Anesthesiology.1 

The study by Dr. Blum et al. focused first on identifying the top five common problem areas displayed by anesthesia residents, as judged by their academic faculty. To illustrate the complexity of the issue of simulation for assessment, I took the six core competencies promulgated by the Accreditation Council for Graduate Medical Education which regulates housestaff training in the United States and mapped them to the analogous Canadian Medical Education Directives for Specialists. Then, I assigned the five potential performance problem areas identified by Blum et al. into those competencies. (table 1) This somewhat arbitrary process made clear to me that performance assessment requires precisely defining it which is complex because it needs to encapsulate the many varied dimensions of practice.

Table 1.

Complexity of Defining Physician Performance

Complexity of Defining Physician Performance
Complexity of Defining Physician Performance

The study did not aim to address overall ability but rather specific behaviors or skills considered to be at risk for suboptimal performance. On the basis of these five potential weaknesses, the investigators designed seven scenarios (e.g., preoperative workup of a patient for urgent exploratory laparotomy) and then evaluated trainee performance in a simulation center that used mannequins as well as actors playing out each scenario. The authors took the time and trouble to establish a scoring system in advance by working with a panel of clinician experts to iteratively discuss and reach consensus on what low, medium, and high performance would look like for each of the five behaviors. They then asked blinded raters to use that scale and grade performance.

The study reports three things. First, the methodology used was generally reasonable for measuring how well housestaff performed. Second, similar to other studies, there was a marked variability in how well residents performed even if they were in the same year of training. Third, only modest improvements in performance could be detected in residents in more advanced stages in training.

This is of interest because the Accreditation Council for Graduate Medical Education is introducing the Next Accreditation System in 2014, which will require that dozens of “milestones” be formally evaluated after each year of training. These milestones are intended to breakdown each of the six Accreditation Council for Graduate Medical Education competencies into smaller pieces, with well-defined, observable developmental steps moving linearly from novice to expert. Perhaps many residents will achieve milestones early and ahead of schedule making it difficult to observe differences between junior and senior resident performance. Or, the reason only modest improvements were found in more senior residents is because a nonlinear pattern of skill acquisition is the typical trajectory as most people learn in a “bump-plateau, bump-plateau” like manner. Alternatively, the rating tools were not sensitive enough to grade the levels of performance accurately. Perhaps less likely is that the test scenarios are too blunt to differentiate performance differences at that higher end of the scale, between good and very good for instance. Is defining bad performance easier than defining good performance?

A group of physicians participating in any simulation course usually contains a few superstars and a few low-performing individuals, whereas the majority is in the middle and difficult to rank order in clinical capacity. Will this be sufficiently helpful to residency programs? Certainly, simulation techniques will not be the primary or sole assessment mechanism as multiple types of assessments will be necessary to comply with the letter and spirit of the Milestones Project.

In medical care, it is difficult to precisely know what expert physician performance looks like. Direct observation of the physician working in their normal job has the advantage of being prospective, but the disadvantage of evaluating a small and likely unrevealing sample of the total range of expected skills.

Adverse event analyses and peer review do have the advantage in that they deal with the actual management of a potentially difficult problem. However, they are retrospective, so they have the bias of hindsight and the inability to accurately determine the details of the case especially because care is typically provided by multiple personnel, so it is difficult to isolate individual performance issues. And, now more than ever, patient satisfaction ratings are being used by some bodies to judge performance.

Often the evidence-based best practice for definitive standard performance is unknown. Each patient requires a case-by-case mixture of cognitive and psychomotor skills, knowledge, tasks, decision-making, and teamwork. Please keep in mind that also what physicians do in the real world of daily practice is affected by systems factors (the operating room environment and staff we work with, production pressure), and individual factors, such as how well rested we are at work.

As a result, simulation may be one methodology called upon to help with assessment. Challenges exist however and are outlined below.

Defining What Expert Physician Performance Looks Like

Some Aspects of Physician Performance May Be Measurable.

An anesthesiologist providing care for a surgical patient, a patient with chronic pain, or a patient in the intensive care unit combines technical skills with nontechnical skills in unique ways. As a result, investigators who use simulation to study physician performance must first determine what they are looking for. Both knowledge and certain aspects of patient care can be assessed using a checklist created via consensus by a group of clinicians as was done by Blum et al.1  This might work well for some scenarios, such as what to do and say if the patient’s temperature and carbon dioxide dramatically increase to high levels in a simulation of a patient with malignant hyperthermia. Even then though do the skills measured in any given acute situation generalize to other clinical situations? What is the required minimum percentage of correctly completed checklist items, or should they be weighted according to importance? This process is called standard setting, which encompasses an entire discipline. Nonetheless, it ultimately comes down to an arbitrary cut point.

Complex Nontechnical Behaviors Are Difficult to Define and Measure.

In contrast to certain tasks for which checklists can be developed, the correct response to other common challenges that anesthesiologists face may be more difficult to condense to a checklist. Examples of such complex behaviors arise daily: how to properly communicate with a family about a patient with a do-not-resuscitate status being brought to the operating room, or exactly when and how much to transfuse a patient who is bleeding during a cancer operation.

Last week I brought my nephew to see the pediatrician. The physician had a fabulous bedside manner and was efficient in her use of the short visit time, but I am sure other pediatricians with different approaches could have been similarly effective. I was reminded once again that medicine is both art and science, and multiple skills are involved.

Common Problem Behaviors Are Difficult to Assess.

Some terms I have heard over the years to describe problem physicians include: “being cavalier,” “using a cookbook approach,” “having poor attention to detail,” “being insensitive to patients,” “failing to heed advice,” “not learning from experience,” “arriving to work late,” “arguing with others,” “not being a team player,” “acting rudely.”

Those descriptors span across multiple Accreditation Council for Graduate Medical Education competencies. How does one go about testing them with or without simulation techniques? Although there is some overlap in the framework and terms chosen to classify the métier of the physician, how can we possibly measure something that we often cannot even define, such as professionalism?

Selection of Optimal Simulation Technique

When the term simulation arises in discourse, the image of a computerized mannequin in a converted operating room with actors playing the roles of surgeons and nurses may come to mind. This allows for full-blown scenarios challenging the physician to prospectively multitask all their nontechnical professional skills including teamwork, communication, and leadership.

In fact, simulation encompasses a variety of techniques to replicate real experiences with planned ones. These evoke substantial aspects of the real world in an interactive manner.

Something seemingly as simple as a mock oral exam, where the learner has to quickly formulate correct responses to decision-making questions, is on the spectrum of simulation. Even though it is a hypothetical case, the learner feels as if they are actually taking care of a patient. This may have more fidelity—likeness to reality—than high-tech techniques such as computer interactive software or virtual worlds with three-dimensional visualization.

For assessing proficiency with technical tasks, it may be sufficient to replicate specific portions of the body. Dozens of commercial devices and trainers are available for simulating airway management, vascular access, and lumbar puncture to name a few.

In contrast, assessing how a resident handles difficult patient encounters such as “end-of-life” discussions may be better achieved by using standardized patients, actors able to give a consistent, predefined history. Hybrid simulation combining standardized patients and part-task trainers may be optimal at other times.

Working to identify when one simulation technique is better than another for testing all the different elements of physician practice in the least costly manner deserves continued attention.

There is public pressure for documentation that all practicing physicians meet or exceed a baseline minimum level.

Although much of the development work in simulation such as the study by Blum et al.1  has targeted housestaff, evaluating the proficiency of practitioners may present some additional challenges. For example, many anesthesiologists work with Certified Registered Nurse Anesthetists, anesthesia nurses, anesthesiologist assistants, fellows, residents, and other providers. Should performance be evaluated differently when the physician works in a team care model?

If simulation techniques were used for formal assessment of physicians in practice, another confounding issue to tackle is the possibility that a practitioner may function well in a known hospital environment with familiar operating room staff, but may underperform in an unfamiliar setting such as in a simulation center. Does it make a difference in observed outcome if the practitioner trained before simulation became in vogue, or if they graduated from a residency with minimal exposure to simulation? Arranging for prior practice in the test environment may reduce some of this. Perhaps these confounding factors partially explain several studies’ findings that some anesthesiologists do not perform well during mannequin-based assessment of the ability to manage simulated intraoperative emergencies, even though the practitioners do not appear to have problems in their day-to-day practice. There is much yet to be understood about using simulation for assessment.

For Maintenance of Certification in Anesthesiology, diplomates need to complete multiple activities. Part 4 (Practice Performance Assessment and Improvement) includes a Simulation Education Course intended as purely educational without any assessment of performance. Self-assessment may occur with the debriefing session helping participants realize whether there is some area of practice they need to review or improve.

As many of you know, objective structured clinical examinations will be added to the American Board of Anesthesiology’s traditional oral examination as part of the new Applied Examination for Board Certification. Objective structured clinical examinations usually consist of multiple stations each with a different examiner requiring 5 to 15 min and are assessment mechanisms which have been well established for many years.

Research is underway to further understand potential future uses of mannequin-based simulation. For example, the Agency for Healthcare Research and Quality funded an ongoing multi-institution study titled “Creating simulation-based performance assessment tools for practicing physicians” to investigate the ability of anesthesiologists to detect and manage uncommon but potentially lethal events. This study’s objectives are to (1) develop standardized simulation scenarios and associated tools to conduct simulation-based assessment of board-certified anesthesiologists, (2) evaluate whether simulation-based clinical assessment can be reliably delivered across multiple national sites for purposes of recertification, and (3) describe the distribution of clinical performance assessment scores during simulation.

In the Stanford Anesthesia residency, we are fortunate to have faculty experienced in the use of simulation for education and training, which allows us to target high priority skills and knowledge. Examples include crisis resource management, supervision of junior residents, and conversation with family members after a patient complication. Many different situations can be portrayed and scheduled whenever convenient and repeated. This artificial environment is perceived as a safe space and allows learners to make mistakes without fear of repercussions.

In my own case, I still remember that day in 1992 at the Anesthesia Crisis Resource Management course at the Palo Alto VA Medical Center where, as a resident, I learned many skills that I still routinely use today. These include calling for help, delegating and confirming, avoiding fixation errors, mobilizing available resources, anticipating and planning for problems, and knowing the environment. Simulation enjoys a positive perception by learners owing to its effective role in education.

In summary, where does this leave us? The study by Blum et al. provides a path to identify residents performing poorly in common problem areas. This is valuable because society increasingly expects some overall determination of doctors’ abilities to care for patients.

The anesthesia community, not just simulation researchers, can work together on defining good performance as this is an issue for everyone. Simulation for education is disparate from simulation for assessment. As the stakes of the assessment rise, the validity and reliability of the test and proper training of the raters also need to rise. It would be naïve to expect simulation alone to be a panacea. It is one tool at our disposal to be used along with other assessment modalities.

for the Harvard Assessment of Anesthesia Resident Performance Research Group
Simulation-based assessment to identify critical gaps in safe anesthesia resident performance.