Clinical teachers and trainees share a common view of what constitutes excellent clinical teaching, but associations between these behaviors and high teaching scores have not been established. This study used residents’ written feedback to their clinical teachers, to identify themes associated with above- or below-average teaching scores.
All resident evaluations of their clinical supervisors in a single department were collected from January 1, 2007 until December 31, 2008. A mean teaching score assigned by each resident was calculated. Evaluations that were 20% higher or 15% lower than the resident’s mean score were used. A subset of these evaluations was reviewed, generating a list of 28 themes for further study. Two researchers then, independently coded the presence or absence of these themes in each evaluation. Interrater reliability of the themes and logistic regression were used to evaluate the predictive associations of the themes with above- or below-average evaluations.
Five hundred twenty-seven above-average and 285 below-average evaluations were evaluated for the presence or absence of 15 positive themes and 13 negative themes, which were divided into four categories: teaching, supervision, interpersonal, and feedback. Thirteen of 15 positive themes correlated with above-average evaluations and nine had high interrater reliability (Intraclass Correlation Coefficient >0.6). Twelve of 13 negative themes correlated with below-average evaluations, and all had high interrater reliability. On the basis of these findings, the authors developed 13 recommendations for clinical educators.
The authors developed 13 recommendations for clinical teachers using the themes identified from the above- and below-average clinical teaching evaluations submitted by anesthesia residents.
Constructive feedback from trainees improves faculty teaching scores
Although trainees identify behaviors of an ideal teacher, whether they utilize these concepts in providing feedback to faculty is not known
In a two-step process, comments from faculty evaluations over a 2-yr period at one institution were studied to identify themes associated with above- and below-average ratings by trainees
Thirteen themes were identified, using trainee evaluations, and these fell into four domains associated with outstanding teaching
THE Accreditation Council for Graduate Medical Education requires residency programs to provide trainees the opportunity to evaluate their supervising faculty members. These evaluations should include the “faculty’s clinical teaching abilities, commitment to the educational program, clinical knowledge, professionalism, and scholarly activities.”1 Such evaluations are often used to support promotion, decide resident assignments, distribute bonus money, and remediate less-skilled teachers.2,3
Evaluations that include constructive feedback may be viewed positively by faculty;4 however, low-performing faculty may be harmed by constructive feedback.5 Previous studies have shown that constructive criticism generally improves clinical teaching scores over time,6,7 whereas teaching scores in the absence of constructive feedback have more mixed results.6 Ideally, comments would reflect specific and modifiable factors, thus, providing guidance for the faculty to improve their performance. For example, when surgical faculty were scored on their suitability as role models, many of the lowest-ranked faculty made significant gains over a 6-month period.7 Similarly, anesthesiologists showed gains in their overall teaching scores within a year, when provided with both evaluative scores and feedback.6
Using surveys and focus groups,8,9 residents identified four unique roles of the clinician-teacher: physician, supervisor, teacher, and person. These roles are consistent with the Ulian conceptual model, which encompasses specific teaching behaviors and approaches, attitudes toward the trainee, and interpersonal skills.10 Trainees expressed preferences for teaching faculty who had clear expertise and up-to-date clinically relevant knowledge, provided appropriate autonomy and supervision, provided formative feedback, provided efficient relevant teaching through discussion, exhibited kindness and sensitivity toward the trainees, and adopted a collegial manner.8,9 When using general descriptive terms, faculty and residents largely share a single view of what constitutes an ideal clinical teacher.11 To date, the link between these ideals and residents’ thoughts and comments when evaluating their best clinical teachers has not been established. To enhance learning of new material, studies have shown that comparing and contrasting new ideas, topics, or themes can be a more effective and efficient method than simply learning one idea, topic, or theme, and then moving on (massed leaning).12 Comparing and contrasting themes found in evaluations with high and low teaching scores is, thus, an effective way to identify important differences associated with high and low evaluations. The objective of this study was to identify themes found in resident feedback that characterize better-than- or worse-than-average teaching capacities of anesthesia faculty members. We hypothesize that specific behaviors and characteristics will be associated with better- and worse-than-average teaching scores.
Materials and Methods
Anesthesia residents at the Massachusetts General Hospital evaluate their clinical teachers on a monthly basis using numerical scores and free text, as has been previously described.6 The evaluation form has seven assessment categories: overall time spent, clinical supervision, quality of teaching, quantity of teaching, role modeling, and encouragement given to think about the science of anesthesia. Each item is rated on a Likert scale (0–10), with 10 representing the highest score. Free-text comments can be entered in three separate boxes: strengths, areas that need improvement, and additional comments. Teaching scores are calculated by summing up the seven individual scores, and thus, they range from 0 to 70.
Evaluations are submitted in a confidential process, where each resident evaluator’s name is replaced by a unique number. Faculty members are given their aggregate teaching scores along with normative data and all their free-text comments. Data are released at 6-month intervals and with a delay to ensure resident anonymity.
Our study protocol was approved by the Institutional Review Board of Massachusetts General Hospital (# 2011P-000676) and exempted from review by the University of Michigan (# HUM0005519). We used all the resident evaluations of the faculty, submitted from January 1, 2007 to December 31, 2008. Using unique identifiers, we calculated a mean teaching score that each resident gave to the teaching faculty. We retained only those evaluations that had teaching scores 20% higher or 15% lower than the residents’ mean teaching scores in order to obtain a substantial number of evaluations, with comments related to above- and below-average teaching scores.
After reading all comments submitted during the first 6-month period, Dr. Baker developed a list of recurring themes (tables 1 and 2). Next, all evaluations in the research database were independently reviewed by two investigators (Drs. Haydar and Charnin) blinded to the teaching score. They determined the presence or absence of these themes. If a particular comment had no particular theme then the investigator could elect “positive, not otherwise specified” (positive NOS), “negative, not otherwise specified” (negative NOS), or none. They also predicted whether the evaluation was associated with an above-average or below-average teaching score based on the comments alone.
We evaluated interrater reliability for each theme, using an Intraclass Correlation Coefficient (ICC) and Cohen Kappa and compared the relationship of each theme with the dichotomized above- and below-average scores, using the two-sided Fisher exact test. We accepted ICCs between 0.40 and 0.59 as fair, between 0.60–0.74 as good, and above 0.75 as excellent agreement.13 Themes with poor interrater reliability (i.e., ICC <0.40) were excluded from further analyses. We performed linear regressions to assess for collinearity between variables. We then excluded from our predictive model themes which loaded solely on above- or below-average evaluations, as they may cause major errors in logistic regressions.14 Finally, we used a logistic regression model, where the dichotomous outcome, above-average or below-average evaluation, was regressed on the independent themes. All comparisons were two-sided, and P value less than 0.05 was accepted as significant. All analyses were conducted using SPSS (version 20; IBM Corporation, Armonk, NY) or Origin (version 7.5 SR4; OriginLab Corp., Northampton, MA).
From January 1, 2007 to December 31, 2008, 117 residents submitted 9,786 evaluations on 162 faculty members. The mean teaching score overall, was 58.3 with an SD of 10.3. There were 527 evaluations that had both a teaching score 20% above the resident’s mean and comments. There were 285 evaluations that had both a teaching score 15% below the resident’s mean and comments. We excluded nine of these evaluations for unintelligible comments and 198 (25%) that contained only “positive NOS,” “negative NOS,” or both, leaving 605 evaluations for analysis. Of these, 61% were above average, whereas 39% were below average. Sixty-seven percent of all evaluations had positive comments (97% of above-average, 21% of below-average), whereas 43% had negative comments (8% of above-average, 97% of below-average). The distribution of above- and below-average teaching scores received by each individual faculty member is presented in figure 1. The distribution of above- and below-average teaching scores assigned by each resident is presented in figure 2. A majority of residents (71%) submitted evaluations that ended up in the research database, and 82% of the teaching faculty was represented. Among these faculty members, 29% received exclusively above-average evaluations and 14% received exclusively below-average evaluations, whereas the majority (56%) received both. Among the residents represented, 59% submitted both above- and below-average evaluations, whereas 24% submitted only above-average evaluations, and 17% submitted only below-average evaluations.
Using only comments, we correctly identified 90.8% of below-average evaluations and 94.6% of above-average evaluations. The interrater reliability between reviewers’ classifications into above- and below-average evaluations was excellent with an ICC of 0.95. A high degree of interrater reliability13 (ICC >0.6) was found for 9 of 15 positive themes and for all negative themes (tables 3 and 4). One positive theme (teaching to the appropriate level of the resident. Table 1, p5), and both “Positive NOS” and “Negative NOS” had very poor reliability, and were excluded from further analysis. Nearly all positive themes had statistically significant associations with above-average teaching scores. Only one positive theme (having high expectations for the resident. Table 3, p10) was not significantly associated with above-average teaching scores. Nearly all negative themes had significant associations with below-average teaching scores. Only one negative theme (providing teaching that is overly limited in scope or clinically irrelevant. Table 4, n3) was not significantly associated with below-average teaching scores. No items were found to be collinear, indicating that each theme was independent of the other.
Our logistic regression analysis demonstrated six positive themes, which were independently associated with above-average evaluations (having education-oriented discussions, spending adequate time teaching, demonstrating an active effort in teaching the resident, allowing a healthy balance of supervision to autonomy, providing support during teaching a new procedure, treating the resident in a collegial and/or respectful manner. Table 3: p3, p4, p6, p9, p11, and p15, respectively). In addition, several positive themes were found only among above-average evaluations, but these themes could not be regressed using binary logistic regression due to their highly skewed distribution. Three of these themes were highly correlated with above-average evaluations using Fisher exact test (explaining why specific management strategies were used, challenging the resident to a better performance, providing developmental feedback. Table 3: p2, p12, and p14, respectively).
Our logistic regression analysis also demonstrated eight negative themes, which were independently associated with below-average evaluations (failing to explain why specific management strategies were chosen, spending an inadequate amount of time teaching, being too rigid or prescriptive in the management of a patient, being too passive or unhelpful during busy or challenging times, intervening prematurely without involving the resident in a decision or a procedure, providing insufficient supervision or too little autonomy in the management of the patient, becoming impatient, frustrated, or angry with the resident, adopting an intimidating demeanor, or treating the resident in an overly rude, condescending, or abrasive manner. Table 4: n1, n2, n4, n5, n6, n7, n10, and n12, respectively). In addition, several negative themes were found only among below-average evaluations, but these themes could not be regressed using binary logistic regression because the themes did not appear in the above-average evaluations (no counts). Three of these themes were highly correlated with below-average evaluations using Fisher exact test (having a low clinical ability as perceived by the resident, being overly critical of the resident, speaking ill of other residents who are not present. Table 4: n8, n11, and n13, respectively).
Due to lower interrater reliability (ICC <0.6) of some of the themes in the logistic regression, we performed a second and separate logistic regression using only the second rater’s themes. To accomplish this regression, we also excluded p13 and n9, as these themes had zero counts for below-average and above-average evaluations, respectively. The regression results, using only the second rater’s data, found the same statistically significant associations as the first regression. Associations that were not significant in the first regression were not significant when coded by the second rater.
Resident evaluations of the teaching faculty can be considered “high-stakes” evaluations with potential implications for promotion, raises, and other professional compensation.3 Trainee expressions of ideal behaviors and characteristics of teaching faculty fall neatly into the four domains described by the Ulian model: physician, supervisor, teacher, and person.8,10 Among the themes we evaluated, most address the latter three domains, and many relate to potentially modifiable behaviors. The comments in our study convey diagnostic teaching information because we were able to correctly categorize more than 90% of the evaluations as being above- or below-average based solely on the comments. Residents disproportionately included more positive comments on below-average evaluations than negative comments on above-average evaluations. They also submitted many more above-average evaluations than below-average evaluations. The mean score for faculty teaching was well above the center of the scale, and this indicates grade inflation, as has been seen elsewhere.6,15 We also encountered below-average evaluations that contained no negative comments (8.6%), and some above-average evaluations that contained no positive comments (3.0%). Characterizing above-average teaching using resident comments may be helpful in identifying why a particular faculty member receives high scores. Using raw scores alone can be problematic because of grade inflation and other independent factors, which influence teaching scores.16
Our chosen positive themes provide support for many practices that clinician-educators regard as essential. These include providing both appropriate autonomy and supervision, imparting knowledge, providing developmental feedback, and doing this in a matter that is fitting of a future colleague. Many of our positive themes reflect the trainee’s desire to learn and develop as a clinician. Importantly, when residents expressed satisfaction with faculty feedback, the teaching scores were always above-average. Providing constructive feedback is considered to be an essential element of clinical teaching by the Accreditation Council for Graduate Medical Education,1 expert panels,17 faculty,18 and trainees,8,18 alike. The importance of feedback has only recently been recognized, having been rarely mentioned in descriptions of the ideal teacher just a generation ago.19 Paradoxically, by providing such feedback, teaching faculty may lower their teaching scores, as compared with merely providing praise.20 At the same time, however, being merely personable, collegial, and respectful was not sufficient to ensure a high-scoring evaluation. The single most common theme found in above-average evaluations related to the faculty member spending an adequate time teaching.
Negative themes had a high degree of reliability and were significantly associated with below-average evaluations. This establishes these themes as possible causes for the low teaching scores. Many of the negative themes represent behaviors that are the negative corollary of the exemplary behaviors in the positive themes. Avoiding these behaviors may provide an avenue for faculty members to improve trainee satisfaction with their teaching and supervision, and subsequently improve their own teaching scores. Our residents frequently request more intraoperative teaching, and this was reflected in the comments we studied. Similarly, in a similar recent study of free-text comments, the most common “area for improvement” comment was a request for more teaching.21 Several of our negative themes may be easy for clinical teachers to manage, such as avoiding derogatory comments about residents, or taking the time to explain clinical decision-making. In a recent contextual analysis of clinical teaching faculty evaluations, Myers21 concluded that written comments “… seem unlikely to provide faculty with substantive feedback.” In contrast, our study reveals pointed and constructive criticism in low-scoring evaluations, with a high degree of reliability and a significant and independent correlation with having a low teaching score. This difference in conclusion may stem from our focus on high- and low-scoring evaluations, which may contain more substantive content, as well as our use of more evaluations over a longer period of time.
Taken together, our results have implications for clinical teachers. The numeric teaching score has little value independent of the associated comments, at least for above- or below-average evaluations. Most faculty members included in this study received both above- and below-average evaluations (fig. 1), and most residents submitted both above- and below-average evaluations (fig. 2). Though periodic evaluations may have a positive effect on teaching scores, a recent review concluded that evaluations were insufficient as a means to improve teaching effectiveness.22 For low-performing faculty, receiving low-scoring evaluations may in fact have a negative effect on teaching performance.5 Moreover, as with any individual,23 faculty members may have limited insight into their own teaching effectiveness.24 Use of observation, mentorship, and other faculty development tools have been demonstrated to improve teaching performance, with some having positive effects lasting for years.22
A particular strength of this study is the combination of a contextual analysis (recurring themes found in free-text comments) with above- or below-average evaluations. Contextual analyses have been done previously establishing different domains of the ideal clinical teacher.10,21 To date, however, none of these themes had been demonstrated to correlate with positive evaluations or high teaching scores. This study demonstrates correlations with key behaviors in each of the four domains, both positively and negatively, and thus, provides some validation for each domain. It is important to state that our results are correlations, and thus, we cannot state cause and effect. These results are best characterized as descriptive of our resident-clinical teacher interactions.
Findings from this study may be limited by selection bias because we excluded evaluations that were not associated with above- or below-average teaching scores. This allowed for efficient detection of positive and negative teaching themes; however, further study of the themes found in average-scoring evaluations might help elucidate more nuanced themes. Additionally, the potential for reporting bias may have resulted from our subjective designation of comments into different categories, our use of a single individual (Dr. Baker) to create the initial set of themes, and the lack of specific definitions for several of our themes. Creating the theme list using the Delphi method may have been preferable; however, the results of independent logistic regression models for each rater were comparable, supporting the overall theme construction. They also had extremely strong correlations with the appropriate category of above- or below-average teaching scores. The designated themes captured the majority of the comments in these evaluations because less than half included “positive NOS” or “negative NOS.” Many of these NOS comments had no developmental or meaningful content; many simply stated “great teacher.” Some of the residents submitting evaluations were excluded from our study database due to the tendency to give the same teaching score to all faculty members, precluding the ability to separate above- or below-average scores. Our analysis did not take into account the hierarchical nature of the data structure, and we treated each evaluation as an independent measure. When the same resident evaluates the same faculty member on more than one occasion, the evaluations are unlikely to be independent. This may have inflated the statistical significance of some themes. However, this would not have any effect on the interrater reliability. Further study is warranted using independent samples from multiple institutions. Finally, we reiterate the correlative nature of our results, which precludes our ability to state cause and effect.
This study is predicated on the idea that residents submitting evaluations accurately identify and report valid reasons for the high or low teaching score that they assign. Using learner evaluations to assess the quality of teaching has long been a subject of debate.16 Given the lack of a “definitive standard” for assessing clinical instruction, trainee feedback remains the main avenue for assessment. Recent advances include validated instruments,25 specific benchmarks set by expert panels,17 and focus groups to help define “better teaching.”
This study found specific positive and negative themes in the comments section of resident evaluations of clinical teachers and related these themes to above- and below-average teaching scores. This study provides an association between above-average teaching scores and the behaviors associated with excellent teaching. Conversely, this study provides an association between below-average teaching scores and the behaviors associated with below-average teaching. These themes are recast as recommendations in table 5.