The objective of this study was to systematically determine the diagnostic accuracy of bedside tests for predicting difficult intubation in patients with no airway pathology. Thirty-five studies (50,760 patients) were selected from electronic databases. The overall incidence of difficult intubation was 5.8% (95% confidence interval, 4.5-7.5%). Screening tests included the Mallampati oropharyngeal classification, thyromental distance, sternomental distance, mouth opening, and Wilson risk score. Each test yielded poor to moderate sensitivity (20-62%) and moderate to fair specificity (82-97%). The most useful bedside test for prediction was found to be a combination of the Mallampati classification and thyromental distance (positive likelihood ratio, 9.9; 95% confidence interval, 3.1-31.9). Currently available screening tests for difficult intubation have only poor to moderate discriminative power when used alone. Combinations of tests add some incremental diagnostic value in comparison to the value of each test alone. The clinical value of bedside screening tests for predicting difficult intubation remains limited.
UNANTICIPATED difficult intubation can be challenging to anesthesiologists. Numerous investigators have attempted to predict difficult intubation by using a simple bedside physical examination. Mallampati et al. 1introduced in 1985 a currently well-known screening test that classifies visibility of the oropharyngeal structure. The distance from the thyroid notch to the mentum (thyromental distance), the distance from the upper border of the manubrium sterni to the mentum (sternomental distance), and a simple summation of risk factors (Wilson risk sum score) are widely recognized as tools for predicting difficult intubation.2,3Nevertheless, the diagnostic accuracy of these screening tests has varied from trial to trial,4probably because of differences in the incidence of difficult intubation, inadequate statistical power, different test thresholds, or differences in patient characteristics. Questions remain as to whether a combination of tests may improve predictive accuracy or whether predictive accuracy differs for specific groups of patients, such as obstetric or obese patients, in whom difficult intubation is considered to occur more often than in normal patients. A recent editorial by Yentis5made clear how hard it is to predict difficult intubation because of its low rate of occurrence and questioned whether attempts at prediction are likely to be useful.
To answer these questions, we systematically reviewed and synthesized the published data relating to the performance of diagnostic tests for difficult intubation in normal-airway patients scheduled to undergo general anesthesia.
Study Selection and Quality Assessment
We searched MEDLINE (1980 through May 2004) and the Cochrane Central Register of Controlled Trials (2004, issue 1) for reports of studies and trials relating to the accuracy of predictive tests for difficult intubation. No language restrictions were applied. The initial search terms were difficult airway , difficult intubation , and difficult laryngoscopy . A manual search of references cited in published reports and reviews was also performed.
Reports were independently selected and reviewed by two investigators (T. S. and Z. W.). The systematic review process for selection of eligible studies is shown in figure 1. Reported studies were selected if they met the following criteria: (1) the study was prospective; (2) at least one bedside diagnostic test was used; (3) absolute numbers of true-positive, false-negative, true-negative, and false-negative results were available or could be derived from the published data; and (4) a standard laryngoscope was used. We did not include articles from any study with insufficient data, with patients whose airway was anatomically abnormal, or with a complicated and not widely accepted scoring system. We excluded retrospective studies, studies requiring impractical and costly diagnostic tests that are not yet widely accepted (e.g. , radiologic examinations), and studies involving a special laryngoscope or technique.
The quality of acceptable reports was assessed independently by two authors (T. S. and Z. W.). Studies were graded according to four a priori criteria for quality, as described by Romagnuolo et al. 6: (1) blinding, (2) consecutive recruitment of patients, (3) single (vs. composite) accepted standards, and (4) nonselective use of the accepted standard. One point was given for each criterion; the maximum possible score was 4.
We defined a Cormack–Lehane grade7of 3 or greater as the accepted standard for difficult intubation, as in most studies included in the review. Some authors reported a required special technique, multiple unsuccessful attempts, or a combination of these as the accepted standard for difficult intubation. If a definition seemed so subjective as to be generally unacceptable, we abandoned it and substituted the Cormack–Lehane grades whenever available. Extracted from the reports were the number of patients, mean age, general patient characteristics, criteria for difficult intubation, type of laryngoscope blade, incidence of difficult intubation, type of screening test, and absolute numbers of true-positive, false-positive, true-negative, and false-negative results. For a diagnostic test to be included in our analysis, at least three reports of that test had to have been identified in our literature search. The Mallampati classification, thyromental distance, sternomental distance, mouth opening, Wilson risk sum score, and a combination of the Mallampati classification and thyromental distance met our inclusion criteria. In addition, we performed subgroup analysis. Obstetric or obese patient groups were analyzed if they were grouped separately within a study and the subgroup appeared in at least three studies.
We calculated pooled estimates of the incidence of difficult intubation, sensitivity, specificity, positive and negative likelihood ratios, and natural logarithm of diagnostic odds ratio by the DerSimonian–Laird random-effects model.8Rates were pooled after logit transformation, weighting study rates by the inverse ratio of their variance plus the between-study variance for that measure, and then retransformed back into standard proportions with 95% confidence intervals (CIs). Homogeneity of the effect size across trials was tested by chi-square statistics. Heterogeneity was defined as P < 0.1.
Sensitivity is the ratio of the true-positive number to the sum of true-positive plus false-negative numbers. Specificity is the ratio of the true-negative number to the sum of true-negative plus false-positive numbers.9Likelihood ratios are obtained as follows: positive likelihood ratio = sensitivity/(1 − specificity); negative likelihood ratio = (1 − sensitivity)/specificity. Likelihood ratios greater than 10 and less than 0.1 are considered strong evidence for ruling in or ruling out diagnoses, respectively, under most circumstances.10The log diagnostic odds ratio is the logit (positive likelihood ratio/negative likelihood ratio), indicating a summary of diagnostic performance.11
The diagnostic performance of each test was also assessed by means of summary receiver operating characteristic (ROC) curves according to the method described by Moses et al. 12We constructed ROC curves. Briefly, the true-positive rate was plotted against the false-positive rate for each study. To avoid calculation problems by having values of zero, 0.5 was added to each cell of the respective contingency table. The summary ROC model is described by the following equation D =a +b S. The summary ROC curve analysis is based on regression analysis of logit transformed data, which plots the difference between the logit of the true-positive rate (TPR) and the logit of the false-positive rate (FPR) (D = logit TPR − logit FPR) on the y-axis and their sum (S = logit TPR + logit FPR) on the x-axis. The y-axis (D) is equivalent to the log diagnostic odds ratio, and the x-axis (S) is a measure of how the test characteristics vary with the test threshold. The regression coefficient b examines the extent to which the log odds ratio is dependent on the threshold values chosen. The linear regression analysis was weighted by the inverse of the variance of D. The regression line was back-transformed to the ROC space.
Assessment of Publication Bias
To assess the potential for publication bias, a funnel plot was constructed in which the log of relative risks was plotted against the associated number of patients.13In addition, correlation between standardized log relative risks and the associated number of patients was determined by the Kendall rank correlation coefficient. The correlation between sample size and relative risk would be strong if not many small studies with null results were published. A significant correlation between sample size and relative risk would not exist in the absence of publication bias. Statistical significance was defined for treatment effects as P < 0.05 and for heterogeneity and publication bias as P < 0.1. Analyses were performed with Microsoft Excel (Microsoft Corporation, Redmond, WA), Meta-DiSc® (Hospital Ramón y Cajal, Madrid, Spain), and Number Cruncher Statistical System 2004 (NCSS Statistical Systems, Kaysville, UT).
The electronic search resulted in 3,318 hits. Thirty-five studies14–48representing 50,760 patients met the inclusion criteria (table 1). Non–English-language reports included 3 in French, 2 in German, 1 in Italian, and 1 in Japanese. We excluded the original articles by Mallampati et al. 1and Wilson et al. 3because the test designers were also the test assessors. One report49was included in the analysis of obese populations but was excluded from the final analysis because of possible duplication of data.
The overall incidence of difficult intubation was 5.8% (95% CI, 4.5–7.5%) for the overall patient population, 6.2% (95% CI, 4.6–8.3%) for normal patients excluding obstetric and obese patients, 3.1% (95% CI, 1.7–5.5%) for obstetric patients, and 15.8% (95% CI, 14.3–17.5%) for obese patients. Data pertaining only to obstetric patients were given in four reports. Of these reports, three assessed risk on the basis of the Mallampati classification and one assessed risk on the basis of sternomental distance; therefore, we analyzed only the Mallampati test data. Data pertaining exclusively to obese patients were given in four reports, all of which assessed risk on the basis of the Mallampati test.
Pooled estimates of the incidence of difficult intubation, sensitivity, specificity, positive and negative likelihood ratios, and natural logarithm of diagnostic odds ratio as well as the regression model equation for each test are shown in table 2. The summary ROC curve for each test is shown in figure 2. With the exception of thyromental distance, diagnostic accuracy did not vary with the test threshold in any test. Because diagnostic accuracy tended to vary with the test threshold for thyromental distance (P = 0.056), we calculated likelihood ratios with an adjusted cutoff point; a stricter criterion for thyromental distance (≪ 6.0 cm) was applied. In a subgroup of eight studies,14,20–23,26,32,42with a cutoff of 6.0 cm or less, pooled positive and negative likelihood ratios were updated to 4.1 (95% CI, 2.3–7.0) and 0.8 (95% CI, 0.6–0.9), respectively, with significant heterogeneity, indicating that a thyromental distance of 6.0 cm or less slightly improved the prediction of difficult intubation.
We calculated posttest probability because it enabled us to generalize our results for varying previous incidence.10,11Calculation of posttest probabilities by means of likelihood ratios is shown in table 2. For example, patients with a 5% pretest probability of difficult intubation have a 15% risk of difficult intubation after a positive thyromental distance test result and a 4% risk of difficult intubation after a negative thyromental distance test result. The risk of difficult intubation after positive and negative test results is shown with a possible range of pretest probabilities (table 2).
Symmetry in the funnel plot was confirmed by significant Kendall correlation coefficients of 0.18 (P = 0.14) for the Mallampati test and 0.23 (P = 0.19) for thyromental distance, which suggests the absence of publication bias.
The Mallampati score may estimate the size of the tongue relative to the oral cavity1,4and may possibly indicate whether displacement of the tongue by the laryngoscope blade is likely to be easy or difficult. In addition, it assesses whether the mouth can be opened adequately to permit intubation. The Mallampati test assesses not only pharyngeal structure but also head and neck mobility. Recent investigation50has suggested that craniocervical extension relates to mouth opening, and limited head or neck mobility may result in a poor Mallampati scores. Despite theoretical arguments for this test, poor pooled sensitivity values and relatively moderate specificity values were obtained in our analysis. Positive and negative likelihood ratios were moderate but unsatisfactory for clinical use. Heterogeneity was present in sensitivity and specificity. Heterogeneity and inadequate diagnostic performance may result in part from inconsistency or uncertainty in performing the tests, e.g. , the Mallampati test may have been conducted with or without phonation and/or with different head or tongue positions. Some reports omitted descriptions of how the tests were administered. Because of these factors, the Mallampati test may be of marginal diagnostic value.
Thyromental distance is considered to be an indicator of mandibular space.4This test also reflects whether displacement of the tongue by the laryngoscope blade will be easy or difficult. The diagnostic value of thyromental distance proved unsatisfactory in our analysis. A wide range in test sensitivity may result in heterogeneity. Heterogeneity may be due to the variety of test thresholds: cutoff points varied from 4.0 to 7.0 cm. The summary ROC analysis showed a trend toward variation in overall diagnostic performance of the thyromental test in relation to test threshold. Our additional analysis showed that the positive likelihood ratio improved from 3.4 to 4.1 when a stricter cutoff criterion (≪ 6.0 cm) was applied. Because one study21with a cutoff of less than 4.0 cm yielded higher diagnostic performance with positive and negative likelihood ratios of 9.4 and 0.03, respectively, we should reevaluate the test threshold for thyromental distance. Another source of heterogeneity may be variation in measurement conditions: Thyromental distance could have been measured from inside or outside the mentum. The methods of measurement must be standardized.
Sternomental distance can be an indicator of head and neck mobility.31Head extension is believed to be an important factor in determining the ease or difficulty of intubation. Among single-factor tests, sternomental distance yielded the highest positive likelihood ratio and diagnostic odds ratio with moderate sensitivity and specificity. The negative likelihood ratio was lower than that of any other test, suggesting that it is the best single test for ruling out difficult intubation. The cutoff point of sternomental distance was consistently 12.5 to 13.5 cm. However, only three studies were included in our analysis. Therefore, the diagnostic performance remains inconclusive. Further investigation is required because so few studies address sternomental distance.
Mouth opening seemed in our analysis to be an inadequate predictor of difficult intubation. It may be argued that mouth opening indicates movement of the temporomandibular joint and that significantly limited mouth opening hinders exposure of the larynx. Several studies based on multivariate analysis3,51indicated that limited mouth opening is strongly associated with difficult intubation. Unexpected results may have been obtained in our analysis, because measurement thresholds varied. The threshold was even unclear in one study.32Our analysis suggests that mouth opening is not a useful test; however, we could not determine whether this is because of limited data or because mouth opening is truly not useful in predicting difficult intubation. This area would benefit from further investigation.
Wilson Risk Score
The CI of the Wilson risk score is narrower than that of other tests, and sensitivity and specificity are homogeneous. The same criterion (score ≫ 2) was applied in all studies included in our analysis, making the data cluster very closely together and thus yielding a narrower CI in pooled sensitivity and specificity. All included studies set the test threshold somewhat high; therefore, sensitivity remained low and specificity remained high on our summary ROC curve. The Wilson risk score with a cutoff value of 2 or greater yielded a low true-positive rate and a low false-positive rate, meaning that the test threshold correctly identifies patients for whom intubation will be easy. Although our analysis did not include the original data of Wilson et al. ,3our pooled sensitivity and specificity with a cutoff score of 2 or greater seem to be similar to their original sensitivity and specificity data. This suggests that the Wilson risk score has high reproducibility.
Combination of Mallampati Classification and Thyromental Distance
We found that a combination of the Mallampati test and thyromental distance most accurately predicted difficult intubation. This combination yielded low sensitivity, but the positive likelihood ratio (9.9) supports the test as a strong predictor of difficult intubation. The diagnostic odds ratio (3.3) and the area under the summary ROC curve (0.84) are the highest of all tests. Patients with a 5% pretest probability of difficult intubation were shown to have a 34% risk of difficult intubation after a positive result for the combination test, a 16% risk after a positive result of Mallampati test alone, and a 15% risk after a positive result of thyromental distance alone. Therefore, the discriminative power is greater when the tests are used in combination rather than alone. It is suggested that a combination of the Mallampati classification and thyromental distance has the highest discriminative power among currently available tests. However, heterogeneity and an insufficient number of studies limit definitive conclusions.
Mallampati Classification in Obstetric and Obese Populations
We found that diagnostic performance of the Mallampati test in obstetric and obese populations is similar to that in the overall population. The diagnostic odds ratios in these populations are similar, and the trend toward poor sensitivity and fair specificity remained. We also found the incidence of difficult intubation in obese (body mass index > 30) patients to be more than three times that of normal patients. Obese patients with a 15% pretest probability of difficult intubation had a 34% risk of difficult intubation after a positive Mallampati test result, twice the risk of the normal population with a 5% pretest probability. Excessive soft tissue in the velopalate, retropharynx, and submandibular regions in obese patients may cause difficulty in laryngoscopy.49Our result confirms the common understanding that obese patients have a greater incidence of difficult intubation than that of normal patients. Because of the high incidence of difficult intubation in these patients, the Mallampati test may yield higher posttest probability of difficult intubation in obese patients than in normal patients. Data for obstetric population, however, remain inconclusive because of the small number of studies and the heterogeneity.
Strengths and Limitations
Our meta-analysis showed the incidence of difficult intubation in normal patients without pathologic airway anatomy to be 5.8%, which lies within the limits of the incidence reported in the literature we reviewed.2,4,52,53This can be viewed as a strength in terms of the external validity of our findings. However, our meta-analysis has several limitations. First, publication bias was not identified for the Mallampati classification and for thyromental distance. However, few studies were included for the other diagnostic tests; there may be unpublished studies. Second, the reference standard for difficult intubation differed somewhat among studies. Most studies defined difficult intubation as a Cormack–Lehane grade of 3 or greater, but some studies used other classification systems (e.g. , Intubation Difficulty Scale Score17) or repeated attempts.19,26,40The Cormack–Lehane scale was not originally designed for grading the degree of difficulty in laryngoscopy or tracheal intubation.5In addition, laryngoscopy with or without application of external cricoid pressure or of backward, upward, and rightward pressure (BURP maneuver) on the thyroid cartilage to facilitate a laryngoscopic view might have affected the Cormack–Lehane grade in individual studies. Controversy lingers as to the definitions of difficult intubation and difficult laryngoscopy.5,53
Given that screening tests included proved to have inadequate diagnostic power, is any attempt at prediction likely to be useful? Should any predictive attempt be advocated? This question cannot be generally answered; however, as Wilson stated, “No test is likely to be perfect, therefore, it remains essential that every anesthetist must be trained and equipped to deal with the now much less common, unexpected failure to intubate.”54We concur, and we believe that attempts at prediction are much less important than knowing what to do when difficulty is encountered.
In conclusion, currently available screening tests for difficult intubation have only poor to moderate discriminative power when used alone. Combinations of individual tests or risk factors add some incremental diagnostic value in comparison to the value of each test alone. However, the clinical value of these bedside screening tests for predicting difficult intubation remains limited.
The authors thank Joseph Lau, M.D. (Professor of Medicine and Clinical Research and Director of the Center for Clinical Evidence Synthesis, Tufts-New England Medical Center, Boston, Massachusetts), for providing statistical advice and reviewing the manuscript and Javier Zamora, M.D. (Associate Professor, Unidad de Bioestadística Clínica, Hospital Ramón y Cajal, Madrid, Spain), for his courtesy of providing statistical software. We also thank Toshiro Shitara, M.D., Ph.D. (Chief Anesthesiologist, Sakakibara Memorial Hospital, Tokyo, Japan), for his superb technical assistance.