The purpose of this meta-analysis is to compare clinical screening tests for obstructive sleep apnea and establish an evidence base for their preoperative use. Diagnostic odds ratios were used as summary measures of accuracy, and false-negative rates were used as measures of missed diagnosis with each screening test in this review. Metaregression revealed that clinical models, logarithmic equations, combined techniques, cephalometry, and morphometry are significant characteristics, whereas body mass index, history of hypertension, and nocturnal choking are significant test elements associated with higher diagnostic accuracy. Test accuracy in repeated validation studies of the same screening test is variable, suggesting an underlying heterogeneity in either the clinical presentation of obstructive sleep apnea or the measured clinical elements of these models. Based on the false-negative rates, it is likely that most of the clinical screening tests will miss a significant proportion of patients with obstructive sleep apnea.
OBSTRUCTIVE sleep apnea (OSA) affects 2–4% of the population1in the United States and is now considered a significant risk factor for perioperative morbidity and mortality.2,3The risks of OSA in the general population are well known and include hypertension,4coronary artery disease,5,6stroke,7pulmonary hypertension,8sudden cardiac death,9and deep vein thrombosis,10to name a few that directly impact on perioperative outcome. Overnight polysomnography is the standard for diagnosis of OSA, but its value in the management of patients scheduled to undergo surgery is reduced by significant issues with resource availability.11Full polysomnography involves an overnight stay in a designated sleep laboratory with multichannel monitoring to measure electro-oculogram, chin and leg electromyography, electro-oculography, chest and abdominal respiratory effort, nasal airflow via a thermistor and/or nasal cannula, oxygen saturation, and heart rate monitoring, in addition to several sleep architecture measures. Traditionally, the apnea-hypopnea index (AHI) or the respiratory disturbance index has been used as a measure of the presence of OSA and its severity. Accepted diagnostic thresholds for OSA have varied between AHI values of 5 or more per hour1,12and 10 or more per hour.13
Current guidelines by the American Society of Anesthesiologists (ASA) recommend preoperative polysomnography when indicated.12Although it may indeed be the most cost-effective strategy in diagnosing OSA,14urgency of the planned operative procedure is an important limiting factor in pursuing a policy of liberal preoperative polysomnography.15It is estimated that 93% of females and 82% of males with OSA are possibly undiagnosed.16Further, it will take several years to complete the current requirement for polysomnography in the general population with existing resources.11All of these points make a compelling argument in favor of cost-effective prediction models to help anesthesiologists assess risk of OSA preoperatively. It is with this background information that we set out to systematically review alternatives to polysomnography in published literature. Indeed, there have been numerous efforts in the past to devise alternate clinical methods of predicting OSA, primarily by experts in sleep medicine, looking to aid in screening patients for high risk of OSA. These methods are broadly classified as questionnaires and clinical prediction models (algorithms, artificial neural networks, cephalometry, morphometry, and other combined techniques and regression models). There is no consensus in the ASA or the American Academy of Sleep Medicine about the best screening tests, with the exception of portable devices for diagnosis of OSA.15Most of the current screening methods have been validated in the sleep laboratory population. It is important to recognize that basic differences exist between the study populations in sleep laboratories and preoperatively. On the one hand, patients are referred to sleep laboratory because of a perceived high risk of OSA, and a questionnaire or clinical screening test administered in the sleep laboratory essentially functions as a second highly specific step to rule in the diagnosis of OSA. Anesthesiologists, on the other hand, need a highly accurate clinical test with high sensitivity to rule out OSA robustly in a lower-risk population, without recourse to confirmatory polysomnography. In addition, screening test results from high-risk populations often report higher sensitivity than is seen when the test is used in a lower-risk population. Identifying the most accurate screening test, with reproducible low false-negative rates, is of critical importance in this context. A previous meta-analysis on screening tests for OSA17was published in 2000 and described the analysis of several screening methods, including partial time polysomnography, partial channel polysomnography, oximetry, portable devices, prediction equations, flow-volume loops, global impression, questionnaires, and other clinical, chemical, and radiologic screening tests. Methodologically, it suffers from a major inadequacy in that several largely heterogeneous studies were pooled without analysis of relative merits and demerits of the individual tests. This lack of discriminatory analysis makes it difficult for an anesthesiologist to make an evidence-based choice of preoperative screening test. Further, there have been several new validation studies on prediction models for OSA in the ensuing years after the last meta-analysis. The purpose of the current systematic review is, therefore, to update the literature and identify the best approach to clinical prediction of OSA, by comparing clinical screening tests for ease and accuracy of prediction of both diagnosis and severity of OSA. We investigated this question using quantitative methods to retrieve and analyze the relevant published literature.
Materials and Methods
The reviewers (S.K.R. and L.A.J.) searched the electronic databases PubMed and Ovid for articles published in English from 1966 to May 2008 using the phrases sleep apnea , obstructive sleep apnea , prediction , diagnosis , screening , and combinations of these phrases. We then manually searched the associated articles and bibliography of any relevant published article we retrieved for additional pertinent references. We also checked the Cochrane Controlled Trials register and hand searched the journals Sleep , American Journal of Respiratory and Critical Care Medicine , Thorax , Chest , International Journal of Obesity , Obesity , Obesity Reviews , Obesity Surgery , and Annals of Internal Medicine for additional articles.
Inclusion Criteria and Assessment of Study Quality
Studies that measured the diagnostic value of questionnaires, clinical scales, or prediction equations (algorithms or regression equations) compared with standard overnight polysomnography were included. We excluded from review studies that did not provide prevalence (pretest probability) of OSA with raw data in 2 × 2 tables, sensitivity and specificity, or positive and negative likelihood ratios. We also excluded studies where the reference standard was not overnight monitored polysomnography in a hospital or laboratory facility. We therefore excluded studies that used portable monitoring as the standard. Two reviewers (S.K.R. and L.A.J.) independently screened the titles and abstracts of all articles identified by the search strategy and individually determined inclusion of studies for the analysis, guided by previously established methodologic standards for diagnostic test research in OSA.18Full copies of all selected articles were retrieved. Disagreements regarding inclusion and exclusion were resolved by discussion, which involved manually rechecking the data extraction from each disputed article. In those situations where there was lack of agreement or clarity even after this discussion, further advice was sought from our institution's specialist on OSA (Ronald D. Chervin, M.D.). The quality of studies accepted for this review was analyzed under the Quality Assessment of Diagnostic Accuracy Studies19framework for completeness and accuracy of reporting.
Definitions and Statistical Analysis
A questionnaire was defined as a set of questions with no additional physical measurement involved. A clinical model combined elements of history and physical examination, with or without additional measurements and investigations (radiologic, oximetry, or laboratory). Of these, we arbitrarily chose three elements to define ease of use of a given test and reflect its applicability as a preoperative screening tool: number of variables, use of linear or log scales, and description of the clinical methods. An ease-of-use scale was thus developed, with 0 defining easy and 3 meaning complex methodology. One point was added for the presence of each of the following: four or more test elements or variables; log scale; and need for additional techniques, measurements, or investigations.
The frequency of true-positive, true-negative, false-positive, and false-negative (FN) results were abstracted from all selected studies. True positives were defined as the frequency of patients with OSA with a positive screening test. False positives were the frequency of patients with positive screening test but no OSA. FNs were the frequency of patients with OSA and a negative screening test. True negatives referred to the frequency of patients with negative screening test and no OSA. Where raw 2 × 2 tables were not presented, these data were derived from the relevant results. From each 2 × 2 table, we computed sensitivity, specificity, likelihood ratios, and the diagnostic odds ratio (DOR), which combines data on sensitivity and specificity to give an indication of a test's ability to rule in or rule out a condition. The DOR was chosen as the primary summary measure of test accuracy for comparison, the reasons for which are described in further detail in the Discussion. A DOR of greater than 81 was chosen to identify an excellent test, because this indicated that the specificity and sensitivity were both greater than 0.9.20A DOR of 10–80 was termed a good test,215–10 was arbitrarily termed average, 2–5 was considered poor, and less than 2 was considered to be of no value in prediction. These summary measures of diagnostic accuracy were reported as point estimates with 95% confidence intervals. The FN rate was derived as (1 − sensitivity) and was used as a measure of the rate of missed diagnosis for any given screening test. An ideal test was one that had a DOR greater than 81 and an FN rate of 0%.
In addition, the following unique descriptors of the test were collected: questionnaire or clinical model, year of publication, Quality Assessment of Diagnostic Accuracy Studies score, number of patients in the study, age, sex balance, prevalence of OSA, mean body mass index, number of variables, linear scale or log equation, and presence of additional diagnostic modalities (cephalometry, morphometry, or oximetry). Elements of the screening tests were also collected as binary yes/no data, including body mass index, snoring, age, sex, hypertension, witnessed apnea, neck circumference, choking or gasping in sleep, tiredness, and daytime somnolence. We computed statistics for individual studies and combined them using Meta-DiSc (version 1.2; Ramon y Cajal Hospital, Madrid, Spain).22We planned to use the Mantel-Haenszel fixed effects model if the studies were homogeneous for the diagnostic performance indices and the DerSimonian-Laird random effect model if they showed heterogeneity. A further within-test subanalysis was performed for screening tests with more than one validated study, to assess whether the reported accuracy of the test in one study was reproducible across all studies. The homogeneity of likelihood ratios and DORs were traditionally assessed using the Cochran Q test based on inverse variance weights, which also has a chi-square distribution with k − 1 degrees of freedom.22A P value less than 0.05 was considered to indicate the presence of statistically significant heterogeneity between the studies. In addition, the I 2index was used to quantify any heterogeneity. A value of 0% indicates no heterogeneity, and larger values indicate increasing heterogeneity. Low, moderate, and high I 2values of 25, 50, and 75% were chosen to quantify heterogeneity as described by Higgins et al. 23
Finally, a random effects metaregression of the previously listed screening test elements was undertaken, by adding these elements as covariates to the regression model. The resulting parameter estimates were back-transformed (antilogarithmic transformation) to relative diagnostic odds ratios (rDORs).24An rDOR of 1 indicates that the particular screening test element does not affect the overall DOR of the test. An rDOR of greater than 1 means a particular element bestows a higher DOR on a test, compared with tests without this particular variable. For the purposes of description, rDOR > 2 was arbitrarily chosen to identify significant variables. Diagnostic threshold effect was studied using Littenberg and Moses' fitted model,25D = a + bS, where D is the natural logarithm of the DOR and S is the natural logarithm of the product of the odds of true-positive test results and the odds of false-positive test results. The statistical significance of the regression coefficient b (P < 0.05) was tested to assess whether diagnostic accuracy varies significantly with changes in threshold.
Results
Our initial search strategy (fig. 1) retrieved 6,816 potentially relevant diagnostic studies, which were then screened by title first and then by abstract. These included the unduplicated results of multiple search engines and hand-searched journals as listed previously. After review of the relevant articles and their bibliography, 115 studies were considered potentially appropriate for the study. On a more detailed review of these publications, a further 89 articles were excluded from the final analysis, either because the standard reference test used was not overnight polysomnography or clinical methods were not used in the screening tests. Therefore, a total of 26 articles were accepted for final analysis. Of these, 8 pertained to questionnaires, and the remaining 18 described clinical prediction tests. These included linear scales, algorithms, regression models, morphometry, cephalometry, combined prediction models, and neural networks. The 26 studies included a total 6,794 patients with suspected OSA (median sample size, 123; range, 33–1,409). The study prevalence of OSA ranged from 0.09 to 0.847. The proportion of male subjects ranged from 24% to 100%.
The study quality of the included articles was variable with Quality Assessment of Diagnostic Accuracy Studies scores ranging from 6 to 13. In addition, there was evidence of verification bias as described by Irwig et al. ,18because several studies were derived from nonrandomly chosen populations and did not describe the cases that were not included in sufficient detail. Tables 1–4describe the sensitivity, specificity, and likelihood ratios of the various screening tests for prediction of OSA. Random effects models were used because of the significant heterogeneity between studies, as measured by both the Cochran Q test (P < 0.05) and the I 2index (74.4–90.6%).
Table 2. Test Characteristics of Questionnaires Predicting the Presence of Severe Obstructive Sleep Apnea

Table 3. Test Characteristics of Clinical Models Predicting the Diagnosis of Obstructive Sleep Apnea

Table 4. Test Characteristics of Clinical Models Predicting the Presence of Severe Obstructive Sleep Apnea

An additional subgroup analysis was undertaken of the screening tests with more than one validation study. Three tests were identified for this purpose, namely the Berlin questionnaire, the Maislin multivariable apnea index (algorithm), and the Kushida index (morphometry). There was a high degree of heterogeneity within each test across various studies (I 2> 75%). Only the Kushida index reproducibly performed as an excellent predictor (DOR > 81) in all validated studies. A summary table of the range of FN rates was generated for each screening test (table 5), with summary recommendation for the test utility. No single questionnaire or clinical model satisfied the criteria for the ideal preoperative screening test.
Figures 2–5describe the DOR characteristics of the individual questionnaires and clinical screening tests for prediction of diagnosis and severity of OSA. Although cumulative analyses were performed for all tests characteristics, the large heterogeneity precluded a more specific comment on the pooled results. Broadly, clinical models had marginally better pooled DOR than questionnaire models for both OSA diagnosis (pooled DOR 10.49 vs. 5.02) and severity (pooled DOR 17.24 vs. 10.12) prediction.
Fig. 2. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of questionnaires predicting the diagnosis of obstructive sleep apnea. SDQ = Sleep Disorders Questionnaire; STOP = acronym from Chung et al. 31
Fig. 2. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of questionnaires predicting the diagnosis of obstructive sleep apnea. SDQ = Sleep Disorders Questionnaire; STOP = acronym from Chung et al. 31
Fig. 3. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of questionnaires predicting presence of severe obstructive sleep apnea. SDQ = Sleep Disorders Questionnaire; STOP = acronym from Chung et al. 31
Fig. 3. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of questionnaires predicting presence of severe obstructive sleep apnea. SDQ = Sleep Disorders Questionnaire; STOP = acronym from Chung et al. 31
Fig. 4. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of clinical models predicting the diagnosis of obstructive sleep apnea. ASA = American Society of Anesthesiologists; BMI = body mass index; MAP = multivariable apnea prediction; SDB = sleep-disordered breathing; STOP-BANG = acronym from Chung et al. 31
Fig. 4. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of clinical models predicting the diagnosis of obstructive sleep apnea. ASA = American Society of Anesthesiologists; BMI = body mass index; MAP = multivariable apnea prediction; SDB = sleep-disordered breathing; STOP-BANG = acronym from Chung et al. 31
Fig. 5. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of clinical models predicting presence of severe obstructive sleep apnea. ASA = American Society of Anesthesiologists; BASHIM = acronym from Dixon et al. 30,; BMI = body mass index; MAP = multivariable apnea prediction; SDB = sleep-disordered breathing; STOP-BANG = acronym from Chung et al. 31
Fig. 5. Plot of diagnostic odds ratios (ORs) and 95% confidence intervals (CIs) of clinical models predicting presence of severe obstructive sleep apnea. ASA = American Society of Anesthesiologists; BASHIM = acronym from Dixon et al. 30,; BMI = body mass index; MAP = multivariable apnea prediction; SDB = sleep-disordered breathing; STOP-BANG = acronym from Chung et al. 31
Finally, a metaregression of several study covariates and elements was undertaken to identify the source of this heterogeneity (table 6). Study characteristics with rDOR > 2 were identified to be log equations (nonlinear scales), clinical models, clinical-cephalometry, combined techniques, and morphometry in ascending order of magnitude. Clinical elements associated with rDOR > 2 were body mass index, hypertension, and history of choking or gasping. Covariates associated with rDOR < 2 were prevalence of OSA, AHI threshold for diagnosis, publication year, number of variables, oximetry, age, witnessed apnea, neck circumference, tiredness, and daytime somnolence. History of snoring and sex balance in the study population had rDORs of 1.93 and 1.81, respectively. On diagnostic threshold analysis (table 7), the statistical significance of the regression coefficient b was found to be 0.189, suggesting that there is no influence of diagnostic threshold on diagnostic accuracy. That is, DOR is independent of the chosen diagnostic AHI threshold.
Discussion
We report the results of a meta-analysis of clinical screening tests for the prediction of diagnosis and severity of OSA. Severe OSA can be predicted by questionnaires and clinical tests with a high degree of accuracy. The Berlin questionnaire, the Sleep Disorders Questionnaire, morphometry (Kushida index), and the combined clinical-cephalometry model (Battagel) were the most accurate questionnaires and clinical models. However, there is high degree of heterogeneity and FN rate with all questionnaires and most clinical prediction models, making it possible that a significant proportion of patients with OSA will be missed by all questionnaires and most of the clinical models. Metaregression revealed that clinical models, log equations, combined techniques, cephalometry, and morphometry are significant test characteristics, whereas body mass index, history of hypertension, and nocturnal choking are significant test elements in the more accurate prediction models.
Importance of Test Accuracy: Implications for Anesthesiologists
There are several described summary measures for meta-analyses of diagnostic tests, namely sensitivity, specificity, predictive values, likelihood ratios, DOR, receiver operating characteristic curve analysis, and area under the curve. Each of these summary measures has unique advantages and disadvantages. An ideal diagnostic test in a healthy population should have a relatively high sensitivity with sufficient specificity and also be minimally intrusive, be relatively inexpensive, and identify patients early in the disease process.17Although high sensitivity helps to rule out OSA preoperatively, it is incomplete and potentially inaccurate as a summary statistic on its own, especially in the presence of spectrum bias and study heterogeneity. Combining sensitivity with specificity improves this, arguably at the cost of increasing complexity in terms of comparison. Although false positives could significantly increase costs directly and indirectly, because of prolonged postanesthesia care unit stay as mandated by the ASA practice guidelines12or more conservative local discharge policies, the bigger priority during the perioperative period is preventing mortality and morbidity related to OSA. Using FN rates as a summary measure of robustness of each screening test, we were able to describe the proportion of missed diagnosis among patients with OSA. Because sensitivity is considered prevalence independent, it follows that FN rates are also independent of prevalence of OSA. The FN rate is typically expressed as a conditional probability or a percentage. The Berlin questionnaire, which is commonly used in several hospitals now, has FN rates of 14.5–38.2%, clearly making it undependable to robustly rule out OSA preoperatively. Similar FN rates were observed with the ASA model (12.3–37.9%), STOP questionnaire (20.5–34.4%), and STOP-BANG model (0.0–16.4%) for diagnosis of OSA. FN rates tended to be marginally lower when predicting presence of severe OSA across all studies. Of the remaining summary measures, positive and negative predictive values are highly dependent on disease prevalence and therefore have limited value in comparing tests. Summary receiver operating characteristic curve analysis is recommended for studies that exhibit threshold effect, but it is difficult to interpret and apply to practice. Likelihood ratios describe a user-friendly way of providing convincing diagnostic evidence (> 10 positive likelihood ratio with < 0.1 negative likelihood ratio) or strong diagnostic evidence (> 5 positive likelihood ratio with < 0.2 negative likelihood ratio). Again, the need for pairing these summary measures reduces its utility in comparative analyses. The DOR is defined as the ratio of the odds of positivity in patients with OSA relative to the odds of positivity in the nondiseased. It can also be calculated as the ratio of the positive and negative likelihood ratios and represents the best single point estimate of the receiver operating characteristic curve. Importantly, DOR used as single indicator of test performance is considered to be not prevalence dependent.26It provides an assessment of how well a decision tool or doctor performs in distinguishing healthy from unhealthy patients; the bigger the DOR is, the better the diagnostic accuracy is. As a result of these points, we chose the DOR as a primary measure of test accuracy. Although independent of disease prevalence, the clinical value of DOR varies with disease prevalence. A test with a DOR of 10.00 is considered to be a very good test by current standards in populations at high risk. A DOR of 10.00 in a low-risk population, on the other hand, may represent a very weak association between the experimental test and the standard test.21Therefore, the expectation that a highly accurate prediction tool validated for screening in sleep clinics will provide the same functionality in the preoperative period may be fallacious. One of the additional criticisms of DOR is that it ignores the relative weights of sensitivity and specificity. We accounted for this shortcoming by choosing a DOR threshold of 81, thereby ensuring that the robustness extends across both sensitivity and specificity.
The lack of reproducibility with the diagnostic performance of all the studied screening tests is an important finding of this study. Based on this study's findings, case-control studies or clinical protocols that depend on simple prediction models like the Berlin questionnaire, the ASA model, or the STOP questionnaire for assessment of high-risk groups are bound to suffer from significant FN error. Further research into predictive modeling should compare the most accurate tests in this analysis with one another in a true representative surgical population. Perhaps more importantly, anesthesiologists need to consider the importance of identifying a critical outcome measure for defining significant OSA. There is insufficient prospective evidence in the literature to support the view that mild to moderate OSA is associated with significant adverse outcome postoperatively. Indeed, regular continuous positive airway pressure therapy for mild OSA (with excessive daytime somnolence) and moderate OSA has not been shown to have the same impact on systemic disease as compared with severe OSA.27Further research into defining the subset of OSA patients at higher risk of perioperative mortality or morbidity is a much-needed crucial step to further avoid both unnecessary costs and potential disaster from excessive resource utilization on one hand and missed diagnoses on the other.
Analysis of Screening Test Characteristics and Elements
The characteristics of the screening tests that bestowed higher accuracy included clinical models, log equations, all combined techniques, cephalometry, and morphometry. Although questionnaires were inferior to clinical models, they are widely used currently as screening tools and therefore warrant further discussion. The Berlin questionnaire28was the most accurate questionnaire for predicting diagnosis of OSA. The least accurate questionnaire was the Epworth Sleepiness Scale,29possibly because excessive daytime sleepiness occurs commonly in obese individuals without OSA, driven by mechanisms other than nighttime sleep deprivation.30Regardless of the abnormal sleep physiology in obese patients, most questionnaire models have the generic problem of increasing the burden of the user for additional data collection. The simplest questionnaire, the STOP questionnaire,31was a poor predictor of OSA (DOR 2.9) compared with other published questionnaires. Similarly, the recently validated ASA screening tool was either of no value (DOR 1.6) or poor value (DOR 4.08) in predicting diagnosis and severity of OSA respectively. In contrast, several other clinical models trended toward significantly better performance than questionnaires in predicting diagnosis of OSA. Clinical models typically use several elements that clearly identify the OSA phenotype more robustly than questionnaires alone. A similar trend to improved accuracy was seen in clinical models versus questionnaires in predicting severe OSA. The two most accurate clinical models used additional cephalometry32and morphometry,33suggesting that using upper airway measurements could improve the accuracy of currently used clinical methods. However, the complexity of these tests could hinder their addition into standard preoperative evaluation. The STOP-BANG clinical scale31was identified as an excellent method for prediction of severe OSA with DOR 141.5 in one study. The ease of use of this clinical test (linear scale, and no need for additional investigations) makes it a user-friendly option for screening for severe OSA in the immediate preoperative period, although it is an average predictor of diagnosis of OSA (DOR 6.59). As with other models, further validation of this screening test is essential before final comment on its preoperative utility.
Study Limitations
Before we consider the implications of these results for clinical practice, it is important to consider some of the limitations and strengths of our methods and those of the clinical studies in our review. Although we cannot be absolutely certain that we retrieved all published material in this area, we are confident that our methodology allowed for the most thorough review of all publications from within all accessible large databases. This risk was minimized by two independent searches by the two authors. Aside from these points, there were several shortcomings in the reviewed publications, namely threshold variability, study heterogeneity, verification bias, and spectrum bias. Threshold variability refers to the influence of variable AHI threshold used explicitly or implicitly by the reference test for the validation process. Indeed, based on pooled DOR values, clinical models and questionnaires seemed to be more accurate when predicting severe OSA (AHI threshold 25 or 30). It appeared on first glance that there was sufficient variability in the chosen threshold AHI for making a diagnosis of OSA, to explain at least some of the heterogeneity seen in this meta-analysis. However, on metaregression, AHI diagnostic threshold was not seen to be a significant contributor to study accuracy in comparison with other variables. The use of a threshold AHI of 5 per hour assumes that all patients with AHI < 5 per hour are morphologically and physiologically distinct from those with AHI > 5 per hour. This also places mild, moderate, and severe OSA patients in one group, thereby assuming that they carry similar traits. Similarly, a threshold of AHI > 30 per hour for defining severe OSA also assumes that normal, mild OSA, and moderate OSA are one homogenous group with traits uniquely different from those with severe OSA. Both these assumptions are probably fallacious, because OSA encompasses a spectrum of physical and physiologic traits that overlap with normal patients. There is also some evidence that AHI < 5 per hour does not necessarily mean that the patient does not have OSA, because repeated sleep studies on consecutive days have shown discordance in diagnosis.34First-night effect describes the variance in AHI between the first and second nights of polysomnography, independent of duration of sleep time. There are several plausible reasons for first-night effect, including AHI, anxiety, presence of psychiatric disorders, psychoactive medications, alcohol intake, and age of the patient. The percentage of patients misdiagnosed based purely on a single-night study has been reported to be as high as 43%,35but these effects are seen exclusively in the mild end of the OSA spectrum. Subsequent studies have shown a significantly lower misdiagnosis rate than described above, and the American Academy of Sleep Medicine and the ASA currently recognize single-night testing as the standard for diagnosis of sleep apnea. All of these factors may explain the significant intratest heterogeneity seen with the three tests specifically analyzed for reproducibility, namely the Berlin questionnaire, multivariable apnea prediction, and the Kushida index. Any intratest heterogeneity that results in FNs is a clinically important problem for anesthesiologists, because patients with OSA may be exposed to harm. Of the studied models, only the Kushida index was deemed to be an excellent test in repeated studies.
Two main biases observed in this meta-analysis deserve further mention, namely verification bias and spectrum effect or spectrum bias. To avoid verification bias, it is important that diagnostic accuracy is assessed in consecutive patients who present with the clinical problem of interest. Clearly, all of the analyzed studies in this meta-analysis did not meet these standards. There are two types of verification bias: “partial verification,” which occurs when the reference standard was not applied to all participating patients, and “differential verification,” which occurs when results from a decision tool influence the choice of reference standards to apply. Estimates of sensitivity tend to be overestimated when partial verification bias is present. Both sensitivity and specificity tend to be overstated when differential verification bias is present.36One critical consideration common to most of the studied OSA prediction questionnaires and models is the high pretest probability of OSA, i.e. , all of the study patients typically attended a sleep clinic for suspected OSA or other sleep-related breathing disorders. This is a very different clinical scenario compared with the typical surgical population, where the clinical distinction between patients with and without OSA is possibly more apparent. In essence, these could be considered as two distinct study populations. Prediction models that are derived and validated in high-risk populations are subject to spectrum effect or spectrum bias, and these tests report higher sensitivity than is seen when the test is used in a lower-risk population. The advantage of deriving the screening tests in a representative healthy population is that this is exactly how the tests will be used in practice. The disadvantage is that the absolute frequency of abnormalities is much lower, meaning that confidence intervals for the results are wider unless the study has far more patients.36,37
In summary, this review provides a comprehensive and up-to-date synthesis of the literature regarding the accuracy of clinical screening methods in the diagnosis of OSA. It is possible to predict severe OSA with a high degree of accuracy by clinical methods that could be used preoperatively, but no single prediction tool functions as an ideal preoperative test. The Berlin questionnaire and the Sleep Disorders Questionnaire were the two most accurate questionnaires, whereas morphometry and combined clinical-cephalometry were the most accurate clinical models. However, test accuracy, as defined by DOR, has poor reproducibility in multiple validation studies of the same screening tool. Based on FN rates and heterogeneity, it is possible that all of the studied questionnaires and most of the clinical models will not correctly identify a significant proportion of patients with OSA. Because of significant differences between the validation study patients and surgical patients, further validation of the most accurate screening tests as defined in this meta-analysis is essential in a typical surgical population to identify the best preoperative method of screening for OSA.
The authors thank Ronald D. Chervin, M.D. (Professor, Department of Neurology, University of Michigan, Ann Arbor, Michigan), for his help with questions during the study and overall support of the authors' endeavor.