External validation of published risk stratification models is essential to determine their generalizability. This study evaluates the performance of the Risk Stratification Indices (RSIs) and 30-day mortality Risk Quantification Index (RQI).
108,423 adult hospital admissions with anesthetics were identified (2006–2011). RSIs for mortality and length-of-stay endpoints were calculated using published methodology. 91,128 adult, noncardiac inpatient surgeries were identified with administrative data required for RQI calculation.
RSI in-hospital mortality and RQI 30-day mortality Brier scores were 0.308 and 0.017, respectively. RSI discrimination, by area under the receiver operating curves, was excellent at 0.966 (95% CI, 0.963–0.970) for in-hospital mortality, 0.903 (0.896–0.909) for 30-day mortality, 0.866 (0.861–0.870) for 1-yr mortality, and 0.884 (0.882–0.886) for length-of-stay. RSI calibration, however, was poor overall (17% predicted in-hospital mortality vs. 1.5% observed after inclusion of the regression constant) as demonstrated by calibration plots. Removal of self-fulfilling diagnosis and procedure codes (20,001 of 108,423; 20%) yielded similar results. RQIs were calculated for only 62,640 of 91,128 patients (68.7%) due to unmatched procedure codes. Patients with unmatched codes were younger, had higher American Society of Anesthesiologists physical status and 30-day mortality. The area under the receiver operating curve for 30-day mortality RQI was 0.888 (0.879–0.897). The model also demonstrated good calibration. Performance of a restricted index, Procedure Severity Score + American Society of Anesthesiologists physical status, performed as well as the original RQI model (age + American Society of Anesthesiologists + Procedure Severity Score).
Although the RSIs demonstrated excellent discrimination, poor calibration limits their generalizability. The 30-day mortality RQI performed well with age providing a limited contribution.
Risk Stratification Indices and Risk Quantification Indices were developed to predict clinical endpoints using administrative patient data
External validation of risk stratification models determine their generalizability
Validation should characterize overall performance and assess discrimination (the probability measured risk is higher for a case than it is for a noncase) and calibration (how well-predicted probabilities align with observed outcomes)
Patient data from the Massachusetts General Hospital were used to show that the Risk Stratification Indices had excellent discrimination and poor calibration but the 30-day mortality Risk Quantification Indices performed well
ADMINISTRATIVE patient data are increasingly used to evaluate patterns and outcomes of disease. Several indices have been developed to predict mortality and other endpoints, including the Charlson Comorbidity Index,1 variations on the Charlson Comorbidity Index,2,3 the Elixhauser method,4 and the Procedural Index for Mortality Risk.5 The Risk Stratification Indices6 (RSIs) were developed using International Classification of Disease, Ninth Revision, Clinical Modification (ICD-9-CM) diagnosis and procedure codes for hospital inpatients who were more than 65 yr of age, which were obtained from the Medicare Provider Analysis and Review (MEDPAR) database for the period of 2001–2006. The RSIs have received much attention for their potential application as a tool for risk-adjusting important healthcare outcomes.7 However, they have not yet been externally validated.
In the RSI models, a level of risk was derived for diagnosis and procedure codes associated with a hospital stay, using logistic regression modeling for in-hospital mortality and Cox proportional hazards modeling for postdischarge death and length-of-stay (LOS). Summation of the covariate coefficients associated with a patient’s procedure and diagnostic codes generates an RSI. The RSIs demonstrated excellent discrimination (c statistic) as predictive indices for mortality and LOS endpoints when applied to a large MEDPAR validation set and the authors’ institutional data. Additional statistical analysis to determine overall model performance and calibration was recommended.8 Calibration of a model refers to the degree to which predicted and actual outcomes agree.9 A follow-up analysis assessed calibration by comparing actual mortality rates with RSI value groups.10 A comparison against predicted mortality was not made.
The Risk Quantification Indices11 (RQIs) are another risk index system developed to predict 30-day mortality and morbidity using a small number of administrative data points. The indices were developed using adult noncardiac surgical patient data from the National Surgical Quality Improvement Project database12 for the period of 2005–2008. Using Current Procedural Terminology (CPT) codes, the authors derived a procedural severity score (PSS) that measured procedural risk. These scores were combined with patient age and American Society of Anesthesiologists (ASA) physical status to create a predictive index of 30-day mortality and major morbidity. The goal of this study was to externally evaluate the performance of the RSIs and 30-day mortality RQI using patient data at our institution.
Materials and Methods
This study was approved by the Partners Institutional Review Board, Boston, Massachusetts (2011P000253). For validation of the RSIs, the authors identified adult (18 yr of age or older) inpatient admissions that included anesthetics for the period 2006–2011 using the Massachusetts General Hospital (MGH) anesthesia information management system. For the 30-day mortality RQI validation, we identified all adult noncardiac inpatient surgical cases by excluding cardiac cases and nonsurgical procedures that required anesthetics, specifically cardiology, electroconvulsive therapy, and labor epidurals.
After identifying the patient population using MGH anesthesia information management system data, we obtained the ASA physical status and primary surgical CPT code from our anesthesia billing system (2007–2011). Diagnostic and procedure ICD-9-CM codes and LOS data were obtained from the MGH billing system. Mortality endpoints were derived from the Partners HealthCare System Research Patient Data Repository (RPDR). The RPDR is a secure centralized administrative data warehouse that contains patient encounter data from multiple hospital information systems in the Partners Health System.13,14 The RPDR links patients to the National Death Index, a central computerized index of death record information maintained by the National Center for Health Statistics division of the Centers for Disease Control.15
RSI Validation Methodology
For validation of the RSIs, a level of risk was assigned for each patient in our sample population by summation of the covariate coefficients associated with each diagnosis and procedure ICD-9-CM code. The methodology for RSI calculation is available at the authors’ Web site.§ Briefly, patient data are organized into an “input file,” with each row representing a single patient stay that contains all associated ICD-9-CM diagnosis and procedure codes. Using SPSS (version 17.0; IBM, Armonk, NY), the published SPSS macro was executed with the MGH input file to assign β coefficients to each diagnosis and procedure code. β Coefficients were summed to calculate an RSI value.
Overall model performance was determined using the Brier score. To assess discrimination, the aggregate performance of each RSI on outcomes of interest was quantified by calculating the area under the receiver operating characteristic curve (AUROC). Consistent with the original RSI study, LOS was assessed by determining whether the index risk was above or below the median LOS for the primary ICD-9-CM procedure code. Calibration was assessed by comparing actual with predicted endpoints16 for in-hospital mortality, 30-day mortality, and 1-yr mortality in the RPDR data set. Predicted outcomes were calculated by the inverse logit function for mortality endpoints and LOS (1/[1+e−RSI]). For time-dependent endpoints (30-day mortality, 1-yr mortality, and LOS), calibration curves were generated using cases where endpoints were known, eliminating the concern for censored data. The authors defined “self-fulfilling” ICD-9-CM codes as “conditions that required immediate medical intervention” and “procedures that are typically performed during emergency resuscitation,” and then applied this definition to the RSI in-hospital mortality model (appendix 1). ICD-9-CM codes that are self-fulfilling (i.e., cardiac arrest) with respect to outcome were removed from the MGH data set and RSI model performance reanalyzed.
RQI Validation Methodology
To validate the RQI for 30-day mortality, CPT codes corresponding to patients’ primary procedure were assigned weights (i.e., PSS) and combined with ASA physical status and age to calculate the RQI. Technical issues precluded the use of the published R module, available at the RQI Web site.‖ Published PSS covariates together with model parameters provided by the original authors (appendix 2) were used to compute the 30-day mortality RQI using SPSS (appendix 3). Results were compared against actual patient endpoints in the RPDR data set. Overall model performance was assessed with the Brier score. Discrimination was quantified by AUROC calculation. Calibration was assessed by comparing actual with predicted endpoints16 for 30-day mortality in the RPDR data set. RQI model performance (age, ASA physical status, and PSS) was compared with limited versions (age + ASA, age + PSS, ASA + PSS) to assess the degree to which each variable contributed to the RQI performance.
In addition to the validation methodology above, statistical comparisons were made to characterize the patient population that did not generate RQI values due to unmatched CPT codes. The chi-square test was used for categorical variables (mortality, ASA physical status >2) and a two-tailed independent t test was used to compare continuous variables (age). All statistical comparisons were performed using SPSS. Calibration plots and Brier scores were generated using R (version 2.15.1, rms package; R Core Team, R Foundation for Statistical Computing, Vienna, Austria).# The specific code used to generate RSI and RQI calibration plots can be found in appendix 4.
A total of 108,423 adult anesthetic records, each corresponding to an individual inpatient admission, were identified for validating the RSIs. Overall, there were 3,811 unique principal diagnosis codes and 1,873 unique principal procedure codes. Characteristics of the data set are illustrated in tables 1 and 2. Overall, the RSIs demonstrated excellent discrimination, but poor calibration. The results for each endpoint’s Brier score and AUROC are summarized in table 3 and figure 1. The AUROC for in-hospital mortality, 30-day mortality, 1-yr mortality, and LOS at our institution were 0.966 (95% CI, 0.963–0.970), 0.903 (0.896–0.909), 0.866 (0.861–0.870), and 0.884 (0.882–0.886), respectively, compared with the original findings of 0.977 (0.977–0.980), 0.854 (0.834–0.875), 0.832 (0.825–0.839), and 0.886 (0.883–0.888), respectively.6
Calibration “in-the-large” for RSI in-hospital mortality illustrated a discrepancy between actual (1.5%) and predicted (51.7%) in-hospital mortality. The authors identified a regression constant (−2.198) in the published RSI “all-covariates.xls” file§ that was not used in the published SPSS implementation macro, “RSI Calculation for Web Use Rev 2.sps,”§ and used this to calculate an adjusted RSI in-hospital mortality. Incorporation of this constant improved the calibration (predicted in-hospital mortality of 17.5%) and Brier score (table 3), although calibration remained poor. Calibration plots also demonstrated poor calibration for 30-day mortality, 1-yr mortality, and LOS (fig. 2). Assessment of a sample results data file published on the RSI Web site, “sample data rev2.sav,”§ was similar (49% predicted in-hospital mortality, 18% predicted in-hospital mortality after including the regression constant).
Records with any ICD-9-CM codes that were identified as self-fulfilling with respect to outcome (appendix 1) were removed from the MGH data set (20,001 of 108,423, 20%) and RSI model performance reanalyzed; discrimination slightly improved whereas calibration remained poor (fig. 3).
The authors identified 91,128 anesthetic records for noncardiac surgical cases with the data required for calculation of the 30-day mortality RQI. There were 1,694 unique primary surgical CPT codes. Characteristics of the data set are shown in table 4. Of these, RQI calculations could not be performed for 28,488 cases (31.3%) due to unmatched CPT codes. Of the 197 unique unmatched CPT codes, the 10 most common are illustrated in table 5. Compared with the matched CPT data set, patients with unmatched CPT codes were younger (50.8 ± 21.6 vs. 56.2 ± 19.1 yr old; P < 0.001), were more likely to have an ASA physical status greater than 2 (54.3 vs. 38.0%; P < 0.001) and had greater 30-day mortality (4.7 vs. 1.9%; P < 0.001).
For the 62,640 cases with matched CPT codes, Brier score and AUROC are shown in table 3 and figure 4. AUROC for 30-day mortality at our institution was 0.888 (0.879–0.897). This performance was similar to the originally reported AUROC of 0.915 (0.906–0.924).11 The 30-day mortality RQI demonstrated good calibration (fig. 4).
The discrimination of the age + ASA, age + PSS, and ASA + PSS models are shown in table 6. Calibration of these component models was not assessed. Of the three model elements, age provided the smallest contribution to the resulting discrimination.
Our results indicate important limitations to the generalizability of RSIs and 30-day mortality RQI. To the authors’ knowledge, this is the first comprehensive external evaluation of these indices to be published.
The RSIs use ICD-9-CM codes for diseases and procedures associated with hospitalization. Using up to 10 diagnostic and 10 procedure codes to assign a level of risk to a patient stay, the RSIs capture the underlying clinical condition in a manner that is intended for retrospective risk-adjusted quality-of-care comparisons. By contrast, the RQIs use information available before a hospitalization to predict expected outcomes. Compared with other indices for outcome adjustment, the RQI uses fewer data points: a primary surgical CPT code, ASA physical status, and age.
Both RSI and RQI indices use an aggregation scheme to account for low-frequency codes and annual revisions in code definitions. In the RSI models, hierarchical selection processes were used on the MEDPAR data set to select a set of codes based on average annual incidence to ensure consistency of codes across years. Less frequent ICD-9-CM codes were truncated and reassessed to ensure an average annual occurrence of more than 5,000 for four- and five-character ICD-9-CM codes and more than 1,000 for three-character ICD-9-CM codes. ICD-10, which consists of over 100,000 α-numeric diagnostic and procedure codes, is scheduled for roll out in October 2014.17 A similar derivation scheme could be applied to generate RSI models compatible with ICD-10. The RQI aggregation scheme was developed on a reserved subset of the National Surgical Quality Improvement Project database.12 Frequently used CPT codes were represented as separate cohorts. Less common procedures were aggregated according to one of 244 categories as described by the Clinical Classifications Software for Services and Procedures (U.S. Department of Health and Human Services Agency for Healthcare Research and Quality, Rockville, MD).18 If the number of cases within a Clinical Classifications Software group was low, these groups were further aggregated into an “all-purpose other” group. The resulting scheme defines PSS scores for a subset of CPT codes.
Evaluation of a predictive model should characterize overall performance, and assess discrimination and calibration.8,16,19 The Brier score is a statistical measure used to assess overall predictive performance by computing the squared difference between predicted and actual outcomes.16,20 Smaller differences between predicted and actual data points reflect a better overall “goodness-of-fit” of a model. The Brier score ranges from 0 to 1. A model with perfect prediction has a score of 0, whereas a noninformative model that provides predictions no better than chance with an outcome incidence of 50% has a score of 0.25. Unlike the AUROC, interpretation of the Brier score depends on the incidence of the outcome. The Adjusted RSI In-Hospital Mortality model without self-fulfilling codes had the best overall Brier score relative to other RSIs, but was still considerably higher than the RQI 30-day Mortality Brier score.
Calibration refers to how well a model’s predicted probabilities align with the observed outcomes. Discrimination, however, represents the probability that the measured risk is higher for a case versus noncase or how well a model can rank order cases.9 It is not surprising, for instance, that discrimination for RSI In-Hospital Mortality remained strong after removal of self-fulfilling ICD-9-CM codes. A well-ordered list will remain in order regardless of how cases may be removed. But discrimination does not characterize actual predicted probabilities, which are fundamental to clinical risk-prediction models. Calibration is especially important for prognostic models in which the clinical question is the chance of a future outcome, given current risk factors.16,19 RSI In-Hospital Mortality calibration remained poor after removal of self-fulfilling ICD-9-CM codes. The improvement in Brier score likely reflects the modest improvement in discrimination.
Although the RSIs demonstrate excellent discrimination, the current models’ poor calibration limits their use as a tool to compare risk-adjusted outcomes among entities. Of note, there is a significant mean age difference between the MEDPAR data set (mean age 74) and MGH data set (mean age 55). This age difference may influence calibration, as greater risk is associated with increasing age. Although the original RSI study includes an internal validation of Cleveland Clinic data with mean age of 56.6 yr,6 there are no calibration data published for comparison. The physiologic changes of aging result in increased incidence of comorbidities such as high blood pressure** and diabetes.†† For example, the impact of diabetes on recovery after hip fracture is moderated by age.21 Thus, the risk associated with a diagnosis or procedure code may be different when comparing patients based on age alone.
It is interesting to note the impact of age in another well-established predictive index, the Revised Cardiac Risk Index. In the Revised Cardiac Risk Index, discrimination was evaluated using the AUROC whereas calibration was assessed by comparing the predicted and actual major cardiac complication rates by risk class.22 The mean age of patients within the Revised Cardiac Risk Index study was 66 yr, with age more than 70 yr correlating with a relative risk of 1.9 (1.1–3.2) for cardiac complications.22 Although increased age was associated with higher morbidity in the original Cardiac Risk Index,23 the six independent risk factors identified by logistic regression analysis for the Revised Cardiac Risk Index did not include age.22,24
For the RQI calculation, we observed a relatively high rate of unmatched CPT codes in our data set of inpatient anesthetics, which is likely a product of the aggregation scheme and the relatively broad set of anesthetic cases included in our data set. Alternative CPT aggregation schemes have been proposed.25 The data set used to derive the RQI model was obtained from the National Surgical Quality Improvement Project and may not be representative of procedures or CPT coding practices at our institution. Our data set was derived from anesthesia billing records. Cardiac and nonprocedural anesthetic records were removed from our data set to conform more closely to the sample population used for derivation of the RQI with the goal of evaluating the generalizability of the RQI as a novel severity scoring methodology using primary CPT codes. The most common unmatched CPT codes listed in table 5 represent procedures for which an RQI score would be useful. A robust capture of CPT codes for 30-day mortality RQI calculation is important because the current analysis indicates these procedures were associated with a significantly higher ASA physical status and 30-day mortality.
There are a number of limitations that must be appreciated when using administrative data. Code definitions change with time. The RQI was derived using data from 2005 to 2008, whereas the current data timeframe was 2006–2011. Thus, only 3 yr overlap when comparing data sets. Review of the total number of CPT changes for the years 2005–2011 has been shown to total more than 2,500 changes.26–32 Furthermore, the RQI incorporates the ASA physical status,33,34 which has notable provider variability.35,36 Regional or institutional differences in coding practices may also contribute to coding variability. Sources of coding error occur along the entire patient trajectory.37–39 Error rates in coding have been shown to range from approximately 10–20%.40,41
Risk indices that use administrative data may not adequately assess how well hospitals and providers identify and respond to adverse events that developed during their care; described as “failure to rescue.”‡‡ The degree to which in-hospital complications are documented differs among institutions. Rewarding hospitals solely for lower-documented in-hospital complications rates, such as using the RSIs for risk-adjusted comparison of outcomes among institutions, may not reflect important quality standards such as “monitoring” and “action taken.”
Additionally, the current study has several limitations. The authors were unable to identify present-on-admission diagnoses and index procedures in order to exclude conditions or procedures that occurred during the hospitalization. As a result, including non–present-on-admission data may lead to overly optimistic predictive capability of the RSI in the current validation. As these data becomes more widely available, models may be used to assess risk of hospital-acquired conditions based on admission diagnoses or planned procedures.19
The authors’ institution does not routinely collect the endpoints required for validating 30-day morbidity outside of participation in National Surgical Quality Improvement Project. As a result, the authors were unable to assess the performance of the RQI for 30-day morbidity. In addition, the model for RQI 30-day mortality was derived using inpatient and outpatient surgical data. The current data set included inpatient surgeries only; thus, the authors were unable to validate RQI 30-day mortality model performance for outpatient surgery.
Although the RSI models for risk-adjusted healthcare outcomes demonstrated excellent discrimination, the poor calibration of the current models raises concerns about their generalizability. Assessment of calibration of the MEDPAR data set used to generate the original RSIs would be informative, with the potential to rederive the risk associated with covariates of interest to improve its performance on external data sets.
The RQI for 30-day mortality performed well on the current data set for matched CPT data. However, the current data reveal a large number of unmatched CPT codes for cases associated with significantly higher morbidity and mortality. A robust capture of CPT codes for 30-day mortality RQI calculation may identify patients at increased risk. Inclusion of age in the RQI was of limited additional predictive information in the analysis of the current data set.
The authors thank Frank E. Harrell, Jr., Ph.D., Chair and Professor of Biostatistics, Vanderbilt University, Nashville, Tennessee, for his invaluable assistance with statistical analysis.
Cleveland Clinic: Outcomes Research, Risk Stratification Index. Available at: http://my.clevelandclinic.org/anesthesia/outcomes/risk-stratification-index.aspx. Accessed November 23, 2012.
Cleveland Clinic: Outcomes Research, Risk Quantification Index. Available at: http://my.clevelandclinic.org/anesthesia/outcomes/risk-quantification-index.aspx. Accessed November 23, 2012.
Vanderbilt University, Department of Biostatistics: Statistical Computing. Available at: http://biostat.mc.vanderbilt.edu/wiki/Main/StatComp. Accessed November 23, 2012.
American Heart Association: Statistical Fact Sheet (2012). Available at: http://www.heart.org/HEARTORG/General/Populations_UCM_319119_Article.jsp. Accessed June 10, 2013.
American Diabetes Association: Diabetes Statistics (2011). Available at: http://www.diabetes.org/diabetes-basics/diabetes-statistics/. Accessed February 2, 2013.
Agency for Healthcare Research and Quality: Patient Safety Network Glossary, Failure to Rescue. Available at: http://psnet.ahrq.gov/popup_glossary.aspx?name=failuretorescue. Accessed February 2, 2013.