Hospitals are increasingly required to publicly report outcomes, yet performance is best interpreted in the context of population and procedural risk. We sought to develop a risk-adjustment method using administrative claims data to assess both national-level and hospital-specific performance.
A total of 35,179,507 patient stay records from 2001-2006 Medicare Provider Analysis and Review (MEDPAR) files were randomly divided into development and validation sets. Risk stratification indices (RSIs) for length of stay and mortality endpoints were derived from aggregate risk associated with individual diagnostic and procedure codes. Performance of RSIs were tested prospectively on the validation database, as well as a single institution registry of 103,324 adult surgical patients, and compared with the Charlson comorbidity index, which was designed to predict 1-yr mortality. The primary outcome was the C statistic indicating the discriminatory power of alternative risk-adjustment methods for prediction of outcome measures.
A single risk-stratification model predicted 30-day and 1-yr postdischarge mortality; separate risk-stratification models predicted length of stay and in-hospital mortality. The RSIs performed well on the national dataset (C statistics for median length of stay and 30-day mortality were 0.86 and 0.84). They performed significantly better than the Charlson comorbidity index on the Cleveland Clinic registry for all outcomes. The C statistics for the RSIs and Charlson comorbidity index were 0.89 versus 0.60 for median length of stay, 0.98 versus 0.65 for in-hospital mortality, 0.85 versus 0.76 for 30-day mortality, and 0.83 versus 0.77 for 1-yr mortality. Addition of demographic information only slightly improved performance of the RSI.
RSI is a broadly applicable and robust system for assessing hospital length of stay and mortality for groups of surgical patients based solely on administrative data.
What We Already Know about This Topic
❖ Hospitals are increasingly required to publicly report outcomes, yet performance is best interpreted in the context of population and procedural risk.
❖ Good predictive systems that are based on readily accessible data are not currently available.
What This Article Tells Us That Is New
❖ The authors developed broadly applicable and robust risk-stratification systems for assessing hospital length of stay and mortality for surgical patients based solely on administrative data.
A CENTRAL tenet of national health care quality improvement, as described by the Hospital Quality Alliance,**is public reporting of hospital-level outcome statistics. A critical assumption behind public outcome reporting is that patients (or their insurers) are rational consumers who will choose hospitals reporting superior results and that this, in turn, will serve as an economic incentive for quality improvement in health care. It is thus likely that reported results will have considerable financial impact throughout the healthcare system.
The difficulty with unadjusted outcomes is that baseline patient risk varies considerably. Even for a given procedure performed by the same surgeon in the same hospital, mortality may vary considerably because of preexisting patient demographics, comorbidities, and disease stage.1,2Furthermore, some medical and surgical procedures are either substantially less effective or more dangerous than others, resulting in unforeseen complications that have an adverse impact on recovery and survival. Outcomes can thus only be reasonably interpreted in light of baseline population risk, the cumulative impact of procedures, and complications associated with hospital care.
Risk stratification schemes have been developed for selected procedures. In general, they use critical laboratory test results and other disease-specific clinical findings that correlate with outcome. An alternative approach is base stratification on administrative claims databases containing diagnostic and procedure descriptors, with or without readily available patient demographic characteristics.1,3One shortcoming of existing risk-stratification methods is inadequate validation across diverse populations, facilities, geographical regions, and various medical procedures. For example, the widely-used Charlson comorbidity index (CCI) was developed in 1987 from fewer than 600 patients.4Development of a broadly validated risk-stratification method would permit relevant outcomes, such as duration of hospitalization and mortality, to be fairly compared across healthcare institutions. Availability of an open-source, reproducible method would also foster a more consistent and transparent outcome comparison process. Our goal was to develop risk-adjustment models from a national administrative database from the Centers for Medicare and Medicaid Services (CMS), and to validate performance of the resulting models in a large single-center electronic registry of surgical patients.
Materials and Methods
A dataset was constructed from the 2001–2006 Medicare Provider Analysis and Review (MEDPAR) database (CMS dataset; n = 79,741,480). The MEDPAR file is a national stay-based dataset derived from claims made for payment to CMS under the Medicare program. Each record in the file represents a single patient stay. Data fields include demographic data (age, gender), up to 10 diagnosis codes and 6 procedure codes (coded according to the International Classification of Diseases, Ninth Revision, Clinical Modification [ICD-9-CM]), length of stay (LOS), and days from admission to death.
We excluded patients younger than 65 yr, those having no procedure or procedures with an annual average occurrence of less than 5,000, or a patient stay with less than 1 yr of follow-up. The final dataset was randomly divided into development (n = 17,589,824) and validation (n = 17,589,683) datasets (fig. 1). CCI was computed for each patient stay.4,5
Our approach was to derive a measure of the risk posed by each patient's comorbidities, jointly with the risk associated with each procedure. Diagnosis and procedure codes (ICD-9-CM) were used to generate the optimum covariate set for modeling each endpoint (LOS, in-patient mortality, and 30-day and 1-yr postdischarge mortality). The ICD-9-CM codes are hierarchical; therefore, it was possible to truncate the codes to a higher level to ensure consistency of the covariates across time to account for new codes and changes in code use (fig. 2). In successive iterations, covariates were selected in a step-wise manner based on the statistical significance of the covariates in a multivariable model (Stepwise Hierarchical Selection). Cox proportional hazards modeling was used to model time to postdischarge death and time to discharge. Because the timing of the diagnostic and procedure codes during the hospitalization was unknown, logistic regression was used to model in-hospital mortality.
During early iterations of the Stepwise Hierarchical Selection algorithm (fig. 3), the development dataset was randomly sampled to permit a reasonable execution time. The initial models were developed from a 1% sample of the development dataset (n = 175,898). The second and third iterations, with smaller covariate sets, were based on a 10% sample. The entire development dataset was used for the final iteration. The criteria for including covariates in the first two iterations (P < 0.2 and P < 0.05) of the algorithm were selected to allow the largest number of likely variables to be identified. The criteria for the third iteration of the algorithm (P < 10−6) was selected after examining the output and identifying a threshold below which the highly significant variables were clustered. A Hierarchical Dataset Coding algorithm was used to translate the diagnosis and procedure codes from the development dataset into the final covariate set. The Hierarchical Dataset Coding algorithm selectively collapsed diagnosis and procedure codes into binary covariates (0 or 1), as determined by the covariate set.
A Cox or logistic model was used to estimate the hazard associated with each covariate. The initial covariate set included 1,951 variables used for the initial model of each endpoint. The limit of statistical significance applied to the model covariates was P less than 0.2 in the first iteration, P less than 0.05 after the second, and P less than 10−6after the third. The fourth iteration was used to recalculate the final hazard ratios. The final model for each endpoint resulted in a different number of variables: in-hospital mortality,184; 30-day mortality, 240; 1-yr mortality, 503; and LOS, 1,096.
A risk stratification index (RSI) for each of the endpoints of interest was then developed, with RSI1YR, RSI30days, RSIINHOSP, and RSILOSdenoting predictors of 1-yr, 30-day, and in-hospital mortality, and time to discharge within 30 days, respectively. The RSI value for each patient stay was calculated by adding the covariate coefficients associated with the patient's procedure and diagnostic codes linked to the patient stay. The coefficient (βj) of each covariate calculated by the Cox modeling process was the natural log of the hazard associated with that covariate (or the natural log of the odds ratio change for the logistic model; βj= ln(hazard ratioj). The total hazard arising from a particular patient's diagnostic and procedure codes can be calculated as the exponential sum of the covariate coefficients associated with those codes. Total hazard has a non-Gaussian distribution; it is preferable, therefore, to use RSI as a risk-adjustment factor rather than the total hazard itself.
Prospective validation was initially conducted using the CMS validation dataset. The predictive power of the RSIs were evaluated on both the development and validation datasets using the C statistic (area under the receiver operating characteristic curve) for mortality and median LOS (coded as a binary covariate with values corresponding to either > or < the median) and Harrell's C index (a measure of relative predictive performance) for time to discharge within 30 days.6For each dataset, validation was conducted using all hospital stays, and separately for stays including principal procedures likely to require full anesthetic management (for details, see Supplemental Digital Content 1, a table showing likely surgical procedures, https://links.lww.com/ALN/A642). The effect of sample size on the predictive accuracy of RSI was assessed by repeatedly randomly sampling the CMS validation dataset to obtain sets from 100 to 50,000 patient stays. Confidence intervals for the C index were obtained by bootstrapping techniques. Statistical significance was defined as P less than 0.05 for comparisons of C statistics and C indices.
A second prospective validation evaluated the performance of the RSIs on surgical patients from one tertiary medical center. With approval of the Cleveland Clinic Institutional Review Board (Cleveland, Ohio), this validation used the Cleveland Clinic Perioperative Health Documentation System (PHDS), an electronic medical record–based registry of noncardiac surgical patients from January 2005 to December 2009 (n = 103,324). We constructed a dataset from this registry that was structured in the same stay-based format as the MEDPAR dataset, with ICD-9-CM procedure and diagnosis codes. Patients younger than 18 yrs and those with missing data were excluded (n = 2,122). The four RSIs were computed using the Hierarchical Dataset Coding algorithm, and performance was evaluated using the C statistic and C index. The performances of the RSIs were compared with the CCI. We also evaluated whether inclusion of demographic characteristics improved prediction accuracy.
Statistical programming was implemented in SPSS (SPSS Inc., Chicago, IL) and Python (Python Software Foundation, Hampton, NH). The CMS development and validation datasets were evenly split by even or odd record numbers provided by Medicare; random selection for bootstrapping was performed using the SPSS uniform distribution function.
Results
Characteristics of the CMS development and validation datasets did not differ significantly. There were significant differences between the CMS and PHDS validation datasets (table 1). Surgical patients in the PHDS dataset were younger and had fewer comorbidities and lower CCI. PHDS patients had shorter hospital stays and lower mortality rates.
Performance statistics on the CMS development and validation databases are presented in table 2. Performance on the validation dataset was not statistically different from that on the development database, indicating that the degree to which the RSIs predict the endpoints is highly consistent. Performance was significantly better in the population that had surgical procedures that likely required full anesthetic management, possibly because ICD-9 codes are better characterized in the surgical population that usually gets a careful preoperative evaluation. The predictors associated with the highest and lowest hazard ratio for each of the three models are provided in tables 3–5. The complete set of covariates and coefficients are provided in Supplemental Digital Content 2A–C, tables showing predictors and coefficients for each model, https://links.lww.com/ALN/A643.††
Prospective evaluation of the RSIs on the PHDS dataset is presented in table 2. The performance of each RSI was significantly better than CCI, a difference that persisted after the addition of demographic characteristics to both models. Adding demographic characteristics significantly improved RSI performance only for the 1-yr mortality endpoint. RSI1YRpredicted mortality at 30 days as well as the independent RSI30daysmodel; the RSI1YRmodel can thus be equally used for either 30-day or 1-yr postdischarge mortality endpoints.
Receiver operating characteristic curves for the four endpoints indicate significantly better performance for the RSIs across the sensitivity-specificity range compared with CCI and demographics alone.
Figure 4shows that superior performance was most evident for in-hospital mortality and median LOS and less pronounced for the remaining endpoints. The ability of CCI to predict 30-day LOS was no better than chance. RSI1yrpredictive accuracy appeared stable down to a sample size as small as several thousand hospital stays (fig. 5).
Discussion
Hospital performance measures and public reporting are key methods to drive quality improvement. The validity of comparisons among hospitals depends critically on accurate stratification of population and procedure risk. Furthermore, accurate and universally applicable risk-stratification methods would reduce incentives for hospitals to “cherry pick” healthier patients or perform simpler procedures that might improve their unadjusted outcomes.
Risk-stratification systems developed for specific subpopulations may generalize poorly and are thus unsuitable for characterizing all outcomes within a single hospital, much less for comparing among diverse hospitals. There are also outcome prediction systems, such as the National Surgical Quality Improvement Program stratification,7that are broad-based but depend on clinical information that is not readily available for all hospitalizations. The National Surgical Quality Improvement Program, for example, depends on highly trained nurse reviewers who collect clinical data from a small fraction of patients at participating centers. These clinical details presumably augment prediction accuracy but are not easily available for other patients, even in National Surgical Quality Improvement Program participating centers, much less for patients in nonparticipating hospitals. Any system with potential for broad applicability must therefore be based exclusively on readily available administrative claims data.
We developed broadly applicable empirical models for stratifying postoperative risk that are based on ICD-9 diagnostic and procedure codes and demographic characteristics, information that is standardized, objective, and available for virtually every admitted patient requiring a procedure. Unlike proprietary systems,8ours is publicly available and transparent and can thus be applied by any stakeholder to objectively risk-adjust hospital outcomes. Furthermore, this method can be easily updated to reflect evolving coding conventions (i.e. , conversion to ICD-10 or introduction of entirely new codes), and can be extended to include other populations and outcomes, such as morbidity and cost of care.
It is noteworthy that demographic characteristics only modestly improved some of our models' predictive accuracy. Including age, weight, sex, and race, for example, improves the C statistic based on ICD-9 codes alone by only ≈0.02 for 1-yr mortality but has no significant impact on the models for in-hospital mortality or LOS. It is thus apparent that risk is better characterized by diagnosis and procedure codes rather than by demographic characteristics including age—a result that is consistent with previous observations.9
One of the most commonly-used stratification systems, the CCI, was designed to predict 1-yr mortality. We found that our long-term mortality RSI model comparably predicts both 30-day and 1-yr postdischarge mortality more accurately than the CCI, although unsurprisingly, the difference for 1-yr mortality was less. In contrast to longer-term mortality, distinct models were required for the most accurate prediction of LOS and in-hospital mortality. Our models for these acute outcomes were considerably better than the CCI. For example, the C statistic for in-hospital mortality is 0.977 with our RSI versus 0.654 with the CCI. We thus present three models that accurately predict four important outcomes: LOS and in-hospital, 30-day, and 1-yr mortality.
The RSI models were developed from ≈17 million MEDPAR records and validated on an additional ≈17 million MEDPAR records. It is reassuring that the results were not statistically different, with C statistics typically differing by less than 0.001. However, the more important validation was to apply the RSI model developed from MEDPAR data to the Cleveland Clinic's PHDS database. This was a considerably stricter test because the populations differ in several important ways. For example, the MEDPAR dataset includes all stay-based procedures whether surgical or not, whereas PHDS is surgical. Thus, only approximately 30% of the MEDPAR cases were likely to have been surgical, whereas all the PHDS cases were. Furthermore, the average age of the MEDPAR patients was 18 yr older than in the PHDS population, and only 32% of the PHDS patients were older than 65 yr and thus eligible for Medicare. Finally, the baseline comorbidity, as measured by the CCI and the number of diagnostic codes, was lower in the PHDS patients. Nonetheless, the predictive accuracy of RSI was not statistically different between the MEDPAR and Cleveland Clinic patients, indicating that the RSI system is broadly applicable.
RSI performance appears to remain accurate in samples as small as several thousand hospital stays. This suggests that risk stratification can be used in smaller hospitals or at frequent intervals in larger hospitals.
There are more than 16,000 ICD-9 diagnostic codes and more than 4,500 procedure codes, of which ≈10,000 and ≈3,000, respectively, are in common use. All were considered in development of our risk-stratification models. ICD-9 codes are hierarchical, enabling the “collapsing” of codes to higher (more general) levels. Our method takes advantage of the possibility that marginally predictive codes may increase in predictive power when combined with other related codes because doing so increases the occurrence rate. By first retaining strongly predictive (small P value) individual codes as covariates and then collapsing the remaining codes to create composite covariates with higher occurrence rates, we have derived a highly predictive set of covariates without relying on a priori assumptions to create covariates. The result is a set of models that, unlike various proprietary systems, is reproducible and transparent.
Our models include between 184 and 1,096 codes. Although this might appear overly complicated, CMS billing conventions supply up to 16 ICD-9 codes for each patient record. Individual risk for each outcome can thus be determined from a look-up table and simple calculations; however, our results suggest that at least several thousand patients need to be aggregated to produce reliable predictions.
That various baseline characteristics are associated with poor outcome is consistent with clinical intuition. Among the strongest predictors of mortality, for example, were diagnostic codes associated with preexisting malignancy; intracerebral hemorrhage, organic brain syndrome, and heart failure were also strong predictors of 30-day and 1-yr mortality—all with P values less than 10−300.
Less intuitive is that certain baseline characteristics were protective. For example, a diagnosis of hypercholesterolemia reduced the risk of mortality at all time points. In the MEDPAR dataset, 90.8% of patients with a diagnosis of hypercholesterolemia also have a diagnosis of cardiovascular or cerebrovascular disease, which is a strong predictor of poor outcome. Statin therapy, the primary treatment for hypercholesterolemia, is associated with a reduction in coronary and all-cause mortality as well as major vascular events.10It is likely that patients with cardiovascular disease who carried an ICD-9 code for hypercholesterolemia were treated with statins and thus protected relative to patients with cardiovascular or cerebrovascular disease who did not take statins.
Certain surgical procedures were also found to be protective, especially radical prostatectomy. In the MEDPAR dataset, 99.4% of patients undergoing radical prostatectomy had cancer. Malignant neoplasms are among the highest risk factors in our model. Prostatectomy was thus apparently protective compared with patients with cancer who did not have a radical prostatectomy.
These examples show that individual codes cannot be considered in isolation because each patient's risk is determined by the totality of the codes they carry. In other words, our models are predicated on a relative relationship between covariates associated with an underlying risk and diagnoses or procedures associated with treatment that reduces that risk. Furthermore, this relative relationship is based on a MEDPAR record, which consistently includes up to 6 procedure codes and 10 diagnostic codes for each admission. Covariates, therefore, should not be used in isolation or in databases that are not consistent with the MEDPAR stay-based ICD-9-CM format. The general method we present can easily be extended to other administrative record formats and, although similar predictive performance may be achieved, the relative risk associated with specific procedures and diagnoses is likely to vary based on the coding method used.
Use of administrative claims information, including our RSIs, can suffer from regional variations in coding validity or reimbursement gaming.11,12But given the penalties for fraudulent coding, it seems unlikely that many hospitals consistently game the system. The contribution of miscoding to our nationally derived models should thus be minimal.
A more serious limitation of our system is that it does not distinguish between a priori codes related to baseline health status and planned procedures from actual procedure codes and complications accumulated during hospitalization. The reason is the MEDPAR and most of the PHDS data are derived from claims reports that do not indicate the diagnostic codes present on admission, which reflect baseline patient characteristics, or the principal planned or required procedures as opposed to diagnosis and procedure codes arising from complications during hospitalization. Our system thus assigns risk stratification based on all reported ICD-9 codes, including those that resulted from care-induced complications.
Fortunately, the Agency for Healthcare Research and Quality has published a set of codes usually associated with complications.13In a study of two state-wide databases, 92–94% of secondary diagnoses were present on admission, so the contribution of additional in-hospital complication codes might be expected to be limited.14It is thus possible to perform risk stratification with and without these “complication codes,” which will provide a reasonable distinction between baseline and procedure-related risk versus complications associated with hospital care. The RSI covariates associated with the Agency for Healthcare Research and Quality Clinical Classification Software complications‡‡are denoted in SDC table 2A–C. To further evaluate the effect of in-hospital complications, we backed the risk associated with the Clinical Classification Software complication codes out of the RSI models; the predictive performance of the residual models on the PHDS validation database was not statistically different from RSI, including complications. This lack of significant impact may in part result from the low complication rates encountered at the Cleveland Clinic and theoretically may be greater at other institutions.
In summary, hospitals are increasingly required to publicly report outcomes. However, outcomes can only be reasonably interpreted in the context of baseline-related and procedure-related risk. We thus present three validated RSIs that predict four major outcomes for hospitalization with procedures: LOS and in-hospital, 30-day postdischarge, and 1-yr postdischarge mortality. Our system, RSI, uses only readily available administrative claim codes. It can thus be used to perform risk-adjusted hospital outcomes wherever these claim codes are used to describe patient stays.
The authors gratefully acknowledge the contributions of Eric K. Christiansen, M.B.A. (Senior Operations Analyst, Anesthesia Operations Group, Anesthesiology Institute, Cleveland Clinic, Cleveland, Ohio), who led extraction of data from the Perioperative Health Documentation System.