The validity of basing healthcare reimbursement policy on pay-for-performance is grounded in the accuracy of performance measurement.
Monte Carlo simulation was used to examine the accuracy of performance profiling as a function of statistical methodology, case volume, and the extent to which hospital or physician performance deviates from the average.
There is extensive variation in the true-positive rate and false discovery rate as a function of model specification, hospital quality, and hospital case volume. Hierarchical and nonhierarchical modeling are both highly accurate at very high case volumes for very low-quality hospitals. At equivalent case volumes and hospital effect sizes, the true-positive rate is higher for nonhierarchical modeling than for hierarchical modeling, but the false discovery rate is generally much lower for hierarchical modeling than for nonhierarchical modeling. At low hospital case volumes (200) that are typical for many procedures, and for hospitals with twice the rate of death or major complications for patients undergoing isolated coronary artery bypass graft surgery at the average hospital, hierarchical modeling missed 90.6% of low-quality hospitals, whereas nonhierarchical modeling missed 65.3%. However, at low case volumes, 38.9% of hospitals classified as low-quality outliers using nonhierarchical modeling were actually average quality, compared to 5.3% using hierarchical modeling.
Nonhierarchical modeling frequently misclassified average-quality hospitals as low quality. Hierarchical modeling commonly misclassified low-quality hospitals as average. Assuming that the consequences of misclassifying an average-quality hospital as low quality outweigh the consequences of misclassifying a low-quality hospital as average, hierarchical modeling may be the better choice for quality measurement.
The validity of basing healthcare reimbursement policy on pay-for-performance is grounded in the accuracy of performance measurement
Using Monte Carlo simulation, nonhierarchical modeling frequently misclassified average-quality hospitals as low quality, whereas hierarchical modeling commonly misclassified low-quality hospitals as average
DURING the past decade, there has been enormous growth in the use of public and nonpublic reporting and, more recently, in pay-for-performance initiatives to drive quality improvement. In part, these initiatives were motivated by early studies reporting very large decreases in mortality and morbidity after the introduction of nonpublic reporting for cardiac and noncardiac surgery.1–3 But, more recent studies have been unable to demonstrate outcome benefits with either nonpublic reporting (American College of Surgeons National Surgical Quality Improvement Program4 ), public reporting (Medicare Hospital Compare5 ), or pay-for-performance (Premier Pay-for-Performance6 and Centers for Medicare and Medicaid Services [CMS] program of nonpayment for preventable infections7 ). This lack of improvement in healthcare outcomes with performance reporting is concerning, given that current efforts to redesign the healthcare system are anchored in performance measurement. One possible explanation for the lack of evidence linking hospital performance reporting and better outcomes is that hospital report cards do not convey accurate information on hospital quality.
Criticism of performance measurement usually focuses on the limitations of risk adjustment. But, even with perfect risk adjustment, there may be significant differences between hospitals’ observed performance and their true performance because of chance alone,8 especially when hospital case volumes are low. Hospitals with small case volumes will exhibit more extreme variation in their observed performance compared to hospitals with higher case volumes that have identical performance. CMS, along with the American College of Surgeons and the Society of Thoracic Surgeons, uses hierarchical modeling to minimize the lack of reliability of performance measures for small hospital case volumes. By “shrinking” the adjusted performance of lower-volume hospitals toward the performance of the average hospital, hierarchical modeling minimizes the risk of misclassifying average hospitals as quality outliers. In comparison, nonhierarchical modeling does not adjust the magnitude of the risk-adjusted mortality rate to reflect differences in reliability between low- and high-volume hospitals. Conceptually, these two approaches involve trade-offs between sensitivity—how often low-performance hospitals will be correctly identified as low performance—and specificity—how often average-performing hospitals will be incorrectly classified as low-performance hospitals. Hierarchical modeling could lead to a “false complacency”9 because some, or even many, low-performance hospitals can be misclassified as average—and therefore have less of an incentive to improve their performance. On the other hand, nonhierarchical modeling may result in some average hospitals being classified as low performance, resulting in unwarranted damage to the professional reputation of some hospitals, as well as loss of income.
We designed this study to compare the accuracy of hospital report cards based on hierarchical and nonhierarchical modeling. Using Monte Carlo simulation, we (1) specified the true performance of all the hospitals and (2) perfectly risk adjusted for differences in patient casemix across hospitals. We based the patient risk distribution and outcomes on data from the New York State Cardiac Surgery Registry. We then examined the ability of hierarchical and nonhierarchical modeling to identify known low- and high-quality hospitals across a range of hospital case volumes and hospital performance. The findings of this study may help government regulators, policymakers, and third-party payers better understand the limitations of performance measures for assessing true quality. Our findings may also provide physicians and hospitals with some of the knowledge they need to help shape the future of payment reform, in partnership with regulatory agencies, as we transition to value-based purchasing.
Materials and Methods
We based this study on population-based data from the New York State Cardiac Surgery Reporting Systems for patients undergoing isolated coronary artery bypass graft (CABG) surgery in New York between 2009 and 2010. These data were obtained from the New York State Department of Health. The database includes information on patient demographics, preoperative risk factors, encrypted hospital identifiers, in-hospital mortality, and major postoperative complications (stroke, Q-wave myocardial infarction [MI], deep sternal wound infection, bleeding requiring reoperation, sepsis or endocarditis, gastrointestinal bleeding, perforation, or infarction; renal failure; respiratory failure; unplanned cardiac reoperation; or interventional procedure). Our study was approved by the Institutional Review Board at the University of Rochester (Rochester, New York), and the requirement for informed consent was waived.
Creation of Simulation Data
We defined a composite outcome of in-hospital mortality or major in-hospital complication (either Q-wave MI,* renal failure,† or stroke‡). We estimated a fixed-effects logistic regression model in which the dependent variable was the composite outcome. Patient risk factors (age, obesity, ejection fraction, emergency surgery, congestive heart failure, previous MI, calcified aorta, previous open heart surgery, renal failure, cerebrovascular disease, chronic obstructive pulmonary disease, and hematocrit) and hospital fixed effects were the dependent variables. We included hospital fixed effects in order to obtain unbiased coefficient estimates for patient-level risk factors. Not including hospital fixed effects can lead to biased estimates for patient risk factors when hospital effects are correlated with patient risk. This model was used to estimate the predicted probability of the composite outcome for each patient conditional on patient risk factors only. We used the linear predictor (log odds of the predicted probability) as the patient risk score in our simulation.
In the baseline Monte Carlo simulation, we created a synthetic data set with 100 hospitals and 200 patients per hospital (20,000 patients). Each patient was assigned a risk score using a random number generator based on the distribution of linear risk scores in the original data source specified as a normal distribution ~N(−4.0172, σ = 0.7394). We randomly assigned 200 patients to each of the 100 hospitals. The first five hospitals (hospitals 1 to 5) were specified as low-quality hospitals and were assigned a hospital-specific risk equal to the log odds of 3 (corresponding to an adjusted odds ratio [AOR] of 3). Hospitals 6 to 10 were specified as high-quality hospitals and were assigned a hospital-specific risk equal to the log odds of 0.33 (AOR of 0.33). Hospitals 11, 12, …, 100 were specified as average-quality hospitals and were assigned a hospital-specific risk equal to the log odds of 1 (AOR of 1.0). The incidence of death or major complication in the baseline analysis was 2.70%. Using a latent variable model framework for binary outcomes,10 we created a latent variable y* such that
where riskij is the linear risk score for patient i in hospital j, Pj = 1 if patient i is treated by hospital j and 0 otherwise, the error term εij follows the logistic distribution, and the binary outcome yij for the ith patient is a function of the latent variable:
Using this approach, we created synthetic data sets with 20,000 patients, each with a linear risk score, hospital identifier (1, 2, …,100), and a composite outcome based on the latent variable framework.
By construction, patient outcomes were simulated as a function of patient risk and true hospital performance. By specifying our simulated data sets in this fashion, we have perfect risk adjustment and there are no unmeasured differences in patient risk across hospitals. Furthermore, we know the exact identity and performance of high- and low-quality hospitals. Simulated data are an ideal platform for comparing the accuracy of hierarchical and nonhierarchical models in identifying low- and high-quality hospitals.
For each of the simulated data sets, we assessed hospital performance using two different approaches: nonhierarchical modeling to estimate hospital observed/expected mortality (OE) ratios and hierarchical modeling to estimate hospital AOR. The OE ratio is the ratio of the hospital observed rate of the composite outcome (O) to the hospital expected rate of adverse outcome (E), where E is the average of the estimated probabilities of adverse outcomes for the patients in a hospital.
Using the simulated data set, we first estimated a nonhierarchical model. The dependent variable was the simulated composite outcome. The independent variable was the linear risk predictor:
where Pij is the probability that patient i treated by hospital j will experience the composite outcome, and riskij is the linear risk score for patient i in hospital j (as defined previously).
We then estimated the predicted probability of the composite outcome for each patient. For each hospital, we calculated the OE ratio, and the 95% CIs for the OE ratio using Byar’s approximation of the exact Poisson distribution.11 Hospitals with an OE ratio significantly greater than 1 (P < 0.05) were classified as low-performance outliers, and hospitals with an OE ratio significantly less than 1 were classified as high-performance outliers.
We then estimated a hierarchical model in the simulated data and specified hospitals as a random effect:
where μj is the hospital-specific random effect for hospital j.
The empirical Bayes estimate of the hospital effect, μj, was exponentiated to yield an AOR for each hospital. Hospitals with an AOR significantly greater than 1 (P < 0.05) were classified as low-performance outliers, and hospitals with an AOR significantly less than 1 (P < 0.05) were classified as high-performance outliers. Because the incidence of the composite outcome was less than 10%, the AOR approximates the risk ratio (OE ratio).12
Assessing the Impact of Using Nonhierarchical versus Hierarchical Modeling on the Accuracy of Hospital Profiling
We then estimated the (1) true-positive rate (TPR; also called the sensitivity) for low- and high-performance hospitals and (2) the false discovery rate (FDR; which is equivalent to 1 − positive predictive value) for low- and high-performance hospitals, where
We repeated these calculations for 500 different simulated data sets and calculated the TPR and FDR measures—for high- and low-performance hospitals—averaged over 500 different replications.
To further understand the association between the effects of the statistical approach on hospital profiling accuracy, we performed a series of Monte Carlo simulations in which we incrementally increased hospital case volume from 100 to 1,000. We then repeated our analyses after simulating low-performance hospitals with an AOR of 2 and then 1.5 and high-performance hospitals with an AOR of 0.5 and then 0.67.
For descriptive purposes, we defined a scale to categorize the accuracy of a scoring system (table 1). We have defined our scale such that an FDR greater than or equal to 20% results in a “not acceptable” rating, regardless of the value for the TPR. We designed this categorical accuracy scale to give more weight to achieving a low FDR versus a high TPR. The rationale for this weighting scheme is our assumption that the consequences of misclassifying an average-quality hospital as a low-quality hospital outweigh the consequences of misclassifying a performance outlier as average. Monte Carlo simulations and statistical modeling were performed using STATA/MP (version 14.0; StataCorp LP, USA).
We found extensive variation in the TPR and FDR as a function of model specification (hierarchical vs. nonhierarchical), hospital performance (e.g., hospital AOR 3.0 vs. 1.5), and hospital case volume (fig. 1). In general, higher case volumes were associated with higher TPRs for both hierarchical and nonhierarchical models (fig. 1). For example, the TPR for hospitals (with patients undergoing isolated CABG surgery) with an AOR of 2.0 was 9.4% (95% CI, 8.0 to 10.8) for hospitals with a case volume of 200 versus 90.9% (95% CI, 89.7 to 92.2) for hospitals with a case volume of 1,000 using hierarchical modeling (fig. 2).
At equivalent case volumes and hospital effect sizes, the FDR was generally much lower for hierarchical modeling than for nonhierarchical modeling (fig. 1). For example, the FDR was 5.3% (95% CI, 3.6 to 7.0) for hierarchical modeling versus 38.9% (95% CI, 36.2 to 41.6) for nonhierarchical modeling for hospitals with an AOR of 2.0 and a case volume of 200 (fig. 2). The FDR was very high for nonhierarchical modeling at low effect sizes and low case volumes (fig. 1). For example using nonhierarchical modeling, the FDR for hospitals with an AOR of 1.5 and a case volume of 200 was 67.4% (95% CI, 64.2 to 70.5) and the FDR for hospitals with an AOR of 0.67 and a case volume of 200 was 100% (SEM, 0; figs. 2 and 3). By comparison, using hierarchical modeling, the FDR for hospitals with an AOR of 1.5 and case volume of 200 was 2.1% (95% CI, 0.9 to 3.3) and the FDR for hospitals with an AOR of 0.67 and case volume of 200 was 0% (SEM, 0).
However, at equivalent case volumes and hospital effect sizes, the TPR was higher for nonhierarchical modeling than for hierarchical modeling. For example, the TPR was 34.7% (95% CI, 32.9 to 36.5) for hospitals with an AOR of 2.0 and case volume of 200 using nonhierarchical modeling versus 9.4% (95% CI, 8.0 to 10.8) using hierarchical modeling (fig. 2). This gap in TPR between nonhierarchical and hierarchical modeling decreased substantially at higher hospital volumes. For example, the TPR was 95.8% (95% CI, 95.0 to 96.6) for hospitals with an AOR of 2.0 and a case volume of 1,000 using nonhierarchical modeling and 90.9% (95% CI, 89.7 to 92.2) using hierarchical modeling (fig. 2).
Both hierarchical and nonhierarchical models exhibited high accuracy for extremely low-quality outliers (AOR, 3.0) with hospital case volumes of 200 or greater (fig. 4). Hierarchical models exhibited moderate-to-high accuracy for low-quality outliers with an AOR of 2 and case volumes of 500 or greater and nonhierarchical models with hospital volumes of 700 or greater. Hierarchical and nonhierarchical models exhibited unacceptable accuracy for low-quality outliers with an AOR of 1.5 at case volumes between 100 and 1,000.
Nonhierarchical models exhibited unacceptable accuracy for detecting high-performance outliers (with AOR ranging between 0.33 and 0.67 and case volumes between 100 and 1,000; fig. 4). Hierarchical modeling exhibited moderate-to-high levels of accuracy only for extreme high-quality outliers (AOR, 0.33) at hospital case volumes of 600 or greater (fig. 4). Hierarchical modeling exhibited unacceptable accuracy for high-quality outliers with an AOR of 0.67 at all case volumes between 100 and 1,000 and for high-quality outliers with an AOR of 0.5 at case volumes between 100 and 800.
Using Monte Carlo simulation, we found that hierarchical modeling more accurately identifies true high- and low-quality hospitals compared to conventional nonhierarchical modeling. We also found that, in general, the accuracy of both modeling approaches increased at higher hospital case volumes and as the performance gap between hospital outliers and average hospitals increased. The primary limitation of nonhierarchical modeling was the large percentage of hospitals that were falsely classified as low-performance outliers: as many as 65% of hospitals classified as low-quality outliers were actually not low-quality hospitals in our baseline analysis. However, even though hierarchical modeling rarely misclassifies average hospitals as low-quality outliers, it frequently misclassified low- and high-quality hospitals as average quality. In particular, at low hospital volumes (200 cases), hierarchical modeling missed more than 90% of low-quality hospitals with adverse event rates that are twice as high as average and missed 100% of high-quality hospitals with one third the rate of adverse events of average hospitals. Our findings highlight the lack of accuracy of both risk-adjustment methodologies for identifying high- and low-quality hospitals with low case volumes, even when perfectly adjusting for differences in patient risk.
After the publication of the seminal article on The Risks of Risk Adjustment by Iezzoni,13 it has been widely recognized that the choice of risk factors used for risk adjustment has a profound impact on quality measurement: different risk-adjustment models frequently disagree on which hospitals are delivering low- and high-quality care.13–15 The impact of the statistical methodology used to specify the risk-adjustment model—nonhierarchical versus hierarchical modeling—has received less attention and is generally not well understood by most clinicians. It is generally accepted that hierarchical modeling identifies fewer outlier hospitals and that these two methods can lead to very different conclusions regarding hospital quality.16 What is not as well established is which method leads to the more accurate identification of performance outliers. More importantly, the extent to which even the “best” risk adjustment can lead to the wrong conclusions on hospital quality is poorly understood by clinical audiences.
With few exceptions, hierarchical modeling has become the de facto standard for risk-adjustment modeling and has replaced nonhierarchical modeling. There are several theoretical arguments for using hierarchical modeling. With nonhierarchical modeling, hospitals with low case volumes may have risk-adjusted outcomes that are extreme, simply by chance alone. Moreover, low-volume hospitals identified as quality outliers are more likely to be reclassified as average performers in the future—regression to the mean—using nonhierarchical modeling.17 In theory, hierarchical modeling increases the reliability of the evaluation of low-volume hospitals by “shrinking” risk-adjusted rates toward the mean: a hospital’s risk-adjusted performance is estimated as a weighted average of its own adverse outcome rate and the average performance of all hospitals. The extent to which a hospital’s performance is shrunk toward the grand mean increases as the case volume decreases. Shrinkage accounts for the low sensitivity of hierarchical modeling at low case volumes. Similarly, shrinkage accounts for the low FDR because of the “smoothing” effect of hierarchical modeling on extreme outcome rates.
Other studies have also used Monte Carlo simulation to compare the accuracy of hierarchical and nonhierarchical methods for hospital profiling. Austin et al.18 used acute MI data from Ontario, setting the hospital case volume to a fixed value of 250 in their simulation. This analysis identified quality outliers based on the point estimate for the hospital relative OR alone, as opposed to using a more standard approach of using the point estimate in combination with 95% CI. They found that the fixed-effects model exhibited greater sensitivity than the random-effects model, whereas the random-effects model had greater specificity for identifying quality outliers. In a simulation study based on data from high-volume hospitals in Kaiser Permanente, Kipnis et al.19 also found that random-effects models have the highest specificity and the lowest sensitivity. However, the high case volumes used in this study (average hospital volume exceeded 5,000 cases) would be expected to lead to reliable estimates of hospital quality regardless of whether hierarchical or nonhierarchical modeling was used. Using CABG data from the New York State Department of Health, Racz and Sedransk20 showed that outlier detection is reduced using random-effects modeling at low hospital volumes but not at high hospital case volumes. Unlike our study, Racz and Sedransk20 only simulated low-quality outliers, as opposed to both high- and low-quality outliers. More recently, Austin et al.18 used Monte Carlo simulation to show that the accuracy of hierarchical modeling improved at higher case volumes.21 In their simulation study, also based on Ontario acute MI data, they found that the accuracy of hospital report cards improves at higher hospital volumes and that hospital volumes must be greater than 300 in order for greater than or equal to 70% of hospitals to be classified correctly. Their study, however, did not examine the impact of statistical methodology as the performance gap between hospital outliers and average hospitals increased. None of these studies explored the effects of hierarchical versus nonhierarchical modeling on the proportion of hospitals classified as low-quality outliers that are actually falsely identified as low-quality hospitals.
Our study extends the previous work by examining the performance of hierarchical and nonhierarchical models for hospitals with outcome rates 50%, 100%, or 300% higher than average hospitals or 30%, 50%, or 70% lower than average hospitals, with case volumes ranging between 100 and 1,000. Our study not only compares the performance of hierarchical and nonhierarchical modeling, but does this using a range of simulated hospital volumes and hospital performance. In particular, our study quantifies the proportion of hospitals classified as low-quality outliers that are actually falsely identified as low-quality hospitals. This error in classification is likely to be viewed as critically important by hospitals and physicians. The most striking finding in our study is the unacceptably high FDR using nonhierarchical modeling when true low-quality hospitals have 50% or even 100% higher rates of adverse outcomes than average hospitals, even at case volumes as high as 500. Equally important is our finding that hospitals with one-third fewer bad outcomes will be misclassified as “average” 98% of the time using hierarchical modeling, even at case volumes as high as 1,000.
Our findings have important policy implications in light of recent legislation expanding pay-for-performance to individual physicians. As part of its effort to transform the healthcare system and improve quality, CMS introduced the Physician Quality Report System (PQRS) in 2006. PQRS incentivized physicians to assess and report the quality of their care. Under PQRS, physicians were paid to report their outcomes—pay for reporting—but were not penalized or rewarded for their outcomes.22 Because physician performance was not publicly reported and payments were not tied to quality in PQRS, the accuracy of physician quality reporting was not an issue for most physicians. However, with the enactment of the Medicare Access and Children’s Reauthorization Act and the introduction of the Merit-based Incentive Payment System (MIPS), CMS is now transitioning physicians away from fee-for-service to pay-for-performance. Beginning in 2019, physicians will be subject to a plus or minus performance adjustment starting at 4% and increasing to 9% in 2022.23
These changes in reimbursement policy raise the question of whether physician report cards are accurate enough to be used for public reporting and pay-for-performance. Our study shows that we must use hierarchical modeling, as opposed to nonhierarchical modeling, in order to avoid falsely classifying many average-quality physicians as low performers. However, at typical hospital volumes of 200 cases, hierarchical modeling “misses” more than 90% of low-quality physicians with twice the rate of adverse events of average physicians and 100% of high-quality physicians with half the rate of adverse events of average physicians. This problem may not prevent CMS from using “inaccurate” report cards as the basis for MIPS. For example, in the case of the Hospital Reduction Readmission Program, CMS penalizes hospitals up to 1% of all of their CMS billings if the estimated ratio of OE readmission ratio is 1.01 for a group of targeted conditions, up to 2% for an OE ratio of 1.02, and up to 3% for an OE ratio of 1.03.24 In most cases, it is not possible to conclude that hospitals with OE ratios of 1.01 to 1.03 are truly low quality, but yet these hospitals are subject to CMS penalties. Report cards will never be 100% accurate. However, since report cards are more accurate at higher case volumes, it is critical that physicians and their societies advocate for hospital-based measures as opposed to physician-based measures for MIPS—since the use of hospital-based physician measures will aggregate the results of many physicians and will therefore be more likely to capture and report true quality because of higher sample sizes.
Our study also illustrates one of the most difficult challenges of quality measurement—no approach is perfect but some work better than others, depending on who is using the quality information. Assuming that patients place the highest priority on avoiding low-quality physicians and hospitals, nonhierarchical modeling may be preferred because nonhierarchical modeling is most likely to flag low-quality hospitals. Physicians and hospitals, on the other hand, are more concerned about the risk of being falsely labeled as low quality and would thus prefer an approach that minimizes the risk of false-positive hierarchical modeling. CMS and third-party payers, which are likely to base payments on the rank order of hospital performance—hospitals in the top quartile are rewarded and those in the bottom quartile are penalized—may prefer hierarchical modeling because of the reliability or smoothing effects of hierarchical modeling. When it comes to the quality of quality measurement, “quality” is very much in the eyes of the user.
This study has several limitations. First, we based our simulation on isolated CABG surgery and did not consider other procedures. We selected this clinical scenario because CABG surgery is very common and has a long history of quality reporting. It is likely, however, that had we selected more common procedures with lower rates of adverse outcomes, we would have found that quality reporting is even less accurate.25 Second, our definition of report card accuracy is arbitrary and by design imposes a greater “penalty” for misclassifying average hospitals as low-performance outliers. Thus, our conclusion regarding the accuracy of quality reporting for hierarchical versus nonhierarchical modeling may be sensitive to our definition of accuracy. However, our choice of definitions does not alter the fact that in our simulation, 49% of high-volume hospitals (500 cases) identified as low performance (hospitals with 50% higher rates of death or major complications) using nonhierarchical modeling were in fact average performance hospitals. This key finding makes it difficult to recommend the use of nonhierarchical modeling for quality reporting because of the potential unintended consequences of falsely labeling average-quality hospitals as low quality. Third, our simulation did not consider the additional impact on the accuracy of quality measurement of inadequate adjustment for risk factors. Fourth, every hospital in each of the simulation scenarios had the same case volume. Our simulation does not account for the variation in case volumes that is seen in practice. Finally, although we used a hospital case volume of 200 to represent “typical” surgical case volumes in our baseline analysis, many surgical procedures are performed less commonly at many hospitals.25
Report cards based on hierarchical modeling are more accurate than report cards based on nonhierarchical modeling—primarily because of the unacceptably high FDRs with nonhierarchical modeling. Furthermore, although quality reporting reliably identifies high-volume low-quality hospitals with extremely poor performance, the quality of quality reporting is significantly degraded at lower hospital volumes and at levels of performance that are more likely to be seen in clinical practice. In theory, quality reporting will increase transparency and accountability and will improve patient outcomes by incentivizing higher-quality care and penalizing lower-quality care. In practice, quality reporting may be very inaccurate—especially at low clinician volumes—and may only reliably identify physicians and hospitals whose performance deviates extremely from normal.
Supported by the Department of Anesthesiology at the University of Rochester School of Medicine, Rochester, New York.
The authors declare no competing interests.