In a recent report, we describe risk stratification indices (RSIs) for mortality and duration of hospital stay.1We reported the C statistic along with graphical receiver operating characteristic curves to assess the performance of these predictive models on a prospective validation data set. Pace correctly points out that the C statistic is a measure of model discrimination and that a complete validation also requires an assessment of calibration.
The RSI for in-hospital mortality is derived using a logistic model and therefore the C statistic is an appropriate metric of discrimination. The RSIs of 30-day mortality, 1-yr mortality, and 30-day discharge, however, are derived using Cox proportional hazards modeling. For these, a more appropriate measure of discrimination is Harrell's C (concordance) index, which is defined as the proportion of all usable data samples in which the predictions and the outcomes are concordant.2Although the C statistic is defined for dichotomous outcomes, the C index is more broadly applicable, being appropriate for censored time-to-event response variables as well as continuous and ordinal outcomes.
We calculated the C index for each of these three RSIs on the Cleveland Clinic validation data set using a bootstrap methodology to estimate the 95% confidence intervals. The C indices (table 1) were nearly identical to the previously reported C statistics—although with somewhat wider confidence intervals—thus revealing good discrimination across all four RSI models.
As suggested by Pace, we assessed calibrations of the RSI models on the Cleveland Clinic validation data set by means of calibration graphs, which are graphical representations of the Hosmer-Lemeshow goodness-of-fit test.3These were constructed by grouping patients into approximately equal-size bins of equivalent RSI values. The number of bins was chosen to achieve as even a distribution of patients among bins as possible, given the existence of ties. The mean RSI within each bin was then plotted against the mortality rate or mean length-of-stay within that bin.
The graphs indicate good calibration across the four RSI models (fig. 1), with mortality and extended-stay events most prevalent in the higher predicted-risk groups. (There is no expectation of linearity in these plots; goodness of calibration is indicated by monotonic left-to-right increases.) The low event rate for the in-hospital mortality endpoint results in very few events in the lower predicted risk groups; this gives rise to the “hockey stick” appearance of the graph. As these events accumulate in the 30-day and 1-yr mortality graphs, more events occur in the lower predicted risk groups, and the calibration graphs become smoother. The continuous endpoint in the 30-day discharge graph yields a smooth calibration relationship.
In summary, our risk stratification models exhibit both good discrimination and calibration. The indices can thus be used to adjust for differences in baseline and procedural risk, permitting fair outcome comparisons among hospitals and practices. We have put the system in the public domain to facilitate use; details, including model coefficients, SPSS programs, and sample files, are available at the Web site.*
†The Cleveland Clinic, Cleveland, Ohio. email@example.com