An appropriate measure of performance is needed to identify anesthetic depth indicators that are promising for use in clinical monitoring. To avoid misleading results, the measure must take into account both desired indicator performance and the nature of available performance data. Ideally, anesthetic depth indicator value should correlate perfectly with anesthetic depth along a lighter-deeper anesthesia continuum. Experimentally, however, a candidate anesthetic depth indicator is judged against a "gold standard" indicator that provides only quantal observations of anesthetic depth. The standard anesthetic depth indicator is the patient's response to a specified stimulus. The resulting observed anesthetic depth scale may consist only of patient "response" versus "no response," or it may have multiple levels. The measurement scales for both the candidate anesthetic depth indicator and observed anesthetic depth are no more than ordinal; that is, only the relative rankings of values on these scales are meaningful.
Criteria were established for a measure of anesthetic depth indicator performance and the performance measure that best met these criteria was found.
The performance measure recommended by the authors is prediction probability PK, a rescaled variant of Kim's dy.x measure of association. This performance measure shows the correlation between anesthetic depth indicator value and observed anesthetic depth, taking into account both desired performance and the limitations of the data. Prediction probability has a value of 1 when the indicator predicts observed anesthetic depth perfectly, and a value of 0.5 when the indicator predicts no better than a 50:50 chance. Prediction probability avoids the shortcomings of other measures. For example, as a nonparametric measure, PK is independent of scale units and does not require knowledge of underlying distributions or efforts to linearize or to otherwise transform scales. Furthermore, PK can be computed for any degree of coarseness or fineness of the scales for anesthetic depth indicator value and observed anesthetic depth; thus, PK fully uses the available data without imposing additional arbitrary constraints, such as the dichotomization of either scale. And finally, PK can be used to perform both grouped- and paired-data statistical comparisons of anesthetic depth indicator performance. Data for comparing depth indicators, however, must be gathered via the same response-to-stimulus test procedure and over the same distribution of anesthetic depths.
Prediction probability PK is an appropriate measure for evaluating and comparing the performance of anesthetic depth indicators.
A monitor of anesthetic depth [1–3]during general anesthesia would be useful for assessing a patient's response to anesthetic agents and for titrating administration of the agents. Several anesthetic depth indicators have been suggested for monitoring purposes, including those based on hemodynamics, [4–6],* spontaneous electroencephalogram, [4,5,7],*,** auditory evoked potential, [8,9]spontaneous electromyogram, [10,11]esophageal contractility, [12,13]pupillary reflex, [4,14,15]skin conductivity, [16]and anesthetic concentration [4,17]and delivery rate. [18]Combinations of variables, using discriminant analysis, [19,20],*** multivariate logistic regression, [21,22]and neural networks, [20],*** also are being considered. An appropriate performance measure is needed to evaluate and compare candidate indicators of anesthetic depth. Such a measure could help identify indicators that merit further investigation.
A wide variety of possible performance measures are available. Among these are measures of difference or separation, such as the familiar parametric t and F statistics and the nonparametric Mann-Whitney statistic [23]; classification measures, such as sensitivity, specificity, and percent correct [24,25]; receiver operating characteristic (ROC) measures, such as parametric and nonparametric ROC area [25–27]; quantal response curve measures, such as the likelihood value and the slope parameter [28–31]; and parametric and nonparametric correlation measures, such as the Pearson product-moment correlation, [23]the biserial, [32]and the Spearman correlation coefficients [23]and a variety of measures of association. [33–39].
A performance measure for assessing proposed anesthetic depth indicators must be selected with care to avoid false conclusions about the potential utility of an indicator or how it compares with other indicators. The measure should have a meaningful interpretation and permit statistical comparisons of performance. More specifically, the measure should take into account the relationship desired between anesthetic depth indicator value and anesthetic depth and the nature of the available experimental data. In doing so, it should not require unjustified assumptions or ignore available information in the data. Previous measures have not adequately met these requirements.
We now present a new measure, prediction probability (PK), for evaluating and comparing anesthetic depth indicators. This article is divided into three main sections. In the first section, we characterize the problem of assessing anesthetic depth indicators to identify the specific criteria that a performance measure should meet. In the second section, we present PKas a measure that meets these criteria. Specifically, we define and interpret PK, and describe how it is computed and used. We illustrate the application of PKboth on hypothetical data and on the experimental data of Leslie et al., [4]the first investigators to use this measure to compare candidate indicators of anesthetic depth. In the third section, we contrast the proposed measure to a variety of alternative measures.
Background: Assessing Anesthetic Depth Indicator Performance
Measurement Scales
Observed Versus Underlying Anesthetic Depth Scale. Anesthetic depth often is defined in terms of a response-to-stimulus test, such as whether or not the patient moves to skin incision [40,41]or responds to a voice command. [42,43]Such a "gold standard" anesthetic depth indicator defines a dichotomous scale of observed anesthetic depths having the two levels "response" and "no response." We assume that this observation scale is a coarse lumping of an underlying anesthetic depth continuum. The test stimulus defines a critical threshold point, dividing the underlying continuum into the two observed levels. A patient at a depth of anesthesia less than this critical point moves when stimulated, whereas a patient at a depth of anesthesia greater than this point does not. A stimulus of a different strength can define a different threshold point on the underlying depth continuum. [7,17,44,45]For example, the stimulus of suturing during skin closure defines a lesser threshold point than that from a 25-cm surgical incision. [17].
The scale of observed anesthetic depth can have more than two levels. One way to define a multilevel scale is to apply a graded sequence of stimuli. For example, the application of two stimuli, the first being weak and the second strong, could define the three ordered levels, "response to weak stimulus," "response to strong but not to weak stimulus," and "no response." A multilevel observed depth scale also can be created for a single-strength stimulus. For example, patient response to a single stimulus could be graded more finely into the categories "unequivocal move," "equivocal move," and "no move." [8]Or, for a repeated stimulus, the scale could be more finely graded by measuring the percentage of the patient's appropriate responses to the stimulus. [9].
The level of measurement of the observed anesthetic depth scale is no more than ordinal, in contrast to interval or ratio. That is, anesthetic levels can be rank ordered along the scale, but neither the sizes of intervals nor ratios of levels are meaningful in the context of depth of anesthesia. For example, consider the three-level observed depth scale, "response to weak stimulus," "response to strong but not to weak stimulus," and "no response." These levels can be rank ordered, so the scale is ordinal, but it is not an interval scale, because we cannot say how the difference between the levels of "response to weak stimulus" and "response to strong but not to weak stimulus" compares with the difference between the levels of "response to strong but not to weak stimulus" and "no response."
Anesthetic Depth Indicator Scale. The anesthetic depth indicator scale also is ordinal. Usually, the anesthetic depth indicator scale is finely graded compared with the observed depth scale, but it can be coarse and even dichotomous. As a specific example of a possible anesthetic depth indicator, consider the amplitude of the pupillary reflex to a light flash. [4,14,15]Theoretically, reflex amplitude is a continuous variable, although experimentally, it is made discrete by the limits of resolution for processing, display, or data storage. Reflex amplitudes can be rank ordered, so the anesthetic depth indicator scale is at least ordinal. One might argue that the reflex amplitude scale is interval or ratio, because the scale has units of length, and differences and ratios of length have meaning. However, within the context of predicting anesthetic depth, there is no uniform meaning over the range of reflex amplitudes of a given difference between, or ratio of, reflex amplitudes. Furthermore, because we do not know the form of the bivariate distribution of anesthetic depth and reflex amplitude, or even have as yet a meaningful interval or ratio scale for anesthetic depth, we do not know how to transform reflex amplitude into a variable for which intervals or ratios are meaningful.
It may be argued that the measurement scale for anesthetic concentration goes beyond ordinal, to interval or ratio, especially in view of minimum alveolar concentration additivity. [46]However, if an anesthetic depth indicator that has an interval or ratio output scale is to be compared with an indicator that has an output scale that is no more than ordinal, the same measure of performance must be used for both, namely, a performance measure applicable to ordinal scales.
Ideal Versus Actual Performance of an Anesthetic Depth Indicator
Ideal Performance. Ideally, an anesthetic depth indicator would change in a continuous fashion to show the changes in a patient's underlying anesthetic depth, as illustrated by the hypothetical curve in Figure 1(a). In the figure, the horizontal axis shows anesthetic depth indicator value, x, increasing to the right, and the vertical axis shows the underlying anesthetic depth, yu, increasing downward. These directional conventions are a compromise between those commonly used for graphs and for tables; we use the same conventions later for tables. We use x for anesthetic depth indicator value and yufor anesthetic depth to denote that during clinical monitoring we want the known indicator value, x, to predict the unknown anesthetic depth, yu.
Figure 1. A hypothetical ideal anesthetic depth indicator. (a) Ideal relationship between indicator value x and underlying anesthetic depth y sub u. (b) Ideal relationship between indicator value x and quantal observed anesthetic depth y. The vertical axes show underlying and observed anesthetic depths increasing downward.
Figure 1. A hypothetical ideal anesthetic depth indicator. (a) Ideal relationship between indicator value x and underlying anesthetic depth y sub u. (b) Ideal relationship between indicator value x and quantal observed anesthetic depth y. The vertical axes show underlying and observed anesthetic depths increasing downward.
Mathematically, anesthetic depth yuin Figure 1(a) is a monotonically increasing function of anesthetic depth indicator value x. This type of relationship is ideal, because it ensures that anesthetic depth indicator value can predict anesthetic depth perfectly and that a change in yuis reflected by a change in the same direction in x. To say that yuis a mathematical function of x means that there is one and only one yufor each value of x; we want this mathematical relationship because clinically we want to use x to predict yu. The statement "yuis a function of x" is not intended to suggest that anesthetic depth yuis caused by indicator x. "Monotonically increasing" means that yuincreases as x increases; that is, the slope of yuversus x is positive (recall that yuincreases downward in Figure 1). If desired, a nonlinear transformation could be used to straighten the curve.
Experimentally, values of observed, not underlying, anesthetic depth are available. Thus, anesthetic depth indicator performance can be assessed only in terms of how well indicator value correlates with or predicts observed anesthetic depth. Figure 1(b) illustrates the effect of the quantal nature of the observed anesthetic depth scale on the hypothetical ideal relationship in Figure 1(a). For generality, we show two threshold points on the continuum, yu1and yu2, which divide the continuum into a three-level observed depth scale, with quantal levels y1, y2, and y3. As a result, the smoothly changing curve in Figure 1(a) is reduced to a series of steps in Figure 1(b). Mathematically, observed anesthetic depth y in Figure 1(b) is a monotonically nondecreasing function of anesthetic depth indicator x. That is, the mathematical curve of y versus x can have zero and positive, but not negative, slope (again, recall that y increases downward in the figure). Given this relationship, anesthetic depth indicator value can predict observed anesthetic depth perfectly.
Actual Performance. Now consider an actual, non-ideal anesthetic depth indicator used on a population of patients. At each point on the underlying anesthetic depth continuum, there is a distribution or spread of indicator values. Consequently, each indicator value corresponds to a distribution or spread of underlying anesthetic depths. Figure 2illustrates a population distribution of hypothetical data points, each point consisting of an anesthetic depth indicator value and an underlying anesthetic depth. To create the figure, the x-yuplane from Figure 1(a) was rotated and tilted, and a z direction was added. The x axis is west-to-east, the yuaxis is north-to-south, and the z axis elevates upward out of the plane. The curved ridge or hill elevated above the plane represents the population distribution. For comparison, the dashed curve on the x-yuplane shows the original hypothetical ideal curve from Figure 1(a). The west-east dark line at anesthetic depth yu1illustrates the spread in anesthetic depth indicator values at a given point on the depth continuum. Similarly, the north-south dark line at indicator value x1illustrates the spread in anesthetic depth corresponding to a given anesthetic depth indicator value.
Figure 2. Effects on the ideal relationship in Figure 1a of the distributional spread of indicator value at each underlying anesthetic depth.
Figure 2. Effects on the ideal relationship in Figure 1a of the distributional spread of indicator value at each underlying anesthetic depth.
The population distribution in Figure 2degrades the indicator's ability to predict observed, as well as underlying, anesthetic depth. The effect of this distribution on the graph of observed anesthetic depth versus indicator value in Figure 1(b) would be to increase the lengths of the line segments at the different levels of observed depth, resulting in horizontal overlapping of these segments. A given indicator value then would correspond to more than one observed anesthetic depth. That is, observed anesthetic depth no longer would be a monotonically nondecreasing function of anesthetic depth indicator value, and indicator value no longer could predict observed depth perfectly.
Performance Measurement Criteria
In summary, to assess anesthetic depth indicator performance, we want a measure of the correlation or association between indicator value and underlying anesthetic depth. Experimentally, however, we can measure only observed, not underlying, anesthetic depth. In view of this limitation, we define ideal indicator performance to mean the perfect prediction by indicator value x of observed anesthetic depth y. Anesthetic depth indicator value and observed anesthetic depth are ordinal variables. Therefore, ideal indicator performance is achieved when observed anesthetic depth y is mathematically a monotonically nondecreasing function of indicator value x. We want a performance measure that shows how close the relationship between observed anesthetic depth and indicator value is to this goal. The performance measure should use fully the experimental data for any degree of coarseness or fineness of the indicator and observed anesthetic depth scales. In addition, as stated earlier, the measurement should have a meaningful interpretation and permit statistical comparisons of performance.
Recommended Performance Measure: Prediction Probability P sub K
The performance measure we recommend to meet these criteria is prediction probability PK, a type of nonparametric correlation known as a measure of association. Measures of association are attractive for our application because they are suited to ordinal variables and can accommodate variable scales having any degree of coarseness or fineness. Measures of association are not new, but there are many to choose from, and insight is still being developed as to their properties and appropriate use. Relatively recently, Freeman identified which measure of association was appropriate for each of several models of ideal relationship between the variables. [47]Our ideal is for observed anesthetic depth y to be a monotonically nondecreasing mathematical function of anesthetic depth indicator x. For this ideal, Kim's dy*symbol* x [37]is the appropriate measure of association. [47]We have rescaled dy*symbol* x to create PK, a measure having an interpretation that is simpler and more meaningful for our application. We use the subscript K in PKto denote its close relationship to Kim's measure.
Measures of association, which apply to cross-tabulations of ordinal variables, are not widely familiar. Therefore, before proceeding with PK, we show that experimental data on anesthetic depth indicators can be expressed in a tabular format. We also review the terminology of ordinal relationships and restate the performance desired of an anesthetic depth indicator in this terminology.
Tabular Presentation of Anesthetic Depth Indicator Data
Because the scales for both anesthetic depth indicator value and observed anesthetic depth are no more than ordinal, the information in experimental data resides in the rank order of the data points along these scales, not their specific numeric values and units. Thus, without any loss of information, we can present experimental data in table form, with column index increasing to the right and showing increasing rank of indicator values, and row index increasing downward and showing increasing rank of observed anesthetic depth.
To express the experimental data in table form, we do not need to "bin" the data. Instead, on each scale, we let each distinct experimental value define a distinct rank; that is, each distinct indicator value in the data sample defines a column, and each distinct observed anesthetic depth value in the data sample defines a row. Data for continuous, as well as discrete, scales can be put in table form in this way.
As an example, consider the six hypothetical data points shown on the ideal relationship between underlying anesthetic depth and indicator value in Figure 1(a) and replotted on the three-level observed anesthetic depth scale in Figure 1(b). Table 1shows these same six data points. Each cell entry in the table shows the number of data points that have that particular combination of indicator value (column) and observed anesthetic depth (row). Because observed anesthetic depth y in Figure 1(b) is a monotonically nondecreasing function of indicator value x, all of the data points in each succeeding row of Table 1lie to the right of all of the data points in previous rows. An anesthetic depth indicator that generated these data should be judged to perform perfectly.
(Table 2) shows a more general set of hypothetical data points, assumed to be a sampling from some unknown bivariate distribution of underlying anesthetic depth and indicator value, such as that in Figure 2. Observed anesthetic depth again has three levels. Table 2shows nonideal indicator performance, because there is overlap from row to row.
Desired Performance Expressed in the Terminology of Ordinal Relationships
The relationship between ordinal variables x (indicator value) and y (observed anesthetic depth) is described in terms of the rank ordering of the x and y values for pairs of data points. A concordance occurs when the x values and y values for a pair of data points are rank ordered in the same direction. For example, in Table 2, the point in cell B2 and a point in cell A1 or in cell C3 compose a concordance. A discordance occurs when the x and y values are rank ordered in opposite directions (e.g., the point in cell B2 and a point in cell C1). An indicator-only, or x-only, tie (tie in indicator value but not in observed depth) occurs when the x values are tied but the y values are not (e.g., the point in cell B2 and a point in cell B1). Similarly, a depth-only, or y-only, tie occurs when the x values are not tied but the y values are (e.g., the point in cell B2 and a point in cell D2). Finally, a joint tie occurs when there are ties in both x and in y (e.g., any two data points in cell D2).
Our ideal is for indicator value x to predict observed anesthetic depth y perfectly. Therefore, concordances are desired, because the rank order of the x values correctly predicts the rank order of the y values. Discordances are undesirable, because x order incorrectly predicts y order. Indicator-only ties also are undesirable, because the values of x provide no predictive information about the rank order of the y values; the order of the y values is then only a guess, with a 50:50 chance of being correct. Ties in y (both y-only and joint ties) are not caused by the indicator, but by the experimental limitations that result in a quantal scale of observed anesthetic depth, so ties in y should not be considered in the evaluation of indicator performance.
Thus, an ideal relationship between indicator value x and observed anesthetic depth y consists of concordances, with no discordances or x-only ties, and with ties in y tolerated. A measure of anesthetic depth indicator performance should reward concordances, penalize discordances and x-only ties, and ignore ties in y (both y-only and joint ties).
Definition and Interpretation of P sub K
Prediction probability PKis a variant of Kim's dy*symbol* x [37]measure of association. Kim's dy*symbol* x is defined for ordinal variables x and y in terms of the types of pairs of data points just described. Let Pc, Pd, and Ptxbe the respective probabilities that two data points drawn at random, independently and with replacement, from the population are a concordance, a discordance, or an x-only tie. The only other possibility is that the two data points are tied in observed depth y; therefore, the sum of Pc, Pd, and P sub tx is the probability that the two data points have distinct values of observed anesthetic depth, that is, that they are not tied in y.
Kim's dy*symbol* x is defined to be Equation 1. Alternatively, we define prediction probability PKto be Equation 2which, by inserting (1) into (2), becomes Equation 3.
Thus, PKand Kim's dy*symbol* x differ in scale and range of values but convey the same information. As desired, both Kim's d sub y *symbol* x and PKreward concordances, penalize discordances and indicator-only ties, and ignore ties in observed depth y. The range for Kim's dy*symbol* x is from -1 to +1, while that for PKis from 0 to 1. When the probabilities of discordance and indicator-only tie are both zero, dy*symbol* x and PKboth equal 1. When the probability of discordance equals that of concordance, dy*symbol* x = 0 and PK= 0.5. A negative value of dy*symbol* x, or a value of PKless than 0.5, means that discordances are more likely than concordances.
The advantage of prediction probability PKover dy*symbol* x is its simple interpretation as a probability that directly relates to the goal of using indicator value to predict observed anesthetic depth. Specifically, given two randomly selected data points with distinct observed anesthetic depths, PKis the probability that the indicator values of the data points predict correctly which of the data points is the lighter (or deeper). Appendix 1 supports this interpretation. A value of PK= 0.5 means that the indicator correctly predicts the anesthetic depths only 50% of the time, i.e., no better than a 50:50 chance. A value of PK= 1 means that the indicator predicts the anesthetic depths correctly 100% of the time.
In contrast, though Kim's dy*symbol* x embodies the same information as PK, its interpretation is a more abstract and cumbersome difference between two probabilities. Specifically, given two randomly selected data points with distinct observed anesthetic depths, Kim's dy*symbol* x is the probability that the two data points are concordant, minus the probability that the two data points are discordant.
Estimation of P sub K
Prediction probability PKis computed from sample data by replacing the probabilities in Equation 3with sample estimates. Table 1and Table 2are examples of sample data. Estimates of PKand its standard error (SE) can be obtained by using the jackknife method or more traditionally by using the relationship of PKto other measures of association. Calculations can be performed by using a custom spreadsheet macro, PKMACRO,**** or commercial statistical software. The program PKMACRO provides both jackknife and the more traditional estimates of P sub K and its SE.
We recommend using the jackknife method [48,49]to estimate PKand its SE. An advantage of the jackknife method is that sampling variability can be approximated by the Student's t distribution, [48]thus taking into account sample size. Another advantage is that the jackknife method makes possible paired-data, as well as grouped-data, statistical comparisons of PKvalues. Finally, the jackknife method reduces bias in the estimation of PK, although, as we shall see, bias may not be a significant concern. The jackknife method assumes independent data points.
At the outset, the application of the jackknife method to PKappears computationally forbidding. For a sample of n data points, the method requires the computation of n + 1 estimates of PK, one on the entire n-sample and n more on the (n - 1)-size samples obtained by deleting each of the n data points one at a time. Fortunately, it is possible to take advantage of the mathematical structure of PKto speed up these computations. As a result, PKMACRO performs the jackknife computations rapidly.
For the nonideal data in Table 2, the sample estimate of prediction probability computed using Equation 3is PK= 0.867. The jackknife method gives nearly the same value, PKjack= 0.866, suggesting that there is little or no bias in PK. The jackknife SE estimate is sigmaPKjack= 0.070. Alternatively, for the ideal data in Table 1, PK= PKjack= 1 with a jackknife SE of 0. These values were computed using PKMACRO.
A more traditional approach to computing PKand its SE is to use the close relationship between Kim's dy*symbol* x [37]and an older measure of association, Somers' dyx*symbol*[35]This approach also assumes independent data points. It permits grouped-data, but not paired-data, comparisons of PKvalues. An advantage of this approach is that computations can be performed using commercial statistical programs for Somers' measure. Previous work on Somers' measure [39]can be used to show that PKis asymptotically unbiased and Gaussian. There are no clear guidelines, however, on the minimum sample size required to assume this asymptotic distribution.
As reviewed in Appendix 2, Kim's dy*symbol* x is structurally equal to Somers' dxy(note the reversed subscripts), and Somers' dxycan be found using commercial software programs, such as BMDP [50](BMDP Statistical Software, Los Angeles, CA), SPSS [51](SPSS, Chicago, IL), and SAS [52](SAS Institute, Cary, NC). Thus, P sub K can be computed by inserting dy*symbol* x = dxyinto Equation 2. The commercial programs compute Goodman and Kruskal's [39]approximate standarderror (ASE) of dxy, denoting it s1or approximate standard error 1. By Equation 2, the corresponding SE of P sub K, sigmaPK1, equals one half of S1. For the nonideal data in Table 2, BMDP program 4F [50]gives dxy= 0.734 and S sub 1 = 0.132. Therefore, by Equation 2, PK=(1 + 0.734)/2 = 0.867, the same value for PKobtained previously, and sigmaPK1 = 0.132/2 = 0.066, which is close to the jackknife SE shown previously. For the ideal data in Table 1, the more traditional approach results in PK= 1 with an SE of 0, the same values obtained using the jackknife method.
Brown and Benedetti [53]recommended an adjusted version of S1, denoted S0, when using a sparse data table to test the null hypothesis of no association at a small level of significance (e.g., 0.01). The BMDP and SPSS software packages provide this alternative SE estimate for Somers' measure as a t value equal to dxy/S0. The corresponding SE estimate for PKis sigmaPK0 = S0/2; in terms of PK, the associated t-value equals (PK- 0.5)/sigma sub PK0. For Table 2, BMDP gives t = 5.643, showing that there is statistically significant association at P = 0.0001. Correspondingly, S sub 0 = 0.130 and sigmaPK0 = 0.130/2 = 0.065, which is close to the previous values of sigmaPK1 and sigmaPKjack. The program PKMACRO computes both sigmaPK1 and sigmaPK0.
Leslie et al. [4]applied PKto experimental data for ten candidate indicators for monitoring anesthetic depth. In their study of human volunteers given propofol and nitrous oxide, observations of anesthetic depth consisted of noting whether or not a volunteer moved in response to electrical stimulation. This stimulus corresponds to a depth threshold point, such as yu1in our Figure 2, dividing the depth scale into two observed levels, "move" and "no move."
Jackknife estimates of PKon the 130-stimulus data for the ten candidates (Table 2of Leslie et al. [4]) ranged from 0.736 for propofol blood concentration to 0.864 for the Bispectral Index of the electroencephalogram. The jackknife estimates were close to the traditional estimates. Specifically, the values of PKjackand PKagreed out to at least 13 decimal places, again suggesting negligible bias in PK. The values of sigmaPKjackwere greater than those of sigmaPK1 by an average of 1.5% and were less than those of sigmaPK0 by an average of 4.9%.
Hypothesis Tests on P sub K
Test of an Indicator's Ability to Predict Observed Anesthetic Depth. One useful statistical test of indicator performance is whether PKis different from 0.5, the value for an indicator that has no predictive power. For an n-sample, this test can be performed by using the t statistic (PKjack- 0.5)/sigmaPKjackwith n - 1 degrees of freedom. For our Table 2, this statistic has the value 5.227. This value is consistent with the previously mentioned t statistic associated with sigmaPK0, again showing the presence of statistically significant predictive ability at P = 0.0001.
For the ten candidate indicators investigated by Leslie et al., [4]the 130-stimulus t statistic values for jackknife hypothesis tests of the predictive ability of each variable ranged from 4.87 for propofol blood concentration to 11.05 for the Bispectral Index. Even with a Bonferroni correction for multiple comparisons, [54]all ten indicators showed statistically significant abilities to predict observed anesthetic depth at P = 0.001.
Test to Compare the Performance of Two Indicators. Another useful test is a comparison of the performance of two anesthetic depth indicators. If the data are collected independently on the two indicators, the test statistic is the difference between the two sample values of P sub K, divided by an estimated SE of this difference. If jackknife results are available, a t test can be performed. The choice of method for determining the estimated SE of the difference and the number of degrees of freedom for this test depends on whether or not the variances of the jackknife pseudovalues [48,49]for the two indicators are equal. [23]Alternatively, if the sampling variabilities for the two estimates of PKare assumed to be Gaussian, then the estimated SE of the difference is the square root of the sum of the squares of the estimated SEs of the two sample PKvalues, and the test statistic is Gaussian.
If paired data on the two indicators are collected and the indicators are positively correlated, we can avail ourselves of the greater statistical power of a paired-data method of comparing indicator performance. Paired-data comparisons can be performed using PKMACRO. Leslie et al. [4]used the jackknife method to make paired-data comparisons of the PKvalues for alternative depth indicators.
Prediction probability is useful for comparing anesthetic depth indicators because it does not depend on distributional assumptions, the particular type or units of an indicator variable's scale, or the choice of a particular variable threshold value, and because its expected value is asymptotically independent of the number of experimental data points. When comparing indicators, however, it is necessary to gather data using the same stimulus procedure and over the same distribution of anesthetic depths. A good way to ensure appropriate conditions for comparing the performance of two anesthetic depth indicators is to measure the indicator values simultaneously for the same subjects--hence, the importance of being able to carry out paired-data comparisons. The sensitivity of PKto data range is analogous to that for the more familiar Pearson product-moment correlation coefficient, r, when r is used to measure the degree of linear relationship between two interval variables. [55].
P sub K Versus Alternative Performance Measures
Separation Measures
One approach to measuring the performance of an anesthetic depth indicator is to determine the difference, or separation, between the populations of indicator value x that correspond to different values of observed anesthetic depth y. Consider first dichotomous observed depth y, with values R for "response" and N for "no response" such that N > R, and let xRand XNdenote corresponding values of x.
Perhaps the most familiar separation measure is the t statistic (or its equivalent in this situation, the F statistic), which evaluates separation between the means of two populations. [23]A drawback of this parametric statistic is that its proper interpretation requires the distributions of XRand XNto be simultaneously Gaussian with equal variance. Analysis of variance [23]also assumes Gaussian distributions with equal variance. Another drawback to these statistics is that their expected values change with the number of experimental data points, reducing their usefulness for comparisons. In contrast, prediction probability PKdoes not require distributional assumptions, and its expected value is asymptotically independent of sample size.
Nonparametric alternatives to the t statistic, such as the Mann-Whitney U statistic, or, equivalently, Kendall's S or the Wilcoxon W, have the advantage that they can be used with ordinal variables. [23,34]These measures are appropriate for testing the hypothesis that the XRand XNpopulations are not separated. The chi-square statistic offers an even less restrictive test of the independence of the XRand XNpopulations. [23]Unlike the value of PK, however, with its meaningful interpretation as a probability of prediction, the numeric values of these statistics lack intrinsic meanings, and again they vary with sample size.
Prediction probability PKapplies, and retains its meaning, when y has multiple levels. In contrast, while the separation measures can be generalized to test the mutual separation or independence of distributions of x for multiple y values, they cannot show the presence or degree of the relationship of interest, namely that x and y increase together.
Classification and Receiver Operating Characteristic Measures
Another approach to measuring performance is to determine how successfully indicator value x classifies experimental data points according to observed anesthetic depth. Again consider first dichotomous y with values R and N. The x scale then is dichotomized by selecting a threshold, and x less than or greater than the threshold is used to predict that the value of observed depth y is, respectively, R or N. Sensitivity, specificity, and percent correct are common classification measures.
As discussed in Swets and Pickett [25], these and other related measures have shortcomings. Sensitivity or specificity alone is an incomplete performance measure, because each takes into account only one of the two possible types of classification errors. In contrast, PKand percent correct take both types of errors into account. The values of sensitivity, specificity, and percent correct depend on the threshold used to dichotomize x. This dependence on threshold is undesirable, because there is no universal criterion for choosing a threshold value for a given anesthetic depth indicator, nor is there a clear way to select equivalent threshold values for alternative indicators. In contrast, PKtakes into account all possible threshold values along the x scale, no matter how coarsely or finely graded the scale is. Another drawback of sensitivity and specificity is that, unlike PK, they cannot be used beyond dichotomous y.
For dichotomous y, parametric [27]and nonparametric [26]ROC areas, like prediction probability PK, take into account both sensitivity and specificity for all possible threshold values of x. Parametric ROC area has the drawback that, unlike PK, it requires distributional assumptions--typically, that the x scale is transformable so that XRand XNare simultaneously Gaussian, though with perhaps different variances. Like PK, nonparametric ROC area can be applied to ordinal data, but, unlike PK, it cannot be used beyond dichotomous y.
Moreover, the statistical assumptions on which ROC area is based do not apply generally to anesthetic depth indicator data. Receiver operating characteristic analysis makes the assumption that x is measured conditioned on y for each of the two values of y. [26]In contrast, and consistent with PK, data for assessing anesthetic depth indicators usually consist of joint measurements of indicator value x and observed depth y, neither of which is known with certainty beforehand.
Quantal Response Curves
When observed anesthetic depth y is dichotomous, with values R and N, and x is a suitably transformed indicator value, a simple, two-parameter function, such as the logistic or probit curve, can be fitted to a set of sample data to show the percent probability versus x that y = N. [28–31]The appeal of such a "quantal response curve" is that, for the distribution of anesthetic depths on which it is based, it shows directly how to achieve a particular probability of no response to stimulus. Common quantal response curve parameters are the value of x at the 50% probability of no patient response, x50, and the steepness or slope of the curve, reciprocally related to the spread in x of the derivative of the curve.
Quantal response curves can be used to compare the performance of two anesthetic depth indicators. If the indicators have a common scale, then the better of the two indicators has the quantal response curve with the steeper slope. Another approach to comparing two indicators, say indicator A and indicator B, is to use stepwise regression and a measure such as the likelihood, (i.e., the maximized likelihood value). [29–31,50–52]The likelihood is the probability of occurrence of the sample results for the given quantal response curve parameter values; the likelihood increases to 1 when the quantal response curve fits the sample data perfectly.
An example of this approach would be to fit individual logistic curves to the suitably transformed indicators A and B, and then to use bivariate regression to fit a logistic curve to the linear combination of the transformed indicators A and B. If the likelihood value for the bivariate regression is statistically better than the univariate value for indicator A, but not better than the univariate value for indicator B, then indicator B is the better one.
A drawback to using quantal response curves to assess indicator performance is that, unlike PK, they are limited to dichotomous observed anesthetic depth y. Also unlike PK, the measure of indicator quality obtained using a quantal response curve is dependent on whatever nonlinear transformation first is applied to x. We expand on this issue in a later section. If slope parameters are compared then the transformation to put both indicators onto a common scale also affects the comparison. As with PK, sample quantal response curves depend on the distribution of anesthetic depths, so data on indicators to be compared need to be gathered over the same depth distributions.
Correlation Measures
Prediction probability PKis a type of correlation measure. The familiar Pearson product-moment correlation coefficient, p or r, has the drawback that its usual statistical interpretation assumes that x and y are jointly Gaussian. [23]Similarly, the biserial correlation coefficient [32]assumes that y is a dichotomization of an underlying variable that is jointly Gaussian with x. The Spearman rank correlation coefficient, [23]a nonparametric alternative to tau, has the advantage of avoiding distributional assumptions. In its basic form, however, it has the drawback of assuming that there are no tied values of x or y, whereas the coarse nature of the observed anesthetic depth y scale typically results in many ties. Although procedures exist to correct for ties, [34]the numeric value of the Spearman correlation coefficient still lacks an intrinsic meaning, in contrast to the interpretation of PKas a probability of correct prediction.
Prediction probability PKis a variation of Kim's dy*symbol*xmeasure of association, which is equivalent to Morris' gammak. [36]We mentioned previously that Kim's dy*symbol* x is closely related to Somers' dyx[35]but that the latter is not appropriate here; as Appendix 2 shows, Somers' dyxinappropriately ignores indicator-only ties and penalizes depth-only ties. There are several other closely related measures of association, including Kendall's taub, [33]Stuart's tauc, [33]Goodman and Kruskal's gamma, [39]and Wilson's e. [38]These measures also are inappropriate, because the ways they treat ties in the data are inconsistent with the model we established of ideal indicator performance, namely, that observed depth y should be a monotonically nondecreasing function of indicator value x. Kendall's and Stuart's measures have the additional drawbacks that their numeric values lack an intrinsic meaning. [56]As desired, Goodman and Kruskal's gamma ignores ties in y, but it inappropriately also ignores indicator-only ties. These ties, in which the same indicator value x occurs for two different values of observed depth y, degrade the predictive power of x and should be penalized. Wilson's e is not appropriate, because it penalizes not only indicator-only ties but also depth-only ties. These latter ties occur because of the experimentally coarse observed depth scale and should not count against indicator performance.
P sub K Versus Parametric Performance Measures that Require Model Fitting
An alternative to assuming ordinality and using a nonparametric performance measure such as PKis to attempt to transform the data so that they fit the requirements for a parametric performance measure. An example of such a transformation is the use of the logarithm of anesthetic concentration to improve the fit of the logistic or probit parametric model of the quantal response curve for dichotomous y. [17,18,21,57,58](The sigmoidal quantal response equations used by Ausems et al. [17]and by Vuyk et al. [58]are equivalent to fitting logistic curves to the logarithm of concentration. [29]).
Ultimately, the clinical user of an anesthetic depth indicator needs to know how to interpret its output in the context of anesthesia administration. Model fitting may be a necessary step in achieving this level of understanding. However, when seeking promising indicators, there are drawbacks to the use of a model-dependent performance measure. The difficulty is that such a measure reflects not only the indicator's inherent potential but also how well the model is fit. The process of model fitting is iterative, with no guarantee of finding the best data transformation, [59–61]and there are multiple, at times contradictory, measures of goodness-of-fit. [30,31,50–52].
We now illustrate how PKcompares with a model-based performance measure, specifically, a measure based on logistic regression, using the 130-stimulus data for the ten candidate indicators of Leslie et al. [4]The measure of logistic curve fit we choose is the SPSS "significance" value for "--2 Log Likelihood." [51]This significance value is obtained for a test of the null hypothesis that the likelihood equals 1, that is, that the given logistic curve explains the dichotomous quantal data perfectly. Thus, a larger significance value corresponds to a better fit of the logistic curve to the quantal data. To investigate the effect of distributional shape, we apply both of the performance measures, PKand logistic significance, before and after logarithmic transformations of the variables.
(Figure 3) shows logistic significance values for the ten candidate indicators versus the corresponding sample estimates of PK. The open symbols in Figure 3are for the original indicators, and the filled symbols are for the logarithmic transformations of these indicators. The effect of logarithmic transformation was the greatest on the percent beta power of the electroencephalographic spectrum (BETA). Before the transformation, logistic significance ranked BETA as only the fifth best of the candidates; afterward, the ranking of BETA by logistic significance improved to second best. In contrast, PKranked BETA as second best, without the need for a transformation. Because PKis a nonparametric measure, its values were unaffected by the transformation, that is, by the shape of the data distribution.
Figure 3. Logistic significance value vs. PKfor ten candidate indicators for monitoring anesthetic depth. The vertical axis shows the significance for a chi-square test on the likelihood value in which the null hypothesis is that the logistic curve explains the quantal data perfectly. Open symbols show the results for the original indicators; closed symbols show the results for logarithmic transformations of the indicators. BPROP = blood propofol concentration; EPROP = model-estimated effect-site propofol concentration; BIS = EEG Bispectral Index; DELT = EEG spectrum percent delta power; BETA = EEG spectrum percent beta power; F95 = EEG 95-percent spectral edge frequency; F50 = EEG median frequency; SABP = systolic arterial blood pressure; RA = pupillary reflex amplitude; CV = pupillary constriction velocity. The data are from the study of Leslie et al. [4].
Figure 3. Logistic significance value vs. PKfor ten candidate indicators for monitoring anesthetic depth. The vertical axis shows the significance for a chi-square test on the likelihood value in which the null hypothesis is that the logistic curve explains the quantal data perfectly. Open symbols show the results for the original indicators; closed symbols show the results for logarithmic transformations of the indicators. BPROP = blood propofol concentration; EPROP = model-estimated effect-site propofol concentration; BIS = EEG Bispectral Index; DELT = EEG spectrum percent delta power; BETA = EEG spectrum percent beta power; F95 = EEG 95-percent spectral edge frequency; F50 = EEG median frequency; SABP = systolic arterial blood pressure; RA = pupillary reflex amplitude; CV = pupillary constriction velocity. The data are from the study of Leslie et al. [4].
When using a model-based performance measure, it may not be clear whether the data need to be transformed and, if so, what transformation is best. The presence of positive skew in histograms of BETA for "move" and "no move" suggested that a logarithmic transformation might improve logistic fit. [59],Figure 4shows this skew (4a) and its reduction after the transformation (4b). Some other transformation, however, might have resulted in an even better logistic fit. Histograms, as well as previous propofol studies, [18,58]suggested that logarithmic transformations also would improve the logistic curve fits for blood and model-estimated effect-site propofol concentrations. The increases in logistic significance values for these variables in Figure 3confirm these expectations.
Figure 4. Histograms of electroencephalographic percent beta power (BETA) for the "move" and "no move" levels of observed anesthetic depth. (a) Raw data. (b) After logarithmic transformation of the data. The data are from the study of Leslie et al. [4].
Figure 4. Histograms of electroencephalographic percent beta power (BETA) for the "move" and "no move" levels of observed anesthetic depth. (a) Raw data. (b) After logarithmic transformation of the data. The data are from the study of Leslie et al. [4].
For some other indicators, however, the histograms were misleading. A logarithmic transformation of electroencephalographic percent delta power seemed to increase, rather than reduce, distributional skew, yet Figure 3shows that the transformation slightly improved the fit of the logistic curve. Also, although the skew for median frequency F50 resembled that shown in Figure 4(a) for BETA, Figure 3shows that the logarithmic transformation of the data worsened the fit of the logistic curve. Thus, for the logistic performance measure, an approach more sophisticated than simply viewing histograms is required in the search for appropriate data transformations.
We believe that PKshows the potential performance of an indicator of anesthetic depth, whereas this potential may be revealed by a model-based measure only after a suitable transformation of the indicator has been found. The results in Figure 3support our view. There is a strong correlation between sample values of logistic significance and PKafter indicators were transformed to improve logistic curve fit. Specifically, using the higher of each pair of significance values shown, the sample correlation coefficient between logistic significance and PKin Figure 3is 0.971.
In summary, prediction probability PKis a performance measure particularly suited to evaluating and comparing anesthetic depth indicators. This measure is the probability that an indicator can predict correctly the rank order of an arbitrary pair of distinct observed anesthetic depths. The measure has the value of 0.5 when the indicator has no useful predictive power and the value of 1 when the indicator predicts perfectly. Sample prediction probability PKand its estimated SE can be computed for any degree of coarseness or fineness of the anesthetic depth indicator and observed anesthetic depth scales, ranging from dichotomous to continuous. Confidence intervals can be determined, and grouped-data and paired-data statistical comparisons can be made of anesthetic depth indicator performance. Prediction probability PKis convenient for making comparisons because it is not dependent on linear or nonlinear scaling of x, or on the choice of some x threshold, and because its expected value is asymptotically independent of the number of experimental data points. Data for comparisons, however, must be gathered using the same stimulus procedure and for the same distribution of anesthetic depths.
The authors thank Kate Leslie, M.B.B.S., F.A.N.Z.C.A., Daniel I. Sessler, M.D., and the ANESTHESIOLOGY reviewers for many helpful suggestions; Mehernoor F. Watcha, M.D., for providing information on biserial correlation; and Paul Manberg, Ph.D., for discussions on logistic curve analysis.
Appendix 1. Probabilistic Interpretation of P sub K
Consider two cases drawn randomly, but with distinct values of observed anesthetic depth y. Assume that we use anesthetic depth indicator value x for the two cases to predict which case is deeper according to the following rule:(1) If the two indicator values are distinct, predict that anesthesia is deeper for the case with the greater indicator value;(2) If the indicator values are equal, randomly guess which case is deeper. Then PKis the probability of correctly predicting the deeper case from the indicator values.
We now prove this interpretation of PKas a prediction probability. The probability of a correct prediction of the deeper case by part A of the prediction rule equals the probability that the two cases are concordant. This probability, conditioned on the assumption of distinct observed depths, is Equation 4. Part B of the prediction rule applies if the indicator values are equal. The probability of repeated indicator values, again conditioned on distinct observed depths, is Equation 5.
According to the prediction rule, if the indicator values for the two cases are equal, then randomly guess which is the deeper case. Because there is a 50:50 chance that the guess is correct, the probability of a correct prediction by part B of the prediction rule is one half of Equation 5. The overall probability of a correct prediction of the deeper case, conditioned on distinct observed depths, is Equation 4plus one half of Equation 5, which is the right hand side of Equation 3, the definition of PK.
Appendix 2. The Relationship between Kim's d sub y *symbol* x and Somers' d sub yx
Kim's dy*symbol* x and Somers' dyxare both "asymmetric" measures of association, in that they are intended to measure prediction accuracy in just one direction, specifically, to measure how well x predicts y. The order of the x and y subscripts for these measures denotes this particular direction of prediction.
The mathematical expression for Somers' dyxhas the same form as that in Equation 1for Kim's dy*symbol* x, except for one critical difference. To create the expression for Somers' dyx, the probability Ptxof an indicator-only (x-only) tie in the denominator of Equation 1for Kim's dy*symbol* x is replaced by the probability Ptyof a depth-only (y-only) tie. That is, both Kim's d sub y *symbol* x and Somers' dyxreward concordances and penalize discordances. However, whereas Kim's dy*symbol* x penalizes indicator-only ties and ignores ties in y, Somers' dyxpenalizes depth-only ties and ignores ties in x. Because of this critical difference, Somers' dyxis not an appropriate predictive measure for our application.
Kim's dy*symbol* x, however, is structurally equal to Somers' dxy(note the reversal of subscript order). Thus, using Equation 2, we can apply theory developed for Somers' dxyto PK. Also, we can use commercial statistical programs that compute Somers' measure, such as the BMDP program 4F [50](frequency tables), the SPSS command CROSSTABS, [51]and the SAS procedure FREQ, [52]to compute Kim's dy*symbol* x, and, therefore, PK.
These programs print out estimates of Somers' measure for both directions of prediction, that is, dyxfor x predicting y and dxyfor y predicting x. To show the direction of prediction, the programs label the predicting variable as "independent" and the predicted variable as "dependent." We want to use indicator value x (independent variable) to predict observed depth y (dependent variable). To obtain a value for Kim's dy*symbol* x, we must select Somers' dxy(note the reversed subscripts) from the computer output, as though our goal is to use observed depth y to predict indicator value x, that is, as though y was the independent variable and x was the dependent variable.
*Dutton RC, Smith WD, Smith NT: Does the EEG predict anesthetic depth better than cardiovascular variables?(abstract). ANESTHESIOLOGY 1990; 73:A532.
**Dutton RC, Smith WD, Smith NT: EEG prediction of arousal during anesthesia with combinations of isoflurane, fentanyl, and N2O (abstract). ANESTHESIOLOGY 1991; 75:A448.
***Watt RC, Samuelson H, Navabi MJ: A comparison of artificial neural networks and classical statistical analysis. ANESTHESIOLOGY 1991; 75:A451.
****The command macro, PKMACRO, written in Microsoft Excel 4.0 (Microsoft, Redmond, WA) for the Macintosh computer (Apple, Cupertino, CA), is available from the authors.