A common type of study performed by anesthesiologists determines the effect of an intervention on pain reported by groups of patients. The goal of this study was to evaluate the effectiveness of t, analysis of variance (ANOVA), Mann-Whitney, and Kruskal-Wallis tests to compare visual analog scale (VAS) measurements between two or among three groups of patients. These results may be particularly helpful during the design of studies that measure pain with a VAS.
One VAS measurement was obtained from each of 480 nulliparous women in labor who were receiving oxytocin (149), nalbuphine (159), or epidural bupivacaine (172). Multiple simulated samples were then drawn from these data. These simulated samples were used in computer simulations of clinical trials comparing VAS measurements among groups. t and ANOVA tests were performed before and after an arcsin transformation was used, to make the data closer to a normal distribution. VAS measurements were also compared after they were divided into five ranked categories.
The statistical distributions of VAS measurements were not normal (P < 10(-7)). Arcsin transformation made the distributions closer to normal distributions. Nevertheless, no statistical test incorrectly suggested that a difference existed among groups, when there was no difference, more often than the expected rate. t or ANOVA tests had a slightly greater statistical power than the other tests to detect differences among groups. Because arcsin transformation both decreased differences among means and reduced the variance to a lesser extent, it decreased power to detect differences among groups. Statistical power to detect differences among groups was not less for a five-category VAS than for a continuous VAS.
We conclude that t and ANOVA, without an accompanying arcsin transformation, are good tests to find differences in VAS measurements among groups.
Key words: Measurement techniques, pain: visual analog scale. Statistics: power analysis. Study design: computer simulation; Monte Carlo simulation.
A common problem in anesthesia is to assess the effect of some intervention on levels of pain. For example, one may want to know whether patients who have a wound infiltrated with local anesthetic have less pain postoperatively. Visual analog scale (VAS) measurements are frequently used for pain assessment.* Subjects mark the position along a 10-cm line, which denotes the severity of their pain. The distance from the marked point to the bottom of the scale is used as a measure of pain. For guidance, the phrases "no pain" and "worst imaginable pain" are placed at the bottom and top of the line, respectively.
The choice of a statistical test to analyze VAS data is controversial. Procedures based on the rank order of data (nonparametric procedures) have an advantage. They do not incorrectly suggest that a difference exists among groups more often than specified by the nominal false-positive or type 1 error rate, regardless of whether the statistical distribution of VAS measurements is normal or not. Scott and Huskisson have recommended that nonparametric methods be used to analyze VAS measurements. Procedures that assume that the VAS measurements follow a normal distribution (parametric procedures) also have an advantage. If the distribution of the data were normal, the tests would have the highest power (i.e., ability) to detect differences among groups. However, even if the distribution were not normal, parametric tests usually give false-positive rates that are close to the nominal value. Therefore, Philip recommended that parametric statistics be used to analyze VAS data. A recent literature review of papers that analyzed VAS measurements showed that 54% of the studies used a nonparametric test (Mann-Whitney or Kruskal-Wallis), and 34% used a parametric test (t test or analysis of variance [ANOVA]). It appears no consensus has emerged in the scientific literature.
The goal of this study was to use computer simulation, based on actual patient data, to evaluate the accuracy and effectiveness of several statistical tests to compare VAS measurements among groups. The Mann-Whitney and Kruskal-Wallis tests (nonparametric procedures) determine whether two or more than two groups have the same mean rank, respectively, t and ANOVA tests (parametric procedures) check whether two or more than two groups have the same mean, respectively. In addition, t and ANOVA are often performed after an arcsin transformation has been done on the VAS measurements. The ends of the VAS limit the value of the VAS measurements. Therefore, before data transformation, the variance of VAS measurements clustered at the ends of the VAS line is less than the variance of the measurements located in the middle of the line. The transformation spreads out VAS results at the extremes, but does not affect VAS measurements toward the middle. By doing so, the distributions of VAS data are closer to normal distributions, which can increase the power of the parametric tests to detect differences among groups.
Materials and Methods
VAS data were obtained from previously published obstetric anesthesia studies performed at the University of Iowa Hospitals and Clinics. [6,7]The methodologies were described in those studies, and are briefly reported here to put the VAS measurements into an appropriate clinical context. The previously published studies [6,7]examined whether early administration of epidural analgesia affected obstetric outcome. Informed consent was obtained from nulliparous women with a singleton fetus in a vertex presentation, who requested epidural analgesia during labor at at least 36 weeks' gestation. All women had a lumbar epidural catheter placed, and in all women the cervix was 3-5 cm dilated. One group ("oxytocin") of these women (n = 149) were receiving intravenous oxytocin at the time of randomization. Each of these women reported a VAS measurement before they received analgesia. A second group ("epidural") of women in spontaneous labor (n = 172) received early epidural bupivacaine analgesia. A VAS measurement was obtained from each woman 30 min after they received analgesia. The third group ("nalbuphine") of women in spontaneous labor (n = 159) reported a VAS measurement 30 min after they received nalbuphine 10 mg intravenously. .**
VAS measurements were obtained in a standard fashion. The VAS ranged from 0 to 10 cm. The lines were vertical. The labels 'no pain' and 'worst possible pain' were printed at the bottom and top of the line, respectively. For each group, Lilliefors' test was used to determine whether the observed distributions of VAS measurements were normal.
Groups compared were oxytocin-oxytocin, epidural-epidural, oxytocin-oxytocin-oxytocin, epidural-epidural-epidural, oxytocin-nalbuphine, nalbuphine-epidural, and oxytocin-nalbuphine-epidural. We chose the first four permutations because, as reported in the Results, the oxytocin and epidural groups' distributions were the least well described by a normal distribution. We chose the latter three permutations because, as reported below, these comparisons minimized differences among the group means. Separate Monte Carlo simulation studies were done with 5, 10, 20, 30, 40, or 50 patients per group. First, VAS measurements were randomly selected with replacement from the measured VAS distributions. Second, various statistical tests were used to compare VAS data among the groups. Third, the first and second steps were repeated 3,999 times. Each time new random samples were drawn from the measured groups. Reported percentages give the fraction of the 4,000 simulated clinical studies for which each test rejected the hypothesis of no difference among groups at the 0.05 level (i.e., for which each test found a significant difference).
Mathematical details of the algorithms are listed, in arbitrary order, in this paragraph. The 95% confidence limit on the type 1 error rates equaled plus/minus 1.96 [radical (0.05)(1 - 0.05)/4,000] = plus/minus 0.0007. Therefore, proportions were accurate to two decimal places. If we had identified all appropriate theoretical distribution to represent the data, we would have used it for generation of random VAS measurements. We expected beta distributions would be appropriate, because they describe random variables that range between finite lower (0-cm) and upper (10-cm) limits. However, we found beta distributions poorly fit the data. Thus, VAS measurements were randomly obtained from the measured statistical distributions by the standard method, which uses a continuous piecewise linear empirical distribution function. In brief, the VAS measurements were sorted in ascending order. Uniformly distributed random numbers were generated between 1 and the number of VAS measurements. Random VAS measurements were obtained by linear interpolation between corresponding VAS measurements. When comparing two groups, we used one-sided tests to decrease computation time. Generally, one-sided tests have less accurate type 1 error rates; thus, our simulation results are applicable to two-sided tests. The t test assumed a common variance between groups. Significances of the Mann-Whitney statistics were found by referring them to the chi-square distribution with one degree of freedom, after adjustment for ties among the ranks of the VAS measurements. The arcsin transformation of each VAS measurement equaled the arcsin in radians of the square root of the VAS measurement divided by the maximum VAS measurement. Computer code*** was written and compiled in the CA-Realizer BASIC dialect.
Data Collection and Transformation
By reviewing data from previously published clinical studies, we identified one VAS measurement from each of 480 women. They were receiving oxytocin (149), nalbuphine (159), or epidural bupivacaine (172). One VAS measurement was obtained from each patient. The mean plus/minus standard deviation of the VAS measurements, on a 0- to 10-cm scale, were 7.7 plus/minus 1.8, 6.1 plus/minus 2.6, and 2.2 plus/minus 2.2 cm for the oxytocin, nalbuphine, and epidural groups, respectively.**** The distributions were not normal (P = 0.0006, = 0.008, and < 108for the oxytocin, nalbuphine, and epidural groups, respectively) (Figure 1). The percentages of women rating their pain at 10 cm were 13%, 5%, and 1% for the oxytocin, nalbuphine, and epidural groups, respectively. The percentages of women rating their pain as 0 cm were 0%, 1%, and 16%, respectively. Because some women reported their pain as 0 cm, some simple data transformations, such as logarithmic and reciprocal, could not be used, because for 0 neither the logarithm nor the reciprocal can be taken.
Arcsin transformation of the VAS measurements had both beneficial and disadvantageous effects. After transformation, the statistical distributions were closer to normal distributions. From this beneficial effect of transformation, the parametric tests (t and ANOVA) would be expected to be more accurate, and have a greater power to detect differences among groups. As expected, the arcsin transformation reduced the variance. It also reduced differences among means (as might be expected) but to a greater extent than it decreased the variance. For example, the t statistic comparing VAS measurements between the oxytocin and nalbuphine groups equaled 6.1 and 5.9 before and after transformation, respectively. As a result, arcsin transformation of the VAS data does not necessarily increase the power to detect differences among groups. We assessed the overall effect by using computer simulation.
Computer Simulation with Two Groups
Four thousand simulation studies were done to compare VAS measurements between two groups of patients (oxytocin vs. oxytocin). For each simulation, two samples were randomly selected from the measured distribution of VAS measurements. We recorded the percentage of the simulated studies for which each test rejected the hypothesis of no difference between groups at the 0.05 level (Table 1). Essentially, then, we found the percentage of the studies for which the computed P value was less than 0.05 (i.e., the false positive or type 1 error rate). By using a 0.05 criterion, we set up the computer simulation so that the correct false-positive rate was 5%. In other words, because the null hypothesis of no difference between groups was true, we expected a type 1 error rate of exactly 5%. No statistical test had rates greater than 5% (Table 1). This encouraging result implies that none of the tests would incorrectly suggest that a difference exists between groups when there is no difference, more often than the expected type 1 error rate of 5%.
Power Analysis with Two Groups
We examined the power of the statistical tests to detect a difference in VAS measurements between two groups, when a difference exists. Groups compared were oxytocin versus nalbuphine and nalbuphine versus epidural. As before, Table 1gives the percentages of four thousand simulated studies for which the P value was less than 0.05 (i.e., for which a significant difference was found). However, because we compared different groups, there was no single correct percentage. Rather, larger percentages implied a greater power to detect the difference between the two groups. Because the null hypothesis of no difference between groups was not true, we expected increasing the number of patients per group to increase statistical power.
Three results are apparent from Table 1. First, the difference in mean VAS measurements between the nalbuphine and epidural groups (6.1 - 2.2 = 3.9 cm) exceeded the difference between the oxytocin and nalbuphine groups (7.7 - 6.1 = 1.6 cm). Therefore, powers were generally greater when comparing nalbuphine with epidural (Table 1). Second, generally, increasing the number of patients per group caused power to increase at a rate exceeding the square root of the relative change in group size. For example, increasing the number of patients from 5 to 50 caused the t test's power for oxytocin versus nalbuphine (Table 1) to increase 0.97/0.25 = 3.9 fold, not [radical 50/5] = 3.2-fold. These results show that, as expected, modest increases in very small sample sizes (e.g., 5 patients per group) can dramatically increase the statistical power to detect differences between groups. Third, Student's t test had a slightly greater power than the Mann-Whitney or the t test after arcsin transformation of the data (Table 1). This implies that fewer patients would be needed in a clinical study, if the t test were used.
Computer Simulation with Three Groups
Four thousand simulation studies were done to compare VAS measurements among three groups of patients receiving oxytocin. As before, we recorded the percentage of studies for which each test rejected the hypothesis of no difference among groups at the 0.05 level (Table 2) (i.e., the false positive or type I error rate). The correct false-positive rate was 5%. No statistical test had rates greater than 5% (Table 2). This result implies that none of the tests would incorrectly suggest that a difference exists among groups, when there is no difference, more often than the expected type 1 error rate of 5%.
Power Analysis with Three Groups
We examined the power of the statistical tests to detect differences in VAS measurements among the three groups: oxytocin, nalbuphine, and epidural (Table 2). Larger percentages imply a greater power to detect differences among the three groups. Analogous to the two group results, ANOVA had a slightly greater power than the Kruskal-Wallis test or ANOVA after arcsin transformation of the data (Table 2). This implies that fewer patients would be needed in a clinical study, if ANOVA were used.
Analysis with Five Categories
Sometimes VAS measurements are recorded not as a distance from 0 to 10 cm but as a scale, which may have as few as five categories. The number of different categories on the scale varies, but rarely are fewer than five categories included. We evaluated what effect having only five categories would have on the results of the computer simulation. To do so, we converted the measured VAS measurements into five blocks. Measurements of 0-2 cm became 1; 2-4 cm became 2; and so forth.***** Using this method to convert from the continuous VAS to the five-point system, the ranks that would have been selected the most were 5, 4, and 1 for the oxytocin, nalbuphine, and epidural groups, respectively. Fifty percent of the women would have rated their pain as 4-5, 3-4, and 1 for the oxytocin, nalbuphine, and epidural groups, respectively.
Using five ranked categories rather than a continuous scale did not affect our results qualitatively (Table 3). As before, we recorded the percentage of studies for which each test rejected the hypothesis of no difference among groups at the 0.05 level (i.e., the false positive or type 1 error rate). The expected false-positive rate was 5%. No statistical test had rates greater than 5% (Table 3). This result implies that none of the tests would incorrectly suggest that a difference exists between groups, when there is no difference, more often than the expected type 1 error rate of 5%. t and ANOVA, without an accompanying arcsin transformation, had the greatest power to detect difference(s) among groups. Statistical power to detect differences among groups was not less for the five-category VAS (Table 3) than for the continuous VAS (Table 1and Table 2).******
The general validity of our conclusions depends predominantly on the assumption that the observed distributions of VAS measurements are reasonable representations of the distributions at other centers or for nonobstetric patients. This weakness of our study reflects the inherent drawback to any single center clinical study. The significant feature of the distributions was that they were close enough to a normal distribution for the parametric tests (t and ANOVA) to do well. These characteristics are likely to hold at other centers. Therefore, we expect that our conclusions can be applied quite liberally, with one caveat. None of the three groups of patients had many (> 16%) patients who ranked their pain as 0 or 10 cm. Parametric tests may do poorly when many patients specify an extreme value. Assume that a researcher were to find that many patients were specifying one of the two extremes. Then, a nonparametric test (Mann-Whitney or Kruskal-Wallis) may be a better choice than a parametric method. However, the researcher should also question the validity of her pain measurements. The VAS test may have been administered in a way that made it too insensitive to detect subtle changes in pain.
Sometimes researchers must use a discrete VAS (e.g., when studying pain in young children). A five-category VAS had as high a power as the corresponding continuous VAS (Table 1, Table 2, Table 3). Therefore, the researchers need not expect to use more patients solely because they are using a five-category rather than continuous VAS. However, our results do not support the routine use of discrete VAS. A continuous VAS should be used, whenever possible, because it is an accurate instrument to measure pain. [2,5]We compared statistical power and not the accuracy of pain measurement.
Our findings will be most helpful to researchers for study design and for analysis of studies that include few patients. For example, assume that VAS data were collected on many patients (e.g., with more than 100 per group). Then, a statistician could use a statistical method (e.g., Box and Cox power transformation), to find a mathematical transformation that makes the statistical distribution of the VAS data closer to a normal distribution. However, when studies are small (e.g., fewer than 50 patients per group), the use of these approaches is difficult to justify. Furthermore, these methods rarely appear in the anesthesiology literature. Our results show that researchers would be making a reasonably good decision by simply using a t test or ANOVA. Researchers should be cautious to not blindly apply our results if they expect a greater percentage (> 16%) of their patients to rank their pain as 0 or 10 cm. However, despite this caveat, researchers can justifiably use the most powerful statistical tests, t or ANOVA, when comparing VAS measurements among small groups. They should not feel compelled to use non-parametric methods because the sample sizes are small and the statistical distribution not known. Identifying appropriate statistical tests for small clinical studies is important, because decreasing the number of patients in a study decreases its costs, both financial and ethical. Furthermore, researchers prospectively do not know what mathematical transformation to make to the VAS data they have not yet collected to make the data normally distributed. Our study shows that power analyses to find suitable sample sizes can be done using equations applicable to a t test or ANOVA.
In summary, we evaluated the accuracy and effectiveness of several statistical tests that can compare analgesic doses among groups. No statistical test incorrectly suggested that a difference existed among groups, when there was no difference. If a researcher were to detect a difference among groups with any one of the tests, the result would be statistically reliable. The t or ANOVA tests, for differences between two or more than two groups' means, respectively, had slightly greater power than the other tests to detect difference(s) among groups. Furthermore, parametric methods have the advantage that much theory is available to guide experimental design and post hoc analysis, more so than for nonparametric tests. Therefore, we conclude that t and ANOVA are good choices to compare VAS measurements among groups. We expect that these results will be most helpful during the design of studies that will assess pain with a VAS. Our results may, however, not hold when many (> 16%) patients rank their pain at one of the two extremes.
*Olsen S. Nolan MF, Kori S: Pain measurement: An overview of two commonly used methods. Anesthesiology Review 6:11-15, 1992.
**In the published study, 162 women received nalbuphine. We used 159 VAS measurements because we performed the current analysis before the last 3 patients were enrolled in the study.
***The computer code is available on request from the corresponding author.
****Skewness equaled -0.82, -0.36, and 1.20 for the oxytocin, nalbuphine, and epidural groups, respectively.
*****We converted the continuous VAS to the five-point system to permit statistical analysis. However, we caution that we are not advocating use of a five-point system. A patient presented with a 10-cm scale may choose values in a way that does not map directly to a five-point scale (i.e., a patient may not map 0-2 cm into a value of 1, and so forth, as we did).
******To check our results, we repeated the simulations with a two-category VAS. As expected, the statistical power was much less than with the five-category VAS, especially for the small group sizes.