To the Editor:—

We write to make the case that the practice of providing a priori  sample size calculations, recently endorsed in an Anesthesiology editorial, 1is in fact undesirable. Presentation of confidence intervals serves the same purpose, but is superior because it more accurately reflects the actual data, is simpler to present, addresses uncertainty more directly, and encourages more careful interpretation of results. The clinical trial report 2lauded in the editorial in fact serves to illustrate the drawbacks of sample size calculation as a data analysis tool. The a priori  calculation presented is based on assumptions about length of stay (normally distributed with a SD of 4.5 days) that did not hold in the actual data, an analysis (comparison of mean length of stay between two groups by t  test) that was not presented, and a sample size that was not attained. It therefore does not help the reader interpret the results, which is the proper goal when reporting on a study that has been completed. The post hoc  power calculation presented retains most of these deficiencies, and therefore does not help the reader to assess the strength of evidence against a 1.0-day mean advantage for one treatment versus  another. In contrast, a confidence interval for the difference in means would directly address this issue. Although the presence of outliers would require a bootstrapping method 3to obtain a valid confidence interval for a difference in means, this bit of extra effort is certainly worthwhile for the central issue of a study, and in any case, much better than relying on convoluted reasoning with invalid power approximations.

Perhaps the worst aspect of reporting sample size or power calculations is that it encourages interpretation of studies’ results based only on P  values, in particular the widespread fallacy of interpreting P > 0.05 as proving the null hypothesis. The other article 4cited by the editorial provides a glaring example of this type of reasoning, concluding that reporting of sample size calculations did not change over time in any journal but did increase overall (see their fig. 2). Returning to the clinical trial report, consider the statement that death rates “were similar” in the four subgroups. While this is an accurate characterization of what was actually observed, unsophisticated readers are liable to interpret this (contrary to the authors’ intentions) to mean that the study found strong evidence against any substantial difference in death rates. In fact, the exact 595% confidence interval around the odds ratio for death comparing intravenous versus  epidural postoperative analgesia goes from 0.36 to 5.4, which is wide enough to make clear to most readers that this study by itself provides only very weak evidence against a clinically important difference in death rates.

We urge reviewers, editors, and quality studies to give authors full credit for providing confidence intervals instead of sample size calculations in reports of completed studies. Indeed, for the reasons illustrated here, it would be best to discourage the practice of using sample size and power calculations as substitutes for more direct assessment of uncertainty using confidence intervals.

Todd MM: Clinical research manuscripts in Anesthesiology. A nesthesiology 2001; 95: 1051–3
Norris EJ, Beattie C, Perler BA, Martinez EA, Meinert CL, Anderson GF, Grass JA, Sakima NT, Gorman R, Achuff SC, Martin BK, Minken SL, Williams GM, Traystman RJ: Double-masked randomized trial comparing alternate combinations of intraoperative anesthesia and postoperative analgesia in abdominal aortic surgery. A nesthesiology 2001; 95: 1054–67
Efron B, Tibshirani RJ: An Introduction to the Bootstrap. London, Chapman and Hall, 1993
Pua HL, Lerman J, Crawford MW, Wright JG: An evaluation of the quality of clinical trials in anesthesia. A nesthesiology 2001; 95: 1068–73
Mehta CR, Patel NR: Exact logistic regression: theory and examples. Stat Med 1995; 14: 2143–60