“The lack of an obvious human phenotype for anesthetic neurotoxicity represents a major obstacle to study design and interpretation.”
THERE are few issues facing the field that are more concerning and contentious than the possible neurotoxic effects of anesthetics on children. Although laboratory studies report that virtually all commonly used anesthetics invariably induce neurodegeneration in the developing animal brain, observational studies are less conclusive with some reporting an association between exposure to anesthesia/surgery and adverse neurobehavioral outcome, whereas others do not.1 Among the many methodologic problems associated with human studies are the outcome measures available to the investigators.1,2 As virtually all these studies are retrospective, the outcome is not chosen by the investigator and therefore may not provide the most meaningful measure of the cognitive or behavioral effect. In addition, the various neurocognitive outcomes may or may not be comparable as few studies have reported more than a single end point. In this issue of Anesthesiology, Ing et al.3 have attempted to provide a structured comparison of outcome measures representative of those found in most studies of this type. Similar to their previous publication,4 data from the Raine Study, a cohort of 2,868 children born from 1989 to 1992 in Western Australia, were examined for an association between exposure to anesthesia/surgery in children before the age of 3 yr and three different but closely related outcomes including direct neuropsychological testing, International Classification of Diseases, 9th Revision (ICD-9)–coded clinical disorders, and a group test of academic achievement. Of the 781 children included, 112 had been exposed to anesthesia/surgery, and among those exposed, the risk of deficits in individual language assessments and ICD-9 codes for language or cognitive disorders was increased. In contrast, exposed and unexposed children did not differ with regard to academic achievement. The authors conclude that these data explain some of the variation in the literature and underscore the importance of the outcome measure when interpreting studies of cognitive function. Similar findings have previously been noted in other studies using more than a single measure of neurodevelopment.5
A cursory review of the literature suggests that the majority of studies with negative results use broad measures of academic performance such as group tests of achievement (California Achievement Test and Danish standardized test of achievement) and teacher–parent rating scales very similar to that used in this study.6–9 Studies using individual tests of cognitive performance have been uniformly positive, commonly in areas of speech and language. The larger studies performed in Europe utilizing group tests (or similar) tend to be negative, whereas smaller studies using individual neurobehavioral tests more frequently are positive.
Utilization of ICD-9 codes in epidemiologic research is common as administrative data are widely available and often represent the only source of information related to an outcome of interest. Unfortunately, errors in coding are exceedingly common and represent a source of significant bias.10 Attention deficit hyperactivity disorder (ADHD) provides an instructive example alluded to by the authors. Ing et al. utilized ICD-9 codes as a means of identifying relevant behavioral or cognitive outcomes including ADHD, the diagnosis of which is clearly delineated within the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition. However, in studies of ADHD diagnostic accuracy, only one third of children diagnosed with ADHD have been subject to the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition, criteria and as many as two thirds of children with ADHD have a diagnosed learning disability that may or may not be identified with a specific ICD-9 code.11 It is therefore difficult to be certain whether a child has the outcome of interest (ADHD) or has a similar outcome that may confound the relationship (learning disability). In the case of the study by Ing et al., the problem of mis-coding was magnified by assigning codes from parental reports of childhood illness, rather than medical records, an additional source of potential bias. Ing et al. somewhat inaccurately compares ADHD as an outcome in this study with that in the study by Sprung et al.12 The comparison provides an instructive example of how apparently identical outcome measures may differ in profound ways. In the study Sprung et al., ADHD was diagnosed by strict Diagnostic and Statistical Manual of Mental Disorders, 4th Edition, criteria using a robust medical record and unique access to school records—information unavailable to Ing et al. In addition, Sprung, but not Ing, was able to separate those children with ADHD alone from those with a learning disability and ADHD to examine the effects of these overlapping cognitive disorders separately. Consequently, the methodology in the study by Ing et al. almost certainly overestimates the frequency of ADHD, cannot determine whether the observed differences are truly driven by ADHD, or is the result of confounding between ADHD and learning disability. As such these data should be compared with that in the study by Sprung with great caution, if at all.
The lack of an obvious human phenotype for anesthetic neurotoxicity represents a major obstacle to study design and interpretation. The study by Ing et al. is intended in part to identify a robust end point for evaluating existing work as well as designing future studies that may be more informative. The unique feature of the data reported by Ing is the extensive neurodevelopmental testing that was performed repeatedly for each of the studied subjects. No other study to date contains as much cognitive outcome data as this and their previous publication using the same data. In addition to studies from the Mayo Clinic, those by Ing et al. are the only extant studies that contain data from individually administered tests of cognition. It is striking that these studies are both positive and report disproportionate effects on speech and language. Nonetheless, as mentioned above, caution should also be used when interpreting these data as many of the outcomes are interrelated and the use of multiple tests increases the risk of a type 1 statistical error. Noteworthy is the observation that 25% of the exposed comprised children undergoing myringotomies—a population notoriously known to suffer from later language and learning problems.13
Ing et al. suggest that group tests may lack sufficient sensitivity to detect small differences in performance that may exist between those exposed and those not exposed, but that these minor differences may not be clinically or academically meaningful. They also suggest that studies using large cohorts but insensitive outcomes are likely to be negative and should be interpreted with caution; studies using individually administered tests of cognition may be more likely to be positive and can provide insight into phenotype (i.e., abnormalities in speech and language). However, the value of ICD-9 or other administrative data in this setting as an end point is unclear and awaits the results of studies that examine the correlation between such codes and direct testing depending on location and time. Moreover, studies using comprehensive cognitive testing are laborious and expensive; therefore, the sample size in these studies will invariably be small. If this approach is used more widely in the future, a possible consequence is the accumulation of limited powered studies that might overestimate the effects we are looking for (type I error) or fail to detect a difference (type II error) based on limited sample size. Indeed, similar concerns have been raised regarding studies on postoperative cognitive dysfunction (POCD) in the elderly.14,15 POCD researchers still have no tools available that can reliably assess the presence of POCD, and increasing the number of tests used to classify POCD increases the sensitivity to change not only in postoperative patients but also in the controls.14
Ing et al. should be congratulated for their contribution to the understanding of the growing concerns related to the effects of exposure to anesthetic agents on young children. However, not all outcome measures are created equally—the devil is truly in the details with regard to not only outcome but also many other aspects of study design and conduct not discussed here. However, the problems with the POCD studies suggest that one must ascertain under what circumstances individual cognitive testings are also meaningful human outcome measures. Indeed, exactly how different are individually administered tests of speech and language and school tests—certainly, good school test scores require adequate speech and learning skills?
Acknowledgments
Dr. Flick is supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (Bethesda, Maryland) to study this topic (grant no. R01 HD 071907-01).
Competing Interests
The authors are not supported by, nor maintain any financial interest in, any commercial activity that may be associated with the topic of this article.