A biomarker may provide a diagnosis, assess disease severity or risk, or guide other clinical interventions such as the use of drugs. Although considerable progress has been made in standardizing the methodology and reporting of randomized trials, less has been accomplished concerning the assessment of biomarkers. Biomarker studies are often presented with poor biostatistics and methodologic flaws that precludes them from providing a reliable and reproducible scientific message. A host of issues are discussed that can improve the statistical evaluation and reporting of biomarker studies. Investigators should be aware of these issues when designing their studies, editors and reviewers when analyzing a manuscript, and readers when interpreting results.
HISTORICALLY, the term biomarker referred to analytes in biologic samples that predict a patient's disease state. However, the term biomarker has evolved over time to any biologic measurement, recently including genomic or proteomic analyses, that could also predict a response to a drug (efficacy, toxicity, or pharmacokinetics) or indicate an underlying physiologic mechanism.1New biomarkers exploring the cardiovascular system, kidney, central nervous system, inflammation, and sepsis are under the scrutiny of bioengineering companies, and we are witnessing a biomarkers revolution similar to the imaging technique revolution.2Remarkably, this revolution has already occurred for cancer drugs.1
Assessment of these biomarkers is complex but valuable in perioperative and critical care medicine as markers of diagnosis, disease severity, and risk. Although considerable progress has been made in standardizing the methodology and reporting of randomized trials, less has been accomplished concerning the assessment of diagnostic and prognostic biomarkers. Analysis of the literature, even in prestigious journals, has revealed that the methodologic quality of diagnostic studies is on average poor.3Recommendations concerning the reporting of diagnostic studies, the Standards for Reporting of Diagnostic Accuracy (STARD) initiative, have been published recently,4several years after the first recommendations concerning reporting of randomized trials.5However, these recommendations do not encompass all issues of this rapidly evolving domain.
The purpose of this article was to provide the anesthesiologist with a comprehensive introduction of the problems, potential solutions, and limitations raised by the assessment of the diagnostic properties of modern biomarkers. It is important to appreciate the available statistical methodologic tools to face the biomarker revolution, either as a clinical investigator or as a consumer of scientific literature. This is no easy task, for we must now look beyond the classic diagnostic indices (sensitivity, specificity, and predictive values) and even beyond the more widely used receiver operating characteristic (ROC) curves by integrating the principles of Bayesian theory. To appreciate these issues, the different roles of a biomarker must first be explored.
Role of a Biomarker
A biomarker may serve different roles (table 1) and, thus, need to accomplish several reporting goals. A biomarker may provide a diagnosis or assess severity (or assess a risk). For example, cardiac troponin I is a very sensitive and specific biomarker of myocardial infarction in the postoperative period in noncardiac surgery.6In contrast, it is considered only as a severity biomarker in pulmonary embolism,9whereas procalcitonin is considered both as a diagnostic and severity biomarker of infection.8Biomarkers are often used for risk stratification. For example, blood lactate levels have been proposed for risk stratification of sepsis.14However, the purpose of diagnostic and prognostic settings markedly differ. For example, in the diagnostic setting, although unknown, the outcome (the disease) has occurred, whereas in the prognostic setting, the outcome remains to be determined and can only be estimated as a probability or a risk, and the uncertain nature of this outcome should be considered.
There are several important hierarchical steps in demonstrating the clinical interest of a biomarker:
Demonstrate that the biomarker is significantly modified in diseased patients as compared to control.
Assess the diagnostic properties of the biomarkers.
Compare the diagnostic properties of the biomarker to existing tests.
Demonstrate that the diagnostic properties of the biomarker increase the ability of the physician to make a decision; this might be difficult to analyze because timing of diagnosis may be crucial and not easy to identify. For example, although the accuracy of procalcitonin to diagnose postoperative infection after cardiac surgery was lower than that of physicians, procalcitonin enabled to make the diagnosis earlier.7
Assess the usefulness of the biomarker, which should be clearly distinguished to the quality of diagnostic information provided.15Assessment of the usefulness mainly involves both characteristics of the test itself such as cost, invasiveness, technical difficulties, rapidity, and characteristics of the clinical context (prevalence of the disease, consequences of outcome, cost, and consequences of therapeutic options).
Demonstrate that the measurement of the biomarkers modifies outcome (intervention studies). For example, several studies nicely demonstrated that a diagnostic strategy based on procalcitonin level reduces antibiotic use for acute respiratory tract infections, exacerbation of chronic obstructive pulmonary disease, and ventilator-associated pneumonia.16However, intervention studies are lacking for many novel biomarkers or give conflicting results for others.17
For all stages of this process, it is important to understand the pathophysiologic mechanisms involved in the biomarker's synthesis, production, its kinetic properties, and its physiologic effects. For example, brain natriuretic peptide (BNP) is known to be released predominantly from the left cardiac ventricles in response to increased ventricular wall stretch, volume expansion, and overload. Its physiologic role includes systemic and pulmonary arterial vasodilation, promotion of natriuresis and diuresis, inhibition of the renin-angiotensin-aldosterone system, and endothelins. In contrast, the pathophysiologic background of procalcitonin remains poorly understood.
A biomarker may also guide other clinical decisions, particularly concerning the use of drugs. This area is now widely developed in oncology in which biomarkers are used to predict an efficacy and/or a toxicity response of a drug.1For example, procalcitonin has been advocated to guide the clinician to decide the duration of antibiotherapy,13and genetic determinants of metabolic activation of clopidogrel have been shown to modulate the clinical outcome of patients treated by clopidogrel after an acute myocardial infarction.12Finally, a biomarker can be used as a surrogate endpoint in a clinical trial,1,18but this issue is beyond the scope of this review.
The Bayesian Approach
One way to conceptualize the utility of a biomarker is its value for enhancing our existing knowledge in predicting the probability of some outcome (e.g. , disease state, prognosis). In this regard, Bayesian statistical methods provide a powerful system from which to update existing information about the likelihood of the occurrence of some disease or prognosis. A comprehensive introduction to these methods is far beyond the scope of this review, but an interested reader is referred to one of the several textbooks that have been written on applying Bayesian statistical methods to medical problems.19
Bayes' theorem uses two types of information to compute a predicted probability of the outcome. First, the prior probability (or pretest probability) of the outcome must be considered. For biomarker studies involving the diagnosis of a disease, this can be akin to the general prevalence of the disease in the population under study, or what we know about the base rate of the disease for this individual without any additional information. This information is combined with the predictive power of the biomarker (i.e. , the ability of the test to discriminate between disease states) to adjust our prediction of the likelihood of the outcome. Stated simply, the predicted probability of a patient having the disease (posttest probability) can be calculated as20: posttest probability = (pretest probability) × (predictive power of the evidence).
Numerical examples of this calculation are given in Likelihood Ratios, where likelihood ratios (LHRs) are discussed. However, the use of Bayes' theorem to “update” our expectation of the presence of a disease is illustrated using Fagan's nomograms (fig. 1).21In these examples, it can be seen how disease prevalence (pretest probability) is used in conjunction with the LHR (strength of evidence) to calculate an updated (posttest) probability of the disease.
Although a powerful method for updating assumptions given the available information, there are many instances where an appropriate estimate of the pretest probability is not known (or agreed on). In such cases, different physicians might have different estimates of the probability of the disease for a given patient. Further, information may actually be available for one or more risk factors, but the unique combination of these factors may obscure the subjective probability for a given patient. For these applications, the sensitivity of the expectation can be checked against a range of assumptions (see Reclassification Table for more details).
The diagnostic performance of a biomarker is often evaluated by its sensitivity and specificity. Sensitivity is the ability to detect a disease in patients in whom the disease is truly present (i.e. , a true positive), and specificity is the ability to rule out the disease in patients in whom the disease is truly absent (i.e. , a true negative). Calculation of these indices requires knowledge of a patient's “true” disease state and a dichotomous prediction based on the biomarker (i.e. , disease is predicted to be present or absent) to construct a 2 × 2 contingency table. Table 2displays how the frequency of predictions from a sample of patients could be used in conjunction with their known disease state to calculate sensitivity and specificity.
Although sensitivity and specificity are the most commonly provided variables in diagnostic studies, they do not directly apply to many clinical situations because the physician would rather know the probability that the disease is truly present or absent if the diagnostic test is positive or negative rather than probability of a positive test given the presence of the disease (sensitivity). These former, more clinically interesting probabilities are provided by the positive predictive value and negative predictive value. Table 2presents the calculation of these predictive indices.
The diagnostic accuracy of a test is the proportion of correctly classified patients (i.e. , the sum of true positive and true negative tests). Perhaps because it is the most intuitive index of diagnostic performance, diagnostic accuracy is sometimes reported as a global assessment of the test. However, the use of this index for this purpose is inherently flawed and produces unsatisfactory estimates under a range of situations, such as when the prevalence of the disease substantially deviates from 50%.22It is recommended that authors report more than just a single estimate of diagnostic accuracy.
The Youden index Y = sensitivity + (specificity − 1) represents the difference between the diagnostic performance of the test and the best possible performance23(sometimes called the “regret” defined as the utility loss because of uncertainty about the true state).24Interestingly, accuracy is actually a weighted average of sensitivity and specificity, using as weight the prevalence of the disease. It should be clear that these five indices (sensitivity, specificity, negative and positive predictive values, and accuracy) are partially redundant, because knowing three of them enables the calculation of the rest.
Influence of Prevalence
Although sensitivity and specificity are not markedly influenced by the prevalence of the disease, negative predictive value, positive predictive value, and accuracy are affected by prevalence. Figure 2shows the influence of prevalence on the various diagnostic indices. This issue is of paramount importance because disease prevalence can markedly differ from one population to another. The prevalence of sepsis in an intensive care unit is high compared with that in the emergency department, and the positive predictive values and accuracy of procalcitonin for sepsis might markedly differ between these two settings. For example, Falcoz et al. 25reported that a 1 ng/ml procalcitonin had a positive predictive value of 0.63 for predicting postoperative infection after thoracic surgery. However, in that study, the prevalence of infection was 16%. If a post hoc analysis was conducted restricting the scope of inclusion only to patients with systemic inflammatory response syndrome criteria, the prevalence would have been 63% and the positive predictive value 0.90.
Although the mathematical calculation of sensitivity and specificity are not necessarily altered by prevalence, certain clinical situations may foster higher estimates.26,27These indices can be influenced by case mix, disease severity, or risk factors for disease.27For example, a biomarker is likely to be more sensitive among more severe than among milder cases of the diseases. The sensitivity of procalcitonin to diagnose bacterial infection is greater in patients with meningitis than in patients with pyelonephritis.28
LHRs are another way of describing the prognostic or diagnostic value of a biomarker. Although we call them “diagnostic” LHR, these ratios are LHRs in the true statistical sense and correspond to the ratios of the likelihood of the observed test result in the diseased versus nondiseased populations. Two dimensions of accuracy have to be considered, the LHR for a positive test (positive LHR) and the LHR for a negative test (negative LHR). One of the most interesting features of LHRs is that they quantify the increase in knowledge about the presence of disease that is gained through the diagnostic test. Thus, LHR could also be referred as Bayes factors; we could demonstrate that using the following formula: posttest probability of disease = (positive LHR) × (pretest probability of disease) or posttest probability of nondisease = (negative LHR) × (pretest probability of nondisease).
Furthermore, LHRs are not dependant on disease prevalence and, thus, are considered as a robust global measure of the diagnostic properties of a test, and they can be used with tests that have more than two possible results (see interval LHR).
More pragmatically, positive LHR ranges from 1.0 to infinity and negative LHR from 0 to 1.0. An uninformative test having no relation with the disease has LHR of 1.0, whereas a perfect test would have a positive LHR equal to infinity and a negative LHR of 0 (table 3). For example, a common sense translation of a positive LHR of 8.6 for a plasma soluble triggering receptor expressed on myeloid cells 1 value exceeding 60 ng/ml is that this value is obtained approximately nine times more often from a patient with sepsis than from a patient without sepsis.29Experts usually consider that tests with a positive LHR greater than 10 (or a negative LHR less than 0.1) have the potential to alter clinical decisions. Although it could be tempting to follow definitive rules-of-thumb for interpretation (such as those provided in table 3), we must primarily consider the clinical setting to determine what level of increased likelihood is clinically relevant to improve the management of patients.
Revisiting the nomograms in figure 1, several important issues become clear concerning diagnostic LHRs. First, no change in prediction (expectation) is possible without a strong LHR. As might be expected, when an LHR provides no added information (e.g. , LHR = 1.0), the pretest probability equals the posttest probability. Second, pretest probability greatly influences what can be learned from using even a very predictive biomarker (e.g. , LHR = 10.0). Very high (or low) pretest probabilities result in smaller adjusted expectations than those less extreme.
Basic of ROC Curve.
The ROC curve is a form of a series of pairs (proportion of true positive results; proportion of false-positive results) or (sensitivity; 1 − specificity): the sequence of points obtained for different cutoff points can be joined to draw the empirical ROC curve, or a smooth curve can be obtained by appropriate fitting, usually using the binomial distribution (fig. 3).30,31In other words, the positive LHRs calculated at various values of the diagnostic test can be plotted to produce a ROC curve, and thus, a ROC curve is a graphical way of presenting information presented in the table of LHRs. Graphically, the positive LHR is the slope of the line through the origin (sensitivity = 0; 1 − specificity = 0) and a given point on the ROC curve, whereas the negative LHR is the slope of the line through the point opposite to the origin (sensitivity = 1; 1 − specificity = 1) and that given point on the ROC curve.
The area under the receiver-operating characteristic curve (AUCROC) (also called the c statistics or the c index) is equivalent to the probability that the biomarker is higher for a diseased patient than a control and, thus, is a measure of discrimination. By convention, ROC curves should be presented above the identity curve (fig. 2) that represents a test without any value and which performs like chance. It is important to note that the following points belong to the identity curve: of course sensitivity = 0.50 and specificity = 0.50 but also sensitivity = 0.90 and specificity = 0.10, and sensitivity = 0.10 and specificity = 0.90. This enables us to understand that sensitivity cannot be interpreted without specificity. The AUCROCshould be reported with confidence intervals (CIs) to allow statistical evaluation versus the identity line or statistical comparison versus other diagnostic tests (see Comparison of ROC Curves). Usually, biomarkers are considered as having good discriminative properties tests when AUC are higher than 0.75 and as excellent more than 0.90 (table 3). The ROC curve is a global assessment of the test accuracy but without any a priori hypothesis concerning the cutoff chosen, is relatively independent on prevalence, and is a simple plot that is easily appreciated visually. However, the cutoff point and the number of patients are not typically presented (although a small sample size is easily detected by a jagged and bumpy ROC curve). The generation of a ROC curve is no longer cumbersome because most statistical software provides the calculation and display of the relative parameters.
There are three common summary measures for the accuracy described by a ROC curve. The first is simply to report the pair of values (sensitivity and specificity) associated with a chosen cutoff point. The second is the AUCROC, and the third is the area under a portion of the curve (partial area) for a prespecified range of values. Interpreting the AUCROCis somewhat problematic because of the substantial portion of variance in this index that comes from values of the biomarker of no clinical relevance. One ROC curve may have a higher proportion of false positive than another in the region of clinical interest, but the two ROC curves may cross, leading to different conclusion when curves are compared on the basis of the entire area. Therefore, it is recommended that examination of the ROC curve be conducted in the context of partial area or average sensitivity over a range of clinically relevant proportion of false positives in addition to the AUCROC.32
Comparison of ROC Curves.
The first step for any ROC curve comparison should be a visual inspection of their graphical representation. This inspection allows the evaluation of large differences between the AUCROCand to detect the situations where ROC curves cross. However, formal statistical testing is required to assess differences between the curves. Several different approaches are possible, and all must take into consideration the nature of the collected data. When the predictive value of a new biomarker is compared with an existing standard(s), two or more empirical curves are constructed based on tests performed on the same individuals. Statistical analysis on differences between these curves must take into account the fact that one individual is contributing two scores to the analysis. Most biomarker studies collect data that are paired (i.e. , measurements are correlated) in nature. Parametric approaches to these comparisons assume that there is a continuous spectrum of possible values of the biomarker for both diseased and nondiseased patients (generally true with biomarkers) and that the underlying distribution is Gaussian (normal). However, this assumption is often not tenable in biomarker studies. Despite this, paired parametric methods of ROC comparison are often used to evaluate biomarkers, using an approach described by Hanley and Mc Neil.33An alternative nonparametric paired method, described by DeLong et al. ,34is based on the Mann–Whitney U statistic. The two approaches yield similar estimates even in nonbinormal models.35
Two main limitations must be considered for global comparisons of the ROC curves. First, this way of comparing two ROC curves is not precise, specifically when two ROC curves cross each other. Second, many cutoffs on the ROC curves are not considered in practice because their associated specificity and sensibility are not clinically relevant. To reduce the impacts of these limitations, comparisons of partial AUCROCwithin a specific range of specificity for two correlated ROC curves have been developed and might be interesting to consider for some biomarkers.36
Finally to maximize the generalization capacity of the observed data, resampling methods have been proposed to compare ROC curves. This modern approach is actually easier to conduct with the increase of computing power and seems to provide more accurate results for small sample sizes.
Determination of Cutoff.
The ROC curve is used to determine a clinical cutoff point to make a clinical discrimination. The method used to choose this cutoff is crucial but unfortunately not always reported in published studies.37In some situations, we do not wish to (or could not) privilege either sensitivity (identifying diseased patients) or specificity (excluding control patients), and thus, the cutoff point is chosen as that one which could minimize misclassification. Two techniques are often used to choose an “optimal” cutoff. The first one (I) minimizes the mathematical distance between the ROC curve and the ideal point (sensitivity = specificity = 1) and thus intuitively minimizes misclassification. The second (J) maximizes the Youden index (sensitivity +[specificity − 1]) and thus intuitively maximizes appropriate classification (fig. 3).38Interestingly, Perkins et al. 39present a sophisticated argument that the J point should be preferred, because I does not solely rely on the rate of misclassification but also on an unperceived quadratic term that is responsible for observed differences between I and J.39
However, the use of I or even J may not be satisfactory for two main reasons. First, this equipoise decision, which does not privilege either sensitivity or specificity, is valid only in the case of a prevalence of 0.50; in other situations, the prevalence should be taken into account. Second, in many clinical situations, the researcher could privilege either sensitivity or specificity because the consequence of false-positive or false-negative results is not equivalent in terms of a cost–benefit relationship. For example, it is clear that it is more crucial to rule out bacterial meningitis than treat a patient with antibiotics who has viral meningitis, or at least those due to enterovirus.40The researcher should assign a relative cost (financial or health cost, from the patient, care provider, or society points of view) of a false positive to a false-negative result and consider the prevalence, and these different elements can be combined to calculate a slope m: m = (false-positive cost/false-negative cost) × ([1 − P]/P), the operating point on the ROC curve being that which maximizes the function (sensitivity − m[1 − specificity]).15,41Other methods includes the net cost of treating controls to net benefit of treating individuals and the prevalence.42,43
In any case, the following recommendations should be provided: (1) the choice of the researcher must be clearly explained and justified; (2) the choice (at least its methodology) must be a priori decided; (3) the ROC curve should be provided to allow the reader to make its opinion; and (4) the cutoff that maximizes the Youden index should also be indicated. It remains clear that data-driven choice of cutoff tends to exaggerate the diagnostic performance of the biomarker.44This bias should be recognized and probably concerns many published studies.
Surprisingly, although the cutoff point has a crucial role in the decision process, it is provided in most (if not all) studies without any CI. This may constitute a major methodologic flaw, particularly in small sample studies, because this cutoff point might be markedly influenced by the value of very few patients, although the CIs of sensitivity and specificity associated with that cutoff are reported.45,46The reason of the absence of CI is probably related to the fact that more sophisticated statistical methods should be used. The principle of all these methods is to perform multiple resampling of the studied population to provide a large sample of different populations providing a large sample of cutoff points and thus a mean (or a median) associated with its 95% CI. Several techniques of resampling can be used (bootstrap, Jackknife, Leave-One-Out, n-fold sampling).47,48In a recent study, Fellahi et al. 49used a bootstrap technique to provide median and 95% CIs for cutoff points of troponin Ic in patients undergoing various types of cardiac surgery, enabling the comparison of these different cutoff points. Here again, CI rule enables the researcher and the reader to honestly communicate or understand the values presented, taking into account the sample size.
The Gray Zone.
Another option for clinical discrimination is to avoid providing a single cutoff that dichotomizes the population, but rather to propose two cutoffs separated by the “gray zone” (fig. 4). The first cutoff is chosen to exclude the diagnosis with near-certainty (i.e. , privilege specificity). The second cutoff is chosen to include the diagnosis with near-certainty (i.e. , privilege sensitivity). When values of the biomarker falls into the gray zone between the two cutoffs, uncertainty exists, and the physician should pursue a diagnosis using additional tools. This approach is probably more useful from a clinical point of view and is now more widely used in clinical research. Moreover, the two cutoffs and gray zone comprise three intervals of the biomarker that can be associated with a respective LHR. In that case, the positive LHR of the highest value of the biomarker in the gray zone is considered to include the diagnosis and the negative LHR of the lowest value to exclude the diagnosis. This interesting option is often called the interval LHR50and results in less loss of information and less distortion than choosing a single cutoff, providing an advantage in interpretation over a binary outcome. This allows the clinician to more thoroughly interpret the results improving clinical decision-making.
Here again, the 95% CI of the cutoff points may be calculated using a resampling method,47,48and the rules for choosing the cutoffs be determined a priori and clearly explained and justified.
Biomarkers' abilities to predict a disease are commonly evaluated using ROC curves. The improvement in AUCROCfor a model containing the new biomarker is defined simply as the difference in AUCROCcalculated using models with and without the biomarker of interest. This increase, however, is often very small in magnitude. Ware et al. 51and Pepe et al. 52describe examples in which large odds ratios are required to meaningfully increase the AUCROC. As a consequence, many risk factors that we know to be clinically important may not affect the c-statistic very much. Thus, the ROC curves approach might be considered as insensitive to evaluate the gain of biomarkers.52Furthermore, ROC curves are frequently not helpful for evaluating biomarkers because they do not provide information about the actual risks or the proportion of participants who have high- or low-risk values. Moreover, when comparing ROC curves for two biomarkers, the models are aligned according to their false-positive rates (that is, different risk thresholds are applied to the two models to achieve the same false-positive rate), and this might be considered as inappropriate.53In addition, the AUCROCor c-statistic has poor clinical relevance. Clinicians are never asked to compare risks for a pair of patients among whom one who will eventually have the event and one who will not. To complete the results obtained by ROC curves, some new approaches to evaluate risk prediction have been proposed. One of the most interesting is the risk stratification tables. This methodology better focuses on the key purpose of a risk prediction, which remains to classify individuals into clinically relevant risk categories. Pencina et al. 54have recently purposed two ways of assessing improvement in model performance using reclassification tables: Net Reclassification Index (NRI) and Integrated Discrimination Improvement.
The NRI approach enables us to assess the role of a biomarker to modify risk strata and alter clinical decisions. It requires a predefined risk stratification, which is usually expressed in several strata (>2), and the use of 3 strata (high, intermediate, and low risks) is probably the most easily handled for routine clinical management.55NRI is the combination of four components: the proportion of individuals with events who move up or down a category and the proportion of individuals with nonevents who move up or down a category. Because the NRI and its four components might be affected by the choice of stratification of the risks, lack of clear agreement on the categories that are clinically important could be problematic when using the NRI to assess new biomarkers. This concern is common with the Hosmer-Lemeshow test. Again, prevalence, predictive values, cost, and benefit should probably be considered to make clinically relevant decisions.56On the contrary, the Integrated Discrimination Improvement table does not require predefined strata, and it can be seen as continuous version of NRI with the probability of disease differences used instead of predefined strata. Alternatively, it could be defined as the difference of mean predicted probabilities of events and no events.
NRI and Integrated Discrimination Improvement tables provide an important increase in the power to detect an improvement in risk stratification associated with the use of a new biomarker. Indeed, numerous clinical situations exist where a considered small increase of AUCROClead to substantial improvement in reclassification by the NRI and/or Integrated Discrimination Improvement table. This might suggest that very small increase of AUCROCmight still be suggestive of a meaningful improvement in the risk prediction and that the exclusive use of ROC curve is not sufficient to demonstrate that a biomarker is not useful. This is clearly an evolving domain of biostatics,53which should be highly considered for perioperative medicine and risk stratification.
Common Pitfalls of the Evaluation of a Biomarker
A biomarker supposes a biologic assay that is associated with measurement errors. Thus, the precision of the measurement of the biomarker and the limit of detection should be provided (reproducibility). Moreover, the measurement of the biomarker should be sensitive, for example, detects very low concentrations, and specific, for example, provides a measure of the biomarkers itself without interferences with other molecules, particularly those related to its metabolism. All these intrinsic properties are important to report and disclose to the reader.
Moreover, because most of biomarkers are molecules produced by our cells and measured in blood, urine, or other organic fluid, the possibility that abnormal cells can produce the biomarker in a completely abnormal pathophysiologic mechanism when compared with that (or those) known should always be considered. For example, there is some evidence that KIM-1, a biomarker of kidney injury, can be secreted by the kidney cancer cells in the absence of renal injury.∥This point might be difficult because the physiology of a given biomarker is often incompletely known, and the abnormal cells may produce anything. Normality may differ between populations (example of troponin in cardiac surgery vs. other type of surgery).49
The analytical characteristics of any assay should be distinguished from its diagnostic characteristics.57The terms “limit of detection,”“limit of quantitation,” or “minimal detectable concentration” are synonyms used for analytical sensitivity. Polymerase chain reaction is considered as a very sensitive test because it could detect a very low number of copies of gene or gene fragment. However, despite this exquisite analytical sensitivity, its diagnostic sensitivity may not be so perfect when the target DNA is absent in the biologic material analyzed: this could be the case of a patient with endocarditis but whose withdrawn blood samples do not contain any bacteria. In the same way, polymerase chain reaction can be considered as an assay with exquisite analytical specificity, but its diagnostic specificity may not be so perfect just because of contamination.57
Numerical Expression of Diagnostic Variables
Most of these variables should be considered as percentages and, thus, can be expressed either using the unit “percent” or using two digits: thus sensitivity might be presented either as 89% or 0.89. The important point is to ensure coherence along a given manuscript for a given variable and among all diagnostic variables. More importantly, because these variables are percentages, a CI (95% CI) should always be associated.58The lower and upper limits of the 95% CIs inform the reader about the interval in which 95% of all estimates of the measure (e.g. , sensitivity, area under the curve) would decrease if the study was repeated over and over again. When LHRs are reported, CIs that include 1 indicate that the study has not shown convincing evidence of any diagnostic value of the investigated biomarker. Therefore, the reader does not know whether a test with a positive LHR of 20 but a 95% CI of 0.7–43 is useful. A study reporting a positive LHR of 5.1 with a 95% CI of 4.0–6.0 provides more precise evidence than another study arriving at a positive LHR of 9.7 with a 95% CI of 2.3–17. Usually the sample size in critical care medicine studies is small, leading to wide CIs. Likewise, too often, studies concerning diagnostic tests are underpowered to allow statistically sound inferences about the differences in test accuracy.
The reporting of CIs enables the researcher and the reader to effectively communicate or understand the values presented, taking into account the uncertainty inherent with any sample size. This is particularly important because most of these variables are calculated using only a fraction of the whole population studied: for example, an interesting sensitivity of 0.90 in a large population of 500 patients (but only 10 presenting the disease) may not seem so interesting when considering its 95% CI: 0.60–0.98. Likelihood and diagnostic ratios are ratios of probabilities but should also be reported with their CIs. Moreover, CI enables the reader to directly interpret statistical inference.58
Role of Time
In most clinical situations, the issue of the time of biomarker measurement is of limited interest, mainly because the time of onset of the pathologic process and or disease is unknown. However, in other situations, the time of onset can be readily determined. This is the case for acute chest pain and for the appearance in the blood of a biomarker for myocardial infarction. In that example, although troponin is recognized as an ideal biomarker (both very sensitive and very specific), it needs more time to be detected than myoglobin, which is considered as a poorer diagnostic biomarker but one that appears earliest (fig. 5).59The importance of timing can be crucial in perioperative medicine, particularly in the postoperative period, because timing of the insult (anesthesia/surgery) is precisely known. For example, in cardiac surgery, Fellahi et al. 11suggested that troponin should be measured 24 h after cardiopulmonary bypass to gain the maximum information. In contrast, the time profile of another biomarker such as BNP may be completely different in that surgical setting.60
The issue of time of measurement may also be crucial when considering the pathophysiologic process assessed by biomarkers, which supposes a clear understanding of these processes, which is not always the case. In sepsis, the simultaneous occurrence of proinflammatory processes (tumor necrosis factor-α and interleukin-6) and antiinflammatory processes (interleukin-1), and the complex interaction of therapeutics, may render difficult the analysis of biomarkers, particularly when the onset of the various infection processes (onset of infection vs. onset of severe infection vs. onset of shock) remains vague.
Diagnostic tests may substantially vary when measured in different patient populations, particularly when studied populations are defined by characteristics such as demographic features (age and sex) and spectrum of the disease (severity, acute vs. chronic illness, pathologic location of form).61Moreover, the diagnostic test may work well in a global population but not in a given subgroup. For example, procalcitonin may not be a good biomarker of infection in pyelonephritis28or intraabdominal abcess.62Procalcitonin is not a good biomarker for infection in a population exposed to heatstroke even though half of them are truly infected simply because heatstroke itself increases procalcitonin.63In the perioperative period, the type of surgery might be an important cause of variation. The properties of cardiac troponin I to diagnose postoperative myocardial infarction are fundamentally different in noncardiac versus cardiac surgery, just because cardiac surgery alone is responsible for important postoperative release of cardiac troponin, which has multiple causes: surgical cardiac trauma, extracorporeal circulation, and defibrillation.60Even when considering cardiac surgery, different cutoff points of cardiac troponin to predict major postoperative cardiac events are observed when comparing cardiopulmonary bypass, valve, or combined surgery.49
All these important issues are usually summarized as spectrum biases. Therefore, precise information concerning the population studied and its case mix are important to be provided by researchers and to be understood by readers (table 4).
The issue of different populations could be more widely analyzed as an issue of external influences (covariates). This might be the case when factors other than the disease affect the biomarker including factors that affect the test procedure (apparatus and centers), the value of the biomarker itself (see later for the influence on kinetics), or the relation of the biomarker to the outcome. Thus, adjustment for covariates may be an important component of evaluation of biomarkers.64When the covariate does not modify the ROC performance, the covariate-adjusted ROC curve is an appropriate tool to assess the classification accuracy and is analogous to the adjusted odds ratio in an association study. In contrast, when a covariate affects the ROC performance, the ROC curves for specific covariate groups should be used. Covariate adjustment may also be important when comparing biomarkers, even under a paired design, because unadjusted comparisons could be biased.64
Importance of the Biomarker Kinetics
A biomarker has its own kinetics implying metabolism and elimination. This important issue has been poorly recognized at least partly because the kinetics of biomarkers is often poorly investigated. Just as renal or liver insufficiency may influence the pharmacokinetics of drugs, they also could influence the kinetics of a biomarker and interfere with their diagnostic properties. For example, procalcitonin has been shown recently to be increased in patients with renal function who undergo vascular surgery. This increase was observed both in infected and noninfected patients (fig. 6) and interferes with the cutoff point chosen but not with the diagnostic performance.65This effect could be of paramount importance in the postoperative period or in the intensive care unit because these patients are more likely to present organ failures. When comparing two biologic forms derived from BNP, the active form and its prometabolite N-terminal prohormone brain natriuretic peptide, Ray et al. 66observed that the diagnostic properties of N-terminal prohormone brain natriuretic peptide were decreased compared with BNP, probably because of the differential impact of renal function on these two biomarkers in an elderly population.
One of the important variables associated with a decrease in organ function is age.10Because we are more frequently caring for elderly patients, it is important that biomarkers be tested not only in a middle-aged population but also in an elderly population.
The range of values reported for the sensitivity and specificity in different studies of any biomarker are often very wide. This variability is uncovered by most meta-analyses performed on biomarker studies.67Apart from differences concerning the cutoff point, which should be considered as a definition issue, one of the most important reason for this wide variation is that diagnostic studies are plagued by numerous biases68:
The main problem resides in the reference test used. In many clinical situations, the definition of case and controls do not rely on a perfect “gold standard” reference test (see infra), and often patients with ambiguous classification are ignored in the analysis. This leads to a case-control design that overestimates the diagnostic accuracy.3,69This bias (also called spectrum bias) may be associated with the largest bias effect.3
Selection bias occurs when nonconsecutive patients or not randomly selected patients are included.
The lack of blinding for the biomarker tested can also introduce bias, which usually overestimates diagnostic accuracy, although this effect seems to be relatively small.3
The verification bias is caused by the selection of a population of patients who actually receive the reference test, thus ignoring unverified patients, or when not all patients are subjected to the reference test, or when different reference tests are used (called partial verification bias).3This bias might be particularly confounding when the decision to perform the reference test is based on the result of the studied test.
The test may produce an uninterpretable result. Although this problem is frequently not formally reported, for these observations are removed from the analysis, this practice can introduce bias. For subjectively interpreted tests (which might be particularly the case for biomarkers measured with rapid bedside technique), interobserver variation can have a silent but important impact that could be neither estimated nor reported.
The accuracy of a biomarker may improve over time because of either improvement in the skills of the biologist or reader or improvement in technology. The measurement of cardiac troponin is a good illustration of progressive improvement in technology over the recent decade, leading to marked and progressive decrease of the cutoff determining normality.70
Finally, as for clinical trials, a publication bias might occur because studies showing encouraging results have a higher likelihood to be published. This bias is important to consider for meta-analysis. In the absence of registration of diagnostic studies, it is difficult to estimate the impact of this source of bias.
Statistical Power Issue
Although frequently overlooked,71statistical power considerations are as important for studies examining the diagnostic performance of a biomarker as in other types of research (e.g. , clinical trials). Thus, all studies on biomarkers should have included an a priori calculation of the number of patients needed to be included. The exact statistical power considerations that are relevant for interpreting a biomarker study are dependent on the nature or purpose of the study but generally focus on demonstrating that the sensitivity and/or specificity of a biomarker is superior to some stated value (e.g. , sensitivity >0.75). It is of note that this focus on sensitivity and specificity is the case even if the predictive values are of greater interest (as they often are), because predictive values are also dependent on prevalence of the underlying disease (fig. 2).
Calculating the required sample size to provide some level of desired statistical power (1 −β) is analogous to a traditional difference from a theoretical proportion (i.e. , one-sample difference in proportions) using a one-sided CI. For the calculation, the sensitivity or specificity values of the test are treated as a proportion and compared with a minimally acceptable value. For this calculation, a null hypothesis is posited that the sensitivity or specificity of the test is equal to a minimally acceptable value, with the alternative hypothesis that the test value is greater than this minimal value. To test this hypothesis, a type-I error rate must be specified (usually as α= 0.05) to construct a one-sided CI (1 −α). Further, because of the nature of the inference being conducted, the desired statistical power is conservatively set to 95% (1 −β= 0.95). Finally, the expected sensitivity or specificity of the biomarker must be anticipated such that the difference in the two proportions can be used in the calculation.
Although this process can seem daunting, there are several resources available to assist researchers and consumers of biomarker research. The calculation is now routinely available on most statistical software applications. The assumptions used in the calculation must still be provided by the user, but elegant algorithms can actually perform the calculation. Second, Flahault et al. 72have recently provided an extensive overview of the process and have even provided tables of values that are routinely encountered. Finally, a growing list of internet sites host statistical power calculators for a variety of applications. Although many of these sites are not formally vetted, several are hosted by Universities and are, thus, quite useful.
Nevertheless, as we advocate going beyond sensitivity and specificity in this review, it should be emphasized that calculation of the required sample size should now be done considering either the sensitivity at a particular false-positive rate,73the AUCROC,73–75including partial AUCROC,73,76or the reclassification indices.54Moreover, the objective of a study could also be to determine the value of a cutoff or to compare two or more biomarkers. In fact, diagnosis assessments of biomarkers include numerous forms of statistical analyses, but we have to take into consideration sample size calculations, which is ever feasible even with some difficulties. Thus, clinicians have first to define the aim of the considered research and second to evaluate its ability to conduct this power calculation. In fact, these techniques are not yet available in most of the usual statistical software applications, and more advanced statistical software#might be dissuasive for a punctual use by clinical researchers. Thus, advice of a biostatistician may be very helpful.
Imperfect Reference Test
In a diagnostic study, the reference test should be a gold standard, but in many clinical situations this is not possible. A universally recognized standard may not exist (e.g. , cardiac failure), may not have been performed in many patients (e.g. , autopsy), or logistically could not be concurrently performed. For example, when evaluating BNP, echocardiography for heart failure is not always performed in the emergency department but is usually performed later during hospitalization.34Moreover, in many situations, biomarkers are compared with derived scores from several clinical metrics that have unknown reliability in place of a confirmed diagnosis. This practice was seen in the Framingham score for heart failure and use of the biomarker BNP, criteria of systemic inflammatory response syndrome and sepsis and use of procalcitonin,77and Risk, Injury, Failure, Loss, and End-stage Kidney (RIFLE) score for acute renal failure.78
When an imperfect reference test must be used, it should be recognized that measures of test performance can be distorted.79Glueck et al. 80showed that when inappropriate reference standards are used, the observed AUCROCcan be greater than the true area, with the typical direction of the bias being a strong inflation in sensitivity with a slight deflation of specificity. Taken together, this information warrants the use of reliable reference standards that are not prone to such bias.
There are several options available to improve a reference standard when a gold standard does not exist or cannot be used. First, expert consensus can be used to define the diagnosis. For this task, at least three experts are needed with a majority rule.81These experts should have complete access to all available information, except that concerning the biomarker test, to which they should be blinded. The statistical agreement between experts should be quantified and reported. A second option is to assign a probability value (i.e. , 0–1) that corresponds to a subjective or derived (logistic regression using dedicated variables) probability that a patient has the disease. Third, one can use covariance information to estimate a model of the multivariate normal distributions of disease-positive and disease-negative patients when several accurate tests are being compared. Finally, one can transform the diagnostic problem into a clinical outcome problem.82
There are also some situations in which the reference test outcome is not binary (yes or no) but ordinal or continuous. Obuchowski et al. 83proposed a ROC type nonparametric measure of diagnostic accuracy. This is a discrimination test in which a diagnostic test is compared with a continuous reference test to determine how well it distinguishes outcome of the reference test.
New biomarkers do not only modify our diagnostic process but also change the definition of a disease.84For example, cardiac troponin has progressively modified the definition of the diagnosis of myocardial infarction.70Glasziou et al. 84have proposed three main principles that may assist the replacement of a current reference test: the consequences of the new test can be understood through disagreements between the reference and the new test; resolving disagreements between new and references test requires a fair, but not necessarily perfect, “umpire” test; possible umpire tests include causal exposures, concurrent testing, prognosis, or the response to treatment. A fair umpire test means that it does not favor either the reference or the new test and, thus, is considered as unbiased.
STARD Statement for Diagnosis Studies
The STARD initiative was published recently to improve the quality of reporting for studies of diagnostic accuracy.4Complete and accurate reporting of biomarker studies allow the reader to detect potential bias and judge the clinical applicability and generalization of results. The STARD recommendations follow the template of the Consolidated Standards of Reporting Trials statement for the reporting of randomized controlled trials (RCTs).5The STARD guideline attempts to improve the reporting of several factors that may threaten the internal or external validity of the results of a study, including design deficiencies, selection of the patients, execution of the index test, selection of the reference standard, and analysis of the data. That these reporting improvements are needed is evidenced by a survey of diagnostic accuracy studies published in major medical journals between 1978 and 1993 that found generally poor methodologic quality and underreporting of key methodologic elements.85Similar shortcomings were observed in most specialized journals.86
The STARD guideline (table 4) provides a checklist of 25 items to verify that relevant information is provided. Similar to the reporting of RCTs, a flow diagram is strongly recommended, with an item advocating the extensive use of CIs. Although the STARD initiative is a crucial step for improving reporting, because of the heterogeneity of available methods, only a few general recommendations concerning the statistical methods were offered.
Associated Clinical Predictors and/or Multiple Biomarkers
Pretest risk for all patients of a population is rarely equal, and clinical predictors, such as age, are most often present. The clinical question remains how a new biomarker improves the risk stratification determined by the classic predictors (clinical and biologic). The risk prediction obtained with a new biomarker alone, even if excellent, may have no clinical application if it does not improve the risk stratification obtained with the usual predictors. The use of a risk prediction model (a statistical model that combines information from several predictors) is the most frequent approach. The purpose of a risk prediction model is to accurately stratify individuals into clinically relevant risk categories. The common types of models include logistic regression, Cox proportional hazard, and classification trees. Two nested models are then constructed and compared, one with usual predictors and the other with usual predictors and the new biomarkers.
In most clinical situations, clinicians want to apply more than one biomarker. A multiple biomarker approach is more widely used in several domains of modern medicine. For example, stratification of the cardiovascular risk in the general population is improved when considering several biomarkers such as C-reactive protein, troponin, and BNP.87After cardiac surgery, a multiple biomarker approach has been shown to improve the prediction of poor long-term outcome when compared with the classic clinical Euroscore (fig. 7).60There are two main approaches: (1) several biomarkers testing the same pathophysiologic process; (2) different biomarkers testing different pathologic processes. For example, C-reactive protein may assess the postoperative inflammatory response, BNP the cardiac strain, and troponin myocardial any ischemic damage, all of them influencing final outcome in cardiac surgery.11
However, the results from different tests are usually not independent of each other, even if they assess different pathophysiologic mechanisms, indicating that sequential Bayesian calculations may not be appropriate. Other statistical methods, which take into account colinearity and interdependence, should be considered. In the case of a logistic regression, the regression coefficients can be used to calculate the probability of disease presence, and a simplified score can then be derived from these coefficients. Biases associated with the multivariate analyses are common and are largely able to impact the replication of these scores. The best way to limit this is the use of both internal and external validations of the models.88
Meta-analysis of Diagnostic Studies
Systematic reviews are conducted to help gain insight on the available evidence for a research topic. For a meta-analysis, this is conducted in a two-stage process where summary statistics are first computed for individually considered studies, and then a weighted average is computed across studies.89In this regard, meta-analyses of biomarker diagnostic studies are similar to other types of meta-analyses.67For that reason, the reporting of meta-analyses of biomarker diagnostic studies should generally follow existing guidelines for meta-analysis such as Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA).90
Despite general similarities, the conduct of meta-analysis for biomarker studies differs from meta-analyses of RCT in several important ways. First, the assessment of study quality for diagnostic studies varies considerably from RCTs. Individual studies on the same biomarker can vary considerably on the choice of threshold used, the population under study, and even the measurement of the biomarker or reference standard. The choice of patient recruitment strategy can impact the assessment as well, with one study finding that recruiting patients and controls separately can lead to an overestimation of the test's diagnostic accuracy.3To assist in the evaluation of study quality, several specialized tools have been created, such as STARD guidelines (table 4), and the quality assessment of studies of diagnostic accuracy (QUADAS) has been included in systematic reviews.91The accurate characterization of a biomarker's performance in a particular setting for a specific population is dependent on sorting through the available evidence to primarily focus on only relevant studies of high quality.
The statistical techniques used to aggregate the results of biomarker studies also differ from the meta-analyses of RCTs. The meta-analysis of diagnostic studies requires the consideration of two index measures (e.g. , sensitivity and specificity), as opposed to a single index in the meta-analysis of an RCT.92It is also expected that heterogeneity in the indices will be observed from several different sources,3and this heterogeneity must be considered in the statistical model used to pool the estimates.93The choice of which type of model and estimation strategy to use is not trivial, with several novel techniques such as the hierarchical summary ROC94and multivariate random-effects meta-analysis95offering distinct advantages over traditional approaches. For the interested reader, Deeks et al. 92offers an informative illustration of the meta-analytical process.
The studies of biomarkers, either as diagnostic or prognostic variables, are often presented with poor biostatistics and methodologic flaws that often preclude them from providing a reliable and reproducible scientific message. This practice dramatically contrasts with bioengineering companies' prolific development of new biomarkers exploring the cardiovascular system, kidney, central nervous system, inflammation, and sepsis. To address this gap in methodologic quality, some recommendations have been produced recently, but even they did not cover the entire biomarker domain.4Two main reasons may explain this situation. First, there is a widely recognized delay between the development of biostatistical techniques and their implementation in medical journals. Second, even in biostatistics, this domain has not been thoroughly explored and developed. Thus, there is an urgent need to accelerate the improvement of our methods in analyzing biomarkers, particularly concerning the use of the ROC curve, choice of cutoff point, including the definition of a gray zone, appropriate a priori calculation of the number of patients to include, and the extensive use of validation techniques. Admittedly, if we retrospectively look at one of our recent studies,66we now realize that considerable improvement in information could have been provided (fig. 8), that should be considered as a promising and encouraging signal.96
There are important potential shortcomings in biomarker studies. Investigators should be aware of this when designing their studies, editors and reviewers when analyzing a manuscript, and readers when interpreting results.