### Statistical Methods to Evaluate Biomarkers

In the early phase of biomarker development, the association between a biomarker and the outcome is often assessed using regression models and the reporting of odds/hazard ratios or estimates of relative risks to quantify this association, preferably including an assessment of their value over established biomarkers or clinical characteristics. A prospective design is preferable as it facilitates clear inclusion criteria, data collection procedures (minimizing missing data), and standardization of measurements, and ensures all relevant clinical information is measured. Registering a protocol and prespecifying study objectives, biomarkers of interest, and statistical methods will reduce publication bias and selective reporting.^{[1]}

A commonly used approach to estimate biomarker discrimination and the incremental value of a biomarker is to calculate the area under the receiver operating characteristic curve (AUC).^{[15]} The receiver operating characteristic curve is formed by plotting false positive rates (1 − specificity) on the x-axis and the true positive rates (sensitivity) on the y-axis. The AUC quantifies the discriminative ability of the biomarker ranging from 0.5 (*i.e.*, no better than flipping a coin) to 1 (*i.e.*, perfect discrimination). Discrimination is the ability of the biomarker to differentiate those with and without the event (*e.g.*, quantifying whether those with the event tend to have higher biomarker values compared to those who do not).

So-called "optimal" biomarker thresholds are often determined based on maximizing the Youden index (maximum [sensitivity + specificity − 1]).^{[1]} The Youden index is often used to determine the value of the biomarker that maximizes the sum of sensitivity and specificity. However, such an approach is problematic if the biomarker is used to either rule out (high sensitivity) or confirm (high specificity) a diagnosis when negative and positive likelihood ratios can be used to select thresholds. The 95% CI around the "optimal" cutoffs could be reported (*e.g.*, bootstrap resampling).^{[1,16]} Furthermore, dichotomization (and indeed, categorization) of a biomarker is also biologically implausible, as no thresholds of a biomarker exist that cause a sudden change in risk (*e.g.*, there is typically no reason why a person's risk on either side of a cut-point will be dramatically different).

Categorization (including dichotomization) of a continuous measurement (*e.g.*, biomarkers) should therefore be avoided during statistical analysis, as it will result in a loss of information and negatively impact predictive accuracy.^{[17–19]} The statistical analysis should ideally retain continuous measurements on their original scale, allowing for nonlinear relationships to be considered (using restricted cubic splines or fractional polynomials).^{[20]}

To assess the incremental value of a novel biomarker when added to a clinical model or a standard biomarker, the difference in AUC between two prediction models (improvement in discrimination) is often used.^{[21]} Methods such as the DeLong nonparametric test and the Hanley and McNeil method are then used to compare AUCs of the biomarker under investigation against an already established biomarker or clinical model assessed in the same set of individuals.^{[22,23]} The main limitation in comparing AUCs is that a relatively large "independent" association is needed to result in a meaningfully larger AUC for the new biomarker. In response to the insensitivity of comparing AUCs, reclassification methods (*e.g.*, net reclassification index, integrated discrimination index) have been proposed and are described in Table 2.^{[8]} However, despite their popularity, it has since been shown that these approaches offer little more than existing approaches and can be unreliable in certain situations.^{[24]} Reclassification methods have been shown to have inflated false positive rates when testing the improved predictive performance of a novel biomarker.^{[25,26]} Approaches based on net benefit using decision analytic methods are now widely recommended, as they allow for meaningful assessment of a new biomarker against an established biomarker or combination of biomarkers by comparing the benefits and risk of decisions (true positives) against their relative harms (false positives).^{[21,27–28]} The comparison is made across all (or a range of) thresholds to evaluate whether the new biomarker has added clinical utility.

#### Clinical Risk Prediction Models Using Biomarkers

Clinical prediction models are typically developed using regression models (*e.g.*, logistic regression or Cox regression). Logistic regression is mainly used for short-term binary outcomes (*e.g.*, mortality, postoperative myocardial infarction), while survival methods (such as Cox regression) are used for time-to-event outcomes and allow for censoring. Methods for handling missing data should be considered before analysis (*e.g.*, multiple imputation).^{[29]} Predictors with a high amount of missing data can be problematic, indicating the measurement is infrequently performed in daily practice and potentially limiting to a biomarker model's usefulness. The choice of which variables to include in a model needs consideration: variables should have clinical relevance and be readily available at the intended moment of use of the model. The functional form of any continuous variables (*e.g.*, biomarkers) should be appropriately investigated using fractional polynomials or restricted cubic splines to fully capture any nonlinearity in the association of the continuous variables with the outcome.^{[17,20]} The number of candidate predictors to consider in multivariable modeling has historically been constrained relative to the number of outcome events to avoid overfitting, in a concept called events-per-variable that minimizes the risk of overfitting (a condition where a statistical model describes random variation in the data rather than the true underlying relationship).^{[30]} It was widely recommended that studies should only be carried out when the events-per-variable exceeds 10. However, the events-per-variable concept has recently been refuted as having no strong scientific grounds.^{[31,32]} More recently, sample size formulae have been developed that are context-specific to minimize the potential for overfitting; that depends not only on the number of events relative to the number of candidate predictors (*i.e.*, those considered for inclusion, not necessarily those that end up in the final model), but also on the total number of participants, the outcome proportion, and the expected predictive performance Box 1 and Box 2.^{[33]}

The use of penalized regression methods (*e.g.*, least absolute shrinkage and selection operator, ridge regression, elastic net) can be considered since it facilitates the choice of variables to be included in the model while minimizing overfitting (Table 2).^{[34–36]} However, it was reported that penalized approaches do not necessarily solve problems associated with small sample size.^{[37]} General and biomarker-specific considerations for a developing multivariable prediction models are summarized in Box 3.

More recently, machine learning methods have been gaining interest as an alternative approach to regression-based models in critical care and perioperative medicine.^{[38–40]} Algorithms that improve the clinical use of biomarkers have been developed with machine learning.^{[41,42]} A practical definition of machine learning is that it uses algorithms that automatically learn (*i.e.*, are trained) from data, contrary to clinical prediction models, which are based on prespecifying predictors and their functional forms. These algorithms are divided into two categories: supervised and unsupervised. Supervised machine learning algorithms are used to uncover the relationship between a set of clinical features and biomarkers and known outcomes (predictive and prognostic models; Supplemental Digital Content 1, http://links.lww.com/ALN/C503).^{[34]} The main supervised learning algorithms (*e.g.*, artificial neural network, tree-based methods, support vector machines) are described in Table 2. Supervised conventional statistical modeling (*e.g.*, logistic regression) and supervised machine learning should be considered complementary rather than mutually exclusive.^{[43,44]} Marafino *et al.* used a set of vital signs and biologic data from the first 24 h of admission for more than 100,000 unique intensive care unit (ICU) patients in a supervised machine learning algorithm, incorporating measures of clinical trajectory to develop and validate ICU mortality prediction models. The developed prediction model for mortality risk, leveraging serial data points for each predictor variable, exhibited discrimination comparable to classical mortality scores (*e.g.*, Simplified Acute Physiology Score III and Acute Physiologic Assessment and Chronic Health Evaluation IV scores).^{[41]} In another example, Zhang *et al.* developed a prediction machine learning model that was used to differentiate between volume-responsive and volume-unresponsive acute kidney injury (AKI) in 6,682 critically ill patients. The extreme gradient boosting combined with a decision tree model was reported to outperform the traditional logistic regression model in differentiating the two groups.^{[42]}

Machine learning is often claimed to have superior performance in high-dimensional settings (*i.e.*, with a large number of explanatory variables). However, there is limited evidence to support this claim in fair and meaningful comparisons with regression-based approaches, as observed in a recent systematic review that showed no performance benefit in clinical studies.^{[45]} While machine learning algorithms are often declared to perform well, they require very large datasets, massive computations, and sufficient expertise.^{[46]} As such, they should not be considered as an "easy path to perfect prediction." Limitations include overfitting, which captures random errors in the training dataset and makes the algorithm not generalizable to future predictions.^{[47]} Approaches to control for overfitting should be adapted from the established clinical prediction model literature to provide an unbiased assessment of predictive accuracy. The other disadvantage of supervised machine learning algorithms is that the underlying association between covariates and outcome cannot be fully understood by clinicians ("black box" models).^{[48]} Conversely, in logistic regression models, the regression coefficient of each covariate can be easily interpreted as the odds ratio (exponentiation of the regression coefficient), which reflects the magnitude of the association with the outcome. A causal interpretation of any association in a prediction model should be avoided, as the aim of a prediction model is to predict and not attribute causality.^{[49]} The interpretation of a model that includes biomarkers reflecting distinct pathophysiological pathways (*e.g.*, myocardial injury, endothelial dysfunction) and their associations with outcome is more intuitive for clinicians when using classical regression models than machine learning algorithms.

Regardless of whether more traditional regression-based approaches or modern machine learning have been used to develop a prediction model, their predictive accuracy can be assessed with several metrics. The two widely recommended measures are calibration and discrimination.^{[4,49]} Calibration assesses how well the risk predicted from the model agrees with the actual observed risk. Calibration can be assessed graphically by plotting the observed risk of outcome against the predicted risk (*e.g.*, mortality, postoperative AKI).^{[50]} Discrimination is a measure of how well the biomarker model can discriminate those who have and those who do not have the outcome of interest (mainly evaluated by the AUC). Another measure of predictive accuracy is the Brier score (squared difference between patient outcome and predicted risk), which reflects the clinical utility of prediction models. However, it has been suggested that the Brier score does not appropriately evaluate the clinical utility of diagnostic tests or prediction models.^{[51]} In practice, no one measure is enough, and the use of multiple metrics characterizing different components of prediction accuracy is required.^{[52]}

Assessing model performance is an important and vital step. During the development of a prediction model, internal validation, using cross-validation or bootstrapping, that mimics the uncertainty in the building process and uses only the original study sample to assess model performance should be carried out.^{[4,49]} The reason to carry out an internal validation is to obtain a bias-corrected estimate of model performance, and for regression-based models, the regression coefficients can be subsequently shrunk due to overfitting.^{[54]} A stronger test of a model is to carry out an external validation, which consists of assessing the performance (discrimination and calibration) of the prediction model in different participant data than was used for the model development (typically collected from different institutions).^{[4,49]} It is often expected that upon external validation, the calibration of the model will be poorer, and methods to recalibrate the model should be considered.^{[55]}

Anesthesiology. 2020;134(1):15-25. © 2020 American Society of Anesthesiologists | Lippincott Williams & Wilkins