Abstract
The ability to predict postinduction hypotension remains limited and challenging due to the multitude of data elements that may be considered
Novel machine-learning algorithms may offer a systematic approach to predict postinduction hypotension, but are understudied
Among 13,323 patients undergoing a variety of surgical procedures, 8.9% experienced a mean arterial pressure less than 55 mmHg within 10 min of induction start
While some machine-learning algorithms perform worse than logistic regression, several techniques may be superior
Gradient boosting machine, with tuning, demonstrates a receiver operating characteristic area under the curve of 0.76, a negative predictive value of 19%, and positive predictive value of 96%
Hypotension is a risk factor for adverse perioperative outcomes. Machine-learning methods allow large amounts of data for development of robust predictive analytics. The authors hypothesized that machine-learning methods can provide prediction for the risk of postinduction hypotension.
Data was extracted from the electronic health record of a single quaternary care center from November 2015 to May 2016 for patients over age 12 that underwent general anesthesia, without procedure exclusions. Multiple supervised machine-learning classification techniques were attempted, with postinduction hypotension (mean arterial pressure less than 55 mmHg within 10 min of induction by any measurement) as primary outcome, and preoperative medications, medical comorbidities, induction medications, and intraoperative vital signs as features. Discrimination was assessed using cross-validated area under the receiver operating characteristic curve. The best performing model was tuned and final performance assessed using split-set validation.
Out of 13,323 cases, 1,185 (8.9%) experienced postinduction hypotension. Area under the receiver operating characteristic curve using logistic regression was 0.71 (95% CI, 0.70 to 0.72), support vector machines was 0.63 (95% CI, 0.58 to 0.60), naive Bayes was 0.69 (95% CI, 0.67 to 0.69), k-nearest neighbor was 0.64 (95% CI, 0.63 to 0.65), linear discriminant analysis was 0.72 (95% CI, 0.71 to 0.73), random forest was 0.74 (95% CI, 0.73 to 0.75), neural nets 0.71 (95% CI, 0.69 to 0.71), and gradient boosting machine 0.76 (95% CI, 0.75 to 0.77). Test set area for the gradient boosting machine was 0.74 (95% CI, 0.72 to 0.77).
The success of this technique in predicting postinduction hypotension demonstrates feasibility of machine-learning models for predictive analytics in the field of anesthesiology, with performance dependent on model selection and appropriate tuning.
HYPOTENSION has been demonstrated to be an independent risk factor for adverse perioperative outcomes.1–5 Despite careful administration of medications by anesthesiologists during induction of general anesthesia, hypotension is often an unintended consequence. Thus, early recognition of intraoperative hypotension may lead to preventive measures to improve anesthetic and surgical outcome.
There are currently few available methods used in clinical practice for prediction of postinduction hypotension.6,7 It is likely that a variety of factors are involved in the precipitation of postinduction hypotension, including patient comorbidities, home medications taken on the day of surgery, and medications used for induction of anesthesia. Due to this complexity, simple modeling techniques may not be sufficient for risk prediction of postinduction hypotension.
Machine-learning methods provide an opportunity for large amounts of data to be incorporated into development of robust predictive analytics, often without many of the pitfalls and restrictions of standard modeling techniques.8,9 These techniques are increasingly being used in various fields of medicine, including in the diagnosis of primary hyperparathyroidism, prognosis in stage III colorectal cancer, discharge disposition following craniotomy, mortality prediction after cardiac surgery, and identifying graft failure after liver transplantation.10–16 Modern electronic health records with integrated anesthesiology intraoperative records provide the opportunity for complex clinical decision support tools to aid clinicians in making decisions based on objective evidence and data, in addition to training and experience. There is now access to large amounts of perioperative data via the electronic health record. Before implementation of these tools, however, there must be proof-of-concept explorations followed by rigorous validation (“bench work”) to confirm reliability in clinical practice (“bedside”).
In the current study, we hypothesized that we could develop a highly discriminative machine-learning model for prediction of postinduction hypotension using information readily available from the electronic health record to demonstrate the viability of machine-learning methods for intraoperative predictive analytics.
Materials and Methods
The data did not contain any direct patient identifiers, and no direct interaction with human subjects was involved. Our Institutional Review Board (New York University Langone Health, New York, New York) does not directly review such analyses, and requires certification that no direct patient identifier is accessible via the data kept in record. This study was considered exempt from review and written informed consent was consequently waived.
Patient Population
Patients who underwent any procedure requiring general anesthesia from November 2015 to May 2016 were included. Only patients over the age of 12 were included. There were no procedure-based exclusions.
Primary Outcome
The target output (or primary outcome) was postinduction hypotension, defined as any single mean arterial pressure (MAP) less than 55 mmHg on any noninvasive or continuous blood pressure measurement within 10 min of the recorded induction time of general anesthesia. Designation of hypotension was limited only to the available recorded data. Any recorded blood pressure measurement was included, whether it was from an arterial line or from a non-invasive blood pressure cuff. Any blood pressure taken by non-invasive blood pressure cuff was recorded, so the frequency would depend on the cycle time designated by the anesthesiologist. Continuous blood pressure measurement was recorded at a resolution of 1 min. Multiple thresholds for the adverse potential for hypotension have also been explored, and relationships have been drawn between MAP less than 49 and postoperative mortality. In addition, systolic blood pressure and relative blood pressure changes have also been explored in association with adverse postoperative outcome. This particular threshold was chosen based on the association of MAP less than 55 mmHg with adverse perioperative outcomes.1,3,4,17 Specifically, MAP less than 55 mmHg has been associated with both acute kidney injury and myocardial infarction in multiple previous explorations of large data sets in association with outcomes. Because this was a binary categorical designation (hypotension either occurred or it did not), only classification methods of machine learning were considered.
Data Source
Data were aggregated from the electronic health record (Epic; Epic Systems, USA) from a single large academic institution, including a six-month range from November 2015 to May 2016. The data were made available by unified reports initially created by the hospital information technology department for reporting and research purposes. There were reports generated for: demographic data, intraoperative medications administered, intraoperative event times, preoperative medical comorbidities in the form of the problem list as of the time of the encounter, intraoperative vital signs and ventilator data, and preoperative medications as of the time of the encounter. Each set of values was assessed individually for validity, including by visualization of summary data and case sampling, in which 100 cases were assessed manually.
Data Elements
From demographic data, the following were extracted: age, sex, body mass index, time of surgery, and American Society of Anesthesiologists (ASA) Physical Status score. From intraoperative medications, by using the time of medication administration, only medications given between entry into the operating area and 10 min after induction start were included. Intraoperative vital signs were restricted to the same time period (from start of anesthesia to 10 min after induction start). At our institution, preoperative IV medications are not given before entry into the operating room. Induction start is a single manually entered event that indicates the beginning of administration of induction medication, and records that did not include an induction time were not included. We do not indicate induction end within the electronic health record at our institution. The data frame was truncated to include only the relevant time period. From the preoperative medical comorbidities, the following were extracted by text search from the problem list, which is included in every patient electronic health record and can be populated as discrete fields by any documenting provider: coronary artery disease, hypertension, congestive heart failure, atrial fibrillation, chronic kidney disease, asthma, chronic obstructive pulmonary disease, gastroesophageal reflux disease, obstructive sleep apnea, diabetes mellitus, and aortic stenosis. These exact terms were used for searching the problem list, as only these discrete terms can be entered to define the given comorbidity within the problem list (i.e., coronary artery disease is listed as “coronary artery disease” or “coronary artery disease [CAD]” and not only “CAD”). The entire list of preoperative medications, as separated into pharmacologic classes, was retained for more specific feature selection. Pharmacologic classes were determined by the underlying Epic categorization system, more specifically the Clarity (Epic Systems) designations intended for querying and reporting of medications. Clarity is the relational database that houses data from the Epic electronic health record. Clarity categorizes any given medication by therapeutic class, pharmacologic class, and pharmacologic subclass. For the purposes of this study, we used the most specific category: the pharmacologic subclass. Only electronic health record data were used, and no waveform data from the monitor were analyzed. Data was obtained exactly as it was recorded in the electronic health record.
Statistical Analysis
Data Cleaning and Feature Selection.
Although machine learning is expected to help in unbiased feature selection, our initial feature selection for inclusion was decided based on clinical judgment of factors that are potential contributors to postinduction hypotension as a basis for data extraction, as well as what was available for extraction from the electronic health record. The previously mentioned medical comorbidities, age, sex, body mass index, time of surgery, and ASA Physical Status score were included. Time of surgery was represented by the hour during which the patient entered the procedure area, as a continuous variable (i.e., entering the room at 7:26 am was encoded as 7; entering the room at 10:15 pm was encoded as 22). ASA Physical Status score was represented as a categorical variable. Age and body mass index were included as continuous variables, while the remaining were considered categorical. Medications used during induction of general anesthesia were investigated, and only the most common medications were included: midazolam, propofol, etomidate, fentanyl, rocuronium, and succinylcholine. All medications were included as continuous variables indicating dose administered. In circumstances where there was no data for an intraoperative medication, such as no value for administration of propofol, that value was converted to zero. The data were inspected and continuous data did not indicate a need for normalization. Blood pressures that were obviously out of physiologic range were excluded (MAP less than 20 mmHg, MAP greater than 200 mmHg, or pulse pressure less than 20 mmHg), but no other attempt was made at artifact detection. No other preprocessing was performed. Consequently, all data were used without modification, and no additional measures were taken regarding potentially “missing” or misclassified data. This was done to preserve application of real-world data. Features were examined in relation to the primary outcome for data leakage or perfect separation potential. Data leakage refers to information within the training set that leads to excessively optimistic predictions. This data may be information that is not available in the real world setting, or data that contains the information that is to be predicted. For example, in this setting, data leakage could have occurred if lowest MAP had been included as a feature, as this value is directly related to the binary definition of hypotension used in the models. Perfect separation refers to data that clearly forces the outcomes of the algorithm into one classification or another (i.e., if variable x = 1, then outcome y = 1 always). No features were considered to be a risk for data leakage or perfect separation. From intraoperative vital signs and ventilator data, the first MAP, maximum end tidal volatile anesthetic concentration, and mean peak inspiratory pressure were included, all of which were included as continuous variables. Recursive feature elimination was used as a wrapper method on top of random forest for feature selection, using the “rfe” function from the “caret” package, resulting in a subset of the available features for inclusion within the machine-learning models.
Model Selection.
Because not all machine-learning methods have robust internal validation, data was randomly separated into 70/30 training and test sets for validation. Specifically, 70% of the data was used for training the machine learned models, and 30% was held out for the test set (fig. 1). In no predetermined order, machine-learning algorithms were trained on the training set, using tenfold cross-validation repeated three times to minimize initial overfitting. Because of concerns of how various machine-learning models treat class imbalance, area under the receiver operating characteristic curve was used as the primary performance metric due to its threshold-independence instead of a simple accuracy metric, which may not be reflective of performance in the setting of class imbalance. In addition, the threshold-dependent measures of sensitivity and specificity at the “best” thresholds were computed for each model. “Best” threshold refers to the threshold at which sensitivity and specificity are both maximized, not necessarily the optimal threshold for clinical integration. The following machine-learning algorithms were trained: logistic regression, support vector machines, naive Bayes, k-nearest neighbor, linear discriminant analysis, random forest, neural nets, and stochastic gradient boosting machine. Although the goal was developing a predictive model, an additional aim was to explore how various machine-learning algorithms compare with respect to handling of pre- and intraoperative data.
Diagram of methods. The complete data set was split into training and test sets. The machine-learning methods were trained on the training set and the best performer selected for additional parameter tuning before being applied to the test set for validation.
Diagram of methods. The complete data set was split into training and test sets. The machine-learning methods were trained on the training set and the best performer selected for additional parameter tuning before being applied to the test set for validation.
All available features after recursive feature elimination were used for all training algorithms. Some machine-learning algorithms perform built-in feature selection (i.e., random forest), and this behavior was not restricted. The “caret” package in R was used for initial training and tenfold cross-validation, using receiver operating characteristic as the performance metric, and basic tuning on parameters specific to each method of machine learning (https://CRAN.R-project.org/package=caret, accessed May 4, 2018). Tuning was performed by a limited grid-search as dictated by the defaults in the package. Functions, packages, and tuning parameters used for each machine-learning method are shown in appendix 1.18–21 (https://CRAN.R-project.org/package=gbm, accessed May 4, 2018). Receiver operating characteristic curves were generated using the “pROC” package.22 Bootstrap 95% CI were computed with 2,000 stratified bootstrap replicates by the “ci” function in the “pROC” package.
Model Tuning and Testing.
After model selection, which involved a course-level tuning of the models, the best performing algorithm as determined by highest area under the receiver operating characteristic curve was fine-tuned further for parameters specific to the method. Tuning refers to optimization of the algorithm by modification of parameters in order to achieve the best performance. Modifiable parameters are specific to each machine-learning algorithm (i.e., number of trees for random forest, distance and kernel for k-nearest neighbor, etc.). Tuning parameters were assessed by expanded manual grid search, in which large but realistic ranges of values are given for each tuning parameter, and performance of the resulting models is compared. Variable importance was determined for the final model if this is a possibility for the given machine-learning algorithm, as not all machine-learning algorithms are amenable to computing variable importance. Variable importance is computed based on how important any given feature is to aid in the classification process when the classifier is built, determined by its effect on the performance measure. Generally, variable importance helps to assess the impact of any given variable on the performance of the algorithm. If a variable with high importance is permuted or removed from the model, the performance decreases. The greater the importance, the more essential the variable is to the performance of the model. Nonetheless, assumptions about effect size cannot be drawn directly about the relationship of variable importance to the primary outcome.23 The “varImp” function from the “caret” package was used for variable importance (https://CRAN.R-project.org/package=caret, accessed May 4, 2018). The final model was then simulated on the test set to determine generalizability of the algorithm, and assess whether the model was overfitted.24 The process is depicted in figure 1. All statistical operations were performed using the R statistical software (R Foundation for Statistical Computing, version 3.3.2, Austria). A sample data set (Supplemental Digital Content 1, https://links.lww.com/ALN/B773) and sample code of the primary analysis (Supplemental Digital Content 2, https://links.lww.com/ALN/B774) are provided.
Sensitivity Analyses
Different Definition of Hypotension.
We used MAP less than 55 mmHg as the definition of hypotension due to its association with certain postoperative outcomes. Other studies suggest a more conservative definition of hypotension, namely MAP less than 65 mmHg, as being associated with harm, namely myocardial and kidney injury.17 Because of this, we undertook a sensitivity analysis in which MAP less than 65 mmHg was considered as the definition for hypotension, and trained the best performing algorithm on this new definition to generate a new model.
Treated Hypotension.
We initially only aimed to identify patients that experienced hypotension, as in those cases there was an implication that the hypotension was unanticipated. In cases of anticipated or suspected hypotension, the anesthesiologist is likely to have treated the hypotension. Sometimes the treatment is successful and other times not. This creates a few potential outcomes: untreated hypotension, unsuccessful treatment of hypotension, and successful treatment of hypotension. Our initial model would only capture the first two options. We undertook a sensitivity analysis in which the third outcome (successful treatment) is explored by including administration of phenylephrine or ephedrine as part of the target output definition in addition to MAP less than 55 mmHg, and trained the best performing algorithm on this new definition to generate a new model.
Adjusting for Class Imbalance.
Although threshold invariant metrics such as area under the receiver operating characteristic curve tend to be more resistant to class imbalance, there are additional methods to reduce the impact of class imbalance on measuring model performance. We undertook a sensitivity analysis in which a down-sampling of the majority class was performed on the training set to reduce initial class imbalance. The test set was not modified at all. The best performing algorithm was trained on the down-sampled training set to generate a new model.
Results
After exclusion of cases without an induction time and patients younger than age 12, there were 13,323 cases remaining, 1,185 (8.9%) of which experienced postinduction hypotension. There were 412 cases (3.0%) with missing induction time. There were 2,051 (15%) cases that utilized continuous arterial blood pressure monitoring. Ultimately, the training set contained 9,326 cases and the test set contained 3,997 cases. There were 816 (8.7%) cases of postinduction hypotension within the training set, and there were 369 (9.2%) cases of postinduction hypotension within the test set. Data characteristics of the complete data set are detailed in table 1, while data characteristics of the data set only including those features for modeling are detailed in appendix 2. Final feature selection after recursive feature elimination is depicted in figure 2.
Data Set Population Characteristics and Characteristics of Patients Who Experienced and Did Not Experience Postinduction Hypotension

Reduction of dimensionality by recursive feature elimination on the training data set. The number of features used for training was reduced from the list of features on the left to the list of features on the right. Italics indicate features that were eliminated. ACE inhibitors, angiotensin converting enzyme inhibitors; ASA score, American Society of Anesthesiologists Physical Status score.
Reduction of dimensionality by recursive feature elimination on the training data set. The number of features used for training was reduced from the list of features on the left to the list of features on the right. Italics indicate features that were eliminated. ACE inhibitors, angiotensin converting enzyme inhibitors; ASA score, American Society of Anesthesiologists Physical Status score.
After training, area under the receiver operating characteristic curve using logistic regression was 0.71 (95% CI, 0.70 to 0.72); support vector machines was 0.63 (95% CI, 0.58 to 0.60); naive Bayes was 0.69 (95% CI, 0.67 to 0.69); k-nearest neighbor was 0.64 (95% CI, 0.63 to 0.65); linear discriminant analysis was 0.72 (95% CI, 0.71 to 0.73); random forest 0.74 (95% CI, 0.73 to 0.75); neural nets was 0.71 (95% CI, 0.69 to 0.71); and gradient boosting machines was 0.76 (95% CI, 0.75 to 0.77). Receiver operating characteristic curves, as well as sensitivity and specificity at “best” thresholds for each machine-learning method, are depicted in figure 3.
Receiver operating characteristic curves of machine-learning methods for prediction of postinduction hypotension in the training data set. A greater area under the receiver operating characteristic curve (AUC) represents higher discriminative ability of the model. Area under the receiver operative characteristics curves, as well as specificity and sensitivity of each machine-learning model for prediction of postinduction hypotension at “best” threshold are presented with 95% CIs. “Best” threshold refers to the threshold at which specificity and sensitivity are both maximized.
Receiver operating characteristic curves of machine-learning methods for prediction of postinduction hypotension in the training data set. A greater area under the receiver operating characteristic curve (AUC) represents higher discriminative ability of the model. Area under the receiver operative characteristics curves, as well as specificity and sensitivity of each machine-learning model for prediction of postinduction hypotension at “best” threshold are presented with 95% CIs. “Best” threshold refers to the threshold at which specificity and sensitivity are both maximized.
Based on the model selection process, it appeared that gradient boosting machine was the strongest initial performer to be a candidate for continuing tuning and further testing. Other parameters that were tuned specific to the gradient boosting machine method were the number of trees (range 50 to 400), interaction depth (range 1 to 8), shrinkage (range 0.01 to 0.3), and the minimum number of variables at terminal node (range 5 to 30). Final tuning resulted in a gradient boosting machine algorithm with 200 trees, interaction depth of 6, shrinkage of 0.05, and 30 minimum variables at terminal node. The final model had an area under the receiver operating characteristic curve of 0.77 (95% CI, 0.75 to 0.78). Final variable importance can be seen in figure 4. The model run on the test set had an area under the receiver operating characteristic curve of 0.74 (95% CI, 0.72 to 0.76), a negative predictive value of 19% (95% CI, 16 to 21%) and a positive predictive value of 96% (95% CI, 95 to 97%). Areas under the receiver operating characteristic curve for all machine-learning classifiers run on the test set are presented in table 2 solely for comparison in this setting.
Area under the Receiver Operating Characteristic Curves (AUROC) for Each Machine-learning Classifier Run on the Test Data Set

Variable importance of features included in stochastic gradient boosting machine-learning algorithm for prediction of postinduction hypotension. Variable importance is computed based on how important any given feature is to aid in the classification process when the classifier is built, determined by its effect on the performance measure. The greater the importance, the more essential the variable is to the performance of the model. Assumptions about effect size cannot be drawn directly about the relationship of variable importance to the primary outcome. ACE inhibitor, angiotensin converting enzyme inhibitor; ASA score, American Society of Anesthesiologists Physical Status score; DMARD, disease modifying antirheumatic drug; DPP4, dipeptidyl peptidase-4 inhibitor; max. sevoflurane conc., maximum sevoflurane concentration; max. desflurane conc., maximum desflurane concentration; PPI/H2 blocker, proton pump inhibitor/H2 blocker.
Variable importance of features included in stochastic gradient boosting machine-learning algorithm for prediction of postinduction hypotension. Variable importance is computed based on how important any given feature is to aid in the classification process when the classifier is built, determined by its effect on the performance measure. The greater the importance, the more essential the variable is to the performance of the model. Assumptions about effect size cannot be drawn directly about the relationship of variable importance to the primary outcome. ACE inhibitor, angiotensin converting enzyme inhibitor; ASA score, American Society of Anesthesiologists Physical Status score; DMARD, disease modifying antirheumatic drug; DPP4, dipeptidyl peptidase-4 inhibitor; max. sevoflurane conc., maximum sevoflurane concentration; max. desflurane conc., maximum desflurane concentration; PPI/H2 blocker, proton pump inhibitor/H2 blocker.
Sensitivity Analyses
The model with a different definition of hypotension than the primary analysis (MAP less than 65 mmHg) had an area under the receiver operating characteristic curve of 0.72 (95% CI, 0.71 to 0.72), with specificity of 65% and sensi tivity of 67% at the “best” threshold.
The model that incorporated administration of phenylephrine or ephedrine within the hypotension outcome definition had an area under the receiver operating characteristic curve of 0.75 (95% CI, 0.74 to 0.75), with specificity of 63% and sensitivity of 73% at the best threshold.
The model that utilized the down-sampled training set had an area under the receiver operating characteristic curve of 0.76 (95% CI, 0.75 to 0.77), with specificity of 69% and sensitivity of 69% at the best threshold.
Discussion
In this study, we examined the use of machine-learning methods based on existing information in the electronic health record for intraoperative predictive analytics, specifically prediction of postinduction hypotension. The final model used a gradient boosting machine that demonstrated strong discrimination in both the training (area under the receiver operating characteristic curve 0.76, 95% CI, 0.75 to 0.77) and testing (area under the receiver operating characteristic curve 0.74, 95% CI, 0.72 to 0.77) sets. Gradient boosting machines function as an ensemble method by sequentially adding weak classifiers, in this case decision trees, to reach a final model based on improvement of each classifier. The results of this exploration are not surprising considering the nature of the data, namely that the machine-learning methods that handle class imbalance better (gradient boosting machines, random forest, logistic regression) performed better than other methods. Boosting algorithms tend to suffer in cases of highly misclassified data, so the strong performance of this algorithm offers some indication of the veracity of the modeled data. Most of the variables of high importance are also not unexpected as far as clinically credibility; it is realistic to expect that features such as age, induction agents, volatile anesthetic concentration, and mean peak inspiratory pressure would be relevant for prediction of postinduction hypotension. Some were surprising, however, such as the relatively high importance of levothyroxine and bisphosphonates.
Some machine-learning algorithms have been reported within the anesthesiology, perioperative care, and pain medicine fields, such as for predicting mortality after cardiac surgery,13 predicting postoperative sepsis and acute kidney injury,25 predicting postoperative pain or the need for pain consults,26,27 or predicting patient controlled analgesia consumption.28 These methods, however, do not extend into the intraoperative period. There are two major differences between our approach and that of other previous acute hypotension analysis approaches (as from the PhysioNet Challenge29 ). The settings of these explorations are primarily the intensive care unit, wherein data acquisition for processing occurs over a longer period of time and there is not necessarily a discrete inciting event related to hypotension. Additionally, this exploration used electronic health record data as opposed to waveform data. Waveform data is reliant on the quality of the waveform, as well as presence of invasive blood pressure monitoring, which is not available for all surgical cases.29,30 While there are some predictive tools for intraoperative hypotension, none currently seem to utilize the electronic health record for clinical decision support integration, or utilize machine learning.31,32 The benefit of being able to predict postinduction hypotension may ultimately allow clinicians to tailor their induction agents by prepopulating a model and observing the risk of hypotension, or for triggering an intraoperative alert to notify the clinician of impending hypotension for treatment potential. Our machine-learning approach has a strong precedent in both medical and nonmedical fields.
There are a number of benefits to using machine learning for problems such as this. The most obvious of these is the ability to incorporate large amounts of disparate data into a unified algorithm. Most machine-learning methods are highly scalable, and thus can handle a variety of problems with differing feature types. The sensitivity analyses demonstrate the flexibility of machine-learning approaches to variations in target definition. Machine learning is particularly useful when the limits of human understanding have been superseded. For example, despite a thorough understanding of pharmacology, normal physiology, pathophysiology, and surgical factors, postinduction hypotension still occurs at a surprisingly high rate, likely because the number of variables involved is so vast and complex. Because of this, such a problem is a prime target for machine learning. Although each individual machine-learning method may have its own restrictions, most are not bound by the restrictions of classical prediction methods, such as linearity assumptions and the importance of identifying interactions between terms. Circumstances in which regression methods are utilized for machine learning can be preprocessed and tuned using techniques to minimize the impact of those assumptions, though not always entirely eliminated.
There are some disadvantages to machine-learning methods, however. Training times can vary widely depending on the methods and tuning parameters, number of complexity of features, computing power, and sheer volume of data. This can make the iterative process required for tuning models relatively time consuming as compared to simple rule-based if–then approaches. As with any other predictive modeling technique, any given machine-learning technique may not be an ideal approach for all tasks. For example, there were a number of methods in this study that performed more poorly than logistic regression, which, although considered a machine-learning algorithm, is more accessible and familiar to the medical community due to its roots in statistical learning. While some machine-learning methods offer information as to the relevance of various features, such as the variable importance shown for the gradient boosting machine algorithm, machine-learning tools tend to be “black boxes,” effectively. While utility can be measured using performance metrics, the lack of transparency in the algorithms may be inadequate to those who want to have a complete understanding of the clinical implications in order for more specific practice modification to be a possibility. However, the unbiased nature of machine-learning algorithms may allow insight into previously unexplored or unexpected factors that may contribute to a given outcome. For example, exploring why time of day was a variable of high importance may lead to further insight into a potential for modification. As with many other classifiers, threshold-dependent measures such as sensitivity and specificity may not be useful independent of choosing an appropriate threshold to balance desired sensitivity and specificity, which should be done based on clinical guidance and weighing the implications of misclassification as a result of over- or underdiagnosis. Finally, care must be taken when developing models to avoid overfitting, which can happen as a result of data leakage or perfect separation problems, among other causes.8
Limitations exist within this project. Although more than 10,000 cases were incorporated into the machine learning, they were extracted only over a six-month period, and within a single institution. A larger data set over a longer time period may have resulted in slightly different results, as practice may have changed over the course of the date range, and a greater amount of data may have led to utilization of more complex relationships among the data for algorithm training. Nonetheless, the expectation is that if incorporated into practice, algorithms would be routinely updated based on the most recent data so as to reflect current practice, as has been suggested in other studies of clinical data source relevance.33 While we used the area under the receiver operating characteristic curve as a performance metric, there are a number of other acceptable methods for comparison among machine-learning methods. The performance metric or metrics should be determined based on the examination of the data and the desired outcome, and no single performance metric is likely to fully encompass the viability of a machine-learning application in a given setting. This was a single institution study, which, without external validation with an external database, limits the use of this precise model in another setting. While an area under the receiver operating characteristic curve of 0.76 is demonstrative of reasonably strong discrimination, there still exists substantial potential for improvement in performance before clinical use may be acceptable. There are a number of features that were not easily available as discrete values within the electronic health record for incorporation into the model. For example, numerical indices of heart function such as ventricular ejection fraction, details of airway management, or preoperative lab values may have improved the discrimination of the model. While this may seem to be a limitation, the methods demonstrated herein are reflections of how actual practical model training would occur, with only those features readily available from most electronic health record. With additional availability of data features, the scalability of most machine-learning methods allows enhancement (and likely predictive power) of the existing model. As electronic health records evolve, there has also been a transition to including more discrete data fields, which will aid in development of predictive analytics, as well as in other forms of data mining. For data that may not be easily transformed into discrete fields, data aggregation may ultimately have to rely on natural language processing. Finally, our approach only finely tuned one model. It is possible that if other algorithms were tuned, they may have performed equal to, or better than, the single model we tuned.
Current intraoperative clinical decision support in most electronic health records is primarily rule based, and has demonstrated success in improving adherence to practice guidelines in cases of blood pressure and glucose management, and prophylaxis for postoperative nausea and vomiting and infection.34–39 Like rule-based clinical decision support systems derived from existing guidelines, machine-learning–based systems will need to be grounded in validated methodologies before assimilation into an intraoperative workflow. For this reason, clinical “bedside” application of these tools should be preceded by “bench” validation as described and demonstrated herein, followed by proof of clinical improvement.
Acknowledgments
The authors would like to thank the Department of Anesthesiology, Perioperative Care, and Pain Medicine, New York University Langone Health, New York, New York, for granting time to support this work.
Research Support
Support was provided solely from institutional and/or departmental sources.
Competing Interests
The authors declare no competing interests.
Functions, Packages, and Tuning Parameters in the R Statistical Software Used for Each Machine-learning Algorithm
