Delirium poses significant risks to patients, but countermeasures can be taken to mitigate negative outcomes. Accurately forecasting delirium in intensive care unit (ICU) patients could guide proactive intervention. Our primary objective was to predict ICU delirium by applying machine learning to clinical and physiologic data routinely collected in electronic health records.
Two prediction models were trained and tested using a multicenter database (years of data collection 2014 to 2015), and externally validated on two single-center databases (2001 to 2012 and 2008 to 2019). The primary outcome variable was delirium defined as a positive Confusion Assessment Method for the ICU screen, or an Intensive Care Delirium Screening Checklist of 4 or greater. The first model, named “24-hour model,” used data from the 24 h after ICU admission to predict delirium any time afterward. The second model designated “dynamic model,” predicted the onset of delirium up to 12 h in advance. Model performance was compared with a widely cited reference model.
For the 24-h model, delirium was identified in 2,536 of 18,305 (13.9%), 768 of 5,299 (14.5%), and 5,955 of 36,194 (11.9%) of patient stays, respectively, in the development sample and two validation samples. For the 12-h lead time dynamic model, delirium was identified in 3,791 of 22,234 (17.0%), 994 of 6,166 (16.1%), and 5,955 of 28,440 (20.9%) patient stays, respectively. Mean area under the receiver operating characteristics curve (AUC) (95% CI) for the first 24-h model was 0.785 (0.769 to 0.801), significantly higher than the modified reference model with AUC of 0.730 (0.704 to 0.757). The dynamic model had a mean AUC of 0.845 (0.831 to 0.859) when predicting delirium 12 h in advance. Calibration was similar in both models (mean Brier Score [95% CI] 0.102 [0.097 to 0.108] and 0.111 [0.106 to 0.116]). Model discrimination and calibration were maintained when tested on the validation datasets.
Machine learning models trained with routinely collected electronic health record data accurately predict ICU delirium, supporting dynamic time-sensitive forecasting.
Existing intensive care unit (ICU) delirium prediction models consider a parsimonious set of clinical variables, lack dynamic prediction capability, and have received limited external validation
In a multicenter electronic health record database of 22,234 intensive care unit (ICU) patients from 2014 to 2015, delirium was identified using the Confusion Assessment Method for the ICU screen or Intensive Care Delirium Screening Checklist
Static and dynamic machine learning algorithms were trained, tested, and externally validated to predict the onset of delirium during the ICU stay
The static model using data from the first 24 h after ICU admission to predict delirium at any point during the ICU stay demonstrated higher discrimination compared with a widely cited reference model
The dynamic model was able to predict delirium up to 12 h in advance with reasonable discrimination and calibration
Delirium is common in the acute care setting and particularly in intensive care units (ICUs), affecting up to 35% of hospitalized patients and up to 80% of patients requiring intensive care,1 and costing an estimated $164 billion annually in healthcare expenditures.2 The onset of delirium in hospitalized patients has been independently associated with poor short-term and long-term health outcomes, and research aimed at preventing or treating delirium is regarded as a public health priority.3
Approximately 30 to 40% of delirium cases might be amenable to delirium-reduction strategies.4 Multicomponent interventions focusing on device and catheter removal, promotion of normal sleep-wake cycles, and early mobilization are cost-effective methods for preventing and treating delirium.4 In critically ill patients, implementation of a structured bundle of treatments has been associated with a 40% reduction in delirium in a multisite cohort of more than 15,000 critically ill patients,5 and use of the α2 agonist dexmedetomidine could decrease delirium risk by up to 48%6 in ICU patients requiring sedation. Although these approaches are promising, ICU delirium may be underrecognized and misdiagnosed.7 Delirium screening is inconsistent in many health systems and, even when consistently deployed, may not capture relevant events due to the acute onset and fluctuating nature of the disorder.8
Research during the past two decades has identified a number of delirium risk factors, some of which may be modifiable.3 The ability to predict delirium onset in high-risk individuals might allow preventive or treatment strategies to be implemented in a more targeted or even personalized fashion. Here, we created two models to predict delirium: an early prediction model to identify delirium onset at any time during intensive care by using data available early in the ICU stay, and a dynamic model to predict the onset of delirium 0 to 12 h in the future. We hypothesized that physiologic and clinical variables routinely acquired during intensive care would be associated with the probability of delirium onset.
Materials and Methods
The overarching goal was to predict the onset of ICU delirium by training machine learning models with physiologic and clinical features routinely available at the bedside. If our primary hypothesis is correct, we will reject the null hypothesis that physiologic and clinical variables routinely acquired during intensive care have no relation to the probability of delirium onset. Research followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis checklist,9 a copy of which is provided in table E1 of the Supplemental Digital Content (http://links.lww.com/ALN/C999). The model pipeline is schematized in figure E1 in the Supplemental Digital Content (http://links.lww.com/ALN/C999). All code is available on GitHub at https://github.com/ryanlu41/delirium.
The data analysis and statistical plan for the first 24-h model development were written and recorded in the investigators’ files before data were accessed, while additional model development occurred after the data were accessed. This included determining which patients developed delirium in the development dataset and extracting features guided by domain expertise. The distribution differences of these features were compared using Mann-Whitney U tests (for nonparametric comparison of continuous feature means in two independent samples), or chi-square tests (for comparison of proportions in categorical data). All analyses use a P-value threshold of 0.05 for significance. Statistical and machine learning software packages used are detailed in subsequent sections.
Data Sources
Research in this report was performed on three fully deidentified publicly available datasets made available via the Massachusetts Institute of Technology (Cambridge, Massachusetts) PhysioNet repository10 : the Philips eICU Collaborative Research Database (hereafter referred to as the development dataset), the third version of Medical Information Mart for Intensive Care (often abbreviated as MIMIC-III, and hereafter referred to as validation dataset 1), and the fourth version of Medical Information Mart for Intensive Care (often abbreviated as MIMIC-IV, and hereafter referred to as validation dataset 2). The former was used for model training and testing, while the latter two were used for external validation. The development dataset is a multicenter electronic health record–based database containing granular data on 200,859 admissions to ICUs from between 2014 and 2015 at 208 hospitals located in the United States.11 Validation dataset 1 comprises electronic health record data from 61,532 ICU stays at Beth Israel Deaconess Medical Center in Boston, Massachusetts, from 2001 to 2012.12 Validation dataset 2 comprises electronic health record data from 76,943 ICU stays at Beth Israel Deaconess Medical Center from 2008 to 2019.13 Both validation datasets likely do have some overlap in data due to being from the same hospital and having several years in common.
Data in both validation databases have been deidentified, and the institutional review boards of Massachusetts Institute of Technology (number 0403000206) and Beth Israel Deaconess Medical Center (number 2001-P-001699/14) both approved the use of the database for research. Because the database does not contain protected health information, a waiver of the requirement for informed consent was included in the institutional review board approval. Data in the development dataset are also deidentified, and research using the development dataset is exempt from institutional review board approval due to the retrospective design, lack of direct patient intervention, and the security schema, for which the reidentification risk was certified as meeting safe harbor standards by an independent privacy expert (Privacert, Cambridge, Massachusetts; Health Insurance Portability and Accountability Act Certification number 1031219–2).
Modeling Paradigms
Two modeling paradigms were created (fig. 1). The first, named the “first 24-hour model,” analyzed data collected in the 24 h after ICU admission to predict the probability of delirium at any subsequent point during the ICU stay. The second model, designated “dynamic model,” used cumulative data from ICU admission to the prediction time point and computed the probability of delirium onset 0 to 12 h in the future. For ICU stays with delirium, the model was trained on data obtained before the first positive delirium screen. For ICU stays without delirium, the model was trained on data obtained before a randomly selected negative delirium screen.
Model frameworks for the first 24-h and dynamic modeling paradigms. ICU, intensive care unit.
Model frameworks for the first 24-h and dynamic modeling paradigms. ICU, intensive care unit.
Case Identification
Flow diagrams illustrating the process for case identification and selection are shown in figure 2. Patient stays were selected for the first 24-h model if they were admitted to the ICU, remained in the ICU for at least 24 h, and were screened for delirium using the Confusion Assessment Method for the Intensive Care Unit14 or the Intensive Care Delirium Screening Checklist.15 To limit the possibility that patients had delirium before the time of prediction, we excluded patient stays in which there was a positive delirium test or diagnosis in the first 24 h.
Study flow diagrams for the first 24-h model (left), and the 12-h lead time dynamic model with at least 12 h of data (right). ICU, intensive care unit.
Study flow diagrams for the first 24-h model (left), and the 12-h lead time dynamic model with at least 12 h of data (right). ICU, intensive care unit.
For the dynamic model, we selected ICU stays of patients who were in the ICU for at least 12 h and who had at least one delirium screening. To limit the possibility that patients entered the ICU with delirium, we excluded patients in whom there was a positive delirium screen or diagnosis in the first 12 h. Delirium-positive cases were identified by finding the first positive Confusion Assessment Method for the ICU or Intensive Care Delirium Screening Checklist screen in a patient stay, defining that time point as delirium onset, and using data preceding onset to make predictions. The median time (and interquartile range) of delirium onset in the 12-h lead time dynamic development model cohort was 61.3 (38.3 to 109.5) hours after ICU admission. Delirium-negative cases were obtained by finding ICU stays where all delirium screening was negative, randomly selecting one of the screens, and using data from before that delirium test to make predictions. The median time (and interquartile range) of these randomly selected negative delirium screens in the development cohort was 39.7 (30.4 to 59.8) hours after ICU admission. This resulted in the model making predictions across a wide range of times in ICU stays.
Outcome Variable
The primary outcome variable was delirium, defined as a positive Confusion Assessment Method for the ICU screen, a score of 4 or more on the Intensive Care Delirium Screening Checklist, without any contradictions from diagnostic code information. Both Confusion Assessment Method for the ICU and Intensive Care Delirium Screening Checklist scores are documented in the development dataset, while only the Confusion Assessment Method for the ICU is recorded in both validation databases. Additionally, the development database marks some patients with a free text delirium diagnosis. Patients with this diagnosis alone and no positive delirium screenings were excluded from the study. In the development dataset, the median (interquartile range) interval between delirium tests was 4.0 h (1.0 to 12.0 h), while in validation dataset 1 it was 9.2 h (4.0 to 13.3 h) and in validation dataset 2 it was 9.4 h (4.0 to 12.3 h).
Predictive Variables
Predictive variables to consider in the model were identified through literature review, clinician guidance, and dataset exploration. Variables extracted included patient demographics, medical history and comorbidities, laboratory studies, medications administered, other treatments, nurse documentation, and physiologic time series (both nurse-validated data and automated data from monitors), with all time stamps being recorded at minute-level resolution. All variables used in the model were temporally distinct from data used in assessing outcomes. All analyses up to this point were done using Python, specifically the numpy16 and pandas17 packages, with the exception of the comorbidity features that were created using the R package comorbidity.18
Preprocessing
The distributions of each feature were examined by a fellowship-trained, board-certified intensive care physician, who helped define upper and lower bounds of physiologic plausibility; values deemed implausible were then removed. For each model and lead time, features with more than 20% of samples missing were excluded, which primarily resulted in the removal of less common laboratory tests from the feature space, such as alkaline phosphate measurements or monocyte counts. Missing values were then imputed using mean imputation, based on training data means.
Feature Development and Analysis
Predictive features were created from the processed data. Categorical variables were one-hot encoded into individual features (i.e., translated into binary variables) and sometimes grouped together for simplicity. For numerical variables with multiple values during the patient stay, estimates of central tendency and variance such as means and standard deviations were calculated. For higher frequency variables such as respiratory rate, heart rate, blood pressure, or oxygen saturation data, more complex features such as Fourier transform coefficients or wavelets were computed using the Python tsfresh package.19 A full list of features used in the models is available in table E2 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).
Model Development
Model features were analyzed using three machine learning algorithms: logistic regression, random forest,20 and gradient boosting (CatBoost21 ), as well as an ensemble or stacked model using outputs from all three algorithms. All modeling and evaluation, excluding the CatBoost algorithm, was done using Python, specifically the scikit-learn package22 and SciPy package.23 Clinically relevant features and their relationships to delirium risk were determined using logistic regression, random forest, or SHapley Additive exPlanations (SHAP or Shapley values).24 Shapley values indicate a quantitative association between a feature and a given model output, with high Shapley values indicating association with high model output, and vice versa. Shapley plots are increasingly leveraged to visualize complex relationships captured by machine learning. For each modeling exercise, features were removed if the feature importance as determined internally by each training algorithm was zero, or if they were deemed implausible by a fellowship-trained, board-certified intensive care physician. A nested cross-validation scheme with five inner folds and four outer folds was used to train and evaluate models. In this setup, the complete dataset is split into four different outer training and testing set combinations. Each outer training set is then further split into five different datasets. For a given iteration of the outer loop, hyperparameters are first tuned using 5-fold cross-validation on the training set of the iteration. A model with the optimal hyperparameters is then trained on the outer training set and evaluated on the testing set of the iteration. This process is repeated four times for each different outer training and testing split. To train a final model, hyperparameters are tuned with the outer splits, and then a model with the best hyperparameters is trained on the entire dataset. Hyperparameters were tuned using Bayesian hyperparameter optimization where feasible, or grid search otherwise. For Bayesian hyperparameter optimization, the Tree-structured Parzen Estimator Approach25 was utilized. Final hyperparameters are in table E3 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).
A model based on the features from the PREdiction of DELIRium in ICu patients model (often abbreviated as PRE-DELIRIC, and hereafter referred to as the reference model),26 a widely cited reference model in ICU delirium prediction, was created for comparison, with slight adjustments (described in table E4 of the Supplemental Digital Content, http://links.lww.com/ALN/C999) to features based on data availability in the development dataset. These reference model–based features were used to train a logistic regression model in the development dataset.
Model Performance Evaluation
Model performances were evaluated by three metrics: area under the receiver operating characteristic curve (AUC), area under the precision recall curve (or mean precision), and Brier score or calibration curve, while also minimizing computation required for training and making predictions. These metrics indicate the strength of the relationship between the predictive variables and delirium risk. The performances on the outer testing sets are reported. To externally validate the results, the final model trained on the entire development dataset was tested on the validation datasets’ extracted features, without using these features in the model training process. For each of these metrics, 95% CIs were also calculated. Detailed performance metrics are reported for the various iterations of the first 24-h model and the dynamic models in table E5, figure E2, and figure E3 in the Supplemental Digital Content (http://links.lww.com/ALN/C999). For the primary analysis, we will reject the null hypothesis if AUC is more than 0.5.
Results
Patient Characteristics
Flow diagrams with details of patient inclusion and exclusion are provided in figure 2, and characteristics of the population are in reported in table 1, and online tables E6 and E7 in the Supplemental Digital Content (http://links.lww.com/ALN/C999). The cohort for the first 24-h dynamic model consisted of 18,305 total patient stays in the development dataset of which 2,536 (13.9%) were labeled as delirium positive. In validation dataset 1, a total of 5,299 patient stays were identified of which 768 (14.5%) were delirium positive, and in validation dataset 2 a total of 36,194 patient stays were identified, of which 5,955 (11.9%) were delirium positive. Across all datasets, median APACHE IV scores and unadjusted mortality were significantly higher in delirium-positive patients.
Characteristics of Cases and Controls from the Development Dataset for the First 24-h Model Cohort

For the 12-h lead time dynamic model, the development cohort consisted of 22,234 total patient stays in the development dataset of which 3,791 (17.0%) were labeled as delirium positive and 18,443 (83.0%) were negative. In validation dataset 1, a total of 6,166 patient stays were identified of which 994 (16.1%) were delirium positive and 5,172 (83.9%) were negative, and, in validation dataset 2, a total of 28,440 patient stays were identified, of which 5,955 (20.9%) were delirium positive and 22,485 (79.1%) were delirium negative. Characteristics of admissions analyzed in the development cohort for the first 24-h model are in table 1, and similar characteristics from the validation cohorts are available in table E6 and table E7 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).
First 24-h Model Performance
The best-performing algorithm was CatBoost, trained with a pruned feature space of 155 features. All predictive performance metrics are summarized in figure 3. Although AUCs from models that share features do not have a standard statistical test,27 the mean AUC (95% CI) was 0.785 (0.769 to 0.801), which is higher than that of the modified reference model that had a mean AUC of 0.730 (0.704 to 0.757). This model was validated successfully in the validation dataset 1 population (AUC of 0.796) and the validation dataset 2 population (AUC of 0.810). While holding the sensitivity at 0.85, the model achieved a specificity of 0.556 (0.515 of 0.586), negative predictive value of 0.948 (0.943 of 0.950), and positive predictive value of 0.282 (0.264 of 0.296). The mean precision was 0.384 (0.357 to 0.411) in the development dataset, while in validation dataset 1 it was 0.389 and in validation dataset 2 it was 0.475. The mean Brier score was 0.102 (0.097 to 0.108) in the development dataset, 0.105 in validation dataset 1, and 0.110 in validation dataset 2.
Model performance metrics for the first 24-h models, including the re-created reference model (PREdiction of DELIrium in ICu patients, often abbreviated as PRE-DELIRIC), and our current first 24-h CatBoost model on the development and external validation datasets. (A) Receiver operating characteristic curves with 95% CI shaded. (B) First 24-h model precision-recall curves with average precision and 95% CI shaded. (C) Calibration curves and calculated Brier scores. AUC, area under the receiver operating characteristics curve.
Model performance metrics for the first 24-h models, including the re-created reference model (PREdiction of DELIrium in ICu patients, often abbreviated as PRE-DELIRIC), and our current first 24-h CatBoost model on the development and external validation datasets. (A) Receiver operating characteristic curves with 95% CI shaded. (B) First 24-h model precision-recall curves with average precision and 95% CI shaded. (C) Calibration curves and calculated Brier scores. AUC, area under the receiver operating characteristics curve.
Dynamic Model Performance
The dynamic models performed overall better than the first 24-h model, with higher performances noted at short lead times (fig. 4; although the significance of AUC comparisons between models with shared features may not be ascertainable using available statistical methodologies27 ). The 12-h model had a mean AUC (95% CI) of 0.845 (0.831 to 0.859) and was validated in the validation dataset 1 population (AUC of 0.804) and the validation dataset 2 population (AUC of 0.838). When holding sensitivity at 0.85, the model achieved a specificity of 0.657 (0.623 to 0.691), negative predictive value of 0.955 (0.953 to 0.957), and positive predictive value of 0.337 (0.315 to 0.359). The mean precision was 0.590 (0.566 to 0.613) in the development dataset, while in validation dataset 1 it was 0.449 and in validation dataset 2 it was 0.593. The mean Brier score was 0.111 (0.106 to 0.116) in the development dataset, 0.165 in validation dataset 1, and 0.132 in validation dataset 2. With 6-h lead time, which is greater than the median time between delirium tests in the development dataset, the mean AUC (95% CI) was 0.880 (0.872 to 0.887). External validation performance was variable, with validation dataset 2 results generally within the 95% CI of the development dataset results, and validation dataset 1 performance being slightly worse. However, the maximum absolute difference in AUC between development and validation samples was 0.04. In all instances, the 95% CI of the AUCs excluded 0.5; thus, we reject the null hypothesis that physiologic and clinical variables routinely acquired during intensive care have no relation to the probability of delirium onset.
Time dependence of model discrimination. AUC (A), average precision (B), and Brier scores (C) of the dynamic model at different lead times, with 95% CI indicated by error bars and corresponding external validation performance. AUC, area under the receiver operating characteristics curve.
Time dependence of model discrimination. AUC (A), average precision (B), and Brier scores (C) of the dynamic model at different lead times, with 95% CI indicated by error bars and corresponding external validation performance. AUC, area under the receiver operating characteristics curve.
Feature Importance Analysis
Feature importance was determined using the Shapley values, with relative importance and directionality shown in figure 5 for the dynamic model at 1-h lead time, and for the first 24-h model in figure E7 of the Supplemental Digital Content (http://links.lww.com/ALN/C999). Although feature importance varied by model, features involving Glasgow Coma Scores, Richmond Agitation Sedation Scale, age, mechanical ventilation, and overall acuity were important in prediction. Length of ICU stay before delirium onset and time of day were important predictors for the dynamic model. Contrary to one of our primary hypotheses, physiologic time series features based on 5-min frequency blood pressure, respiratory rate, heart rate, and oxygen saturation data did not increase either model’s performance, as can be seen from performance metrics shown in table E8 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).
Feature analysis of the dynamic model. Shapley summary plot of the top 20 features for the dynamic model, at 12-h lead time for prediction of delirium. Each dot represents the Shapley value of one sample for that feature. A feature’s Shapley value represents the association of that feature to the risk score, with positive values indicating an association with a higher risk of delirium, and negative values indicating an association with a lower risk of delirium. The location of the dot on the x-axis represents its Shapley value, and its color represents the feature’s absolute value. For example, low age is associated with low Shapley values, while high age is associated with high Shapley values, indicating that elderly patients are at higher risk for delirium. ICU, intensive care unit.
Feature analysis of the dynamic model. Shapley summary plot of the top 20 features for the dynamic model, at 12-h lead time for prediction of delirium. Each dot represents the Shapley value of one sample for that feature. A feature’s Shapley value represents the association of that feature to the risk score, with positive values indicating an association with a higher risk of delirium, and negative values indicating an association with a lower risk of delirium. The location of the dot on the x-axis represents its Shapley value, and its color represents the feature’s absolute value. For example, low age is associated with low Shapley values, while high age is associated with high Shapley values, indicating that elderly patients are at higher risk for delirium. ICU, intensive care unit.
Discussion
Main Findings
Using large clinical databases, we developed and validated two new models for the prediction of delirium in the ICU. The first 24-h early prediction model performed better than the adapted reference model (fig. 3) and calibrated well in a contemporary dataset (fig. 3). The second dynamic delirium prediction model, which enables estimates of delirium risk that are continually updated over time, had higher discrimination than the modified reference model and similar or better discrimination compared with published delirium prediction models. Both models generally validated well on two external datasets, especially on the more contemporary dataset, although calibration of this model was limited in the validation cohorts, where, at high predicted probability, the observed probability of delirium was considerably lower.
Analysis of Existing Literature
Studies evaluating ICU delirium prediction have varied in patient characteristics, predictive frameworks, and model performance (see table E9 in the Supplemental Digital Content, http://links.lww.com/ALN/C999). Many of the most predictive features identified in previous work (e.g., age, mechanical ventilation, severity of illness [APACHE, SOFA], exposure to benzodiazepines) were confirmed in the models presented here. Most previous studies reporting high predictive performance used static models that were unable to predict delirium onset at a specific time point. PREdiction of DELIRium in ICu patients (PRE-DELIRIC)26 is the most studied model of this type, with large variability in performance across different populations internationally, although a meta-analysis estimated an aggregate AUC (95% CI) of 0.844 (0.793 to 0.896).28 Many previous studies had strict patient inclusion criteria, focusing on certain conditions or ICU types, thereby limiting their generalizability.29,30 Several reports on higher-performing models lacked external validation and were based on much smaller sample sizes than the ones reported here.29,30 The limited number of previous studies that made time-specific predictions of delirium onset evaluated their models on a daily basis,29–31 for example, at midnight, and predicted delirium onset in the next 24 h.
Physiologic Time Series
Early in this project we postulated that complex features from physiologic time series data (specifically blood pressure, respiratory rate, oxygen saturation, and heart rate) could be used to predict the onset of delirium. This hypothesis was not verified in our models. Although such features had some predictive power, they did not improve model performance and were more computationally costly. This counterintuitive result could potentially be due to feature redundancy, or perhaps the lack of hypothesized association.
Strengths
Large and Heterogeneous Datasets
Our study uses data from the Philips eICU Collaborative Research Database, which contains more than 200,000 unique ICU stays from more than 200 hospitals across the United States. This dataset includes data from critical care units in a wide range of different health system sizes, organizational structures, and settings.11 We extracted 22,234 ICU stays from this dataset for training, testing, and validating our dynamic model. This is larger than population samples used in previous delirium prediction studies, and this population likely has greater heterogeneity than the data used in the original reference model dataset, which trained on data from 1,613 stays in a single hospital in the Netherlands and validated with data from four other hospitals in the same country.26 Model results obtained on diverse populations could be more generalizable, especially in the United States where the data were collected. The publicly available nature of the dataset offers opportunities for other research groups to evaluate our models’ reproducibility.
Feature Space
Our models identified several predictive features that were not in the reference model (fig. 5). Low Glasgow Coma Scores, longer length of stay, and the prediction time being late at night were all associated with higher risk of delirium. These new features may explain why our model outperforms the reference model.
Potential Advantages of Dynamic Prediction
The dynamic model presented here is designed to predict delirium at set times up to 12 h in the future, potentially a more context-sensitive approach than that of other prediction systems. More time-specific onset predictions could allow targeted implementation of preventive measures in patients with immediate high risk. At shorter lead times (1 h or less) this algorithm is identifying delirium close to the present, which could improve the ability to treat ongoing delirium and reduce harm.
External Validation
The delirium prediction models were validated on two large independent external datasets, although validation on the older external dataset was less robust. These results suggest that the models performed well in the development dataset and may also be applicable to other populations. The vast majority of features were observed in the external validation dataset, suggesting that the features in our model could be generalizable to a range of intensive care settings.
Limitations
Study Design
We have evaluated the relationship between delirium as an outcome and a range of different exposure variables. We feel that this design, which is equivalent to a case-control methodology, is well suited to the modeling task. However, we recognize its limitations, including the inability to determine causality, and potential sources of bias associated with selection and recall, as well as the temporal bias that can occur when there is overemphasis on features that are close in time to the outcome of interest.32,33
Outcome Labels
The frequency of delirium observed in both development and validation samples is lower than that reported in other ICU delirium studies. This may in part be explained by our exclusion of patients who had delirium early in their ICU stay. It is also likely that some patients had been misclassified by clinicians documenting their delirium screens. Hypoactive delirium, in particular, may be overlooked in ICU settings, even when using clinical screens.34 Data based on such documentation may have biased our models, making them more suited for predicting hyperactive delirium.
Another key limitation is the uncertainty regarding the precise timing of delirium onset and resolution, and the impact of this uncertainty on delirium detection and even epidemiology.35,36 One of the cardinal features of delirium is its fluctuating nature, challenging precise timing in detection. As in other clinical studies of delirium, we identified cases from documented Confusion Assessment Method for the ICU and Intensive Care Delirium Screening Checklist screening tests; however, the temporal interval between documentation and the clinical state of patients may vary. Similarly, inconsistencies in the frequency of delirium screening represents another limitation. This might reflect clinician workflows in the ICU setting that can directly impact when these tests are conducted and documented.
Feature Data
Another important limitation in this work was the reduced availability or missingness of certain predictive variables in all of the datasets. A number of key variables for delirium prediction (e.g., pre-ICU functional status, detailed histories of cognitive and neurologic conditions, psychologic disorders) were not consistently available, which likely reduced the predictive capabilities of our model. Using routine clinical data means that our predictive features are also susceptible to inconsistency that arises from manual charting.
Model Utility
Although the models presented in this work had overall good performance characteristics, their real-world utility needs to be determined given questions regarding calibration and dynamic prediction. Calibration results were below expectation. Clinical decisions based on a model that overpredicts might lead to unnecessary resource allocation and even harm. Reduced calibration may be due to overfitting, high levels of heterogeneity in the data, use of too many predictive features, and biased sampling.37 Although our training and testing performances do not suggest overfitting, the other cited factors may have influenced our calibration results.
The paradigm of dynamic predictive modeling, while potentially impactful, presents theoretical and implementation challenges. As the temporal offset between the observation and prediction window narrows, the distinction between predicting an event and identifying it becomes less clear. Regarding implementation of a dynamic prediction system for delirium, one can envision that models with sufficient lead times (e.g., 12 h or more) could be leveraged for antidelirium interventions; in contrast the actionable impact of shorter lead times (e.g., 1 h or 6 h) needs further study.
Singular Prediction per ICU Stay
To maintain independence between samples, our dynamic model was trained by treating each ICU stay as a sample. However, in a realistic clinical setting, a dynamic model would need to make multiple iterative predictions per ICU stay on a regular basis, likely increasing the challenge of accurate and timely prediction. Analysis on one sample’s predictive risk over time is shown in figure E8 in the Supplemental Digital Content (http://links.lww.com/ALN/C999), and such analyses will be replicated in other samples in future studies.
Comparison with Reference Model
The modified reference model constructed in this work is not identical to the original model, because it includes adjustments (described in table E4, http://links.lww.com/ALN/C999). Its performance could have been influenced by the transformation of features.
Conclusion
Leveraging machine learning applied to very large datasets, we have developed and externally validated a novel approach for prediction of delirium in the ICU. After further prospective testing and validation, these models could help support the implementation of delirium-reducing interventions in high-risk individuals.
Acknowledgments
The authors thank Sridevi V. Sarma, Ph.D., Associate Professor, Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, for her invaluable guidance in the early phase of this project, Karol M. Pencina, Ph.D., Chief Biostatistician at Harvard Medical School, Boston, Massachusetts, for his advice on statistical analysis, and Ike Zhang, Student in Biomedical Engineering at the Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, for his aid in preparing this manuscript.
Research Support
Support was provided solely from institutional and/or departmental sources at Johns Hopkins University (Baltimore, Maryland).
Competing Interests
The authors declare no competing interests.
Supplemental Digital Content
Supplemental Digital Content, http://links.lww.com/ALN/C999