Background

Delirium poses significant risks to patients, but countermeasures can be taken to mitigate negative outcomes. Accurately forecasting delirium in intensive care unit (ICU) patients could guide proactive intervention. Our primary objective was to predict ICU delirium by applying machine learning to clinical and physiologic data routinely collected in electronic health records.

Methods

Two prediction models were trained and tested using a multicenter database (years of data collection 2014 to 2015), and externally validated on two single-center databases (2001 to 2012 and 2008 to 2019). The primary outcome variable was delirium defined as a positive Confusion Assessment Method for the ICU screen, or an Intensive Care Delirium Screening Checklist of 4 or greater. The first model, named “24-hour model,” used data from the 24 h after ICU admission to predict delirium any time afterward. The second model designated “dynamic model,” predicted the onset of delirium up to 12 h in advance. Model performance was compared with a widely cited reference model.

Results

For the 24-h model, delirium was identified in 2,536 of 18,305 (13.9%), 768 of 5,299 (14.5%), and 5,955 of 36,194 (11.9%) of patient stays, respectively, in the development sample and two validation samples. For the 12-h lead time dynamic model, delirium was identified in 3,791 of 22,234 (17.0%), 994 of 6,166 (16.1%), and 5,955 of 28,440 (20.9%) patient stays, respectively. Mean area under the receiver operating characteristics curve (AUC) (95% CI) for the first 24-h model was 0.785 (0.769 to 0.801), significantly higher than the modified reference model with AUC of 0.730 (0.704 to 0.757). The dynamic model had a mean AUC of 0.845 (0.831 to 0.859) when predicting delirium 12 h in advance. Calibration was similar in both models (mean Brier Score [95% CI] 0.102 [0.097 to 0.108] and 0.111 [0.106 to 0.116]). Model discrimination and calibration were maintained when tested on the validation datasets.

Conclusions

Machine learning models trained with routinely collected electronic health record data accurately predict ICU delirium, supporting dynamic time-sensitive forecasting.

Editor’s Perspective
What We Already Know about This Topic
  • Existing intensive care unit (ICU) delirium prediction models consider a parsimonious set of clinical variables, lack dynamic prediction capability, and have received limited external validation

What This Manuscript Tells Us That Is New
  • In a multicenter electronic health record database of 22,234 intensive care unit (ICU) patients from 2014 to 2015, delirium was identified using the Confusion Assessment Method for the ICU screen or Intensive Care Delirium Screening Checklist

  • Static and dynamic machine learning algorithms were trained, tested, and externally validated to predict the onset of delirium during the ICU stay

  • The static model using data from the first 24 h after ICU admission to predict delirium at any point during the ICU stay demonstrated higher discrimination compared with a widely cited reference model

  • The dynamic model was able to predict delirium up to 12 h in advance with reasonable discrimination and calibration

Delirium is common in the acute care setting and particularly in intensive care units (ICUs), affecting up to 35% of hospitalized patients and up to 80% of patients requiring intensive care,1  and costing an estimated $164 billion annually in healthcare expenditures.2  The onset of delirium in hospitalized patients has been independently associated with poor short-term and long-term health outcomes, and research aimed at preventing or treating delirium is regarded as a public health priority.3 

Approximately 30 to 40% of delirium cases might be amenable to delirium-reduction strategies.4  Multicomponent interventions focusing on device and catheter removal, promotion of normal sleep-wake cycles, and early mobilization are cost-effective methods for preventing and treating delirium.4  In critically ill patients, implementation of a structured bundle of treatments has been associated with a 40% reduction in delirium in a multisite cohort of more than 15,000 critically ill patients,5  and use of the α2 agonist dexmedetomidine could decrease delirium risk by up to 48%6  in ICU patients requiring sedation. Although these approaches are promising, ICU delirium may be underrecognized and misdiagnosed.7  Delirium screening is inconsistent in many health systems and, even when consistently deployed, may not capture relevant events due to the acute onset and fluctuating nature of the disorder.8 

Research during the past two decades has identified a number of delirium risk factors, some of which may be modifiable.3  The ability to predict delirium onset in high-risk individuals might allow preventive or treatment strategies to be implemented in a more targeted or even personalized fashion. Here, we created two models to predict delirium: an early prediction model to identify delirium onset at any time during intensive care by using data available early in the ICU stay, and a dynamic model to predict the onset of delirium 0 to 12 h in the future. We hypothesized that physiologic and clinical variables routinely acquired during intensive care would be associated with the probability of delirium onset.

The overarching goal was to predict the onset of ICU delirium by training machine learning models with physiologic and clinical features routinely available at the bedside. If our primary hypothesis is correct, we will reject the null hypothesis that physiologic and clinical variables routinely acquired during intensive care have no relation to the probability of delirium onset. Research followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis checklist,9  a copy of which is provided in table E1 of the Supplemental Digital Content (http://links.lww.com/ALN/C999). The model pipeline is schematized in figure E1 in the Supplemental Digital Content (http://links.lww.com/ALN/C999). All code is available on GitHub at https://github.com/ryanlu41/delirium.

The data analysis and statistical plan for the first 24-h model development were written and recorded in the investigators’ files before data were accessed, while additional model development occurred after the data were accessed. This included determining which patients developed delirium in the development dataset and extracting features guided by domain expertise. The distribution differences of these features were compared using Mann-Whitney U tests (for nonparametric comparison of continuous feature means in two independent samples), or chi-square tests (for comparison of proportions in categorical data). All analyses use a P-value threshold of 0.05 for significance. Statistical and machine learning software packages used are detailed in subsequent sections.

Data Sources

Research in this report was performed on three fully deidentified publicly available datasets made available via the Massachusetts Institute of Technology (Cambridge, Massachusetts) PhysioNet repository10 : the Philips eICU Collaborative Research Database (hereafter referred to as the development dataset), the third version of Medical Information Mart for Intensive Care (often abbreviated as MIMIC-III, and hereafter referred to as validation dataset 1), and the fourth version of Medical Information Mart for Intensive Care (often abbreviated as MIMIC-IV, and hereafter referred to as validation dataset 2). The former was used for model training and testing, while the latter two were used for external validation. The development dataset is a multicenter electronic health record–based database containing granular data on 200,859 admissions to ICUs from between 2014 and 2015 at 208 hospitals located in the United States.11  Validation dataset 1 comprises electronic health record data from 61,532 ICU stays at Beth Israel Deaconess Medical Center in Boston, Massachusetts, from 2001 to 2012.12  Validation dataset 2 comprises electronic health record data from 76,943 ICU stays at Beth Israel Deaconess Medical Center from 2008 to 2019.13  Both validation datasets likely do have some overlap in data due to being from the same hospital and having several years in common.

Data in both validation databases have been deidentified, and the institutional review boards of Massachusetts Institute of Technology (number 0403000206) and Beth Israel Deaconess Medical Center (number 2001-P-001699/14) both approved the use of the database for research. Because the database does not contain protected health information, a waiver of the requirement for informed consent was included in the institutional review board approval. Data in the development dataset are also deidentified, and research using the development dataset is exempt from institutional review board approval due to the retrospective design, lack of direct patient intervention, and the security schema, for which the reidentification risk was certified as meeting safe harbor standards by an independent privacy expert (Privacert, Cambridge, Massachusetts; Health Insurance Portability and Accountability Act Certification number 1031219–2).

Modeling Paradigms

Two modeling paradigms were created (fig. 1). The first, named the “first 24-hour model,” analyzed data collected in the 24 h after ICU admission to predict the probability of delirium at any subsequent point during the ICU stay. The second model, designated “dynamic model,” used cumulative data from ICU admission to the prediction time point and computed the probability of delirium onset 0 to 12 h in the future. For ICU stays with delirium, the model was trained on data obtained before the first positive delirium screen. For ICU stays without delirium, the model was trained on data obtained before a randomly selected negative delirium screen.

Fig. 1.

Model frameworks for the first 24-h and dynamic modeling paradigms. ICU, intensive care unit.

Fig. 1.

Model frameworks for the first 24-h and dynamic modeling paradigms. ICU, intensive care unit.

Close modal

Case Identification

Flow diagrams illustrating the process for case identification and selection are shown in figure 2. Patient stays were selected for the first 24-h model if they were admitted to the ICU, remained in the ICU for at least 24 h, and were screened for delirium using the Confusion Assessment Method for the Intensive Care Unit14  or the Intensive Care Delirium Screening Checklist.15  To limit the possibility that patients had delirium before the time of prediction, we excluded patient stays in which there was a positive delirium test or diagnosis in the first 24 h.

Fig. 2.

Study flow diagrams for the first 24-h model (left), and the 12-h lead time dynamic model with at least 12 h of data (right). ICU, intensive care unit.

Fig. 2.

Study flow diagrams for the first 24-h model (left), and the 12-h lead time dynamic model with at least 12 h of data (right). ICU, intensive care unit.

Close modal

For the dynamic model, we selected ICU stays of patients who were in the ICU for at least 12 h and who had at least one delirium screening. To limit the possibility that patients entered the ICU with delirium, we excluded patients in whom there was a positive delirium screen or diagnosis in the first 12 h. Delirium-positive cases were identified by finding the first positive Confusion Assessment Method for the ICU or Intensive Care Delirium Screening Checklist screen in a patient stay, defining that time point as delirium onset, and using data preceding onset to make predictions. The median time (and interquartile range) of delirium onset in the 12-h lead time dynamic development model cohort was 61.3 (38.3 to 109.5) hours after ICU admission. Delirium-negative cases were obtained by finding ICU stays where all delirium screening was negative, randomly selecting one of the screens, and using data from before that delirium test to make predictions. The median time (and interquartile range) of these randomly selected negative delirium screens in the development cohort was 39.7 (30.4 to 59.8) hours after ICU admission. This resulted in the model making predictions across a wide range of times in ICU stays.

Outcome Variable

The primary outcome variable was delirium, defined as a positive Confusion Assessment Method for the ICU screen, a score of 4 or more on the Intensive Care Delirium Screening Checklist, without any contradictions from diagnostic code information. Both Confusion Assessment Method for the ICU and Intensive Care Delirium Screening Checklist scores are documented in the development dataset, while only the Confusion Assessment Method for the ICU is recorded in both validation databases. Additionally, the development database marks some patients with a free text delirium diagnosis. Patients with this diagnosis alone and no positive delirium screenings were excluded from the study. In the development dataset, the median (interquartile range) interval between delirium tests was 4.0 h (1.0 to 12.0 h), while in validation dataset 1 it was 9.2 h (4.0 to 13.3 h) and in validation dataset 2 it was 9.4 h (4.0 to 12.3 h).

Predictive Variables

Predictive variables to consider in the model were identified through literature review, clinician guidance, and dataset exploration. Variables extracted included patient demographics, medical history and comorbidities, laboratory studies, medications administered, other treatments, nurse documentation, and physiologic time series (both nurse-validated data and automated data from monitors), with all time stamps being recorded at minute-level resolution. All variables used in the model were temporally distinct from data used in assessing outcomes. All analyses up to this point were done using Python, specifically the numpy16  and pandas17  packages, with the exception of the comorbidity features that were created using the R package comorbidity.18 

Preprocessing

The distributions of each feature were examined by a fellowship-trained, board-certified intensive care physician, who helped define upper and lower bounds of physiologic plausibility; values deemed implausible were then removed. For each model and lead time, features with more than 20% of samples missing were excluded, which primarily resulted in the removal of less common laboratory tests from the feature space, such as alkaline phosphate measurements or monocyte counts. Missing values were then imputed using mean imputation, based on training data means.

Feature Development and Analysis

Predictive features were created from the processed data. Categorical variables were one-hot encoded into individual features (i.e., translated into binary variables) and sometimes grouped together for simplicity. For numerical variables with multiple values during the patient stay, estimates of central tendency and variance such as means and standard deviations were calculated. For higher frequency variables such as respiratory rate, heart rate, blood pressure, or oxygen saturation data, more complex features such as Fourier transform coefficients or wavelets were computed using the Python tsfresh package.19  A full list of features used in the models is available in table E2 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).

Model Development

Model features were analyzed using three machine learning algorithms: logistic regression, random forest,20  and gradient boosting (CatBoost21 ), as well as an ensemble or stacked model using outputs from all three algorithms. All modeling and evaluation, excluding the CatBoost algorithm, was done using Python, specifically the scikit-learn package22  and SciPy package.23  Clinically relevant features and their relationships to delirium risk were determined using logistic regression, random forest, or SHapley Additive exPlanations (SHAP or Shapley values).24  Shapley values indicate a quantitative association between a feature and a given model output, with high Shapley values indicating association with high model output, and vice versa. Shapley plots are increasingly leveraged to visualize complex relationships captured by machine learning. For each modeling exercise, features were removed if the feature importance as determined internally by each training algorithm was zero, or if they were deemed implausible by a fellowship-trained, board-certified intensive care physician. A nested cross-validation scheme with five inner folds and four outer folds was used to train and evaluate models. In this setup, the complete dataset is split into four different outer training and testing set combinations. Each outer training set is then further split into five different datasets. For a given iteration of the outer loop, hyperparameters are first tuned using 5-fold cross-validation on the training set of the iteration. A model with the optimal hyperparameters is then trained on the outer training set and evaluated on the testing set of the iteration. This process is repeated four times for each different outer training and testing split. To train a final model, hyperparameters are tuned with the outer splits, and then a model with the best hyperparameters is trained on the entire dataset. Hyperparameters were tuned using Bayesian hyperparameter optimization where feasible, or grid search otherwise. For Bayesian hyperparameter optimization, the Tree-structured Parzen Estimator Approach25  was utilized. Final hyperparameters are in table E3 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).

A model based on the features from the PREdiction of DELIRium in ICu patients model (often abbreviated as PRE-DELIRIC, and hereafter referred to as the reference model),26  a widely cited reference model in ICU delirium prediction, was created for comparison, with slight adjustments (described in table E4 of the Supplemental Digital Content, http://links.lww.com/ALN/C999) to features based on data availability in the development dataset. These reference model–based features were used to train a logistic regression model in the development dataset.

Model Performance Evaluation

Model performances were evaluated by three metrics: area under the receiver operating characteristic curve (AUC), area under the precision recall curve (or mean precision), and Brier score or calibration curve, while also minimizing computation required for training and making predictions. These metrics indicate the strength of the relationship between the predictive variables and delirium risk. The performances on the outer testing sets are reported. To externally validate the results, the final model trained on the entire development dataset was tested on the validation datasets’ extracted features, without using these features in the model training process. For each of these metrics, 95% CIs were also calculated. Detailed performance metrics are reported for the various iterations of the first 24-h model and the dynamic models in table E5, figure E2, and figure E3 in the Supplemental Digital Content (http://links.lww.com/ALN/C999). For the primary analysis, we will reject the null hypothesis if AUC is more than 0.5.

Patient Characteristics

Flow diagrams with details of patient inclusion and exclusion are provided in figure 2, and characteristics of the population are in reported in table 1, and online tables E6 and E7 in the Supplemental Digital Content (http://links.lww.com/ALN/C999). The cohort for the first 24-h dynamic model consisted of 18,305 total patient stays in the development dataset of which 2,536 (13.9%) were labeled as delirium positive. In validation dataset 1, a total of 5,299 patient stays were identified of which 768 (14.5%) were delirium positive, and in validation dataset 2 a total of 36,194 patient stays were identified, of which 5,955 (11.9%) were delirium positive. Across all datasets, median APACHE IV scores and unadjusted mortality were significantly higher in delirium-positive patients.

Table 1.

Characteristics of Cases and Controls from the Development Dataset for the First 24-h Model Cohort

Characteristics of Cases and Controls from the Development Dataset for the First 24-h Model Cohort
Characteristics of Cases and Controls from the Development Dataset for the First 24-h Model Cohort

For the 12-h lead time dynamic model, the development cohort consisted of 22,234 total patient stays in the development dataset of which 3,791 (17.0%) were labeled as delirium positive and 18,443 (83.0%) were negative. In validation dataset 1, a total of 6,166 patient stays were identified of which 994 (16.1%) were delirium positive and 5,172 (83.9%) were negative, and, in validation dataset 2, a total of 28,440 patient stays were identified, of which 5,955 (20.9%) were delirium positive and 22,485 (79.1%) were delirium negative. Characteristics of admissions analyzed in the development cohort for the first 24-h model are in table 1, and similar characteristics from the validation cohorts are available in table E6 and table E7 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).

First 24-h Model Performance

The best-performing algorithm was CatBoost, trained with a pruned feature space of 155 features. All predictive performance metrics are summarized in figure 3. Although AUCs from models that share features do not have a standard statistical test,27  the mean AUC (95% CI) was 0.785 (0.769 to 0.801), which is higher than that of the modified reference model that had a mean AUC of 0.730 (0.704 to 0.757). This model was validated successfully in the validation dataset 1 population (AUC of 0.796) and the validation dataset 2 population (AUC of 0.810). While holding the sensitivity at 0.85, the model achieved a specificity of 0.556 (0.515 of 0.586), negative predictive value of 0.948 (0.943 of 0.950), and positive predictive value of 0.282 (0.264 of 0.296). The mean precision was 0.384 (0.357 to 0.411) in the development dataset, while in validation dataset 1 it was 0.389 and in validation dataset 2 it was 0.475. The mean Brier score was 0.102 (0.097 to 0.108) in the development dataset, 0.105 in validation dataset 1, and 0.110 in validation dataset 2.

Fig. 3.

Model performance metrics for the first 24-h models, including the re-created reference model (PREdiction of DELIrium in ICu patients, often abbreviated as PRE-DELIRIC), and our current first 24-h CatBoost model on the development and external validation datasets. (A) Receiver operating characteristic curves with 95% CI shaded. (B) First 24-h model precision-recall curves with average precision and 95% CI shaded. (C) Calibration curves and calculated Brier scores. AUC, area under the receiver operating characteristics curve.

Fig. 3.

Model performance metrics for the first 24-h models, including the re-created reference model (PREdiction of DELIrium in ICu patients, often abbreviated as PRE-DELIRIC), and our current first 24-h CatBoost model on the development and external validation datasets. (A) Receiver operating characteristic curves with 95% CI shaded. (B) First 24-h model precision-recall curves with average precision and 95% CI shaded. (C) Calibration curves and calculated Brier scores. AUC, area under the receiver operating characteristics curve.

Close modal

Dynamic Model Performance

The dynamic models performed overall better than the first 24-h model, with higher performances noted at short lead times (fig. 4; although the significance of AUC comparisons between models with shared features may not be ascertainable using available statistical methodologies27 ). The 12-h model had a mean AUC (95% CI) of 0.845 (0.831 to 0.859) and was validated in the validation dataset 1 population (AUC of 0.804) and the validation dataset 2 population (AUC of 0.838). When holding sensitivity at 0.85, the model achieved a specificity of 0.657 (0.623 to 0.691), negative predictive value of 0.955 (0.953 to 0.957), and positive predictive value of 0.337 (0.315 to 0.359). The mean precision was 0.590 (0.566 to 0.613) in the development dataset, while in validation dataset 1 it was 0.449 and in validation dataset 2 it was 0.593. The mean Brier score was 0.111 (0.106 to 0.116) in the development dataset, 0.165 in validation dataset 1, and 0.132 in validation dataset 2. With 6-h lead time, which is greater than the median time between delirium tests in the development dataset, the mean AUC (95% CI) was 0.880 (0.872 to 0.887). External validation performance was variable, with validation dataset 2 results generally within the 95% CI of the development dataset results, and validation dataset 1 performance being slightly worse. However, the maximum absolute difference in AUC between development and validation samples was 0.04. In all instances, the 95% CI of the AUCs excluded 0.5; thus, we reject the null hypothesis that physiologic and clinical variables routinely acquired during intensive care have no relation to the probability of delirium onset.

Fig. 4.

Time dependence of model discrimination. AUC (A), average precision (B), and Brier scores (C) of the dynamic model at different lead times, with 95% CI indicated by error bars and corresponding external validation performance. AUC, area under the receiver operating characteristics curve.

Fig. 4.

Time dependence of model discrimination. AUC (A), average precision (B), and Brier scores (C) of the dynamic model at different lead times, with 95% CI indicated by error bars and corresponding external validation performance. AUC, area under the receiver operating characteristics curve.

Close modal

Feature Importance Analysis

Feature importance was determined using the Shapley values, with relative importance and directionality shown in figure 5 for the dynamic model at 1-h lead time, and for the first 24-h model in figure E7 of the Supplemental Digital Content (http://links.lww.com/ALN/C999). Although feature importance varied by model, features involving Glasgow Coma Scores, Richmond Agitation Sedation Scale, age, mechanical ventilation, and overall acuity were important in prediction. Length of ICU stay before delirium onset and time of day were important predictors for the dynamic model. Contrary to one of our primary hypotheses, physiologic time series features based on 5-min frequency blood pressure, respiratory rate, heart rate, and oxygen saturation data did not increase either model’s performance, as can be seen from performance metrics shown in table E8 in the Supplemental Digital Content (http://links.lww.com/ALN/C999).

Fig. 5.

Feature analysis of the dynamic model. Shapley summary plot of the top 20 features for the dynamic model, at 12-h lead time for prediction of delirium. Each dot represents the Shapley value of one sample for that feature. A feature’s Shapley value represents the association of that feature to the risk score, with positive values indicating an association with a higher risk of delirium, and negative values indicating an association with a lower risk of delirium. The location of the dot on the x-axis represents its Shapley value, and its color represents the feature’s absolute value. For example, low age is associated with low Shapley values, while high age is associated with high Shapley values, indicating that elderly patients are at higher risk for delirium. ICU, intensive care unit.

Fig. 5.

Feature analysis of the dynamic model. Shapley summary plot of the top 20 features for the dynamic model, at 12-h lead time for prediction of delirium. Each dot represents the Shapley value of one sample for that feature. A feature’s Shapley value represents the association of that feature to the risk score, with positive values indicating an association with a higher risk of delirium, and negative values indicating an association with a lower risk of delirium. The location of the dot on the x-axis represents its Shapley value, and its color represents the feature’s absolute value. For example, low age is associated with low Shapley values, while high age is associated with high Shapley values, indicating that elderly patients are at higher risk for delirium. ICU, intensive care unit.

Close modal

Main Findings

Using large clinical databases, we developed and validated two new models for the prediction of delirium in the ICU. The first 24-h early prediction model performed better than the adapted reference model (fig. 3) and calibrated well in a contemporary dataset (fig. 3). The second dynamic delirium prediction model, which enables estimates of delirium risk that are continually updated over time, had higher discrimination than the modified reference model and similar or better discrimination compared with published delirium prediction models. Both models generally validated well on two external datasets, especially on the more contemporary dataset, although calibration of this model was limited in the validation cohorts, where, at high predicted probability, the observed probability of delirium was considerably lower.

Analysis of Existing Literature

Studies evaluating ICU delirium prediction have varied in patient characteristics, predictive frameworks, and model performance (see table E9 in the Supplemental Digital Content, http://links.lww.com/ALN/C999). Many of the most predictive features identified in previous work (e.g., age, mechanical ventilation, severity of illness [APACHE, SOFA], exposure to benzodiazepines) were confirmed in the models presented here. Most previous studies reporting high predictive performance used static models that were unable to predict delirium onset at a specific time point. PREdiction of DELIRium in ICu patients (PRE-DELIRIC)26  is the most studied model of this type, with large variability in performance across different populations internationally, although a meta-analysis estimated an aggregate AUC (95% CI) of 0.844 (0.793 to 0.896).28  Many previous studies had strict patient inclusion criteria, focusing on certain conditions or ICU types, thereby limiting their generalizability.29,30  Several reports on higher-performing models lacked external validation and were based on much smaller sample sizes than the ones reported here.29,30  The limited number of previous studies that made time-specific predictions of delirium onset evaluated their models on a daily basis,29–31  for example, at midnight, and predicted delirium onset in the next 24 h.

Physiologic Time Series

Early in this project we postulated that complex features from physiologic time series data (specifically blood pressure, respiratory rate, oxygen saturation, and heart rate) could be used to predict the onset of delirium. This hypothesis was not verified in our models. Although such features had some predictive power, they did not improve model performance and were more computationally costly. This counterintuitive result could potentially be due to feature redundancy, or perhaps the lack of hypothesized association.

Strengths

Large and Heterogeneous Datasets

Our study uses data from the Philips eICU Collaborative Research Database, which contains more than 200,000 unique ICU stays from more than 200 hospitals across the United States. This dataset includes data from critical care units in a wide range of different health system sizes, organizational structures, and settings.11  We extracted 22,234 ICU stays from this dataset for training, testing, and validating our dynamic model. This is larger than population samples used in previous delirium prediction studies, and this population likely has greater heterogeneity than the data used in the original reference model dataset, which trained on data from 1,613 stays in a single hospital in the Netherlands and validated with data from four other hospitals in the same country.26  Model results obtained on diverse populations could be more generalizable, especially in the United States where the data were collected. The publicly available nature of the dataset offers opportunities for other research groups to evaluate our models’ reproducibility.

Feature Space

Our models identified several predictive features that were not in the reference model (fig. 5). Low Glasgow Coma Scores, longer length of stay, and the prediction time being late at night were all associated with higher risk of delirium. These new features may explain why our model outperforms the reference model.

Potential Advantages of Dynamic Prediction

The dynamic model presented here is designed to predict delirium at set times up to 12 h in the future, potentially a more context-sensitive approach than that of other prediction systems. More time-specific onset predictions could allow targeted implementation of preventive measures in patients with immediate high risk. At shorter lead times (1 h or less) this algorithm is identifying delirium close to the present, which could improve the ability to treat ongoing delirium and reduce harm.

External Validation

The delirium prediction models were validated on two large independent external datasets, although validation on the older external dataset was less robust. These results suggest that the models performed well in the development dataset and may also be applicable to other populations. The vast majority of features were observed in the external validation dataset, suggesting that the features in our model could be generalizable to a range of intensive care settings.

Limitations

Study Design

We have evaluated the relationship between delirium as an outcome and a range of different exposure variables. We feel that this design, which is equivalent to a case-control methodology, is well suited to the modeling task. However, we recognize its limitations, including the inability to determine causality, and potential sources of bias associated with selection and recall, as well as the temporal bias that can occur when there is overemphasis on features that are close in time to the outcome of interest.32,33 

Outcome Labels

The frequency of delirium observed in both development and validation samples is lower than that reported in other ICU delirium studies. This may in part be explained by our exclusion of patients who had delirium early in their ICU stay. It is also likely that some patients had been misclassified by clinicians documenting their delirium screens. Hypoactive delirium, in particular, may be overlooked in ICU settings, even when using clinical screens.34  Data based on such documentation may have biased our models, making them more suited for predicting hyperactive delirium.

Another key limitation is the uncertainty regarding the precise timing of delirium onset and resolution, and the impact of this uncertainty on delirium detection and even epidemiology.35,36  One of the cardinal features of delirium is its fluctuating nature, challenging precise timing in detection. As in other clinical studies of delirium, we identified cases from documented Confusion Assessment Method for the ICU and Intensive Care Delirium Screening Checklist screening tests; however, the temporal interval between documentation and the clinical state of patients may vary. Similarly, inconsistencies in the frequency of delirium screening represents another limitation. This might reflect clinician workflows in the ICU setting that can directly impact when these tests are conducted and documented.

Feature Data

Another important limitation in this work was the reduced availability or missingness of certain predictive variables in all of the datasets. A number of key variables for delirium prediction (e.g., pre-ICU functional status, detailed histories of cognitive and neurologic conditions, psychologic disorders) were not consistently available, which likely reduced the predictive capabilities of our model. Using routine clinical data means that our predictive features are also susceptible to inconsistency that arises from manual charting.

Model Utility

Although the models presented in this work had overall good performance characteristics, their real-world utility needs to be determined given questions regarding calibration and dynamic prediction. Calibration results were below expectation. Clinical decisions based on a model that overpredicts might lead to unnecessary resource allocation and even harm. Reduced calibration may be due to overfitting, high levels of heterogeneity in the data, use of too many predictive features, and biased sampling.37  Although our training and testing performances do not suggest overfitting, the other cited factors may have influenced our calibration results.

The paradigm of dynamic predictive modeling, while potentially impactful, presents theoretical and implementation challenges. As the temporal offset between the observation and prediction window narrows, the distinction between predicting an event and identifying it becomes less clear. Regarding implementation of a dynamic prediction system for delirium, one can envision that models with sufficient lead times (e.g., 12 h or more) could be leveraged for antidelirium interventions; in contrast the actionable impact of shorter lead times (e.g., 1 h or 6 h) needs further study.

Singular Prediction per ICU Stay

To maintain independence between samples, our dynamic model was trained by treating each ICU stay as a sample. However, in a realistic clinical setting, a dynamic model would need to make multiple iterative predictions per ICU stay on a regular basis, likely increasing the challenge of accurate and timely prediction. Analysis on one sample’s predictive risk over time is shown in figure E8 in the Supplemental Digital Content (http://links.lww.com/ALN/C999), and such analyses will be replicated in other samples in future studies.

Comparison with Reference Model

The modified reference model constructed in this work is not identical to the original model, because it includes adjustments (described in table E4, http://links.lww.com/ALN/C999). Its performance could have been influenced by the transformation of features.

Conclusion

Leveraging machine learning applied to very large datasets, we have developed and externally validated a novel approach for prediction of delirium in the ICU. After further prospective testing and validation, these models could help support the implementation of delirium-reducing interventions in high-risk individuals.

Acknowledgments

The authors thank Sridevi V. Sarma, Ph.D., Associate Professor, Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, for her invaluable guidance in the early phase of this project, Karol M. Pencina, Ph.D., Chief Biostatistician at Harvard Medical School, Boston, Massachusetts, for his advice on statistical analysis, and Ike Zhang, Student in Biomedical Engineering at the Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, for his aid in preparing this manuscript.

Research Support

Support was provided solely from institutional and/or departmental sources at Johns Hopkins University (Baltimore, Maryland).

Competing Interests

The authors declare no competing interests.

Supplemental Digital Content, http://links.lww.com/ALN/C999

1.
Vasilevskis
EE
,
Han
JH
,
Hughes
CG
,
Ely
EW
:
Epidemiology and risk factors for delirium across hospital settings.
Best Pract Res Clin Anaesthesiol
.
2012
;
26
:
277
87
2.
Inouye
SK
,
Westendorp
RGJ
,
Saczynski
JS
:
Delirium in elderly people.
Lancet
.
2014
;
383
:
911
22
3.
Oh
ES
,
Fong
TG
,
Hshieh
TT
,
Inouye
K
:
Delirium in older persons: Advances in diagnosis and treatment.
JAMA
.
2018
;
318
:
1161
74
4.
Hshieh
TT
,
Yue
J
,
Oh
E
,
Puelle
M
,
Dowal
S
,
Travison
T
,
Inouye
SK
:
Effectiveness of multicomponent nonpharmacological delirium interventions: A meta-analysis.
JAMA Intern Med
.
2015
;
175
:
512
20
5.
Pun
BT
,
Balas
MC
,
Barnes-Daly
MA
,
Thompson
JL
,
Aldrich
JM
,
Barr
J
,
Byrum
D
,
Carson
SS
,
Devlin
JW
,
Engel
HJ
,
Esbrook
CL
,
Hargett
KD
,
Harmon
L
,
Hielsberg
C
,
Jackson
JC
,
Kelly
TL
,
Kumar
V
,
Millner
L
,
Morse
A
,
Perme
CS
,
Posa
PJ
,
Puntillo
KA
,
Schweickert
WD
,
Stollings
JL
,
Tan
A
,
D’Agostino McGowan
L
,
Ely
EW
:
Caring for critically ill patients with the ABCDEF bundle.
Crit Care Med
.
2019
;
47
:
3
14
6.
Burry
LD
,
Cheng
W
,
Williamson
DR
,
Adhikari
NK
,
Egerod
I
,
Kanji
S
,
Martin
CM
,
Hutton
B
,
Rose
L
:
Pharmacological and non-pharmacological interventions to prevent delirium in critically ill patients: a systematic review and network meta-analysis.
Intensive Care Med
.
2021
;
47
:
943
60
7.
Spronk
PE
,
Riekerk
B
,
Hofhuis
J
,
Rommes
JH
:
Occurrence of delirium is severely underestimated in the ICU during daily care.
Intensive Care Med
.
2009
;
35
:
1276
80
8.
Heymann
A
,
Radtke
F
,
Schiemann
A
,
Lütz
A
,
MacGuill
M
,
Wernecke
KD
,
Spies
C
:
Delayed treatment of delirium increases mortality rate in intensive care unit patients.
J Int Med Res
.
2010
;
38
:
1584
95
9.
Collins
GS
,
Reitsma
JB
,
Altman
DG
,
Moons
KGM
:
Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement.
BMC Med
.
2015
;
13
:
1
10
10.
Goldberger
AL
,
Amaral
LA
,
Glass
L
,
Hausdorff
JM
,
Ivanov
PC
,
Mark
RG
,
Mietus
JE
,
Moody
GB
,
Peng
CK
,
Stanley
HE
:
PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.
Circulation
.
2000
;
101
:
E215
20
11.
Pollard
TJ
,
Johnson
AEW
,
Raffa
JD
,
Celi
LA
,
Mark
RG
,
Badawi
O
:
The eICU collaborative research database, a freely available multi-center database for critical care research.
Sci Data
.
2018
;
5
:
1
13
12.
Johnson
AEW
,
Pollard
TJ
,
Shen
L
,
Lehman
LWH
,
Feng
M
,
Ghassemi
M
,
Moody
B
,
Szolovits
P
,
Anthony Celi
L
,
Mark
RG
:
MIMIC-III, a freely accessible critical care database.
Sci Data
.
2016
;
3
:
1
9
13.
Johnson
AEW
,
Bulgarelli
L
,
Pollard
T
,
Horng
S
,
Celi
LA
,
Mark
R
:
MIMIC-IV (version 2.0) PhysioNet
June 12, 2022
.
Available at: https://doi.org/10.13026/7vcr-e114. Accessed November 20, 2022
.
14.
Ely
EW
,
Margolin
R
,
Francis
J
,
May
L
,
Truman
B
,
Dittus
R
,
Speroff
T
,
Gautam
S
,
Bernard
GR
,
Inouye
SK
:
Evaluation of delirium in critically ill patients: Validation of the Confusion Assessment Method for the intensive care unit (CAM-ICU).
Crit Care Med
.
2001
;
29
:
1370
9
15.
Bergeron
N
,
Dubois
MJ
,
Dumont
M
,
Dial
S
,
Skrobik
Y
:
Intensive care delirium screening checklist: Evaluation of a new screening tool.
Intensive Care Med
.
2001
;
27
:
859
64
16.
Harris
CR
,
Millman
KJ
,
van der Walt
SJ
,
Gommers
R
,
Virtanen
P
,
Cournapeau
D
,
Wieser
E
,
Taylor
J
,
Berg
S
,
Smith
NJ
,
Kern
R
,
Picus
M
,
Hoyer
S
,
van Kerkwijk
MH
,
Brett
M
,
Haldane
A
,
del Río
JF
,
Wiebe
M
,
Peterson
P
,
Gérard-Marchant
P
,
Sheppard
K
,
Reddy
T
,
Weckesser
W
,
Abbasi
H
,
Gohlke
C
,
Oliphant
TE
:
Array programming with NumPy.
Nature
.
2020
;
585
:
357
62
17.
McKinney
W
:
Data structures for statistical computing in Python.
van der Walt
S
,
Millman
J
, eds.
In: Proceedings of the 9th Python in Science Conference
,
Austin, Texas
,
June 28 to July 3, 2010
.
pp 56
61
18.
Gasparini
A
:
comorbidity: An R package for computing comorbidity scores.
J Open Source Softw
.
2018
;
3
:
648
19.
Christ
M
,
Braun
N
,
Neuffer
J
,
Kempa-Liehr
AW
:
Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).
Neurocomputing
.
2018
;
307
:
72
7
20.
Breiman
L
:
Random forests.
Mach Learn
.
2001
;
45
:
5
32
21.
Prokhorenkova
L
,
Gusev
G
,
Vorobev
A
,
Dorogush
AV
,
Gulin
A
:
CatBoost: Unbiased boosting with categorical features.
Proceedings of the 32nd International Conference on Neural Information Processing Systems
.
2018
;
6638
48
22.
Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
,
Michel
V
,
Thirion
B
,
Grisel
O
,
Blondel
M
,
Prettenhofer
P
,
Weiss
R
,
Dubourg
V
,
Vanderplas
J
,
Passos
A
,
Cournapeau
D
:
Scikit-learn: Machine learning in Python.
J Mach Learn Res
.
2011
;
12
:
2825
30
23.
Virtanen
P
,
Gommers
R
,
Oliphant
TE
,
Haberland
M
,
Reddy
T
,
Cournapeau
D
,
Burovski
E
,
Peterson
P
,
Weckesser
W
,
Bright
J
,
van der Walt
SJ
,
Brett
M
,
Wilson
J
,
Millman
KJ
,
Mayorov
N
,
Nelson
ARJ
,
Jones
E
,
Kern
R
,
Larson
E
,
Carey
CJ
,
Polat
I
,
Feng
Y
,
Moore
EW
,
VanderPlas
J
,
Laxalde
D
,
Perktold
J
,
Cimrman
R
,
Henriksen
I
,
Quintero
EA
,
Harris
CR
,
Archibald
AM
,
Ribeiro
AH
,
Pedregosa
F
,
van Mulbregt
P
;
SciPy 1.0 Contributors
:
SciPy 1.0: fundamental algorithms for scientific computing in Python.
Nat Methods
.
2020
;
17
:
261
72
24.
Lundberg
SM
,
Nair
B
,
Vavilala
MS
,
Horibe
M
,
Eisses
MJ
,
Adams
T
,
Liston
DE
,
Low
DKW
,
Newman
SF
,
Kim
J
,
Lee
SI
:
Explainable machine-learning predictions for the prevention of hypoxaemia during surgery.
Nat Biomed Eng
.
2018
;
2
:
749
60
25.
Bergstra
J
,
Bardenet
R
,
Bengio
Y
,
Kégl
B
:
Algorithms for hyper-parameter optimization.
25th Annual Conference on Neural Information Processing Systems. NIPS 2011
,
December 12 to 17, 2011
,
Granada, Spain
;
2546
54
26.
Van Den
BM
,
Pickkers
P
,
Slooter
AJC
,
Kuiper
MA
,
Spronk
PE
,
Van Der
VP
,
Van Der
HJ
,
Donders
R
,
Van
AT
,
Schoonhoven
L
:
Development and validation of PRE-DELIRIC (PREdiction of DELIRium in ICu patients) delirium prediction model for intensive care patients: Observational multicentre study.
BMJ
.
2012
;
344
:
17
27.
Demler
OV
,
Pencina
MJ
,
D’Agostino
RB
:
Misuse of DeLong test to compare AUCs for nested models.
Stat Med
.
2012
;
31
:
2577
87
28.
Chen
X
,
Lao
Y
,
Zhang
Y
,
Qiao
L
,
Zhuang
Y
:
Risk predictive models for delirium in the intensive care unit: A systematic review and meta-analysis.
Ann Palliat Med
.
2021
;
10
:
1467
29.
Fan
H
,
Ji
M
,
Huang
J
,
Yue
P
,
Yang
X
,
Wang
C
,
Ying
W
:
Development and validation of a dynamic delirium prediction rule in patients admitted to the Intensive Care Units (DYNAMIC-ICU): A prospective cohort study.
Int J Nurs Stud
.
2019
;
93
:
64
73
30.
Marra
A
,
Pandharipande
PP
,
Shotwell
MS
,
Chandrasekhar
R
,
Girard
TD
,
Shintani
AK
,
Peelen
LM
,
Moons
KGM
,
Dittus
RS
,
Ely
EW
,
Vasilevskis
EE
:
Acute brain dysfunction: development and validation of a daily prediction model.
Chest
.
2018
;
154
:
293
301
31.
Moon
KJ
,
Jin
Y
,
Jin
T
,
Lee
SM
:
Development and validation of an automated delirium risk assessment system (Auto-DelRAS) implemented in the electronic health record system.
Int J Nurs Stud
.
2018
;
77
:
46
53
32.
Breslow
NE
:
Statistics in epidemiology: The case-control study.
J Am Stat Assoc
.
1996
;
91
:
14
28
33.
Yuan
W
,
Beaulieu-Jones
BK
,
Yu
KH
,
Lipnick
SL
,
Palmer
N
,
Loscalzo
J
,
Cai
T
,
Kohane
IS
:
Temporal bias in case-control design: preventing reliable predictions of the future.
Nat Commun
.
2021
;
12
:
1
10
34.
Cour
K
,
Andersen-Ranberg
NC
,
Weihe
S
,
Poulsen
LM
,
Mortensen
CB
,
Kjer
CKW
,
Collet
MO
,
Estrup
S
,
Mathiesen
O
:
Distribution of delirium motor subtypes in the intensive care unit: a systematic scoping review.
Crit Care
.
2022
;
26
:
1
11
35.
Wilson
JE
,
Mart
MF
,
Cunningham
C
,
Shehabi
Y
,
Girard
TD
,
MacLullich
AMJ
,
Slooter
AJC
,
Ely
EW
:
Delirium.
Nat Rev Dis Prim
.
2020
;
6
:
1
26
36.
Pandharipande
P
,
Ely
EW
,
Arora
RC
,
Balas
MC
,
Boustani
M
,
La Calle
GH
,
Cunningham
C
,
Devlin
JW
,
Elefante
J
,
Han
JH
,
MacLullich
M
,
Maldonado
JR
,
Morandi
A
,
Needham
DM
,
Page
VJ
,
Rose
L
,
Salluh
JIF
,
Sharshar
T
,
Shehabi
Y
,
Skrobik
Y
,
Slooter
AJC
,
Smith
HAB
The intensive care delirium research agenda: A Multinational, interprofessional perspective.
Intensive Care Med
.
2017
;
43
:
1329
39
37.
Calster
BV
,
McLernon
DJ
,
Smeden
MV
,
Wynants
L
,
Steyerberg
EW
:
Calibration: The Achilles heel of predictive analytics.
BMC Med
.
2019
;
17
:
1
7