Figure: Example of a Threshold Plot for an AI Binary Classification Model The scores of various test characteristics are shown on the y-axis as a function of discrimination threshold (x-axis). Of note, recall is also known as sensitivity, and precision is also known as the true positive rate. The f1 score represents the harmonic mean of precision and recall, and the max f1 score, in this case tf=0.75, is often used as a default for thresholding. The queue rate is the proportion of observations with a predicted probability greater than the corresponding threshold value. This figure was created using Yellowbrick open source python code (Journal of Open Source Software 2019;4:1075).
Figure: Example of a Threshold Plot for an AI Binary Classification Model The scores of various test characteristics are shown on the y-axis as a function of discrimination threshold (x-axis). Of note, recall is also known as sensitivity, and precision is also known as the true positive rate. The f1 score represents the harmonic mean of precision and recall, and the max f1 score, in this case tf=0.75, is often used as a default for thresholding. The queue rate is the proportion of observations with a predicted probability greater than the corresponding threshold value. This figure was created using Yellowbrick open source python code (Journal of Open Source Software 2019;4:1075).
There is a reason that algorithm-based tools, trained via machine learning or artificial intelligence (AI) concepts, are already ubiquitous in our lives: They are immensely powerful. It is inevitable that these tools will eventually pervade medicine, including anesthesiology. However, this work identifies and offers solutions to several problems with AI implementation in anesthesia: 1) identifying appropriate data elements, 2) thresholding complex clinical problems into “yes” or “no” based on probability, 3) challenges with electronic health record integration, and 4) defining rigorous implementation testing practices to minimize the potential for unintended patient harm.
In the proper hands, AI tools promise to transform the massive amount of perioperative data at our disposal into valuable insights and predictions. For example, ambient intelligence-assisted health care – a type of AI-based tool – involves creating intelligent and responsive environments through the integration of sensors, wearable devices, and other ambient intelligence tools to monitor and analyze patient health data in real time (Nature 2020;585:193-202). Imagine a camera-based system that quantifies patient mobility in the intensive care setting. By identifying patients with poor mobilization, clinicians could develop risk profiles for ICU myopathy and direct resources like targeted physical therapy. Postoperative monitoring using sensors and wearable devices is another logical application.
AI-based conversational large language models like ChatGPT also have potential for health care integration. Plausible uses include integration into note writing to ease the documentation burden of health care providers, allowing for more patient-facing opportunities and time for complex clinical decision-making. Others envision applications in medical education such as creating realistic patient scenarios and providing personalized learning experiences (JMIR Med Educ 2023;9:e46885). There are potential applications in medical writing, such as generating summaries of medical research (J Med Syst 2023;47:33).
Applications of AI also represent a powerful tool to improve patient care in low-resource settings. In an obstetric clinical service in rural Juana Vicente, Dominican Republic, AI-based software helps midwives and other skilled birth attendants interpret images and gestational measurements, facilitating obstetric triage and referral (Radiology 2020;297:513-20). One could imagine a similar approach to preoperative cardiac evaluation by using AI-assisted point-of-care transthoracic ultrasound in low-resource settings. Ultimately, however, the value our field will reap from these methods depends on our understanding of the capabilities of AI tools and ability to rigorously test the implementation of applications.
When designing an AI-based tool, it is important to consider what types of results or outputs algorithms can provide. Consider the example of predicting intraoperative transfusion. An algorithm could provide a probability of intraoperative transfusion. Knowing that there is a 0.1% risk of transfusion – or a 20% risk – could impact your anesthesia plan.
Alternatively, we could create an algorithm to estimate the volume of blood required for a given case. This information, when taken in aggregate across a day or a week's worth of cases, would provide relevant information for a blood bank to improve the efficiency of blood product acquisition and minimize wastage.
We could force the algorithm to make a binary yes/no decision. For example, say we wanted to use the risk of transfusion to guide the ordering of preoperative type and screens. In such a binary classification problem, we have the power to threshold the decision. If we consider a 10% probability to be high risk for transfusion, we could threshold at 10%, and cases with a 5% risk would be considered “no transfusion, do not order type and screen” (J Clin Anesth September 2023).
It is often more powerful, however, to set a threshold to achieve test characteristics relevant to a clinical problem. The choice of threshold results in a set of sensitivity, specificity, precision, accuracy, false positive, and false negative values. The Figure shows how precision (also known as true positive rate) and recall (also known as sensitivity) change as a function of the threshold (x-axis) for a given model. When predicting transfusion to guide type and screen ordering, we can set the threshold to maintain the current institutional level of obtaining type and screens for patients who require transfusion. In other words, we can set the algorithm so it does a similar job to the current practice in ordering type and screens for cases that require transfusion. If the algorithm is successful in predicting patients at low risk of transfusion, clinical application will result in a large reduction in overall type and screen ordering while maintaining the current practice of having type and screens for transfusion cases.
Data availability and the timing of data elements are important to consider. If you want to use a variable as a predictor, it must be available before the time when the prediction occurs. If we wanted to use the prediction of case cancellation so that the anesthesiologist making the following day's schedule can identify cases that will be cancelled before the patient shows up, we need all the predictors to be present a day before the scheduled case.
To integrate a model into the electronic health record, you must consider how the clinician will interact with the prediction. Some examples include recommending a course of action, alerting clinicians to the high risk of a condition or event, or changing the default in an order set. No helpful algorithm is perfect, and if the user depends on the algorithm and it fails, there is a potential for faith to be lost in the tool even if the algorithm performs better overall than clinical judgment alone. One can increase transparency by providing information on what risk factors contribute to a given prediction and the associated probability of the outcome of interest.
There is currently a massive gap between the thousands of papers describing the development of AI models in silico and the far fewer works testing AI-based tools in clinical practice (J Med Internet Res 2021;23:e25759). Rigorously testing algorithm-based tools is vital, as it is possible for inadvertent bias to enter a model. Racial algorithmic bias refers to the systematic discrimination against certain racial groups in the development and implementation of algorithms (Science 2019;366:447-53; JMIR Med Inform 2022;10:e36388). All models should be tested for bias throughout development and with clinical implementation.
Optimal implementation of AI-based tools brings together individuals with diverse skills. Experts in process mapping, human factors, and implementation science can provide insights into how users interact with the electronic health record to make clinical decisions. Ensuring institutional support by engaging with leadership is helpful in clearing the inevitable hurdles inherent in creating a novel process. In addition, connecting with other health professionals who share the perioperative space can improve the efficacy of the intervention by increasing buy-in. Finally, engaging with health information technology is helpful to leverage their technical skills, especially in clinical decision support integration.
The integration of AI into clinical anesthesiology promises revolutions in clinical practice that aim to enhance patient outcomes and safety. However, unlike many uses of AI in industry, where failure of an algorithm could result in less web traffic or sales, medical application has the capacity to cause unintended patient harm. Therefore, careful design and testing of applications is necessary to truly realize the potential of innovation in this space. By actively participating in the development and testing of AI-based tools, we can direct AI integration toward areas of need in perioperative medicine to improve the care of our patients.
Matthew Zapf, MD, Assistant Professor of Anesthesiology, Research Informatics Division, and Director, Center for Evidence-Based Anesthesia, Adult Anesthesiology, Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee.
Matthew Zapf, MD, Assistant Professor of Anesthesiology, Research Informatics Division, and Director, Center for Evidence-Based Anesthesia, Adult Anesthesiology, Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee.