Abstract
Commercial applications of artificial intelligence and machine learning have made remarkable progress recently, particularly in areas such as image recognition, natural speech processing, language translation, textual analysis, and self-learning. Progress had historically languished in these areas, such that these skills had come to seem ineffably bound to intelligence. However, these commercial advances have performed best at single-task applications in which imperfect outputs and occasional frank errors can be tolerated.
The practice of anesthesiology is different. It embodies a requirement for high reliability, and a pressured cycle of interpretation, physical action, and response rather than any single cognitive act. This review covers the basics of what is meant by artificial intelligence and machine learning for the practicing anesthesiologist, describing how decision-making behaviors can emerge from simple equations. Relevant clinical questions are introduced to illustrate how machine learning might help solve them—perhaps bringing anesthesiology into an era of machine-assisted discovery.
The human mind excels at estimating the motion and interaction of objects in the physical world, at inferring cause and effect from a limited number of examples, and at extrapolating those examples to determine plans of action to cover previously unencountered circumstances. This ability to reason is backed by an extraordinary memory that subconsciously sifts events into those experiences that are pertinent and those that are not, and is also capable of persisting those memories even in the face of significant physical damage. The associative nature of memory means that the aspects of past experiences that are most pertinent to the current circumstance can be almost effortlessly recalled to conscious thought. However, set against these remarkable cerebral talents are fatigability, a cognitive laziness that presents as a tendency to short-cut mental work, and a detailed short-term working memory that is tiny in scope. The human mind is slow and error-prone at performing even straightforward arithmetic or logical reasoning.1
In contrast, an unremarkable desktop computer in 2019 can rapidly retrieve and process data from 32 gigabytes of internal memory—a quarter of a trillion discrete bits of information—with absolute fidelity and tirelessness, given an appropriately constructed program to execute. The greatest progress in artificial intelligence has historically been in those realms that can most easily be represented by the manipulation of logic and that can be rigorously defined and structured, known as classical or symbolic artificial intelligence. Such problems are quite unlike the vagaries of the interactions of objects in the physical world. Computers are not good at coming to decisions—indeed, the formal definition of the modern computer arose from the proof that certain propositions are logically undecidable2 —and classical approaches to artificial intelligence do not easily capture the idea of a “good enough” solution.
For most of human history, the practice of medicine has been predominantly heuristic and anecdotal. Traditionally, quantitative patient data would be relatively sparse, decision making would be based on clinical impression, and outcomes would be difficult to relate with much certainty to the quality of the decisions made. The transition to evidence-based practice3 and Big Data is a relatively recent occurrence. In contrast, anesthesiologists have long relied on personalized streams of quantified data to care for their unconscious patients, and advances in monitoring and the richness of that data have underpinned the dramatic improvements in patient safety in the specialty.4 Anesthesiologists also practice at the sharper end of cause and effect: decisions usually cannot be postponed, and errors in judgment are often promptly and starkly apparent.
The general question of artificial intelligence and machine learning in anesthesiology can be stated as follows:
There is some outcome that should be either attained or avoided.
It is not certain what factors lead to that outcome, or a clinical test that predicts that outcome cannot be designed.
Nevertheless, a body of patient data is available that provides at least circumstantial evidence as to whether the outcome will occur. The data are plausibly, but not definitively, related.
The signal, if it is present in the patient data, is too diffuse across the data set for it to be learned reliably from the number of cases that an anesthesiologist might personally encounter, or the clinical decision-making relies upon a subconscious judgment that the anesthesiologist cannot elucidate.
Can an algorithm, derived from the given data and outcomes, provide insight in order to improve patient management and the decision-making process?
This form of machine learning might be termed machine-assisted discovery.
This article takes the form of an integrative review,5 defined as “a review method that summarizes past empirical or theoretical literature to provide a more comprehensive understanding of a particular phenomenon or healthcare problem.” The article therefore introduces the theory underlying classical and modern approaches to artificial intelligence and machine learning, and surveys current empirical and clinical areas to which these techniques are being applied. Concepts in the fundamentals of artificial intelligence and machine learning are introduced incrementally:
Beginning with classical or symbolic artificial intelligence, a logical representation of the problem is crafted and then searched for an optimal solution.
Model fitting of physiologic parameters to an established physiologic model is shown as an extension of search.
Augmented linear regression is shown to allow certain nonlinear relationships between outcomes and physiologic variables to be discerned, even in the absence of a defined physiologic model. It requires sufficient expertise about which combinations of nonphysiologic transformations of the variables might be informative.
Neural networks are shown to provide a mechanism to establish a relationship between input variables and an output without defining a logical representation of the problem or defining transformations of the inputs in advance. However, this flexibility comes at considerable computational cost and a final model with a behavior that may be hard to comprehend.
Numerous other theoretical and computational approaches do exist, and these may have practical advantages depending on the nature of the problem and the structure of the desired outputs.6
The literature search for an integrative review should be transparent and reproducible, comprehensive but focused and purposive. A literature search was performed using PubMed for articles published since 2000 using the following terms: “artificial intelligence anesthesiology” (543 matches), “computerized analysis anesthesiology” (353 matches), “machine learning anesthesiology” (91 matches), and “convolutional neural network anesthesiology” (1 match). Matches were reviewed for suitability, and augmented with references of historical significance. The specialty of anesthesiology features a broad history of attempts to apply computational algorithms, artificial intelligence, and machine learning to tasks in an attempt to improve patient safety and anesthesia outcomes (table 1). Recent significant and informative empirical advances are reviewed more closely.
Classical Artificial Intelligence and Searching
Creating a classical artificial intelligence algorithm begins with the three concepts of a bounded solution space, an efficient search, and termination criteria.
First, using what is known about the problem, a set of possible solutions that the algorithm can produce is defined. The algorithm will be allowed to choose one of these possible solutions, and so the solution set must be created in such a way that it is reasonably certain that the best possible solution is among the choices available. The algorithm will never be able to think outside of this “box,” and in that sense the solution space is bounded. In the game of tic-tac-toe, for example, the set of solutions is those squares that have not yet been taken. The best solution is the one that most diminishes the opponent’s ability to win, ideally until victory is achieved (i.e., minimax).7 In real life problems, however, it can be difficult to define a bounded set of solutions or even say explicitly what “best” means.
Second, the possible solutions are progressively evaluated and searched, trying to find the best one. In designing and programming the search strategy, anything else of worth that is known about the problem should be incorporated, such as how to value one solution versus another, ways to search efficiently by focusing on areas of the solution space that are more likely to be productive,8 and intermediate results that might allow certain subsets of the solution space to be excluded from further evaluation (i.e., pruning). Sometimes the knowledge and understanding of the underlying problem might be quite weak, and then in the worst case it may be necessary to fall back on an exhaustive and computationally intensive brute-force search of all the possible solutions.
Third, the algorithm must terminate and present a result. Given enough time, eventually the algorithm should ideally find and select the optimum solution. Depending on the structure of the problem and the search algorithm, it may be possible to guarantee through theory that the algorithm will terminate with the optimal solution within a constrained amount of time. A weaker theoretical guarantee would be that the algorithm will at least improve its solution with each search iteration. However, in the general case and if no such theoretical guarantee is possible, the algorithm might only select the best good-enough solution found within an allowed time limit, or perhaps issue an error message that no sufficiently satisfactory solutions were identified.
Search-based classical artificial intelligence has obvious applications to practical problems such as wayfinding on road maps, in which a route must be chosen that is connected by legal driving maneuvers and arrives in the shortest time. Less obviously, this same logic can be applied to real-world problems such as locating a lost child in a supermarket. According to the order of operations above, the first step is to create a bounded solution set: by covering the exit, the location of the child is reasonably bounded to be somewhere within the supermarket. Second, a search is begun. A naive approach might be to walk up and down every aisle in turn until the child is found but, from insight, far better search strategies for this problem can be easily identified. The most efficient search strategy is clearly to walk along the ends of the aisles: this allows whole aisles to be scanned and excluded (i.e., pruned) rapidly. Third, the search terminates either on finding the child, or on determining that additional resources must be employed if the child cannot be found within a certain time.
Designing classical artificial intelligence algorithms is not a turnkey mathematical task; it is heavily dependent on the human expertise of the designer. In classical artificial intelligence, the role of the computer is to contribute its immense power of calculation to evaluate the relative merits of a large number of possible solutions, which the designer provides. This division of labor can be dated back to Ada Lovelace’s 1843 description of the conception of the modern computer: “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths. Its province is to assist us in making available what we are already acquainted with.”9
In 1997, the IBM supercomputer Deep Blue defeated the then world champion, Garry Kasparov, at chess. It had been a longstanding goal for a machine to be able to play chess at levels unattainable by humans.10 The rules of chess are clear and unambiguous, and the actions take place within the confines of the board. The state of play is completely apparent and known to both players. It is possible to list all the available legal moves, all responses to those moves, all responses to those responses, and so on—the solution space is bounded. In principle, it is not even particularly difficult to write a program that can play chess flawlessly. The program simply tries out (i.e., searches) all possible moves and all possible responses until the game is either won or lost. However, a computer program that tried to evaluate every possible move and all of its consequences would not be able to make its first move, so immense is the search space.11 Deep Blue’s success rested on two pillars. First, its search algorithm possessed an evaluation function to approximate the relative value of a position. This function was crafted from the distilled, programed, strategic wisdom of human chess experts, and allowed the search algorithm to ignore choices that were likely to be unproductive. Second, this search algorithm was supported by brute-force computing power capable of evaluating two hundred million moves per second. These techniques proved sufficient for Deep Blue to achieve superhuman mastery of a game with approximately 1047 possible board positions—an immense but bounded space. In many ways, however, mastery of chess was classical artificial intelligence’s triumph but also its swansong. The division of labor remains the same as in Lovelace’s original description, and the human strategic understanding of the game was not outdone but instead overwhelmed by the indefatigability of the machine’s tactical evaluation of millions and millions of positions. The computer did what it was told, but it did not learn.
In anesthesiology practice, the closest example is open-loop target-controlled infusion. Pharmacokinetic models describe the forwards relationship from a drug administration schedule D(t) to an effect site concentration e(t). However, it is the inverse solution that is required: for a requested e(t), some D(t) should be produced, perhaps subject to limits on administration rate or plasma concentration.12 An open-loop target-controlled infusion pump will perform a search for a drug administration schedule that brings the predicted concentration of the medication within the body toward this goal, subject to the given constraints. The underlying equations are concise and effective,13 but the device cannot become more proficient at its task. It follows the algorithms that it is given.
Model Fitting as a Form of Searching
Anesthesiologists take particular interest in objective patient outcomes, and whether these good or bad outcomes can be predicted from the data that are available. Lacking a direct test for the desired outcome (in which case prognosis would be straightforward), the research question becomes whether the patient’s outcome is in some way imprinted upon and foreshadowed by the imperfectly informative data that are available. An approach is to seek to fit models to the data in order to try to make more reliable predictions, and therefore potentially discover previously unappreciated but useful relationships within the data. This approach requires a large enough body of data and patient outcomes on which model fitting can be performed, and this large body of data cannot reasonably be analyzed by hand.
A model is created using an example set for which the data and outcomes are known—i.e., the training data. The essence of a useful model is that it should be able to make useful predictions about data it has not previously seen, i.e., that it is generalizable. An overly complicated model may become overfit to its training data, such that its predictions are not generalizable. Figure 1 shows examples of this. Each figure shows a population of green circles, representing notionally favorable outcomes, and red crosses, representing notionally unfavorable outcomes. The question is whether the two available items of data, Feature X and Feature Y, can predict the outcome. Three models are fit to the exact same training data, to produce a black line (known as the discriminant) that separates the figure into prediction regions, shaded either green or red, accordingly. An item of training data is correctly classified when its symbol falls in a region that is shaded the same color, and is misclassified when it does not (i.e., when it lies on the wrong side of the discriminant).
Figure 1A shows a model that is underfit. Although most of the symbols are correctly classified, there are several misclassified red crosses on the left of the figure, and the simple linear discriminant has no way to capture these. A decision algorithm based on this model would have high specificity (green circles are almost all correctly classified), but a lower sensitivity (several bad outcomes are erroneously predicted as good). The decision performance is therefore somewhat reminiscent of the Mallampati test,14 which also demonstrates high specificity but low sensitivity.15 The discriminant in figure 1A would function better if it could assume a more complex form. Figure 1B, in contrast, shows a model that is overfit. Although all the training data are correctly classified, the unwieldy discriminant is governed too much by the satisfaction of individual data points rather than the overall structure of the problem. This model is unlikely to generalize well to new data, as it is overly elaborate. Figure 1C shows a model that is appropriately fit to the data (indeed, the data were created to illustrate this point). The discriminant is complex enough to capture the distribution of the outcomes, but it is also parsimonious in that the shape of the discriminant is described by only a few parameters. In practice, of course, the true underlying distribution is not known in advance, so the performance of the discriminant must be tested statistically. The discriminant in figure 1C has fewer degrees of freedom than the discriminant in figure 1B, so its performance is statistically more likely to represent the true nature of the underlying process even though it has more misclassifications than the overfit discriminant. Model fitting is therefore a form of search in which the choices are the parameters admitted to the model and their relative weights, in order to find models that are statistically most likely to represent the underlying process based upon the training data that are available.
Discovering Nonlinear Relationships in Clinical Medicine
Many judgments in anesthesiology are based on absolute thresholds or linear combinations of variables. A patient with a heart rate above 100 is tachycardic, and one with a temperature above 38°C is febrile. A man whose ECG features an R wave in lead aVL and an S wave in lead V3 that combined exceed 28 mm has left-ventricular hypertrophy.16
Logistic regression is useful when fitting a weighted combination of variables to an outcome. Logistic regression defines an error function that measures the extent to which the current weighted combination of variables tends to misclassify outcomes. These weights are subsequently modified in order to improve the classification rate. The regression algorithm determines the changes in the weightings that would most improve the current classification, and then repeats the process until an optimum weighting is settled upon. The regression algorithm therefore performs a gradient descent on the error function; one can picture an imaginary ball rolling down a landscape defined by the error function until a lowest, optimum point is reached such that the best linear combination of the input variables is determined. Logistic regression is a powerful machine-learning technique that works quickly and is also convex, meaning essentially that the “ball” can roll down to the optimum solution from any starting point (for almost all well-posed problems). However, many outcome problems in anesthesiology and critical care clearly do not depend on linear criteria. For example, intensive care unit outcomes that may depend on a patient’s potassium level, or glucose level, or airway positive end-expiratory pressure are more likely to be Goldilocks problems: the best outcomes require an amount that is neither too big, nor too small, but just right. Figure 2A illustrates such a situation, in which the good outcomes are clustered around a point in the feature space, and deviations from that point result in poor outcomes. As a clinical correlate, one might imagine that the outcomes are timely intensive care unit discharges,17 and the data K and G represent well-controlled levels of potassium and glucose, although the data shown here are purely artificial and created solely for this example. The data show a clear clustering of the outcomes, but an algorithm that is only capable of producing a discriminant based on a linear combination of K and G would not be able to capture that separation. Rather than performing a nonlinear regression over the two variables K and G, a solution lies in transforming the data by calculating the squares of K and G (i.e., K2, G2) and their cross-term KG, and then performing an augmented linear logistic regression over the five variables K, G, K2, G2, and KG.
As shown in figure 2B, a linear discriminant in the K2, G2 plane will perfectly separate the outcomes. This discriminant, given by (K2) + (G2) – 9 = 0, is exactly the same as a circle of radius 3 in figure 2A, demonstrating that nonlinear boundaries can be discovered. Although it may seem clinically bizarre to talk about the squared value of the serum potassium (K2), it is easy to write a quadratic function that has clinical meaning. For example, the function 35 – 17K – 2K2 is positive if the value of K lies between 3.5 and 5.0, but turns negative for any more hypokalemic or hyperkalemic value outside that range. This simple example underscores the ways in which the outputs from machine-learning algorithms can seem inscrutable or black box. To a computer, the two definitions are essentially equivalent: one is no more meaningful or better than the other. It takes human effort to explain numerical results in a clinically meaningful way.18
Augmenting the variable space with quadratic terms allows a linear algorithm to define nonlinear features like islands (as in fig. 2A) and open curves (as in fig. 1B), but the technique can be extended further by using higher polynomial terms. Augmentation can also be performed with reciprocal powers such as K-1 (i.e., 1/K), which would in principle allow the machine-learning algorithm to discern useful relationships based on ratios. Common such clinical examples include the following:
The Shock Index19 (heart rate / systolic blood pressure), which rises in response to the combined increase in heart rate and decrease in blood pressure associated with hypovolemia.
The Rapid Shallow Breathing Index20 (respiratory frequency / tidal volume) which rises in response to the fast, small, panting breaths associated with respiratory failure.
The Body Mass Index (weight / height2), which represents obesity as excess weight distributed over an insufficiently sized physical frame.
A downside to augmenting the variable space is that the number of input variables can increase dramatically, which can overwhelm the size of the available training data and lead to a significant risk of overfitting. One challenge is that the input variables and the augmented combinations that are to be considered must be fully defined in advance. Only nonlinear relationships that can be approximated from a linear combination of the variables that are supplied can be found. For real-world problems in medicine and biology, considerable expertise is required in order to define a meaningful and informative set of inputs. Human insight must also be applied to determine what problem should be solved and what outputs are useful. When only limited knowledge is available about the best way to frame a problem numerically, modern artificial intelligence and neural networks provide an alternative approach.
Modern Artificial Intelligence
The limitations of classical artificial intelligence were particularly apparent in attempts to produce programs capable of playing Go. Go is, at least in terms of its rules, a simpler game than chess. Two players take turns to place a stone, white or black, on a 19 × 19 board. Plays in Go take place at the intersections of the grid lines, rather than on the squares as in chess. Once a stone is placed, it does not subsequently move. Briefly, the game is won by whoever manages to corral the largest total space on the board. However, in play, Go is a much more complex game than chess, with approximately 10170 board positions compared to 1047. It is hard to overstate the magnitude of these combinatoric numbers. There are about 1080 atoms in the universe, so if each individual atom were actually another universe in its own right, then that would still represent only a total of 10160 atoms. Even the best classical artificial intelligence approaches to Go seemed unable to accomplish anything better than amateurish play.
In March 2016, a computer program, AlphaGo, defeated a human player of the highest professional caliber, the world number two Lee Sedol, in a head-to-head five-game series of Go. This was the first time that a computer program had beaten a player of that level of skill without handicaps. Although AlphaGo featured an algorithm to choose moves that was somewhat guided by the design of its developers, its evaluation function was composed of a neural network that had been trained against a database of recorded games and outcomes.21 In chess, a crude valuation of a position can be made from the strength of the pieces remaining to the player and their ability to move freely. In Go, the stones do not have an equivalent individual worth and the valuation of a Go position instead depends on the relative spatial interplay of the player’s stones and the opponent’s stones. Where classical artificial intelligence algorithms were unable to discern this strategic posture, AlphaGo’s neural network approach was successful.
Neural Networks
Figure 3A illustrates the simplest feasible fully connected feed-forward neural network, taking two inputs and returning one outcome. The network is composed of the inputs, an input layer, a hidden layer, an output layer, and the output. Each layer is fully connected to the next, meaning there is path from each node (i.e., neuron) to every node in the following layer. Each path has an associated weight, which describes how much the signal traveling along that path is amplified or attenuated or inverted. At each node, the weighted inputs are added together and then applied to an activation function. Each of the nodes illustrated here uses a sigmoid activation function, which is the most basic of the standard activation functions as shown in figure 3B. The general idea is that a node, in a manner reminiscent of a biologic neuron, will remain “off” until a suitable degree of excitation is reached, at which point it will quickly turn “on.” The first node in the input layer, for example, receives inputs from Features S and T. These inputs are weighted by wfs,i1 and wft,i1 respectively, so the total input z to the first input node is given by z = wfs,i1S + wft,i1T. The total input is then applied to the sigmoid function, producing an output from this node of . This output feeds forward to the next, hidden layer along with the other weighted contributions from the input layer, and so forth until an output is produced. The output of the sigmoid activation function is always between 0 and 1, so if the outcomes are classified as 0 (e.g., red crosses) and 1 (e.g., green circles), the performance of the network can be assessed by how closely it predicts the various outcomes in the training data.
The behavior of the network depends on the values of the various weights w, and so the general idea of machine learning in a neural network is to adjust these weights until satisfactory performance is achieved. To begin, the weights are set to random values and so the initial performance of the network will usually be poor. However, for each error in prediction that is output, a degree of blame can be apportioned over the weights that contributed to it, and these weights can then be adjusted accordingly. This process is called back propagation,22 and it is the process by which the network learns to improve from its mistakes. Data are fed forward through the weights and nodes to produce output predictions, and then the errors in these predictions are propagated backward through the network to readjust the weights. This process is continued until, hopefully, the network settles to some form in which it is able to model the outputs satisfactorily based upon the input data. Beyond this basic description, of course, there are extraordinary implementational details and subtleties. For example, even in the rudimentary neural network shown in figure 3A, there are already 10 different weights that can be adjusted. The number of parameters in any practical network will be very large, and a great deal of care is required in the handling of the training data in order to avoid immediate overfitting. Additionally, the error function for neural networks is not globally convex, so there is no guarantee that the learning process will converge upon the optimum solution, and it may instead settle on some less ideal solution. In the gradient descent analogy described earlier, this would be like the imaginary ball becoming stuck in a small divot and failing to roll down to the valley below. Two ways around this problem are either to survey the landscape by starting from a selection of different locations, or to occasionally give the ball (or the landscape) some sort of shake (i.e., stochastic gradient descent23 ). Nevertheless, the process remains very computationally intensive and slow, despite technical advances in repurposing the hardware of three-dimensional graphics cards (i.e., Graphics Processing Units [GPUs]) to parallelize the calculations.24
The primary reason to take on the burden of training neural networks is that they possess the new property of universality.25 Universality means that, given an adequately large number of nodes in the respective layers, the weights of a neural network can be configured to approximate any other continuous function to within any desired level of accuracy.25 This leads to two immediate and important benefits.
The property of universality stipulates that the neural network can, in principle, represent any continuous function to any desired degree of accuracy. The idea of a function is very broad—it does not just mean the transformation of one numerical value into another. It incorporates any transformation of input data into an output, such as a Go board position into a verdict into whether that position is winning or losing,21 or determining the location of a lesion on a three-dimensional magnetic resonance image of the brain.26 A function can be any transformation, even if its mathematical form is not known in advance.
The behavior of the network is dependent on the weights. The network learns the appropriate weights solely from the training data that it is given. Therefore, the network can learn the functional relationship between the outcome and the data even if there is no preexisting knowledge about what that relationship might be. However, it can be extraordinarily difficult to reverse this process to determine an efficient statement of the functional relationship that is described by the fitted weights. This leads to the well-known criticism that the operation of a neural network is particularly hard to characterize and therefore hard to validate.
As shown in figure 4 therefore, it is possible, at least in the abstract, for a neural network to take a preoperative image of a patient and produce a prediction about how difficult that patient’s intubation might be. The proposed function is a transformation from the pixel values of the image to an estimated Cormack-Lehane view,27 but it is hard to intuit in advance what the form of that underlying function might turn out to be. While the picture alone is very unlikely to contain sufficient information to produce a reliable prediction, it is plausible that it is in some way informative as to the outcome. Although a fully connected neural network is shown, the universality theorem only demonstrates that a fully connected network with a single hidden layer can represent any function. It does not guarantee that the network contains a reasonably tractable number of nodes, nor that the inputs are informative as to the output, nor that convergence to a satisfactory answer can occur within a feasible amount of time. The current state of the art in computer science therefore involves finding network topologies that use a more efficient number of nodes and can be trained in a reasonable period of time. Two examples of these alternative network connection patterns are deep convolutional neural networks,28 in which there are several hidden layers but many weights are constrained to have the same set of values, and residual neural networks,29 in which additional paths with a weight of one skip over intervening hidden layers. Both of these approaches derive plausible justification from analogous arrangements of neurons in the mammalian brain, such as visual field maps for convolutional networks and pyramidal projection neurons for residual networks.
Evolution in Time
Each of the feed-forward neural network tasks illustrated so far make their predictions based solely on the input data available at that immediate time. They are stateless, i.e., they have no temporal relationship to any measurement taken before or after. If a neural network is intended to make intraoperative decisions about patient management, then the network will require some way to base its decision-making on memories of evolving trends. Figure 5 illustrates an Elman network,30 with three inputs and one output. In this topology, weighted paths project from the outputs of the hidden layer to a context layer, and then further weighted paths return from the context layer to the inputs of the hidden layer. An Elman network is the simplest example of a recurrent neural network that can evaluate changes in data over time. For example, the output might be a decision whether to transfuse or not,31 and the inputs might be the clinically observable parameters heart rate (R), blood pressure (P), and estimated blood loss (B). The context layer would allow the network to discern and respond to trends in these inputs.
Practical Approaches to Machine Learning in Anesthesiology
Advances in technology and monitoring can change the impetus for machine learning. For example, a neural network developed to detect esophageal intubation from flow-loop parameters32 is obviated by continuous capnography.33,34 In this instance, a reliable clinical test has made readily apparent what was once an insidious and devastating complication. A machine-learning model to predict difficult intubation from patient appearance35 must now be tempered by the convenience and ubiquity of video laryngoscopy. Advances in airway management technology have broadened the range of outcomes of laryngeal visualization that can be accepted. Anesthesiologists have long considered the possibility of an algorithm that might autonomously control depth of anesthesia based on electroencephalogram recordings36,37 since the 1950s—yet this concept remains very much a topic of current research.38
Two papers from 2018 illustrate the theoretical concepts covered. The first paper, by Hatib et al.,39 uses a very highly augmented data set in conjunction with logistic regression to produce an algorithmic model that can, in post hoc analysis, detect the incipient onset of hypotension up to 15 min before hypotension actually occurs. For model training, the authors employed a database of 545,959 min of high-fidelity (100 Hz) arterial waveform recordings acquired from the records of 1,334 patients, internally validated against the records of 350 additional patients that were held back. The training data set included 25,461 episodes of hypotension. The model itself is derived from 51 base variables assembled from significant features extracted from the processing of arterial waveforms obtained by the Edwards FloTrac device (Edwards Lifesciences, USA).40 Each variable was augmented with its squared term and their reciprocals (i.e., X, X2, X-1, X-2), and then every combination of these variables was generated to produce an overall input set of 2,603,125 parameters. The authors chose two clearly separated outcomes: hypotension defined as mean arterial pressure less than 65 mmHg (e.g., notionally red crosses), and nonhypotension defined as mean arterial pressure greater than 75 mmHg (e.g., notionally green circles), but did not consider the “gray zone” between these outcomes. Despite the large number of available parameters and the risk of overfitting, the authors were nevertheless able to use a parsimonious parameter selection process to produce a final model that depended on only 23 of the 2.6 million available inputs (Maxime Cannesson, M.D., Ph.D., Department of Anesthesiology, UCLA Medical Center, Santa Monica, California. Electronic personal communication, June 13, 2018). The study did have some limitations, notably that it did not include any episodes in which hypotension was caused by surgical intervention, all model fitting and assessments were retrospective and offline, and the algorithm made no recommendations as to whether an intervention should be performed. Nevertheless, the authors demonstrated an algorithm that was apparently able to foresee episodes of hypotension in operative patients up to 15 min in advance of the onset of the event itself with an area under the curve of 0.95.
The second paper, by Lee et al.,41 describes a neural network approach to predicting the Bispectral Index (BIS) based upon the infusion history of propofol and remifentanil. This paper is particularly noteworthy because a strongly theoretical approach to this question already exists in the target-controlled infusion literature. The classical approach is to model the pharmacokinetics of propofol42 and remifentanil43 in the body independently, based upon the infusion history. The effect site concentration of each drug is then combined in a response surface model,44 producing an estimate of the BIS. These classical pharmacokinetic models are well established and have been used to demonstrate closed-loop target-controlled infusion control of anesthetized patients.45 In contrast, Lee et al. created a neural network comprised of two stages. The first stage receives the infusion history of propofol and remifentanil over the preceding 30 min with a resolution of 10 s (i.e., 180 inputs for each medication). The inputs for each medication are fed to two separate eight-node recurrent neural networks. Rather than using an Elman30 arrangement, as seen in figure 5, the paper made use of a newer configuration known as a Long Short Term Memory.46 Simple recurrent neural networks such as Elman have difficulty recalling or learning events that happen over a long timeframe as their training error gradients become too small to be adaptive. The Long Short Term Memory is a more robust memory topology that also includes pathways that explicitly cause the network to reinforce or forget remembered states. The output from the Long Short Term Memory layer is applied directly to a simple fully connected feed-forward neural network with 16 nodes of the type shown in figures 2 and 3. A single output node emits a scaled BIS estimation. The network was developed from a database of 231 patient cases (101 cases used for training, 30 for validation, and 100 for final testing), and comprised a total of around 2 million data points. In post hoc analysis, the classical pharmacokinetic/pharmacodynamic models were able to predict the BIS value with a root-mean-square error of 15 over all phases of the anesthetic. Despite being naive to all existing theory, the neural network comfortably outperformed the best current models with an root-mean-square error of 9—a remarkable victory for modern artificial intelligence over existing classical pharmacokinetic/pharmacodynamic expert systems47 that might lead us to question the ongoing utility of classical response surface models.48
Future Directions
The most exciting recent advance in machine learning has been the development of AlphaGo Zero,49 a system capable of learning how to play board games without any human guidance, solely through self-play alone. It performs at a level superior to all previous algorithms and human players in chess, Go, and shogi. This learning approach requires that the system be able to play several lifetimes’ worth of simulated games against itself. Although anesthesia simulators exist, they do not currently simulate patient physiology with the fidelity with which a simulated chess game matches a real chess game.
The most plausible route to the introduction of artificial intelligence and machine learning into anesthetic practice is that the routine intraoperative management of patients will begin to be handed off to closed-loop control algorithms. Maintaining a stable anesthetic is a good first application because the algorithms do not necessarily have to be able to render diagnoses, but rather to detect if the patient has begun to drift outside the control parameters that have been set by the anesthesiologist.50 In this regard, such systems would be like an autopilot, maintaining control but alarming and disconnecting if conditions outside the expected performance envelope were encountered—hardly a threat to the clinical autonomy of the anesthesiologist.51 A closed-loop control system need not necessarily have any learning capability itself, but it provides the means to collect a large amount of physiologic data from many patients with high fidelity, and this is an essential precursor for machine learning. Access to large volumes of high-quality data will enable more machine-learning successes, such as the offline post hoc prediction of BIS41 and hypotension39 discussed above. For now, finding algorithms that provide good clinical predictions in real time should be emphasized. Management of all the parameters of a stable anesthetic is not a simple problem,52 but embedding53 the machine in the care of the patient is a good way to begin.54
Research Support
Supported by National Insitutes of Health (Bethesda, Maryland) grant No. R01 GM121457.
Competing Interests
Dr. Connor holds the following patents pertinent to the subject matter: U.S. Patent 8460215, Systems and methods for predicting potentially difficult intubation of a subject; U.S. Patent 9113776, Systems and methods for secure portable patient monitoring; U.S. Patent 9549283, Systems and methods for determining the presence of a person.