A Comparison of Five Algorithmic Methods and Machine Learning Pattern Recognition for Artifact Detection in Electronic Records of Five Different Vital Signs: A Retrospective Analysis

Background: Research on electronic health record physiologic data is common, invariably including artifacts. Traditionally, these artifacts have been handled using simple filter techniques. The authors hypothesized that different artifact detection algorithms, including machine learning, may be necessary to provide optimal performance for various vital signs and clinical contexts. Methods: In a retrospective single-center study, intraoperative operating room and intensive care unit (ICU) electronic health record datasets including heart rate, oxygen saturation, blood pressure, temperature, and capnometry were included. All records were screened for artifacts by at least two human experts. Classical artifact detection methods (cutoff, multiples of SD [z-value], interquartile range, and local outlier factor) and a supervised learning model implementing long short-term memory neural networks were tested for each vital sign against the human expert reference dataset. For each artifact detection algorithm, sensitivity and specificity were calculated. Results: A total of 106 (53 operating room and 53 ICU) patients were randomly selected, resulting in 392,808 data points. Human experts annotated 5,167 (1.3%) data points as artifacts. The artifact detection algorithms demonstrated large variations in performance. The specificity was above 90% for all detection methods and all vital signs. The neural network showed significantly higher sensitivities than the classic methods for heart rate (ICU, 33.6%; 95% CI, 33.1 to 44.6), systolic invasive blood pressure (in both the operating room [62.2%; 95% CI, 57.5 to 71.9] and the ICU [60.7%; 95% CI, 57.3 to 71.8]), and temperature in the operating room (76.1%; 95% CI, 63.6 to 89.7). The CI for specificity overlapped for all methods. Generally, sensitivity was low, with only the z-value for oxygen saturation in the operating room reaching 88.9%. All other sensitivities were less than 80%. Conclusions: No single artifact detection method consistently performed well across different vital signs and clinical settings. Neural networks may be a promising artifact detection method for specific vital signs. In a single-center retrospective analysis of 53 operating room and 53 intensive care unit (ICU) patients, 5,167 of 392,808 (1.3%) electronic health record measurements (heart rate, oxygen saturation measured by pulse oximetry [Spo2], blood pressure, temperature, and capnometry) were annotated by human reviewers as an artifact. A comparison of classic artifact detection methods (cutoff, multiples of SD [z-value], interquartile range, and local outlier factor) and a supervised neural network against the human reviewer standard demonstrated that no single method was superior from a sensitivity or specificity perspective. The highest performing method’s sensitivity ranged widely for operating room patients, from 36% for diastolic noninvasive blood pressure to 89% for Spo2. The highest performing sensitivity was greater than 70% for capnometry, Spo2, temperature, and invasive mean arterial pressure for operating room patients. For ICU patients, the highest performing method’s sensitivity ranged from 34% for heart rate to 74% for Spo2. The highest performing sensitivity was greater than 70% for capnometry, Spo2, and temperature.


What This Article Tells us That Is New
• In a single-center retrospective analysis of 53 operating room and 53 intensive care unit (ICU) patients, 5,167 of 392,808 (1.3%) electronic health record measurements (heart rate, oxygen saturation measured by pulse oximetry [Spo 2 ], blood pressure, temperature, and capnometry) were annotated by human reviewers as an artifact.• A comparison of classic artifact detection methods (cutoff, multiples of SD [z-value], interquartile range, and local outlier factor) and a supervised neural network against the human reviewer standard demonstrated that no single method was superior from a sensitivity or specificity perspective.
• The highest performing method's sensitivity ranged widely for operating room patients, from 36% for diastolic noninvasive blood pressure to 89% for Spo 2 .The highest performing sensitivity was greater than 70% for capnometry, Spo 2 , temperature, and invasive mean arterial pressure for operating room patients.• For ICU patients, the highest performing method's sensitivity ranged from 34% for heart rate to 74% for Spo 2 .The highest performing sensitivity was greater than 70% for capnometry, Spo 2 , and temperature.

Comparison of Artifact Detection Methods
Maleczek et al.
T he collection of physiologic data is common in anesthesia and intensive care, and a large variety of high-resolution vital signs is generated and stored routinely in hospitals with electronic health record systems.The availability of large datasets of vital signs offers a tremendous opportunity to conduct clinical research using data from thousands of patients.][3][4][5][6][7] This large-scale collection of vital signs invariably includes collecting artifacts as well, with a value outside the normal ranges of the vital sign in question being likelier to be an artifact than a value with normal ranges, according to some reports. 8,9These artifacts can originate from different factors, such as electrocautery, disconnected arterial lines, or movement of the lines. 10Although artifacts are typically recognized easily and ignored by the treating staff in real time, retrospective analysis of large datasets does not have documentation of this thought process.Artifact data can alter clinical outcome classification and impact descriptive and inferential analyses. 8,11,12Artifact filtering can have a substantial impact on hypotension prevalence and a small effect on the reported association between hypotension and myocardial injury. 12To date, most large retrospective studies use simple filter techniques, whereas some studies do not comment on artifact handling at all. 1 The most commonly used filters are cutoff filters and moving mean/median.Moving mean/median is different from all other filters, as it modifies nearly every data point. 13It is unclear which artifact detection algorithm may be suitable for perioperative and critical care data, since large-scale studies about artifact detection in these data are rare. 8,9,13,14o compare different artifact filtering methods, we applied the currently used methods and augmented them with algorithms known in data science as well as a neural network specially trained for artifact filtering. 13,15pervised learning algorithm frameworks that implement neural networks in pattern recognition have shown numerous successes.Since artifacts are also a type of pattern, or rather a break of a certain type of pattern, testing neural networks for artifact recognition is an obvious choice. 16,17he main objective of this project was to provide guidance on which algorithm is best suited for filtering the artifacts of each of the most common vital parameters: heart rate, blood pressure, temperature, capnometry, and peripheral oxygen saturation.We hypothesized that a specially trained neural network would outperform classic algorithms.

Study Design
The University of Vienna's ethical committee approved this study and waived the need for informed consent (reference No. 2179/2020).The Medical University of Vienna is a tertiary care hospital with approximately 50,000 surgical procedures and 7,000 to 8,000 intensive care unit (ICU) admissions per year.We conducted a retrospective study using the Medical University of Vienna's perioperative database.The study population for this study consisted of all patients who underwent surgery between January 1, 2019, and September 1, 2020.To ensure that only complete datasets were included, ICU patients had to have at least 120 h of records of all five vital parameters (heart rate, blood pressure, temperature, capnometry, and peripheral oxygen saturation).Device data sources for each of these parameters are described below.Surgical patients had to have at least 30 min of records of the five parameters.
The sample size was defined using a pragmatic approach to how much data could be annotated in a reasonable time (approximately 4.5 h per expert reviewer participant) while providing the team with enough data points to split data and use all filtering algorithms.The sample included 53 patients admitted to the ICU and 53 patients undergoing surgery.The patients were randomly selected from a list of all included patients using a random number generator in Python (Python Software Foundation, USA). 18To limit the number of ICU patient data, only 120 h of records were used per patient.Data from surgical cases were used in total.
The study followed the Enhancing the Quality and Transparency of Health Research (EQUATOR) Standards for Reporting Diagnostic Accuracy Studies (STARD) Guideline. 19The checklist used can be found in Supplemental Digital Content 1 (https://links.lww.com/ALN/D489).

Data Sources
Data were collected from the perioperative database.The perioperative database is constantly synchronized with the Philips IntelliSpace Critical Care and Anesthesia (Philips, The Netherlands) electronic health record, recording

PerioPerative Medicine
Maleczek et al.
all patients perioperatively and in the ICU.The database contains data on vital parameters and manually entered observations/actions by all healthcare professionals.In the operating room, discrete vital parameters are stored every 15 s; in the ICU, the temporal resolution is 15 min.No artifact recognition method is applied before saving the data; therefore, the raw parameters are saved and can be used for scientific applications.Heart rate, blood pressure, temperature, and pulse oximeter values were collected via a Draeger Infinity monitoring system consisting of both the Draeger Infinity Delta and Infinity M540 systems (Draeger, Germany).Capnometry values were collected via an anesthesia machine (Draeger Primus or Draeger Perseus) in the operating room.In the ICU, carbon dioxide is measured using Draeger monitors.The heart rate parameter studied in this analysis was the electrocardiogram heart rate; pulse rate from the arterial line was not available.During artifact filtering, the human experts had the opportunity to see the pulse rate from the oxygen saturation.Blood pressure was collected both noninvasively using a cuff and invasively using arterial lines (mainly radial).For both blood pressure signals, no further signal processing was undertaken to provide all methods with the "raw" data available in the electronic health record.For cases in which both invasive and noninvasive blood pressure signals were available, both were annotated by the reviewers.

Data Processing
All available values of the five vital signs of interest were extracted from the database for the randomly selected patients.No further processing of the data was performed, except for deleting all capnometry values less than 2 mmHg.These values are transferred to the electronic health record by the anesthesia machines as soon as they are switched on by default.Beyond that, no further changes were made to the data; no interpolation was performed.
In the first step, vital sign artifacts were annotated independently by five human experts: after their 4-month anesthesia internship in the operating theater, final-year medical students received 4 h of training both as a group and individually, as well as feedback on demand.Further details can be found in Supplemental Digital Content 5 (https://links.lww.com/ALN/D493).A web-based front-end interface was used to review the patient charts and annotate the artifacts.In this self-developed front end, all reported vital signs plus pulse rate from pulse oximetry could be seen singly or combined.Artifacts can be annotated either by clicking on single data points or by circling them to mark more than one.Across two training sessions, the experts were instructed to annotate every data point that they believed to be an artifact.This included a discussion of the most important causes: disconnected/displaced lines, blood sampling, electrocautery, patient movement, and others.Four of the experts annotated 53 patients each (equally distributed between the ICU and operating room).Each patient was annotated independently by two experts.If the two experts had conflicting annotations, the fifth member of the expert panel made the final decision regarding whether the data point was an artifact by majority vote.

Artifact Detection Algorithms
Parallel and independent from the human artifact filtering used as the reference standard during algorithm comparison, the following artifact detection algorithms were applied to the data (invasive and noninvasive blood pressure signals had the same handling throughout): • Cutoff: For the cutoff algorithm, the following ranges were defined as valid (and physiologically possible), and all values beyond these ranges were defined as artifacts: systolic blood pressure, 20 to 300 mmHg; mean blood pressure, 10 to 250 mmHg; diastolic blood pressure, 5 to 225 mmHg; capnometry, 5 to 150 mmHg; temperature, 25° to 45°C; heart rate, 5 to 300/min; and Spo 2 , 0 to 100%. 7• Z-value: The z-value was calculated for each vital sign and for each patient.All values lying outside three multiples of the SD were defined as artifacts.By using 3 SDs as the threshold, all values lying outside 99.73% of the mean were marked as artifacts.• Interquartile range: The interquartile range was calculated for each vital sign and for each patient.All values lying outside the three multiples of the interquartile range were defined as artifacts.By using 3 interquartile ranges as the threshold, all values lying outside 99.73% of the median were marked as artifacts.• Local outlier factor: The local outlier factor was calculated as described by Breunig et al. 20 Seconds for Euclidean distances were on the x-axis, and the vital sign-specific values (mmHg for blood pressure, percentage for Spo 2 , and others) were on the y-axis.A k = 7 was chosen primarily to identify extreme changes in the time series.A local outlier factor greater than or equal to 1.5 was used to label the data points as artifacts. 20ditionally, a long short-term memory neural network was trained using the human reference standard to show its ability to predict this standard in another part of the dataset.The dataset was transformed into batches of time series.The steps for this method included (1) normalizing the input features to an interval of [-1,1], (2) calculating the first and second derivatives of the time-dependent input values, and (3) creating a time series for each data feature within a defined time window.Any data points outside the observed period were set to 0. The dataset was then randomly split into a training set (80%) and a test set (20%) while ensuring that data from a single patient were not split.The network architecture comprised an input layer, a long short-term memory neural network layer, and an output layer optimized for the size of the time window and batches.The process was halted once the accuracy,

Comparison of Artifact Detection Methods
Maleczek et al.
specificity, sensitivity, and area under the receiver operating characteristics curve (AUC) did not improve in the test set.The number of neurons was estimated based on the input size, output size, batch size, and size of the observed time interval.The loss function was applied using an entropy gradient function with the ADAM 21 optimizer.To implement these algorithms, Python 3.8 18 (primarily with the pandas, 22 numpy, 23 scikit-learn, 24 and scipy 25 packages) was used; the code used can be found in Supplemental Digital Content 2 and 3 (https://links.lww.com/ALN/D490,https://links.lww.com/ALN/D491).

Statistical Analysis
The main objective of this study was to compare the sensitivity and specificity of all used artifact filtering methods to provide a guide on which algorithm is best suited for further research.For calculation of sensitivity and specificity, the human reviewer standard defined the "true values" of the presence or absence of artifacts.Sensitivity was defined as the ratio of artifacts correctly annotated by each algorithm compared to the human reviewer reference standard.Specificity was defined as the ratio of data points correctly annotated as not being an artifact consistent with the human reviewer reference standard.After all the artifact detection algorithms were applied, the results were compared to the human reference standard.For each artifact detection algorithm, true positive, true negative, false positive, and false negative were calculated.Specificity, sensitivity, positive predictive value, and negative predictive value including 95% CI calculated using Wilson's method 26,27 were displayed per vital sign parameter and artifact detection algorithm.Comparisons of CI were done using the complete nonoverlap method. 28Observations were viewed as independent-no within-person clustering of performance was conducted.Descriptive statistics were calculated using mean ± SD or median and 25% and 75% quartiles, respectively, as appropriate.A formal comparison of artifact detection methods was conducted by comparing the 95% CI of sensitivity/specificity.The null hypothesis was that there was no difference in sensitivity and specificity between the neural network and any other method.All statistical analysis was done using Python 3.8 18 as described above.
Sensitivity Analysis.As most of the tested algorithms relied on the definition of thresholds or factors, a sensitivity analysis was conducted using different thresholds.For all algorithms, receiver operating characteristic curves are shown, and the area under the curve was calculated.Furthermore, all key statistical figures (sensitivity, specificity, positive predictive value, and negative predictive value) are shown in the main analysis.
Defining thresholds for the cutoff method was challenging.0][31] Indeed, the literature about reference ranges of vital signs is sparse, often relying on cohort studies focusing on outcomes. 32][35] Therefore, four additional thresholds were defined: (1) using a 95% CI from the complete dataset, (2) values outside of physiologic ranges, (3) values that would worry the treating healthcare professionals, and (4) values needing urgent treatment.Details can be found in table 1.
For the z-value and interquartile range, in addition to using 3 as factor, all values between 2 and 3.5 were tested in 0.5 steps.For the calculation of the receiver operating characteristics curve, all values between 0.5 and 5 were used.

results
The study population consisted of 28,388 operating room patients and 1,262 ICU patients with available data for all five vital signs.A total of 106 patients (53 ICU and 53 operating rooms) were randomly selected from the study population.Demographic details can be found in table 2.
For the operating room data and ICU data, the mean (± SD) duration of observation was 2.6 ± 1.8 h and 120.0 ± 0.2 h, respectively.During that time, a total of 395,213 data points were included.After excluding the capnometry values less than 2 mmHg, 392,808 data points remained.The mean number of data points per operating room patient was 3,087.1 ± 3,117.0; in the ICU, it was 4,324.4 ± 2,880.6.Of these, the four human experts annotated a total of 11,699 data points as artifacts, including the data points annotated by two experts evaluating each patient.In 2,891 data points, consensus was met by the first step, leaving 5,917 data points (50.6%) for the third expert's decision.In 2,276 of those cases (38.5%), he decided that the data point was an artifact, resulting in 5,167 annotated artifacts (1.3% of the data points).Splitting the data into a training set (80%) and a test set (20%) resulted in 310,085 instances without artifacts and 4,162 instances (1.33%) with artifacts in the training set.In the test set, there were 77,557 instances without artifacts and 1,005 instances with artifacts (1.28%).
The application of the artifact detection algorithms resulted in a large variation in annotated artifacts.For example, the interquartile range method resulted in 6,196 artifacts annotated, while the local outlier factor annotated only 1,189 data points as artifacts.Details can be found in table 3. Data describing the long short-term memory neural network are missing in table 3 due to the dataset split.
The hypothesis that the neural network showed significantly higher sensitivities than the classic methods

PerioPerative Medicine
Maleczek et al.
was found to be true for the following vital signs: heart rate (ICU, 33.6%; 95% CI, 33.1 to 44.6), systolic invasive blood pressure (in both the operating room [62.2%; 95% CI, 57.5 to 71.9] and the ICU [60.7%; 95% CI, 57.3 to 71.8]), and temperature in the operating room (76.1%; 95% CI, 63.6 to 89.7).Specificity was very similar in all methods.As expected, the interquartile range and z-value performed very similarly, and the data were equally distributed.The best-performing methods are summarized in tables 4 and 5.
The specificity was above 90% for all detection methods and all vital signs.However, the sensitivity was low for cutoff, z-value, interquartile range, and local outlier factor, with only the z-value for saturation in the operating room reaching 88.9%.All other sensitivity values were less than 80%, with the local outlier factor not exceeding 10% of the sensitivity.An example of the neural network's performance can be found in figure 1.
A comparison of the performance across methods revealed significant differences between vital signs, methods, and clinical locations.For example, for heart rate in the ICU, the long short-term memory neural network showed a significantly higher sensitivity of 33.6% and specificity of 99.2%, whereas the interquartile range showed 19.5% and 99.4%, the z-value performed similarly (25.3% and 99.6%), and the cutoff showed a sensitivity of only 3.8%, with a specificity of 100%.By contrast, the cutoff showed better results for invasive mean arterial pressure (MAP) in the operating room (sensitivity, 74.9%; specificity, 100%) but not in the ICU (9.3% and 100%).Details, including 95% CI, can be found in table 4, showing performance in data from the operating room, and in table 5, showing performance in data from the ICU.
To show the performance of different thresholds when using the interquartile range and z-value, a sensitivity analysis was conducted: all threshold values from 0.5 to 5 were tested, as well as different cutoff values.The results showed that using values other than those previously described resulted in better sensitivity, while specificity stayed at an acceptable level.For example, the second threshold level (values worrying the treating healthcare professionals) resulted in 75% sensitivity for invasive MAP.However, specificity decreased to 92%, while those originally used showed 39.8%/100%.This trend was seen in all used thresholds: increased sensitivity led to rapidly decreasing specificity.All calculations can be found in table 1

PerioPerative Medicine
Maleczek et al.
All AUC values were above 0.61, with most exceeding 0.85.For example, applying the z-value to MAP resulted in an AUC of 0.88, while applying the interquartile range to carbon dioxide resulted in an AUC of 0.96.The resulting receiver operating characteristic curves, including all AUC values, can be found in Supplemental Digital Content 1 (https://links.lww.com/ALN/D489).99.9 (99.7-99.9)local outlier factor 0 (0.0-35.4) 99.9 (99.7-100.0)0 (0.0-56.1) 99.8 (99.5-99.9)Cutoff 0 (0.0-35.4) 100 (99.9-100.0)99.8 (99.5-99.9)Neural network Sensitivity, specificity, positive predictive value, and negative predictive value of all artifact detection algorithms separated in the operating room.Note that too little information was available to train the neural net for noninvasive blood pressure.*Methods with the highest sensitivity.Specificity was > 97% in all marked methods.†Cells with a sensitivity above 70% and a specificity above 95%.Diastolic, diastolic blood pressure; MAP, mean arterial pressure; neural network, long-short term memory (machine learning algorithm); Spo 2 , oxygen saturation measured by pulse oximetry; systolic, systolic blood pressure.

Maleczek et al. discussion
In the current study, we found that artifact filtering methods performed differently both in terms of specific vital signs and clinical context of intraoperative versus intensive care unit.No one method was found to be consistently superior across different vital signs and clinical contexts.Compared to human experts annotating artifacts retrospectively, 8,9,36 the methods of interquartile range, z-value, and cutoff filters showed high specificity but only intermediate sensitivity; Sensitivity, specificity, positive predictive value, and negative predictive value of all artifact detection algorithms in the intensive care unit.Note that too little information was available to train the neural net for noninvasive blood pressure.*Methods with the highest sensitivity.Specificity was > 96% in all marked methods.†Cells with a sensitivity above 70% and a specificity above 95%.Diastolic, diastolic blood pressure; MAP, mean arterial pressure; neural network, long-short term memory (machine learning algorithm); Spo 2 , oxygen saturation measured by pulse oximetry; systolic, systolic blood pressure.

PerioPerative Medicine
Maleczek et al.
the local outlier factor had a sensitivity less than 10%.By contrast, a specially trained, long short-term memory neural network showed higher sensitivity values, while specificity remained as high as the other methods.Narrowing the thresholds of the cutoff filter in a sensitivity analysis also increased sensitivity; however, specificity decreased rapidly.The thoughtful selection of artifact detection methods for each clinical parameter is important.For specific clinical parameters, the use of neural networks demonstrated higher artifact filtering performance.
Artifact filtering is of the utmost importance, as it has the potential to alter scientific results. 8,12[3]7 Some studies have not described any detail of artifact detection at all.8][39] In the current study, it was shown that different artifact detection methods perform differently on each vital sign.
Blood pressure is the focus of many perioperative and critical care research efforts, with both MAP and systolic pressure being reported. 1,3,5,7No single method performed best for invasive MAP: in the operating room, the cutoff method performed best, while in the ICU, the neural network performed best.The sensitivity analysis showed that narrowing the limits rapidly increased sensitivity, with a drop in specificity.This further emphasizes the relevance of choosing the right algorithm with the right threshold.This is especially relevant when looking at pathologic states, such as intraoperative hypotension or hypothermia: erroneously flagging vital signals as artifacts would lead to the exclusion of relevant information.
The same importance of choosing the right method for the right parameter was seen for heart rate, for which all algorithms showed a sensitivity of less than 40% with large differences between the ICU and the operating room.Although the sensitivity analysis showed a tendency toward increased sensitivity, specificity dropped rapidly.The most probable cause for the low sensitivities is rapid changes in heart rate: electrocautery in the operating room and movement or arrhythmia in the ICU.These artifacts can easily be detected when using the pulse rate from the oxygen saturation-a signal not available to the artifact detection methods.
Data on artifact filtering for vital signs other than heart rate and blood pressure are sparse.The current data demonstrate that each method performed differently for temperature, pulse oximetry, and capnometry, with regard to sensitivity and specificity.In contrast to heart rate and blood pressure, z-value and interquartile range performed well, while the neural network showed mixed results.

Maleczek et al.
As with blood pressure and heart rate, choosing the right threshold-especially for cutoff filters-is an important topic.Although sensitivity was increased by changing the thresholds, specificity dropped for temperature and capnometry.Only for pulse oximetry did the specificity remain stable, which can most probably be attributed to the special distribution of values.
Interestingly, all algorithms performed worse in the ICU than in the operating room.There are multiple potential explanations for this difference, with the decreased resolution of data in the ICU being the most important.For example, although an artifact due to blood sampling can easily be detected in the arterial blood pressure signal with a 15-s resolution, it will diminish at a 15-min resolution.
Due to the similar statistical approaches used for the interquartile range and z-value, similar results were expected.Nevertheless, for datasets with a "low resolution" vital sign, such as the noninvasive blood pressure in this study, the z-value often performed better compared to the interquartile range.Generally, the local outlier factor performed poorly throughout all vital signs, with sensitivity never exceeding 15%.One conclusion may be that the use of the local outlier factor in anesthetic data is questionable, contrary to other datasets. 40The topic of data granularity itself could not be studied in this project, as the data resolution was evenly distributed.Further research is necessary, especially on the effects of data granularity on artifact filtering algorithms and the characterizations of different vital signs, which are most often subsequently used for statistical analysis.As the collection of high-resolution vital signs, including waveforms, becomes increasingly popular, the problem of low granular data will improve-at least in perioperative data.

limitations
One significant limitation of this study is the use of retrospective electronic health record data as the foundation for creating the human reference standard rather than real-time point of care annotation.This possibly explains why the rate of artifacts marked by the reviewers (1.3%) was lower than in some previous studies reporting up to 14%. 8,9,39Using retrospective data can lead to missing artifacts.For example, the wrong elevation of a blood pressure transducer cannot be identified in retrospective data, patient movement cannot be seen, and the dislocation of a saturation sensor can only be assumed.However, as the main objective was a comparison of artifact filtering methods for use in retrospective data, we decided to choose a retrospective approach for human experts as well.
The neural network used has some limitations.Of importance is the fact that the tested network was trained with the human reference standard and therefore could at best predict the reference standard.External validation to prove generalizability is needed before using the model in other centers.In this case, approximately 450 person-hours were needed to annotate all 106 patients, including the majority vote, for artifacts where two reviewers disagreed.
Another limitation is that humans, classic artifact detection algorithms, and neural networks may not be comparable.Although the neural network is trained with actual reference standard data, such as human expert reviewer annotations, the classic algorithms do not have access to individually annotated data.Classic algorithms are based on thresholds, population distributions, or mathematical transformations.Humans, by contrast, can utilize much more information to filter artifacts than is available for long short-term memory neural networks.The use of electrocautery or the movement of cables is obvious to the human eye but difficult to learn for long short-term memory neural networks.Neural network training has the potential to overfit, which was decreased using a randomly split internal test dataset.As shown in Supplemental Digital Content 1 (https://links.lww.com/ALN/D489), the neural network's calibration plots had poor calibration for certain variables, such as pulse oximetry values and temperature, both in the ICU and the operating room, whereas they showed good performance for others, such as systolic blood pressure of heart rate, in both clinical settings.This indicates that there may be a relevant risk of decision errors in particular risk ranges.
The use of human experts is a significant limitation of this study.Although all experts had experience in clinical anesthesia and intensive care, incorrect filtering was always possible.To address this concern, every data point was independently evaluated by two experts.The distribution of patient data to two of the four experts was performed randomly.For cases in which these two experts diverged, a third expert decided with a majority vote.The number of data points in which this majority vote was necessary was high (50.6% of annotated data points), leading to the concern that the analysis relied on an imperfect reference standard.The use of three independent experts is the maximum previously described in the literature, leading to the assumption that the reference standard used is the best possible in that setting. 8,39Following Walsh, we conducted a naive analysis accepting that results may be underestimated or overestimated, while other methods seemed impractical in this case. 41 further limitation is the single-center data source, which has limited generalizability.The highly standardized way in which monitoring devices are produced and used is encouraging, but we cannot exclude systematic errors, such as those described previously. 42,43In addition, many commercially available electronic health records record only intraoperative physiologic data every 60 s, calling into question the validity of our findings in these datasets.Validation of the neural network used on external, completely new data is essential to establish the potential value of the network on broad patient populations.

PerioPerative Medicine
Maleczek et al.

Conclusions
The use of simple, universal physiologic artifact detection methods seems inferior to vital sign-specific artifact detection algorithms.Commonly used artifact detection methods performed very differently when tested for different vital signs and for different settings (ICU vs. operating room).Using a pretrained neural network for artifact filtering in retrospective data may be a possible additional valid option as an artifact detection method, although performance may be worse in different datasets and potential overfitting is an important limitation.

Fig. 1 .
Fig.1.Example of the neural network performance compared to the human experts in a sample of heart rates from one specific patient (example signal: normalized heart rate).By converting the minimal heart rate to -1 and the maximal heart rate to 1, the performance of the neural network is facilitated, while intervals between values of heart rate remain the same.

table 3 .
Number of Annotated Artifacts

table 4 .
Performance of the Algorithms in the Operating Room

table 5 .
Performance of the Algorithms in Intensive Care unit