Background

Research on electronic health record physiologic data is common, invariably including artifacts. Traditionally, these artifacts have been handled using simple filter techniques. The authors hypothesized that different artifact detection algorithms, including machine learning, may be necessary to provide optimal performance for various vital signs and clinical contexts.

Methods

In a retrospective single-center study, intraoperative operating room and intensive care unit (ICU) electronic health record datasets including heart rate, oxygen saturation, blood pressure, temperature, and capnometry were included. All records were screened for artifacts by at least two human experts. Classical artifact detection methods (cutoff, multiples of SD [z-value], interquartile range, and local outlier factor) and a supervised learning model implementing long short-term memory neural networks were tested for each vital sign against the human expert reference dataset. For each artifact detection algorithm, sensitivity and specificity were calculated.

Results

A total of 106 (53 operating room and 53 ICU) patients were randomly selected, resulting in 392,808 data points. Human experts annotated 5,167 (1.3%) data points as artifacts. The artifact detection algorithms demonstrated large variations in performance. The specificity was above 90% for all detection methods and all vital signs. The neural network showed significantly higher sensitivities than the classic methods for heart rate (ICU, 33.6%; 95% CI, 33.1 to 44.6), systolic invasive blood pressure (in both the operating room [62.2%; 95% CI, 57.5 to 71.9] and the ICU [60.7%; 95% CI, 57.3 to 71.8]), and temperature in the operating room (76.1%; 95% CI, 63.6 to 89.7). The CI for specificity overlapped for all methods. Generally, sensitivity was low, with only the z-value for oxygen saturation in the operating room reaching 88.9%. All other sensitivities were less than 80%.

Conclusions

No single artifact detection method consistently performed well across different vital signs and clinical settings. Neural networks may be a promising artifact detection method for specific vital signs.

Editor’s Perspective
What We Already Know about This Topic
  • Modern perioperative and critical care clinical research often uses electronic health record physiologic data.

  • The ideal physiologic data artifact detection algorithm remains unclear.

What This Article Tells Us That Is New
  • In a single-center retrospective analysis of 53 operating room and 53 intensive care unit (ICU) patients, 5,167 of 392,808 (1.3%) electronic health record measurements (heart rate, oxygen saturation measured by pulse oximetry [Spo2], blood pressure, temperature, and capnometry) were annotated by human reviewers as an artifact.

  • A comparison of classic artifact detection methods (cutoff, multiples of SD [z-value], interquartile range, and local outlier factor) and a supervised neural network against the human reviewer standard demonstrated that no single method was superior from a sensitivity or specificity perspective.

  • The highest performing method’s sensitivity ranged widely for operating room patients, from 36% for diastolic noninvasive blood pressure to 89% for Spo2. The highest performing sensitivity was greater than 70% for capnometry, Spo2, temperature, and invasive mean arterial pressure for operating room patients.

  • For ICU patients, the highest performing method’s sensitivity ranged from 34% for heart rate to 74% for Spo2. The highest performing sensitivity was greater than 70% for capnometry, Spo2, and temperature.

The collection of physiologic data is common in anesthesia and intensive care, and a large variety of high-resolution vital signs is generated and stored routinely in hospitals with electronic health record systems. The availability of large datasets of vital signs offers a tremendous opportunity to conduct clinical research using data from thousands of patients. For example, large datasets of blood pressure values have enabled researchers to determine associations between intraoperative hypotension and clinically relevant outcomes, such as acute kidney injury, stroke, and myocardial injury.1–7 

This large-scale collection of vital signs invariably includes collecting artifacts as well, with a value outside the normal ranges of the vital sign in question being likelier to be an artifact than a value with normal ranges, according to some reports.8,9  These artifacts can originate from different factors, such as electrocautery, disconnected arterial lines, or movement of the lines.10  Although artifacts are typically recognized easily and ignored by the treating staff in real time, retrospective analysis of large datasets does not have documentation of this thought process. Artifact data can alter clinical outcome classification and impact descriptive and inferential analyses.8,11,12  Artifact filtering can have a substantial impact on hypotension prevalence and a small effect on the reported association between hypotension and myocardial injury.12  To date, most large retrospective studies use simple filter techniques, whereas some studies do not comment on artifact handling at all.1  The most commonly used filters are cutoff filters and moving mean/median. Moving mean/median is different from all other filters, as it modifies nearly every data point.13  It is unclear which artifact detection algorithm may be suitable for perioperative and critical care data, since large-scale studies about artifact detection in these data are rare.8,9,13,14 

To compare different artifact filtering methods, we applied the currently used methods and augmented them with algorithms known in data science as well as a neural network specially trained for artifact filtering.13,15  Supervised learning algorithm frameworks that implement neural networks in pattern recognition have shown numerous successes. Since artifacts are also a type of pattern, or rather a break of a certain type of pattern, testing neural networks for artifact recognition is an obvious choice.16,17  The main objective of this project was to provide guidance on which algorithm is best suited for filtering the artifacts of each of the most common vital parameters: heart rate, blood pressure, temperature, capnometry, and peripheral oxygen saturation. We hypothesized that a specially trained neural network would outperform classic algorithms.

Study Design

The University of Vienna’s ethical committee approved this study and waived the need for informed consent (reference No. 2179/2020). The Medical University of Vienna is a tertiary care hospital with approximately 50,000 surgical procedures and 7,000 to 8,000 intensive care unit (ICU) admissions per year. We conducted a retrospective study using the Medical University of Vienna’s perioperative database. The study population for this study consisted of all patients who underwent surgery between January 1, 2019, and September 1, 2020. To ensure that only complete datasets were included, ICU patients had to have at least 120 h of records of all five vital parameters (heart rate, blood pressure, temperature, capnometry, and peripheral oxygen saturation). Device data sources for each of these parameters are described below. Surgical patients had to have at least 30 min of records of the five parameters.

The sample size was defined using a pragmatic approach to how much data could be annotated in a reasonable time (approximately 4.5 h per expert reviewer participant) while providing the team with enough data points to split data and use all filtering algorithms. The sample included 53 patients admitted to the ICU and 53 patients undergoing surgery. The patients were randomly selected from a list of all included patients using a random number generator in Python (Python Software Foundation, USA).18  To limit the number of ICU patient data, only 120 h of records were used per patient. Data from surgical cases were used in total.

The study followed the Enhancing the Quality and Transparency of Health Research (EQUATOR) Standards for Reporting Diagnostic Accuracy Studies (STARD) Guideline.19  The checklist used can be found in Supplemental Digital Content 1 (https://links.lww.com/ALN/D489).

Data Sources

Data were collected from the perioperative database. The perioperative database is constantly synchronized with the Philips IntelliSpace Critical Care and Anesthesia (Philips, The Netherlands) electronic health record, recording all patients perioperatively and in the ICU. The database contains data on vital parameters and manually entered observations/actions by all healthcare professionals. In the operating room, discrete vital parameters are stored every 15 s; in the ICU, the temporal resolution is 15 min. No artifact recognition method is applied before saving the data; therefore, the raw parameters are saved and can be used for scientific applications. Heart rate, blood pressure, temperature, and pulse oximeter values were collected via a Draeger Infinity monitoring system consisting of both the Draeger Infinity Delta and Infinity M540 systems (Draeger, Germany). Capnometry values were collected via an anesthesia machine (Draeger Primus or Draeger Perseus) in the operating room. In the ICU, carbon dioxide is measured using Draeger monitors. The heart rate parameter studied in this analysis was the electrocardiogram heart rate; pulse rate from the arterial line was not available. During artifact filtering, the human experts had the opportunity to see the pulse rate from the oxygen saturation. Blood pressure was collected both noninvasively using a cuff and invasively using arterial lines (mainly radial). For both blood pressure signals, no further signal processing was undertaken to provide all methods with the “raw” data available in the electronic health record. For cases in which both invasive and noninvasive blood pressure signals were available, both were annotated by the reviewers.

Data Processing

All available values of the five vital signs of interest were extracted from the database for the randomly selected patients. No further processing of the data was performed, except for deleting all capnometry values less than 2 mmHg. These values are transferred to the electronic health record by the anesthesia machines as soon as they are switched on by default. Beyond that, no further changes were made to the data; no interpolation was performed.

In the first step, vital sign artifacts were annotated independently by five human experts: after their 4-month anesthesia internship in the operating theater, final-year medical students received 4 h of training both as a group and individually, as well as feedback on demand. Further details can be found in Supplemental Digital Content 5 (https://links.lww.com/ALN/D493). A web-based front-end interface was used to review the patient charts and annotate the artifacts. In this self-developed front end, all reported vital signs plus pulse rate from pulse oximetry could be seen singly or combined. Artifacts can be annotated either by clicking on single data points or by circling them to mark more than one. Across two training sessions, the experts were instructed to annotate every data point that they believed to be an artifact. This included a discussion of the most important causes: disconnected/displaced lines, blood sampling, electrocautery, patient movement, and others. Four of the experts annotated 53 patients each (equally distributed between the ICU and operating room). Each patient was annotated independently by two experts. If the two experts had conflicting annotations, the fifth member of the expert panel made the final decision regarding whether the data point was an artifact by majority vote.

Artifact Detection Algorithms

Parallel and independent from the human artifact filtering used as the reference standard during algorithm comparison, the following artifact detection algorithms were applied to the data (invasive and noninvasive blood pressure signals had the same handling throughout):

  • Cutoff: For the cutoff algorithm, the following ranges were defined as valid (and physiologically possible), and all values beyond these ranges were defined as artifacts: systolic blood pressure, 20 to 300 mmHg; mean blood pressure, 10 to 250 mmHg; diastolic blood pressure, 5 to 225 mmHg; capnometry, 5 to 150 mmHg; temperature, 25° to 45°C; heart rate, 5 to 300/min; and Spo2, 0 to 100%.7 

  • Z-value: The z-value was calculated for each vital sign and for each patient. All values lying outside three multiples of the SD were defined as artifacts. By using 3 SDs as the threshold, all values lying outside 99.73% of the mean were marked as artifacts.

  • Interquartile range: The interquartile range was calculated for each vital sign and for each patient. All values lying outside the three multiples of the interquartile range were defined as artifacts. By using 3 interquartile ranges as the threshold, all values lying outside 99.73% of the median were marked as artifacts.

  • Local outlier factor: The local outlier factor was calculated as described by Breunig et al.20  Seconds for Euclidean distances were on the x-axis, and the vital sign–specific values (mmHg for blood pressure, percentage for Spo2, and others) were on the y-axis. A k = 7 was chosen primarily to identify extreme changes in the time series. A local outlier factor greater than or equal to 1.5 was used to label the data points as artifacts.20 

Additionally, a long short-term memory neural network was trained using the human reference standard to show its ability to predict this standard in another part of the dataset. The dataset was transformed into batches of time series. The steps for this method included (1) normalizing the input features to an interval of [–1,1], (2) calculating the first and second derivatives of the time-dependent input values, and (3) creating a time series for each data feature within a defined time window. Any data points outside the observed period were set to 0. The dataset was then randomly split into a training set (80%) and a test set (20%) while ensuring that data from a single patient were not split. The network architecture comprised an input layer, a long short-term memory neural network layer, and an output layer optimized for the size of the time window and batches. The process was halted once the accuracy, specificity, sensitivity, and area under the receiver operating characteristics curve (AUC) did not improve in the test set. The number of neurons was estimated based on the input size, output size, batch size, and size of the observed time interval. The loss function was applied using an entropy gradient function with the ADAM21  optimizer. To implement these algorithms, Python 3.818  (primarily with the pandas,22  numpy,23  scikit-learn,24  and scipy25  packages) was used; the code used can be found in Supplemental Digital Content 2 and 3 (https://links.lww.com/ALN/D490, https://links.lww.com/ALN/D491).

Statistical Analysis

The main objective of this study was to compare the sensitivity and specificity of all used artifact filtering methods to provide a guide on which algorithm is best suited for further research. For calculation of sensitivity and specificity, the human reviewer standard defined the “true values” of the presence or absence of artifacts. Sensitivity was defined as the ratio of artifacts correctly annotated by each algorithm compared to the human reviewer reference standard. Specificity was defined as the ratio of data points correctly annotated as not being an artifact consistent with the human reviewer reference standard. After all the artifact detection algorithms were applied, the results were compared to the human reference standard. For each artifact detection algorithm, true positive, true negative, false positive, and false negative were calculated. Specificity, sensitivity, positive predictive value, and negative predictive value including 95% CI calculated using Wilson’s method26,27  were displayed per vital sign parameter and artifact detection algorithm. Comparisons of CI were done using the complete nonoverlap method.28  Observations were viewed as independent—no within-person clustering of performance was conducted. Descriptive statistics were calculated using mean ± SD or median and 25% and 75% quartiles, respectively, as appropriate. A formal comparison of artifact detection methods was conducted by comparing the 95% CI of sensitivity/specificity. The null hypothesis was that there was no difference in sensitivity and specificity between the neural network and any other method. All statistical analysis was done using Python 3.818  as described above.

Sensitivity Analysis

As most of the tested algorithms relied on the definition of thresholds or factors, a sensitivity analysis was conducted using different thresholds. For all algorithms, receiver operating characteristic curves are shown, and the area under the curve was calculated. Furthermore, all key statistical figures (sensitivity, specificity, positive predictive value, and negative predictive value) are shown in the main analysis.

Defining thresholds for the cutoff method was challenging. Multiple ways of calculating a reference range are described in the literature, with a 95% CI being mostly used, especially for laboratory values.29–31  Indeed, the literature about reference ranges of vital signs is sparse, often relying on cohort studies focusing on outcomes.32  In oxygen saturation, for example, normal values and values not requiring treatment differ in certain patient groups (e.g., acute respiratory distress syndrome or acute myocardial infarction).33–35 

Therefore, four additional thresholds were defined: (1) using a 95% CI from the complete dataset, (2) values outside of physiologic ranges, (3) values that would worry the treating healthcare professionals, and (4) values needing urgent treatment. Details can be found in table 1.

Table 1.

Values Used for Sensitivity Analysis

Values Used for Sensitivity Analysis
Values Used for Sensitivity Analysis

For the z-value and interquartile range, in addition to using 3 as factor, all values between 2 and 3.5 were tested in 0.5 steps. For the calculation of the receiver operating characteristics curve, all values between 0.5 and 5 were used.

The study population consisted of 28,388 operating room patients and 1,262 ICU patients with available data for all five vital signs. A total of 106 patients (53 ICU and 53 operating rooms) were randomly selected from the study population. Demographic details can be found in table 2.

Table 2.

Demographic Details

Demographic Details
Demographic Details

For the operating room data and ICU data, the mean (± SD) duration of observation was 2.6 ± 1.8 h and 120.0 ± 0.2 h, respectively. During that time, a total of 395,213 data points were included. After excluding the capnometry values less than 2 mmHg, 392,808 data points remained. The mean number of data points per operating room patient was 3,087.1 ± 3,117.0; in the ICU, it was 4,324.4 ± 2,880.6. Of these, the four human experts annotated a total of 11,699 data points as artifacts, including the data points annotated by two experts evaluating each patient. In 2,891 data points, consensus was met by the first step, leaving 5,917 data points (50.6%) for the third expert’s decision. In 2,276 of those cases (38.5%), he decided that the data point was an artifact, resulting in 5,167 annotated artifacts (1.3% of the data points). Splitting the data into a training set (80%) and a test set (20%) resulted in 310,085 instances without artifacts and 4,162 instances (1.33%) with artifacts in the training set. In the test set, there were 77,557 instances without artifacts and 1,005 instances with artifacts (1.28%).

The application of the artifact detection algorithms resulted in a large variation in annotated artifacts. For example, the interquartile range method resulted in 6,196 artifacts annotated, while the local outlier factor annotated only 1,189 data points as artifacts. Details can be found in table 3. Data describing the long short-term memory neural network are missing in table 3 due to the dataset split.

Table 3.

Number of Annotated Artifacts

Number of Annotated Artifacts
Number of Annotated Artifacts

The hypothesis that the neural network showed significantly higher sensitivities than the classic methods was found to be true for the following vital signs: heart rate (ICU, 33.6%; 95% CI, 33.1 to 44.6), systolic invasive blood pressure (in both the operating room [62.2%; 95% CI, 57.5 to 71.9] and the ICU [60.7%; 95% CI, 57.3 to 71.8]), and temperature in the operating room (76.1%; 95% CI, 63.6 to 89.7). Specificity was very similar in all methods. As expected, the interquartile range and z-value performed very similarly, and the data were equally distributed. The best-performing methods are summarized in tables 4 and 5.

Table 4.

Performance of the Algorithms in the Operating Room

Performance of the Algorithms in the Operating Room
Performance of the Algorithms in the Operating Room
Table 5.

Performance of the Algorithms in Intensive Care Unit

Performance of the Algorithms in Intensive Care Unit
Performance of the Algorithms in Intensive Care Unit

The specificity was above 90% for all detection methods and all vital signs. However, the sensitivity was low for cutoff, z-value, interquartile range, and local outlier factor, with only the z-value for saturation in the operating room reaching 88.9%. All other sensitivity values were less than 80%, with the local outlier factor not exceeding 10% of the sensitivity. An example of the neural network’s performance can be found in figure 1.

Fig. 1.

Example of the neural network performance compared to the human experts in a sample of heart rates from one specific patient (example signal: normalized heart rate). By converting the minimal heart rate to –1 and the maximal heart rate to 1, the performance of the neural network is facilitated, while intervals between values of heart rate remain the same.

Fig. 1.

Example of the neural network performance compared to the human experts in a sample of heart rates from one specific patient (example signal: normalized heart rate). By converting the minimal heart rate to –1 and the maximal heart rate to 1, the performance of the neural network is facilitated, while intervals between values of heart rate remain the same.

Close modal

A comparison of the performance across methods revealed significant differences between vital signs, methods, and clinical locations. For example, for heart rate in the ICU, the long short-term memory neural network showed a significantly higher sensitivity of 33.6% and specificity of 99.2%, whereas the interquartile range showed 19.5% and 99.4%, the z-value performed similarly (25.3% and 99.6%), and the cutoff showed a sensitivity of only 3.8%, with a specificity of 100%. By contrast, the cutoff showed better results for invasive mean arterial pressure (MAP) in the operating room (sensitivity, 74.9%; specificity, 100%) but not in the ICU (9.3% and 100%). Details, including 95% CI, can be found in table 4, showing performance in data from the operating room, and in table 5, showing performance in data from the ICU.

To show the performance of different thresholds when using the interquartile range and z-value, a sensitivity analysis was conducted: all threshold values from 0.5 to 5 were tested, as well as different cutoff values. The results showed that using values other than those previously described resulted in better sensitivity, while specificity stayed at an acceptable level. For example, the second threshold level (values worrying the treating healthcare professionals) resulted in 75% sensitivity for invasive MAP. However, specificity decreased to 92%, while those originally used showed 39.8%/100%. This trend was seen in all used thresholds: increased sensitivity led to rapidly decreasing specificity. All calculations can be found in table 1 and in Supplemental Digital Content 1and 4 (https://links.lww.com/ALN/D489; https://links.lww.com/ALN/D492).

All AUC values were above 0.61, with most exceeding 0.85. For example, applying the z-value to MAP resulted in an AUC of 0.88, while applying the interquartile range to carbon dioxide resulted in an AUC of 0.96. The resulting receiver operating characteristic curves, including all AUC values, can be found in Supplemental Digital Content 1 (https://links.lww.com/ALN/D489).

In the current study, we found that artifact filtering methods performed differently both in terms of specific vital signs and clinical context of intraoperative versus intensive care unit. No one method was found to be consistently superior across different vital signs and clinical contexts. Compared to human experts annotating artifacts retrospectively,8,9,36  the methods of interquartile range, z-value, and cutoff filters showed high specificity but only intermediate sensitivity; the local outlier factor had a sensitivity less than 10%. By contrast, a specially trained, long short-term memory neural network showed higher sensitivity values, while specificity remained as high as the other methods. Narrowing the thresholds of the cutoff filter in a sensitivity analysis also increased sensitivity; however, specificity decreased rapidly. The thoughtful selection of artifact detection methods for each clinical parameter is important. For specific clinical parameters, the use of neural networks demonstrated higher artifact filtering performance.

Artifact filtering is of the utmost importance, as it has the potential to alter scientific results.8,12  In the vast majority of anesthesiologic and intensive care publications to date, only basic methods of artifact detection in recordings of continuous vital parameters have been reported.1–3,7  Some studies have not described any detail of artifact detection at all. However, a broad variety of artifact detection algorithms and highly specialized neural networks have been published for artifact detection in retrospective data.37–39  In the current study, it was shown that different artifact detection methods perform differently on each vital sign.

Blood pressure is the focus of many perioperative and critical care research efforts, with both MAP and systolic pressure being reported.1,3,5,7  No single method performed best for invasive MAP: in the operating room, the cutoff method performed best, while in the ICU, the neural network performed best. The sensitivity analysis showed that narrowing the limits rapidly increased sensitivity, with a drop in specificity. This further emphasizes the relevance of choosing the right algorithm with the right threshold. This is especially relevant when looking at pathologic states, such as intraoperative hypotension or hypothermia: erroneously flagging vital signals as artifacts would lead to the exclusion of relevant information.

The same importance of choosing the right method for the right parameter was seen for heart rate, for which all algorithms showed a sensitivity of less than 40% with large differences between the ICU and the operating room. Although the sensitivity analysis showed a tendency toward increased sensitivity, specificity dropped rapidly. The most probable cause for the low sensitivities is rapid changes in heart rate: electrocautery in the operating room and movement or arrhythmia in the ICU. These artifacts can easily be detected when using the pulse rate from the oxygen saturation—a signal not available to the artifact detection methods.

Data on artifact filtering for vital signs other than heart rate and blood pressure are sparse. The current data demonstrate that each method performed differently for temperature, pulse oximetry, and capnometry, with regard to sensitivity and specificity. In contrast to heart rate and blood pressure, z-value and interquartile range performed well, while the neural network showed mixed results.

As with blood pressure and heart rate, choosing the right threshold—especially for cutoff filters—is an important topic. Although sensitivity was increased by changing the thresholds, specificity dropped for temperature and capnometry. Only for pulse oximetry did the specificity remain stable, which can most probably be attributed to the special distribution of values.

Interestingly, all algorithms performed worse in the ICU than in the operating room. There are multiple potential explanations for this difference, with the decreased resolution of data in the ICU being the most important. For example, although an artifact due to blood sampling can easily be detected in the arterial blood pressure signal with a 15-s resolution, it will diminish at a 15-min resolution.

Due to the similar statistical approaches used for the interquartile range and z-value, similar results were expected. Nevertheless, for datasets with a “low resolution” vital sign, such as the noninvasive blood pressure in this study, the z-value often performed better compared to the interquartile range. Generally, the local outlier factor performed poorly throughout all vital signs, with sensitivity never exceeding 15%. One conclusion may be that the use of the local outlier factor in anesthetic data is questionable, contrary to other datasets.40  The topic of data granularity itself could not be studied in this project, as the data resolution was evenly distributed. Further research is necessary, especially on the effects of data granularity on artifact filtering algorithms and the characterizations of different vital signs, which are most often subsequently used for statistical analysis. As the collection of high-resolution vital signs, including waveforms, becomes increasingly popular, the problem of low granular data will improve—at least in perioperative data.

Limitations

One significant limitation of this study is the use of retrospective electronic health record data as the foundation for creating the human reference standard rather than real-time point of care annotation. This possibly explains why the rate of artifacts marked by the reviewers (1.3%) was lower than in some previous studies reporting up to 14%.8,9,39  Using retrospective data can lead to missing artifacts. For example, the wrong elevation of a blood pressure transducer cannot be identified in retrospective data, patient movement cannot be seen, and the dislocation of a saturation sensor can only be assumed. However, as the main objective was a comparison of artifact filtering methods for use in retrospective data, we decided to choose a retrospective approach for human experts as well.

The neural network used has some limitations. Of importance is the fact that the tested network was trained with the human reference standard and therefore could at best predict the reference standard. External validation to prove generalizability is needed before using the model in other centers. In this case, approximately 450 person-hours were needed to annotate all 106 patients, including the majority vote, for artifacts where two reviewers disagreed.

Another limitation is that humans, classic artifact detection algorithms, and neural networks may not be comparable. Although the neural network is trained with actual reference standard data, such as human expert reviewer annotations, the classic algorithms do not have access to individually annotated data. Classic algorithms are based on thresholds, population distributions, or mathematical transformations. Humans, by contrast, can utilize much more information to filter artifacts than is available for long short-term memory neural networks. The use of electrocautery or the movement of cables is obvious to the human eye but difficult to learn for long short-term memory neural networks. Neural network training has the potential to overfit, which was decreased using a randomly split internal test dataset. As shown in Supplemental Digital Content 1 (https://links.lww.com/ALN/D489), the neural network’s calibration plots had poor calibration for certain variables, such as pulse oximetry values and temperature, both in the ICU and the operating room, whereas they showed good performance for others, such as systolic blood pressure of heart rate, in both clinical settings. This indicates that there may be a relevant risk of decision errors in particular risk ranges.

The use of human experts is a significant limitation of this study. Although all experts had experience in clinical anesthesia and intensive care, incorrect filtering was always possible. To address this concern, every data point was independently evaluated by two experts. The distribution of patient data to two of the four experts was performed randomly. For cases in which these two experts diverged, a third expert decided with a majority vote. The number of data points in which this majority vote was necessary was high (50.6% of annotated data points), leading to the concern that the analysis relied on an imperfect reference standard. The use of three independent experts is the maximum previously described in the literature, leading to the assumption that the reference standard used is the best possible in that setting.8,39  Following Walsh, we conducted a naive analysis accepting that results may be underestimated or overestimated, while other methods seemed impractical in this case.41 

A further limitation is the single-center data source, which has limited generalizability. The highly standardized way in which monitoring devices are produced and used is encouraging, but we cannot exclude systematic errors, such as those described previously.42,43  In addition, many commercially available electronic health records record only intraoperative physiologic data every 60 s, calling into question the validity of our findings in these datasets. Validation of the neural network used on external, completely new data is essential to establish the potential value of the network on broad patient populations.

Conclusions

The use of simple, universal physiologic artifact detection methods seems inferior to vital sign-specific artifact detection algorithms. Commonly used artifact detection methods performed very differently when tested for different vital signs and for different settings (ICU vs. operating room). Using a pretrained neural network for artifact filtering in retrospective data may be a possible additional valid option as an artifact detection method, although performance may be worse in different datasets and potential overfitting is an important limitation.

Code Availability

Code of the used algorithms are found in Supplemental Digital Content 2 and 3 (https://links.lww.com/ALN/D490 and https://links.lww.com/ALN/D491); full code is available from the corresponding author upon reasonable request.

Research Support

This study was funded only by departmental funds.

Competing Interests

The authors declare no competing interests.

Supplemental Digital Content 1: ROC curves, calibration plots, Standards for Reporting Diagnostic Accuracy Studies (STARD) checklist, https://links.lww.com/ALN/D489

Supplemental Digital Content 2: Jupyter notebook of Python code, https://links.lww.com/ALN/D490

Supplemental Digital Content 3: Jupyter notebook with a simplified long short-term memory neural network's code for learning, https://links.lww.com/ALN/D491

Supplemental Digital Content 4: Supplemental table 1: Artifact detection algorithms’ performance in combined data; supplemental table 2: Sensitivity analysis of cutoff method, https://links.lww.com/ALN/D492

Supplemental Digital Content 5: Description of the human reviewers’ training, https://links.lww.com/ALN/D493

1.
Walsh
M
,
Devereaux
PJ
,
Garg
AX
et al
:
Relationship between intraoperative mean arterial pressure and clinical outcomes after noncardiac surgery: Toward an empirical definition of hypotension.
Anesthesiology
2013
;
119
:
507
15
2.
Sun
LY
,
Wijeysundera
DN
,
Tait
GA
,
Beattie
WS
:
Association of intraoperative hypotension with acute kidney injury after elective noncardiac surgery.
Anesthesiology
2015
;
123
:
515
23
3.
Bijker
JB
,
Persoon
S
,
Peelen
LM
et al
:
Intraoperative hypotension and perioperative ischemic stroke after general surgery: A nested case-control study.
Anesthesiology
2012
;
116
:
658
64
4.
Gregory
A
,
Stapelfeldt
WH
,
Khanna
AK
et al
:
Intraoperative hypotension is associated with adverse clinical outcomes after noncardiac surgery.
Anesth Analg
2021
;
132
:
1654
65
5.
Wesselink
EM
,
Kappen
TH
,
Torn
HM
,
Slooter
AJC
,
van Klei
WA
:
Intraoperative hypotension and the risk of postoperative adverse outcomes: A systematic review.
Br J Anaesth
2018
;
121
:
706
21
6.
Johnson
AEW
,
Pollard
TJ
,
Shen
L
et al
:
MIMIC-III, a freely accessible critical care database.
Sci Data
2016
;
3
:
160035
7.
Salmasi
V
,
Maheshwari
K
,
Yang
D
et al
:
Relationship between intraoperative hypotension, defined by either reduction from baseline or absolute thresholds, and acute kidney and myocardial injury after noncardiac surgery: A retrospective cohort analysis.
Anesthesiology
2017
;
126
:
47
65
8.
Hoorweg
A
,
Pasma
W
,
van Wolfswinkel
L
,
de Graaff
JD
:
Incidence of artifacts and deviating values in research data obtained from an anesthesia information management system in children.
Anesthesiology
2018
;
128
:
293
304
9.
Kool
NP
,
van Waes
JAR
,
Bijker
JB
et al
:
Artifacts in research data obtained from an anesthesia information and management system.
Can J Anaesth
2012
;
59
:
833
41
10.
Takla
G
,
Petre
JH
,
Doyle
DJ
,
Horibe
M
,
Gopakumaran
B
:
The problem of artifacts in patient monitor data during surgery: A clinical and methodological review.
Anesth Analg
2006
;
103
:
1196
204
11.
Hoare
SW
,
Beatty
PC
:
Automatic artifact identification in anaesthesia patient record keeping: A comparison of techniques.
Med Eng Phys
2000
;
22
:
547
53
12.
Pasma
W
,
Peelen
LM
,
van Buuren
S
,
van Klei
WA
,
de Graaff
JC
:
Artifact processing methods influence on intraoperative hypotension quantification and outcome effect estimates.
Anesthesiology
2020
;
132
:
723
37
13.
Chen
L
,
Dubrawski
A
,
Wang
D
et al
:
Using supervised machine learning to classify real alerts and artifact in online multisignal vital sign monitoring data.
Crit Care Med
2016
;
44
:
e456
463
14.
Du
CH
,
Glick
D
,
Tung
A
:
Error-checking intraoperative arterial line blood pressures.
J Clin Monit Comput
2019
;
33
:
407
12
15.
Hravnak
M
,
Chen
L
,
Dubrawski
A
,
Bose
E
,
Clermont
G
,
Pinsky
MR
:
Real alerts and artifact classification in archived multi-signal vital sign monitoring data: Implications for mining big data.
J Clin Monit Comput
2016
;
30
:
875
88
16.
Pao
Y
:
Adaptive Pattern Recognition and Neural Networks
.
Reading, Massachusetts
,
Addison-Wesley Publishing Co., Inc.
,
1989
.
17.
Ripley
BD
:
Pattern Recognition and Neural Networks
.
Cambridge
,
Cambridge University Press
,
1996
18.
Van Rossum
G
,
Drake
FL
, Jr
:
Python Tutorial
.
Amsterdam, Netherlands
,
Centrum voor Wiskunde en Informatica
,
1995
19.
Bossuyt
PM
,
Reitsma
JB
,
Bruns
DE
et al
;
STARD Group
:
STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies.
BMJ
2015
;
351
:
h5527
20.
Breunig
MM
,
Kriegel
H-P
,
Ng
RT
,
Sander
J
:
LOF: Identifying density-based local outliers.
SIGMOD Rec
2000
;
29
:
93
104
21.
Kingma
DP
,
Ba
J
:
Adam: A method for stochastic optimization.
Published as a conference paper at the 3rd International Conference for Learning Representations
,
San Diego
,
2015
.
doi:10.48550/arXiv.1412.6980
22.
McKinney
W
:
Data structures for statistical computing in Python
,
Proceedings of the 9th Python in Science Conference, Austin, Texas, June 28 to July 3, 2010
. Edited by
van der Walt
S
,
Millman
J
.
2010
, pp
56
61
23.
Harris
CR
,
Millman
KJ
,
van der Walt
SJ
et al
:
Array programming with NumPy.
Nature
2020
;
585
:
357
62
24.
Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
et al
:
Scikit-learn: Machine learning in Python.
J Mach Learn Res
2011
;
12
:
2825
30
25.
Virtanen
P
,
Gommers
R
,
Oliphant
TE
et al
:
SciPy 1.0: Fundamental algorithms for scientific computing in Python.
Nat Methods
2020
;
17
:
261
72
26.
Wilson
EB
:
Probable inference, the law of succession, and statistical inference.
J Am Stat Assoc
1927
;
22
:
209
12
27.
Altman
D
:
Statistics with Confidence: Confidence Intervals and Statistical Guidelines
, 2nd edition. Edited by
Machin
D
,
Bryant
T
,
Gardner
M
.
London
,
BMJ Books
,
2013
28.
Cumming
G
,
Finch
S
:
Inference by eye: Confidence intervals and how to read pictures of data.
Am Psychol
2005
;
60
:
170
80
29.
Liu
W
,
Bretz
F
,
Cortina-Borja
M
:
Reference range: Which statistical intervals to use?
Stat Methods Med Res
2021
;
30
:
523
34
30.
Menacer
S
,
Claessens
Y-E
,
Meune
C
et al
:
Reference range values of troponin measured by sensitive assays in elderly patients without any cardiac signs/symptoms.
Clin Chim Acta
2013
;
417
:
45
7
31.
Roshan
D
,
Ferguson
J
,
Pedlar
CR
et al
:
A comparison of methods to generate adaptive reference ranges in longitudinal monitoring.
PLoS One
2021
;
16
:
e0247338
32.
Brouwers
S
,
Sudano
I
,
Kokubo
Y
,
Sulaica
EM
:
Arterial hypertension.
Lancet
2021
;
398
:
249
61
33.
Bos
LD
,
Martin-Loeches
I
,
Schultz
MJ
:
ARDS: Challenges in patient care and frontiers in research.
Eur Respir Rev
2018
;
27
:
170107
34.
Ranieri
VM
,
Rubenfeld
GD
,
Thompson
BT
et al
;
ARDS Definition Task Force
:
Acute respiratory distress syndrome: The Berlin definition.
JAMA
2012
;
307
:
2526
33
35.
Ibanez
B
,
James
S
,
Agewall
S
et al
;
ESC Scientific Document Group
:
2017 ESC guidelines for the management of acute myocardial infarction in patients presenting with ST-segment elevation: The Task Force for the management of acute myocardial infarction in patients presenting with ST-segment elevation of the European Society of Cardiology (ESC).
Eur Heart J
2018
;
39
:
119
77
36.
Hravnak
M
,
Chen
L
,
Bose
E
et al
:
Artifact patterns in continuous noninvasive monitoring of patients.
Intensive Care Med
2013
;
39
:
S405
37.
Simpao
AF
,
Nelson
O
,
Ahumada
LM
:
Automated anesthesia artifact analysis: Can machines be trained to take out the garbage?
J Clin Monit Comput
2021
;
35
:
225
7
38.
Hashimoto
DA
,
Witkowski
E
,
Gao
L
,
Meireles
O
,
Rosman
G
:
Artificial intelligence in anesthesiology: Current techniques, clinical applications, and limitations.
Anesthesiology
2020
;
132
:
379
94
39.
Pasma
W
,
Wesselink
EM
,
van Buuren
S
,
de Graaff
JC
,
van Klei
WA
:
Artifacts annotations in anesthesia blood pressure data by man and machine.
J Clin Monit Comput
2021
;
35
:
259
67
40.
He
T
,
Zhou
Q
,
Zou
Y
:
Automatic detection of age-related macular degeneration based on deep learning and local outlier factor algorithm.
Diagnostics (Basel)
2022
;
12
:
532
41.
Walsh
T
:
Fuzzy gold standards: Approaches to handling an imperfect reference standard.
J Dent
2018
;
74
:
S47
9
42.
Fawzy
A
,
Wu
TD
,
Wang
K
et al
:
Racial and ethnic discrepancy in pulse oximetry and delayed identification of treatment eligibility among patients with COVID-19.
JAMA Inter Med
2022
;
182
:
730
8
43.
Bothe
TL
,
Bilo
G
,
Parati
G
,
Haberl
R
,
Pilz
N
,
Patzak
A
:
Impact of oscillometric measurement artefacts in ambulatory blood pressure monitoring on estimates of average blood pressure and of its variability: A pilot study.
J Hypertens
2023
;
41
:
140
9
This is an open access article distributed under the Creative Commons Attribution License 4.0 (CCBY), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.