Electronic Medical Records (EMRs) contain a large number of missing values which imposes difficulties for data scientists who want to model after this data. In a previous post, we discussed the different feature engineering methods available on diagnosis codes, medication data and clinical notes of EMRs. In this post, we highlight the challenges of missing values when modelling with time-series data of EMRs and discuss some techniques to address it.
EMR data, especially for laboratory measurements and vital signs, often contain missing values due to various reasons such as time and cost constraints (Chen et al., 2019). While hospital systems are capable of capturing the entirety of data measurements, some patient data are still found missing from databases (Adibuzzaman et al., 2018). Another contribution to this sparsity is that doctors often write important diagnoses in a free-text format that is not converted into a machine-readable format. In analyzing clinical time-series data from the PhysioNet 2019 challenge, many laboratory measurements are missing at any given hour during a patient’s hospital admission (Figure 1). These shortcomings make it harder for algorithms to capture patterns in medical data sets.
Figure 1: Proportion of measured values in each feature in training data from the PhysioNet 2019 challenge. The majority of lab measurements had only a small proportion of recorded values.
Machine learning algorithms typically can not accommodate incomplete data, so we must pre-process the data before modelling. One approach is to discard all examples consisting of missing values; however, this can potentially remove a significant portion of training data and is generally not desirable. Alternatively, a more common approach is to apply data imputation. Column-based summary statistics like the mean and median, calculated based on the other observed values, can be used for imputation. A downside to column-based imputation methods is that it only takes into account values from the same feature. Features from other columns can help determine a more accurate estimate by exploiting correlations between features. For example, in detecting acute kidney failure, one might expect associated changes in creatinine levels and urine output. Some examples of imputation algorithms that can exploit relationships between features include k-nearest neighbours (Troyanskaya et al., 2001), random forests (Tang and Ishwaran, 2017), and Multiple Imputation by Chained Equations (MICE) (van Buuren, 2007).
One assumption that is often made when using these imputation methods is that the data is missing completely at random. This means that each data point has equal probability of being missing. In the case of the PhysioNet dataset where longitudinal trajectories for each patient are provided, data may be missing not at random because measurements are taken at different schedules and frequencies. As illustrated in Figure 2, we show that some measurements are recorded every hour while others are more infrequent. Hence, the nature of “missingness” in such data may not always be random, but rather measurements are taken at different time intervals. This makes imputation methods less effective and susceptible to bias.
Figure 2: Frequency of time intervals between subsequent measurements on Creatinine, Lactate and Alkalinephos. We can see that the features are taken at different time intervals. Lactate measurements are taken 1 hour apart, however, Creatinine and Alkalinephos are taken every 24 hours as well. Data taken from the PhysioNet 2019 challenge.
In the context of longitudinal time series data, we can consider alternative techniques to deal with missing values:
- We can augment the dataset by adding auxiliary variables to represent missingness. For example, binary masking features can be created to indicate whether a value is observed or imputed. In Che et al (2018), the authors introduced features that captured the time interval between measurements in addition to adding masking features. Hence, the model learns from the original input as well as the temporal representation of missing values.
- We can impute values by carrying forward the last measurement, which assumes that the value has remained steady in between. However, if the last measurement is noisy, then it propagates noisiness in the dataset.
- By using an autoregressive integrated moving average (ARIMA), we can impute missing values based on past values (Kohn and Ansley, 1986). This technique is commonly used in time series forecasting models.
Handling missing values is important to consider when applying machine learning to time series data. This is pertinent to clinical data, in which a patient’s physiological profile is often incomplete at any given time. Another important consideration is the accuracy and runtime of the imputation methods. While we discussed the effects on various methods of imputation, no method for handling missing values will be universally applicable. We recommend investigating multiple potential approaches to suit the context of the specific learning problem. While we were not able to cover all the strategies in-depth, we hope to inspire a greater appreciation of the challenges faced when working with EMR data.
Adibuzzaman, M., DeLaurentis, P., Hill, J., Benneyworth, B.D., 2018. Big data in healthcare – the promises, challenges and opportunities from a research perspective: A case study with a model database. AMIA Annu Symp Proc 2017, 384–392.
Chen, D., Liu, S., Kingsbury, P., Sohn, S., Storlie, C.B., Habermann, E.B., Naessens, J.M., Larson, D.W., Liu, H., 2019. Deep learning and alternative learning strategies for retrospective real-world clinical data. NPJ Digit Med 2.
Kohn, R., Ansley, C.F., 1986. Estimation, Prediction, and Interpolation for ARIMA Models with Missing Data. Journal of the American Statistical Association 81, 751–761.
Tang, F., Ishwaran, H., 2017. Random Forest Missing Data Algorithms. Stat Anal Data Min 10, 363–377.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B., 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525.
van Buuren, S., 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16, 219–242.