BLOG: Preprocessing for accurate downstream machine learning with ContingentAI™
Victoria Catterson, Vice President of Data Science Research
BioSymetrics’ ContingentAI™ optimizes the full machine learning (ML) workflow from feature engineering, imputation, and selection through to modelling and analysis. The typical approach to ML tunes hyperparameters on a model to determine best performance; instead, we tune all decision points in the data pipeline to optimize performance even further. This results in building 1,000s of models on every permutation of the preprocessing and modelling parameters to ensure we find the best model for a given task.
A recent project highlighted the benefit of this method for downstream model performance. We used ContingentAI™ to identify phenoclusters in Parkinson’s Disease (PD): different subgroups of patients who may respond better to a tailored therapeutic approach. The full phenoclustering workflow includes 5 steps:
Feature engineering
Scaling and imputation
Feature selection
Clustering
Metric calculation
This article focuses on the feature engineering of patient comorbidities, and how choices made in step 1 of the workflow impacted model performance at step 5. By thinking carefully about the meaning of raw data, we were able to improve downstream model performance by engineering features more representative of the clinical reality for patients.
Comorbidities of Parkinson’s Disease
Parkinson’s Disease (PD) is a complex neurodegenerative disorder, exhibiting substantial heterogeneity in symptom presentation, disease progression, and therapeutic response. Current treatment protocols apply a “one-size-fits-all” approach that does not account for this variability, focusing almost exclusively on the typical motor symptoms of PD, such as tremor and dyskinesia.
Non-motor symptoms of PD often have even greater impact on daily life. These non-motor effects, such as depression, sleep disorders, and incontinence, are typically treated as separate comorbidities, despite being well-known as effects of dopaminergic neuron loss and nervous system degeneration. This means that patients are prescribed therapies typical in standalone, non-PD-caused cases, or even left to self-manage these symptoms, which may be a suboptimal treatment path for these patients.
The Parkinson’s Progression Markers Initiative (PPMI) is a dataset containing clinical data for patients with PD. Comorbidities are recorded in three main ways: through reason for prescription of medication, through recorded contact with a specific hospital department, and through an intake questionnaire covering common symptoms and comorbidities of PD. Many patients report symptoms but take no medication, were prescribed medication but do not report the symptom (perhaps because it is managed by the medication), or visited a specific hospital department for treatment. These aspects of the data all shed light on the same core question: does a patient have a given non-motor symptom or not?
Preprocessing of comorbidities
Principal Component Analysis (PCA) identifies underlying signal derived from linear combinations of multiple noisy features. This is ideal for our case: we want to understand whether a patient has depression, say, from multiple indicators such as whether they were on medications for mood disorders, whether they had hospital contact with psychiatry, and whether they report symptoms of depression in their intake questionnaire.
When we performed PCA on the combination of comorbidity indicators, we saw that comorbidity component 3 does exactly this, with the features with the highest positive loading being medication for depression, contact with psychiatry, and reported symptoms of depression (FEATDEPRES) (Figure 1 below).
Figure 1: Comorbidity component 3 indicates likelihood of depression.
However, the disadvantage of PCA is that it adds a layer of indirection between the model and the data. When examining feature importance, the final model will place a certain weighting on comorbidity component 3, which in turn consists of loadings on depression medications and psychiatric involvement. We must unpick both levels before understanding how impactful medications for depression are on the final result.
Given this added complexity, it is valid to ask whether engineering comorbidity features is truly necessary, or whether we can gain as good a clustering of patients with the raw features.
The experiment
We developed PD phenoclusters as described in our previous article. This assigned a cluster number from 0 to 6 to all patients, indicating their likely disease trajectory over the first two years post-diagnosis. Afterwards, to study the importance of individual features in distinguishing one cluster from another, we built supervised classifiers to predict cluster number for the patients.
We were very happy with the model performance (Figure 2 below). As in all projects, we applied ContingentAI™ to explore the preprocessing and modelling space, to ensure we found the best performing model. In this case, we used the engineered comorbidity features along with other clinical and demographic attributes to give a starting feature vector. We then applied PCA to the full feature vector, and varied the level of explained variance at 0.3, 0.5, 0.7, 0.9, and 1.0 (i.e., the raw data), before testing Logistic Regression (LogReg) and XGBoost models with five-fold cross validation. We assessed performance using the area under the receiver operator curve (ROC AUC), highlighting the LogReg model with PCA at 0.5 explained variance with the highest median ROC AUC.
Figure 2: ROC AUC of cluster prediction models with engineered comorbidity features.
Next, we repeated the cluster prediction workflow, but with the raw comorbidity features (reported symptoms and medications) instead of the engineered comorbidities, and a couple more PCA explained variance thresholds. The ROC AUCs are shown in Figure 3 below. Note the difference in scale on the y-axis as compared to Figure 2! While all LogReg models in Figure 2 have a median AUC above 0.99, none of the models in Figure 3 reach this level. Interestingly, the best performing LogReg and XGBoost models in Figure 3 are those without PCA, in direct contrast to those in Figure 2.
Conclusion
Our key takeaway from this experiment is that engineering the comorbidities truly enhanced the performance of the downstream model, boosting the best median AUC from 0.975 to over 0.995. While peak performance is important, it was also much easier for models using engineered comorbidities to find good performance regardless of other parameters.
It can be difficult to assess the benefit of intermediate steps in the data analysis pipeline until you see the final model’s performance. ContingentAI™ is our way of ensuring we find the best solution for any given ML task.
Follow us on LinkedIn to hear more about ContingentAI™ and our other projects.