BLOG: Machine learning without data sharing via federation
Victoria Catterson, Vice President of Data Science Research
BioSymetrics was delighted to be part of a Digital Global Innovation Cluster research project exploring federated learning on clinical data. Along with our partners DNAstack and Integrate.ai, we validated the use of Canadian-made technology for federated machine learning (ML) on data from patients with Parkinson’s Disease (PD). As a result, our model can make personalized predictions about the trajectory of the disease over the first two years post-diagnosis.Emerging trends and opportunities
Why federated learning?
A perennial challenge for ML on patient clinical data is the difficulty of assembling enough data to draw conclusions with desired statistical power. One mitigation is to combine data from multiple hospitals, but the data sharing arrangements can take longer to put in place than the duration of the project itself.
Federated learning is a method of ML training where the data never leaves its original site. Instead, a centralized orchestrator manages local task runners located where the data lives. The local task runner trains on the slice of the data that it can see (such as a single hospital’s patients), then sends model weights to the orchestrator for integration with similar results from all other task runners (hospitals) in the network. The orchestrator then combines all local weights into a global set of weights, which get sent back to the local runners for the next round of training.
In this way, federated learning mirrors the process of batch training, with weights being repeatedly collected, combined, and distributed for as many rounds as it takes for the model to converge.
Why Parkinson’s Disease?
Parkinson’s Disease (PD) is a complex neurodegenerative disorder, exhibiting substantial heterogeneity in symptom presentation, disease progression, and therapeutic response. Current treatment protocols apply a “one-size-fits-all” approach that does not account for this variability, potentially leading to suboptimal outcomes for many patients.
We applied our unique ContingentAITM phenoclustering approach to segment the PD population into 7 phenotypically similar subgroups (Figure 1). Each group is associated with different outcomes after 2 years, including speed of progression of motor and non-motor symptoms, and level of impact on daily life. We believe that certain subgroups may benefit from tailored therapeutic approaches, rather than considering the PD population as a homogeneous group.
Figure 1: UMAP representation of the seven PD clusters
This methodology proved effective in our prior work on COVID-19-related Acute Respiratory Distress Syndrome (Cheyne et al., 2023).
What did we find?
We used Integrate.ai’s federated ML platform to build a cluster prediction model on two sets of data within DNAstack’s Publisher framework, representing two separate hospitals’ patients, with a third set held out as a validation test set. While one site’s partial model consistently performed better than the other (Figure 2, blue and orange lines respectively), the average performance on the combined model tended to improve with each round of training (Figure 2, black line).
Figure 2: Performance metrics for the federated ML model
The result is a model which can be applied to new patients, identifying which cluster they are most likely to belong to. Each cluster is associated with different outcomes at the two-year mark, quantifying the speed of progress of motor and non-motor symptoms. This could be used as a tool for precision medicine, to help predict what to expect after diagnosis.
Follow us on LinkedIn to hear more about our developments and breakthroughs as they come.