Improving Mechanism of Action prediction by increasing chemical white space
By Victoria Catterson, Vice President of Data Science Research – May 9, 2025
Identifying Mechanism of Action (MOA) following phenotypic screening remains a major bottleneck in drug discovery. Our Elion platform uses proprietary machine learning (ML) models to predict MOA, and while these models perform well, we’re continually looking for ways to improve their performance. To this end, we increased our in silico compound library from 1M to 557M compounds, without adding new target annotations, and improved our MOA predictions on three key metrics:
More of our validation screens returned the correct target
The correct target was ranked higher than before in the majority of screens
The correct target was listed in the top 3 in one third of our validation screens.
The correct target was listed in the top 3 in one third of our validation screens.’
This post explains what we did and why.
Overview of MOA prediction
MOA prediction is a critical part of the drug discovery process, as new therapies are very unlikely to be approved without evidence of the mechanism. Unlike target-based screens, which start with a known target and test for compounds which display binding activity, phenotypic screens start by testing for compounds which resolve a specific phenotype, often with the convenient additional benefit of concurrently testing for toxicity and off-target effects. This has the potential to leapfrog years of sequential testing required with a target-based screen, but requires more effort in determining the mechanism of a successful drug candidate.
Our platform uses phenotypic screen data, comprising compounds in SMILES format and a binary indicator of which compounds resolved the phenotype, to build a machine learning model that predicts activity of the hit compounds. The model is then used to virtually screen our in silico compound library, ranking the chances of each compound resolving the phenotype. Some of those compounds are annotated with known targets and mechanism, collected from public sources of data such as the Therapeutic Target Database and the Drug Repurposing Hub. The platform quantifies the likelihood of each target being the true target of the hit compounds in the phenotypic screen, based on the distribution of known compounds with that target annotation throughout the ranked in silico compound space.
New data
Our previous in silico library was curated from various sources, including the ChemBridge library of ~1M screening compounds. Of course, many of these compounds are un-annotated, as they are not currently drugs with known mechanism or target.
Prior internal R&D showed that when we increased the compound space covered by the in silico library – even by adding completely un-annotated compounds – the performance of our mechanism prediction algorithm improved. This is because the ranking and distribution of annotated compounds becomes clearer when adding others: the difference between being in the top 1% of compounds likely to be a hit, compared with the top 5%, becomes easier to distinguish among 1M compounds versus 100,000 compounds. For this reason, adding more un-annotated “chemical white space” into our in silico library has two advantages:
It increases the metrics for predicting target (on our known validation cases), and
It has the potential to find novel compounds which are not as widely studied as those in a drug-like library such as ChemBridge’s.
As a result, we added the ZINC-20 dataset to our existing curated in silico library. When converted into canonical SMILES format, this dataset contains ~557M compounds: substantially increasing our compound space from its prior ~1M! Notably this did not add any new annotations to our library; it simply filled in some more compound space between our annotated compounds.
Results
We reran our suite of validation screens through Elion with the expanded library of ~557M compounds. There were no changes to the way models were built, features were selected, or annotations were analyzed: the only change was that the ranking of likelihood of reversing the phenotype for compounds with known annotations was now calculated out of ~557M instead of the previous ~1M.
Many of our metrics improved:
More of our validation screens returned the correct target (7 of 9, as compared to 5 before)
The correct target was ranked higher than before in the majority of screens (5 of the 7 correct)
The correct target was listed in the top 3 in one third of our validation screens.
In practice, this is a huge improvement gained simply by adding un-annotated compounds. Generally, the next step after predicting targets is to perform binding assays to confirm, so true targets in the top 3 predictions means the confirmatory assay will be performed in the first round of testing. This translates into a further shortening of the drug discovery timeline, with lower costs as we test fewer false positives.
In short, adding un-annotated compounds to our in silico library improves prediction accuracy, which shortens timelines, and reduces costs associated with confirming the correct target.
Future directions
This is just the first of multiple R&D paths we are exploring, as we seek continuous improvement of our platform. Another exciting strand of research in development is our Elion compound foundation model. After evaluating off-the-shelf models, we identified key ways for improving the abstraction of compound space given by a transformer model, which we believe translates into better mapping of similarities between compounds, and ultimately generates better, more realistic, and synthesizable compounds. We have a number of ideas for how to integrate this foundation model into our platform, one of which is adoption into our MOA prediction capabilities. Stay tuned for future updates!