CASE STUDY: Machine learning for activity prediction, as part of lead compound generation
The Challenge: The ability to quickly iterate multiple large feature sets with the flexibility to test models at scale is a challenge for any data scientist.
The Solution: Using Augusta, we put head-to-head nine fingerprint libraries (FPLs) and chemical properties to predict inhibition of a protease associated with Crimean-Congo Hemorrhagic Fever (CCHF).
It is well known that choice of FPL is critical for machine learning model accuracy. New results show that how many motifs from a library are included can be just as important, and more is not always better.
The study identified clear differences in performance of models built with each FPL (Figure 1), with the chemical properties, MACCS FP, and GpiDAPH3 FP showing best precision at 10–20% recall.
Data Scientists also studied the effect of limiting the number of motifs from a given FPL. Using Mutual Information to rank motifs, they found surprising variation in model performance. The more MACCS keys included in the model, the higher the precision at 10% recall. However, contrary to expectations, 1,500 TAT motifs outperformed 3,000 from the TAT library, and 3,000 piDAPH4 motifs outperformed 7,500 and 10,000.
This suggests that an optimal fingerprint must be tailored to the specific activity being predicted, as certain motifs are critical to some models and irrelevant to others. Determining this optimal fingerprint requires trial-and-error testing of both the FPL and number of motifs.
The BioSymetrics Augusta platform automates the computational workflow, in this instance allowing the scientist to specify the set of FPLs and numbers of motifs to test. The software then builds, validates, and assesses machine learning models for each of the permutations.
The Result: Faster time-to-market and greater confidence in results.