A batch effect occurs whenever non-biological factors begin to influence your experimental readout. Often, these effects can be so large that they present a barrier towards understanding the underlying biology. When working with simple readouts, it can be easy to visualize and understand batch effects. However, with the advent of high content screening methodologies (e.g. cellular imaging, transcriptomics, etc.), it becomes more challenging to tease apart and visualize these effects. This is further compounded when
The Connectivity Map (CMap) is a conceptual, comprehensive linking of cellular signatures to genomic (i.e. mutation) and pharmacological (i.e. drug-mediated) effects. The CMap dataset is based on the L1000 assay (developed by the Broad Institute), which measures the mRNA abundance of 978 landmark genes plus 80 control genes from human cells. As with any assay, L1000 data is noisy. Experimental replicates (the same compound tested on the same cell line under the same conditions) often
The effect of SMILES format on chemical database overlap A common format for representing compounds is the Simplified Molecular Input Line Entry System (SMILES), which encodes a chemical structure as a short string. But despite being a standard format, it is possible to represent the same structure in multiple ways. For example, caffeine can be represented as “CN1C=NC2=C1C(=O)N(C(=O)N2C)C” or equally validly as “Cn1c(=O)c2c(ncn2C)n(C)c1=O”, depending on the starting atom.
Electronic Medical Records (EMRs) contain a large number of missing values which imposes difficulties for data scientists who want to model after this data. In a previous post, we discussed the different feature engineering methods available on diagnosis codes, medication data and clinical notes of EMRs. In this post, we highlight the challenges of missing values when modelling with time-series data of EMRs and discuss some techniques to address it. EMR data, especially for laboratory
PURPOSE: Identify mechanism of action (MoA) from animal phenotype models OVERVIEW: BioSymetrics leverages a proprietary machine learning platform (Augusta™) to generate structure-based activity predictions. This in combination with a vertebrate, in vivo phenotypic profiling framework has allowed us to make phenotype-mechanism association predictions across a range of potential clinical applications. INPUT: Chemical structures, experimental datasets (public and private) OUTPUT: Implicated pathways/processes USE CASE: Phenotype MoA Prediction 1INPUT: Phenotype Assays2Activity prediction model is fit and validated3The
PURPOSE: Quantify and correct bias from high-content screening (HCS) data INPUT: Chemical structures, morphological properties (or original images) OUTPUT: Dynamic workflow that integrates bias removal and mechanism prediction USE CASE: Batch effects are a common issue when dealing with high througput assays, often resulting in patterns within the data unrelated to assay response. Machine Learning (ML) models latch on to any source of regularity Without Augusta™ pre-processing and Contingent-AI (patent pending), ML models will learn
CASE STUDY: Machine learning for activity prediction, as part of lead compound generation The Challenge: The ability to quickly iterate multiple large feature sets with the flexibility to test models at scale is a challenge for any data scientist.  
USE CASE: Value Based Care CLIENT: Major UK based Healthcare network in partnership with Intacare. OVERVIEW: The annual cost of radiotherapy is escalating year-on-year with little visibility of root cause and control.  Maintaining cost efficient healthcare for patients required an investigation of current code/claim and cost data. GOAL: Identify and quantify potential cost savings of revising existing reimbursement mechanisms. “Processing the data manually would have required many months of man hours.” Matt Hickey, CEO Intacare
A Comprehensive Overview of Data Cleaning and Feature Engineering Techniques for Clinical Data Housed in Electronic Medical Records. The electronic medical record (EMR) is a digital version of a patient’s chart that collects data related to a patient’s visit such as past medical history, lab results, prescriptions, diagnosis, and patient reported outcomes. EMR data are notorious for being messy, incomplete, and inconsistent. Part of the “messiness” is due to the diverse nature of clinical data.
Challenge: Combine Disparate Data Sets in PreProcessing for ML Summary: Compelling results show that combining data sources generally allowed better diagnostic performance than with any data set alone (Figures 1&2)