Electronic Medical Records (EMRs) contain a large number of missing values which imposes difficulties for data scientists who want to model after this data. In a previous post, we discussed the different feature engineering methods available on diagnosis codes, medication data and clinical notes of EMRs. In this post, we highlight the challenges of missing values when modelling with time-series data of EMRs and discuss some techniques to address it. EMR data, especially for laboratory
PURPOSE: Identify mechanism of action (MoA) from animal phenotype models OVERVIEW: BioSymetrics leverages a proprietary machine learning platform (Augusta™) to generate structure-based activity predictions. This in combination with a vertebrate, in vivo phenotypic profiling framework has allowed us to make phenotype-mechanism association predictions across a range of potential clinical applications. INPUT: Chemical structures, experimental datasets (public and private) OUTPUT: Implicated pathways/processes USE CASE: Phenotype MoA Prediction 1INPUT: Phenotype Assays2Activity prediction model is fit and validated3The
PURPOSE: Quantify and correct bias from high-content screening (HCS) data INPUT: Chemical structures, morphological properties (or original images) OUTPUT: Dynamic workflow that integrates bias removal and mechanism prediction USE CASE: Batch effects are a common issue when dealing with high througput assays, often resulting in patterns within the data unrelated to assay response. Machine Learning (ML) models latch on to any source of regularity Without Augusta™ pre-processing and Contingent-AI (patent pending), ML models will learn
CASE STUDY: Machine learning for activity prediction, as part of lead compound generation The Challenge: The ability to quickly iterate multiple large feature sets with the flexibility to test models at scale is a challenge for any data scientist.  
USE CASE: Value Based Care CLIENT: Major UK based Healthcare network in partnership with Intacare. OVERVIEW: The annual cost of radiotherapy is escalating year-on-year with little visibility of root cause and control.  Maintaining cost efficient healthcare for patients required an investigation of current code/claim and cost data. GOAL: Identify and quantify potential cost savings of revising existing reimbursement mechanisms. “Processing the data manually would have required many months of man hours.” Matt Hickey, CEO Intacare
A Comprehensive Overview of Data Cleaning and Feature Engineering Techniques for Clinical Data Housed in Electronic Medical Records. The electronic medical record (EMR) is a digital version of a patient’s chart that collects data related to a patient’s visit such as past medical history, lab results, prescriptions, diagnosis, and patient reported outcomes. EMR data are notorious for being messy, incomplete, and inconsistent. Part of the “messiness” is due to the diverse nature of clinical data.
Challenge: Combine Disparate Data Sets in PreProcessing for ML Summary: Compelling results show that combining data sources generally allowed better diagnostic performance than with any data set alone (Figures 1&2)
Report Title: Distributed Processing Frameworks for Machine Learning of Combined Biomedical Data Types Whitepaper discusses the computing requirements of combined data types for which the Augusta™ platform was constructed to operate This is a must read for understanding the  compute power complexities of pre-processing various data types and identifying ideal scenarios when using/pricing Augusta™ Please complete the form below to download our free white paper.