With the advent of high content screening methodologies (e.g. cellular imaging, transcriptomics, etc.), it becomes more challenging to tease apart and visualize batch effects. This is further compounded when building machine learning models which can easily use these confounding variables instead of real biological signal to generate predictions leading to poor real world relevance.
As with any assay, L1000 data is noisy. Experimental replicates (the same compound tested on the same cell line under the same conditions) often result in different levels of expression being measured. The process of de-noising the L1000 data makes it easier to see true assay response, and pick a representative concentration for each compound.
When are two compounds the same? The effect of Simplified Molecular Input Line Entry System (SMILES) format on chemical database overlap including best practice for canonicalization and harmonization to understand the impact of these compound effects on a particular dataset and specific application.
CASE STUDY: Machine learning for activity prediction, as part of lead compound generation
The Challenge: The ability to quickly iterate multiple large feature sets with the flexibility to test models at scale is a challenge for any data scientist. Continue Reading
USE CASE: Value Based Care
CLIENT: Major UK based Healthcare network in partnership with Intacare.
OVERVIEW: The annual cost of radiotherapy is escalating year-on-year with little visibility of root cause and control. Maintaining cost efficient healthcare for patients required an investigation of current code/claim and cost data.
GOAL: Identify and quantify potential cost savings of revising existing reimbursement mechanisms.
PROBLEM: Data was incongruous. Each healthcare provider used different systems, taxonomies, codes and cost basis for mapping radiotherapy procedures when submitting claims.
- 75,000+ claims
- 1725+ unique narratives
- 1000’s of individual codes and duplicate codes
- Data types: text, numeric
SOLUTION: Intacare used Augusta Pre-Processing workflows to quickly normalize procedure and cost data collected from multiple sources. The team also created an automatable workflow to streamline future analysis and report generation.
- Identified inefficiencies across 24,281 claims
- 39% of total claims
- Projected cost savings: $5Million
Pre-processing of data using Augusta workflows reduced the time to manage the data from multiple weeks to just hours.
Moreover, the workflows are now available within Augusta as standard packages, easily replicated for future projects or if additional data needs to be interrogated.
Download the Intacare case study in PDF format.
Complete the form below.
Challenge: Combine Disparate Data Sets in PreProcessing for ML
Summary: Compelling results show that combining data sources generally allowed better diagnostic performance than with any data set alone (Figures 1&2) Continue Reading