Feature Selection Using Contingent AI: Going Beyond Mutual Information
Feature selection results in an information-rich vector of understandable features, ultimately leading to higher performing and more explainable models. The correct approach is to consider features in combination, in order to maximize the information available to the model.
Mitigating Batch Effects in Cell Painting Data
With the advent of high content screening methodologies (e.g. cellular imaging, transcriptomics, etc.), it becomes more challenging to tease apart and visualize batch effects. This is further compounded when building machine learning models which can easily use these confounding variables instead of real biological signal to generate predictions leading to poor real world relevance.
De-noising CMap L1000 Data
As with any assay, L1000 data is noisy. Experimental replicates (the same compound tested on the same cell line under the same conditions) often result in different levels of expression being measured. The process of de-noising the L1000 data makes it easier to see true assay response, and pick a representative concentration for each compound.
When are Two Compounds the Same?
When are two compounds the same? The effect of Simplified Molecular Input Line Entry System (SMILES) format on chemical database overlap including best practice for canonicalization and harmonization to understand the impact of these compound effects on a particular dataset and specific application.
Dealing with Missing Values in Healthcare Data
In this post, we highlight the challenges of missing values when modelling with time-series data of EMRs and discuss some techniques to address it.
Feature Engineering of Electronic Medical Records
A comprehensive overview of data cleaning and feature engineering techniques for clinical data
Dishing Dirt About Clean Data
A daughter's desire to please her parents demonstrates how a data scientist with good intentions can cause far more harm and expense in the long run, through the selection and creation of the wrong features during data pre-processing.