A batch effect occurs whenever non-biological factors begin to influence your experimental readout. Often, these effects can be so large that they present a barrier towards understanding the underlying biology. When working with simple readouts, it can be easy to visualize and understand batch effects. However, with the advent of high content screening methodologies (e.g. cellular imaging, transcriptomics, etc.), it becomes more challenging to tease apart and visualize these effects. This is further compounded when building machine learning models which can easily use these confounding variables instead of real biological signal to generate predictions leading to poor real world relevance.
Recently, a large collection of image data corresponding to changes in cellular morphology under varying chemical treatments was released. With over 30K molecules screened in this fashion, this data set provides a unique starting point for building activity models (i.e. predicting molecules that inhibit specific proteins or pathways of interest). Alongside each image, the authors published metadata such as the batch, plate ID and even well location of each collected data point. Using this information, we can begin to understand various sources of batch effects. As seen in Figure 1, there is clear evidence of batch effects since sorting by non-biological parameters reveals noticeable striated patterns.
Fig. 1: Correlation heat map describing the average similarity of experiments across all experimental plates. Data is sorted according to batch ID.
Machine learning models trained directly with the data could easily use non-biological factors to over-inflate our predictive accuracy. To counteract this effect, we can apply a variety of preprocessing and normalization procedures. Instead of trying a single strategy and assuming success, we directly compare the numerous preprocessing steps. As seen in Figure 2, there are numerous paths and combinations of normalization procedures to consider.
Fig. 2: Preprocessing and normalization workflow for Cell Painting data.
After developing this preprocessing workflow, we now have a series of data sets each uniquely altered to reduce bias. However, we still don’t know which method to use, or to what degree bias has decreased. To address this, we propose a simple quantitative measure of bias. As seen below (Figure 3), we build both a classification model to predict specific labels of biological interest and another model to predict non-biological factors. Ideally, non-biological labels are not well predicted however, this is often seen when working with raw data. By comparing the accuracy of each model, we can quantify the degree to which bias is reduced and how much we can trust that true biological effects are being used for prediction.
Fig. 3: Developing a quantitative measure of bias in high content biological data sets.
Using this measure of bias, we can directly compare each of our iterated preprocessed data sets against the raw to determine which workflow is most applicable here. As seen in Figure 4, raw data is strongly correlated with the date at which the assay was conducted. In fact, our model is able to predict the specific date a compound was assayed with over 80% accuracy. Counter intuitively, the best preprocessing workflow to mitigate bias does not actually account directly for assay date, but uses more precise subsets (batch and plate ID). Using these steps, we are able to strongly reduce batch effects within Cell Painting Data and quantify this using a simple method.
Fig. 4: Comparing bias correction workflows for Cell Painting using a quantitative measure of bias.
DRD2 inhibitory activity was used as our label for biological activity while assay date was used for evaluating non-biological bias.