Dishing Dirt About Clean Data

By June 20, 2019Blog Post

My daughter was very proud of herself.  She had scrubbed her new bathtub clean.  It was beautiful.  Until about two weeks later when odd water stains began to form around the tub, and grow darker with time.

Mom and I could not understand what was wrong, until we realized her decisions in how to clean had caused irreparable damage.

It appears she had used a steel wool pad with a powder cleanser on a fiberglass tub.  She had read the instructions on the can (‘for use in bathrooms’), and knew dad often scrubbed pots with steel wool.  She cleaned the way she best thought she knew how, with a method she was familiar with. It appeared to work… until it entered trial stage.

A recent series of healthcare industry events, articles, and webinars has renewed the conversation about data hygiene for AI.  Many industry pundits and naysayers are dishing dirt about clean data as the biggest impediment to good machine learning.  This is only partly true.  Equally, if not more important are the decisions made in how to prepare that data which can cause the greatest damage.

Collecting, normalizing, and standardizing data are the obvious areas where bias begins. These are known-knowns of bias we understand as an industry and often seek to account for in our models.  However, the selection and engineering of features to be used to train ML models within the data is where we need to confront the true dangers of introducing bias.  Like my daughter, the data scientist with good intentions can cause far more harm and expense in the long run, through the selection and creation of the wrong features during data pre-processing.

Just as my daughter read the label, a data scientist will read published papers in an attempt to understand the best way to pre-process the data.  They may also consult a colleague or expert to see if steel wool is an appropriate tool.  But even well informed decisions can lead to failure.

This is why we’ve constructed ‘Contingent AI’… the element within the Augusta™ framework that looks at optimizing the procedures themselves, allowing the scientist to iterate, explore, and validate decisions used in the preparation of their feature sets.  (see: What is Contingent AI?)

As for my daughter? Yes… that was an expensive tub… and a time consuming repair.