Feature Selection Using Contingent AI: Going Beyond Mutual Information

What should you do when building ML models, and you have many more features than you have samples? This is very common in healthcare: between Electronic Health Records (EHRs) and genomic profiles, we often have 100s or even 1000s of features per person. If the cohort size n is smaller than the feature vector size m, we risk the model overfitting to noise and spurious relationships.

There are three approaches typically used to reduce the size of a feature vector:

  • Feature fusion. Two or more features are combined into one, such as height and weight converting into BMI. The benefit is that we reduce the feature space while still having features that are interpretable. However, not all features can be fused in a meaningful way, and sometimes the originals provide distinct information. For example, systolic and diastolic blood pressure are often individually important, as are individual genes in a common pathway. Fusion of these would not generally be beneficial.

  • Feature transformation. The full feature space can be transformed using methods such as Principal Component Analysis (PCA), and then reduced by dropping the components that contribute least variance. The main disadvantage is that features are no longer easily interpretable, as each feature becomes a linear combination of all original features. As a result, models become less explainable.

  • Feature selection. The feature vector is reduced in size so that it contains only those features that contribute most information to the task at hand. The benefit is that all remaining features are easily interpretable and meaningful. The main challenge is in how to identify the most information-rich set of features to keep.

Feature selection: the old way

The standard approach to feature selection is to use a metric like mutual information (MI) to score all features for their correlation with the model target. We then keep all features with MI above a given threshold.

However, this doesn’t take account of interactions between features. Two features may both have high correlation with the target, but be providing the same fundamental information about the target. Two such highly correlated features will at best lead to an underweighting of the importance of these features in the final model, and at worst prevent a model from training entirely, due to colinearity.

Even if we remove highly correlated features prior to applying MI, there is a second limitation of the thresholding approach. It is very possible that there is a third feature with much lower MI, but which offers complementary information to the two above. The ideal feature vector would contain this feature plus one of those above. A typical thresholding approach will not include this feature in the vector.

Therefore, the standard approach to feature selection often results in lower than maximal information available to the model, and therefore lower model performance.

Iterating to the best solution

The problem then is to find the optimal combination of features that provides most information about the target. At BioSymetrics, we apply our Contingent AI approach to iterate over many 1000s of random subsets of the features, and examine which combinations consistently rise to the top.

An example workflow looks like:

  • 25,000 times over, randomly sample p of the m features, where p < 0.8*n. This is the “sampled” set of features.

  • Extract these p features for a stratified sample of 80% of the patients, giving a training set with dimensions 0.8*n x p.

  • Train a simple shallow model such as Logistic Regression, using L1 regularization to force feature selection behaviour. (L1 penalizes by the full magnitude of the coefficients, rather than L2’s half the sum of squares of the coefficients, analogous to Manhattan distance versus Euclidian distance.)

  • Examine the feature importance of the sampled set of features. All those with a non-zero coefficient have been “chosen” by the model.

After 25,000 iterations, we can examine the distributions of feature sampling and feature choice. An example is shown in the Figure below, which shows that the number of times each feature was sampled (orange) approximates a uniform distribution. In contrast, feature choice (blue) is non-uniform. Those features to the left of the plot are candidates for selection, compared to those on the right that are never chosen by a model.

Figure: Count of iterations where a feature was sampled versus chosen.

Figure: Count of iterations where a feature was sampled versus chosen.

Conclusion

Having many more features than samples is a common situation in data analysis, but it is essential to reduce the feature space before ML modelling. When performed correctly, feature selection results in an information-rich vector of understandable features, ultimately leading to higher performing and more explainable models. The correct approach is to consider features in combination, in order to maximize the information available to the model.

Feature selection is just one component of the process towards building robust, interpretable, generalizable models. At BioSymetrics, we use our Contingent AI approach to iterate over every step of the workflow, ensuring high quality results for healthcare applications.

Previous
Previous

Binary Classification Metrics

Next
Next

Calculating Mortality from COVID-19