The effect of SMILES format on chemical database overlap

A common format for representing compounds is the Simplified Molecular Input Line Entry System (SMILES), which encodes a chemical structure as a short string. But despite being a standard format, it is possible to represent the same structure in multiple ways. For example, caffeine can be represented as “CN1C=NC2=C1C(=O)N(C(=O)N2C)C” or equally validly as “Cn1c(=O)c2c(ncn2C)n(C)c1=O”, depending on the starting atom.

This poses a challenge when comparing two databases for overlap in the chemical space. If we simply use these raw SMILES strings for comparison, it is not possible to merge the results for caffeine from both databases, as it looks like a different structure. To address this the industry often uses “canonical SMILES”, where a canonicalization algorithm uses rules to produce a unique ordering for the atoms. As long as the same canonicalization algorithm is used, any valid SMILES representation of caffeine will be converted into the same canonical SMILES string.

However, there is a deeper question of the chemical meaning of two compounds being considered “the same”. It is relatively easy to determine that two different starting points on a molecule represent an identical structure, and the two representations should be considered one compound. But in contrast, stereoisomers and tautomers are examples where, depending on the circumstances, we may or may not want to consider different representations to be a single compound.

What are the differences?

Stereoisomers are substances with a single 2D representation (such as a SMILES string), which, when considered in 3D space, can have two or more different shapes. For example, in the generic amino acid shown in the figure below, the central carbon atom bonds to fragments H, COOH, R, and NH2. In 2D it makes no difference if the clockwise order of the branches is H, COOH, R, NH2 (left) or H, NH2, R, COOH (right). But in 3D these produce differently-shaped molecules. It may be that only one shape interacts with a protein of interest, and is therefore a viable drug candidate.


With this reasoning it seems appropriate to treat the stereoisomers as different compounds. But in practice, it can be more realistic to treat all stereoisomers as the same compound. Many compound synthesis strategies are not stereoselective, meaning that samples bought for analysis are likely to be a racemic mixture of stereoisomers. Even if only one stereoisomer is effective, database entries will still record activity, just at a higher dose concentration than if the pure stereoisomer were used. And in the case of ibuprofen, humans produce an isomerase that readily converts between the two stereoisomers.

Therefore, stereochemistry may or may not be relevant, depending on context.

In contrast, tautomers are substances where the 2D representation and the 3D shape differ between two structural isomers. The isomers of a tautomer contain a fixed set of atoms, but the bonding pattern easily changes (for example, when a proton relocates within the molecule). As in the case of stereoisomers, any given sample of the substance will contain some distribution of the structural isomers. But tautomers are different in that the proportion of each isomer changes dynamically to maintain equilibrium. If one of the isomers were to interact with a protein of interest, those molecules which dock with a protein would be removed from the system, and the other isomer would convert to maintain equilibrium.

Therefore, it is appropriate to link results across databases by considering all isomers of a tautomer to be the same compound.

Best practices

Stereochemistry and tautomers are just two of the difficulties in determining when two compounds are the same. Packages such as MolVS provide tools for harmonizing lists of compounds according to these and other attributes. At a minimum, we recommend canonicalization and harmonizing tautomers, with other attributes (stereochemistry, fragments, charges, and isotopes) being considered in some cases. Ideally, the results of any workflow should be compared with two versions of the compound list:

  • Canonicalization and tautomer harmonization,
  • Canonicalization and tautomer, stereochemistry, fragment, charge, and isotope harmonization.

This is the best way to understand the impact of these compound effects on a particular dataset and specific application.

Raw Canonicalized Tautomer Canonicalized
SMILES CCCCOC(=O)[C@@H](C)OC1:C:C:C(OC2:C:C:C(C(F)(F)F):C:N:2):C:C:1 CCCCOC(=O)[C@@H](C)Oc1ccc(Oc2ccc(C(F)(F)F)cn2)cc1 CCCCOC(=O)C(C)Oc1ccc (Oc2ccc(C(F)(F)F)cn2)cc1
Notes Contains aromatic bonds
denoted by ” : “Contains a chiral center
Aromatic bonds are removed and replaced with explicit double bonds Chiral center is removed because both enantiomers around this center are interchangeable via keto-enol tautomerization

Results on database overlap

We applied these harmonization steps to four publicly available databases:

The graph above shows how much compound overlap was found between databases. The raw data with unprocessed SMILES has the lowest overlap, with only 80 compounds found in common between the Therapeutic Target Database and CMap. By the time we apply the full set of harmonizations, there are 1372 compounds with results in both. This means that results from these two databases can be merged for 1372 compounds, instead of only 80.

These results demonstrate that molecular standardization is a vital step in preprocessing for chemical datasets, especially when mapping between databases. At BioSymetrics, we dig into the details of preprocessing and develop workflows that automatically perform and evaluate different molecular standardization strategies. Using our unique platform, we develop end to end machine learning workflows that integrate and optimize preprocessing steps directly from raw data sets.