Chapter

Data Integration in the Life Sciences

Volume 3615 of the series Lecture Notes in Computer Science pp 145-157

Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples

  • Greeshma NeglurAffiliated withLaboratory for Advanced Computing, University of Illinois at Chicago
  • , Robert L. GrossmanAffiliated withLaboratory for Advanced Computing, University of Illinois at Chicago
  • , Bing LiuAffiliated withDepartment of Computer Science, University of Illinois at Chicago

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Integrating data involving chemical structures is simplified when unique identifiers (UIDs) can be associated with chemical structures. For example, these identifiers can be used as database keys. One common approach is to use the Unique SMILES notation introduced in [2]. The Unique SMILES views a chemical structure as a graph with atoms as nodes and bonds as edges and uses a depth first traversal of the graph to generate the SMILES strings. The algorithm establishes a node ordering by using certain symmetry properties of the graphs. In this paper, we present certain molecular graphs for which the algorithm fails to generate UIDs. Indeed, we show that different graphs in the same symmetry class employed by the Unique SMILES algorithm have different Unique SMILES IDs. We tested the algorithm on the National Cancer Institute (NCI) database [7] and found several molecular structures for which the algorithm also failed. We have also written a python script that generates molecular graphs for which the algorithm fails.