Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples

  • Greeshma Neglur
  • Robert L. Grossman
  • Bing Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3615)

Abstract

Integrating data involving chemical structures is simplified when unique identifiers (UIDs) can be associated with chemical structures. For example, these identifiers can be used as database keys. One common approach is to use the Unique SMILES notation introduced in [2]. The Unique SMILES views a chemical structure as a graph with atoms as nodes and bonds as edges and uses a depth first traversal of the graph to generate the SMILES strings. The algorithm establishes a node ordering by using certain symmetry properties of the graphs. In this paper, we present certain molecular graphs for which the algorithm fails to generate UIDs. Indeed, we show that different graphs in the same symmetry class employed by the Unique SMILES algorithm have different Unique SMILES IDs. We tested the algorithm on the National Cancer Institute (NCI) database [7] and found several molecular structures for which the algorithm also failed. We have also written a python script that generates molecular graphs for which the algorithm fails.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Weininger, D.: SMILES, a Chemical Language and Information System 1: Introduction to Methodology and Encoding Rules, Medicinal Chemistry Project, Pomona College (1988)Google Scholar
  2. 2.
    Weininger, D., Weininger, A., Weininger, J.L.: SMILES 2: Algorithm for Generation of Unique SMILES Notation, Daylight Chemical Information Systems, Irvine, California 92714 (1989); Note that although the Unique SMILES implementation has been changed by the Daylight Chemical Information System, this appears to be the most recent publication describing the algorithmGoogle Scholar
  3. 3.
    Weininger, D.: SMILES 3: Depicting Graphical Depiction of Chemical Structures, Daylight Chemical Information Systems, New Orleans, LouisianaGoogle Scholar
  4. 4.
    A SMILES to graph translation can be found at: http://www.daylight.com/daycgi/depict
  5. 5.
    A SMILES to UNIQUE SMILES translation can be found at, http://cactus.nci.nih.gov/services/translate/
  6. 6.
    More counter examples can be found at the web site, http://ncdm171.lac.uic.edu/neglur/USMILES/USMILES.html
  7. 7.
    NCI database, retrieved from http://129.43.27.140/ncidb2/ on (March 2, 2005)
  8. 8.
    Sample adjacency list used -{1:[[’C’,1,6,’0’,3],[[1,2]]], 2:[[’C’,2,6,’0’,2],[[1,1],[1,3]]], 3:[[’C’,4,6,’0’,0],[[1,2],[2,4],[1,11]]], 4:[[’C’,3,6,’0’,1],[[2,3],[1,5]]], 5:[[’C’,4,6,’0’,0],[[1,4],[1,6],[2,8]]], 6:[[’C’,2,6,’0’,2],[[1,5],[1,7]]], 7:[[’C’,1,6,’0’,3],[[1,6]]], 8:[[’C’,3,6,’0’,1],[[2,5],[1,9]]], 9:[[’C’,4,6,’0’,0],[[1,8],[1,10],[2,11]]], 10:[[’C’,1,6,’0’,3],[[1,9]]], 11:[[’C’,3,6,’0’,1],[[1,3],[2,9]]]}Google Scholar
  9. 9.
    CANON Algorithm (Extract from Reference [2])- (1) Set the atomic vector to initial invariants. Go to step 3. (2) Set vector to product of primes corresponding to neighbors’ ranks. (3) Sort vector, maintaining stability over previous ranks. (4) Rank atomic vector. (5) If not invariant partitioning, go to step 2. (6) On first pass, save partitioning as symmetry classes. (7) If highest rank is smaller than number of nodes, break ties, go to step 2. (8)... else doneGoogle Scholar
  10. 10.
    See http://bioweb.dataspaceweb.org/chemicalKeys (retrieved on March 2, 2005)
  11. 11.
    Beyer, T., Proskurowski, A.: Symmetries in graph coding. In: Proceedings of Northwest 1976 ACM–CIPS Pacific Regional Symposium, pp. 198–203 (1976)Google Scholar
  12. 12.
    HM, C.B., Santolini, A.: A quasi-decision algorithm for the p-equivalence of two matrices. ICC Bull. 8(1), 57–69 (1964)Google Scholar
  13. 13.
    IUPAC, Nomenclature of Organic Chemistry. Pergamon Press, Oxford (1979)Google Scholar
  14. 14.
    Klin, M.H., Lebedev, O.V., Pivina, T.S., Zefirov, N.S.: Nonisomorphic cycles of maximum length in a series of chemical graphs and the problem of application of IUPAC nomenclature rules. MATCH 27, 133–151 (1992)MATHMathSciNetGoogle Scholar
  15. 15.
  16. 16.
    Randic, M., Brissey, G.M., Wilkins, C.L.: Computer perception of topological symmetry via canonical numbering of atoms. Journal of Chemical Information and Computer Sciences 21(1), 52–59 (1981)Google Scholar
  17. 17.
    McKay, B.: Practical Graph Isomorphism. Congr. Numer. 30, 45–87 (1981)MathSciNetGoogle Scholar
  18. 18.
    Morgan, H.L.: The Generation of a Unique Machine Description for Chemical Structures – A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107–113 (1965)CrossRefGoogle Scholar
  19. 19.
    Braun, J., Gugisch, R., Kerber, A., Laue, R., Meringer, M., Rcker, C.: MOLGEN-CID, A Canonizer for Molecules and Graphs Accessible through the Internet. Journal of Chemical Information and Computer Sciences 44, 542–548 (2004)Google Scholar
  20. 20.
    Grossman, R., Hamelberg, D., Kasturi, P., Liu, B.: Experimental Studies of the Universal Chemical Key (UCK) Algorithm on the NCI Database of Chemical Compounds. In: Proceedings of the 2003 IEEE Computer Society Bioinformatics Conference (CSB 2003), pp. 244–250. IEEE Computer Society, Los Alamitos (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Greeshma Neglur
    • 1
  • Robert L. Grossman
    • 1
  • Bing Liu
    • 2
  1. 1.Laboratory for Advanced ComputingUniversity of Illinois at ChicagoChicagoUSA
  2. 2.Department of Computer ScienceUniversity of Illinois at ChicagoChicagoUSA

Personalised recommendations