Journal of Computer-Aided Molecular Design

, Volume 25, Issue 5, pp 427–441 | Cite as

Drug discovery using very large numbers of patents. General strategy with extensive use of match and edit operations

  • Barry Robson
  • Jin Li
  • Richard Dettinger
  • Amanda Peters
  • Stephen K. Boyer


A patent data base of 6.7 million compounds generated by a very high performance computer (Blue Gene) requires new techniques for exploitation when extensive use of chemical similarity is involved. Such exploitation includes the taxonomic classification of chemical themes, and data mining to assess mutual information between themes and companies. Importantly, we also launch candidates that evolve by “natural selection” as failure of partial match against the patent data base and their ability to bind to the protein target appropriately, by simulation on Blue Gene. An unusual feature of our method is that algorithms and workflows rely on dynamic interaction between match-and-edit instructions, which in practice are regular expressions. Similarity testing by these uses SMILES strings and, less frequently, graph or connectivity representations. Examining how this performs in high throughput, we note that chemical similarity and novelty are human concepts that largely have meaning by utility in specific contexts. For some purposes, mutual information involving chemical themes might be a better concept.


Patent analytics Drug discovery Ligand design Regular expression Similarity 



  1. 1.
    Adams RS (2006) Information sources in patents. Walter de Gruyter: Amsterdam, The NetherlandsGoogle Scholar
  2. 2.
    Lynch MF, Barnard JM, Welford SM (1981) Computer Storage and retrieval of generic chemical structures in patents, 1. Introduction and general strategy. J Chem Inf Comp Sci 21(3):148–150Google Scholar
  3. 3.
    Downs GM, Barnard JM (1998) Chemical patents and structural information: The Sheffield research in context. J Documentation 54(1):106–120CrossRefGoogle Scholar
  4. 4.
    Oldach S, Stabinsk N (2009) The value of patent analytics, 2008. Intellectual property today. Accessed 20 Mar 2009
  5. 5.
    Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  6. 6.
    Berks AH (2001) Current state of the art of Markush topological search systems. World Patent Inf 23(1):5–13CrossRefGoogle Scholar
  7. 7.
    Li J, Robson B (2000) Bioinformatics and computational chemistry in molecular design. Recent advances and their application. In Peptide and Protein Drug Analysis, Marcel Dekker NY, 285–307Google Scholar
  8. 8.
    Paolini GV, Shapland HBR, van Hoorn WP, Mason JS, Hopkins AL (2006) Global mapping pharmaceutical space. Nat Biotechnol 24(7):805–815CrossRefGoogle Scholar
  9. 9.
    Chen YP, Chen F (2008) Identifying targets for drug discovery using bioinformatics. Expert Opin Ther Targets 12(4):383–389CrossRefGoogle Scholar
  10. 10.
    Digital Chemistry (2009) Digital chemistry. Accessed 20 Jul 2009
  11. 11.
    Reel Two, Reel Two web site (2007) Accessed 20 Jul 2009
  12. 12.
    Tripos Inc (2008) Accessed 5 Apr 09
  13. 13.
    Symyx, Symyx Web Page (2009) Accessed 10 Nov 2009
  14. 14.
    Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) Lingos, finite state machines, and fast similarity searching. J Chem Inf Model 46(5):1912–1918CrossRefGoogle Scholar
  15. 15.
    Haque IS, Pande VS, Walters WP (2010) SIML: A fast SIMD algorithm for calculating LINGO chemical similarities on GPUs. J Chem Inf Model 50:560–564CrossRefGoogle Scholar
  16. 16.
    Rhodes J, Boyer S, Kreulen J, Chen Y, Ordonez P (2007) Mining patents using molecular similarity search. pacific symposium on biocomputing, Maui, Hawaii, 3–7 January 2007 Ed. Altman et al. World Scientific Publishing; p 304–315,
  17. 17.
    Chen Y, Spangler S, Kreulen J, Boyer SK (2009) SIMPLE: A strategic information mining platform for IP excellence. In: IEEE international conference on data mining workshops, Miami, Florida, 6 Dec 2009. p 270–275.$File/rj10450.pdf
  18. 18.
    Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comp Sci 28:31–36Google Scholar
  19. 19.
    The Open Group, Regular Expressions (2009) The Single UNIX ® Specification, Version 2, 1997. Accessed 1 Aug 2009
  20. 20.
    Wall L, The Perl Development Team (2006) Accessed 9/1/2009
  21. 21.
    Fisanick W (1990) The chemical abstracts service generic chemical (Markush) structure storage and retrieval capability. 1. Basic concepts. J Chem Inf Comp Sci 30(2):145–154Google Scholar
  22. 22.
    Barnard JM (1991) A comparison of different approaches to Markush structure handling. J Chem Inf Comp Sci 31(1):64–68Google Scholar
  23. 23.
    Barnard JM (1993) Substructure searching methods: old and new. J Chem Inf Comp Sci 33(4):532–538Google Scholar
  24. 24.
    Barnard JM, Downs GM (1997) Chemical fragment generation and clustering software. J Chem Inf Comp Sci 37(1):141–142Google Scholar
  25. 25.
    Downs GM, Barnard JM (1997) Techniques for generating descriptive fingerprints in combinatorial libraries. J Chem Inf Comp Sci 37(1):59–61Google Scholar
  26. 26.
    Barnard JM, Downs GM (1992) Clustering of chemical structures on the basis of two-dimensional similarity measure. J Chem Inf Comp Sci 32(6):644–649Google Scholar
  27. 27.
    Brown RD, Martin YC (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comp Sci 36:572–584Google Scholar
  28. 28.
    Robson B, Finn PW (1984) Rational design of conformationally flexible drugs. ATLA Journal. Alternatives to Laboratory Animals 11: 67–78Google Scholar
  29. 29.
    Ivanciuc O (2003) Canonical numbering and constitutional symmetry. In: Handbook of Chemoinformatics, Ed. J. Gasteiger, Wiley-VCH, pp 139–160Google Scholar
  30. 30.
    Daylight Chemical Systems, Inc (2009) Accessed 10 Apr 2009
  31. 31.
    Dethlefsen W, Lynch MF, Gillet VJ, Downs GM, Holliday JD, Barnard JM (1991) Computer storage and retrieval of generic chemical structures in patents. 12. Principles of search operations involving parameter lists: matching-relations, user-defined match levels, and transition from the reduced graph search to the refined search. J Chem Inf Comp Sci 31(2):253–260Google Scholar
  32. 32.
    Robson B (1974) Analysis of the code relating sequence to conformation in globular proteins: theory and application of expected information. Biochem J 141:853–867Google Scholar
  33. 33.
    Robson B (2008) Clinical and pharmacogenomic data mining: 4. The FANO program and command set as an example of tools for biomedical discovery and evidence based medicine. J Proteome Res 7(9):3922–3947CrossRefGoogle Scholar
  34. 34.
    Wikepedia (2010) Accessed 8/30/2010
  35. 35.
    Wikepedia (2010) Wikepedia. Accessed 8/3/09
  36. 36.
    Kramer A, Horn HW, Rice J (2003) Fast 3D molecular superposition and similarity search in databases of flexible molecules. J Comp Aided Mol Des 17(1):13–38CrossRefGoogle Scholar
  37. 37.
    IBM Corporation, Data Discovery and Query Builder’s User’s Guide (2006) IBM Corporation. Accessed 7 Apr 2009
  38. 38.
    University of California San Francisco, Accessed 8 Aug 2009
  39. 39.
    RCSB Protein data Bank (2008) Accessed 5 Apr 2009
  40. 40.
    Warner J (2004) Licorice root may keep mental skills sharp: compound derived from licorice root may fight effects of aging on brain. 2004, March. WebMD News. Accessed 5 Apr 2009
  41. 41.
    Livingstone DE, Walker BR (2003) Is 11beta-hydroxysteroid dehydrogenase type 1 a therapeutic target? Effects of carbenoxolone in lean and obese Zucker rats. J Pharmacol Exp Ther 305(1):167–172CrossRefGoogle Scholar
  42. 42.
    Wikepedia (2009)’s_law. Accessed 6 Aug 2009
  43. 43.
    CAS, a division of the American Chemical Society. Support Page (2009) Accessed 1 Jan 2010
  44. 44.
    CAS, a division of the American Chemical Society, Products page (2009) Accessed 1 Jan 2010
  45. 45.
    Schmidt MW, Baldridge KK, Boatz JA, Elbert ST, Gordon MS, Jensen JH, Koseki S, Matsunaga N, Su S, Windus TL, Dupuis M, Montgomery JA (1993) General atomic and molecular electronic structure system. J Comp Chem 14:1347–1363CrossRefGoogle Scholar
  46. 46.
    Peters A, Lundberg M, Sosa CP, Lang T (2007) High throughput computing validation for drug discovery using the DOCK program on a massively parallel system. 1st annual MSCBB, Northwestern University, Evanston, IL, September 2007; available as Peters A, Lundberg M, Lang T, and Sosa, CP, 2008, RedPaper 4410 from IBM Corporation Poughkeepsie, NYGoogle Scholar
  47. 47.
    Balius TE, Mukherjee S (2008) Stony Brook University web site. Accessed 8 Aug 2009
  48. 48.
    Shivakumar D (2008) (updated 2009). University of California San Francisco, Accessed 12 Aug 2009
  49. 49.
    McWeeny R (1979) Coulson’s Valence, 3rd edn. Oxford University Press, Oxford, UK see Ch. 6Google Scholar
  50. 50.
    Robson B, Curioni A, Mordasini T (2002) Studies in the assessment of folding quality for protein modeling and structure prediction. J Proteome Res (Am Chem Soc) 1(2):115–133Google Scholar
  51. 51.
    Robson B, Vaithilingham A (2008) “Protein Folding Revisited” pp 161–202 in Progess in Molecular Biology and Translational Science, Vol 84: Molecular Biology of Protein Folding, Elsevier Press/Academic PressGoogle Scholar
  52. 52.
    Robson B, Douglas GM, Platt E (1982) A new algorithm for rapid calculation of conformational energies. Biochem Soc Trans 10:388–389Google Scholar
  53. 53.
    Robson B, Platt E (1986) Refined models for computer calculations in protein engineering. Calculation and testing of atomic potential functions compatible with more efficient calculations. 188: 259–281Google Scholar
  54. 54.
    Collura VP, Greaney PJ, Robson B (1994) A method for rapidly assessing and refining simple solvent treatments in molecular modeling. Example studies on the antigen-combining loop H2 from FAB fragment McPC603. Protein Eng 7:221–233CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Barry Robson
    • 1
  • Jin Li
    • 2
  • Richard Dettinger
    • 3
  • Amanda Peters
    • 4
  • Stephen K. Boyer
    • 5
  1. 1.St Matthews University School of Medicine, Grand Cayman, Cayman Islands, The University of Wisconsin-StoutMenomonieUSA
  2. 2.Global Compound Sciences, AstraZeneca R&DMacclesfieldUK
  3. 3.PrenticeRochesterUSA
  4. 4.Department of PhysicsHarvard UniversityCambridgeUSA
  5. 5.Collabra Inc.San JoseUSA

Personalised recommendations