Abstract
Millions of existing patent documents and journal articles dealing with chemistry describe chemical structures by way of structure images (so-called Kekulé structures). While being human-readable, these structure images cannot be interpreted by a computer and are unusable in the context of most chemoinformatics applications: structure and substructure searches, chemo-biological property calculations, etc. There are currently many formats available for storing structural information in a computer-readable format, but the conversion of millions of images by hand is a cumbersome and time-consuming process. Therefore there is a need for an automatic tool for converting images into structures. One of the first such tools was presented at ICDAR in 1993 (OROCS). We would like to present modern developments in optical structure recognition which build upon the ideas developed earlier and add modern enhancements to the process of automatic extraction of structure images from the surrounding text and graphics and conversion of the extracted images into a molecular format. We describe in detail two top performing chemical OCR applications—one open source and one academic software package. The performance here was judged by TREC-CHEM 2011 and CLEF 2012 challenges.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Casey R, Boyer S, Healey P, Miller A, Oudot B, Zilles Z (1993) Optical recognition of chemical graphics. In: Proceedings of the international conference on document analysis and recognition, pp 627–632
McDaniel J, Balmuth J (1992) Kekule - OCR optical chemical (structure) recognition. J Chem Inf Comput Sci 32:373–378
Contreras M, Allendes C, Alvarez L, Rozas R (1990) Computational perception and recognition of digitized molecular structures. J Chem Inf Comput Sci 30:302–307
Ibison P, Jacquot M, Kam F, Neville A, Simpson R, Tonnelier C, Venczel T, Johnson A (1993) Chemical literature data extraction - the CLiDE project. J Chem Inf Comput Sci 33:338–344
Park J, Rosania G, Shedden K, Nguyen M, Lyu N, Saitou K (2009) Automated extraction of chemical structure information from digital raster images. Chem Cent J 3(1):4
Zimmermann M, Thi L, Hofmann M (2005) Combating illiteracy in chemistry: towards computer-based chemical structure reconstruction. ERCIM News 60:40–41
Zimmermann M (2006) Large scale evaluation of chemical structure recognition. In: Proceedings of the 4th text mining symposium in life sciences
Filippov IV, Nicklaus MC (2009) Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model 49(3):740–743
Valko AT, Johnson AP (2009) CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition. J Chem Inf Model 49(4):780–787
Sadawi NM, Sexton AP, Sorge V (2012) Chemical structure recognition: a rule based approach. In: Viard-Gaudin C, Zanibbi R (eds) 19th Document recognition and retrieval conference (DRR 2012), SPIE, Bellingham
Cychosz JM (1994) Efficient binary image thinning using neighborhood maps. In: Graphics gems IV. Academic, San Diego, pp 465–473
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9:62–66
Guo Z, Hall RW (1989) Parallel thinning with two subiteration algorithms. Commun ACM 32(3):359–373
Douglas DH, Peucker TK (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 10(2):112–122
Jain A, Trier D, Taxt T (1996) Feature extraction methods for character recognition: a survey. Pattern Recogn 29(4):641–662
Accelrys (2011) CTfile format. http://accelrys.com/products/collaborative-science/biovia-draw/ctfile-no-fee.html
O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminform 3:33
Lupu M, Jiashu Z, Huang J, Gurulingappa H, Filipov I, Tait J (2011) Overview of the TREC 2011 chemical IR track. In: Proceedings of TREC
Piroi F, Lupu M, Hanbury A, Sexton A, Magdy W, Filippov I (2012) CLEF-IP 2012: retrieval experiments in the intellectual property domain. In: Working notes of CLEF
Heifets A, Jurisica I (2012) SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents. Nucleic Acids Res 40(D1):D428–D433. doi:10.1093/nar/gkr919
Wendling L, Tabbone S (2003) Recognition of arrows in line drawings based on the aggregation of geometric criteria using the Choquet integral. In: Seventh international conference on document analysis and recognition–ICDAR 2003, Edinburgh, pp 299–303
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer-Verlag GmbH Germany
About this chapter
Cite this chapter
Filippov, I.V., Lupu, M., Sexton, A.P. (2017). Modern Approaches to Chemical Image Recognition. In: Lupu, M., Mayer, K., Kando, N., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 37. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53817-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-662-53817-3_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53816-6
Online ISBN: 978-3-662-53817-3
eBook Packages: Computer ScienceComputer Science (R0)