Chemical Text Mining for Lead Discovery

  • Muthukumarasamy Karthikeyan
  • Renu Vyas


With the growth of the Internet, the information disseminated and available in public resources has expanded enormously. There is a need for the development of new tools to navigate through each and every document automatically, word by word to extract useful patterns, concepts, knowledge, or discover something which is not explicitly mentioned in a document to derive useful conclusions. Recently, computational linguistics developers and scientists have devised several text-mining tools and techniques for converting the natural language and processing the information content into facts and data for interpretation, analysis, and predictions. Text mining comprises data mining, information retrieval, natural language processing (NLP), and machine learning (ML) methods. Text mining provides researchers with metadata to ascertain meaningful associations of terms prevalent in their respective domains. Thus, it aids in finding meaning, context, semantics, identifying hidden concepts, trends, and discovering hitherto unknown relationships and correlations from heaps of largely fragmented, unstructured, and scattered information lying in public realm. In this chapter, we highlight the general concept of text mining followed by its features and tools especially for handling biomedical and chemical literature data for drug/lead discovery available in over 22.9 million abstracts in PubMed. The emphasis is on building and using simple text-mining tools in a practical way by harnessing the power of open source and commercially available tools and comprehending the overall strategic challenges in this field. An open-source-based tool for text mining literature with chemical significance that can be effectively used for solving chemoinformatics problems related to lead discovery has been developed. MegaMiner can directly predict lead molecules for a target disease of interest by submitting a text-based query in a distributed computing platform.


Text-mining Clustering Stemming Chemoinformatics Lead discovery MegaMiner Open-source tools 


  1. 1.
  2. 2.
    Karthikeyan M, Krishnan S, Pandey AK, Bender A (2006) Harvesting chemical information from the Internet using a distributed approach: ChemXtreme. J Chem Inf Model 46:452–461CrossRefGoogle Scholar
  3. 3.
    Cohen KB, Hunter L (2008) Getting started in text mining. Plos Comput Biol 4(1)Google Scholar
  4. 4.
    Wei CH, Kao HY, Lu Z (2013) PubTutor: a web based text mining tool for assisting biocuration. Nucleic Acids Res 41(Web Server issue):W518–22Google Scholar
  5. 5.
  6. 6.
    Aguiar-Pulido V, Seoane JA, Gestal M, Dorado J (2013) Exploring patterns of epigenetic information with data mining techniques. Curr Pharm Des 19(4):779–789CrossRefGoogle Scholar
  7. 7.
    Yang Y, Adelstein SJ, Kassis AI (2012) Target discovery from data mining approaches. Drug Discov Today 17(Suppl), S16–S23Google Scholar
  8. 8.
    Guha R, Gilbert K, Fox G, Pierce M, Wild D, Yuan H (2010) Advances in chemoinformatics methodologies and infrastructure to support the data mining of large heterogeneous chemical datasets. Curr Comput Aided Drug Des 6(1):50–67CrossRefGoogle Scholar
  9. 9.
  10. 10.
  11. 11.
    Macskassy SA, Hirsh H, Banerjee A, Dayanik AA (2003) Converting numerical classification into text classification. Artif Intell 143:51–77CrossRefGoogle Scholar
  12. 12.
    Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT PressGoogle Scholar
  13. 13.
    Indurkhya N, Damerau F (2010) Handbook of natural language processing. Boca RatonGoogle Scholar
  14. 14.
    Miner G, Elder J, Hill T, Nisbe R, Delen D, Fast A (2012) Practical text mining and statistical analysis for non-structured text data applications. Elsevier Academic PressGoogle Scholar
  15. 15.
    Feldman R, Sanger J (2006) The text mining handbook advanced approaches in analyzing unstructured data. Hebrew University of Jerusalem, ABS Ventures, BostonGoogle Scholar
  16. 16.
    Cunningham H, Tablan V, Angus RB, Kalina B (2013) Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol 9(2):e1002854CrossRefGoogle Scholar
  17. 17. Accessed 31 Oct 2013
  18. 18.
  19. 19.
  20. 20.
    Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. MIT Press, Cambridge, pp 1–36Google Scholar
  21. 21.
    Webster JJ, Kit C (1992) Tokenization as the initial phase in NLP, vol 4. University of Trier, pp 1106–1110Google Scholar
  22. 22.
    Popovic M, Willett P (1992) The effectiveness of stemming for natural-language access to slovene textual data. J Am Soc Inform Sci 43(5):384–390CrossRefGoogle Scholar
  23. 23.
    DeRose SJ (1988) Grammatical category disambiguation by statistical optimization. Comput Linguist 14(1):31–39Google Scholar
  24. 24.
  25. 25.
    Papanikolaou N, Pafilis E, Nikolaou S, Ouzounis CA, Iliopoulos I, Promponas VJ (2011) BioTextQuest: a web-based biomedical text mining suite for concept discovery. Bioinformatics 27(23):3327–3328CrossRefGoogle Scholar
  26. 26.
    Francis WN, Kucera H (1964) A standard corpus of present-day edited American english, for use with digital computers. Department of Linguistics, Brown University, ProvidenceGoogle Scholar
  27. 27.
    Ananiadou S, Sullivan D, Black W, Levow Gi-A, Gillespie JJ, Mao C, Pyysalo S, Kolluru B, Tsujii J, Sobral B (2011) Named entity recognition for bacterial Type IV secretion systems. PLoS One 6(3):e14780CrossRefGoogle Scholar
  28. 28.
    Berry MW, Castellanos M (eds) (2007) Survey of text mining: clustering, classification, and retrieval. SpringerGoogle Scholar
  29. 29.
    Baker NC, Hemminger BM (2010) Mining connections between chemicals, proteins, and diseases extracted from Medline annotations. J Biomed Inform 43(4):510–519CrossRefGoogle Scholar
  30. 30.
    Korhonen A, Seaghdha DO, Silins I, Sun L, Hoegberg J, Stenius U (2012) Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS One 7(4):e33427CrossRefGoogle Scholar
  31. 31.
    Berry MW, Jacob KJ (eds) (2010) Text mining: applications and theory. WileyGoogle Scholar
  32. 32.
    Zhou Y (2009) An improved KNN text classification algorithm based on clustering. J Comput 4(3)Google Scholar
  33. 33.
    Lan M, Tan C, Low H, Sungy S (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Proceedings of the 14th international conference on World Wide Web, pp 1032–1033Google Scholar
  34. 34.
    Wu X, Zhang L, Chen Y, Rhodes J, Griffin TD, Boyer SK, Alba A, Cai K (2010) ChemBrowser: a flexible framework for mining chemical documents. Adv Exp Med Biol 680:57–64 (Advances in Computational Biology)CrossRefGoogle Scholar
  35. 35.
    Khan A, Baharudin B, Lee LH, Khan KA (2010) Review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1)Google Scholar
  36. 36.
  37. 37.
  38. 38.
  39. 39.
    Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques. Elsevier, MK (The Morgan Kaufmann series in data management systems)Google Scholar
  40. 40.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Lect Notes Comput Sci Springer 1398:137–142CrossRefGoogle Scholar
  41. 41.
    Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chainsGoogle Scholar
  42. 42.
    Jang H, Song S, Myaeng S (2006) Text mining for medical documents using a hidden Markov model. In: Ng H, Leong M-K, Kan M-Y, Ji D (eds) Information retrieval technology, vol 4182. pp 553–559Google Scholar
  43. 43.
    Mccallum A, Freitag D (2000) Maximum entropy Markov models for information extraction and segmentationGoogle Scholar
  44. 44.
    Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICMLGoogle Scholar
  45. 45.
    Cohen KB, Hunter L (2008) Getting started in text mining. PLoS Comput Biol 4(1):e20CrossRefGoogle Scholar
  46. 46.
  47. 47.
    Nahm UY, Mooney RJ (2001) Mining soft-matching rules from textual data. In: Proceedings of the seventeenth International Joint Conference on Artificial Intelligence(IJCAI-01), pp 979–984, Seattle, WAGoogle Scholar
  48. 48.
    Miwa M, Ohta T, Rak R, Rowley A, Douglas BK, Pyysalo S, Ananiadou S (2013) A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 29(13):i44–i52CrossRefGoogle Scholar
  49. 49.
    Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. CRC Press, Boca RatonCrossRefGoogle Scholar
  50. 50. Accessed 31 Oct 2013
  51. 51. Accessed 31 Oct 2013
  52. 52. Accessed 31 Oct 2013
  53. 53.
  54. 54. Accessed 31 Oct 2013
  55. 55.
    Nigam K, Leffarty J, Maccallum A (1999) Using maximum entropy for text classification IJCAI-99 workshop on machine learningGoogle Scholar
  56. 56.
  57. 57.
  58. 58.
    Ning K, van Mulligen EM, Kors JA (2011) Comparing and combining chunkers of biomedical text. J Biomed Inform 44(2):354–360CrossRefGoogle Scholar
  59. 59. Accessed 31 Oct 2013
  60. 60.
  61. 61.
    Yonghui W, Joshua DC, Trent RS, Miller RA, Giuse DA, Xu H (2012) A comparative study of current clinical natural language processing systems on handling abbreviations in discharge summaries annual symposium proceedings AMIA Symposium, 997–1003Google Scholar
  62. 62. Accessed 31 Oct 2013
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
    Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, Yeh A, Hitzeman J, Hirschman L (2007) Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 14(5):564–567CrossRefGoogle Scholar
  68. 68.
  69. 69.
    Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD-06)Google Scholar
  70. 70.
  71. 71.
    Feng D, Burns G, Hovy E (2007) Extracting data records from unstructured biomedical full text proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, Association for Computational Linguistics, pp. 837–846Google Scholar
  72. 72.
    Rodriguez-Esteban R (2009) Biomedical text mining and its applications. PLoS Comput Biol 5(12):e1000597CrossRefGoogle Scholar
  73. 73.
  74. 74. Accessed 31 Oct 2013
  75. 75.
    Lourenco A, Carreira R, Carneiro S, Maia P, Glez-Pena D, Fdez-Riverola F, Ferreira EC, Rocha I, Rocha M (2009) @Note: a workbench for biomedical text mining. J Biomed Inf 42:710–720CrossRefGoogle Scholar
  76. 76.
    Kano C, Monaghan T, Blance A, Wall DP, Peshkin L (2009) Collaborative text annotation resource for disease centered relation extraction from biomedical text. J Biomed Inform 42(5):967–977CrossRefGoogle Scholar
  77. 77.
    Corney DPA, Buxton BF, Langdon WB, Jones DT (2004) BioRAT: extracting biological information from full-length papers. Bioinformatics 20:3206–3213CrossRefGoogle Scholar
  78. 78.
    Ding J, Berleant D (2005) MedKit: a helper toolkit for automatic mining of MEDLINE/PubMed citations. Bioinformatics 21:694–695CrossRefGoogle Scholar
  79. 79.
    Domedel-Puig N, Wernisch L (2005) Applying GIFT, a gene interactions finder in text, to fly literature. Bioinformatics 21:3582–3583CrossRefGoogle Scholar
  80. 80.
    Kim J-J, Zhang Z, Park JC, Ng S-K (2006) BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22:597–605CrossRefGoogle Scholar
  81. 81.
    Papanikolaou N, Pafilis E, Nikolaou S, Ouzounis CA, Iliopoulos I, Promponas VJ (2011) BioTextQuest: a web-based biomedical text mining suite for concept discovery. Bioinformatics 27:3327–3328CrossRefGoogle Scholar
  82. 82.
    Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14):3191–3192CrossRefGoogle Scholar
  83. 83.
    Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) OSCAR4: a flexible architecture for chemical text-mining. J Cheminform 3:41 (and references cited therein)CrossRefGoogle Scholar
  84. 84.
    Ha S, Seo YJ, Kwon M-S, Chang BH, Han C-K, Yoon J-H (2008) IDMap: facilitating the detection of potential leads with therapeutic targets. Bioinformatics 24:1413–1415CrossRefGoogle Scholar
  85. 85.
    Kemper B, Matsuzaki T, Matsuoka Y, Tsuruoka Y, Kitano H, Ananiadou S, Tsuji J (2010) PathText: a text mining integrator for biological pathway visualizations. Bioinformatics 26:i374–i381CrossRefGoogle Scholar
  86. 86. Accessed 31 Oct 2013
  87. 87. Accessed 31 Oct 2013
  88. 88. Accessed 31 Oct 2013
  89. 89. Accessed 31 Oct 2013
  90. 90. Accessed 31 Oct 2013
  91. 91.
    Hawizy L, Jessop DM, Adams N, Murray-Rust P (2011) ChemicalTagger: a tool for semantic text mining in chemistry. J Chemoinform 3:17CrossRefGoogle Scholar
  92. 92.
    Attiya H, Welch J (2004) Distributed computing: fundamentals, simulations and advanced topics. Wiley-InterscienceGoogle Scholar
  93. 93.
    Karthikeyan M, Krishnan S, Pandey AK (2008) Distributed chemical computing using ChemStar: an open source java remote method invocation architecture applied to large scale molecular data from PubChem. J Chem Inf Model 48:691–703CrossRefGoogle Scholar
  94. 94.
    Unpublished resultsGoogle Scholar
  95. 95.
  96. 96. Accessed 31 Oct 2013
  97. 97.
  98. 98.
    Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schnikowski B, Idekar T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction network. Genome Res 13:2498–2504CrossRefGoogle Scholar
  99. 99. Accessed 31 Oct 2013
  100. 100.
    Karthikeyan M, Pandit D, Bhavasar A, Bender A, Vyas R (2013) ChemScreener: a distributed computing tool for scaffold based virtual screening. Comb Chem High T Scr:xxGoogle Scholar
  101. 101.
    Monge A, Arrault A, Marot C, Morin-Allory L (2006) Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers. Mol Diversity 10:389–403CrossRefGoogle Scholar

Copyright information

© Springer India 2014

Authors and Affiliations

  1. 1.Digital Information Resource CentreNational Chemical LaboratoryPuneIndia
  2. 2.Scientist (DST) Division of Chemical Engineering and Process DevelopmentNational Chemical LaboratoryPuneIndia

Personalised recommendations