Practical Chemoinformatics

pp 415-449


Chemical Text Mining for Lead Discovery

  • Muthukumarasamy KarthikeyanAffiliated withDigital Information Resource Centre, National Chemical Laboratory Email author 
  • , Renu VyasAffiliated withScientist (DST) Division of Chemical Engineering and Process Development, National Chemical Laboratory

* Final gross prices may vary according to local VAT.

Get Access


With the growth of the Internet, the information disseminated and available in public resources has expanded enormously. There is a need for the development of new tools to navigate through each and every document automatically, word by word to extract useful patterns, concepts, knowledge, or discover something which is not explicitly mentioned in a document to derive useful conclusions. Recently, computational linguistics developers and scientists have devised several text-mining tools and techniques for converting the natural language and processing the information content into facts and data for interpretation, analysis, and predictions. Text mining comprises data mining, information retrieval, natural language processing (NLP), and machine learning (ML) methods. Text mining provides researchers with metadata to ascertain meaningful associations of terms prevalent in their respective domains. Thus, it aids in finding meaning, context, semantics, identifying hidden concepts, trends, and discovering hitherto unknown relationships and correlations from heaps of largely fragmented, unstructured, and scattered information lying in public realm. In this chapter, we highlight the general concept of text mining followed by its features and tools especially for handling biomedical and chemical literature data for drug/lead discovery available in over 22.9 million abstracts in PubMed. The emphasis is on building and using simple text-mining tools in a practical way by harnessing the power of open source and commercially available tools and comprehending the overall strategic challenges in this field. An open-source-based tool for text mining literature with chemical significance that can be effectively used for solving chemoinformatics problems related to lead discovery has been developed. MegaMiner can directly predict lead molecules for a target disease of interest by submitting a text-based query in a distributed computing platform.


Text-mining Clustering Stemming Chemoinformatics Lead discovery MegaMiner Open-source tools