Interactive Document Indexing Method Based on Explicit Semantic Analysis

  • Andrzej Janusz
  • Wojciech Świeboda
  • Adam Krasuski
  • Hung Son Nguyen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7413)


In this article we propose a general framework incorporating semantic indexing and search of texts within scientific document repositories. In our approach, a semantic interpreter, which can be seen as a tool for automatic tagging of textual data, is interactively updated based on feedback from the users, in order to improve quality of the tags that it produces. In our experiments, we index our document corpus using the Explicit Semantic Analysis (ESA) method. In this algorithm, an external knowledge base is used to measure relatedness between words and concepts, and those assessments are utilized to assign meaningful concepts to given texts. In the paper, we explain how the weights expressing relations between particular words and concepts can be improved by interaction with users or by employment of expert knowledge. We also present some results of experiments on a document corpus acquired from the PubMed Central repository to show feasibility of our approach.


Semantic Search Interactive Learning Explicit Semantic Analysis PubMed MeSH 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fazzinga, B., Gianforme, G., Gottlob, G., Lukasiewicz, T.: Semantic web search based on ontological conjunctive queries. Web Semantics: Science, Services and Agents on the World Wide Web (2011)Google Scholar
  2. 2.
    Nguyen, L.A., Nguyen, H.S.: On Designing the SONCA System. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 9–35. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Ślęzak, D., Janusz, A., Świeboda, W., Nguyen, H.S., Bazan, J.G., Skowron, A.: Semantic Analytics of PubMed Content. In: Holzinger, A., Simonic, K.-M. (eds.) USAB 2011. LNCS, vol. 7058, pp. 63–74. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Szczuka, M., Janusz, A., Herba, K.: Clustering of Rough Set Related Documents with Use of Knowledge from DBpedia. In: Yao, J., Ramanna, S., Wang, G., Suraj, Z. (eds.) RSKT 2011. LNCS (LNAI), vol. 6954, pp. 394–403. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    Roberts, R.J.: PubMed Central: The GenBank of the published literature. Proceedings of the National Academy of Sciences of the United States of America 98(2), 381–382 (2001)CrossRefGoogle Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proc. of the 20th Int. Joint Conf. on Artificial Intelligence, Hyderabad, India, pp. 1606–1611 (2007)Google Scholar
  7. 7.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval, 2008. Online edition (2007)Google Scholar
  8. 8.
    Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. Int. Journal on Semantic Web and Information Systems (IJSWIS). Special Issue of Multimedia Semantics 3(3), 55–73 (2006)Google Scholar
  9. 9.
    Rinaldi, A.M.: An ontology-driven approach for semantic information retrieval on the web. ACM Trans. Internet Technol. 9, 1–24 (2009)CrossRefGoogle Scholar
  10. 10.
    Mitchell, T.M.: Machine Learning. McGraw Hill series in computer science. McGraw-Hill (1997)Google Scholar
  11. 11.
    United States National Library of Medicine: Introduction to MeSH - 2011 (2011),
  12. 12.
    Feldman, R., Sanger, J. (eds.): The Text Mining Handbook. Cambridge University Press (2007)Google Scholar
  13. 13.
    R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008)Google Scholar
  14. 14.
    Janusz, A., Nguyen, H.S., Ślęzak, D., Stawicki, S., Krasuski, A.: JRS’2012 Data Mining Competition: Topical Classification of Biomedical Research Papers. In: Yan, J.T., et al. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 422–431. Springer, Heidelberg (2012)Google Scholar
  15. 15.
    Janusz, A., Ślęzak, D., Nguyen, H.S.: Unsupervised similarity learning from textual data. Fundamenta Informaticae (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Andrzej Janusz
    • 1
  • Wojciech Świeboda
    • 1
  • Adam Krasuski
    • 1
    • 2
  • Hung Son Nguyen
    • 1
  1. 1.Faculty of Mathematics, Informatics and MechanicsThe University of WarsawWarsawPoland
  2. 2.Chair of Computer ScienceThe Main School of Fire ServiceWarsawPoland

Personalised recommendations