LiveDoc: Showing Contextual Information Using Topic Modeling Techniques

  • Jayati Deshmukh
  • K. M. Annervaz
  • Shubhashis Sengupta
  • Neetu Pathak
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9729)

Abstract

We present a solution named LiveDoc, which augments natural language text documents with relevant contextual background information. This background information helps readers to understand the context of the discourse better by fetching relevant information from other sources such as Wikipedia. Often the readers do not possess all background and supplementary information required for comprehending the purport of a narrative such as a news op-ed article. At the same time, it is not possible for authors to provide all contextual information while addressing a particular topic. LiveDoc processes the information in a document; uses extracted entities to fetch relevant background information in the context of the document from various sources (as defined by user) using semantic matching and topic modeling techniques like Latent Dirichlet Allocation and Hierarchical Dirichlet Process; and presents the background information to the user by augmenting the original document with the fetched information. Reader is then equipped better to understand the document with this additional background information. We present the effectiveness of our solution through extensive experimentation and associated results.

Keywords

Information retrieval Topic modeling Natural language processing Data contextualization Latent Dirichlet Allocation Hierarchical Dirichlet Process 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)Google Scholar
  2. 2.
    Bishop, C.M.: Pattern recognition and machine learning. Springer (2006)Google Scholar
  3. 3.
    Blei, D.M.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
  5. 5.
    Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures (2001)Google Scholar
  6. 6.
    Bunescu, R.C., Mooney, R.J.: A shortest path dependency kernel for relation extraction. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731. Association for Computational Linguistics (2005)Google Scholar
  7. 7.
    Cassidy, T., Ji, H., Ratinov, L.A., Zubiaga, A., Huang, H.: Analysis and enhancement of wikification for microblogs with context expansion. In: COLING, vol. 12, pp. 441–456 (2012)Google Scholar
  8. 8.
    Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: EMNLP-CoNLL, vol. 7, pp. 708–716 (2007)Google Scholar
  9. 9.
    Dumais, S.T.: Latent semantic analysis. Annual Review of Information Science and Technology 38(1), 188–230 (2004)CrossRefGoogle Scholar
  10. 10.
    Feldman, R., Sanger, J.: The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press (2007)Google Scholar
  11. 11.
    Ferragina, P., Scaiella, U.: Tagme: On-the-fly annotation of short text fragments (by wikipedia entities). In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 1625–1628 (2010). http://doi.acm.org/10.1145/1871437.1871689
  12. 12.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)Google Scholar
  13. 13.
    Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proc. 23rd International Conference on Machine learning (ICML 2006), pp. 377–384. ACM Press (2006)Google Scholar
  14. 14.
    GuoDong, Z., Jian, S., Jie, Z., Min, Z.: Exploring various knowledge in relation extraction. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 427–434. Association for Computational Linguistics (2005)Google Scholar
  15. 15.
    Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)Google Scholar
  16. 16.
    Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008 (1997)Google Scholar
  17. 17.
    Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 457–466 (2009). http://doi.acm.org/10.1145/1557019.1557073
  18. 18.
    Liu, X., Li, Y., Wu, H., Zhou, M., Wei, F., Lu, Y.: Entity linking for tweets. In: ACL (1), pp. 1304–1311 (2013)Google Scholar
  19. 19.
    Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: HLT-NAACL, pp. 196–203 (2007)Google Scholar
  20. 20.
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 233–242. ACM (2007)Google Scholar
  21. 21.
    Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518. ACM (2008)Google Scholar
  22. 22.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  23. 23.
    Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Mining Text Data, pp. 43–76. Springer (2012)Google Scholar
  24. 24.
    Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta. http://is.muni.cz/publication/884893/en
  25. 25.
    Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, 93–128 (2006)Google Scholar
  26. 26.
    Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)Google Scholar
  27. 27.
    Wang, T., Li, Y., Bontcheva, K., Cunningham, H., Wang, J.: Automatic extraction of hierarchical relations from text. Springer (2006)Google Scholar
  28. 28.
    Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. The Journal of Machine Learning Research 3, 1083–1106 (2003)MathSciNetMATHGoogle Scholar
  29. 29.
    Zhou, Y., Nie, L., Rouhani-Kalleh, O., Vasile, F., Gaffney, S.: Resolving surface forms to wikipedia topics. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1335–1343. Association for Computational Linguistics (2010)Google Scholar
  30. 30.
    Zhu, J., Nie, Z., Liu, X., Zhang, B., Wen, J.R.: Statsnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th International Conference on World Wide Web, pp. 101–110. ACM (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Jayati Deshmukh
    • 1
  • K. M. Annervaz
    • 1
  • Shubhashis Sengupta
    • 1
  • Neetu Pathak
    • 1
  1. 1.Accenture Technology LabsBangaloreIndia

Personalised recommendations