Skip to main content
Log in

Building semantic kernels for cross-document knowledge discovery using Wikipedia

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Research into text mining has progressed over the past decade. One of the main challenges now is gauging the difficulty of taking advantage of outside knowledge in the discovery process. In this work, to address the limitations of the traditional bag-of- words model and expand the search scope beyond the document collections at hand, we present a new text mining approach incorporating Wikipedia as the background knowledge. Various semantic kernels are built out of the extensive knowledge derived from Wikipedia and applied to the search scenario of detecting potential semantic relationships between topics. We demonstrate the effectiveness of our approach through comparing with competitive baselines, as well as alternative solutions where only part of Wikipedia resources (e.g., the Wiki-article contents or the associated Wiki-categories) is considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47

    Article  MATH  Google Scholar 

  2. Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th international conference on World Wide Web, pp 757–766

  3. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391

    Article  Google Scholar 

  4. Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. AAAI 6:1301–1306

    Google Scholar 

  5. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611

    Google Scholar 

  6. Hahn R, Bizer C, Sahnwaldt C, Herta C, Robinson S, Bürgle M, Düwiger H, Scheel U (2010) Faceted Wikipedia search. Bus Inf Syst 47:1–11

    Article  Google Scholar 

  7. Hoffart J, Suchanek FM, Berberich K, Weikum G (2013) YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intelli 194:28–61

    Article  MathSciNet  MATH  Google Scholar 

  8. Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the SIGIR 2003 semantic web workshop

  9. Jin W, Srihari RK (2006) Knowledge discovery across documents through concept chain queries. In: Proceeding of the sixth IEEE international conference on data mining workshops, pp 448–452

  10. Jin W, Srihari RK, Ho HH, Wu X (2007) Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques. In: Proceeding of the seventh IEEE international conference on data mining, pp 193–202

  11. Jin W, Srihari RK (2007) Graph-based text representation and knowledge discovery. In: Proceedings of the 2007 ACM symposium on applied computing, pp 807–811

  12. Lehmann J, Schüppel J, Auer S (2007) Discovering unknown connections-the DBpedia relationship finder. CSSW 113:99–110

    Google Scholar 

  13. Martin P (2003) Correction and extension of WordNet 1.7. In: Conceptual structures for knowledge creation and communication, pp 160–173

  14. Milne D (2007) Computing semantic relatedness using Wikipedia link structure. In: Proceedings of the New Zealand computer science research student conference

  15. MWDumper. Software. http://www.mediawiki.org/wiki/Manual:MWDumper

  16. Salahli MA (2009) An approach for measuring semantic relatedness between words via related terms. Math Comput Appl 14(1):55

    Google Scholar 

  17. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  18. Srihari RK, Lamkhede S, Bhasin A (2005) Unapparent information revelation: a concept chain graph approach. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 329–330

  19. Srihari RK, Li W, Niu C, Cornell T (2003) Infoxtract: a customizable intermediate level information extraction engine. In: Proceedings of the HLT-NAACL 2003 workshop on software engineering and architecture of language technology systems, pp 51–58

  20. Srinivasan P (2004) Text mining: generating hypotheses from MEDLINE. J Am Soc Inf Sci Technol 55(5):396–413

    Article  Google Scholar 

  21. Strube M, Ponzetto SP (2006) WikiRelate! Computinga semantic relatedness using Wikipedia. AAAI 6:1419–1424

    Google Scholar 

  22. Suchanek FM, Sozio M, Weikum G (2009) SOFIE: a self-organizing framework for information extraction. In: Proceedings of the 18th international conference on World wide web, pp 631–640

  23. Swanson DR, Smalheiser NR (1999) Implicit text linkages between Medline records: using Arrowsmith as an aid to scientific discovery. Libr Trends 48(1):48–59

    Google Scholar 

  24. Swanson DR (1991) Complementary structures in disjoint science literatures. In: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval, pp 280–289

  25. Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 713–721

  26. Wong SM, Ziarko W, Wong PC (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, pp 18–25

  27. Yan P, Jin W (2012) Improving cross-document knowledge discovery using explicit semantic analysis. In: Proceedings of the 14th international conference on data warehousing and knowledge discovery, pp 378–389

  28. Yan P, Jin W (2013) A new approach for improving cross-document knowledge discovery using Wikipedia. In: Proceedings of the 18th international conference on application of natural language to information systems, pp 291–296

  29. Yan P, Jin W (2013) Mining semantic relationships between concepts across documents incorporating Wikipedia knowledge. Advances in data mining. Applications and theoretical aspects, pp 70–84

  30. Yan P, Jin W (2015) Improving cross-document knowledge discovery through content and link analysis of Wikipedia knowledge. In: Transactions on large-scale data-and knowledge-centered systems XXI, pp 161–184

Download references

Acknowledgments

This research work is supported in part by the NSF Grant (IIS-1452898) and NSF/North Dokota EPSCoR IIP Seed Grant (EPS-0814442).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Yan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, P., Jin, W. Building semantic kernels for cross-document knowledge discovery using Wikipedia. Knowl Inf Syst 51, 287–310 (2017). https://doi.org/10.1007/s10115-016-0973-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0973-5

Keywords

Navigation