Abstract
Research into text mining has progressed over the past decade. One of the main challenges now is gauging the difficulty of taking advantage of outside knowledge in the discovery process. In this work, to address the limitations of the traditional bag-of- words model and expand the search scope beyond the document collections at hand, we present a new text mining approach incorporating Wikipedia as the background knowledge. Various semantic kernels are built out of the extensive knowledge derived from Wikipedia and applied to the search scenario of detecting potential semantic relationships between topics. We demonstrate the effectiveness of our approach through comparing with competitive baselines, as well as alternative solutions where only part of Wikipedia resources (e.g., the Wiki-article contents or the associated Wiki-categories) is considered.
Similar content being viewed by others
References
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47
Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th international conference on World Wide Web, pp 757–766
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. AAAI 6:1301–1306
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611
Hahn R, Bizer C, Sahnwaldt C, Herta C, Robinson S, Bürgle M, Düwiger H, Scheel U (2010) Faceted Wikipedia search. Bus Inf Syst 47:1–11
Hoffart J, Suchanek FM, Berberich K, Weikum G (2013) YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intelli 194:28–61
Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the SIGIR 2003 semantic web workshop
Jin W, Srihari RK (2006) Knowledge discovery across documents through concept chain queries. In: Proceeding of the sixth IEEE international conference on data mining workshops, pp 448–452
Jin W, Srihari RK, Ho HH, Wu X (2007) Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques. In: Proceeding of the seventh IEEE international conference on data mining, pp 193–202
Jin W, Srihari RK (2007) Graph-based text representation and knowledge discovery. In: Proceedings of the 2007 ACM symposium on applied computing, pp 807–811
Lehmann J, Schüppel J, Auer S (2007) Discovering unknown connections-the DBpedia relationship finder. CSSW 113:99–110
Martin P (2003) Correction and extension of WordNet 1.7. In: Conceptual structures for knowledge creation and communication, pp 160–173
Milne D (2007) Computing semantic relatedness using Wikipedia link structure. In: Proceedings of the New Zealand computer science research student conference
MWDumper. Software. http://www.mediawiki.org/wiki/Manual:MWDumper
Salahli MA (2009) An approach for measuring semantic relatedness between words via related terms. Math Comput Appl 14(1):55
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Srihari RK, Lamkhede S, Bhasin A (2005) Unapparent information revelation: a concept chain graph approach. In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 329–330
Srihari RK, Li W, Niu C, Cornell T (2003) Infoxtract: a customizable intermediate level information extraction engine. In: Proceedings of the HLT-NAACL 2003 workshop on software engineering and architecture of language technology systems, pp 51–58
Srinivasan P (2004) Text mining: generating hypotheses from MEDLINE. J Am Soc Inf Sci Technol 55(5):396–413
Strube M, Ponzetto SP (2006) WikiRelate! Computinga semantic relatedness using Wikipedia. AAAI 6:1419–1424
Suchanek FM, Sozio M, Weikum G (2009) SOFIE: a self-organizing framework for information extraction. In: Proceedings of the 18th international conference on World wide web, pp 631–640
Swanson DR, Smalheiser NR (1999) Implicit text linkages between Medline records: using Arrowsmith as an aid to scientific discovery. Libr Trends 48(1):48–59
Swanson DR (1991) Complementary structures in disjoint science literatures. In: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval, pp 280–289
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 713–721
Wong SM, Ziarko W, Wong PC (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, pp 18–25
Yan P, Jin W (2012) Improving cross-document knowledge discovery using explicit semantic analysis. In: Proceedings of the 14th international conference on data warehousing and knowledge discovery, pp 378–389
Yan P, Jin W (2013) A new approach for improving cross-document knowledge discovery using Wikipedia. In: Proceedings of the 18th international conference on application of natural language to information systems, pp 291–296
Yan P, Jin W (2013) Mining semantic relationships between concepts across documents incorporating Wikipedia knowledge. Advances in data mining. Applications and theoretical aspects, pp 70–84
Yan P, Jin W (2015) Improving cross-document knowledge discovery through content and link analysis of Wikipedia knowledge. In: Transactions on large-scale data-and knowledge-centered systems XXI, pp 161–184
Acknowledgments
This research work is supported in part by the NSF Grant (IIS-1452898) and NSF/North Dokota EPSCoR IIP Seed Grant (EPS-0814442).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yan, P., Jin, W. Building semantic kernels for cross-document knowledge discovery using Wikipedia. Knowl Inf Syst 51, 287–310 (2017). https://doi.org/10.1007/s10115-016-0973-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0973-5