An Exploration of Learning to Link with Wikipedia: Features, Methods and Training Collection

  • Jiyin He
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6203)

Abstract

We describe our participation in the Link-the-Wiki track at INEX 2009. We apply machine learning methods to the anchor-to-best-entry-point task and explore the impact of the following aspects of our approaches: features, learning methods as well as the collection used for training the models. We find that a learning to rank-based approach and a binary classification approach do not differ a lot. The new Wikipedia collection which is of larger size and which has more links than the collection previously used, provides better training material for learning our models. In addition, a heuristic run which combines the two intuitively most useful features outperforms machine learning based runs, which suggests that a further analysis and selection of features is necessary.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)Google Scholar
  2. 2.
    He, J., de Rijke, M.: A ranking approach to target detection for automatic link generation. In: SIGIR ’10: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York (2010)Google Scholar
  3. 3.
    Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for ordinal regression. MIT Press, Cambridge (2000)Google Scholar
  4. 4.
    Metzler, D., Croft, W.: A markov random field model for term dependencies. In: SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 472–479. ACM, New York (2005)CrossRefGoogle Scholar
  5. 5.
    Milne, D., Witten, I.H.: Learning to link with wikipedia. In: CIKM ’08: Proceedings of the 17th ACM conference on Information and knowledge management, pp. 509–518. ACM, New York (2008)CrossRefGoogle Scholar
  6. 6.
    Schenkel, R., Suchanek, F., Kasneci, G.: YAWN: A semantically annotated Wikipedia XML corpus. In: BTW 2007 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jiyin He
    • 1
  • Maarten de Rijke
    • 1
  1. 1.ISLA, University of AmsterdamAmsterdamThe Netherlands

Personalised recommendations