Advertisement

Multi-label Wikipedia Classification with Textual and Link Features

  • Boris Chidlovskii
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6203)

Abstract

We address the problem of categorizing a large set of linked documents with important content and structure aspects, in particular, from the Wikipedia collection proposed at the INEX 2009 XML Mining challenge. We analyze the network of collection pages and turn it into valuable features for the classification. We combine the content-based and link-based features of pages to train an accurate categorizer for unlabelled pages. In the multi-label setting, we revise a number of existing techniques and test some which show a good scalability. We report evaluation results obtained with a variety of learning methods and techniques on the training set of the Wikipedia corpus.

Keywords

Betweenness Centrality Graph Feature Transductive Support Vector Machine Collection Page Page Representation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25, 163–177 (2001)CrossRefzbMATHGoogle Scholar
  2. 2.
    Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16(9), 575–577 (1973)CrossRefzbMATHGoogle Scholar
  3. 3.
    Chidlovskii, B.: Semi-supervised categorization of wikipedia collection by label expansion. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 412–419. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Getoor, L., Diehl, C.P.: Link mining: a survey. SIGKDD Explorations 7(2), 3–12 (2005)CrossRefGoogle Scholar
  5. 5.
    Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200. ACM, New York (2005)Google Scholar
  6. 6.
    Gleich, D.: MatlabBGL: a Matlab Graph Library (2008), http://www.stanford.edu/~dgleich/programs/matlab_bgl
  7. 7.
    Joachims, T.: A statistical learning model of text classification for Support Vector Machines. In: Proc. 24th International ACM SIGIR Conf., pp. 128–136. ACM Press, New York (2001)Google Scholar
  8. 8.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)CrossRefzbMATHMathSciNetGoogle Scholar
  9. 9.
    Newman, M.E.J.: The structure and function of complex networks. SIAM Review 45, 167–256 (2003)CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Riehle, D.: How and why Wikipedia works: an interview with Angela Beesley, Elisabeth Bauer, and Kizu Naoko. In: WikiSym 2006: Proceedings of the 2006 international symposium on Wikis, pp. 3–8. ACM, New York (2006)Google Scholar
  11. 11.
    Rowe, R., Creamer, G., Hershkop, S., Stolfo, S.J.: Automated social hierarchy detection through email network analysis. In: Proc. 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 109–117. ACM, New York (2007)Google Scholar
  12. 12.
    Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: WWW 2009: Proceedings of the 18th international conference on World Wide Web, pp. 211–220. ACM, New York (2009)Google Scholar
  13. 13.
    Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3(3), 1–13 (2007)CrossRefGoogle Scholar
  14. 14.
    Yu, K., Yu, S., Tresp, V.: Multi-label informed latent semantic indexing. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 258–265. ACM, New York (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Boris Chidlovskii
    • 1
  1. 1.Xerox Research Centre EuropeMeylanFrance

Personalised recommendations