Wikipedia Mining for an Association Web Thesaurus Construction

  • Kotaro Nakayama
  • Takahiro Hara
  • Shojiro Nishio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4831)


Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concepts. In this paper, we propose an efficient link mining method pfibf (Path Frequency - Inversed Backward link Frequency) and the extension method “forward / backward link weighting (FB weighting)” in order to construct a huge scale association thesaurus. We proved the effectiveness of our proposed methods compared with other conventional methods such as cooccurrence analysis and TF-IDF.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Giles, J.: Internet encyclopaedias go head to head. Nature 438, 900–901 (2005)CrossRefGoogle Scholar
  2. 2.
    Nakayama, K., Hara, T., Nishio, S.: A thesaurus construction method from large scale web dictionaries. In: AINA 2007. Proc. of IEEE International Conference on Advanced Information Networking and Applications, pp. 932–939 (2007)Google Scholar
  3. 3.
    Ruiz-Casado, M., Alfonseca, E., Castells, P.: Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 380–386. Springer, Heidelberg (2005)Google Scholar
  4. 4.
    Strube, M., Ponzetto, S.: WikiRelate! Computing semantic relatedness using Wikipedia. In: AAAI 2006. Proc. of National Conference on Artificial Intelligence, pp. 1419–1424. Boston, Mass (2006)Google Scholar
  5. 5.
    Milne, D., Medelyan, O., Witten, I.H.: Mining domain-specific thesauri from wikipedia: A case study. In: WI 2006. Proc. of ACM International Conference on Web Intelligence, pp. 442–448 (2006)Google Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI 2007. Proc. of International Joint Conference on Artificial Intelligence, pp. 1606–1611 (2007)Google Scholar
  7. 7.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York (1984)Google Scholar
  8. 8.
    Lawrence, P., Sergey, B., Rajeev, M., Terry, W.: The pagerank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library Technologies Project (1999)Google Scholar
  9. 9.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM (5), 604–632 (1999)Google Scholar
  10. 10.
    Davison, B.D.: Topical locality in the web. In: Proc. of the ACM SIGIR, pp. 272–279 (2000)Google Scholar
  11. 11.
    Schutze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. International Journal of Information Processing and Management 33(3), 307–318 (1997)CrossRefGoogle Scholar
  12. 12.
    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)CrossRefGoogle Scholar
  13. 13.
    Chen, H., Yim, T., Fye, D.: Automatic thesaurus generation for an electronic community system. Journal of the American Society for Information Science 46(3), 175–193 (1995)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Kotaro Nakayama
    • 1
  • Takahiro Hara
    • 1
  • Shojiro Nishio
    • 1
  1. 1.Dept. of Multimedia Eng., Graduate School of Information Science and Technology, Osaka University, 1-5 Yamadaoka, Suita, Osaka 565-0871Japan

Personalised recommendations