Advertisement

Information Retrieval

, Volume 13, Issue 4, pp 315–345 | Cite as

Classifying documents with link-based bibliometric measures

  • T. Couto
  • N. Ziviani
  • P. Calado
  • M. Cristo
  • M. Gonçalves
  • E. S. de MouraEmail author
  • W. Brandão
Article

Abstract

Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve the precision of web searching, or help the interactions between user and search engines. In this paper we explore how linkage information inherent to different document collections can be used to enhance the effectiveness of classification algorithms. We have experimented with three link-based bibliometric measures, co-citation, bibliographic coupling and Amsler, on three different document collections: a digital library of computer science papers, a web directory and an on-line encyclopedia. Results show that both hyperlink and citation information can be used to learn reliable and effective classifiers based on a kNN classifier. In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F 1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments. We also present alternative ways of combining bibliometric based classifiers with text based classifiers. Finally, we conducted studies to analyze the situation in which the bibliometric-based classifiers failed and show that in such cases it is hard to reach consensus regarding the correct classes, even for human judges.

Keywords

Text classification Links Web directories Digital libraries 

Notes

Acknowledgments

This work was supported by the Brazilian National Institute of Science and Technology for the Web (Grant MCT/CNPq 573871/2008-6), Project FCT IR-BASE (Grant POSC/EIA/58194/2004), Project InfoWeb (MCT/CNPq/CT-INFO 550874/2007-0), Project InWeb (Grant 573871/2008-6 CNPq) Project 5S-VQ (Grant MCT/-CNPq/-CT-INFO 55.1013/2005-2), CNPq Grant 305237/02-0 (Nivio Ziviani), CNPq Grant 302209/2007-7 (Edleno S. de Moura) and CNPq Grant 301043/2006-0 (Marcos André Gonçalves), Project SIRIAA (Grant MCT/-CNPq/-CT-Amazônia 55.3126/2005-9).

References

  1. ACM. (1998). The ACM computing classification system—1998 version. http://www.acm.org/class/1998/ccs98.html.
  2. Almind, T. C., & Ingwersen, P. (1997). Informetric analyses on the World Wide Web: Methodological approaches to “webometrics”. Journal of Documentation, 53(4), 4004–426.CrossRefGoogle Scholar
  3. Amsler, R. (1972). Application of citation-based automatic classification. Tech. rep., The University of Texas at Austin, Linguistics Research Center.Google Scholar
  4. Angelova, R., & Weikum, G. (2006). Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 485–492.Google Scholar
  5. Bichtler, J., & Eaton, E. A., III. (1980). The combined use of bibliographic coupling and cocitation for document retrieval. Journal of the American Society for Information Science, 31(4), 278–282.CrossRefGoogle Scholar
  6. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th international world wide web conference (WWW98), pp. 107–117.Google Scholar
  7. Calado, P., Cristo, M., Gonçalves, M. A., de Moura, E. S., Ribeiro-Neto, B., Ziviani, N. (2006). Link-based similarity measures for the classification of web documents. Journal of the American Society for Information Science and Technology, 57(2), 208–221.CrossRefGoogle Scholar
  8. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., & Gonçalves, M. A. (2003). Combining link-based and content-based methods for web document classification. In Proceedings of the 12th international conference on information and knowledge management. New Orleans, LA, USA, pp. 394–401.Google Scholar
  9. Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD international conference on management of data, pp. 307–318.Google Scholar
  10. Chang, C., & Lin, C. J. (2001). Libsvm: A library for support vector machines.Google Scholar
  11. Cochran, W. G. (1977). Sampling techniques (2nd ed.). New York: Wiley.zbMATHGoogle Scholar
  12. Cohn, D., & Hofmann, T. (2001). The missing link—a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, V. Tresp (Eds.) Advances in neural information processing systems 13 (pp. 430–436). Cambridge: MIT PressGoogle Scholar
  13. Couto, T., Cristo, M., Gonçalves, M. A., Calado, P., Ziviani, N., Moura, E., & Ribeiro-Neto, B. (2006). A comparative study of citations and links in document classification. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (pp. 75–84).Google Scholar
  14. Cristo, M., Calado, P., Moura, E., & Nivio Ziviani, B. R. N. (2003). Link information as a similarity measure in web classification. In 10th Symposium on string processing and information retrieval SPIRE 2003, Lecture Notes in Computer Science (Vol. 2857, pp. 43–55).Google Scholar
  15. Dean, J., & Henzinger, M. R. (1999). Finding related pages in the World Wide Web. Computer Networks, 31(11–16), 1467–1479, also in Proceedings of the 8th international World Wide Web conference (WWW99).Google Scholar
  16. Egghe, L., & Rousseau, R. (1990). Introduction to informetrics: Quantitative methods in library, documentation and information science. North-Holland, Amsterdam: Elsevier.Google Scholar
  17. Fisher, M., & Everson, R. (2003). When are links useful? Experiments in text classification. In Proceedings of the 25th European conference on information retrieval research (pp. 41–56).Google Scholar
  18. Furnkranz, J. (1999). Exploiting structural information for text classification on the WWW. In Proceedings of the 3rd symposium on intelligent data analysis (IDA99) (pp. 487–498).Google Scholar
  19. Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science, 178(4060), 471–479.CrossRefGoogle Scholar
  20. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., & Flake, G. W. (2002) Using web structure for classifying and describing web pages. In Proceedings of the 11th international World Wide Web conference (WWW02) Google Scholar
  21. Gövert, N., Lalmas, M., & Fuhr, N. (1999). A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th international conference on information and knowledge management (pp. 475–482). Kansas City, MO, USA.Google Scholar
  22. Hawking, D., & Craswell, N. (2001). Overview of TREC-2001 web track. In The 10th text retrieval conference (TREC-2001) (pp. 61–67). Gaithersburg, MD, USA.Google Scholar
  23. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European conference on machine learning (pp. 137–142). Chemnitz, GermanyGoogle Scholar
  24. Joachims, T., Cristianini, N., & Shawe-Taylor, J. (2001). Composite kernels for hypertext categorisation. In Proceedings of the 18th international conference on machine learning, ICML-01 (pp. 250–257).Google Scholar
  25. Kessler, M. M. (1963). Bibliographic coupling between scientific papers. American Documentation, 14(1), 10–25.CrossRefGoogle Scholar
  26. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.zbMATHCrossRefMathSciNetGoogle Scholar
  27. Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the web for emerging cyber-communities. Computer Networks, 31(11–16), 1481–1493, also in Proceedings of the 8th international World Wide Web conference (WWW99).Google Scholar
  28. Larson, R. R. (1996). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Annual meeting of the American Society for information science (pp. 71–78). Baltimore, MD, USA.Google Scholar
  29. Lawrence, S., Giles, C. L., & Bollacker, K. D. (1999). Autonomous citation matching. In O. Etzioni, J. P. Müller, & J. M. Bradshaw (Eds.) Proceedings of the 3rd annual conference on autonomous agents (AGENTS-99) (pp. 392–393). ACM Press.Google Scholar
  30. Li, X., Chen, H., Zhang, Z., & Li, J. (2007). Automatic patent classification using citation network information: An experimental study in nanotechnology. In Proceedings of the ACM IEEE joint conference on digital libraries (pp. 419–427).Google Scholar
  31. Marshakova, I. V. (1973). A system of document connection based on refernces. Scientific and Technical Information Serial of VINITI, 6(2), 3–8.Google Scholar
  32. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.zbMATHGoogle Scholar
  33. Moed, H. F. (2005) Citation analysis in research evaluation (information science & knowledge management). Secaucus, NJ: Springer New York, Inc.Google Scholar
  34. Oh, H. J., Myaeng, S. H., & Lee, M. H. (2000). A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 264–271).Google Scholar
  35. Qi, X., & Davison, B. D. (2006). Knowing a web page by the company it keeps. In Proceedings of the 15th ACM international conference on information and knowledge management (pp. 228–237).Google Scholar
  36. Qin, J. (2000). Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature. Journal of the American Society for Information Science, 51(2), 166–180.CrossRefGoogle Scholar
  37. Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1), 21–41, http://www.dx.doi.org/10.1162/089976602753284446.
  38. Salton, G. (1963). Associative document retrieval techniques using bibliographic information. Journal of the ACM, 10(4), 440–457.zbMATHCrossRefGoogle Scholar
  39. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.CrossRefGoogle Scholar
  40. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRefGoogle Scholar
  41. Shen, D., Sun, J. T., Yang, Q., & Chen, Z. (2006). A comparison of implicit and explicit links for web page classification. In Proceedings of the 15th international conference on World Wide Web (pp. 643–650) New York, NY, USA.Google Scholar
  42. Slattery, S., & Mitchell, T. (2000). Discovering test set regularities in relational domains. In Proceedings of the 17th international conference on machine learning. Stanford, CA, USA.Google Scholar
  43. Small, H. G. (1973). Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4), 265–269.CrossRefGoogle Scholar
  44. Smith, A. G. (2004). Web links as analogues of citations. Information Research, 9(4).Google Scholar
  45. Sun, A., Lim, E. P., & Ng, W. K. (2002). Web classification using support vector machine. In Proceedings of the 4th international workshop on web information and data management (pp. 96–99).Google Scholar
  46. Terveen, L., Hill, W., & Amento, B. (1999). Constructing, organizing, and visualizing collections of topically related web resources. ACM Transactions on Computer-Human Interaction, 6(1), 67–94.CrossRefGoogle Scholar
  47. Turtle, H., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 187–222.CrossRefGoogle Scholar
  48. Veloso, A., Wagner Meira, J., Cristo, M., Gonçalves, M., & Zaki, M. (2006). Multi-evidence, multi-criteria, lazy associative document classification. In Proceedings of the 15th ACM international conference on information and knowledge management (pp. 218–227).Google Scholar
  49. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.CrossRefGoogle Scholar
  50. Witten, I. H., & Frank, E. (2005). Data mining, practical machine learning tools and techniques (2nd ed.). San Francisco, CA: Morgan Kaufmann.zbMATHGoogle Scholar
  51. Yang, Y. (1994). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 13–22).Google Scholar
  52. Yang, Y., & Liu, X. (1999) A re-examination of text categorization methods. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 42–49). Berkeley, CA.Google Scholar
  53. Yang, Y., Slattery, S., & Ghani, R. (2002). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2), 219–241CrossRefGoogle Scholar
  54. Zhang, B., Chen, Y., Fan, W., Fox, E. A., Goncalves, M., Cristo, M., & Calado, P. (2005). Intelligent GP fusion from multiple sources for text classification. In Proceedings of the 14th ACM international conference on information and knowledge management. Bremen, Germany: ACM PressGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • T. Couto
    • 1
  • N. Ziviani
    • 1
  • P. Calado
    • 2
  • M. Cristo
    • 3
  • M. Gonçalves
    • 1
  • E. S. de Moura
    • 4
    Email author
  • W. Brandão
    • 1
  1. 1.Department of Computer ScienceFederal University of Minas GeraisBelo HorizonteBrazil
  2. 2.IST/INESC-IDLisbonPortugal
  3. 3.FUCAPI-Analysis, Research and Tech. Innovation CenterManausBrazil
  4. 4.Department of Computer ScienceFederal University of AmazonasManausBrazil

Personalised recommendations