A Genetic Programming Approach for Combining Structural and Citation-Based Evidence for Text Classification in Web Digital Libraries

  • Baoping Zhang
  • Weiguo Fan
  • Yuxin Chen
  • Edward A. Fox
  • Marcos André Gonçalves
  • Marco Cristo
  • Pável Calado
Part of the Studies in Fuzziness and Soft Computing book series (STUDFUZZ, volume 197)


This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Robert Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.Google Scholar
  2. 2.
    Pável Calado, Marco Cristo, Edleno Silva de Moura, Nivio Ziviani, Berthier A. Ribeiro-Neto, and Marcos André Gonçalves. Combining link-based and content-based methods for Web document classification. In Proceedings of CIKM-03, 12th ACM International Conference on Information and Knowledge Management, pages 394–401, New Orleans, US, 2003. ACM Press, New York, US.Google Scholar
  3. 3.
    Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998.Google Scholar
  4. 4.
    Sin Man Cheang, Kin Hong Lee, and Kwong Sak Leung. Data classification using genetic parallel programming. In E. Cantú-Paz, J. A. Foster, K. Deb, D. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. Standish, G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. Dowsland, N. Jonoska, and J. Miller, editors, Genetic and Evolutionary Computation — GECCO-2003, volume 2724 of LNCS, pages 1918–1919, Chicago, 12–16 July 2003. Springer-Verlag.Google Scholar
  5. 5.
    Chris Clack, Johnny Farringdon, Peter Lidwell, and Tina Yu. Autonomous document classification for business. In AGENTS’ 97: Proceedings of the first international conference on Autonomous agents, pages 201–208. ACM Press, 1997.Google Scholar
  6. 6.
    David Cohn and Thomas Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430–436. MIT Press, 2001.Google Scholar
  7. 7.
    I. De Falco, A. Della Cioppa, and E. Tarantino. Discovering interesting classification rules with genetic programming. Applied Soft Computing, 1(4F):257–269, May 2001.Google Scholar
  8. 8.
    Jeffrey Dean and Monika Rauch Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11–16):1467–1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.CrossRefGoogle Scholar
  9. 9.
    M. Dolores del Castillo and José Ignacio Serrano. A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl., 6(1):70–79, 2004.zbMATHGoogle Scholar
  10. 10.
    J. Eggermont, J. N. Kok, and W. A. Kosters. Genetic programming for data classification: Refining the search space. In T. Heskes, P. Lucas, L. Vuurpijl, and W. Wiegerinck, editors, Proceedings of the Fivteenth Belgium/Netherlands Conference on Artificial Intelligence (BNAIC’03), pages 123–130, Nijmegen, The Netherlands, 23–24 October 2003.Google Scholar
  11. 11.
    Weiguo Fan, Edward A. Fox, Praveen Pathak, and Harris Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7):628–636, 2004.CrossRefGoogle Scholar
  12. 12.
    Weiguo Fan, Michael D. Gordon, and Praveen Pathak. Personalization of search engine services for effective retrieval and knowledge management. In The Proceedings of the International Conference on Information Systems 2000, pages 20–34, 2000.Google Scholar
  13. 13.
    Weiguo Fan, Michael D. Gordon, and Praveen Pathak. Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4):523–527, 2004.CrossRefGoogle Scholar
  14. 14.
    Weiguo Fan, Michael D. Gordon, and Praveen Pathak. A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing and Management, 40(4):587–602, 2004.zbMATHCrossRefGoogle Scholar
  15. 15.
    Weiguo Fan, Michael D. Gordon, Praveen Pathak, Wensi Xi, and Edward A. Fox. Ranking function optimization for effective web search by genetic programming: An empirical study. In Proceedings of 37th Hawaii International Conference on System Sciences, Hawaii, 2004. IEEE.Google Scholar
  16. 16.
    Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox. Tuning before feedback: combining ranking function discovery and blind feedback for robust retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference, U.K., 2004. ACM.Google Scholar
  17. 17.
    Michelle Fisher and Richard Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41–56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.Google Scholar
  18. 18.
    Johannes Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487–498, 1999.Google Scholar
  19. 19.
    Lee Giles. Citeseer: An automatic citation indexing system. December 16 1998.Google Scholar
  20. 20.
    Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and Gary W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW-02, International Conference on the World Wide Web, 2002.Google Scholar
  21. 21.
    M. D. Gordon. User-based document clustering by redescribing subject descriptions with a genetic algorithm. Journal of the American Society for Information Science, 42(5):311–322, June 1991.CrossRefGoogle Scholar
  22. 22.
    Michael Gordon. Probabilistic and genetic algorithms for document retrieval. Communications of the ACM, 31(10):1208–1218, October 1988.CrossRefGoogle Scholar
  23. 23.
    Norbert Gövert, Mounia Lalmas, and Norbert Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475–482, Kansas City, Missouri, USA, November 1999.Google Scholar
  24. 24.
    Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, Germany, April 1998.Google Scholar
  25. 25.
    Thorsten Joachims, Nello Cristianini, and John Shawe-Taylor. Composite kernels for hypertext categorisation. In Carla Brodley and Andrea Danyluk, editors, Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 250–257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.Google Scholar
  26. 26.
    M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10–25, January 1963.Google Scholar
  27. 27.
    J. K. Kishore, L. M. Patnaik, V. Mani, and V. K. Agrawal. Genetic programming based pattern classification with feature space partitioning. Information Sciences, 131(1–4):65–86, January 2001.zbMATHCrossRefGoogle Scholar
  28. 28.
    J. K. Kishore, Lalit M. Patnaik, V. Mani, and V. K. Agrawal. Application of genetic programming for multicategory pattern classification. IEEE Trans. Evolutionary Computation, 4(3):242–258, 2000.CrossRefGoogle Scholar
  29. 29.
    Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999.zbMATHMathSciNetCrossRefGoogle Scholar
  30. 30.
    John R. Koza. Genetic programming: On the programming of computers by natural selection. MIT Press, Cambridge, Mass., 1992.Google Scholar
  31. 31.
    S. Lawrence, C. L. Giles, and K. Bollacker. “Digital Libraries and Autonomous Citation Indexing”. IEEE Computer, 32(6):67–71, 1999.Google Scholar
  32. 32.
    Steve Lawrence, C. Lee Giles, and Kurt D. Bollacker. Autonomous citation matching. In Oren Etzioni, Jörg P. Müller, and Jeffrey M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 392–393, Seattle, WA, USA, 1999. ACM Press.Google Scholar
  33. 33.
    M. J. Martin-Bautista, M. Vila, and H. L. Larsen. A fuzzy genetic algorithm approach to an adaptive information retrieval agent. American Society for Information Science, 50:760–771, 1999.CrossRefGoogle Scholar
  34. 34.
    Andrew Kachites McCallum and Kamal Nigam. Employing EM and pool-based active learning for text classification. In Proc. 15th International Conf. on Machine Learning, pages 350–358. Morgan Kaufmann, San Francisco, CA, 1998.Google Scholar
  35. 35.
    Frederic C. Misch, editor. Webster’s Ninth New Collegiate Dictionary. Merriam-Webster Inc., Springfield, Massachusetts, 1988.Google Scholar
  36. 36.
    Hyo-Jung Oh, Sung Hyon Myaeng, and Mann-Ho Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages 264–271. ACM Press, 2000.Google Scholar
  37. 37.
    P. Pathak, M. Gordon, and W. Fan. Effective information retrieval using genetic algorithms based matching function adaptation. In Proceedings of the 33rd Hawaii International Conference on System Science (HICSS), Hawaii, USA, 2000.Google Scholar
  38. 38.
    Vijay V. Raghavan and Brijesh Agarwal. Optimal determination of user-oriented clusters: an application for the reproductive plan. In John J. Grefenstette, editor, Proceedings of the 2nd International Conference on Genetic Algorithms and their Applications, pages 241–246, Cambridge, MA, July 1987. Lawrence Erlbaum Associates.Google Scholar
  39. 39.
    S. E. Robertson, S. Walker, and M. M. Beaulieu. Okapi at TREC-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), pages 73–96, 1995.Google Scholar
  40. 40.
    Maytal Saar-Tsechansky and Foster Provost. Active learning for class probability estimation and ranking. In Bernhard Nebel, editor, Proceedings of the Seventeenth International Conference on Artificial Intelligence (IJCAI-01), pages 911–920, San Francisco, CA, August 4–10 2001. Morgan Kaufmann Publishers, Inc.Google Scholar
  41. 41.
    Gerard Salton. Automatic Text Processing. Addison-Wesley, Boston, Massachusetts, USA, 1989.Google Scholar
  42. 42.
    Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.CrossRefGoogle Scholar
  43. 43.
    Henry G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265–269, July 1973.Google Scholar
  44. 44.
    A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999.CrossRefGoogle Scholar
  45. 45.
    Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96–99. ACM Press, 2002.Google Scholar
  46. 46.
    Yiming Yang. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 13–22, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.Google Scholar
  47. 47.
    Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3):219–241, 2002.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Baoping Zhang
    • 1
  • Weiguo Fan
    • 1
  • Yuxin Chen
    • 1
  • Edward A. Fox
    • 1
  • Marcos André Gonçalves
    • 2
  • Marco Cristo
    • 2
  • Pável Calado
    • 3
  1. 1.Department of Computer ScienceVirginia Polytechnic Institute and State UniversityBlacksburgUSA
  2. 2.Department of Computer ScienceFederal University of Minas GeraisBelo Horizonte, MGBrazil
  3. 3.Pável CaladoIST/INESC-IDLisbonPortugal

Personalised recommendations