A Graph-Based Framework for Web Document Mining

  • Adam Schenker
  • Horst Bunke
  • Mark Last
  • Abraham Kandel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3163)


In this paper we describe methods of performing data mining on web documents, where the web document content is represented by graphs. We show how traditional clustering and classification methods, which usually operate on vector representations of data, can be extended to work with graph-based data. Specifically, we give graph-theoretic extensions of the k-Nearest Neighbors classification algorithm and the k-means clustering algorithm that process graphs, and show how the retention of structural information can lead to improved performance over the case of the vector model approach. We introduce several different types of web document representations that utilize graphs and compare their performance for clustering and classification.


  1. 1.
    Zhong, N., Liu, J., Yao, Y.: In search of the wisdom web. Computer 35, 27–32 (2002)CrossRefGoogle Scholar
  2. 2.
    Madria, S.K., Bhowmick, S.S., Ng, W.K., Lim, E.P.: Research issues in web data mining. Data Warehousing and Knowledge Discovery, 303–312 (1999)Google Scholar
  3. 3.
    Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of SIGIR–00, 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 256–263 (2000)Google Scholar
  4. 4.
    Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)CrossRefGoogle Scholar
  5. 5.
    Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)Google Scholar
  6. 6.
    Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  7. 7.
    Lopresti, D., Wilfong, G.: Applications of graph probing to web document analysis. In: Proceedings of the 1st International Workshop on Web Document Analysis, pp. 51–54 (2001)Google Scholar
  8. 8.
    Liang, J., Doermann, D.: Logical labeling of document images using layout graph matching with adaptive learning. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 224–235. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  9. 9.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence 18 (to appear)Google Scholar
  10. 10.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 240–244 (2003)Google Scholar
  11. 11.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Graph representations for web document clustering. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 935–942. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  12. 12.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Clustering of web documents using a graph model. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities, pp. 3–18. World Scientific Publishing Company, Singapore (2003)CrossRefGoogle Scholar
  13. 13.
    Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38, 529–546 (2002)zbMATHCrossRefGoogle Scholar
  14. 14.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, Boston (1997)zbMATHGoogle Scholar
  15. 15.
    Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 225–259 (1998)CrossRefGoogle Scholar
  16. 16.
    Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22, 753–758 (2001)zbMATHCrossRefGoogle Scholar
  17. 17.
    Wallis, W.D., Shoubridge, P., Kraetz, M., Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22, 701–704 (2001)zbMATHCrossRefGoogle Scholar
  18. 18.
    Dickinson, P., Bunke, H., Dadej, A., Kretzl, M.: On graphs with unique node labels. In: Hancock, E.R., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 13–23. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Jiang, X., Muenger, A., Bunke, H.: On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1144–1151 (2001)CrossRefGoogle Scholar
  20. 20.
    Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20, 68–86 (1971)CrossRefGoogle Scholar
  21. 21.
    Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27, 329–341 (1999)CrossRefGoogle Scholar
  22. 22.
    Turney, P.: Learning algorithms for keyphrase extraction. Information Retrieval 2, 303–336 (2000)CrossRefGoogle Scholar
  23. 23.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)CrossRefGoogle Scholar
  24. 24.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Adam Schenker
    • 1
  • Horst Bunke
    • 2
  • Mark Last
    • 3
  • Abraham Kandel
    • 1
    • 4
  1. 1.University of South FloridaTampaUSA
  2. 2.University of BernBernSwitzerland
  3. 3.Ben-Gurion University of the NegevBeer-ShevaIsrael
  4. 4.Tel-Aviv UniversityTel-AvivIsrael

Personalised recommendations