Advertisement

Automatic Hierarchical Classification of Structured Deep Web Databases

  • Weifeng Su
  • Jiying Wang
  • Frederick Lochovsky
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4255)

Abstract

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
    Librarians’ Index to the Internet, http://www.lii.org
  4. 4.
    Chang, K.C.-C., He, B., Li, C., Zhang, Z.: Structured databases on the Web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, CS Department, University of Illinois at Urbana-Champaign (February 2003)Google Scholar
  5. 5.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)zbMATHGoogle Scholar
  6. 6.
    Gravano, L., Ipeirotis, P.G., Sahami, M.: Probe, count, and classify: Categorizing hidden Web databases. In: ACM SIGMOD Conference, pp. 363–374 (2001)Google Scholar
  7. 7.
    Gravano, L., Ipeirotis, P.G., Sahami, M.: Qprober: A system for automatic classification of hidden-Web databases. ACM Transactions on Information Systems 21(1), 1–41 (2003)CrossRefGoogle Scholar
  8. 8.
    He, B., Tao, T., Chang, K.C.-C.: Organizing structured Web sources by query schemas: A clustering approach. In: Proceedings of the 13th Conference on Information and Knowledge Management, pp. 22–31 (2004)Google Scholar
  9. 9.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  10. 10.
    Kriegel, H., Kroeger, P., Pryakhin, A., Schubert, M.: Using support vector machines for classifying large sets of multi-represented objects. In: SIAM International Conference on Data MiningGoogle Scholar
  11. 11.
    Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods (2000)Google Scholar
  12. 12.
    Sun, A., Lim, E.: Hierarchical text classification and evaluation. In: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 521–528 (2001)Google Scholar
  13. 13.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)zbMATHGoogle Scholar
  14. 14.
    Wang, J., Lochovsky, F.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (2003)Google Scholar
  15. 15.
    Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization in a metasearch engine environment. In: Proceedings of the First International Conference on Web Information Systems Engineering, pp. 283–290 (June 2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Weifeng Su
    • 1
  • Jiying Wang
    • 2
  • Frederick Lochovsky
    • 1
  1. 1.Hong Kong University of Science & TechnologyHong Kong
  2. 2.City UniversityHong Kong

Personalised recommendations