Automatic Hierarchical Classification of Structured Deep Web Databases
We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.
Unable to display preview. Download preview PDF.
- 1.CompletePlanet, http://www.completeplanet.com
- 2.InvisibleWeb, http://www.invisibleweb.com
- 3.Librarians’ Index to the Internet, http://www.lii.org
- 4.Chang, K.C.-C., He, B., Li, C., Zhang, Z.: Structured databases on the Web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, CS Department, University of Illinois at Urbana-Champaign (February 2003)Google Scholar
- 6.Gravano, L., Ipeirotis, P.G., Sahami, M.: Probe, count, and classify: Categorizing hidden Web databases. In: ACM SIGMOD Conference, pp. 363–374 (2001)Google Scholar
- 8.He, B., Tao, T., Chang, K.C.-C.: Organizing structured Web sources by query schemas: A clustering approach. In: Proceedings of the 13th Conference on Information and Knowledge Management, pp. 22–31 (2004)Google Scholar
- 9.Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
- 10.Kriegel, H., Kroeger, P., Pryakhin, A., Schubert, M.: Using support vector machines for classifying large sets of multi-represented objects. In: SIAM International Conference on Data MiningGoogle Scholar
- 11.Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods (2000)Google Scholar
- 12.Sun, A., Lim, E.: Hierarchical text classification and evaluation. In: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 521–528 (2001)Google Scholar
- 14.Wang, J., Lochovsky, F.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (2003)Google Scholar
- 15.Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization in a metasearch engine environment. In: Proceedings of the First International Conference on Web Information Systems Engineering, pp. 283–290 (June 2000)Google Scholar