CatRelate: A New Hierarchical Document Category Integration Algorithm by Learning Category Relationships
We address the problem of integrating documents from a source catalog into a master catalog. Current technologies for solving the problem deem it as a flat category integration problem without considering the useful hierarchy information in the catalog, or deal with it hierarchically but without a rigorous model. In contrast, our method is based on correctly identifying relationships among categories, such as Match, Disjoint, SubConcept, SuperConcept, and Overlap, which come from the relations of sets in Set theory. Compared with traditional Match/NotMatch relationship in literature, our approach is more expressive in defining the relationship. The relationships among categories are first learned in a probabilistic way, and then refined by considering the hierarchy context. Our preliminary experiments show that it can help to correctly identify category relationships, and thus increase the accuracy of document integration.
Unable to display preview. Download preview PDF.
- 1.Agrawal, R., Srikant, R.: On Integrating Catalogs. In: Proceedings of WWW10 Conference, Hong Kong, May 1-5, pp. 603–612 (2001)Google Scholar
- 2.Cheng, T.H., Wei, C.: Integration of Document-category Hierarchies: A Clustering-based Approach. In: Web 2003 (The Second Workshop on e-Business), Seattle, Washington, USA (December 13-14, 2003)Google Scholar