Abstract
Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplerecord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
H. Borko and M. Bernick. Automatic document classification. Journal of the ACM, 10(2):151–162, 1963.
L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of the 21th ACM SIGIR, pages 96–103, 1998.
M. A. Bunge. Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston, 1977.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. A ddison Wesley, Menlo Park, California, 1999.
S. Chakrabarti, M. van den Berg, and B. E. Dom. Focused crawling: A new approach for topic-specific resource discovery. Computer Networks, 31:1623–1640, 1999.
D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y.-K. Ng, and R. Smith. Conceptual-model-based data extraction from multiplerecord web pages. Data & Knowledge Engineering, 31(3):227–251, November 1999.
D. W. Embley, N. Fuhr, C.-P. Klas, and T. Roelleke. Ontology suitability for uncertain extraction of information from multi-record web documents. In Proceedings of the Workshop on Agenten, Datenbanken und Information Retrieval (ADI’99), Rostock-Warnemuende, Germany, 1999.
D. W. Embley, Y. S. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD, pages 467–478, Philadelphia, Pennsylvania, 31 May–3 June 1999.
D. W. Embley and L. Xu. Record location and reconfiguration in unstructured multiple-record web documents. In Proceedings of the 3rd Intl. Workshop on the Web and Databases, pages 123–128, Dallas, Texas, May 2000.
Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domainspecific search engines with machine learning techniques. In Proceedings of the AAAI Spring Sym. on Intelligent Agents in Cyberspace, March 1999.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
E. Rilo. and W. Lehnert. Information extraction as a basis for highprecision text classification. ACM TOIS, 12(3):296–333, 1994.
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Embley, D.W., Ng, YK., Xu, L. (2001). Recognizing Ontology-Applicable Multiple-Record Web Documents. In: S.Kunii, H., Jajodia, S., Sølvberg, A. (eds) Conceptual Modeling — ER 2001. ER 2001. Lecture Notes in Computer Science, vol 2224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45581-7_41
Download citation
DOI: https://doi.org/10.1007/3-540-45581-7_41
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42866-4
Online ISBN: 978-3-540-45581-3
eBook Packages: Springer Book Archive