Recognizing Ontology-Applicable Multiple-Record Web Documents

Embley, David W.; Ng, Yiu-Kai; Xu, Li

doi:10.1007/3-540-45581-7_41

Recognizing Ontology-Applicable Multiple-Record Web Documents

David W. Embley⁷,
Yiu-Kai Ng⁷ &
Li Xu⁷

Conference paper
First Online: 18 December 2001

590 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2224))

Abstract

Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplerecord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Borko and M. Bernick. Automatic document classification. Journal of the ACM, 10(2):151–162, 1963.
Article MATH Google Scholar
L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of the 21th ACM SIGIR, pages 96–103, 1998.
Google Scholar
M. A. Bunge. Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston, 1977.
Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. A ddison Wesley, Menlo Park, California, 1999.
Google Scholar
S. Chakrabarti, M. van den Berg, and B. E. Dom. Focused crawling: A new approach for topic-specific resource discovery. Computer Networks, 31:1623–1640, 1999.
Article Google Scholar
D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y.-K. Ng, and R. Smith. Conceptual-model-based data extraction from multiplerecord web pages. Data & Knowledge Engineering, 31(3):227–251, November 1999.
Article MATH Google Scholar
D. W. Embley, N. Fuhr, C.-P. Klas, and T. Roelleke. Ontology suitability for uncertain extraction of information from multi-record web documents. In Proceedings of the Workshop on Agenten, Datenbanken und Information Retrieval (ADI’99), Rostock-Warnemuende, Germany, 1999.
Google Scholar
D. W. Embley, Y. S. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD, pages 467–478, Philadelphia, Pennsylvania, 31 May–3 June 1999.
Google Scholar
D. W. Embley and L. Xu. Record location and reconfiguration in unstructured multiple-record web documents. In Proceedings of the 3rd Intl. Workshop on the Web and Databases, pages 123–128, Dallas, Texas, May 2000.
Google Scholar
Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domainspecific search engines with machine learning techniques. In Proceedings of the AAAI Spring Sym. on Intelligent Agents in Cyberspace, March 1999.
Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
Google Scholar
E. Rilo. and W. Lehnert. Information extraction as a basis for highprecision text classification. ACM TOIS, 12(3):296–333, 1994.
Article Google Scholar
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Brigham Young University, 84602, Provo, Utah, USA
David W. Embley, Yiu-Kai Ng & Li Xu

Authors

David W. Embley
View author publications
You can also search for this author in PubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author in PubMed Google Scholar
Li Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ricoh Company, Ltd., Software Research Center, 1-1-17 Koishikawa, Bunkyo-ku, 112-0002, Tokyo, Japan
Hideko S.Kunii
George Mason University, Center for Secure Information Systems, Mail Stop 4A4, Fairfax, 22030-4444, VA, USA
Sushil Jajodia
Department of Computer and Information Science, The Norwegian University of Science and Technology, 7491, Trondheim, Norway
Arne Sølvberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Embley, D.W., Ng, YK., Xu, L. (2001). Recognizing Ontology-Applicable Multiple-Record Web Documents. In: S.Kunii, H., Jajodia, S., Sølvberg, A. (eds) Conceptual Modeling — ER 2001. ER 2001. Lecture Notes in Computer Science, vol 2224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45581-7_41

Download citation

DOI: https://doi.org/10.1007/3-540-45581-7_41
Published: 18 December 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42866-4
Online ISBN: 978-3-540-45581-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics