Skip to main content

Recognizing Ontology-Applicable Multiple-Record Web Documents

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2224))

Abstract

Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplerecord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. H. Borko and M. Bernick. Automatic document classification. Journal of the ACM, 10(2):151–162, 1963.

    Article  MATH  Google Scholar 

  2. L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of the 21th ACM SIGIR, pages 96–103, 1998.

    Google Scholar 

  3. M. A. Bunge. Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston, 1977.

    Google Scholar 

  4. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. A ddison Wesley, Menlo Park, California, 1999.

    Google Scholar 

  5. S. Chakrabarti, M. van den Berg, and B. E. Dom. Focused crawling: A new approach for topic-specific resource discovery. Computer Networks, 31:1623–1640, 1999.

    Article  Google Scholar 

  6. D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y.-K. Ng, and R. Smith. Conceptual-model-based data extraction from multiplerecord web pages. Data & Knowledge Engineering, 31(3):227–251, November 1999.

    Article  MATH  Google Scholar 

  7. D. W. Embley, N. Fuhr, C.-P. Klas, and T. Roelleke. Ontology suitability for uncertain extraction of information from multi-record web documents. In Proceedings of the Workshop on Agenten, Datenbanken und Information Retrieval (ADI’99), Rostock-Warnemuende, Germany, 1999.

    Google Scholar 

  8. D. W. Embley, Y. S. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD, pages 467–478, Philadelphia, Pennsylvania, 31 May–3 June 1999.

    Google Scholar 

  9. D. W. Embley and L. Xu. Record location and reconfiguration in unstructured multiple-record web documents. In Proceedings of the 3rd Intl. Workshop on the Web and Databases, pages 123–128, Dallas, Texas, May 2000.

    Google Scholar 

  10. Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.

  11. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domainspecific search engines with machine learning techniques. In Proceedings of the AAAI Spring Sym. on Intelligent Agents in Cyberspace, March 1999.

    Google Scholar 

  12. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.

    Google Scholar 

  13. E. Rilo. and W. Lehnert. Information extraction as a basis for highprecision text classification. ACM TOIS, 12(3):296–333, 1994.

    Article  Google Scholar 

  14. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Embley, D.W., Ng, YK., Xu, L. (2001). Recognizing Ontology-Applicable Multiple-Record Web Documents. In: S.Kunii, H., Jajodia, S., Sølvberg, A. (eds) Conceptual Modeling — ER 2001. ER 2001. Lecture Notes in Computer Science, vol 2224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45581-7_41

Download citation

  • DOI: https://doi.org/10.1007/3-540-45581-7_41

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42866-4

  • Online ISBN: 978-3-540-45581-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics