Skip to main content

Recognition of Data Records in Semi-structured Web-Pages Using Ontology and χ 2 Statistical Distribution

  • Conference paper
Advanced Data Mining and Applications (ADMA 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5139))

Included in the following conference series:

  • 2442 Accesses

Abstract

Information extraction (IE) has been emerged as a novel discipline in computer science. In IE, intelligent algorithms are employed to extract the required data, and structure them so that they are appropriate for query. In most IE systems, a web-page structure, e.g. HTML tags are used to recognize the looked-for information. In this article, an algorithm is developed to recognize the main region of web-pages containing the looked-for information, by means of an ontology, a web-page structure and goodness-of-fit χ 2 test. After recognizing the main region, the existing records of the region are recognized, and then each record is put in a text file.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Casella, G., Berger, R.L.: Statistical Inference, 2nd edn. Duxbury Press (2001)

    Google Scholar 

  2. Papadakis, N.K., Skoutas, D., Raftopoulos, K.: STAVIES: A System for Information Extraction from Unknown Web Data Source through Automatic Web Wrapper Generation Using Clustering Techniques. IEEE Transaction on Knowledge and Data Engineering 17(12), 1638–1652 (2005)

    Article  Google Scholar 

  3. Ye, S., Chua, T.S.: Learning Object Models from Semistructured Web Documents. IEEE Transaction on Knowledge and Data Engineering 18(3), 334–349 (2006)

    Article  Google Scholar 

  4. Chang, C.H., Gigis, M.R.: A Survey of Web Information Extraction Systems. IEEE Transaction on Knowledge and Data Engineering 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  5. Liu, B., Zhai, Y.: NET—A System for Extracting Web Data from Flat and Nested Data Records. In: Proc. Sixth Int’l Conf. Web Information Systems Eng, pp. 487–495 (2005)

    Google Scholar 

  6. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. Int’l Conf. Knowledge Discovery in Databases and Data Mining (KDD), pp. 601–606 (2003)

    Google Scholar 

  7. http://www.w3.org/DOM/

  8. Zhang, N., Chen, H., Wang, Y., Chen, S.J., Xiong, M.F.: Odaies: Ontology-driven Adaptive Web Information Extarction Systems. In: Proc. IEEE/WIC International Conference on Intelligent Agent Technology (IAT 2003), pp. 454–460 (2003)

    Google Scholar 

  9. Daconta, M.C., Obrst, L.J., Smith, K.T.: The Semantic Web: A Guide to the Future of XML, Web Service, and Knowledge Management. Wiley publishing, Inc., Chichester (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Keshavarzi, A., Rahmani, A.M., Mohsenzadeh, M., Keshavarzi, R. (2008). Recognition of Data Records in Semi-structured Web-Pages Using Ontology and χ 2 Statistical Distribution. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_71

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88192-6_71

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88191-9

  • Online ISBN: 978-3-540-88192-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics