Skip to main content

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 111))

  • 1685 Accesses

Abstract

Most previous IE (IE) work relys on the analysis of the DOM tree of HTML file. When hundreds of information sources need to be extracted in a specific domain like news, it will lead to decreased accuracy. Based on the features of news articles, this paper proposed a new way to get news content desired by washing noise information and text group statistics. The experiment proved the effectiveness of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sarawagi, S.: Automation in IEand data integration (tutorial). In: Proceedings of the 28th International Conference on Very Large Data Bases (2002)

    Google Scholar 

  2. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  3. Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for IE. In: Int. Joint Conf. on Artificial Intelligence (1997)

    Google Scholar 

  4. Hsu, C., Dung, M.: Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23, 521–538 (1998)

    Article  Google Scholar 

  5. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: International Conference on Autonomous Agents, pp. 190–197 (1999)

    Google Scholar 

  6. Klein, M.: Combining and relating ontologies: an analysis of problems and solutions. In: Workshop on Ontologies and Information Sharing, IJCAI (2002)

    Google Scholar 

  7. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large Websites. In: International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  8. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: The Fifth Asia Pacific Web Conference (2003)

    Google Scholar 

  9. Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 250–257 (2002)

    Google Scholar 

  10. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web, pp. 66–75 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Su, W., Junping, D., Tian, G. (2011). A New Way of News Extraction by Text Washing and Statistics. In: Jiang, L. (eds) Proceedings of the 2011, International Conference on Informatics, Cybernetics, and Computer Engineering (ICCE2011) November 19–20, 2011, Melbourne, Australia. Advances in Intelligent and Soft Computing, vol 111. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25188-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25188-7_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25187-0

  • Online ISBN: 978-3-642-25188-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics