Abstract
Most previous IE (IE) work relys on the analysis of the DOM tree of HTML file. When hundreds of information sources need to be extracted in a specific domain like news, it will lead to decreased accuracy. Based on the features of news articles, this paper proposed a new way to get news content desired by washing noise information and text group statistics. The experiment proved the effectiveness of the algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sarawagi, S.: Automation in IEand data integration (tutorial). In: Proceedings of the 28th International Conference on Very Large Data Bases (2002)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for IE. In: Int. Joint Conf. on Artificial Intelligence (1997)
Hsu, C., Dung, M.: Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23, 521–538 (1998)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: International Conference on Autonomous Agents, pp. 190–197 (1999)
Klein, M.: Combining and relating ontologies: an analysis of problems and solutions. In: Workshop on Ontologies and Information Sharing, IJCAI (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large Websites. In: International Conference on Very Large Data Bases, pp. 109–118 (2001)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: The Fifth Asia Pacific Web Conference (2003)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 250–257 (2002)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web, pp. 66–75 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Su, W., Junping, D., Tian, G. (2011). A New Way of News Extraction by Text Washing and Statistics. In: Jiang, L. (eds) Proceedings of the 2011, International Conference on Informatics, Cybernetics, and Computer Engineering (ICCE2011) November 19–20, 2011, Melbourne, Australia. Advances in Intelligent and Soft Computing, vol 111. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25188-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-25188-7_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25187-0
Online ISBN: 978-3-642-25188-7
eBook Packages: EngineeringEngineering (R0)