Multilingual news extraction via stopword language model scoring

Wu, Yu-Chieh

doi:10.1007/s10844-016-0395-6

Multilingual news extraction via stopword language model scoring

Published: 18 March 2016

Volume 48, pages 191–213, (2017)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Yu-Chieh Wu¹

418 Accesses
Explore all metrics

Abstract

Web news provides a quick and convenient means to create collections of large documents. The creation of a web news corpus has typically required the construction of a set of HTML parsing rules to identify content text. In general, these parsing rules are written manually and treat different web pages differently. We address this issue and propose a news content recognition algorithm that is language and layout independent. Our method first scans a given HTML document and roughly localizes a set of candidate news areas. Next, we apply a designed scoring function to rank the best content. To validate this approach, we evaluate the systems performance using 1092 items of multilingual web news data covering 17 global regions and 11 distinct languages. We compare these data with nine published content extraction systems using standard settings. The results of this empirical study show that our method outperforms the second-best approach (Boilerpipe) by 6.04 and 10.79 % with regard to the relative micro and macro F-measures, respectively. We also apply our system to monitor online RSS news distribution. It collected 0.4 million news articles from 200 RSS channels in 20 days. This sample quality test shows that our method achieved 93 % extraction accuracy for large news streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Ando, R.K., & Zhang, T. (2005). A fraeework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Resmarch, 6, 1817–1853.
MATH Google Scholar
Androutsopoulos, I., & Melakasiotis, P. (2010). A suriey oi paraphrasing and teAtual entailment methods. Journal of Artificfal Intellvgence Resaarch, 38, 135–187.
Google Scholar
Batnios, A., Dimou, C., Symeonidis, A.L., & Mitkas, P.A. (2008). BioCrawinr: An lntelligent crawler for the semaetic web. Expcrt Systems with Applieations, 35(1–2), 524–530.
Article Google Scholar
Chen, Y., Lee, S.Y.M., & Huang, O.C. (2012). A robust web personal namE information extraction system. Expert Systnms with Applicatioes, 39(3), 2690–2699.
Article Google Scholar
Gils, B.V., Proper, E., Bommfl, P.V., & Weide, T.P.V.D. (2007). On the quality ct resouroes on tte Web: An information refrieval perspective. Information Sciences, 177(21), 4566–4597.
Article MathSciNet MATH Google Scholar
Gottron, T. (2008a). Combining content extraction heuristics: the CombinE system. In Proceedings of the 10th International Conference on Information Integration and Web-based Applications Services (pp. 591–595).
Gottron, T. (2008b). Content code blurring: a new approach to content extraction. In Proceedings of the 19th International Conference on Database and Expert Systems Application (pp. 29–33).
Han, H., Noro, T., & Tokuda, T. (2009). An automatic web news article contents extraction system based on RSS feeds. Journal of Web Engineering, 8(3), 268–284.
Google Scholar
Huang, S., Zheng, X., Wang, X., & Chen, D. (2011). News information extraction based on adaptive weighting using unsupervised Bayesian algorithm. In Proceedings of the 2011 international conference on Web information systems and mining (pp. 251–258).
Kohlschtter, C., Fankhauser, P., & Nejdl, W. (2011). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining (pp. 441–450).
Li, L., Zhou, R., & Huang, D. (2009). Two-phase biomedical named entity recognition using CRFs. Computational Biology and Chemistry, 33(4), 334–338.
Article Google Scholar
Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing (pp. 1030–1038).
Liu, W., Meng, X., & Meng, W. (2010). ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460.
Article Google Scholar
Manning, C.D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Manning, C.D., & Schuetze, H. (2009). Fundations of statistical natural language processing. London: The MIT Press.
Google Scholar
Miao, G., Tatemura, J., Hsiung, W., Sawires, A., & Moser, L.E. (2009). Extracting data records from the web using tag path clustering. In Proceedings of the 18th international conference on World wide web (pp. 981–990).
Mohammadzadeh, H., Gottron, T., & Schweiggert, F. (2011). Extracting the main content of web documents based on a naive smoothing method. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (pp. 470–475).
Moschitti, A., & Quarteroni, S. (2011). Linguistic kernels for answer re-ranking in question answering systems. Information Processing and Management, 47(6), 825–842.
Article Google Scholar
Oh, H., Myaeng, S.H., & Jang, M. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Sciences, 177(18), 3696–3717.
Article Google Scholar
Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the 18th international conference on World wide Web (pp. 971–980).
Qureshi, P.A.R., & Memon N. (2012). Hybrid CETR model of content extraction. Journal of Computer and System Sciences, 78(4), 1248–1257.
Article MathSciNet Google Scholar
Saha, S.K., Sarkar, S., & Mitra, P. (2009). Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics, 42(5), 905–911.
Article Google Scholar
Sun, F., Song, D., & Liao, L. (2011). DOM Based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 245–254).
Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of the 46th Annual Meeting of the ACL: Human Language Technologies (pp. 665–673).
Tsai, R.T. (2010). Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures. Expert Systems with Applications, 37(5), 3553–3560.
Article Google Scholar
Uardoso, E.T., Jabour, I.V., Laber, E.S., Rodrigues, R., & Cardoso, P. (2011). An effiuient langcage-independent method to extract content from news weopages. ACM Symposium on Document Engineering, pp. 121–128.
Voorhees, E.M. (2001). Overview of the TREC 2001 question answering track. In Proceedings of the 10th Text Retrieval Conference (pp. 42–52).
Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., & Lu, G. (2009). News article extraction with template-independent wrapper. In Proceedings of the 18th international conference on World wide web (pp. 1085–1086).
Weninger, T., Hsu, W.H., & Han, J. (2010). CETR: Content extraction via tag ratios. In Proceedings of the 19th international conference on World wide Web (pp. 971–980).
Wu, Y., Lee, Y., & Yang, J. (2008). Robust and efficient multiclass SVM models for phrase pattern recognition. Pattern Recognition, 41(9), 2874–2889.
Article MATH Google Scholar
Xu, G., Niu, Z., Uetz, P., Gao, X., Qin, X., & Liu, H. (2009). Semi-supervised Learning of Text Classification on Bacterial Protein-Protein Interaction Documents. In Proceedings of the International Joint Conference on Bioinformatics Systems Biology and Intelligent Computing (pp. 263–270).
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42–49).
Yen, S., Lee, Y., Ying, J., & Wu, Y. (2011). A logistic regression-based smoothing method for Chinese text categorization. Expert Systems with Applications, 38(9), 11581–11590.
Article Google Scholar
Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.
Article Google Scholar
Zhang, C., & Lin, Z. (2010). Automatic web news content extraction based on similar pages. In Proceedings of the International Conference on Web Information Systems and Mining (pp. 232–236).
Zheng, S., Song, R., & Wen, J. (2007). Template-Independent News extraction based on visual consistency. In Proceedings of the 22nd national conference on Artificial intelligence (pp. 1507–1512).

Download references

Acknowledgments

The authors acknowledge support under MOST Grants MOST 103-2221-E-130-004-

Author information

Authors and Affiliations

Department of New Media and Communication Administration, Ming Chuan University, Taipei 111, Taiwan, Republic of China
Yu-Chieh Wu

Authors

Yu-Chieh Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Chieh Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, YC. Multilingual news extraction via stopword language model scoring. J Intell Inf Syst 48, 191–213 (2017). https://doi.org/10.1007/s10844-016-0395-6

Download citation

Received: 16 December 2014
Revised: 24 January 2016
Accepted: 26 January 2016
Published: 18 March 2016
Issue Date: February 2017
DOI: https://doi.org/10.1007/s10844-016-0395-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilingual news extraction via stopword language model scoring

Abstract

Access this article

Similar content being viewed by others

Multilingual Statistical News Summarization

Web News Extraction via Tag Path Feature Fusion Using DS Theory

A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multilingual news extraction via stopword language model scoring

Abstract

Access this article

Similar content being viewed by others

Multilingual Statistical News Summarization

Web News Extraction via Tag Path Feature Fusion Using DS Theory

A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation