Abstract
News mining has gained increasing attention because of the overwhelming news produced everyday. Lots of news portals such as Sina (http://www.sina.com) and Chinanews (http://www.chinanews.com) develop tools to manage the billions of news and provide services to meet all kinds of needs. News analysis applications conduct news mining work and reveal valuable information. What they all need is news meta-data, the fundamental element to support news analysis work. To extract and maintain meta-data of news becomes an important and challenging task. In this paper, we present a system specialized for Chinese news meta-data extraction. It can identify 28 kinds of meta-data and provides not only a pipeline to extract them but also a systematic way for management. It facilitates the organizing and conducting of news mining processes and improves efficiency by avoiding duplication of work. More specifically, it introduces an innovative way to categorize news based on words’ ability to represent category. It also adapts existing methods to extract keywords, entities and event elements. Integration of our system on news mining applications has proved its valuable contribution for news analysis work.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
When, Who, What, Where, Why, How.
- 3.
- 4.
- 5.
- 6.
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. pp. 337–348. ACM (2003)
Garrido, A.L., Gómez, O., Ilarri, S., Mena, E.: An experience developing a semantic annotation system in a media group. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 333–338. Springer, Heidelberg (2012)
Hou, L., Li, J., Wang, Z., Tang, J., Zhang, P., Yang, R., Zheng, Q.: Newsminer: multifaceted news analysis for event search. Knowl.-Based Syst. 76, 17–29 (2015)
Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization. IBM Syst. J. 41(3), 428–437 (2002)
Krishnalal, G., Rengarajan, S.B., Srinivasagan, K.: A new text mining approach based on HMM-SVM for web news classification. Int. J. Comput. Appl. 1(19), 98–104 (2010)
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Appl. Intell. 37(1), 80–99 (2012)
Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)
McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)
Pawar, P.Y., Gawande, S.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)
Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization. TANLP, pp. 23–49. Springer, Heidelberg (2013)
Shan, D., Zhao, W.X., Chen, R., Shu, B., Wang, Z., Yao, J., Yan, H., Li, X.: Eventsearch: a system for event discovery and retrieval on multi-type historical data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1564–1567. ACM (2012)
Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the 15th International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)
Vadrevu, S., Nagarajan, S., Gelgi, F., Davulcu, H.: Automated metadata and instance extraction from news web sites. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 38–41. IEEE (2005)
Wang, W., Zhao, D., Zou, L., Wang, D., Zheng, W.: Extracting 5W1H event semantic elements from Chinese online news. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 644–655. Springer, Heidelberg (2010)
Zheng, Q., Li, J., Wang, Z., Hou, L.: Co-mention and context-based entity linking. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.-T. (eds.) Semantic Web and Web Science. SPC, pp. 117–129. Springer, Heidelberg (2013)
Zhou, Y., Li, Y., Xia, S.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Acknowledgement
The work is supported by 973 Program (No. 2014CB340504), NSFC-ANR (No. 61261130588), Tsinghua University Initiative Scientific Research Program (No. 20131089256), THU-NUS NExT Co-Lab and National Natural Science Foundation of China (No. 61303075).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Xia, J., Xie, F., Zhang, M., Su, Y., Luan, H. (2016). CNME: A System for Chinese News Meta-Data Extraction. In: Qi, G., Kozaki, K., Pan, J., Yu, S. (eds) Semantic Technology. JIST 2015. Lecture Notes in Computer Science(), vol 9544. Springer, Cham. https://doi.org/10.1007/978-3-319-31676-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-31676-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31675-8
Online ISBN: 978-3-319-31676-5
eBook Packages: Computer ScienceComputer Science (R0)