CNME: A System for Chinese News Meta-Data Extraction
News mining has gained increasing attention because of the overwhelming news produced everyday. Lots of news portals such as Sina (http://www.sina.com) and Chinanews (http://www.chinanews.com) develop tools to manage the billions of news and provide services to meet all kinds of needs. News analysis applications conduct news mining work and reveal valuable information. What they all need is news meta-data, the fundamental element to support news analysis work. To extract and maintain meta-data of news becomes an important and challenging task. In this paper, we present a system specialized for Chinese news meta-data extraction. It can identify 28 kinds of meta-data and provides not only a pipeline to extract them but also a systematic way for management. It facilitates the organizing and conducting of news mining processes and improves efficiency by avoiding duplication of work. More specifically, it introduces an innovative way to categorize news based on words’ ability to represent category. It also adapts existing methods to extract keywords, entities and event elements. Integration of our system on news mining applications has proved its valuable contribution for news analysis work.
KeywordsNews analysis Meta-data extraction Keyword extraction Entity linking
The work is supported by 973 Program (No. 2014CB340504), NSFC-ANR (No. 61261130588), Tsinghua University Initiative Scientific Research Program (No. 20131089256), THU-NUS NExT Co-Lab and National Natural Science Foundation of China (No. 61303075).
- 1.Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. pp. 337–348. ACM (2003)Google Scholar
- 5.Krishnalal, G., Rengarajan, S.B., Srinivasagan, K.: A new text mining approach based on HMM-SVM for web news classification. Int. J. Comput. Appl. 1(19), 98–104 (2010)Google Scholar
- 8.McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)Google Scholar
- 11.Shan, D., Zhao, W.X., Chen, R., Shu, B., Wang, Z., Yao, J., Yan, H., Li, X.: Eventsearch: a system for event discovery and retrieval on multi-type historical data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1564–1567. ACM (2012)Google Scholar
- 12.Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the 15th International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)Google Scholar
- 13.Vadrevu, S., Nagarajan, S., Gelgi, F., Davulcu, H.: Automated metadata and instance extraction from news web sites. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 38–41. IEEE (2005)Google Scholar