CNME: A System for Chinese News Meta-Data Extraction

Xia, Junbo; Xie, Fei; Zhang, Mengdi; Su, Yu; Luan, Huanbo

doi:10.1007/978-3-319-31676-5_7

CNME: A System for Chinese News Meta-Data Extraction

Junbo Xia^17,18,
Fei Xie^17,18,
Mengdi Zhang^17,18,
Yu Su^17,18 &
…
Huanbo Luan^17,18

Conference paper
First Online: 20 March 2016

798 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9544))

Abstract

News mining has gained increasing attention because of the overwhelming news produced everyday. Lots of news portals such as Sina (http://www.sina.com) and Chinanews (http://www.chinanews.com) develop tools to manage the billions of news and provide services to meet all kinds of needs. News analysis applications conduct news mining work and reveal valuable information. What they all need is news meta-data, the fundamental element to support news analysis work. To extract and maintain meta-data of news becomes an important and challenging task. In this paper, we present a system specialized for Chinese news meta-data extraction. It can identify 28 kinds of meta-data and provides not only a pipeline to extract them but also a systematic way for management. It facilitates the organizing and conducting of news mining processes and improves efficiency by avoiding duplication of work. More specifically, it introduces an innovative way to categorize news based on words’ ability to represent category. It also adapts existing methods to extract keywords, entities and event elements. Integration of our system on news mining applications has proved its valuable contribution for news analysis work.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/Metadata.
2.
When, Who, What, Where, Why, How.
3.
http://www.news.cn.
4.
http://www.chinanews.com.
5.
http://www.people.com.cn.
6.
http://www.tencent.com.

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. pp. 337–348. ACM (2003)
Google Scholar
Garrido, A.L., Gómez, O., Ilarri, S., Mena, E.: An experience developing a semantic annotation system in a media group. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 333–338. Springer, Heidelberg (2012)
Chapter Google Scholar
Hou, L., Li, J., Wang, Z., Tang, J., Zhang, P., Yang, R., Zheng, Q.: Newsminer: multifaceted news analysis for event search. Knowl.-Based Syst. 76, 17–29 (2015)
Article Google Scholar
Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization. IBM Syst. J. 41(3), 428–437 (2002)
Article Google Scholar
Krishnalal, G., Rengarajan, S.B., Srinivasagan, K.: A new text mining approach based on HMM-SVM for web news classification. Int. J. Comput. Appl. 1(19), 98–104 (2010)
Google Scholar
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Appl. Intell. 37(1), 80–99 (2012)
Article Google Scholar
Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)
Article Google Scholar
McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)
Google Scholar
Pawar, P.Y., Gawande, S.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)
Article Google Scholar
Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization. TANLP, pp. 23–49. Springer, Heidelberg (2013)
Chapter Google Scholar
Shan, D., Zhao, W.X., Chen, R., Shu, B., Wang, Z., Yao, J., Yan, H., Li, X.: Eventsearch: a system for event discovery and retrieval on multi-type historical data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1564–1567. ACM (2012)
Google Scholar
Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the 15th International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)
Google Scholar
Vadrevu, S., Nagarajan, S., Gelgi, F., Davulcu, H.: Automated metadata and instance extraction from news web sites. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 38–41. IEEE (2005)
Google Scholar
Wang, W., Zhao, D., Zou, L., Wang, D., Zheng, W.: Extracting 5W1H event semantic elements from Chinese online news. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 644–655. Springer, Heidelberg (2010)
Chapter Google Scholar
Zheng, Q., Li, J., Wang, Z., Hou, L.: Co-mention and context-based entity linking. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.-T. (eds.) Semantic Web and Web Science. SPC, pp. 117–129. Springer, Heidelberg (2013)
Chapter Google Scholar
Zhou, Y., Li, Y., Xia, S.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Article Google Scholar

Download references

Acknowledgement

The work is supported by 973 Program (No. 2014CB340504), NSFC-ANR (No. 61261130588), Tsinghua University Initiative Scientific Research Program (No. 20131089256), THU-NUS NExT Co-Lab and National Natural Science Foundation of China (No. 61303075).

Author information

Authors and Affiliations

Knowledge Engineering Group, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, People’s Republic of China
Junbo Xia, Fei Xie, Mengdi Zhang, Yu Su & Huanbo Luan
Communication Technology Bureau, Xinhua News Agency, Beijing, 100803, China
Junbo Xia, Fei Xie, Mengdi Zhang, Yu Su & Huanbo Luan

Authors

Junbo Xia
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Mengdi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Su
View author publications
You can also search for this author in PubMed Google Scholar
Huanbo Luan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junbo Xia .

Editor information

Editors and Affiliations

Southeast University, Nanjing, China
Guilin Qi
Osaka University, Ibaraki, Japan
Kouji Kozaki
The University of Aberdeen, Aberdeen, United Kingdom
Jeff Z. Pan
Zhongnan Hospital of Wuhan University, Wuhan, China
Siwei Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, J., Xie, F., Zhang, M., Su, Y., Luan, H. (2016). CNME: A System for Chinese News Meta-Data Extraction. In: Qi, G., Kozaki, K., Pan, J., Yu, S. (eds) Semantic Technology. JIST 2015. Lecture Notes in Computer Science(), vol 9544. Springer, Cham. https://doi.org/10.1007/978-3-319-31676-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-31676-5_7
Published: 20 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31675-8
Online ISBN: 978-3-319-31676-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics