Skip to main content

CNME: A System for Chinese News Meta-Data Extraction

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9544))

Abstract

News mining has gained increasing attention because of the overwhelming news produced everyday. Lots of news portals such as Sina (http://www.sina.com) and Chinanews (http://www.chinanews.com) develop tools to manage the billions of news and provide services to meet all kinds of needs. News analysis applications conduct news mining work and reveal valuable information. What they all need is news meta-data, the fundamental element to support news analysis work. To extract and maintain meta-data of news becomes an important and challenging task. In this paper, we present a system specialized for Chinese news meta-data extraction. It can identify 28 kinds of meta-data and provides not only a pipeline to extract them but also a systematic way for management. It facilitates the organizing and conducting of news mining processes and improves efficiency by avoiding duplication of work. More specifically, it introduces an innovative way to categorize news based on words’ ability to represent category. It also adapts existing methods to extract keywords, entities and event elements. Integration of our system on news mining applications has proved its valuable contribution for news analysis work.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Metadata.

  2. 2.

    When, Who, What, Where, Why, How.

  3. 3.

    http://www.news.cn.

  4. 4.

    http://www.chinanews.com.

  5. 5.

    http://www.people.com.cn.

  6. 6.

    http://www.tencent.com.

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. pp. 337–348. ACM (2003)

    Google Scholar 

  2. Garrido, A.L., Gómez, O., Ilarri, S., Mena, E.: An experience developing a semantic annotation system in a media group. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 333–338. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  3. Hou, L., Li, J., Wang, Z., Tang, J., Zhang, P., Yang, R., Zheng, Q.: Newsminer: multifaceted news analysis for event search. Knowl.-Based Syst. 76, 17–29 (2015)

    Article  Google Scholar 

  4. Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization. IBM Syst. J. 41(3), 428–437 (2002)

    Article  Google Scholar 

  5. Krishnalal, G., Rengarajan, S.B., Srinivasagan, K.: A new text mining approach based on HMM-SVM for web news classification. Int. J. Comput. Appl. 1(19), 98–104 (2010)

    Google Scholar 

  6. Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Appl. Intell. 37(1), 80–99 (2012)

    Article  Google Scholar 

  7. Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)

    Article  Google Scholar 

  8. McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)

    Google Scholar 

  9. Pawar, P.Y., Gawande, S.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)

    Article  Google Scholar 

  10. Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization. TANLP, pp. 23–49. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  11. Shan, D., Zhao, W.X., Chen, R., Shu, B., Wang, Z., Yao, J., Yan, H., Li, X.: Eventsearch: a system for event discovery and retrieval on multi-type historical data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1564–1567. ACM (2012)

    Google Scholar 

  12. Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the 15th International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)

    Google Scholar 

  13. Vadrevu, S., Nagarajan, S., Gelgi, F., Davulcu, H.: Automated metadata and instance extraction from news web sites. In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 38–41. IEEE (2005)

    Google Scholar 

  14. Wang, W., Zhao, D., Zou, L., Wang, D., Zheng, W.: Extracting 5W1H event semantic elements from Chinese online news. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 644–655. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Zheng, Q., Li, J., Wang, Z., Hou, L.: Co-mention and context-based entity linking. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.-T. (eds.) Semantic Web and Web Science. SPC, pp. 117–129. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  16. Zhou, Y., Li, Y., Xia, S.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)

    Article  Google Scholar 

Download references

Acknowledgement

The work is supported by 973 Program (No. 2014CB340504), NSFC-ANR (No. 61261130588), Tsinghua University Initiative Scientific Research Program (No. 20131089256), THU-NUS NExT Co-Lab and National Natural Science Foundation of China (No. 61303075).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junbo Xia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Xia, J., Xie, F., Zhang, M., Su, Y., Luan, H. (2016). CNME: A System for Chinese News Meta-Data Extraction. In: Qi, G., Kozaki, K., Pan, J., Yu, S. (eds) Semantic Technology. JIST 2015. Lecture Notes in Computer Science(), vol 9544. Springer, Cham. https://doi.org/10.1007/978-3-319-31676-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31676-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31675-8

  • Online ISBN: 978-3-319-31676-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics