Cluster Computing

, Volume 22, Supplement 4, pp 9977–9990 | Cite as

Research on Tibetan hot words, sensitive words tracking and public opinion classification

  • Guixian XuEmail author
  • Changzhi Wang
  • Haishen Yao
  • Qi Qi


The rapid development of Tibetan information technology provides rich resources for Tibetan information processing technology. The construction of Tibetan corpus is the field of Tibetan information processing of basic work. In this paper, we design the system of Tibetan network data collection and web pages preprocessing. It can timely and efficiently access to web resources, and provide a basis for further analysis of Tibetan data. It can establish the Tibetan related corpus, enrich the Tibetan digital resources. It can also alleviate the status of Tibetan corpus data sparse and lack of resources and bring the convenient condition for Tibetan information processing. The hot words reflect the hot spot of Tibetan people’s attention in a certain period of time. Firstly, the paper proposes the method for reducing the space dimension of Tibetan news text. It can effectively reduce the complexity of subsequent processing. Secondly, term weighting method is proposed based on improved TFIDF for Tibetan text information extraction. It utilizes the idea that the words of different locations are given different weights to extract the hot words. On sensitive words discovery and classification of public opinion, sensitive thesaurus are collected artificially. Through the sensitive thesaurus comparison, the sensitive words are extracted. Classification of public opinion words is based on the proposed classification formula and the public opinion thesaurus. It will classify one Tibetan text to one public opinion class. In this paper, the software is developed to automatically collect Tibetan web pages from the network, preprocess the web pages, extract the text features and hot words, discover the sensitive words and classify the Tibetan text to one public opinion class. The experiment shows that the Tibetan hot words extraction is effective and Tibetan classification results of public opinion are significant.


Web crawler Tibetan hot words Term weight computing Sensitive words discovery Text classification 



This work was supported by the Beijing Social Science Foundation (No. 14WYB040), First class university, First class discipline construction funds of Minzu University of China (No.2017MDYL12), the National Natural Science Foundation of China (No. 61309012), the National Key Technology Research and Development Program of the Ministry of Science and Technology of China (No. 2014BAK10B03).


  1. 1.
    Gao, D.G., Guan, B.: Retrospect on the development of Tibetan information processing technology. J. Tibet Univ. 24(3), 18–27 (2009)Google Scholar
  2. 2.
    Li, Y.Q., Sun, L.H.: Hot-word detection for internet public sentiment. J. Chin. Inf. Process. 25(1), 49–53 (2011)Google Scholar
  3. 3.
    Gao, D.G., Tashigyal, Zhao, D.C.: Data analyses of large basic Tibetan corpus. J. Northwest Univ. Natl. 34(92), 46–51 (2013)Google Scholar
  4. 4.
    Li, P.F., Zhu, Q.M., Qian, P.D.: Construction approach of large-scale corpus based on web. Comp. Eng. 34(7), 41–46 (2008)Google Scholar
  5. 5.
    Liu, H.D., Nuo, M.H., Ma, L.L.: Mining Tibetan web text resources and its application. J. Chin. Inf. Process. 29(1), 170–177 (2015)Google Scholar
  6. 6.
    Yang, D.Z., Zhao, G., Wang, T.: Application of WebCrawler in information search and data mining. Comput. Eng. Des. 30(24), 5658–5662 (2009)Google Scholar
  7. 7.
    Yang, L., Geng, X., Liao, H.: A web sentiment analysis method on fuzzy clustering for mobile social media users. Eurasip J. Wirel. Commun. Netw. 2016(1), 1–13 (2016)Google Scholar
  8. 8.
    Wu, Q., Yang, X., Zhao, Z.X.: Web information extraction based on visual characteristics. In: Symposium of the Sixth China Conference on Information Retrieval (2010)Google Scholar
  9. 9.
    Zhang, R.X., Song, M.Q., Gong, Y.L.: Parsing DOM tree reversely and extracting web main page information. Comput. Sci. 38(4), 213–215 (2011)Google Scholar
  10. 10.
    Hu, J.D.: Research on Web News Extraction and Duplicates Elimination. Zhejiang University, Hangzhou (2011)Google Scholar
  11. 11.
    Ma, C.Q., Mao, X.G.: Research on near-duplicate detection algorithm shingling and simhash. Comput. Digit. Eng. 39(1), 15–17 (2009)Google Scholar
  12. 12.
    Kang, C., Jiang, D., Long, C.: Tibetan word segmentation based on word-position tagging. In: 2013 International Conference on Asian Language Processing (IALP), pp. 239–242. Urumqi (2013)Google Scholar
  13. 13.
    Jin, Z.: A method of intelligence key words extraction based on improved TF-IDF. J. Intell. 4, 028 (2014)Google Scholar
  14. 14.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)Google Scholar
  15. 15.
    Becker, J., Kuropka, D.: Topic-based vector space model. In: Proceedings of the 6th international conference on business information systems, pp. 7–12 (2003)Google Scholar
  16. 16.
    Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)Google Scholar
  17. 17.
    Wu, H.C., Luk, R.W.P., Wong, K.F., et al.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26(3), 13 (2008)Google Scholar
  18. 18.
    Shi, C.Y., Xu, C.J., Yang, X.J.: Study of TFIDF algorithm. J. Comput. Appl. 26, 167–170 (2009)Google Scholar
  19. 19.
    Cao, H., Jia, H.: Tibetan text classification based on the feature of position weight. In: 2013 International Conference on Asian Language Processing (IALP), pp. 220–223. Urumqi (2013)Google Scholar
  20. 20.
    Jiang, T., Yu, H.Z., Zhang, B.: Tibetan text classification using distributed representations of words. In: 2015 International Conference on Asian Language Processing (IALP), pp. 123–126. Suzhou (2015)Google Scholar
  21. 21.
    Kim, S.B., Han, K.S., Rim, H.C., HyonMyaeng, S.: Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18(11), 1457–1466 (2006)Google Scholar
  22. 22.
    Liu, W., Song, Z.: Design and implementation of an internet public opinion monitoring system. In: 2014 International Conference on security, pattern analysis, and cybernetics (SPAC), pp. 114–118. Wuhan (2014)Google Scholar
  23. 23.
    Guo, K., Shi, L., Ye, W., Li, X.: A survey of internet public opinion mining. In: 2014 International Conference on progress in informatics and computing (PIC), pp. 173–179 Shanghai (2014)Google Scholar
  24. 24.
    Li, X., Gao, L.: The design and implementation of an internet public opinion monitoring and analyzing system. In: 2013 International Conference on Service Sciences (ICSS), pp. 176–180. Shenzhen (2013)Google Scholar
  25. 25.
    Mo, J.W., Zheng, Y., Shou, Z.Y., Zhang, S.L.: Improved Chinese word segmentation method based on dictionary. Comput. Eng. Des. 34(5), 1802–1807 (2013)Google Scholar
  26. 26.
    Chen, Y.Z., Li, B.L., Yu, S.W., Lan, C.J.: An automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Appl. Linguist. 1, 75–82 (2003)Google Scholar
  27. 27.
    Zhu, J., Li, T.R.: Research on Tibetan stop words selection and automatic processing method. J. Chin. Inf. Process. 29(2), 125–132 (2015)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Information Engineering CollegeMinzu University of ChinaBeijingChina

Personalised recommendations