Abstract
Subject words represent the brief information of the text. Text automatic summary reflects its theme and core content. In this paper, the research is conducted on multi-feature fusion algorithm on subject words extraction and summary generation of Tibetan network text. Firstly, Tibetan web pages are collected and preprocessing is conducted to extract the useful information from web pages. Secondly, BCCF algorithm of word segmentation is utilized to cut the text’s words. Then multi-feature fusion algorithm is proposed to extract the subject words of the text. The algorithm takes into account the multi-factors such as the word’s frequency, length, type to calculate the words’ weight and effectively select the text’s subject words. For text summary generation, the algorithm of the sentence weight calculation is designed in terms of the word frequency, position and so on. The algorithm of text summary generation is to compute the sentences’ weight, remove the redundant sentences and form the text summary. The experiments show that multi-feature fusion algorithm of the subject words extraction and the summary generation have reached the better achievement. The research is useful and helpful to the study of Tibetan information processing.
Similar content being viewed by others
References
Hu, X., Lin, Y., Wang, C., et al.: Summary of automatic text summarization techniques. J. Intell. 29(08), 144–147 (2010)
Hu, C., Luo, N., Zhao, Q.: Fast fuzzy trajectory clustering strategy based on data summarization and rough approximation. Clust. Comput. 19(3), 1–10 (2016)
Ohsawa, Y., Benson, N.E., Yachida, M.: KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor. In: Proceedings of the Research and Technology Advances in Digital Libraries, pp. 12–18 (1998)
Manning, C., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. Proceedings of EMNLP, pp. 404–411 (2004)
Si, X., Sun, M.: Tag-LDA for scalable real-time tag recommendation. J. Comput. Inf. Syst. 6(2), 23–31 (2009)
Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet allocation for tag recommendation. In: Proceedings of ACM Conference on Recommender Systems, pp. 61–68 (2009)
Bundschus, M., Yu, S., Tresp, V, et al.: Hierarchical Bayesian models for collaborative tagging systems. In: Proceedings of ICDM, pp. 728–733 (2009)
State Administration of Press, Publication, Radio, Film, and Television of The People’s Republic of China: Rules for Abstracts and Abstracting (GB6447-86). Standards Press of China Press, Beijing, pp. 141–142 (1998)
Ge, J.Y.: Research on Text Automatic Summarization Technology. Fudan University (2004)
Jin, B., Shi, Y.J., Teng, H.F., et al.: Automatic abstracting technology and its application. Appl. Res. Comput. 12, 13–15 (2004)
Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)
Baxendale, P.: Machine-made index for technical literatur—an experiment. IBM J. Res. Dev. 2(4), 354–361 (1958)
Aone, C., Okurowski, M.E., Gorlinsky, J., et al.: A trainable summarizer with knowledge acquired from robust NLP techniques. In: Mani, I., Maybury, M.T. (eds.) Advances in Automatic text Summarization, pp. 71–80. MIT Press, Cambridge (1999)
Lin, C.Y.: Training a selection function for extraction. In: Eighth International Conference on Information and Knowledge Management. ACM, pp. 55-62 (1999)
Conroy, J.M., O’Leary, D.P.: Text summarization via hidden Markov models. In: International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 406-407 (2001)
Su, H.Y., Wang, Y.C.: The automatic creation of the abstracts of Chinese scientific and technical literature. J. China Soc. Sci. Tech. Inf. 8, 433–439 (1989)
Mo, Y., Wang, Y.C.: Automatic abstract of Chinese documents. New Technol. Libr. Inf. Serv. 3, 10–12 (1999)
Wang, Y.C., Xu, H.M.: The OA-1.4 automatic abstraction system on Chinese documents. High Technol. Lett. 1, 19–23 (1998)
Wu, Y.: HIT-97 type English automatic abstracting system. J. China Soc. Sci. Tech. Inf. 17(5), 358–364 (1998)
An-JianCaiRang: Research on automatic abstract of web document summarization of Tibetan search engine. Microprocessors 31(5), 77–80 (2010)
Yang, D.Z., Zhao, G., Wang, T.: Application of WebCrawler in information search and data mining. Comput. Eng. Des. 30(24), 5658–5662 (2009)
Swaraj, K.P., Manjula, D.: A fast approach to identify trending articles in hot topics from XML based big bibliographic datasets. Clust. Comput. 19(2), 837–848 (2016)
Jiang, D.: The method and process of the definition to grammatical chunks in modern Tibetan. Minor. Lang. China 04, 30–39 (2003)
Chen, Y.Z., Li, B.L., et al.: An automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Appl. Linguist. 01, 75–82 (2003)
He, X.Z., Li, Y.C., Ma, N., Yu, H.Z.: Study on Tibetan automatic word segmentation as syllable tagging. Appl. Res. Comput. 32(7), 1989–1991 (2015)
Zhu, J., Li, T.R.: Research on Tibetan stop words selection and automatic processing method. J. Chin. Inf. Process. 29(2), 125–132 (2015)
Powers, D.M.W.: Applications and explanations of Zipf’s law. Adv. Neural Inf. Process. Syst. 5(4), 595–599 (1998)
Acknowledgements
This work was supported by the Beijing Social Science Foundation (No. 14WYB040), First class university, First class discipline construction funds of Minzu University of China (No.2017MDYL12), the National Key Technology Research and Development Program of the Ministry of Science and Technology of China (No. 2014BAK10B03), and the National Natural Science Foundation of China (Nos. 61309012 and 61331013).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, GX., Yao, HS. & Wang, C. Research on multi-feature fusion algorithm for subject words extraction and summary generation of text. Cluster Comput 22 (Suppl 5), 10883–10895 (2019). https://doi.org/10.1007/s10586-017-1219-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1219-3