A study for extracting keywords from data with deep learning and suffix array

Xu, Wentao; Nong, Ge

doi:10.1007/s11042-021-11762-7

A study for extracting keywords from data with deep learning and suffix array

Published: 26 January 2022

Volume 81, pages 7419–7437, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wentao Xu¹ &
Ge Nong¹

197 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

While a suffix index built on a suffix array is capable of supporting full-text searches over any data, its search speed can be accelerated using a keyword index for the set of keywords extracted from data. We attempt to design a method for extracting keywords from data using deep learning and a suffix array in this article. In particular, the study starts with Chinese texts because many word segmentation results on Chinese are available for performance evaluation. A new method combining the use of a neural network and a suffix array of training data is proposed for Chinese word segmentation. The suffix array of training data is constructed to divide long sentences in the input text into short fragments for better word segmentation by our neural network method without a context window. Our experiments on the typical datasets reveal that the proposed method achieves encouraging results in terms of the precision, recall and \(F_1\) score compared to other existing advanced methods while avoiding the drawback of a context window. This study provides some helpful experience for designing a general solution to extract keywords from data using a suffix array.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture

Assessing the Efficiency of Suffix Stripping Approaches for Portuguese Stemming

Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval

Notes

https://www.elastic.com/cn/blog/elastic-search-7-2-0-released
http://sighan.cs.uchicago.edu/bakeoff2005/
https://dumps.wikimedia.org/zhwiki/20210101/zhwiki-20210101-pages-articles-multistream.xml.bz2
https://radimrehurek.com/gensim/index.html
https://github.com/BYVoid/OpenCC
https://github.com/fxsjy/jieba/
https://catalog.ldc.upenn.edu/LDC2007T36
https://github.com/lancopku/PKUSeg-python

References

Cai D, Zhao H (2016) Neural word segmentation learning for Chinese. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 409–420
Cai D, Zhao H, Zhang Z, Xin Y, Wu Y, Huang F (2017) Fast and accurate neural word segmentation for Chinese. In: Proceedings of the 55th annual meeting of the association for computational linguistics. pp 608–615
Chen X, Qiu X, Zhu C, Huang X (2015) Gated recursive neural network for Chinese word segmentation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. pp 1744–1753
Chen X, Qiu X, Zhu C, Liu P, Huang X (2016) Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp 1197–1206
Chen X, Shi Z, Qiu X, Huang, X (2017) Adversarial multi-criteria learning for Chinese word segmentation. In: Proceedings of the 55th annual meeting of the association for computational linguistics. pp 1193–1203
Chen Y, Zheng Q, Chen P (2015) A boundary assembling method for Chinese entity-mention recognition. IEEE Intelligent Systems 30(6):50–58
Article Google Scholar
Daumé H, Langford J, Marcu D (2009) Search-based structured prediction. Machine Learning 75(3):297–325
Article Google Scholar
Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: A review. Multimedia Tools and Applications 78(3):3797–3816
Article Google Scholar
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Computation 12(10):2451–2471
Article Google Scholar
Goldberg Y, Levy O (2014) Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:14023722v1
Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28(10):2222–2232
Article MathSciNet Google Scholar
Kumar A, Garg G (2019) Sentiment analysis of multimodal twitter data. Multimedia Tools and Applications 78(17):1–17
Google Scholar
Liu Q, Wu L, Yang Z, Liu Y (2011) Domain phrase identification using atomic word formation in Chinese text. Knowledge-Based Systems 24(8):1254–1260
Article Google Scholar
Manber U, Myers G (1993) Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5):935–948
Article MathSciNet Google Scholar
Mo J, Zheng Y, Shou Z, Zhang S (2013) Improved Chinese word segmentation method based on dictionary. Computer Engineering & Design 34(5):1802–1771
Google Scholar
Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems 31(3):1–15
Article MathSciNet Google Scholar
Nong G, Zhang S, Chan WH (2011) Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers 60(10):1471–1484
Article MathSciNet Google Scholar
Peng H, Ma Y, Li Y, Cambria E (2018) Learning multi-grained aspect target sequence for Chinese sentiment analysis. Knowledge-Based Systems 148:167–176
Article Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article Google Scholar
Sundermeyer M, Schlüter, R, Ney, H (2012) LSTM neural networks for language modeling. In: Interspeech. pp 601–608
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on international conference on machine learning
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5998–6008
Wang H, Yang Z, Yu Q, Hong T, Lin X (2018) Online reliability time series prediction via convolutional neural network and long short term memory for service-oriented systems. Knowledge-Based Systems 159:132–147
Article Google Scholar
Xiao H, Zhang D, Wang W, Wang J (2021) Non-detection text recognition of certificate image based on transformer. Information Technology 45(6):78–90
Google Scholar
Xu J, Sun X (2016) Dependency-based gated recursive neural network for Chinese word segmentation. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 567–572
Xu W, Zhao X, Lao B, Nong G (2021) Enhancing HDFS with a full-text search system for massive small files. The Journal of Supercomputing 77(4):1–22
Google Scholar
Xue N (2003) Chinese word segmentation as character tagging. International Journal of Computational Linguistics & Chinese Language Processing: Special Issue on Word Formation and Chinese Language Processing 8:29–48
Google Scholar
Zhang J, Meng F, Wang M, Zheng D, Jiang W, Liu Q (2016) Is local window essential for neural network based Chinese word segmentation? In: China national conference on Chinese computational linguistics. pp 450–457
Zhang Y, Clark S (2007) Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the 45th annual meeting of the association of computational linguistics. pp 840–847
Zhao L, Zhang Q, Wang P, Liu X (2018) Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence
Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 207–212

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China (Grant number 61872391), the Special Funds for Guangzhou Scientific and Technological Innovation and Development (Grant number 201802010011).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Wentao Xu & Ge Nong

Authors

Wentao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ge Nong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ge Nong.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, W., Nong, G. A study for extracting keywords from data with deep learning and suffix array. Multimed Tools Appl 81, 7419–7437 (2022). https://doi.org/10.1007/s11042-021-11762-7

Download citation

Received: 09 September 2020
Revised: 09 September 2021
Accepted: 25 November 2021
Published: 26 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11042-021-11762-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study for extracting keywords from data with deep learning and suffix array

Abstract

Access this article

Similar content being viewed by others

Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture

Assessing the Efficiency of Suffix Stripping Approaches for Portuguese Stemming

Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A study for extracting keywords from data with deep learning and suffix array

Abstract

Access this article

Similar content being viewed by others

Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture

Assessing the Efficiency of Suffix Stripping Approaches for Portuguese Stemming

Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation