Skip to main content
Log in

A study for extracting keywords from data with deep learning and suffix array

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

While a suffix index built on a suffix array is capable of supporting full-text searches over any data, its search speed can be accelerated using a keyword index for the set of keywords extracted from data. We attempt to design a method for extracting keywords from data using deep learning and a suffix array in this article. In particular, the study starts with Chinese texts because many word segmentation results on Chinese are available for performance evaluation. A new method combining the use of a neural network and a suffix array of training data is proposed for Chinese word segmentation. The suffix array of training data is constructed to divide long sentences in the input text into short fragments for better word segmentation by our neural network method without a context window. Our experiments on the typical datasets reveal that the proposed method achieves encouraging results in terms of the precision, recall and \(F_1\) score compared to other existing advanced methods while avoiding the drawback of a context window. This study provides some helpful experience for designing a general solution to extract keywords from data using a suffix array.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://www.elastic.com/cn/blog/elastic-search-7-2-0-released

  2. http://sighan.cs.uchicago.edu/bakeoff2005/

  3. https://dumps.wikimedia.org/zhwiki/20210101/zhwiki-20210101-pages-articles-multistream.xml.bz2

  4. https://radimrehurek.com/gensim/index.html

  5. https://github.com/BYVoid/OpenCC

  6. https://github.com/fxsjy/jieba/

  7. https://catalog.ldc.upenn.edu/LDC2007T36

  8. https://github.com/lancopku/PKUSeg-python

References

  1. Cai D, Zhao H (2016) Neural word segmentation learning for Chinese. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 409–420

  2. Cai D, Zhao H, Zhang Z, Xin Y, Wu Y, Huang F (2017) Fast and accurate neural word segmentation for Chinese. In: Proceedings of the 55th annual meeting of the association for computational linguistics. pp 608–615

  3. Chen X, Qiu X, Zhu C, Huang X (2015) Gated recursive neural network for Chinese word segmentation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. pp 1744–1753

  4. Chen X, Qiu X, Zhu C, Liu P, Huang X (2016) Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp 1197–1206

  5. Chen X, Shi Z, Qiu X, Huang, X (2017) Adversarial multi-criteria learning for Chinese word segmentation. In: Proceedings of the 55th annual meeting of the association for computational linguistics. pp 1193–1203

  6. Chen Y, Zheng Q, Chen P (2015) A boundary assembling method for Chinese entity-mention recognition. IEEE Intelligent Systems 30(6):50–58

    Article  Google Scholar 

  7. Daumé H, Langford J, Marcu D (2009) Search-based structured prediction. Machine Learning 75(3):297–325

    Article  Google Scholar 

  8. Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: A review. Multimedia Tools and Applications 78(3):3797–3816

    Article  Google Scholar 

  9. Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Computation 12(10):2451–2471

    Article  Google Scholar 

  10. Goldberg Y, Levy O (2014) Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:14023722v1

  11. Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28(10):2222–2232

    Article  MathSciNet  Google Scholar 

  12. Kumar A, Garg G (2019) Sentiment analysis of multimodal twitter data. Multimedia Tools and Applications 78(17):1–17

    Google Scholar 

  13. Liu Q, Wu L, Yang Z, Liu Y (2011) Domain phrase identification using atomic word formation in Chinese text. Knowledge-Based Systems 24(8):1254–1260

    Article  Google Scholar 

  14. Manber U, Myers G (1993) Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5):935–948

    Article  MathSciNet  Google Scholar 

  15. Mo J, Zheng Y, Shou Z, Zhang S (2013) Improved Chinese word segmentation method based on dictionary. Computer Engineering & Design 34(5):1802–1771

    Google Scholar 

  16. Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems 31(3):1–15

    Article  MathSciNet  Google Scholar 

  17. Nong G, Zhang S, Chan WH (2011) Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers 60(10):1471–1484

    Article  MathSciNet  Google Scholar 

  18. Peng H, Ma Y, Li Y, Cambria E (2018) Learning multi-grained aspect target sequence for Chinese sentiment analysis. Knowledge-Based Systems 148:167–176

    Article  Google Scholar 

  19. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  Google Scholar 

  20. Sundermeyer M, Schlüter, R, Ney, H (2012) LSTM neural networks for language modeling. In: Interspeech. pp 601–608

  21. Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on international conference on machine learning

  22. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5998–6008

  23. Wang H, Yang Z, Yu Q, Hong T, Lin X (2018) Online reliability time series prediction via convolutional neural network and long short term memory for service-oriented systems. Knowledge-Based Systems 159:132–147

    Article  Google Scholar 

  24. Xiao H, Zhang D, Wang W, Wang J (2021) Non-detection text recognition of certificate image based on transformer. Information Technology 45(6):78–90

    Google Scholar 

  25. Xu J, Sun X (2016) Dependency-based gated recursive neural network for Chinese word segmentation. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 567–572

  26. Xu W, Zhao X, Lao B, Nong G (2021) Enhancing HDFS with a full-text search system for massive small files. The Journal of Supercomputing 77(4):1–22

    Google Scholar 

  27. Xue N (2003) Chinese word segmentation as character tagging. International Journal of Computational Linguistics & Chinese Language Processing: Special Issue on Word Formation and Chinese Language Processing 8:29–48

    Google Scholar 

  28. Zhang J, Meng F, Wang M, Zheng D, Jiang W, Liu Q (2016) Is local window essential for neural network based Chinese word segmentation? In: China national conference on Chinese computational linguistics. pp 450–457

  29. Zhang Y, Clark S (2007) Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the 45th annual meeting of the association of computational linguistics. pp 840–847

  30. Zhao L, Zhang Q, Wang P, Liu X (2018) Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence

  31. Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 207–212

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China (Grant number 61872391), the Special Funds for Guangzhou Scientific and Technological Innovation and Development (Grant number 201802010011).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ge Nong.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, W., Nong, G. A study for extracting keywords from data with deep learning and suffix array. Multimed Tools Appl 81, 7419–7437 (2022). https://doi.org/10.1007/s11042-021-11762-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11762-7

Keywords

Navigation