Skip to main content
Log in

Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In this modern digital era, the increasing volume of textual data and the widespread adoption of natural language processing (NLP) techniques have presented a critical challenge in safeguarding sensitive privacy information. As a result, there is a pressing demand to design robust and accurate NLP-based techniques to perform efficient sensitive information detection in textual data. This research paper focuses on the detection and classification of sensitive privacy information in textual documents using NLP by proposing a novel algorithm named Privacy BERT-LSTM. The proposed Privacy BERT-LSTM algorithm employs BERT for obtaining contextual embeddings and LSTM for sequential information processing, facilitating efficient sensitive information detection in textual documents. The BERT with its bidirectional characteristics captures the nuances and meaning of the textual documents, while the LSTM derives the long-range dependencies in the textual data. Moreover, the proposed Privacy BERT-LSTM algorithm with its attention mechanism highlights the important regions of the textual documents, contributing to efficient sensitive information detection. The comprehensive performance evaluation is conducted by employing the SMS Spam Collection dataset in terms of standard performance metrics and comparing it with different state-of-the-art techniques, namely, CASSED, PRIVAFRAME, CNN-LSTM, Conv-FFD, GCSA, TSIIP, and, C-PIIM. The experimental outcomes clearly illustrate that the Privacy BERT-LSTM algorithm demonstrates superior performance in identifying various types of sensitive information by achieving an accuracy of 92.50%, F1-score of 85.02%, and Precision of 89.36%. The proposed algorithm outperforms existing baseline models, providing valuable advancements in sensitive information detection using NLP. Therefore, this research contributes to the advancement of privacy protection in NLP applications and opens avenues for future investigations in the domain of sensitive information detection. Additionally, the proposed algorithm provides valuable insights for researchers and practitioners working on privacy-sensitive NLP tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Ohata EF, Mattos CLC, Gomes SL, Rebouças EDS, Rego PAL (2022) A text classification methodology to assist a large technical support system. IEEE Access 10:108413–108421

    Article  Google Scholar 

  2. Hassan F, Sánchez D, Domingo-Ferrer J (2021) Utility-preserving privacy protection of textual documents via word embeddings. IEEE Trans Knowl Data Eng 35(1):1058–1071

    Google Scholar 

  3. Lynn HM, Kim P, Pan SB (2021) Data independent acquisition based bi-directional deep networks for biometric ECG authentication. Appl Sci 11(3):1125

    Article  Google Scholar 

  4. Khan AR, Yasin A, Usman SM, Hussain S, Khalid S, Ullah SS (2022) Exploring lightweight deep learning solution for malware detection in IoT constraint environment. Electronics 11(24):4147

    Article  Google Scholar 

  5. Gambarelli G, Gangemi A (2022) PRIVAFRAME: a frame-based knowledge graph for sensitive personal data. Big Data Cognit Comput 6(3):90

    Article  Google Scholar 

  6. Zhao M, Fu X, Zhang Y, Meng L, Tang B (2022) Highly imbalanced fault diagnosis of mechanical systems based on wavelet packet distortion and convolutional neural networks. Adv Eng Inform 51:101535

    Article  Google Scholar 

  7. Zhao X, Zhu X, Liu J, Hu Y, Gao T, Zhao L, Yao J, Liu Z (2024) Model-assisted multi-source fusion hypergraph convolutional neural networks for intelligent few-shot fault diagnosis to electro-hydrostatic actuator. Inf Fus 104:102186

    Article  Google Scholar 

  8. Zhao X, Yao J, Deng W, Jia M, Liu Z (2022) Normalized conditional variational auto-encoder with adaptive focal loss for imbalanced fault diagnosis of bearing-rotor system. Mech Syst Signal Process 170:108826

    Article  Google Scholar 

  9. Zhu X, Zhao X, Yao J, Deng W, Shao H, Liu, Z (2023) Adaptive multiscale convolution manifold embedding networks for intelligent fault diagnosis of servo motor-cylindrical rolling bearing under variable working conditions. IEEE/ASME Transactions on Mechatronics.

  10. Aubaid AM, Mishra A (2020) A rule-based approach to embedding techniques for text document classification. Appl Sci 10(11):4009

    Article  Google Scholar 

  11. Huo L, Jiang J (2023) Research on intelligent perception algorithm for sensitive information. Appl Sci 13(6):3383

    Article  Google Scholar 

  12. Zhang K, Jiang X (2023) Sensitive data detection with high-throughput machine learning models in electrical health records. arXiv preprint arXiv:2305.03169.

  13. García M, Maldonado S, Vairetti C (2021) Efficient n-gram construction for text categorization using feature selection techniques. Intell Data Anal 25(3):509–525

    Article  Google Scholar 

  14. Barve Y, Saini JR, Pal K, Kotecha K (2022) A novel evolving sentimental bag-of-words approach for feature extraction to detect misinformation. Int J Adv Comput Sci Appl 13(4):266–275

    Google Scholar 

  15. Zhuohao WANG, Dong WANG, Qing LI (2021) Keyword extraction from scientific research projects based on SRP-TF-IDF. Chin J Electron 30(4):652–657

    Article  Google Scholar 

  16. Luo X (2021) Efficient English text classification using selected machine learning techniques. Alex Eng J 60(3):3401–3409

    Article  Google Scholar 

  17. Kulkarni P, Cauvery NK (2021) Personally identifiable information (pii) detection in the unstructured large text corpus using natural language processing and unsupervised learning technique. Int J Adv Comput Sci Appl 12(9):508–517

    Google Scholar 

  18. Liu Y, Yang CY, Yang J (2021) A graph convolutional network-based sensitive information detection algorithm. Complexity 2021:1–8

    Article  Google Scholar 

  19. Roslan NIM, Foozy CFM (2022) A comparison of sensitive information detection framework using LSTM and RNN techniques. J Soft Comput Data Min 3(2):92–103

    Google Scholar 

  20. Victor N, Lopez D (2020) Sl-LSTM: a Bi-directional LSTM with stochastic gradient descent optimization for sequence labeling tasks in big data. Int J Grid High Perform Comput (IJGHPC) 12(3):1–16

    Article  Google Scholar 

  21. García-Pablos A, Perez N, Cuadros M (2020) Sensitive data detection and classification in Spanish clinical text: Experiments with BERT. arXiv preprint arXiv:2003.03106.

  22. Guo Y, Liu J, Tang W, Huang C (2021) Exsense: extract sensitive information from unstructured data. Comput Secur 102:102156

    Article  Google Scholar 

  23. Qasim R, Bangyal WH, Alqarni MA, Ali Almazroi A (2022) A fine-tuned BERT-based transfer learning approach for text classification. J Healthc Eng 2022:1–17

    Article  Google Scholar 

  24. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  25. Yuan Y, Lin L, Huo LZ, Kong YL, Zhou ZG, Wu B, Jia Y (2020) Using an attention-based LSTM encoder–decoder network for near real-time disturbance detection. IEEE J Sel Top Appl Earth Obs Remote Sens 13:1819–1832

    Article  Google Scholar 

  26. Deng J, Cheng L, Wang Z (2021) Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput Speech Lang 68:101182

    Article  Google Scholar 

  27. Almeida T, Hidalgo J (2012) SMS spam collection. UCI Mach Learn Repos. https://doi.org/10.24432/C5CC84

    Article  Google Scholar 

  28. Kužina V, Petric AM, Barišić M, Jović A (2023) CASSED: context-based approach for structured sensitive data detection. Expert Syst Appl 223:119924

    Article  Google Scholar 

  29. Butt UA, Amin R, Aldabbas H, Mohan S, Alouffi B, Ahmadian A (2023) Cloud-based email phishing attack using machine and deep learning algorithm. Complex Intell Syst 9(3):3043–3070

    Article  Google Scholar 

  30. Zhang Q, Guo Z, Zhu Y, Vijayakumar P, Castiglione A, Gupta BB (2023) A deep learning-based fast fake news detection model for cyber-physical social services. Pattern Recogn Lett 168:31–38

    Article  Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Janani Muralitharan.

Ethics declarations

Conflicts of interest

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Muralitharan, J., Arumugam, C. Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09707-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09707-w

Keywords

Navigation