Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents

Muralitharan, Janani; Arumugam, Chandrasekar

doi:10.1007/s00521-024-09707-w

Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents

Original Article
Published: 16 May 2024

(2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Janani Muralitharan¹ &
Chandrasekar Arumugam²

93 Accesses
Explore all metrics

Abstract

In this modern digital era, the increasing volume of textual data and the widespread adoption of natural language processing (NLP) techniques have presented a critical challenge in safeguarding sensitive privacy information. As a result, there is a pressing demand to design robust and accurate NLP-based techniques to perform efficient sensitive information detection in textual data. This research paper focuses on the detection and classification of sensitive privacy information in textual documents using NLP by proposing a novel algorithm named Privacy BERT-LSTM. The proposed Privacy BERT-LSTM algorithm employs BERT for obtaining contextual embeddings and LSTM for sequential information processing, facilitating efficient sensitive information detection in textual documents. The BERT with its bidirectional characteristics captures the nuances and meaning of the textual documents, while the LSTM derives the long-range dependencies in the textual data. Moreover, the proposed Privacy BERT-LSTM algorithm with its attention mechanism highlights the important regions of the textual documents, contributing to efficient sensitive information detection. The comprehensive performance evaluation is conducted by employing the SMS Spam Collection dataset in terms of standard performance metrics and comparing it with different state-of-the-art techniques, namely, CASSED, PRIVAFRAME, CNN-LSTM, Conv-FFD, GCSA, TSIIP, and, C-PIIM. The experimental outcomes clearly illustrate that the Privacy BERT-LSTM algorithm demonstrates superior performance in identifying various types of sensitive information by achieving an accuracy of 92.50%, F1-score of 85.02%, and Precision of 89.36%. The proposed algorithm outperforms existing baseline models, providing valuable advancements in sensitive information detection using NLP. Therefore, this research contributes to the advancement of privacy protection in NLP applications and opens avenues for future investigations in the domain of sensitive information detection. Additionally, the proposed algorithm provides valuable insights for researchers and practitioners working on privacy-sensitive NLP tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of Feature Extraction Techniques for Fake News Detection

Keyphrase extraction using graph-based statistical approach with NLP patterns

Article 05 May 2024

Analysis of Changing Trends in Textual Data Representation

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Ohata EF, Mattos CLC, Gomes SL, Rebouças EDS, Rego PAL (2022) A text classification methodology to assist a large technical support system. IEEE Access 10:108413–108421
Article Google Scholar
Hassan F, Sánchez D, Domingo-Ferrer J (2021) Utility-preserving privacy protection of textual documents via word embeddings. IEEE Trans Knowl Data Eng 35(1):1058–1071
Google Scholar
Lynn HM, Kim P, Pan SB (2021) Data independent acquisition based bi-directional deep networks for biometric ECG authentication. Appl Sci 11(3):1125
Article Google Scholar
Khan AR, Yasin A, Usman SM, Hussain S, Khalid S, Ullah SS (2022) Exploring lightweight deep learning solution for malware detection in IoT constraint environment. Electronics 11(24):4147
Article Google Scholar
Gambarelli G, Gangemi A (2022) PRIVAFRAME: a frame-based knowledge graph for sensitive personal data. Big Data Cognit Comput 6(3):90
Article Google Scholar
Zhao M, Fu X, Zhang Y, Meng L, Tang B (2022) Highly imbalanced fault diagnosis of mechanical systems based on wavelet packet distortion and convolutional neural networks. Adv Eng Inform 51:101535
Article Google Scholar
Zhao X, Zhu X, Liu J, Hu Y, Gao T, Zhao L, Yao J, Liu Z (2024) Model-assisted multi-source fusion hypergraph convolutional neural networks for intelligent few-shot fault diagnosis to electro-hydrostatic actuator. Inf Fus 104:102186
Article Google Scholar
Zhao X, Yao J, Deng W, Jia M, Liu Z (2022) Normalized conditional variational auto-encoder with adaptive focal loss for imbalanced fault diagnosis of bearing-rotor system. Mech Syst Signal Process 170:108826
Article Google Scholar
Zhu X, Zhao X, Yao J, Deng W, Shao H, Liu, Z (2023) Adaptive multiscale convolution manifold embedding networks for intelligent fault diagnosis of servo motor-cylindrical rolling bearing under variable working conditions. IEEE/ASME Transactions on Mechatronics.
Aubaid AM, Mishra A (2020) A rule-based approach to embedding techniques for text document classification. Appl Sci 10(11):4009
Article Google Scholar
Huo L, Jiang J (2023) Research on intelligent perception algorithm for sensitive information. Appl Sci 13(6):3383
Article Google Scholar
Zhang K, Jiang X (2023) Sensitive data detection with high-throughput machine learning models in electrical health records. arXiv preprint arXiv:2305.03169.
García M, Maldonado S, Vairetti C (2021) Efficient n-gram construction for text categorization using feature selection techniques. Intell Data Anal 25(3):509–525
Article Google Scholar
Barve Y, Saini JR, Pal K, Kotecha K (2022) A novel evolving sentimental bag-of-words approach for feature extraction to detect misinformation. Int J Adv Comput Sci Appl 13(4):266–275
Google Scholar
Zhuohao WANG, Dong WANG, Qing LI (2021) Keyword extraction from scientific research projects based on SRP-TF-IDF. Chin J Electron 30(4):652–657
Article Google Scholar
Luo X (2021) Efficient English text classification using selected machine learning techniques. Alex Eng J 60(3):3401–3409
Article Google Scholar
Kulkarni P, Cauvery NK (2021) Personally identifiable information (pii) detection in the unstructured large text corpus using natural language processing and unsupervised learning technique. Int J Adv Comput Sci Appl 12(9):508–517
Google Scholar
Liu Y, Yang CY, Yang J (2021) A graph convolutional network-based sensitive information detection algorithm. Complexity 2021:1–8
Article Google Scholar
Roslan NIM, Foozy CFM (2022) A comparison of sensitive information detection framework using LSTM and RNN techniques. J Soft Comput Data Min 3(2):92–103
Google Scholar
Victor N, Lopez D (2020) Sl-LSTM: a Bi-directional LSTM with stochastic gradient descent optimization for sequence labeling tasks in big data. Int J Grid High Perform Comput (IJGHPC) 12(3):1–16
Article Google Scholar
García-Pablos A, Perez N, Cuadros M (2020) Sensitive data detection and classification in Spanish clinical text: Experiments with BERT. arXiv preprint arXiv:2003.03106.
Guo Y, Liu J, Tang W, Huang C (2021) Exsense: extract sensitive information from unstructured data. Comput Secur 102:102156
Article Google Scholar
Qasim R, Bangyal WH, Alqarni MA, Ali Almazroi A (2022) A fine-tuned BERT-based transfer learning approach for text classification. J Healthc Eng 2022:1–17
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Yuan Y, Lin L, Huo LZ, Kong YL, Zhou ZG, Wu B, Jia Y (2020) Using an attention-based LSTM encoder–decoder network for near real-time disturbance detection. IEEE J Sel Top Appl Earth Obs Remote Sens 13:1819–1832
Article Google Scholar
Deng J, Cheng L, Wang Z (2021) Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput Speech Lang 68:101182
Article Google Scholar
Almeida T, Hidalgo J (2012) SMS spam collection. UCI Mach Learn Repos. https://doi.org/10.24432/C5CC84
Article Google Scholar
Kužina V, Petric AM, Barišić M, Jović A (2023) CASSED: context-based approach for structured sensitive data detection. Expert Syst Appl 223:119924
Article Google Scholar
Butt UA, Amin R, Aldabbas H, Mohan S, Alouffi B, Ahmadian A (2023) Cloud-based email phishing attack using machine and deep learning algorithm. Complex Intell Syst 9(3):3043–3070
Article Google Scholar
Zhang Q, Guo Z, Zhu Y, Vijayakumar P, Castiglione A, Gupta BB (2023) A deep learning-based fast fake news detection model for cyber-physical social services. Pattern Recogn Lett 168:31–38
Article Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Information Technology, St.Joseph’s College of Engineering, Chennai, Tamil Nadu, India
Janani Muralitharan
Department of Computer Science and Engineering, St.Joseph’s College of Engineering, Chennai, Tamil Nadu, India
Chandrasekar Arumugam

Authors

Janani Muralitharan
View author publications
You can also search for this author in PubMed Google Scholar
Chandrasekar Arumugam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Janani Muralitharan.

Ethics declarations

Conflicts of interest

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Muralitharan, J., Arumugam, C. Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09707-w

Download citation

Received: 22 August 2023
Accepted: 25 March 2024
Published: 16 May 2024
DOI: https://doi.org/10.1007/s00521-024-09707-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents

Abstract

Access this article

Similar content being viewed by others

Review of Feature Extraction Techniques for Fake News Detection

Keyphrase extraction using graph-based statistical approach with NLP patterns

Analysis of Changing Trends in Textual Data Representation

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate

Consent for publication

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents

Abstract

Access this article

Similar content being viewed by others

Review of Feature Extraction Techniques for Fake News Detection

Keyphrase extraction using graph-based statistical approach with NLP patterns

Analysis of Changing Trends in Textual Data Representation

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate

Consent for publication

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation