Abstract
The amount of data produced every day is enormous. According to Forbes, 2.5 quintillion data is created daily (Marr, 2018). The volume of unstructured data is also multiplying daily, forcing organizations to spend significant time, effort, and money to manage and govern the data assets. This volume of unstructured data also leads to data privacy challenges in handling, auditing, and regulatory encounters thrown by governing bodies like Governments, Auditors, Data Protection/Legislative/Federal laws, regulatory acts like The General Data Protection Regulation (GDPR), The Basel Committee on Banking Supervision (BCBS), Health Insurance Portability and Accountability Act (HIPPA), The California Consumer Privacy Act (CCPA) etc.
Organizations must set up a robust data protection framework and governance to identify, classify, protect and monitor the sensitive data residing in the unstructured data sources. Data discovery and classification of the data assets is scanning the organization’s data sources both structured and unstructured, that could potentially contain sensitive or regulated data.
Most organizations are using various data discovery and classification tools in scanning the structured and unstructured sources. The organizations cannot accomplish the overall privacy and protection needs due to the gaps observed in scanning and discovering sensitive data elements from unstructured sources. Hence, they are adapting to manual methodologies to fill these gaps.
The main objective of this study is to build a solution which systematically scans an unstructured data source and detects the sensitive data elements, auto classify as per the data classification categories, and visualizes the results on a dashboard. This solution uses Machine Learning (ML) and Natural Language Processing (NLP) techniques to detect the sensitive data elements contained in the unstructured data sources. It can be used as a first step before performing data encryption, tokenization, anonymization, and masking as part of the overall data protection journey.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bonta, R.: California Consumer Privacy Act (CCPA). Retrieved from State of California Department of Justice: https://oag.ca.gov/privacy/ccpa (2022)
Cha, S.-C., Yeh, K.-H.: A Data-Driven Security Risk Assessment Scheme for Personal Data Protection. IEEE, pp. 50510 – 50517 (2018)
David, D.: AI Unleashes the Power of Unstructured Data. Retrieved from CIO (2019, July 9). https://www.cio.com/article/3406806/ai-unleashes-the-power-of-unstructured-data.html
Gartner Top Strategic Technology Trends for 2022. (2022). Retrieved from Gartner: https://www.gartner.com/en/information-technology/insights/top-technology-trends
Goswami, S.: The Rising Concern Around Consumer Data And Privacy. Retrieved from Forbes (2020, December 14). https://www.forbes.com/sites/forbestechcouncil/2020/12/14/the-rising-concern-around-consumer-data-and-privacy/?sh=30741b43487e
Hill, M.: The 12 biggest data breach fines, penalties, and settlements so far. Retrieved from CSO (2022, August 16). https://www.csoonline.com/article/3410278/the-biggest-data-breach-fines-penalties-and-settlements-so-far.html
Kulkarni, R.: Big Data Goes Big. Retrieved from Forbes (2019, 02 07). https://www.forbes.com/sites/rkulkarni/2019/02/07/big-data-goes-big/?sh=278b2aa820d7
Marr, B.: How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. Retrieved from Forbes (2018, May 21). https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/?sh=4e4f805860ba
Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., Guo, S.: Protection of Big Data Privacy. IEEE, pp. 1821–1834 (2016)
Office for Civil Rights (OCR). (2022, January 19). Your Rights Under HIPAA. Retrieved from HHS.gov: https://www.hhs.gov/hipaa/for-individuals/guidance-materials-for-consumers/index.html
Steele, K.: A Guide to Types of Sensitive Information. Retrieved from BigID (2021, November 3). https://bigid.com/blog/sensitive-information-guide/
Truong, N.B., Sun, K., Lee, G.M., Guo, Y.: GDPR-Compliant Personal Data Management: A Blockchain-Based Solution. IEEE, pp. 1746–1761 (2019)
What Is Data Management? (2022). Retrieved from OCI: https://www.oracle.com/database/what-is-data-management/
Wolford, B.: What is GDPR, the EU’s new data protection law? Retrieved from GDPR.EU (2020) https://gdpr.eu/what-is-gdpr/
Xu, L., Jiang, C., Wang, J., Yuan, J., Ren, Y.: Information Security in Big Data: Privacy and Data Mining. IEEE, pp. 1149–1176 (2014)
Yaqoob, I., Salah, K., Jayaraman, R., & Al-Hammadi, Y.: Blockchain for healthcare data management: opportunities, challenges, and future recommendations. Springer Link, pp. 11475–11490 (2022)
Zhang, X., et al.: MRMondrian: Scalable Multidimensional Anonymisation for Big Data Privacy Preservation. IEEE, pp. 125–139 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Ponde, S., Kulkarni, A., Agarwal, R. (2023). AI/ML Based Sensitive Data Discovery and Classification of Unstructured Data Sources. In: Nandan Mohanty, S., Garcia Diaz, V., Satish Kumar, G.A.E. (eds) Intelligent Systems and Machine Learning. ICISML 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-031-35081-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-35081-8_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35080-1
Online ISBN: 978-3-031-35081-8
eBook Packages: Computer ScienceComputer Science (R0)