Skip to main content

Privacy Disclosures Detection in Natural-Language Text Through Linguistically-Motivated Artificial Neural Networks

  • Conference paper
  • First Online:

Abstract

An increasing number of people are sharing information through text messages, emails, and social media without proper privacy checks. In many situations, this could lead to serious privacy threats. This paper presents a methodology for providing extra safety precautions without being intrusive to users. We have developed and evaluated a model to help users take control of their shared information by automatically identifying text (i.e., a sentence or a transcribed utterance) that might contain personal or private disclosures. We apply off-the-shelf natural language processing tools to derive linguistic features such as part-of-speech, syntactic dependencies, and entity relations. From these features, we model and train a multichannel convolutional neural network as a classifier to identify short texts that have personal, private disclosures. We show how our model can notify users if a piece of text discloses personal or private information, and evaluate our approach in a binary classification task with 93% accuracy on our own labeled dataset, and 86% on a dataset of ground truth. Unlike document classification tasks in the area of natural language processing, our framework is developed keeping the sentence level context into consideration.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    It is worth mentioning that we get little fluctuation on the accuracy value while changing the number of neurons in these layers. It seems obvious because, this layer might have needed more neurons for better non-linearity understanding when it sees relatively more data.

  2. 2.

    https://anonymous.4open.science/repository/3c84ab7b-02ce-4fd7-b982-f278d6f3c4f4/.

References

  1. Abril, D., Navarro-Arribas, G., Torra, V.: On the declassification of confidential documents. In: Torra, V., Narakawa, Y., Yin, J., Long, J. (eds.) MDAI 2011. LNCS (LNAI), vol. 6820, pp. 235–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22589-5_22

    Chapter  Google Scholar 

  2. Agerri, R., Artola, X., Beloki, Z., Rigau, G., Soroa, A.: Big data for natural language processing: a streaming approach. Knowl. Based Syst. 79, 36–42 (2015)

    Article  Google Scholar 

  3. Andalibi, N., Öztürk, P., Forte, A.: Sensitive self-disclosures, responses, and social support on Instagram: the case of #depression. In: CSCW, pp. 1485–1500 (2017)

    Google Scholar 

  4. Bettini, C., Wang, X.S., Jajodia, S.: Protecting privacy against location-based personal identification. In: Jonker, W., Petković, M. (eds.) SDM 2005. LNCS, vol. 3674, pp. 185–199. Springer, Heidelberg (2005). https://doi.org/10.1007/11552338_13

    Chapter  Google Scholar 

  5. Boyd, V.: Financial privacy in the United States and the European union: a path to transatlantic regulatory harmonization. Berkeley J. Int’l L. 24, 939 (2006)

    Google Scholar 

  6. Buchanan, T., Paine, C., Joinson, A.N., Reips, U.D.: Development of measures of online privacy concern and protection for use on the internet. J. Assoc. Inf. Sci. Technol. 58(2), 157–165 (2007)

    Article  Google Scholar 

  7. Caliskan Islam, A., Walsh, J., Greenstadt, R.: Privacy detective: detecting private information and collective privacy behavior in a large social network. In: Proceedings of the 13th Workshop on Privacy in the Electronic Society, pp. 35–46. ACM (2014)

    Google Scholar 

  8. Chow, R., Golle, P., Staddon, J.: Detecting privacy leaks using corpus-based association rules. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 893–901. ACM (2008)

    Google Scholar 

  9. Christofides, E., Muise, A., Desmarais, S.: Information disclosure and control on facebook: are they two sides of the same coin or two different processes? Cyberpsychol. Behav. 12(3), 341–345 (2009)

    Article  Google Scholar 

  10. Word Embedding Wikipedia Contributors: Word embedding — Wikipedia, the free Encyclopedia (2018). https://en.wikipedia.org/w/index.php?title=Word_embedding&oldid=836044700. Accessed 7 May 2018

  11. Costello, J.: Nursing older dying patients: findings from an ethnographic study of death and dying in elderly care wards. J. Adv. Nurs. 35(1), 59–68 (2001)

    Article  Google Scholar 

  12. Datafiniti: Hotel reviews — Kaggle (2018). https://www.kaggle.com/datafiniti/hotel-reviews. Accessed 01 May 2018

  13. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, pp. 519–528. ACM (2003)

    Google Scholar 

  14. De Choudhury, M., De, S.: Mental health discourse on reddit: self-disclosure, social support, and anonymity. In: ICWSM (2014)

    Google Scholar 

  15. DeCew, J.W.: The priority of privacy for medical information. Soc. Philos. Policy 17(2), 213–234 (2000)

    Article  Google Scholar 

  16. Evans, D.A., Zhai, C.: Noun-phrase analysis in unrestricted text for information retrieval. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 17–24. Association for Computational Linguistics (1996)

    Google Scholar 

  17. Stack Exchange: Stack exchange data dump. Stack Exchange, Inc.: Free Download, Borrow, and Streaming: Internet Archive (2018). https://archive.org/details/stackexchange. Accessed 01 May 2018

  18. Ganesan, K., Zhai, C.: Opinion-based entity ranking. Inf. Retrieval 15(2), 116–150 (2012)

    Article  Google Scholar 

  19. Groves, T.: Why is analyzing text so hard? (2018). http://www.ibmbigdatahub.com/blog/why-analyzing-text-so-hard. Accessed 01 Feb 2018

  20. Hern, A.: Far more than 87m Facebook users had data compromised, MPs told (2018). https://www.theguardian.com/uk-news/2018/apr/17/facebook-users-data-compromised-far-more-than-87m-mps-told/-cambridge-analytica. Accessed 01 May 2018

  21. Joinson, A.N., Reips, U.D., Buchanan, T., Schofield, C.B.P.: Privacy, trust, and self-disclosure online. Hum. Comput. Interact. 25(1), 1–24 (2010)

    Article  Google Scholar 

  22. Joshaghani, R., Mehrpouyan, H.: A model-checking approach for enforcing purpose-based privacy policies. In: IEEE Symposium on Privacy-Aware Computing (PAC), pp. 178–179. IEEE (2017)

    Google Scholar 

  23. Keras: Embedding layers - Keras documentation (2018). https://keras.io/layers/embeddings/. Accessed 01 Feb 2018

  24. Keras: Guide to the functional API - Keras documentation (2018). https://keras.io/getting-started/functional-api-guide/. Accessed 01 Feb 2018

  25. Keras: Text preprocessing - Keras documentation (2018). https://keras.io/preprocessing/text/#tokenizer. Accessed 01 Feb 2018

  26. Kravchik, M., Shabtai, A.: Anomaly detection; industrial control systems; convolutional neural networks. arXiv preprint arXiv:1806.08110 (2018)

  27. Krishnamurthy, B., Wills, C.E.: On the leakage of personally identifiable information via online social networks. In: Proceedings of the 2nd ACM Workshop on Online Social Networks, pp. 7–12. ACM (2009)

    Google Scholar 

  28. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. Handb. Brain Theor. Neural Netw. 3361(10), 1995 (1995)

    Google Scholar 

  29. LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems, pp. 396–404 (1990)

    Google Scholar 

  30. Leyshon, A., Signoretta, P., Knights, D., Alferoff, C., Burton, D.: Walking with moneylenders: the ecology of the UK home-collected credit industry. Urban Stud. 43(1), 161–186 (2006)

    Article  Google Scholar 

  31. LIWC: Linguistic inquiry and word count (2018). https://liwc.wpengine.com/. Accessed 01 February 2018

  32. Madden, M.: Privacy management on social media sites. In: Pew Internet Report, pp. 1–20 (2012)

    Google Scholar 

  33. Madden, M., et al.: Teens, social media, and privacy. Pew Res. Center 21, 2–86 (2013)

    Google Scholar 

  34. Malhotra, N.K., Kim, S.S., Agarwal, J.: Internet Users’ Information Privacy Concerns (IUIPC): the construct, the scale, and a causal model. Inf. Syst. Res. 15(4), 336–355 (2004)

    Article  Google Scholar 

  35. Mao, H., Shuai, X., Kapadia, A.: Loose tweets: an analysis of privacy leaks on twitter. In: Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society, pp. 1–12. ACM (2011)

    Google Scholar 

  36. McAuley, J.J., Leskovec, J.: From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 897–908. ACM (2013)

    Google Scholar 

  37. Meerabeau, L.: The management of embarrassment and sexuality in health care. J. Adv. Nurs. 29(6), 1507–1513 (1999)

    Article  Google Scholar 

  38. Mehrpouyan, H., Azpiazu, I.M., Pera, M.S.: Measuring personality for automatic elicitation of privacy preferences. In: IEEE Symposium on Privacy-Aware Computing (PAC), vol. 00, pp. 84–95, August 2017. https://doi.org/10.1109/PAC.2017.15, doi.ieeecomputersociety.org/10.1109/PAC.2017.15

  39. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  40. Milberg, S.J., Burke, S.J., Smith, H.J., Kallman, E.A.: Values, personal information privacy, and regulatory approaches. Commun. ACM 38(12), 65–74 (1995)

    Article  Google Scholar 

  41. Milne, D.N., Pink, G., Hachey, B., Calvo, R.A.: CLPsych 2016 shared task: triaging content in online peer-support forums. In: Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, pp. 118–127 (2016)

    Google Scholar 

  42. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREc, vol. 10 (2010)

    Google Scholar 

  43. Razavi, A.H., Ghazinour, K.: Personal health information detection in unstructured web documents. In: IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS), pp. 155–160. IEEE (2013)

    Google Scholar 

  44. Sachs, J.S.: Recopition memory for syntactic and semantic aspects of connected discourse. Percept. Psychophys. 2(9), 437–442 (1967)

    Article  Google Scholar 

  45. Sánchez, D., Batet, M., Viejo, A.: Detecting sensitive information from textual documents: an information-theoretic approach. In: Torra, V., Narukawa, Y., López, B., Villaret, M. (eds.) MDAI 2012. LNCS (LNAI), vol. 7647, pp. 173–184. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34620-0_17

    Chapter  Google Scholar 

  46. Schrading, N., Alm, C.O., Ptucha, R., Homan, C.: An analysis of domestic abuse discourse on reddit. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2577–2583 (2015)

    Google Scholar 

  47. Serenko, N., Fan, L.: Patients’ perceptions of privacy and their outcomes in healthcare. Int. J. Behav. Healthc. Res. 4(2), 101–122 (2013)

    Article  Google Scholar 

  48. Siegel, A.: In pursuit of privacy: laws, ethics, and the rise of technology. Wilson Q. 21(4), 100 (1997)

    Google Scholar 

  49. Singh, J., Nene, M.J.: A survey on machine learning techniques for intrusion detection systems. Int. J. Adv. Res. Comput. Commun. Eng. 2(11), 4349–4355 (2013)

    Google Scholar 

  50. Solon, O.: Facebook says Cambridge Analytica may have gained 37m more users’ data (2018). https://www.theguardian.com/technology/2018/apr/04/facebook-cambridge-analytica-user-data-latest-more-than-thought. Accessed 01 May 2018

  51. Spacy: Linguistic features (2018). https://spacy.io/usage/linguistic-features. Accessed 01 Feb 2018

  52. Spacy: Named entity recognition (2018). https://prodi.gy/features/named-entity-recognition. Accessed 01 Feb 2018

  53. Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA Annual Fall Symposium, p. 333. American Medical Informatics Association (1996)

    Google Scholar 

  54. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)

    Article  MathSciNet  Google Scholar 

  55. Vasalou, A., Gill, A.J., Mazanderani, F., Papoutsi, C., Joinson, A.: Privacy dictionary: a new resource for the automated content analysis of privacy. J. Assoc. Inf. Sci. Technol. 62(11), 2095–2105 (2011)

    Article  Google Scholar 

  56. Wang, Y.C., Burke, M., Kraut, R.: Modeling self-disclosure in social networking sites. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 74–85. ACM (2016)

    Google Scholar 

  57. Yang, C.C., Tang, X.: Estimating user influence in the MedHelp social network. IEEE Intell. Syst. 27(5), 44–50 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank National Science Foundation for its support through the Computer and Information Science and Engineering (CISE) program and Research Initiation Initiative(CRII) grant number 1657774 of the Secure and Trustworthy Cyberspace (SaTC) program: A System for Privacy Management in Ubiquitous Environments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoda Mehrpouyan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mehdy, N., Kennington, C., Mehrpouyan, H. (2019). Privacy Disclosures Detection in Natural-Language Text Through Linguistically-Motivated Artificial Neural Networks. In: Li, J., Liu, Z., Peng, H. (eds) Security and Privacy in New Computing Environments. SPNCE 2019. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 284. Springer, Cham. https://doi.org/10.1007/978-3-030-21373-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-21373-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-21372-5

  • Online ISBN: 978-3-030-21373-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics