Advertisement

Privacy Disclosures Detection in Natural-Language Text Through Linguistically-Motivated Artificial Neural Networks

  • Nuhil Mehdy
  • Casey Kennington
  • Hoda MehrpouyanEmail author
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 284)

Abstract

An increasing number of people are sharing information through text messages, emails, and social media without proper privacy checks. In many situations, this could lead to serious privacy threats. This paper presents a methodology for providing extra safety precautions without being intrusive to users. We have developed and evaluated a model to help users take control of their shared information by automatically identifying text (i.e., a sentence or a transcribed utterance) that might contain personal or private disclosures. We apply off-the-shelf natural language processing tools to derive linguistic features such as part-of-speech, syntactic dependencies, and entity relations. From these features, we model and train a multichannel convolutional neural network as a classifier to identify short texts that have personal, private disclosures. We show how our model can notify users if a piece of text discloses personal or private information, and evaluate our approach in a binary classification task with 93% accuracy on our own labeled dataset, and 86% on a dataset of ground truth. Unlike document classification tasks in the area of natural language processing, our framework is developed keeping the sentence level context into consideration.

Keywords

Privacy Security Natural language processing Machine learning 

Notes

Acknowledgments

The authors would like to thank National Science Foundation for its support through the Computer and Information Science and Engineering (CISE) program and Research Initiation Initiative(CRII) grant number 1657774 of the Secure and Trustworthy Cyberspace (SaTC) program: A System for Privacy Management in Ubiquitous Environments.

References

  1. 1.
    Abril, D., Navarro-Arribas, G., Torra, V.: On the declassification of confidential documents. In: Torra, V., Narakawa, Y., Yin, J., Long, J. (eds.) MDAI 2011. LNCS (LNAI), vol. 6820, pp. 235–246. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22589-5_22CrossRefGoogle Scholar
  2. 2.
    Agerri, R., Artola, X., Beloki, Z., Rigau, G., Soroa, A.: Big data for natural language processing: a streaming approach. Knowl. Based Syst. 79, 36–42 (2015)CrossRefGoogle Scholar
  3. 3.
    Andalibi, N., Öztürk, P., Forte, A.: Sensitive self-disclosures, responses, and social support on Instagram: the case of #depression. In: CSCW, pp. 1485–1500 (2017)Google Scholar
  4. 4.
    Bettini, C., Wang, X.S., Jajodia, S.: Protecting privacy against location-based personal identification. In: Jonker, W., Petković, M. (eds.) SDM 2005. LNCS, vol. 3674, pp. 185–199. Springer, Heidelberg (2005).  https://doi.org/10.1007/11552338_13CrossRefGoogle Scholar
  5. 5.
    Boyd, V.: Financial privacy in the United States and the European union: a path to transatlantic regulatory harmonization. Berkeley J. Int’l L. 24, 939 (2006)Google Scholar
  6. 6.
    Buchanan, T., Paine, C., Joinson, A.N., Reips, U.D.: Development of measures of online privacy concern and protection for use on the internet. J. Assoc. Inf. Sci. Technol. 58(2), 157–165 (2007)CrossRefGoogle Scholar
  7. 7.
    Caliskan Islam, A., Walsh, J., Greenstadt, R.: Privacy detective: detecting private information and collective privacy behavior in a large social network. In: Proceedings of the 13th Workshop on Privacy in the Electronic Society, pp. 35–46. ACM (2014)Google Scholar
  8. 8.
    Chow, R., Golle, P., Staddon, J.: Detecting privacy leaks using corpus-based association rules. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 893–901. ACM (2008)Google Scholar
  9. 9.
    Christofides, E., Muise, A., Desmarais, S.: Information disclosure and control on facebook: are they two sides of the same coin or two different processes? Cyberpsychol. Behav. 12(3), 341–345 (2009)CrossRefGoogle Scholar
  10. 10.
    Word Embedding Wikipedia Contributors: Word embedding — Wikipedia, the free Encyclopedia (2018). https://en.wikipedia.org/w/index.php?title=Word_embedding&oldid=836044700. Accessed 7 May 2018
  11. 11.
    Costello, J.: Nursing older dying patients: findings from an ethnographic study of death and dying in elderly care wards. J. Adv. Nurs. 35(1), 59–68 (2001)CrossRefGoogle Scholar
  12. 12.
    Datafiniti: Hotel reviews — Kaggle (2018). https://www.kaggle.com/datafiniti/hotel-reviews. Accessed 01 May 2018
  13. 13.
    Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, pp. 519–528. ACM (2003)Google Scholar
  14. 14.
    De Choudhury, M., De, S.: Mental health discourse on reddit: self-disclosure, social support, and anonymity. In: ICWSM (2014)Google Scholar
  15. 15.
    DeCew, J.W.: The priority of privacy for medical information. Soc. Philos. Policy 17(2), 213–234 (2000)CrossRefGoogle Scholar
  16. 16.
    Evans, D.A., Zhai, C.: Noun-phrase analysis in unrestricted text for information retrieval. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 17–24. Association for Computational Linguistics (1996)Google Scholar
  17. 17.
    Stack Exchange: Stack exchange data dump. Stack Exchange, Inc.: Free Download, Borrow, and Streaming: Internet Archive (2018). https://archive.org/details/stackexchange. Accessed 01 May 2018
  18. 18.
    Ganesan, K., Zhai, C.: Opinion-based entity ranking. Inf. Retrieval 15(2), 116–150 (2012)CrossRefGoogle Scholar
  19. 19.
    Groves, T.: Why is analyzing text so hard? (2018). http://www.ibmbigdatahub.com/blog/why-analyzing-text-so-hard. Accessed 01 Feb 2018
  20. 20.
    Hern, A.: Far more than 87m Facebook users had data compromised, MPs told (2018). https://www.theguardian.com/uk-news/2018/apr/17/facebook-users-data-compromised-far-more-than-87m-mps-told/-cambridge-analytica. Accessed 01 May 2018
  21. 21.
    Joinson, A.N., Reips, U.D., Buchanan, T., Schofield, C.B.P.: Privacy, trust, and self-disclosure online. Hum. Comput. Interact. 25(1), 1–24 (2010)CrossRefGoogle Scholar
  22. 22.
    Joshaghani, R., Mehrpouyan, H.: A model-checking approach for enforcing purpose-based privacy policies. In: IEEE Symposium on Privacy-Aware Computing (PAC), pp. 178–179. IEEE (2017)Google Scholar
  23. 23.
    Keras: Embedding layers - Keras documentation (2018). https://keras.io/layers/embeddings/. Accessed 01 Feb 2018
  24. 24.
    Keras: Guide to the functional API - Keras documentation (2018). https://keras.io/getting-started/functional-api-guide/. Accessed 01 Feb 2018
  25. 25.
    Keras: Text preprocessing - Keras documentation (2018). https://keras.io/preprocessing/text/#tokenizer. Accessed 01 Feb 2018
  26. 26.
    Kravchik, M., Shabtai, A.: Anomaly detection; industrial control systems; convolutional neural networks. arXiv preprint arXiv:1806.08110 (2018)
  27. 27.
    Krishnamurthy, B., Wills, C.E.: On the leakage of personally identifiable information via online social networks. In: Proceedings of the 2nd ACM Workshop on Online Social Networks, pp. 7–12. ACM (2009)Google Scholar
  28. 28.
    LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. Handb. Brain Theor. Neural Netw. 3361(10), 1995 (1995)Google Scholar
  29. 29.
    LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems, pp. 396–404 (1990)Google Scholar
  30. 30.
    Leyshon, A., Signoretta, P., Knights, D., Alferoff, C., Burton, D.: Walking with moneylenders: the ecology of the UK home-collected credit industry. Urban Stud. 43(1), 161–186 (2006)CrossRefGoogle Scholar
  31. 31.
    LIWC: Linguistic inquiry and word count (2018). https://liwc.wpengine.com/. Accessed 01 February 2018
  32. 32.
    Madden, M.: Privacy management on social media sites. In: Pew Internet Report, pp. 1–20 (2012)Google Scholar
  33. 33.
    Madden, M., et al.: Teens, social media, and privacy. Pew Res. Center 21, 2–86 (2013)Google Scholar
  34. 34.
    Malhotra, N.K., Kim, S.S., Agarwal, J.: Internet Users’ Information Privacy Concerns (IUIPC): the construct, the scale, and a causal model. Inf. Syst. Res. 15(4), 336–355 (2004)CrossRefGoogle Scholar
  35. 35.
    Mao, H., Shuai, X., Kapadia, A.: Loose tweets: an analysis of privacy leaks on twitter. In: Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society, pp. 1–12. ACM (2011)Google Scholar
  36. 36.
    McAuley, J.J., Leskovec, J.: From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 897–908. ACM (2013)Google Scholar
  37. 37.
    Meerabeau, L.: The management of embarrassment and sexuality in health care. J. Adv. Nurs. 29(6), 1507–1513 (1999)CrossRefGoogle Scholar
  38. 38.
    Mehrpouyan, H., Azpiazu, I.M., Pera, M.S.: Measuring personality for automatic elicitation of privacy preferences. In: IEEE Symposium on Privacy-Aware Computing (PAC), vol. 00, pp. 84–95, August 2017.  https://doi.org/10.1109/PAC.2017.15, doi.ieeecomputersociety.org/10.1109/PAC.2017.15
  39. 39.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  40. 40.
    Milberg, S.J., Burke, S.J., Smith, H.J., Kallman, E.A.: Values, personal information privacy, and regulatory approaches. Commun. ACM 38(12), 65–74 (1995)CrossRefGoogle Scholar
  41. 41.
    Milne, D.N., Pink, G., Hachey, B., Calvo, R.A.: CLPsych 2016 shared task: triaging content in online peer-support forums. In: Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, pp. 118–127 (2016)Google Scholar
  42. 42.
    Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREc, vol. 10 (2010)Google Scholar
  43. 43.
    Razavi, A.H., Ghazinour, K.: Personal health information detection in unstructured web documents. In: IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS), pp. 155–160. IEEE (2013)Google Scholar
  44. 44.
    Sachs, J.S.: Recopition memory for syntactic and semantic aspects of connected discourse. Percept. Psychophys. 2(9), 437–442 (1967)CrossRefGoogle Scholar
  45. 45.
    Sánchez, D., Batet, M., Viejo, A.: Detecting sensitive information from textual documents: an information-theoretic approach. In: Torra, V., Narukawa, Y., López, B., Villaret, M. (eds.) MDAI 2012. LNCS (LNAI), vol. 7647, pp. 173–184. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-34620-0_17CrossRefGoogle Scholar
  46. 46.
    Schrading, N., Alm, C.O., Ptucha, R., Homan, C.: An analysis of domestic abuse discourse on reddit. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2577–2583 (2015)Google Scholar
  47. 47.
    Serenko, N., Fan, L.: Patients’ perceptions of privacy and their outcomes in healthcare. Int. J. Behav. Healthc. Res. 4(2), 101–122 (2013)CrossRefGoogle Scholar
  48. 48.
    Siegel, A.: In pursuit of privacy: laws, ethics, and the rise of technology. Wilson Q. 21(4), 100 (1997)Google Scholar
  49. 49.
    Singh, J., Nene, M.J.: A survey on machine learning techniques for intrusion detection systems. Int. J. Adv. Res. Comput. Commun. Eng. 2(11), 4349–4355 (2013)Google Scholar
  50. 50.
    Solon, O.: Facebook says Cambridge Analytica may have gained 37m more users’ data (2018). https://www.theguardian.com/technology/2018/apr/04/facebook-cambridge-analytica-user-data-latest-more-than-thought. Accessed 01 May 2018
  51. 51.
    Spacy: Linguistic features (2018). https://spacy.io/usage/linguistic-features. Accessed 01 Feb 2018
  52. 52.
    Spacy: Named entity recognition (2018). https://prodi.gy/features/named-entity-recognition. Accessed 01 Feb 2018
  53. 53.
    Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA Annual Fall Symposium, p. 333. American Medical Informatics Association (1996)Google Scholar
  54. 54.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)MathSciNetCrossRefGoogle Scholar
  55. 55.
    Vasalou, A., Gill, A.J., Mazanderani, F., Papoutsi, C., Joinson, A.: Privacy dictionary: a new resource for the automated content analysis of privacy. J. Assoc. Inf. Sci. Technol. 62(11), 2095–2105 (2011)CrossRefGoogle Scholar
  56. 56.
    Wang, Y.C., Burke, M., Kraut, R.: Modeling self-disclosure in social networking sites. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 74–85. ACM (2016)Google Scholar
  57. 57.
    Yang, C.C., Tang, X.: Estimating user influence in the MedHelp social network. IEEE Intell. Syst. 27(5), 44–50 (2012)CrossRefGoogle Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2019

Authors and Affiliations

  • Nuhil Mehdy
    • 1
  • Casey Kennington
    • 1
  • Hoda Mehrpouyan
    • 1
    Email author
  1. 1.Boise State UniversityBoiseUSA

Personalised recommendations