Abstract
Digital service users are routinely exposed to Privacy Policy consent forms, through which they enter contractual agreements consenting to the specifics of how their personal data is managed and used. Nevertheless, despite renewed importance following legislation such as the European GDPR, a majority of people still ignore policies due to their length and complexity. To counteract this potentially dangerous reality, in this paper we present three different models that are able to assign pre-defined categories to privacy policy paragraphs, using supervised machine learning. In order to train our neural networks, we exploit a dataset containing 115 privacy policies defined by US companies. An evaluation shows that our approach outperforms state-of-the-art by 5% over comparable and previously-reported F1 values. In addition, our method is completely reproducible since we provide open access to all resources. Given these two contributions, our approach can be considered as a strong baseline for privacy policy classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
To retrieve the exact source used: <https://www.amazon.com/gp/help/customer/display.html?nodeId=468496> (Sub-entry What Choices Do I Have?) – last accessed March.2nd.2020.
- 2.
- 3.
- 4.
They also claim that a model that predicts that all labels are present would have 100% precision and recall, which is obviously wrong.
- 5.
- 6.
- 7.
The BertLMDataBunch class contains from_raw_corpus method that takes a list of raw texts and creates DataBunch for the language model learner.
- 8.
Here, we only consider high-level categories.
- 9.
All splits are available for further experiments. See footnote 13.
- 10.
Fine-tuning BERT took 33 h for 3 epochs on a single GPU. Once it is completed, training the classification model takes only a few hours, depending on the number of epochs.
- 11.
Website privacy policies in EU depend also on Directive 2002/58/CE.
- 12.
Website privacy policies in European union depend also on Directive 2002/58/CE.
- 13.
A supplementary archive is available online for download: <https://github.com/SmartDataAnalytics/Polisis_Benchmark>. The archive contains inter alia the source-code required to reproduce all the experiments, some useful documentation and necessary datasets.
References
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 160–167. ACM, New York (2008). https://doi.org/10.1145/1390156.1390177, http://doi.acm.org/10.1145/1390156.1390177
Costante, E., Sun, Y., Petković, M., den Hartog, J.: A machine learning solution to assess privacy policy completeness: (short paper). In: Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society, WPES 2012. ACM, New York, pp. 91–96 (2012). https://doi.org/10.1145/2381966.2381979, http://doi.acm.org/10.1145/2381966.2381979
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint (2018). arXiv:1810.04805
Guntamukkala, N., Dara, R., Grewal, G.W.: A machine-learning based approach for measuring the completeness of online privacy policies. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 289–294 (2015)
Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K.G., Aberer, K.: Polisis: automated analysis and presentation of privacy policies using deep learning. In: Proceedings of the 27th USENIX Security Symposium (2018)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint (2016). arXiv:1607.01759
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1181, http://aclweb.org/anthology/D14-1181
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Landesberg, M.K., Levin, T.M., Curtin, C.G., Lev, O.: Privacy online: a report to congress. NASA (19990008264) (1998)
Libert, T.: An automated approach to auditing disclosure of third-party data collection in website privacy policies. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, pp. 207–216 (2018). https://doi.org/10.1145/3178876.3186087
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)
McDonald, A.M., Cranor, L.F.: The cost of reading privacy policies. ISJLP 4, 543 (2008)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 3111–3119. Curran Associates Inc., USA (2013). http://dl.acm.org/citation.cfm?id=2999792.2999959
Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 641–648. ACM, New York (2007). https://doi.org/10.1145/1273496.1273577, http://doi.acm.org/10.1145/1273496.1273577
Obar, J.A., Oeldorf-Hirsch, A.: The biggest lie on the Internet: ignoring the privacy policies and terms of service policies of social networking services. Inf. Commun. Soc. 23, 1–20 (2018)
Sathyendra, K.M., Schaub, F., Wilson, S., Sadeh, N.M.: Automatic extraction of opt-out choices from privacy policies. In: AAAI Fall Symposia (2016)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283, http://doi.acm.org/10.1145/505282.505283
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1555–1565. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1146, http://aclweb.org/anthology/P14-1146
Van Asch, V.: Macro-and Micro-Averaged Evaluation Measures (Basic Draft). CLiPS, Belgium (2013)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting (1995)
Wilson, S., et al.: The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1330–1340 (2016)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint (2016). arXiv:1609.08144
You, Y., Li, J., Hseu, J., Song, X., Demmel, J., Hsieh, C.J.: Reducing BERT pre-training time from 3 days to 76 minutes. arXiv abs/1904.00962 (2019)
Acknowledgment
This work has been partly supported by the European H2020 project “DAPSI” under the Grant Agreement 871498.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 IFIP International Federation for Information Processing
About this paper
Cite this paper
Mousavi Nejad, N., Jabat, P., Nedelchev, R., Scerri, S., Graux, D. (2020). Establishing a Strong Baseline for Privacy Policy Classification. In: Hölbl, M., Rannenberg, K., Welzer, T. (eds) ICT Systems Security and Privacy Protection. SEC 2020. IFIP Advances in Information and Communication Technology, vol 580. Springer, Cham. https://doi.org/10.1007/978-3-030-58201-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-58201-2_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58200-5
Online ISBN: 978-3-030-58201-2
eBook Packages: Computer ScienceComputer Science (R0)