Establishing a Strong Baseline for Privacy Policy Classification

Mousavi Nejad, Najmeh; Jabat, Pablo; Nedelchev, Rostislav; Scerri, Simon; Graux, Damien

doi:10.1007/978-3-030-58201-2_25

Najmeh Mousavi Nejad^18,19,
Pablo Jabat²⁰,
Rostislav Nedelchev¹⁸,
Simon Scerri¹⁹ &
…
Damien Graux²¹

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 580))

Included in the following conference series:

IFIP International Conference on ICT Systems Security and Privacy Protection

1378 Accesses
9 Citations

Abstract

Digital service users are routinely exposed to Privacy Policy consent forms, through which they enter contractual agreements consenting to the specifics of how their personal data is managed and used. Nevertheless, despite renewed importance following legislation such as the European GDPR, a majority of people still ignore policies due to their length and complexity. To counteract this potentially dangerous reality, in this paper we present three different models that are able to assign pre-defined categories to privacy policy paragraphs, using supervised machine learning. In order to train our neural networks, we exploit a dataset containing 115 privacy policies defined by US companies. An evaluation shows that our approach outperforms state-of-the-art by 5% over comparable and previously-reported F1 values. In addition, our method is completely reproducible since we provide open access to all resources. Given these two contributions, our approach can be considered as a strong baseline for privacy policy classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
To retrieve the exact source used: <https://www.amazon.com/gp/help/customer/display.html?nodeId=468496> (Sub-entry What Choices Do I Have?) – last accessed March.2^nd.2020.
2.
https://en.wikipedia.org/wiki/Do_Not_Track.
3.
https://usableprivacy.org/.
4.
They also claim that a model that predicts that all labels are present would have 100% precision and recall, which is obviously wrong.
5.
https://github.com/huggingface/transformers.
6.
https://github.com/kaushaltrivedi/fast-bert.
7.
The BertLMDataBunch class contains from_raw_corpus method that takes a list of raw texts and creates DataBunch for the language model learner.
8.
Here, we only consider high-level categories.
9.
All splits are available for further experiments. See footnote 13.
10.
Fine-tuning BERT took 33 h for 3 epochs on a single GPU. Once it is completed, training the classification model takes only a few hours, depending on the number of epochs.
11.
Website privacy policies in EU depend also on Directive 2002/58/CE.
12.
Website privacy policies in European union depend also on Directive 2002/58/CE.
13.
A supplementary archive is available online for download: <https://github.com/SmartDataAnalytics/Polisis_Benchmark>. The archive contains inter alia the source-code required to reproduce all the experiments, some useful documentation and necessary datasets.

References

Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 160–167. ACM, New York (2008). https://doi.org/10.1145/1390156.1390177, http://doi.acm.org/10.1145/1390156.1390177
Costante, E., Sun, Y., Petković, M., den Hartog, J.: A machine learning solution to assess privacy policy completeness: (short paper). In: Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society, WPES 2012. ACM, New York, pp. 91–96 (2012). https://doi.org/10.1145/2381966.2381979, http://doi.acm.org/10.1145/2381966.2381979
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint (2018). arXiv:1810.04805
Guntamukkala, N., Dara, R., Grewal, G.W.: A machine-learning based approach for measuring the completeness of online privacy policies. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 289–294 (2015)
Google Scholar
Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K.G., Aberer, K.: Polisis: automated analysis and presentation of privacy policies using deep learning. In: Proceedings of the 27th USENIX Security Symposium (2018)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint (2016). arXiv:1607.01759
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1181, http://aclweb.org/anthology/D14-1181
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Google Scholar
Landesberg, M.K., Levin, T.M., Curtin, C.G., Lev, O.: Privacy online: a report to congress. NASA (19990008264) (1998)
Google Scholar
Libert, T.: An automated approach to auditing disclosure of third-party data collection in website privacy policies. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, pp. 207–216 (2018). https://doi.org/10.1145/3178876.3186087
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)
Book Google Scholar
McDonald, A.M., Cranor, L.F.: The cost of reading privacy policies. ISJLP 4, 543 (2008)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 3111–3119. Curran Associates Inc., USA (2013). http://dl.acm.org/citation.cfm?id=2999792.2999959
Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 641–648. ACM, New York (2007). https://doi.org/10.1145/1273496.1273577, http://doi.acm.org/10.1145/1273496.1273577
Obar, J.A., Oeldorf-Hirsch, A.: The biggest lie on the Internet: ignoring the privacy policies and terms of service policies of social networking services. Inf. Commun. Soc. 23, 1–20 (2018)
Google Scholar
Sathyendra, K.M., Schaub, F., Wilson, S., Sadeh, N.M.: Automatic extraction of opt-out choices from privacy policies. In: AAAI Fall Symposia (2016)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283, http://doi.acm.org/10.1145/505282.505283
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1555–1565. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1146, http://aclweb.org/anthology/P14-1146
Van Asch, V.: Macro-and Micro-Averaged Evaluation Measures (Basic Draft). CLiPS, Belgium (2013)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting (1995)
Google Scholar
Wilson, S., et al.: The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1330–1340 (2016)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint (2016). arXiv:1609.08144
You, Y., Li, J., Hseu, J., Song, X., Demmel, J., Hsieh, C.J.: Reducing BERT pre-training time from 3 days to 76 minutes. arXiv abs/1904.00962 (2019)
Google Scholar
https://code.google.com/archive/p/word2vec/

Download references

Acknowledgment

This work has been partly supported by the European H2020 project “DAPSI” under the Grant Agreement 871498.

Author information

Authors and Affiliations

Smart Data Analytics (SDA), University of Bonn, Bonn, Germany
Najmeh Mousavi Nejad & Rostislav Nedelchev
Fraunhofer Intelligent Analysis and Information Systems (IAIS), Sankt Augustin, Germany
Najmeh Mousavi Nejad & Simon Scerri
Company Watch Ltd., London, England
Pablo Jabat
ADAPT Centre, Trinity College Dublin, Dublin, Ireland
Damien Graux

Authors

Najmeh Mousavi Nejad
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Jabat
View author publications
You can also search for this author in PubMed Google Scholar
Rostislav Nedelchev
View author publications
You can also search for this author in PubMed Google Scholar
Simon Scerri
View author publications
You can also search for this author in PubMed Google Scholar
Damien Graux
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Najmeh Mousavi Nejad .

Editor information

Editors and Affiliations

University of Maribor, Maribor, Slovenia
Marko Hölbl
Goethe University Frankfurt, Frankfurt, Germany
Kai Rannenberg
University of Maribor, Maribor, Slovenia
Tatjana Welzer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mousavi Nejad, N., Jabat, P., Nedelchev, R., Scerri, S., Graux, D. (2020). Establishing a Strong Baseline for Privacy Policy Classification. In: Hölbl, M., Rannenberg, K., Welzer, T. (eds) ICT Systems Security and Privacy Protection. SEC 2020. IFIP Advances in Information and Communication Technology, vol 580. Springer, Cham. https://doi.org/10.1007/978-3-030-58201-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58201-2_25
Published: 14 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58200-5
Online ISBN: 978-3-030-58201-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)