Skip to main content

Establishing a Strong Baseline for Privacy Policy Classification

Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT,volume 580)

Abstract

Digital service users are routinely exposed to Privacy Policy consent forms, through which they enter contractual agreements consenting to the specifics of how their personal data is managed and used. Nevertheless, despite renewed importance following legislation such as the European GDPR, a majority of people still ignore policies due to their length and complexity. To counteract this potentially dangerous reality, in this paper we present three different models that are able to assign pre-defined categories to privacy policy paragraphs, using supervised machine learning. In order to train our neural networks, we exploit a dataset containing 115 privacy policies defined by US companies. An evaluation shows that our approach outperforms state-of-the-art by 5% over comparable and previously-reported F1 values. In addition, our method is completely reproducible since we provide open access to all resources. Given these two contributions, our approach can be considered as a strong baseline for privacy policy classification.

Keywords

  • Privacy policy
  • Multi-label classification
  • Deep learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-58201-2_25
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-58201-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Hardcover Book
USD   139.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

Notes

  1. 1.

    To retrieve the exact source used: <https://www.amazon.com/gp/help/customer/display.html?nodeId=468496> (Sub-entry What Choices Do I Have?) – last accessed March.2nd.2020.

  2. 2.

    https://en.wikipedia.org/wiki/Do_Not_Track.

  3. 3.

    https://usableprivacy.org/.

  4. 4.

    They also claim that a model that predicts that all labels are present would have 100% precision and recall, which is obviously wrong.

  5. 5.

    https://github.com/huggingface/transformers.

  6. 6.

    https://github.com/kaushaltrivedi/fast-bert.

  7. 7.

    The BertLMDataBunch class contains from_raw_corpus method that takes a list of raw texts and creates DataBunch for the language model learner.

  8. 8.

    Here, we only consider high-level categories.

  9. 9.

    All splits are available for further experiments. See footnote 13.

  10. 10.

    Fine-tuning BERT took 33 h for 3 epochs on a single GPU. Once it is completed, training the classification model takes only a few hours, depending on the number of epochs.

  11. 11.

    Website privacy policies in EU depend also on Directive 2002/58/CE.

  12. 12.

    Website privacy policies in European union depend also on Directive 2002/58/CE.

  13. 13.

    A supplementary archive is available online for download: <https://github.com/SmartDataAnalytics/Polisis_Benchmark>. The archive contains inter alia the source-code required to reproduce all the experiments, some useful documentation and necessary datasets.

References

  1. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 160–167. ACM, New York (2008). https://doi.org/10.1145/1390156.1390177, http://doi.acm.org/10.1145/1390156.1390177

  2. Costante, E., Sun, Y., Petković, M., den Hartog, J.: A machine learning solution to assess privacy policy completeness: (short paper). In: Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society, WPES 2012. ACM, New York, pp. 91–96 (2012). https://doi.org/10.1145/2381966.2381979, http://doi.acm.org/10.1145/2381966.2381979

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint (2018). arXiv:1810.04805

  4. Guntamukkala, N., Dara, R., Grewal, G.W.: A machine-learning based approach for measuring the completeness of online privacy policies. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 289–294 (2015)

    Google Scholar 

  5. Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K.G., Aberer, K.: Polisis: automated analysis and presentation of privacy policies using deep learning. In: Proceedings of the 27th USENIX Security Symposium (2018)

    Google Scholar 

  6. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint (2016). arXiv:1607.01759

  7. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1181, http://aclweb.org/anthology/D14-1181

  8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)

    Google Scholar 

  9. Landesberg, M.K., Levin, T.M., Curtin, C.G., Lev, O.: Privacy online: a report to congress. NASA (19990008264) (1998)

    Google Scholar 

  10. Libert, T.: An automated approach to auditing disclosure of third-party data collection in website privacy policies. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, pp. 207–216 (2018). https://doi.org/10.1145/3178876.3186087

  11. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)

    CrossRef  Google Scholar 

  12. McDonald, A.M., Cranor, L.F.: The cost of reading privacy policies. ISJLP 4, 543 (2008)

    Google Scholar 

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 3111–3119. Curran Associates Inc., USA (2013). http://dl.acm.org/citation.cfm?id=2999792.2999959

  14. Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 641–648. ACM, New York (2007). https://doi.org/10.1145/1273496.1273577, http://doi.acm.org/10.1145/1273496.1273577

  15. Obar, J.A., Oeldorf-Hirsch, A.: The biggest lie on the Internet: ignoring the privacy policies and terms of service policies of social networking services. Inf. Commun. Soc. 23, 1–20 (2018)

    Google Scholar 

  16. Sathyendra, K.M., Schaub, F., Wilson, S., Sadeh, N.M.: Automatic extraction of opt-out choices from privacy policies. In: AAAI Fall Symposia (2016)

    Google Scholar 

  17. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283, http://doi.acm.org/10.1145/505282.505283

  18. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1555–1565. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1146, http://aclweb.org/anthology/P14-1146

  19. Van Asch, V.: Macro-and Micro-Averaged Evaluation Measures (Basic Draft). CLiPS, Belgium (2013)

    Google Scholar 

  20. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  21. Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting (1995)

    Google Scholar 

  22. Wilson, S., et al.: The creation and analysis of a website privacy policy corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1330–1340 (2016)

    Google Scholar 

  23. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint (2016). arXiv:1609.08144

  24. You, Y., Li, J., Hseu, J., Song, X., Demmel, J., Hsieh, C.J.: Reducing BERT pre-training time from 3 days to 76 minutes. arXiv abs/1904.00962 (2019)

    Google Scholar 

  25. https://code.google.com/archive/p/word2vec/

Download references

Acknowledgment

This work has been partly supported by the European H2020 project “DAPSI” under the Grant Agreement 871498.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Najmeh Mousavi Nejad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 IFIP International Federation for Information Processing

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Mousavi Nejad, N., Jabat, P., Nedelchev, R., Scerri, S., Graux, D. (2020). Establishing a Strong Baseline for Privacy Policy Classification. In: Hölbl, M., Rannenberg, K., Welzer, T. (eds) ICT Systems Security and Privacy Protection. SEC 2020. IFIP Advances in Information and Communication Technology, vol 580. Springer, Cham. https://doi.org/10.1007/978-3-030-58201-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58201-2_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58200-5

  • Online ISBN: 978-3-030-58201-2

  • eBook Packages: Computer ScienceComputer Science (R0)