Advertisement

Automated Identification of Sensitive Data via Flexible User Requirements

  • Ziqi YangEmail author
  • Zhenkai Liang
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 254)

Abstract

Protecting sensitive data in web and mobile applications requires identifying sensitive data, which typically needs intensive manual efforts. In addition, deciding sensitive data subjects to users’ requirements and the application context. Existing research efforts on identifying sensitive data from its descriptive texts focus on keyword/phrase searching. These approaches can have high false positives/negatives as they do not consider the semantics of the descriptions. In this paper, we propose S3, an automated approach to identify sensitive data based on user requirements. It considers semantic, syntactic and lexical information comprehensively, aiming to identify sensitive data by the semantics of its descriptive texts. We introduce the notion concept space to represent the user’s notion of privacy, by which our approach can support flexible user requirements in defining sensitive data. Our approach is able to learn users’ preferences from readable concepts initially provided by users, and automatically identify related sensitive data. We evaluate our approach on over 18,000 top popular applications from Google Play Store. S3 achieves an average precision of 89.2%, and average recall 95.8% in identifying sensitive data.

Notes

Acknowledgment

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its National Cybersecurity R&D Programme (Grant No. NRF2015NCR-NCR002-001).

References

  1. 1.
    Avdiienko, V., Kuznetsov, K., Rommelfanger, I., Rau, A., Gorla, A., Zeller, A.: Detecting behavior anomalies in graphical user interfaces. In: Proceedings of the 39th International Conference on Software Engineering Companion (ICSE-C). IEEE (2017)Google Scholar
  2. 2.
    Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. European Language Resources Association (2010)Google Scholar
  3. 3.
    Budianto, E., Jia, Y., Dong, X., Saxena, P., Liang, Z.: You can’t be me: enabling trusted paths and user sub-origins in web browsers. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014. LNCS, vol. 8688, pp. 150–171. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11379-1_8CrossRefGoogle Scholar
  4. 4.
    Bursztein, E., Soman, C., Boneh, D., Mitchell, J.C.: SessionJuggler: secure web login from an untrusted terminal using session hijacking. In: Proceedings of the 21st International Conference on World Wide Web (WWW). ACM (2012)Google Scholar
  5. 5.
    CNBC: Driver’s license, credit card numbers: The equifax hack is way worse than consumers knew. https://www.cnbc.com/2018/02/12/the-equifax-hack-is-way-worse-than-consumers-knew.html
  6. 6.
    Cunningham, P., Delany, S.J.: K-nearest neighbour classifiers. Multiple Classif. Syst. 34, 1–17 (2007)Google Scholar
  7. 7.
    Enck, W., et al.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (USENIX OSDI). USENIX Association (2010)Google Scholar
  8. 8.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics (2005)Google Scholar
  9. 9.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI). Morgan Kaufmann Publishers Inc. (2007)Google Scholar
  10. 10.
    Huang, J., et al.: SUPOR: precise and scalable sensitive user input detection for android apps. In: 24th USENIX Security Symposium (USENIX Security). USENIX Association (2015)Google Scholar
  11. 11.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London (2014)Google Scholar
  12. 12.
    Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Proceedings of the 15th International Conference on Neural Information Processing Systems (NIPS). MIT Press (2002)Google Scholar
  13. 13.
    Kong, D., Cen, L., Jin, H.: AUTOREB: automatically understanding the review-to-behavior fidelity in android applications. In: Proceedings of the 22nd Conference on Computer and Communications Security (CCS). ACM (2015)Google Scholar
  14. 14.
    LDC: English gigaword fifth edition. https://catalog.ldc.upenn.edu/LDC2011T07
  15. 15.
    Li, X., Hu, H., Bai, G., Jia, Y., Liang, Z., Saxena, P.: DroidVault: a trusted data vault for android devices. In: Proceedings of the 19th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE (2014)Google Scholar
  16. 16.
    Liao, X., Yuan, K., Wang, X., Li, Z., Xing, L., Beyah, R.: Acing the IOC game: toward automatic discovery and analysis of open-source cyber threat intelligence. In: Proceedings of Conference on Computer and Communications Security (CCS). ACM (2016)Google Scholar
  17. 17.
    Lu, K., et al.: Checking more and alerting less: detecting privacy leakages via enhanced data-flow analysis and peer voting. In: Proceedings of the Network and Distributed System Security Symposium (NDSS) (2015)Google Scholar
  18. 18.
    Mannan, M., van Oorschot, P.C.: Using a personal device to strengthen password authentication from an untrusted computer. In: Dietrich, S., Dhamija, R. (eds.) FC 2007. LNCS, vol. 4886, pp. 88–103. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-77366-5_11CrossRefGoogle Scholar
  19. 19.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). http://www.aclweb.org/anthology/P/P14/P14-5010
  20. 20.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc. (2013)Google Scholar
  21. 21.
    Nan, Y., Yang, M., Yang, Z., Zhou, S., Gu, G., Wang, X.: UIPicker: user-input privacy identification in mobile applications. In: Proceedings of the 24th USENIX Security Symposium (USENIX Security). USENIX Association (2015)Google Scholar
  22. 22.
    Olson, D.L., Delen, D.: Advanced Data Mining Techniques. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-76917-0CrossRefzbMATHGoogle Scholar
  23. 23.
    Oprea, A., Balfanz, D., Durfee, G., Smetters, D.K.: Securing a remote terminal application with a mobile trusted device. In: Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC). IEEE (2004)Google Scholar
  24. 24.
    Pandita, R., Xiao, X., Yang, W., Enck, W., Xie, T.: WHYPER: towards automating risk assessment of mobile applications. In: Proceedings of the 22nd USENIX Security Symposium (USENIX Security). USENIX Association (2013)Google Scholar
  25. 25.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)Google Scholar
  26. 26.
    Qu, Z., Rastogi, V., Zhang, X., Chen, Y., Zhu, T., Chen, Z.: AutoCog: measuring the description-to-permission fidelity in android applications. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM (2014)Google Scholar
  27. 27.
    Rastogi, V., Chen, Y., Enck, W.: AppsPlayground: automatic security analysis of smartphone applications. In: Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy. ACM (2013)Google Scholar
  28. 28.
    Roalter, L., Kranz, M., Diewald, S., Möller, A., Synnes, K.: The smartphone as mobile authorization proxy. In: Proceedings of the 14th International Conference on Computer Aided Systems Theory (EUROCAST), pp. 306–307 (2013)Google Scholar
  29. 29.
    Sharp, R., Madhavapeddy, A., Want, R., Pering, T.: Enhancing web browsing security on public terminals using mobile composition. In: Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (MobiSys). ACM (2008)Google Scholar
  30. 30.
    Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)Google Scholar
  31. 31.
    Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, Boston, vol. 400, pp. 525–526 (2000)Google Scholar
  32. 32.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL). Association for Computational Linguistics (2003)Google Scholar
  33. 33.
  34. 34.
    Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM (1996)Google Scholar
  35. 35.
    Yu, L., Luo, X., Qian, C., Wang, S.: Revisiting the description-to-behavior fidelity in android applications. In: Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE (2016)Google Scholar
  36. 36.
    Zhou, Y., Jiang, X.: Detecting passive content leaks and pollution in android applications. In: Proceedings of the 20th Network and Distributed System Security Symposium (NDSS) (2013)Google Scholar
  37. 37.
    Zhou, Y., Evans, D.: Protecting private web content from embedded scripts. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 60–79. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23822-2_4CrossRefGoogle Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2018

Authors and Affiliations

  1. 1.National University of SingaporeSingaporeSingapore

Personalised recommendations