Analyzing Characteristic Host Access Patterns for Re-identification of Web User Sessions

  • Dominik Herrmann
  • Christoph Gerber
  • Christian Banse
  • Hannes Federrath
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7127)

Abstract

An attacker, who is able to observe a web user over a long period of time, learns a lot about his interests. It may be difficult to track users with regularly changing IP addresses, though. We show how patterns mined from web traffic can be used to re-identify a majority of users, i. e. link multiple sessions of them. We implement the web user re-identification attack using a Multinomial Naïve Bayes classifier and evaluate it using a real-world dataset from 28 users. Our evaluation setup complies with the limited knowledge of an attacker on a malicious web proxy server, who is only able to observe the host names visited by its users. The results suggest that consecutive sessions can be linked with high probability for session durations from 5 minutes to 48 hours and that user profiles degrade only slowly over time. We also propose basic countermeasures and evaluate their efficacy.

Keywords

Training Instance Proxy Server Session Duration Access Frequency User Session 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adamic, L., Huberman, B.: Zipf’s Law and the Internet. Glottometrics 3(1), 143–150 (2002)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addision Wesley, New York (1999)Google Scholar
  3. 3.
    Barbaro, M., Zeller, T.: A Face is Exposed for AOL Searcher No. 4417749. The New York Times, August 9 (2006)Google Scholar
  4. 4.
    Breslau, L., Cue, P., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web Caching and Zipf-like Distributions: Evidence and Implications. In: INFOCOM, pp. 126–134 (1999)Google Scholar
  5. 5.
    Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM, New York (2008)Google Scholar
  6. 6.
    Catledge, L.D., Pitkow, J.E.: Characterizing Browsing Behaviors on the World-Wide Web. Georgia Institute of Technology (1995)Google Scholar
  7. 7.
    Coull, S.E., Collins, M.P., Wright, C.V., Monrose, F., Reiter, M.K.: On Web Browsing Privacy in Anonymized NetFlows. In: Proceedings of the 16th USENIX Security Symposium, Boston, MA (August 2007)Google Scholar
  8. 8.
    Coull, S.E., Wright, C.V., Keromytisz, A.D., Monrose, F., Reiter, M.K.: Taming the devil: Techniques for evaluating anonymized network data. In: Proceedings of the 15th Network and Distributed Systems Security Symposium (2008)Google Scholar
  9. 9.
    Coull, S.E., Wright, C.V., Monrose, F., Collins, M.P., Reiter, M.K.: Playing devil’s advocate: Inferring sensitive information from anonymized network traces. In: Proceedings of the Network and Distributed System Security Symposium, pp. 35–47 (2007)Google Scholar
  10. 10.
    Crovella, M.E., Bestavros, A.: Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Trans. Netw. 5(6), 835–846 (1997)CrossRefGoogle Scholar
  11. 11.
    Eckersley, P.: How Unique Is Your Web Browser? Technical report, Electronig Frontier Foundation (2009)Google Scholar
  12. 12.
    Erman, J., Mahanti, A., Arlitt, M.: Internet Traffic Identification using Machine Learning. In: Proceedings of IEEE Global Telecommunications Conference (GLOBECOM), San Francisco, CA, USA, pp. 1–6 (November 2006)Google Scholar
  13. 13.
    Herrmann, D., Wendolsky, R., Federrath, H.: Website fingerprinting: attacking popular privacy enhancing technologies with the multinomial naïve-bayes classifier. In: CCSW 2009: Proceedings of the 2009 ACM Workshop on Cloud Computing Security, pp. 31–42. ACM, New York (2009)CrossRefGoogle Scholar
  14. 14.
    Kellar, M., Watters, C., Shepherd, M.: A field study characterizing Web-based information-seeking tasks. Journal of the American Society for Information Science and Technology 58(7), 999–1018 (2007)CrossRefGoogle Scholar
  15. 15.
    Koukis, D., Antonatos, S., Anagnostakis, K.G.: On the Privacy Risks of Publishing Anonymized IP Network Traces. In: Leitold, H., Markatos, E.P. (eds.) CMS 2006. LNCS, vol. 4237, pp. 22–32. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  16. 16.
    Kumpošt, M.: Data Preparation for User Profiling from Traffic Log. In: The International Conference on Emerging Security Information, Systems, and Technologies, pp. 89–94 (2007)Google Scholar
  17. 17.
    Kumpošt, M.: Context Information and user profiling. PhD thesis, Faculty of Informatics, Masaryk University, Czech Republic (2009)Google Scholar
  18. 18.
    Kumpošt, M., Matyáš, V.: User Profiling and Re-identification: Case of University-Wide Network Analysis. In: Fischer-Hübner, S., Lambrinoudakis, C., Pernul, G. (eds.) TrustBus 2009. LNCS, vol. 5695, pp. 1–10. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  19. 19.
    Liberatore, M., Levine, B.N.: Inferring the Source of Encrypted HTTP Connections. In: CCS 2006: Proceedings of the 13th ACM Conference on Computer and Communications Security, pp. 255–263. ACM Press, New York (2006)Google Scholar
  20. 20.
    Malin, B., Airoldi, E.: The Effects of Location Access Behavior on Re-identification Risk in a Distributed Environment. In: Privacy Enhancing Technologies, pp. 413–429 (2006)Google Scholar
  21. 21.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
  22. 22.
    Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In: SIGMETRICS 2005: Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 50–60. ACM Press, New York (2005)Google Scholar
  23. 23.
    Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: IEEE Symposium on Security and Privacy, pp. 111–125 (2008)Google Scholar
  24. 24.
    Obendorf, H., Weinreich, H., Herder, E., Mayer, M.: Web Page Revisitation Revisited: Implications of a Long-term Click-stream Study of Browser Usage. In: CHI 2007, pp. 597–606. ACM Press (May 2007)Google Scholar
  25. 25.
    Ohm, P.: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization. In: Social Science Research Network Working Paper Series (August 2009)Google Scholar
  26. 26.
    Olivier, M.S.: Distributed Proxies for Browsing Privacy: a Simulation of Flocks. In: SAICSIT ’05: Proceedings of the 2005 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on IT Research in Developing Countries, pp. 104–112. South African Institute for Computer Scientists and Information Technologists, Republic of South Africa (2005)Google Scholar
  27. 27.
    Padmanabhan, B., Yang, Y.: Clickprints on the Web: Are there signatures in Web Browsing Data? Working Paper Series (October 2006)Google Scholar
  28. 28.
    Pang, J., Greenstein, B., Gummadi, R., Seshan, S., Wetherall, D.: 802.11 user fingerprinting. In: MobiCom 2007: Proceedings of the 13th Annual ACM International Conference on Mobile Computing and Networking, pp. 99–110. ACM, New York (2007)Google Scholar
  29. 29.
    Pang, R., Allman, M., Paxson, V., Lee, J.: The devil and packet trace anonymization. SIGCOMM Comput. Commun. Rev. 36(1), 29–38 (2006)CrossRefGoogle Scholar
  30. 30.
    Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explor. Newsl. 1(2), 12–23 (2000)CrossRefGoogle Scholar
  31. 31.
    Sweeney, L.: k-anonymity: A model for protecting privacy. International Journal of Uncertainty Fuzziness and Knowledge Based Systems 10(5), 557–570 (2002)CrossRefMATHMathSciNetGoogle Scholar
  32. 32.
    Williams, N., Zander, S., Armitage, G.: A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Commun. Rev. 36(5), 5–16 (2006)CrossRefGoogle Scholar
  33. 33.
    Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques. Elsevier, San Francisco (2005)MATHGoogle Scholar
  34. 34.
    Wondracek, G., Holz, T., Kirda, E., Kruegel, C.: A Practical Attack to De-Anonymize Social Network Users, iseclab.org
  35. 35.
    Yang, Y.: Web user behavioral profiling for user identification. Decision Support Systems 49, 261–271 (2010)CrossRefGoogle Scholar
  36. 36.
    Yang, Y.C., Padmanabhan, B.: Toward user patterns for online security: Observation time and online user identification. Decision Support Systems 48, 548–558 (2008)CrossRefGoogle Scholar
  37. 37.
    Zipf, G.K.: The psycho-biology of language. An introduction to dynamic philology, 2nd edn. M.I.T. Press, Cambridge (1968)Google Scholar
  38. 38.
    Zuev, D., Moore, A.W.: Traffic Classification using a Statistical Approach. In: Dovrolis, C. (ed.) PAM 2005. LNCS, vol. 3431, pp. 321–324. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Dominik Herrmann
    • 1
  • Christoph Gerber
    • 1
  • Christian Banse
    • 1
  • Hannes Federrath
    • 1
  1. 1.Research Group Security in Distributed Systems, Department of InformaticsUniversity of HamburgHamburgGermany

Personalised recommendations