Tracking Users on the Internet with Behavioral Patterns: Evaluation of Its Practical Feasibility

  • Christian Banse
  • Dominik Herrmann
  • Hannes Federrath
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 376)

Abstract

Traditionally, service providers, who want to track the activities of Internet users, rely on explicit tracking techniques like HTTP cookies. From a privacy perspective behavior-based tracking is even more dangerous, because it allows service providers to track users passively, i. e., without cookies. In this case multiple sessions of a user are linked by exploiting characteristic patterns mined from network traffic.

In this paper we study the feasibility of behavior-based tracking in a real-world setting, which is unknown so far. In principle, behavior-based tracking can be carried out by any attacker that can observe the activities of users on the Internet. We design and implement a behavior-based tracking technique that consists of a Naive Bayes classifier supported by a cosine similarity decision engine. We evaluate our technique using a large-scale dataset that contains all queries received by a DNS resolver that is used by more than 2100 concurrent users on average per day. Our technique is able to correctly link 88.2 % of the surfing sessions on a day-to-day basis. We also discuss various countermeasures that reduce the effectiveness of our technique.

Keywords

Test Instance Range Query Cosine Similarity Training Instance Tracking Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Adamic, L., Huberman, B.: Zipf’s Law and the Internet. Glottometrics 3(1), 143–150 (2002)Google Scholar
  2. 2.
    Ayenson, M., Wambach, D.J., Soltani, A., Good, N., Hoofnagle, C.J.: Flash Cookies and Privacy II: Now with HTML5 and ETag Respawning (2011), http://ssrn.com/abstract=1898390
  3. 3.
    Beesley, K.R.: Language identifier: A computer program for automatic natural-language identification of on-line text. In: Language at Crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, pp. 12–16 (1988)Google Scholar
  4. 4.
    Berthold, O., Federrath, H., Köpsell, S.: Web MIXes: A System for Anonymous and Unobservable Internet Access. In: Federrath, H. (ed.) Anonymity 2000. LNCS, vol. 2009, pp. 115–129. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  5. 5.
    Castillo-Perez, S., García-Alfaro, J.: Evaluation of Two Privacy–Preserving Protocols for the DNS. In: Proceedings of the Sixth International Conference on Information Technology: New Generations, Washington, DC, USA, pp. 411–416 (2009)Google Scholar
  6. 6.
    Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)Google Scholar
  7. 7.
    Chor, B., Kushilevitz, E., Goldreich, O., Sudan, M.: Private Information Retrieval. J. ACM 45(6), 965–981 (1998)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995)MATHGoogle Scholar
  9. 9.
    Damashek, M.: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science 267(5199), 843–848 (1995)CrossRefGoogle Scholar
  10. 10.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  11. 11.
    Dingledine, R., Mathewson, N., Syverson, P.F.: Tor: The Second–Generation Onion Router. In: Proceedings of the 13th USENIX Security Symposium, pp. 303–320 (2004)Google Scholar
  12. 12.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)CrossRefGoogle Scholar
  13. 13.
    Herrmann, D., Gerber, C., Banse, C., Federrath, H.: Analyzing Characteristic Host Access Patterns for Re-identification of Web User Sessions. In: Järvinen, K. (ed.) NordSec 2010. LNCS, vol. 7127, pp. 136–154. Springer, Heidelberg (2012)Google Scholar
  14. 14.
    Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: International Joint Conference on Artificial Intelligence, vol. 14, pp. 1137–1143. Morgan Kaufmann (1995)Google Scholar
  15. 15.
    Kumpošt, M., Matyáš, V.: User Profiling and Re-identification: Case of University-Wide Network Analysis. In: Fischer-Hübner, S., Lambrinoudakis, C., Pernul, G. (eds.) TrustBus 2009. LNCS, vol. 5695, pp. 1–10. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  16. 16.
    Kushilevitz, E., Ostrovsky, R.: Replication is Not Needed: Single Database, Computationally-Private Information Retrieval. In: Proceedings of the 38th annual IEEE Symposium on Foundations of Computer Science, pp. 364–373. IEEE Computer Society (1997)Google Scholar
  17. 17.
    Lu, Y., Tsudik, G.: Towards Plugging Privacy Leaks in the Domain Name System. In: Proceedings of the Tenth International Conference on Peer–to–Peer Computing (P2P), pp. 1–10. IEEE (2010)Google Scholar
  18. 18.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)MATHCrossRefGoogle Scholar
  19. 19.
    Padmanabhan, B., Yang, Y.: Clickprints on the Web: Are there signatures in Web Browsing Data? (October 2006), http://knowledge.wharton.upenn.edu/papers/1323.pdf
  20. 20.
    Raghavan, B., Kohno, T., Snoeren, A.C., Wetherall, D.: Enlisting ISPs to Improve Online Privacy: IP Address Mixing by Default. In: Goldberg, I., Atallah, M.J. (eds.) PETS 2009. LNCS, vol. 5672, pp. 143–163. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  21. 21.
    Rieck, K., Laskov, P.: Language Models for Detection of Unknown Attacks in Network Traffic. Journal in Computer Virology 2(4), 243–256 (2007)CrossRefGoogle Scholar
  22. 22.
    White, T.: Hadoop – The Definitive Guide: Storage and Analysis at Internet Scale, 2nd edn. O’Reilly (2011)Google Scholar
  23. 23.
    Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques. Elsevier, San Francisco (2005)MATHGoogle Scholar
  24. 24.
    Xie, Y., Yu, F., Abadi, M.: De-anonymizing the internet using unreliable IDs. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, pp. 75–86. ACM, New York (2009)CrossRefGoogle Scholar
  25. 25.
    Xie, Y., Yu, F., Achan, K., Gillum, E., Goldszmidt, M., Wobber, T.: How dynamic are IP addresses? In: Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM 2007), pp. 301–312. ACM, New York (2007)Google Scholar
  26. 26.
    Yang, Y.: Web user behavioral profiling for user identification. Decision Support Systems 49, 261–271 (2010)CrossRefGoogle Scholar
  27. 27.
    Yang, Y., Padmanabhan, B.: Toward user patterns for online security: Observation time and online user identification. Decision Support Systems 48, 548–558 (2008)CrossRefGoogle Scholar
  28. 28.
    Zhao, F., Hori, Y., Sakurai, K.: Analysis of Existing Privacy–Preserving Protocols in Domain Name System. IEICE Transactions 93-D(5), 1031–1043 (2010)Google Scholar
  29. 29.
    Zipf, G.K.: The psycho-biology of language. An introduction to dynamic philology, 2nd edn. M.I.T. Press, Cambridge (1968)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2012

Authors and Affiliations

  • Christian Banse
    • 1
  • Dominik Herrmann
    • 2
  • Hannes Federrath
    • 2
  1. 1.Fraunhofer AISECGarching b. MünchenGermany
  2. 2.Department of InformaticsUniversity of HamburgGermany

Personalised recommendations