A query scrambler for search privacy on the internet

Abstract

We propose a method for search privacy on the Internet, focusing on enhancing plausible deniability against search engine query-logs. The method approximates the target search results, without submitting the intended query and avoiding other exposing queries, by employing sets of queries representing more general concepts. We model the problem theoretically, and investigate the practical feasibility and effectiveness of the proposed solution with a set of real queries with privacy issues on a large web collection. The findings may have implications for other IR research areas, such as query expansion and fusion in meta-search. Finally, we discuss ideas for privacy, such as k-anonymity, and how these may be applied to search tasks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    http://www.torproject.org

  2. 2.

    The Private Web Search (PWS) tool is a Firefox Add-on. It is available on-line but seems not to be further developed. Its latest version is v0.4.2, which supports Firefox up to version 2. The PWS as well as the TrackMeNot tool have been developed in the context of the Portia project (http://crypto.stanford.edu/portia/).

  3. 3.

    http://en.wikipedia.org/wiki/Plausible_deniability

  4. 4.

    A simple definition of a martingale sequence from Motwani and Raghavan (1995): A sequence of random variables \(X_0, X_1, \ldots, \) is said to be a martingale sequence if for all \(i > 0, E[X_i| X_0,\ldots,X_{i-1}]=X_{i-1}.\)

  5. 5.

    A collocation is two or more words that often go together to form a specific meaning, e.g., ‘hot dog’.

  6. 6.

    The LCS is defined as the ancestor node common to both input synsets whose shortest path to the root node is the longest.

  7. 7.

    http://lethe.nonrelevant.net/datasets/95-seed-queries-v1.0.txt

  8. 8.

    http://boston.lti.cs.cmu.edu/Data/clueweb09/

  9. 9.

    http://www.lemurproject.org

  10. 10.

    http://www.congition.com

  11. 11.

    http://www.hakia.com

  12. 12.

    More precisely, “cortisone#n#1” in Wordnet, i.e., its 1st, most-frequent, sense as a noun.

  13. 13.

    More precisely, “hormone#n#1” in Wordnet, i.e., its 1st, most-frequent, sense as a noun.

  14. 14.

    At the time of the writing of this paper, there was no Nokia tablet in the market.

References

  1. Barbaro, M., & Zeller, T. (2006). A face is exposed for AOL searcher no. 4417749. Accessed June 3, 2010 from http://www.nytimes.com/2006/08/09/technology/09aol.html.

  2. Boldi, P., Bonchi, F., Castillo, C., & Vigna, S. (2009). From "Dango" to "Japanese Cakes": Query reformulation models and patterns. In: WI-IAT ’09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (pp.183–190). Washington, DC, USA: IEEE Computer Society.

  3. Chor, B., Gilboa, N., & Naor, M. (1997). Private information retrieval by keywords. Tech. Rep. Technical Report TR CS0917. Haifa: Department of Computer Science, Technion, Israel Institute of Technology.

  4. Domingo-Ferrer, J., Bras-Amorós, M., Wu, Q., & Manjón, J. A. (2009a). User-private information retrieval based on a peer-to-peer community. Data & Knowledge Engineering 68(11), 1237–1252.

    Article  Google Scholar 

  5. Domingo-Ferrer, J., Solanas, A., & Castella-Roca, J. (2009b). h(k)-private information retrieval from privacy-uncooperative queryable databases. Online Information Review, 33(4), 720–744.

    Article  Google Scholar 

  6. Erola, A., Castellà-Roca, J., Navarro-Arribas, G., & c Torra, V. (2011). Semantic microaggregation for the anonymization of query logs using the open directory project. SORT—Statistics and Operations Research Transactions, 41–58).

  7. Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparing top k lists. SIAM Journal on Discrete Mathematics, 17(1), 134–160.

    MathSciNet  Article  MATH  Google Scholar 

  8. Howe, D. C., & Nissenbaum, H. (2009). TrackMeNot: Resisting surveillance in web search. In: Lessons from the Identity Trail: Anonymity, Privacy, and Identity in a Networked Society (Chap. 23, pp. 417–436). Oxford, UK: Oxford University Press.

  9. Jones, R., Kumar, R., Pang, B., & Tomkins, A. (2008). Vanity fair: Privacy in querylog bundles. In: CIKM ’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management (pp. 853–862). New York, NY, USA: ACM.

  10. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.

    MathSciNet  Article  MATH  Google Scholar 

  11. Kumar, R., Novak, J., Pang, B., & Tomkins, A. (2007). On anonymizing query logs via token-based hashing. In: WWW ’07: Proceedings of the 16th International Conference on World Wide Web (pp. 629–638). New York, NY, USA: ACM.

  12. Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(1), 39–41.

    Article  Google Scholar 

  13. Mitzenmacher, M., & Upfal, E. (2005). Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge, MA: Cambridge University Press.

    Book  Google Scholar 

  14. Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. Cambridge, MA: Cambridge University Press.

    Book  MATH  Google Scholar 

  15. Murugesan, M., & Clifton, C. (2009). Providing privacy through plausibly deniable search. In: SDM, SIAM (pp. 768–779).

  16. Ostrovsky, R., & Skeith, W. I. (2007). A survey of single-database PIR: techniques and applications. In: Public Key Cryptography (PKC 2007), Lecture Notes in Computer Science. (Vol. 4450, pp. 393–411). Berlin and Heidelberg:Springer.

  17. Pang, H., Ding, X., & Xiao, X. (2010). Embellishing text search queries to protect user privacy. Proceedings of the VLDB Endowment, 3(1), 598–607.

    Google Scholar 

  18. Pass, G., Chowdhury, A., & Torgeson, C. (2006). A picture of search. In: InfoScale ’06: Proceedings of the 1st International Conference on Scalable Information Systems. New York, NY, USA: ACM Press.

  19. Raykova, M., Vo, B., Bellovin, S. M., & Malkin, T. (2009). Secure anonymous database search. In: R. Sion & D. Song (Eds.), CCSW, ACM, pp. 115–126.

  20. Saint-Jean, F., Johnson, A., Boneh, D., & Feigenbaum, J. (2007). Private web search. In: WPES ’07: Proceedings of the 2007 ACM Workshop on Privacy in Electronic Society. (pp. 84–90). New York, NY, USA: ACM.

  21. Shen, X., Tan, B., & Zhai, C. (2007). Privacy protection in personalized search. SIGIR Forum, 41(1), 4–17.

    Article  Google Scholar 

  22. Spink, A., Wolfram, D., Jansen, M. B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234.

    Article  Google Scholar 

  23. Strube, M., & Ponzetto, S. P. (2006). Wikirelate! computing semantic relatedness using wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence. (Vol. 2, pp 1419–1424). Menlo Park, CA:AAAI Press.

  24. Sweeney, L. (2002). k-Anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570.

    MathSciNet  Article  MATH  Google Scholar 

  25. Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E. G., & Milios, E. E. (2005). Semantic similarity methods in wordnet and their application to information retrieval on the web. In: WIDM ’05: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management. (pp. 10–16). New York, NY, USA: ACM.

  26. Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (pp. 113–138). New Mexico: Las Cruces.

  27. Yan, P., Jiao, Y., Hurson, A. R., & Potok, T. E. (2006). Semantic-based information retrieval of biomedical data. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied computing (pp. 1700–1704). New York, NY, USA: ACM.

  28. Yekhanin, S. (2010). Private information retrieval. Communications of the ACM, 53(4), 68–73.

    Article  Google Scholar 

Download references

Acknowledgments

We thank Jaap Kamps from University of Amsterdam, the Netherlands, for providing access to the ClueWeb09_B dataset, and Savvas Chatzichristofis from Democritus University of Thrace for creating Figs. 1 and 2.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Avi Arampatzis.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Arampatzis, A., Efraimidis, P.S. & Drosatos, G. A query scrambler for search privacy on the internet. Inf Retrieval 16, 657–679 (2013). https://doi.org/10.1007/s10791-012-9212-1

Download citation

Keywords

  • Query scrambler
  • Search privacy
  • WordNet
  • Fusion