Versatile Query Scrambling for Private Web Search
- 305 Downloads
We consider the problem of privacy leaks suffered by Internet users when they perform web searches, and propose a framework to mitigate them. In brief, given a ‘sensitive’ search query, the objective of our work is to retrieve the target documents from a search engine without disclosing the actual query. Our approach, which builds upon and improves recent work on search privacy, approximates the target search results by replacing the private user query with a set of blurred or scrambled queries. The results of the scrambled queries are then used to cover the private user interest. We model the problem theoretically, define a set of privacy objectives with respect to web search and investigate the effectiveness of the proposed solution with a set of queries with privacy issues on a large web collection. Experiments show great improvements in retrieval effectiveness over a previously reported baseline in the literature. Furthermore, the methods are more versatile, predictably-behaved, applicable to a wider range of information needs, and the privacy they provide is more comprehensible to the end-user. Additionally, we investigate the perceived privacy via a user study, as well as, measure the system’s usefulness taking into account the trade off between retrieval effectiveness and privacy. The practical feasibility of the methods is demonstrated in a field experiment, scrambling queries against a popular web search engine. The findings may have implications for other IR research areas, such as query expansion, query decomposition, and distributed retrieval.
KeywordsQuery scrambler Search privacy Query-based document sampling Mutual information Set covering Inter-user agreement
The material in Sect. 8 was contributed by George Stamatelatos, master’s student at the Electrical & Computer Engineering department, Democritus University of Thrace, Greece.
- Arampatzis, A., Kamps, J., & Robertson, S. (2009). Where to stop reading a ranked list: Threshold optimization using truncated score distributions. In SIGIR, ACM (pp. 524–531).Google Scholar
- Arampatzis, A., Efraimidis, P., & Drosatos, G. (2011). Enhancing deniability against query-logs. In ECIR, Springer, lecture notes in computer science (Vol. 6611, pp. 117–128).Google Scholar
- Barbaro, M., & Zeller, T. (2006). A face is exposed for AOL searcher no. 4417749. Accessed June 5, 2014. http://www.nytimes.com/2006/08/09/technology/09aol.html.
- Bhagat, S., Weinsberg, U., Ioannidis, S., & Taft, N. (2014). Recommending with an agenda: Active learning of private attributes using matrix factorization. In Proceedings of the 8th ACM conference on recommender systems (pp. 65–72). New York: ACM. RecSys ’14. doi: 10.1145/2645710.2645747.
- Boneh, D., & Waters, B. (2007). Conjunctive, subset, and range queries on encrypted data. In Theory of cryptography, lecture notes in computer science (Vol. 4392, pp. 535–554). Berlin: Springer. doi: 10.1007/978-3-540-70936-7_29.
- Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL (pp. 31–40). http://www.ling.uni-potsdam.de/ gerlof/docs/npmi-pfd.pdf.
- Brown, P. F., Pietra, V. J. D., de Souza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.Google Scholar
- Carpineto, C., & Romano, G. (2013). Semantic search log k-anonymization with generalized k-cores of query concept graph. In Advances in information retrieval, lecture notes in computer Science (Vol. 7814, pp. 110–121). Berlin: Springer. doi: 10.1007/978-3-642-36973-5_10.
- Carr, R. D., Doddi, S., Konjevod, G., & Marathe, M. (2000). On the red-blue set cover problem. In Proceedings of the eleventh annual ACM-SIAM symposium on Discrete Algorithms (pp. 345–353). Philadelphia: Society for Industrial and Applied Mathematics. SODA ’00, http://dl.acm.org/citation.cfm?id=338219.338271.
- Hannak, A., Sapiezynski, P., Molavi Kakhki, A., Krishnamurthy, B., Lazer, D., Mislove, A., & Wilson, C. (2013). Measuring personalization of web search. In Proceedings of the 22nd international conference on World Wide Web (pp. 527–538). Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee. WWW ’13. http://dl.acm.org/citation.cfm?id=2488388.2488435.
- Howe, D. C., & Nissenbaum, H. (2009). TrackMeNot: Resisting surveillance in web search. In I. Kerr, C. Lucock, & V. Steeves (Eds.), Lessons from the Identity trail: Anonymity, privacy, and identity in a networked society (Chap 23, pp. 417–436). Oxford: Oxford University Press.Google Scholar
- Karp, R. (1972). Reducibility among combinatorial problems. In R. E. Miller & J. W. Thatcher (Eds.), Proceedings of a Symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, The IBM Research Symposia Series, Complexity of computer computations (pp. 85–103). New York: Plenum Press.Google Scholar
- Lindell, Y., & Waisbard, E. (2010). Private web search with malicious adversaries. In Privacy enhancing technologies, lecture notes in computer science (Vol. 6205, pp. 220–235). Berlin: Springer. doi: 10.1007/978-3-642-14527-8_13.
- Murugesan, M., & Clifton, C. (2009). Providing privacy through plausibly deniable search. In SDM, SIAM (pp. 768–779).Google Scholar
- Pass, G., Chowdhury, A., & Torgeson, C. (2006). A picture of search. In InfoScale ’06: Proceedings of the 1st international conference on scalable information systems. New York: ACM Press.Google Scholar
- Peddinti, S. T., & Saxena, N. (2014). Web search query privacy: Evaluating query obfuscation and anonymizing networks. Journal of Computer Security, 22(1), 155–199. http://dl.acm.org/citation.cfm?id=2590636.2590640.
- Saint-Jean, F., Johnson, A., Boneh, D., & Feigenbaum, J. (2007). Private web search. In WPES ’07: Proceedings of the 2007 ACM workshop on privacy in electronic society (pp. 84–90). New York: ACM.Google Scholar
- Terra, E. L., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In HLT-NAACL, Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, May 27–June 1, Edmonton, Canada.Google Scholar
- Tigelaar, A. S., & Hiemstra, D. (2010). Query-based sampling using snippets. In Eighth workshop on Large-Scale Distributed Systems for information retrieval, Geneva, Switzerland, CEUR-WS, Aachen, Germany, CEUR workshop proceedings (Vol. 630, pp. 9–14).Google Scholar
- Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML, Morgan Kaufmann (pp. 412–420).Google Scholar