Versatile Query Scrambling for Private Web Search

Abstract

We consider the problem of privacy leaks suffered by Internet users when they perform web searches, and propose a framework to mitigate them. In brief, given a ‘sensitive’ search query, the objective of our work is to retrieve the target documents from a search engine without disclosing the actual query. Our approach, which builds upon and improves recent work on search privacy, approximates the target search results by replacing the private user query with a set of blurred or scrambled queries. The results of the scrambled queries are then used to cover the private user interest. We model the problem theoretically, define a set of privacy objectives with respect to web search and investigate the effectiveness of the proposed solution with a set of queries with privacy issues on a large web collection. Experiments show great improvements in retrieval effectiveness over a previously reported baseline in the literature. Furthermore, the methods are more versatile, predictably-behaved, applicable to a wider range of information needs, and the privacy they provide is more comprehensible to the end-user. Additionally, we investigate the perceived privacy via a user study, as well as, measure the system’s usefulness taking into account the trade off between retrieval effectiveness and privacy. The practical feasibility of the methods is demonstrated in a field experiment, scrambling queries against a popular web search engine. The findings may have implications for other IR research areas, such as query expansion, query decomposition, and distributed retrieval.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. 1.

    https://duckduckgo.com/.

  2. 2.

    https://ixquick.com/.

  3. 3.

    https://www.startpage.com/.

  4. 4.

    Plausible deniability is a legal concept which refers to the lack of evidence proving an allegation. See also http://en.wikipedia.org/wiki/Plausible_deniability.

  5. 5.

    http://www.torproject.org.

  6. 6.

    A more general query \(w\) than the private query \(q\) will always hit more results than \(q\) thus achieving some k-indistinguishability, while a \(w\) hitting other but overlapping results than \(q\) may or may not be more general.

  7. 7.

    http://lemurproject.org/clueweb09.

  8. 8.

    http://code.google.com/p/text-categorization/.

  9. 9.

    “Document retrieval” meant that document frequency statistics were used (as in Eq. 3) for the terms; using collection frequencies instead requires a window-based approach for calculating term co-occurrence. In both cases, “maximum window of 16 words” meant that only pairs of co-occurring words within 16 words were considered.

  10. 10.

    http://lethe.nonrelevant.net/datasets/95-seed-queries-v1.0.txt.

  11. 11.

    http://boston.lti.cs.cmu.edu/Data/clueweb09/.

  12. 12.

    http://www.lemurproject.org.

  13. 13.

    This formula has its roots in clustering, where a commonly used rule-of-thumb for the number of clusters \(v\) to look for in \(n\) data points is \(v \approx \sqrt{n/2}\). We apply this rule-of-thumb in reverse: when going for volume \(v\) we use the top \(n = 2v^2 +1\) data points (scrambled queries); the +1 is immaterial. The problem can indeed also be solved with clustering: first cluster the top-\(n\) scrambled queries into \(v\) clusters with queries hitting similar sets of documents within each cluster, and then select one representative scrambled query from each cluster.

References

  1. Arampatzis, A., Kamps, J., & Robertson, S. (2009). Where to stop reading a ranked list: Threshold optimization using truncated score distributions. In SIGIR, ACM (pp. 524–531).

  2. Arampatzis, A., Efraimidis, P., & Drosatos, G. (2011). Enhancing deniability against query-logs. In ECIR, Springer, lecture notes in computer science (Vol. 6611, pp. 117–128).

  3. Arampatzis, A., Efraimidis, P. S., & Drosatos, G. (2013). A query scrambler for search privacy on the internet. Information Retrieval, 16(6), 657–679.

    Article  Google Scholar 

  4. Barbaro, M., & Zeller, T. (2006). A face is exposed for AOL searcher no. 4417749. Accessed June 5, 2014. http://www.nytimes.com/2006/08/09/technology/09aol.html.

  5. Bhagat, S., Weinsberg, U., Ioannidis, S., & Taft, N. (2014). Recommending with an agenda: Active learning of private attributes using matrix factorization. In Proceedings of the 8th ACM conference on recommender systems (pp. 65–72). New York: ACM. RecSys ’14. doi:10.1145/2645710.2645747.

  6. Boneh, D., & Waters, B. (2007). Conjunctive, subset, and range queries on encrypted data. In Theory of cryptography, lecture notes in computer science (Vol. 4392, pp. 535–554). Berlin: Springer. doi:10.1007/978-3-540-70936-7_29.

  7. Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL (pp. 31–40). http://www.ling.uni-potsdam.de/ gerlof/docs/npmi-pfd.pdf.

  8. Brown, P. F., Pietra, V. J. D., de Souza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

    Google Scholar 

  9. Callan, J. P., & Connell, M. E. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2), 97–130.

    Article  Google Scholar 

  10. Cao, N., Wang, C., Li, M., Ren, K., & Lou, W. (2014). Privacy-preserving multi-keyword ranked search over encrypted cloud data. Parallel and Distributed Systems, IEEE Transactions on, 25(1), 222–233. doi:10.1109/TPDS.2013.45.

    Article  Google Scholar 

  11. Caprara, A., Fischetti, M., & Toth, P. (1998). Algorithms for the set covering problem. Annals of Operations Research, 98, 2000.

    MathSciNet  Google Scholar 

  12. Carpineto, C., & Romano, G. (2013). Semantic search log k-anonymization with generalized k-cores of query concept graph. In Advances in information retrieval, lecture notes in computer Science (Vol. 7814, pp. 110–121). Berlin: Springer. doi:10.1007/978-3-642-36973-5_10.

  13. Carr, R. D., Doddi, S., Konjevod, G., & Marathe, M. (2000). On the red-blue set cover problem. In Proceedings of the eleventh annual ACM-SIAM symposium on Discrete Algorithms (pp. 345–353). Philadelphia: Society for Industrial and Applied Mathematics. SODA ’00, http://dl.acm.org/citation.cfm?id=338219.338271.

  14. Castellà-Roca, J., Viejo, A., & Herrera-Joancomartí, J. (2009). Preserving user’s privacy in web search engines. Computer Communications, 32(13–14), 1541–1551. doi:10.1016/j.comcom.2009.05.009.

    Article  Google Scholar 

  15. Chvatal, V. (1979). A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3), 233–235. doi:10.2307/3689577.

    MATH  MathSciNet  Article  Google Scholar 

  16. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37.

    Article  Google Scholar 

  17. Domingo-Ferrer, J., Bras-Amorós, M., Wu, Q., & Manjón, J. A. (2009). User-private information retrieval based on a peer-to-peer community. Data & Knowledge Engineering, 68(11), 1237–1252.

    Article  Google Scholar 

  18. Domingo-Ferrer, J., Solanas, A., & Castellà-Roca, J. (2009). h(k)-private information retrieval from privacy-uncooperative queryable databases. Online Information Review, 33(4), 720–744. doi:10.1108/14684520910985693.

    Article  Google Scholar 

  19. Fleiss, J., et al. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

    Article  Google Scholar 

  20. Hannak, A., Sapiezynski, P., Molavi Kakhki, A., Krishnamurthy, B., Lazer, D., Mislove, A., & Wilson, C. (2013). Measuring personalization of web search. In Proceedings of the 22nd international conference on World Wide Web (pp. 527–538). Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee. WWW ’13. http://dl.acm.org/citation.cfm?id=2488388.2488435.

  21. Howe, D. C., & Nissenbaum, H. (2009). TrackMeNot: Resisting surveillance in web search. In I. Kerr, C. Lucock, & V. Steeves (Eds.), Lessons from the Identity trail: Anonymity, privacy, and identity in a networked society (Chap 23, pp. 417–436). Oxford: Oxford University Press.

    Google Scholar 

  22. Karp, R. (1972). Reducibility among combinatorial problems. In R. E. Miller & J. W. Thatcher (Eds.), Proceedings of a Symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, The IBM Research Symposia Series, Complexity of computer computations (pp. 85–103). New York: Plenum Press.

  23. Lindell, Y., & Waisbard, E. (2010). Private web search with malicious adversaries. In Privacy enhancing technologies, lecture notes in computer science (Vol. 6205, pp. 220–235). Berlin: Springer. doi:10.1007/978-3-642-14527-8_13.

  24. Lund, C., & Yannakakis, M. (1994). On the hardness of approximating minimization problems. Journal of the ACM (JACM), 41(5), 960–981. doi:10.1145/185675.306789.

    MATH  MathSciNet  Article  Google Scholar 

  25. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

    MATH  Book  Google Scholar 

  26. Murugesan, M., & Clifton, C. (2009). Providing privacy through plausibly deniable search. In SDM, SIAM (pp. 768–779).

  27. Pass, G., Chowdhury, A., & Torgeson, C. (2006). A picture of search. In InfoScale ’06: Proceedings of the 1st international conference on scalable information systems. New York: ACM Press.

  28. Peddinti, S. T., & Saxena, N. (2014). Web search query privacy: Evaluating query obfuscation and anonymizing networks. Journal of Computer Security, 22(1), 155–199. http://dl.acm.org/citation.cfm?id=2590636.2590640.

  29. Saint-Jean, F., Johnson, A., Boneh, D., & Feigenbaum, J. (2007). Private web search. In WPES ’07: Proceedings of the 2007 ACM workshop on privacy in electronic society (pp. 84–90). New York: ACM.

  30. Sánchez, D., Castellà-Roca, J., & Viejo, A. (2013). Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines. Information Sciences, 218, 17–30. doi:10.1016/j.ins.2012.06.025.

    Article  Google Scholar 

  31. Shen, X., Tan, B., & Zhai, C. (2007). Privacy protection in personalized search. SIGIR Forum, 41(1), 4–17.

    Article  Google Scholar 

  32. Sweeney, L. (2002). k-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570.

    MATH  MathSciNet  Article  Google Scholar 

  33. Terra, E. L., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In HLT-NAACL, Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, May 27–June 1, Edmonton, Canada.

  34. Tigelaar, A. S., & Hiemstra, D. (2010). Query-based sampling using snippets. In Eighth workshop on Large-Scale Distributed Systems for information retrieval, Geneva, Switzerland, CEUR-WS, Aachen, Germany, CEUR workshop proceedings (Vol. 630, pp. 9–14).

  35. Viejo, A., & Sánchez, D. (2014). Profiling social networks to provide useful and privacy-preserving web search. JASIST, 65(12), 2444–2458. doi:10.1002/asi.23144.

    Google Scholar 

  36. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML, Morgan Kaufmann (pp. 412–420).

  37. Young, N. E. (2008). Greedy set-cover algorithms. In M.-Y. Kao (Ed.), Encyclopedia of algorithms (pp. 379–381). US: Springer. doi:10.1007/978-0-387-30162-4_175.

    Google Scholar 

Download references

Acknowledgments

The material in Sect. 8 was contributed by George Stamatelatos, master’s student at the Electrical & Computer Engineering department, Democritus University of Thrace, Greece.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Avi Arampatzis.

Additional information

An early shorter version of this work was published in European Conference on Information Retrieval, 2013.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arampatzis, A., Drosatos, G. & Efraimidis, P.S. Versatile Query Scrambling for Private Web Search. Inf Retrieval J 18, 331–358 (2015). https://doi.org/10.1007/s10791-015-9256-0

Download citation

Keywords

  • Query scrambler
  • Search privacy
  • Query-based document sampling
  • Mutual information
  • Set covering
  • Inter-user agreement