Similarity Search Combining Query Relaxation and Diversification

  • Ruoxi ShiEmail author
  • Hongzhi Wang
  • Tao Wang
  • Yutai Hou
  • Yiwen Tang
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10178)


We study the similarity search problem which aims to find the similar query results according to a set of given data and a query string. To balance the result number and result quality, we combine query result diversity with query relaxation. Relaxation guarantees the number of the query results, returning more relevant elements to the query if the results are too few, while the diversity tries to reduce the similarity among the returned results. By making a trade-off of similarity and diversity, we improve the user experience. To achieve this goal, we define a novel goal function combining similarity and diversity. Aiming at this goal, we propose three algorithms. Among them, algorithms genGreedy and genCluster perform relaxation first and select part of the candidates to diversify. The third algorithm CB2S splits the dataset into smaller pieces using the clustering algorithm of k-means and processes queries in several small sets to retrieve more diverse results. The balance of similarity and diversity is determined through setting a threshold, which has a default value and can be adjusted according to users’ preference. The performance and efficiency of our system are demonstrated through extensive experiments based on various datasets.



This paper was partially supported by NSFC grant U1509216, 61472099, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026 and MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology. Hongzhi Wang is the corresponding author of this paper.


  1. 1.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: IEEE International Conference on Data Engineering (2008)Google Scholar
  2. 2.
    Zheng, K., Wang, H.: A survey of query result diversification. Knowl. Inf. Syst. 50, 1–36 (2016)CrossRefGoogle Scholar
  3. 3.
    Ziegler, C.N., Mcnee, S.M., et al.: Improving recommendation lists through topic diversification. Promontory Press (1974)Google Scholar
  4. 4.
    Drosou, M., Pitoura, E.: DisC diversity: result diversification based on dissimilarity and coverage. In: Proceedings of the Vldb Endowment (2013)Google Scholar
  5. 5.
    Agrawal, R., Gollapudi, S., Halverson, A., et al.: Diversifying search results. In: ACM International Conference on Web Search & Data Mining (2009)Google Scholar
  6. 6.
    Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. SIGMOD (2014)Google Scholar
  7. 7.
    Jain, A., Sarda, P., Haritsa, J.R.: Providing diversity in K-Nearest neighbor query results. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 404–413. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-24775-3_49 CrossRefGoogle Scholar
  8. 8.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Advances in Information Sciences & Service Sciences (2012)Google Scholar
  9. 9.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of European Conference (1998)Google Scholar
  10. 10.
    Kim, J.D., Ohta, T., Tateisi, Y., et al.: GENIA corpus–semantically annotated corpus for bio-text mining. Bioinformatics 19, 180–182 (2003)CrossRefGoogle Scholar
  11. 11.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD-ACM (1999)Google Scholar
  12. 12.
    Yu, C., Lakshmanan, L., Amer-Yahia, S.: It takes variety to make a world: diversification in recommender systems. In: EDBT (2009)Google Scholar
  13. 13.
    Haveliwala, T.H., Gionis, A., Klein, D., et al.: Evaluating strategies for similarity search on the web. In: International Conference on World Wide Web (2010)Google Scholar
  14. 14.
    Zheng, J.G., Howsmon, D., Zhang, B., et al.: Entity linking for biomedical literature. BMC Med. Inform. Decis. Making 15, S4 (2015)CrossRefGoogle Scholar
  15. 15.
    Gish, W., States, D.J.: Identification of protein coding regions by database similarity search. Nat. Genet. 3(3), 266–272 (1993)CrossRefGoogle Scholar
  16. 16.
    Drosou, M., Pitoura, E., et al.: Search result diversification. In: Proceedings of the National Academy of Sciences (2010)Google Scholar
  17. 17.
    Vee, E., Srivastava, U.: Efficient computation of diverse query results (2008)Google Scholar
  18. 18.
    Jones, C., Pevzner, P.: An Introduction to Bioinformatics Algorithms, pp. 97–100. MIT Press, Cambridge (2004)Google Scholar
  19. 19.
    Santos, L., et al.: Combine-and-conquer: improving the diversity in similarity search through influence sampling. In: ACM Symposium on Applied Computing (2015)Google Scholar
  20. 20.
    Santos, L.F.D., Oliveira, W.D., Ferreira, M.R.P.: Parameter-free and domain-independent similarity search with diversity. In: SSDBM (2013)Google Scholar
  21. 21.
    Mirzadeh, N., Ricci, F., Bansal, M.: Supporting user query relaxation in a recommender system. In: Bauknecht, K., Bichler, M., Pröll, B. (eds.) EC-Web 2004. LNCS, vol. 3182, pp. 31–40. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-30077-9_4 CrossRefGoogle Scholar
  22. 22.
    Zhou, X., Gaugaz, J.: Query relaxation using malleable schemas. In: ACM SIGMOD (2007)Google Scholar
  23. 23.
    Wagner, R.A., Lowrance, R.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Zhang, Z., Hadjieleftheriou, M.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD (2010)Google Scholar
  25. 25.
    Thompson, J.D.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)CrossRefGoogle Scholar
  26. 26.
    Hartigan, J.A., Wong, M.A.: A K-Means clustering algorithm. Appl. Stat. 28, 100–108 (1979)CrossRefzbMATHGoogle Scholar
  27. 27.
    Han, E.H., Karypis, G.: Text categorization using weight adjusted k-Nearest neighbor classification. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (2001)Google Scholar
  28. 28.
    Vargas, S., Castells, P.: Explicit relevance models in intent-oriented information retrieval diversification. In: International ACM SIGIR Conference on Research & Development in Information Retrieval (2012)Google Scholar
  29. 29.
    Sun, F., Wang, M., Wang, D., et al.: Optimizing social image search with multiple criteria: relevance, diversity, and typicality. Neurocomputing 95, 40–47 (2012)CrossRefGoogle Scholar
  30. 30.
    Yang, J., Hu, G.: Computational biology: methods and applications for the analysis of biological sequences (2010).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ruoxi Shi
    • 1
    Email author
  • Hongzhi Wang
    • 1
  • Tao Wang
    • 1
  • Yutai Hou
    • 1
  • Yiwen Tang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.Harbin Institute of TechnologyHarbinChina

Personalised recommendations