Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection

  • Luiz Olmes CarvalhoEmail author
  • Lucio Fernandes Dutra Santos
  • Agma Juci Machado Traina
  • Caetano TrainaJr.
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 291)


Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.


Similarity search Similarity join Query operators Wide-join Near-duplicate detection 


  1. 1.
    Chino, D.Y.T., Avalhais, L.P.S., Rodrigues Jr., J.F., Traina, A.J.M.: Bowfire: detection of fire in still images by integrating pixel color and texture analysis. In: Proceedings of 28th Conference on Graphics, Patterns and Images, pp. 1–8 (2015)Google Scholar
  2. 2.
    Li, J., Qian, X., Li, Q., Zhao, Y., Wang, L., Tang, Y.Y.: Mining near-duplicate image groups. Multimedia Tools Appl. 74, 655–669 (2015)CrossRefGoogle Scholar
  3. 3.
    Yao, J., Yang, B., Zhu, Q.: Near-duplicate image retrieval based on contextual descriptor. IEEE Sig. Process. Lett. 22, 1404–1408 (2015)CrossRefGoogle Scholar
  4. 4.
    Carvalho, L.O., Santos, L.F.D., Oliveira, W.D., Traina, A.J.M., Traina Jr., C.: Self similarity wide-joins for near-duplicate image detection. In: Proceedings of 2015 IEEE International Symposium on Multimedia, pp. 237–240 (2015)Google Scholar
  5. 5.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36, 15:1–15:41 (2011)CrossRefGoogle Scholar
  6. 6.
    Wang, X.J., Zhang, L., Ma, W.Y.: Duplicate search based image annotation using web-scale data. Proc. IEEE 100, 2705–2721 (2012)CrossRefGoogle Scholar
  7. 7.
    Carvalho, L.O., Santos, L.F.D., Oliveira, W.D., Traina, A.J.M., Traina Jr., C.: Similarity joins and beyond: an extended set of binary operators with order. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds.) SISAP 2015. LNCS, vol. 9371, pp. 29–41. Springer, Cham (2015). doi: 10.1007/978-3-319-25087-8_3 CrossRefGoogle Scholar
  8. 8.
    Silva, Y.N., Aref, W.G., Larson, P.A., Pearson, S., Ali, M.H.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22, 395–420 (2013)CrossRefGoogle Scholar
  9. 9.
    Carvalho, L.O., Santos, L.F.D., Oliveira, W.D., Traina, A.J.M., Traina Jr., C.: Efficient self-similarity range wide-joins fostering near-duplicate image detection in emergency scenarios. In: Proceedings of 18th International Conference on Enterprise Information Systems, vol. 1, pp. 81–91 (2016)Google Scholar
  10. 10.
    Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. Cengage Learning, Boston (2014)Google Scholar
  11. 11.
    Searcóid, M.Ó.: Metric Spaces. Springer, Heidelberg (2007). doi: 10.1007/978-1-84628-627-8 zbMATHGoogle Scholar
  12. 12.
    Bangay, S., Lv, O.: Evaluating locality sensitive hashing for matching partial image patches in a social media setting. J. Multimedia 1, 14–24 (2012)Google Scholar
  13. 13.
    Chum, O., Philbin, J., Zisserman, A.: Near duplicate image detection: min-hash and tf-idf weighting. In: British Machine Vision Conference, pp. 1–10 (2008)Google Scholar
  14. 14.
    Kasutani, E., Yamada, A.: The MPEG-7 color layout descriptor: a compact image feature description for high-speed image/video segment retrieval. In: Proceedings of 8th International Conference on Image Processing, pp. 674–677 (2001)Google Scholar
  15. 15.
    Stricker, M., Orengo, M.: Similarity of color images. In: Proceedings of 3rd Conference on Storage and Retrieval for Image and Video Databases, pp. 381–392 (1995)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Luiz Olmes Carvalho
    • 1
    • 2
    Email author
  • Lucio Fernandes Dutra Santos
    • 1
    • 3
  • Agma Juci Machado Traina
    • 1
  • Caetano TrainaJr.
    • 1
  1. 1.Institute of Mathematics and Computer SciencesUniversity of São PauloSão CarlosBrazil
  2. 2.Federal Institute of Minas GeraisBelo HorizonteBrazil
  3. 3.Federal Institute in the North of Minas GeraisMontes ClarosBrazil

Personalised recommendations