Semantic Similarity Group By Operators for Metric Data

  • Natan A. Laverde
  • Mirela T. Cazzolato
  • Agma J. M. Traina
  • Caetano TrainaJr.
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10609)


Grouping operators summarize data in DBMS arranging elements in groups using identity comparisons. However, for metric data, grouping by identity is seldom useful, since adopting the concept of similarity is often a better fit. There are operators that can group data elements using similarity. However, the existing operators do not achieve good results for certain data domains or distributions. The major contributions of this work are a novel operator called the SGB-Vote that assign groups using an election involving already assigned groups and an extension for current operators bounds each group to a maximum amount of the nearest neighbors. The operators were implemented in a framework and evaluated using real and synthetic datasets from diverse domains considering both quality of and execution time. The results obtained show that the proposed operators produce higher quality groups in all tested datasets and highlight that the operators can efficiently run inside a DBMS.


Similarity Group By Grouping Similarity comparison Metric data 



This research is partially funded by FAPESP, CNPq, CAPES, and the RESCUER Project, as well as by the European Commission (Grant: 614154) and by the CNPq/MCTI (Grant: 490084/2013-3).


  1. 1.
    Barioni, M.C.N., Kaster, D.D.S., Razente, H.L., Traina, A.J.M., Traina Jr., C.: Querying Multimedia Data by Similarity in Relational DBMS. In: Yan, L., Ma, Z. (eds.) Advanced Database Query Systems: Techniques, Applications and Technologies, chap. 14, pp. 323–359. IGI Global, Hershey, NY, USA (2010)Google Scholar
  2. 2.
    Barioni, M.C.N., Razente, H.L., Traina, A.J.M., Traina Jr., C.: SIREN: a similarity retrieval engine for complex data. In: VLDB, pp. 1155–1158. ACM (2006)Google Scholar
  3. 3.
    Carvalho, L.O., de Oliveira, W.D., Pola, I.R.V., Traina, A.J.M., Traina, C.: A wider concept for similarity joins. JIDM 5(3), 210–223 (2014)Google Scholar
  4. 4.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall Press, Upper Saddle River (2008)Google Scholar
  5. 5.
    Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-total. In: ICDE, pp. 152–159. IEEE Computer Society (1996)Google Scholar
  6. 6.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, Burlington (2000)zbMATHGoogle Scholar
  7. 7.
    Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 7:1–7:38 (2008)CrossRefGoogle Scholar
  8. 8.
    Kaster, D.S., Bugatti, P.H., Traina, A.J.M., Traina, C.: FMI-SiR: a flexible and efficient module for similarity searching on oracle database. JIDM 1(2), 229–244 (2010)Google Scholar
  9. 9.
    Li, C., Wang, M., Lim, L., Wang, H., Chang, K.C.: Supporting ranking and clustering as generalized order-by and group-by. In: SIGMOD Conference, pp. 127–138. ACM (2007)Google Scholar
  10. 10.
    Marri, W.J.A., Malluhi, Q., Ouzzani, M., Tang, M., Aref, W.G.: The similarity-aware relational intersect database operator. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 164–175. Springer, Cham (2014). doi: 10.1007/978-3-319-11988-5_15 Google Scholar
  11. 11.
    Oliveira, P.H., Fraideinberze, A.C., Laverde, N.A., Gualdron, H., Gonzaga, A.S., Ferreira, L.D., Oliveira, W.D., Rodrigues Jr., J.F., Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., de Sousa, E.P.M.: On the support of a similarity-enabled relational database management system in civilian crisis situations. In: ICEIS (1), pp. 119–126. SciTePress (2016)Google Scholar
  12. 12.
    Pola, I.R.V., Cordeiro, R.L.F., Traina, C., Traina, A.J.M.: A new concept of sets to handle similarity in databases: the SimSets. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS, vol. 8199, pp. 30–42. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-41062-8_4 CrossRefGoogle Scholar
  13. 13.
    Schallehn, E., Sattler, K., Saake, G.: Advanced grouping and aggregation for data integration. In: CIKM, pp. 547–549. ACM (2001)Google Scholar
  14. 14.
    Sedgewick, R., Wayne, K.: Algorithms, 4th edn. Addison-Wesley, Boston (2011)Google Scholar
  15. 15.
    Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.: SimDB: a similarity-aware database system. In: SIGMOD Conference, pp. 1243–1246. ACM (2010)Google Scholar
  16. 16.
    Silva, Y.N., Aref, W.G., Ali, M.H.: Similarity group-by. In: ICDE, pp. 904–915. IEEE Computer Society (2009)Google Scholar
  17. 17.
    Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: ICDE, pp. 892–903. IEEE Computer Society (2010)Google Scholar
  18. 18.
    Tang, M., Tahboub, R.Y., Aref, W.G., Atallah, M.J., Malluhi, Q.M., Ouzzani, M., Silva, Y.N.: Similarity group-by operators for multi-dimensional relational data. IEEE Trans. Knowl. Data Eng. 28(2), 510–523 (2016)CrossRefGoogle Scholar
  19. 19.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach. Advances in Database Systems, vol. 32. Kluwer, Dordrecht (2006)zbMATHGoogle Scholar
  20. 20.
    Zhang, C., Huang, Y.: Cluster By: a new sql extension for spatial data aggregation. In: GIS, p. 53. ACM (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Natan A. Laverde
    • 1
  • Mirela T. Cazzolato
    • 1
  • Agma J. M. Traina
    • 1
  • Caetano TrainaJr.
    • 1
  1. 1.Institute of Mathematics and Computer SciencesUniversity of Sao PauloSao CarlosBrazil

Personalised recommendations