A New Concept of Sets to Handle Similarity in Databases: The SimSets

  • Ives R. V. Pola
  • Robson L. F. Cordeiro
  • Caetano TrainaJr.
  • Agma J. M. Traina
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8199)


Traditional DBMS are heavily dependent on the concept that a set never includes the same element twice. On the other hand, modern applications require dealing with complex data, such as images, videos and genetic sequences, in which exact match of two elements seldom occurs and, generally, is meaningless. Thus, it makes sense that sets of complex data should not include two elements that are “too similar”. How to create a concept equivalent to “sets” for complex data? And how to design novel algorithms that allow it to be naturally embedded in existing DBMS? These are the issues that we tackle in this paper, through the concept of “similarity sets”, or SimSets for short. Several scenarios may benefit from our SimSets. A typical example appears in sensor networks, in which SimSets can identify sensors recurrently reporting similar measurements, aimed at turning some of them off for energy saving. Specifically, our main contributions are: (i) highlighting the central properties of SimSets; (ii) proposing the basic algorithms required to create them from metric datasets, which were carefully designed to be naturally embedded into existing DBMS, and; (iii) evaluating their use on real world applications to show that our SimSets can improve the data storage and retrieval, besides the analysis processes. We report experiments on real data from networks of sensors existing within meteorological stations, providing a better conceptual underpinning for similarity search operations.


Sensor Network Real World Application Complex Data Cosine Similarity Similarity Threshold 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Black, T.L.: The new NMC mesoscale eta model: Description and forecast examples. Weather and Forecasting 9, 265–284 (1994)CrossRefGoogle Scholar
  2. 2.
    Budikova, P., Batko, M., Zezula, P.: Query language for complex similarity queries. In: Morzy, T., Härder, T., Wrembel, R. (eds.) ADBIS 2012. LNCS, vol. 7503, pp. 85–98. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)Google Scholar
  4. 4.
    Godsil, C., Royle, G.: Algebraic Graph Theory. Springer (April 2001)Google Scholar
  5. 5.
    Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: ACM SIGIR, pp. 284–291 (2006)Google Scholar
  6. 6.
    Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (survey article). ACM Transactions on Database Systems 28(4), 517–580 (2003)CrossRefGoogle Scholar
  7. 7.
    Kuncheva, L.I., Jain, L.C.: Nearest neighbor classifier: Simultaneous editing and feature selection (1999)Google Scholar
  8. 8.
    Lin, F.-S., Chiu, P.L.: A near-optimal sensor placement algorithm to achieve complete coverage-discrimination in sensor networks. IEEE Communications Letters 9(1), 43–45 (2005)Google Scholar
  9. 9.
    Pedreira, O., Brisaboa, N.R.: Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 434–445. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  10. 10.
    Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: Simdb: a similarity-aware database system. ACM SIGMOD, 1243–1246 (2010)Google Scholar
  11. 11.
    Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2007)CrossRefGoogle Scholar
  12. 12.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems, 36,15:1–15:41 (2011)Google Scholar
  13. 13.
    Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data and Knowledge Engineering 70(1), 60–83 (2011)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ives R. V. Pola
    • 1
  • Robson L. F. Cordeiro
    • 1
  • Caetano TrainaJr.
    • 1
  • Agma J. M. Traina
    • 1
  1. 1.Computer Science Department - ICMCUniversity of São Paulo at São CarlosBrazil

Personalised recommendations