A New Concept of Sets to Handle Similarity in Databases: The SimSets

Pola, Ives R. V.; Cordeiro, Robson L. F.; Traina, Caetano; Traina, Agma J. M.

doi:10.1007/978-3-642-41062-8_4

Ives R. V. Pola¹⁸,
Robson L. F. Cordeiro¹⁸,
Caetano Traina Jr.¹⁸ &
…
Agma J. M. Traina¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8199))

Included in the following conference series:

International Conference on Similarity Search and Applications

1655 Accesses
4 Citations

Abstract

Traditional DBMS are heavily dependent on the concept that a set never includes the same element twice. On the other hand, modern applications require dealing with complex data, such as images, videos and genetic sequences, in which exact match of two elements seldom occurs and, generally, is meaningless. Thus, it makes sense that sets of complex data should not include two elements that are “too similar”. How to create a concept equivalent to “sets” for complex data? And how to design novel algorithms that allow it to be naturally embedded in existing DBMS? These are the issues that we tackle in this paper, through the concept of “similarity sets”, or SimSets for short. Several scenarios may benefit from our SimSets. A typical example appears in sensor networks, in which SimSets can identify sensors recurrently reporting similar measurements, aimed at turning some of them off for energy saving. Specifically, our main contributions are: (i) highlighting the central properties of SimSets; (ii) proposing the basic algorithms required to create them from metric datasets, which were carefully designed to be naturally embedded into existing DBMS, and; (iii) evaluating their use on real world applications to show that our SimSets can improve the data storage and retrieval, besides the analysis processes. We report experiments on real data from networks of sensors existing within meteorological stations, providing a better conceptual underpinning for similarity search operations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Black, T.L.: The new NMC mesoscale eta model: Description and forecast examples. Weather and Forecasting 9, 265–284 (1994)
Article Google Scholar
Budikova, P., Batko, M., Zezula, P.: Query language for complex similarity queries. In: Morzy, T., Härder, T., Wrembel, R. (eds.) ADBIS 2012. LNCS, vol. 7503, pp. 85–98. Springer, Heidelberg (2012)
Chapter Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)
Google Scholar
Godsil, C., Royle, G.: Algebraic Graph Theory. Springer (April 2001)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: ACM SIGIR, pp. 284–291 (2006)
Google Scholar
Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (survey article). ACM Transactions on Database Systems 28(4), 517–580 (2003)
Article Google Scholar
Kuncheva, L.I., Jain, L.C.: Nearest neighbor classifier: Simultaneous editing and feature selection (1999)
Google Scholar
Lin, F.-S., Chiu, P.L.: A near-optimal sensor placement algorithm to achieve complete coverage-discrimination in sensor networks. IEEE Communications Letters 9(1), 43–45 (2005)
Google Scholar
Pedreira, O., Brisaboa, N.R.: Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 434–445. Springer, Heidelberg (2007)
Chapter Google Scholar
Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: Simdb: a similarity-aware database system. ACM SIGMOD, 1243–1246 (2010)
Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14(1), 1–37 (2007)
Article Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems, 36,15:1–15:41 (2011)
Google Scholar
Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data and Knowledge Engineering 70(1), 60–83 (2011)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department - ICMC, University of São Paulo at São Carlos, Brazil
Ives R. V. Pola, Robson L. F. Cordeiro, Caetano Traina Jr. & Agma J. M. Traina

Authors

Ives R. V. Pola
View author publications
You can also search for this author in PubMed Google Scholar
Robson L. F. Cordeiro
View author publications
You can also search for this author in PubMed Google Scholar
Caetano Traina Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Agma J. M. Traina
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Database Laboratory, Universidade da Coruña, Spain
Nieves Brisaboa & Oscar Pedreira &
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Pavel Zezula

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pola, I.R.V., Cordeiro, R.L.F., Traina, C., Traina, A.J.M. (2013). A New Concept of Sets to Handle Similarity in Databases: The SimSets. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds) Similarity Search and Applications. SISAP 2013. Lecture Notes in Computer Science, vol 8199. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41062-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-41062-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41061-1
Online ISBN: 978-3-642-41062-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics