Abstract
Nowadays, the amount of data available on the Internet is over Zettabytes (ZB). Such condition defines a scenario known in the literature as Big Data. Although traditional databases are very efficient for finding and retrieving specific content, they are inefficient on Big Data scenario, since the great majority of such data are unstructured and scattered across the Internet. In this way, new databases are required in order to support similarity search. In order to handle such challenging scenario, the proposal in this chapter is to explore the Hamming similarity existent between content identifiers that are generated using the Random Hyperplane Hashing function. Such identifiers provide the basis for building distributed infrastructures that facilitate the similarity search. In this chapter, we present two different approaches: a P2P solution (Hamming DHT) and a Data Center solution (HCube). Evaluations are presented and indicate that both are capable of improving the recall in a similarity search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gantz, J., Reinsel, D.: The Digital Universe Decade - Are You Ready? http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2010) (Online; Acesso em 2 de Março de 2013)
The Apache Software Foundation: Apache\(\textsuperscript{\textregistered}\) Hadoop, http://hadoop.apache.org/ (2013) (Online; Acesso em 5 de Março de 2013)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Indyk, P., Motwani, R.: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In: STOC 1998: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM, New York (1998)
Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: STOC 2002: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, New York, NY, USA, pp. 380–388 (2002)
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
Villaça, R., de Paula, L.B., Pasquini, R., Magalhães, M.F.: Hamming DHT: Taming the Similarity Search. In: Proceedings of the 10th Annual IEEE Consumer Communications and Networking Conference, CCNC 2013. IEEE Communications Society, Las Vegas (2013)
Villaça, R., Pasquini, R., de Paula, L.B., Magalhães, M.F.: HCube: A Server-centric Data Center Structure for Similarity Search. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications, AINA 2013. IEEE Computer Society, Barcelona (2013)
Desai, A., Singh, H., Pudi, V.: DISC: Data-Intensive Similarity Measure for Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS (LNAI), vol. 6635, pp. 469–481. Springer, Heidelberg (2011)
Lee, D., Park, J., Shim, J., Lee, S.: Efficient Filtering Techniques for Cosine Similarity Joins. INFORMATION-An International Interdisciplinary Journal 14, 1265 (2011)
Lawder, J.: The application of Space-filling Curves to the Storage and Retrieval of Multi-dimensional Data. PhD thesis, University of London, London (December 1999)
Zhang, C., Xiao, W., Tang, D., Tang, J.: P2P-based multidimensional indexing methods: A survey. J. Syst. Softw. 84(12), 2348–2362 (2011)
Olszak, A.: Hycube: a dht routing system based on a hierarchical hypercube geometry. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part II. LNCS, vol. 6068, pp. 260–269. Springer, Heidelberg (2010)
Tang, C., Xu, Z., Mahalingam, M.: psearch: information retrieval in structured overlays. SIGCOMM Comput. Commun. Rev. 33, 89–94 (2003)
Bhattacharya, I., Kashyap, S., Parthasarathy, S.: Similarity Searching in Peer-to-Peer Databases. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, ICDCS 2005, pp. 329–338 (June 2005)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proc. of the 7th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2006, vol. 7. USENIX, Berkeley (2006)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)
Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003)
de Paula, L.B., Villaça, R.S., Magalhães, M.F.: Analysis of Concept Similarity Methods Applied to an LSH Function. In: COMPSAC 2011: Computer Software and Applications Conference. IEEE, Munich (2011)
Faloutsos, C.: Gray Codes for Partial Match and Range Queries. IEEE Trans. Software Eng. 14(10), 1381–1393 (1988)
Pasquini, R.: Proposta de Roteamento Plano Baseado em uma Métrica de OU-Exclusivo e Visibilidade Local. Phd. thesis, Faculdade de Engenharia Eletrica e Computação. Universidade Estadual de Campinas, Campinas, SP (June 2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
da Silva Villaça, R., Pasquini, R., de Paula, L.B., Magalhães, M.F. (2015). Exploring the Hamming Distance in Distributed Infrastructures for Similarity Search. In: Xhafa, F., Barolli, L., Barolli, A., Papajorgji, P. (eds) Modeling and Processing for Next-Generation Big-Data Technologies. Modeling and Optimization in Science and Technologies, vol 4. Springer, Cham. https://doi.org/10.1007/978-3-319-09177-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-09177-8_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09176-1
Online ISBN: 978-3-319-09177-8
eBook Packages: EngineeringEngineering (R0)