Skip to main content

Exploring the Hamming Distance in Distributed Infrastructures for Similarity Search

  • Chapter
Modeling and Processing for Next-Generation Big-Data Technologies

Abstract

Nowadays, the amount of data available on the Internet is over Zettabytes (ZB). Such condition defines a scenario known in the literature as Big Data. Although traditional databases are very efficient for finding and retrieving specific content, they are inefficient on Big Data scenario, since the great majority of such data are unstructured and scattered across the Internet. In this way, new databases are required in order to support similarity search. In order to handle such challenging scenario, the proposal in this chapter is to explore the Hamming similarity existent between content identifiers that are generated using the Random Hyperplane Hashing function. Such identifiers provide the basis for building distributed infrastructures that facilitate the similarity search. In this chapter, we present two different approaches: a P2P solution (Hamming DHT) and a Data Center solution (HCube). Evaluations are presented and indicate that both are capable of improving the recall in a similarity search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Gantz, J., Reinsel, D.: The Digital Universe Decade - Are You Ready? http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2010) (Online; Acesso em 2 de Março de 2013)

  2. The Apache Software Foundation: Apache\(\textsuperscript{\textregistered}\) Hadoop, http://hadoop.apache.org/ (2013) (Online; Acesso em 5 de Março de 2013)

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  4. Indyk, P., Motwani, R.: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In: STOC 1998: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM, New York (1998)

    Google Scholar 

  5. Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: STOC 2002: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, New York, NY, USA, pp. 380–388 (2002)

    Google Scholar 

  6. Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml

  7. Villaça, R., de Paula, L.B., Pasquini, R., Magalhães, M.F.: Hamming DHT: Taming the Similarity Search. In: Proceedings of the 10th Annual IEEE Consumer Communications and Networking Conference, CCNC 2013. IEEE Communications Society, Las Vegas (2013)

    Google Scholar 

  8. Villaça, R., Pasquini, R., de Paula, L.B., Magalhães, M.F.: HCube: A Server-centric Data Center Structure for Similarity Search. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications, AINA 2013. IEEE Computer Society, Barcelona (2013)

    Google Scholar 

  9. Desai, A., Singh, H., Pudi, V.: DISC: Data-Intensive Similarity Measure for Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS (LNAI), vol. 6635, pp. 469–481. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Lee, D., Park, J., Shim, J., Lee, S.: Efficient Filtering Techniques for Cosine Similarity Joins. INFORMATION-An International Interdisciplinary Journal 14, 1265 (2011)

    Google Scholar 

  11. Lawder, J.: The application of Space-filling Curves to the Storage and Retrieval of Multi-dimensional Data. PhD thesis, University of London, London (December 1999)

    Google Scholar 

  12. Zhang, C., Xiao, W., Tang, D., Tang, J.: P2P-based multidimensional indexing methods: A survey. J. Syst. Softw. 84(12), 2348–2362 (2011)

    Article  Google Scholar 

  13. Olszak, A.: Hycube: a dht routing system based on a hierarchical hypercube geometry. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part II. LNCS, vol. 6068, pp. 260–269. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  14. Tang, C., Xu, Z., Mahalingam, M.: psearch: information retrieval in structured overlays. SIGCOMM Comput. Commun. Rev. 33, 89–94 (2003)

    Article  Google Scholar 

  15. Bhattacharya, I., Kashyap, S., Parthasarathy, S.: Similarity Searching in Peer-to-Peer Databases. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, ICDCS 2005, pp. 329–338 (June 2005)

    Google Scholar 

  16. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: Proc. of the 7th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2006, vol. 7. USENIX, Berkeley (2006)

    Google Scholar 

  17. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)

    Article  Google Scholar 

  18. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003)

    Article  Google Scholar 

  19. de Paula, L.B., Villaça, R.S., Magalhães, M.F.: Analysis of Concept Similarity Methods Applied to an LSH Function. In: COMPSAC 2011: Computer Software and Applications Conference. IEEE, Munich (2011)

    Google Scholar 

  20. Faloutsos, C.: Gray Codes for Partial Match and Range Queries. IEEE Trans. Software Eng. 14(10), 1381–1393 (1988)

    Article  MATH  MathSciNet  Google Scholar 

  21. Pasquini, R.: Proposta de Roteamento Plano Baseado em uma Métrica de OU-Exclusivo e Visibilidade Local. Phd. thesis, Faculdade de Engenharia Eletrica e Computação. Universidade Estadual de Campinas, Campinas, SP (June 2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodolfo da Silva Villaça .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

da Silva Villaça, R., Pasquini, R., de Paula, L.B., Magalhães, M.F. (2015). Exploring the Hamming Distance in Distributed Infrastructures for Similarity Search. In: Xhafa, F., Barolli, L., Barolli, A., Papajorgji, P. (eds) Modeling and Processing for Next-Generation Big-Data Technologies. Modeling and Optimization in Science and Technologies, vol 4. Springer, Cham. https://doi.org/10.1007/978-3-319-09177-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09177-8_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09176-1

  • Online ISBN: 978-3-319-09177-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics