Advertisement

Large Scale Knowledge Matching with Balanced Efficiency-Effectiveness Using LSH Forest

  • Michael CochezEmail author
  • Vagan Terziyan
  • Vadim Ermolayev
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10190)

Abstract

Evolving Knowledge Ecosystems were proposed to approach the Big Data challenge, following the hypothesis that knowledge evolves in a way similar to biological systems. Therefore, the inner working of the knowledge ecosystem can be spotted from natural evolution. An evolving knowledge ecosystem consists of Knowledge Organisms, which form a representation of the knowledge, and the environment in which they reside. The environment consists of contexts, which are composed of so-called knowledge tokens. These tokens are ontological fragments extracted from information tokens, in turn, which originate from the streams of information flowing into the ecosystem. In this article we investigate the use of LSH Forest (a self-tuning indexing schema based on locality-sensitive hashing) for solving the problem of placing new knowledge tokens in the right contexts of the environment. We argue and show experimentally that LSH Forest possesses required properties and could be used for large distributed set-ups. Further, we show experimentally that for our type of data minhashing works better than random hyperplane hashing. This paper is an extension of the paper “Balanced Large Scale Knowledge Matching Using LSH Forest” presented at the International Keystone Conference 2015.

Keywords

Evolving knowledge ecosystems Locality-sensitive hashing LSH Forest Minhash Random hyperplane hashing Big data 

Notes

Acknowledgments

The authors would like to thank the faculty of Information Technology of the University of Jyväskylä for financially supporting this research. Further, it has to be mentioned that the implementation of the software was greatly simplified by the Guava library by Google, the Apache Commons Math™ library, and the Rabin hash library by Bill Dwyer and Ian Brandt.

References

  1. 1.
    Ermolayev, V., Akerkar, R., Terziyan, V., Cochez, M.: Towards evolving knowledge ecosystems for big data understanding. In: Big Data Computing, pp. 3–55. Taylor & Francis Group, Chapman and Hall/CRC (2014)Google Scholar
  2. 2.
    Cochez, M., Terziyan, V., Ermolayev, V.: Balanced large scale knowledge matching using LSH forest. In: Cardoso, J., Guerra, F., Houben, G.J., Pinto, A., Velegrakis, Y. (eds.) Semanitic Keyword-based Search on Structured Data Sources. LNCS, vol. 9398, pp. 36–50. Springer, Cham (2015). doi: 10.1007/978-3-319-27932-9_4 CrossRefGoogle Scholar
  3. 3.
    Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 651–660. ACM (2005)Google Scholar
  4. 4.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)Google Scholar
  5. 5.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)CrossRefGoogle Scholar
  6. 6.
    Rajaraman, A., Ullman, J.D.: Finding similar items. In: Mining of Massive Datasets, chap. 3, pp. 71–128. Cambridge University Press (2012)Google Scholar
  7. 7.
    Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29. IEEE (1997)Google Scholar
  8. 8.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388. ACM, New York (2002)Google Scholar
  9. 9.
    Ermolayev, V., Davidovsky, M.: Agent-based ontology alignment: basics, applications, theoretical foundations, and demonstration. In: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, WIMS 2012, pp. 3:1–3:12. ACM, New York (2012)Google Scholar
  10. 10.
    Cochez, M.: Locality-sensitive hashing for massive string-based ontology matching. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 1, pp. 134–140. IEEE (2014)Google Scholar
  11. 11.
    Broder, A.: Some applications of Rabins fingerprinting method. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds.) Sequences II, pp. 143–152. Springer, New York (1993). doi: 10.1007/978-1-4613-9323-8_11 CrossRefGoogle Scholar
  12. 12.
    Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284–291. ACM (2006)Google Scholar
  13. 13.
    Cochez, M., Mou, H.: Twister tries: approximate hierarchical agglomerative clustering for average distance in linear time. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 505–517. ACM (2015)Google Scholar
  14. 14.
    Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 950–961. VLDB Endowment (2007)Google Scholar
  15. 15.
    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, STOC 1997, pp. 654–663. ACM, New York (1997)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Michael Cochez
    • 1
    • 2
    • 3
    Email author
  • Vagan Terziyan
    • 3
  • Vadim Ermolayev
    • 4
  1. 1.Fraunhofer Institute for Applied Information Technology FITSankt AugustinGermany
  2. 2.RWTH Aachen University, Informatik 5AachenGermany
  3. 3.Faculty of Information TechnologyUniversity of JyväskyläJyväskyläFinland
  4. 4.Department of ITZaporozhye National UniversityZaporozhyeUkraine

Personalised recommendations