Advertisement

Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engine

  • Tae-Seob Yun
  • Kyu-Young WhangEmail author
  • Hyuk-Yoon Kwon
  • Jun-Sung Kim
  • Il-Yeol Song
Article
  • 54 Downloads

Abstract

We propose two-dimensional indexing—a novel in-memory indexing architecture that operates over distributed memory of a massively-parallel search engine. The goal of two-dimensional indexing is to provide a one-integrated-memory view as in a single node system using one large integrated memory. In two-dimensional indexing, we partition the entire index into n× m fragments and distribute them over the memories of multiple nodes in such a way that each fragment is entirely stored in main memory of one node. The proposed architecture is not only scalable as it uses a scaled-out shared-nothing architecture but also is capable of achieving low query response time as it processes queries in main memory. We also propose the concept of the one-memory point, which is the amount of the memory space required to completely store the entire index in main memory providing a one-integrated-memory view. We first prove the effectiveness of two-dimensional indexing with single-keyword queries, and then, extend the notion so as to be able to handle multiple-keyword queries. To handle multiple-keyword queries, we adopt pre-join that materializes a multiple-keyword query a priori as well as a new notion of semi-memory join that obviates extensive communication overhead to perform join across multiple nodes. In experiments using the real-life search query set over a database consisting of 100 million Web documents crawled, we show that two-dimensional indexing can effectively provide a one-integrated-memory view without too much of additional memory compared with the single node system using one large integrated memory. We also show that, with a six-node prototype, in an ideal case, it significantly improves the query processing performance over a disk-based search engine with an equivalent amount of in-memory buffer but without two-dimensional indexing — by up to 535.54 times. This improvement is expected to get larger as the system is scaled-out with a larger number of machines.

Keywords

Massively-parallel search engine DB-IR integration Pre-join Multiple-keyword search queries Distributed memory 

Notes

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by Korean Government(MSIT) (No. 2016R1A2B4015929).

References

  1. 1.
    Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: The impact of caching on search engines. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 183–190 (2007)Google Scholar
  2. 2.
    Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., Silvestri, F.: Design trade-offs for search engine caching. ACM Transactions on the Web (TWEB) 2(4), 1–28 (2008)CrossRefGoogle Scholar
  3. 3.
    Bernstein, P., Chiu, D.: Using semi-joins to solve relational queries. J. ACM 28(1), 25–40 (1981)CrossRefGoogle Scholar
  4. 4.
    Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: Proceedings of the 14th Int’l Conference on Extending Database Technology (EDBT), pp. 93–104 (2011)Google Scholar
  5. 5.
    Culpepper, J., Petri, M., Scholer, F.: Efficient in-memory top-k document retrieval. In: Proceedings of the 35th Int’l Conference on Information Retrieval (SIGIR), pp. 225–234 (2012)Google Scholar
  6. 6.
    Cutting, D., Pedersen, J.: Optimization for dynamic inverted index maintenance. In: Proceedings of the 13th ACM Int’l Conference on Information Retrieval (SIGIR), pp. 405–411 (1990)Google Scholar
  7. 7.
    Dean, J.: Building Software Systems at Google and Lessons Learned, Stanford Computer Science Department Distinguished Computer Scientist Lecture, Nov. 2010. (presentation slides available at http://research.google.com/people/jeff/Stanford-DL-Nov-2010.pdf)
  8. 8.
    Fagni, T., Perego, R., Silvestri, F., Orlando, S.: Boosting the performance of web search engines caching and prefetching query results by exploiting historical usage data. ACM Transactions on Information Systems (TOIS) 24(1), 51–78 (2006)CrossRefGoogle Scholar
  9. 9.
    Färber, F., et al.: The SAP HANA database - an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)Google Scholar
  10. 10.
    Gan, Q., Suel, T.: Improved techniques for result caching in web search engines. In: Proceedings of the 18th Int’l Conference on World Wide Web (WWW), pp. 431–440 (2009)Google Scholar
  11. 11.
  12. 12.
    IBM WebSphere eXtreme Scale: http://www.ibm.com/software/products/en/websphere-extreme-scale (referenced in Jan. 2018)
  13. 13.
    Internet Live Stats: http://www.internetlivestats.com/google-search-statistics (referenced in Jan. 2018)
  14. 14.
    Jung, B., Omiecinski, E.: Inverted file partitioning schemes in multiple disk systems. IEEE Trans. Parallel Distributed Syst. 6(2), 142–153 (1995)CrossRefGoogle Scholar
  15. 15.
    Kunder, M.: http://www.worldwidewebsize.com (referenced in Jan. 2018)
  16. 16.
    Markatos, E.: On caching search engine query results. Comput. Commun. 24(2), 137–143 (2001)CrossRefGoogle Scholar
  17. 17.
    Memcached - A Distributed Memory Object Caching System, http://memcached.org
  18. 18.
  19. 19.
    Ousterhout, J., et al.: The case for RAMClouds: scalable high-performance storage entirely in DRAM. In: ACM SIGOPS Operating Systems Review, vol. 43, pp. 92–105 (2010)CrossRefGoogle Scholar
  20. 20.
    Ozcan, R., Altingovde, I., Ulusoy, Ö.: Static query result caching revisited. In: Proceedings of the e17th Int’l Conference on World Wide Web (WWW), pp. 1169–1170 (2008)Google Scholar
  21. 21.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web, technical report(SIDL-WP-1999-0120) Stanford University (1999)Google Scholar
  22. 22.
    Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st ACM Int’l Conference on Scalable Information Systems, Article No. 1 (2006)Google Scholar
  23. 23.
    Protic, J., Tomasevic, M., Milutinović, V.: In Book Distributed Shared Memory-Concepts and Systems. Wiley, New York (1998)Google Scholar
  24. 24.
  25. 25.
  26. 26.
    Skobeltsyn, G., Junqueira, G., Plachouras, V., Baeza-Yates, R.: Resin: a combination of results caching and index pruning for high-performance Web search engines. In: Proceedings of the 31th Int’l Conference on Information Retrieval (SIGIR), pp. 131–138 (2008)Google Scholar
  27. 27.
    Stonebraker, M., Weisberg, A.: The voltDB main memory DBMS. IEEE Data Eng. Bull. 36(2), 21–27 (2013)Google Scholar
  28. 28.
    Strohman, T., Croft, W.: Efficient document retrieval in main memory. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 175–182 (2007)Google Scholar
  29. 29.
    Turpin, A., Tsegay, Y., Hawking, D., Williams, H.: Fast generation of result snippets in web search. In: Proceedings of the 30th Int’l Conference on Information Retrieval (SIGIR), pp. 127–134 (2007)Google Scholar
  30. 30.
    Whang, K., Park, B., Han, W., Lee, Y.: An inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems, U.S. Patent no. 6,349,308, Feb. 19, 2002, Application No. 09/250,487 (1999)Google Scholar
  31. 31.
    Whang, K., Lee, M., Lee, J., Han, W.: Odysseus: a high-performance ORDBMS tightly-coupled with IR features. In: Proceedings of the 21st Int’l Conference on Data Engineering (ICDE), pp. 1104–1105 (2005)Google Scholar
  32. 32.
    Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality. In: Proceedings of the 2013 ACM Int’l Conference on Management of Data (SIGMOD), pp. 313–324 (2013)Google Scholar
  33. 33.
    Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR Integration using tight-coupling in the Odysseus DBMS. The World Wide Web J 18 (3), 491–520 (2015)CrossRefGoogle Scholar
  34. 34.
    Xin, R., Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM Int’l Conference on Management of Data (SIGMOD), pp. 13–24 (2013)Google Scholar
  35. 35.
    Zaharia, M.: An architecture for fast and general data processing on large clusters, PhD Dissertation, University of California, Berkeley (2013)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Tae-Seob Yun
    • 1
  • Kyu-Young Whang
    • 1
    Email author
  • Hyuk-Yoon Kwon
    • 2
  • Jun-Sung Kim
    • 1
  • Il-Yeol Song
    • 3
  1. 1.Department of Computer ScienceKAISTDaejeonKorea
  2. 2.Department of Global Fusion Industrial EngineeringSeoul National University of Science and TechnologySeoulKorea
  3. 3.College of Information Science and TechnologyDrexel UniversityPhiladelphiaUSA

Personalised recommendations