Size Oblivious Programming with InfiniMem

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9519)

Abstract

Many recently proposed BigData processing frameworks make programming easier, but typically expect the datasets to fit in the memory of either a single multicore machine or a cluster of multicore machines. When this assumption does not hold, these frameworks fail. We introduce the InfiniMem framework that enables size oblivious processing of large collections of objects that do not fit in memory by making them disk-resident. InfiniMem is easy to program with: the user just indicates the large collections of objects that are to be made disk-resident, while InfiniMem transparently handles their I/O management. The InfiniMem library can manage a very large number of objects in a uniform manner, even though the objects have different characteristics and relationships which, when processed, give rise to a wide range of access patterns requiring different organizations of data on the disk. We demonstrate the ease of programming and versatility of InfiniMem with 3 different probabilistic analytics algorithms, 3 different graph processing size oblivious frameworks; they require minimal effort, 6–9 additional lines of code. We show that InfiniMem can successfully generate a mesh with 7.5 million nodes and 300 million edges (4.5 GB on disk) in 40 min and it performs the PageRank computation on a 14 GB graph with 134 million vertices and 805 million edges at 14 min per iteration on an 8-core machine with 8 GB RAM. Many graph generators and processing frameworks cannot handle such large graphs. We also exploit InfiniMem on a cluster to scale-up an object-based DSM.

References

  1. 1.
    Avery, C.: Giraph: large-scale graph processing infrastruction on hadoop. In: Proceedings of Hadoop Summit. Santa Clara, USA: [sn] (2011)Google Scholar
  2. 2.
    Bader, D.A., Madduri, K.: Gtgraph: A synthetic graph generator suite. Atlanta, GA, February 2006Google Scholar
  3. 3.
    Berry, J., Mackey, G.: The multithreaded graph library (2014)Google Scholar
  4. 4.
    Bu, Y., Borkar, V., Xu, G., Carey, M.J.: A bloat-aware design for big data applications. In: Proceedings of ISMM 2013, pp. 119–130. ACM (2013)Google Scholar
  5. 5.
    Chiang, Y.J., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., Vitter, J.S.: External-memory graph algorithms. In: Proceedings of SODA 1995, pp. 139–149 (1995)Google Scholar
  6. 6.
    Da Zheng, D.M., Burns, R., Vogelstein, J., Priebe, C.E., Szalay, A.S.: Flashgraph: processing billion-node graphs on an array of commodity SSDs. In: FAST (2015)Google Scholar
  7. 7.
    Facebook: RocksDB Project. http://RocksDB.org
  8. 8.
    Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: OSDI 2012, pp. 17–30 (2012)Google Scholar
  9. 9.
    Koduru, S.-C., Vora, K., Gupta, R.: Optimizing caching DSM for distributed software speculation. In: Proceedings of Cluster Computing (2015)Google Scholar
  10. 10.
    Kundeti, V.K., et al.: Efficient parallel and out of core algorithms for constructing large bi-directed de bruijn graphs. BMC bioinform. 11(1), 560 (2010)CrossRefGoogle Scholar
  11. 11.
    Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: Large-scale graph computation on just a PC. In: Proceedings of the 10th USENIX Symposium on OSDI, pp. 31–46 (2012)Google Scholar
  12. 12.
    Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new framework for parallel machine learning. (2010). arXiv:1006.4990
  13. 13.
    Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD ICMD, pp. 135–146. ACM (2010)Google Scholar
  14. 14.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web (1999)Google Scholar
  15. 15.
    Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M.A., Kaleem, R., Lee, T.H., Lenharth, A., Manevich, R., Méndez-Lojo, M., et al.: The tao of parallelism in algorithms. ACM SIGPLAN Not. 46, 12–25 (2011)CrossRefGoogle Scholar
  16. 16.
    Shun, J., Blelloch, G.E.: Ligra: a lightweight graph processing framework for shared memory. In: Proceedings of PPopp 2013, pp. 135–146. ACM (2013)Google Scholar
  17. 17.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed filesystem. In: IEEE MSST 2010, pp. 1–10 (2010)Google Scholar
  18. 18.
    Siek, J., Lee, L., Lumsdaine, A.: The boost graph library (BGL) (2000)Google Scholar
  19. 19.
    Team, T., et al.: Apache mahout project (2014). https://mahout.apace.org
  20. 20.
    Toledo, S.: A survey of out-of-core algorithms in numerical linear algebra. Extern. Mem. Algorithms Vis. 50, 161–179 (1999)MathSciNetGoogle Scholar
  21. 21.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark:cluster computing with working sets. In: Proceedings of HotCloud 2010, vol. 10, p. 10 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Sai Charan Koduru
    • 1
  • Rajiv Gupta
    • 1
  • Iulian Neamtiu
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of CaliforniaRiversideUSA

Personalised recommendations