nativeNDP: Processing Big Data Analytics on Native Storage Nodes

Vinçon, Tobias; Hardock, Sergey; Riegger, Christian; Koch, Andreas; Petrov, Ilia

doi:10.1007/978-3-030-28730-6_9

Tobias Vinçon¹²,
Sergey Hardock^12,13,
Christian Riegger¹²,
Andreas Koch¹⁴ &
…
Ilia Petrov¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11695))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

805 Accesses
1 Citations

Abstract

Data analytics tasks on large datasets are computationally-intensive and often demand the compute power of cluster environments. Yet, data cleansing, preparation, dataset characterization and statistics or metrics computation steps are frequent. These are mostly performed ad hoc, in an explorative manner and mandate low response times. But, such steps are I/O intensive and typically very slow due to low data locality, inadequate interfaces and abstractions along the stack. These typically result in prohibitively expensive scans of the full dataset and transformations on interface boundaries.

In this paper, we examine R as analytical tool, managing large persistent datasets in Ceph, a wide-spread cluster file-system. We propose nativeNDP – a framework for Near-Data Processing that pushes down primitive R tasks and executes them in-situ, directly within the storage device of a cluster-node. Across a range of data sizes, we show that nativeNDP is more than an order of magnitude faster than other pushdown alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

ITRS - International Technology Roadmap for Semiconductors Reports (2014). http://www.itrs2.net/itrs-reports.html
Acharya, A., Uysal, M., Saltz, J.H.: Active disks: programming model, algorithms and evaluation. In: ASPLOS (1998)
Google Scholar
Boral, H., De Witt, D.J.: Database machines: an idea whose time has passed? A critique of the future of database machines. In: Parallel Architectures for Database Systems (1989)
Google Scholar
Cho, S., Park, C., Oh, H., Kim, S., Yi, Y., Ganger, G.R.: Active disk meets flash. In: Proceedings 27th International Conference on Supercomputing - ICS, p. 91. ACM Press (2013)
Google Scholar
De, A., Gokhale, M., Gupta, R., Swanson, S.: Minerva: accelerating data analysis in next-generation SSDs. In: 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 9–16. IEEE, April 2013
Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35, 85–98 (1992)
Article Google Scholar
Eddelbuettel, D.: Seamless R and C++ integration with Rcpp. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6868-4
Book MATH Google Scholar
Gray, J., Shenoy, P.J.: Rules of thumb in data engineering. In: Proceedings ICDE, p. 3 (2000)
Google Scholar
Gu, B., et al.: Biscuit: a framework for near-data processing of big data workloads. In: ACM/IEEE 43rd Annual International Symposium on Computer Architecture, vol. 8, pp. 153–165. IEEE, June 2016
Google Scholar
Hardock, S., Petrov, I., Gottstein, R., Buchmann, A.: NoFTL: database systems on FTL-less flash storage. Proc. VLDB Endow. (2013)
Google Scholar
István, Z., Sidler, D., Alonso, G.: Caribou. Proc. VLDB Endow. 10(11), 1202–1213 (2017)
Article Google Scholar
Keeton, K., Patterson, D.A., Hellerstein, J.M.: A case for intelligent disks (IDISKS). SIGMOD Rec. 27(3), 42–52 (1998)
Article Google Scholar
Kim, S., Oh, H., Park, C., Cho, S., Lee, S.W., Moon, B.: In-storage processing of database scans and joins. Inf. Sci. (Ny) 327, 183–200 (2016)
Article Google Scholar
Minutoli, M., Kuntz, S.K., Tumeo, A., Kogge, P.M.: Implementing Radix Sort on Emu 1. Work. Near-Data Process, pp. 1–6 (2015)
Google Scholar
Riedel, E., Gibson, G.A., Faloutsos, C.: Active storage for large-scale data mining and multimedia. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 62–73. VLDB, Morgan Kaufmann Publishers Inc., San Francisco (1998)
Google Scholar
Vinçon, T., Hardock, S., Riegger, C., Oppermann, J., Koch, A., Petrov, I.: NoFTL-KV: Tacklingwrite-amplification on KV-stores with native storage management. In: EDBT (2018)
Google Scholar
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: OSDI (2006)
Google Scholar
Weil, S.A., Leung, A.W., Brandt, S.A., Maltzahn, C.: RADOS: a scalable, reliable storage service for petabyte-scale storage clusters. In: PDSW (2007)
Google Scholar
Woods, L., Teubner, J., Alonso, G.: Less watts, more performance. In: Proceedings 2013 Int. Conference Management of Data - SIGMOD, p. 1073. ACM Press, New York (2013)
Google Scholar

Download references

Acknowledgements

This work has been partially supported by HAW Promotion MWK, Baden-Würrtemberg and BMBF PANDAS 01IS18081C/D.

Author information

Authors and Affiliations

Data Management Lab, Reutlingen University, Reutlingen, Germany
Tobias Vinçon, Sergey Hardock, Christian Riegger & Ilia Petrov
Databases and Distributed Systems Group, TU Darmstadt, Darmstadt, Germany
Sergey Hardock
Embedded Systems and Applications Group, TU Darmstadt, Darmstadt, Germany
Andreas Koch

Authors

Tobias Vinçon
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Hardock
View author publications
You can also search for this author in PubMed Google Scholar
Christian Riegger
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Koch
View author publications
You can also search for this author in PubMed Google Scholar
Ilia Petrov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tobias Vinçon .

Editor information

Editors and Affiliations

University of Maribor, Maribor, Slovenia
Tatjana Welzer
Alpen-Adria Universität Klagenfurt, Klagenfurt, Austria
Johann Eder
University of Maribor, Maribor, Slovenia
Vili Podgorelec
University of Maribor, Maribor, Slovenia
Aida Kamišalić Latifić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vinçon, T., Hardock, S., Riegger, C., Koch, A., Petrov, I. (2019). nativeNDP: Processing Big Data Analytics on Native Storage Nodes. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-28730-6_9
Published: 13 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28729-0
Online ISBN: 978-3-030-28730-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics