Abstract
Data analytics tasks on large datasets are computationally-intensive and often demand the compute power of cluster environments. Yet, data cleansing, preparation, dataset characterization and statistics or metrics computation steps are frequent. These are mostly performed ad hoc, in an explorative manner and mandate low response times. But, such steps are I/O intensive and typically very slow due to low data locality, inadequate interfaces and abstractions along the stack. These typically result in prohibitively expensive scans of the full dataset and transformations on interface boundaries.
In this paper, we examine R as analytical tool, managing large persistent datasets in Ceph, a wide-spread cluster file-system. We propose nativeNDP – a framework for Near-Data Processing that pushes down primitive R tasks and executes them in-situ, directly within the storage device of a cluster-node. Across a range of data sizes, we show that nativeNDP is more than an order of magnitude faster than other pushdown alternatives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ITRS - International Technology Roadmap for Semiconductors Reports (2014). http://www.itrs2.net/itrs-reports.html
Acharya, A., Uysal, M., Saltz, J.H.: Active disks: programming model, algorithms and evaluation. In: ASPLOS (1998)
Boral, H., De Witt, D.J.: Database machines: an idea whose time has passed? A critique of the future of database machines. In: Parallel Architectures for Database Systems (1989)
Cho, S., Park, C., Oh, H., Kim, S., Yi, Y., Ganger, G.R.: Active disk meets flash. In: Proceedings 27th International Conference on Supercomputing - ICS, p. 91. ACM Press (2013)
De, A., Gokhale, M., Gupta, R., Swanson, S.: Minerva: accelerating data analysis in next-generation SSDs. In: 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 9–16. IEEE, April 2013
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35, 85–98 (1992)
Eddelbuettel, D.: Seamless R and C++ integration with Rcpp. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6868-4
Gray, J., Shenoy, P.J.: Rules of thumb in data engineering. In: Proceedings ICDE, p. 3 (2000)
Gu, B., et al.: Biscuit: a framework for near-data processing of big data workloads. In: ACM/IEEE 43rd Annual International Symposium on Computer Architecture, vol. 8, pp. 153–165. IEEE, June 2016
Hardock, S., Petrov, I., Gottstein, R., Buchmann, A.: NoFTL: database systems on FTL-less flash storage. Proc. VLDB Endow. (2013)
István, Z., Sidler, D., Alonso, G.: Caribou. Proc. VLDB Endow. 10(11), 1202–1213 (2017)
Keeton, K., Patterson, D.A., Hellerstein, J.M.: A case for intelligent disks (IDISKS). SIGMOD Rec. 27(3), 42–52 (1998)
Kim, S., Oh, H., Park, C., Cho, S., Lee, S.W., Moon, B.: In-storage processing of database scans and joins. Inf. Sci. (Ny) 327, 183–200 (2016)
Minutoli, M., Kuntz, S.K., Tumeo, A., Kogge, P.M.: Implementing Radix Sort on Emu 1. Work. Near-Data Process, pp. 1–6 (2015)
Riedel, E., Gibson, G.A., Faloutsos, C.: Active storage for large-scale data mining and multimedia. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 62–73. VLDB, Morgan Kaufmann Publishers Inc., San Francisco (1998)
Vinçon, T., Hardock, S., Riegger, C., Oppermann, J., Koch, A., Petrov, I.: NoFTL-KV: Tacklingwrite-amplification on KV-stores with native storage management. In: EDBT (2018)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: OSDI (2006)
Weil, S.A., Leung, A.W., Brandt, S.A., Maltzahn, C.: RADOS: a scalable, reliable storage service for petabyte-scale storage clusters. In: PDSW (2007)
Woods, L., Teubner, J., Alonso, G.: Less watts, more performance. In: Proceedings 2013 Int. Conference Management of Data - SIGMOD, p. 1073. ACM Press, New York (2013)
Acknowledgements
This work has been partially supported by HAW Promotion MWK, Baden-Würrtemberg and BMBF PANDAS 01IS18081C/D.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Vinçon, T., Hardock, S., Riegger, C., Koch, A., Petrov, I. (2019). nativeNDP: Processing Big Data Analytics on Native Storage Nodes. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-28730-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28729-0
Online ISBN: 978-3-030-28730-6
eBook Packages: Computer ScienceComputer Science (R0)