Abstract
AI and analytics applications are good at deriving meaningful insights from data, but they do not always cope well with the storage management challenges that come with a high pace of data generation. At the same time, a conventional data storage and management layer is not optimized to derive timely insights and value from huge volumes of data. This problem is rooted in a classical cross-layer dilemma wherein neither the application nor the storage layer has the deep knowledge needed to optimize the whole system. We resolve this omniscience dilemma by introducing ProSPECT, a set of techniques to proactively optimize analytics computations and data storage. ProSPECT enables a data fabric to become aware of the purpose and relevance of stored data by intercepting the lineage of workflows under execution within existing analytics frameworks. Partial analytics computations can then be initiated proactively by the data fabric layer, where data is stored and managed. ProSPECT provides analytics applications with relevant data or precomputed insights and alleviates storage management challenges using proactive tiering and data approximation. We describe experiments with application case studies using Apache Spark and Alluxio to demonstrate an order of magnitude reduction in the storage space occupied in the fastest tier and in time to value for analytics applications.
Similar content being viewed by others
Notes
Apache-Falcon: https://falcon.apache.org/.
Hitachi Ethernet Drives: http://www.hgst.com/company/innovation-center.
Joyent Manta: https://www.joyent.com/manta.
Flight Dataset: http://stat-computing.org/dataexpo/2009/the-data.html.
KDD Cup 1999 Dataset: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D.G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on operating systems design and implementation, OSDI’16. USENIX Association, Berkeley, pp 265–283. https://doi.org/10.5555/3026877.3026899
Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD SIGMOD ’00. https://doi.org/10.1145/335191.335450
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys ’13. https://doi.org/10.1145/2465351.2465355
Agrawal N, Vulimiri A (2017) Low-latency analytics on colossal data streams with SummaryStore. In: Proceedings of the 26th symposium on operating systems principles, SOSP ’17. Association for Computing Machinery, New York, pp 647–664. https://doi.org/10.1145/3132747.3132758
Arlitt MF, Williamson CL (1996) Web server workload characterization: the search for invariants. http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
Barua HB, Mondal KC (2018) Green data mining using approximate computing: an experimental analysis with rule mining. In: 2018 International conference on computing, power and communication technologies (GUCON), pp 115–120. https://doi.org/10.1109/GUCON.2018.8675095
Benton W (2016) Containerized spark on Kubernetes. https://spark-summit.org/eu-2016/events/containerized-spark-on-kubernetes/
Carata L, Akoush S, Balakrishnan N, Bytheway T, Sohan R, Seltzer M, Hopper A (2014) A primer on provenance. Commun ACM 57(5):52–60. https://doi.org/10.1145/2596628
Chaudhuri S, Das G, Narasayya V (2007) Optimized stratified sampling for approximate query processing. ACM Trans Database Syst. https://doi.org/10.1145/1242524.1242526
Chen A, Chow A, Davidson A, DCunha A, Ghodsi A, Hong SA, Konwinski A, Mewald C, Murching S, Nykodym T, Ogilvie P, Parkhe M, Singh A, Xie F, Zaharia M, Zang R, Zheng J, Zumar C (2020) Developments in MLflow: a system to accelerate the machine learning lifecycle. In: Proceedings of the fourth international workshop on data management for end-to-end machine learning, DEEM’20. https://doi.org/10.1145/3399579.3399867
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM. https://doi.org/10.1145/1327452.1327492
Derakhshan B, Mahdiraji AR, Abedjan Z, Rabl T (2020) Optimizing machine learning workloads in collaborative environments. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data. https://doi.org/10.1145/3318464.3389715
Devecsery D, Chow M, Dou X, Flinn J, Chen PM (2014) Eidetic systems. In: Proceedings of the 11th USENIX conference on operating systems design and implementation, OSDI’14. https://doi.org/10.5555/2685048.2685090
Efraimidis PS, Spirakis PG (2006) Weighted random sampling with a reservoir. Inf Process Lett 97(5):181–185. https://doi.org/10.1016/j.ipl.2005.11.003
Feng Z, George S, Harkes J, Klatzky RL, Satyanarayanan M, Pillai P (2019) Eureka: edge-based discovery of training data for machine learning. IEEE Internet Comput 23(4):35–42. https://doi.org/10.1109/SEC.2018.00018
Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) ApproxHadoop: bringing approximations to MapReduce frameworks. In: Proceedings of ASPLOS ’15. https://doi.org/10.1145/2775054.2694351
Gunda PK et al (2010) Nectar: automatic management of data and computation in datacenters. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. https://doi.org/10.5555/1924943.1924949
Guo P, Hu W (2018) Potluck: cross-application approximate deduplication for computation-intensive mobile applications. In: Proceedings of the twenty-third international conference on architectural support for programming languages and operating systems, ASPLOS ’18. ACM, New York, pp 271–284. https://doi.org/10.1145/3173162.3173185
Guo P, Hu B, Li R, Hu W (2018) Foggycache: cross-device approximate computation reuse. In: Proceedings of the 24th annual international conference on mobile computing and networking, MobiCom ’18. Association for Computing Machinery, New York, pp 19–34. https://doi.org/10.1145/3241539.3241557
Heintz B, Chandra A, Sitaraman RK (2016) Trading timeliness and accuracy in geo-distributed streaming analytics. In: Proceedings of the seventh ACM symposium on cloud computing, SoCC ’16. https://doi.org/10.1145/2987550.2987580
Herschel M, Diestelkämper R, Lahmar HB (2017) A survey on provenance: what for? What form? What from? VLDB J 26(6):881–906. https://doi.org/10.1007/s00778-017-0486-1
Hindman B et al (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX conference on networked systems design and implementation, NSDI’11. https://doi.org/10.5555/1972457.1972488
HPE: Hybrid Cloud Solutions. https://www.hpe.com/us/en/solutions/container-platform.html
Huston L, Sukthankar R, Wickremesinghe R, Satyanarayanan M, Ganger GR, Riedel E, Ailamaki A (2004) Diamond: a storage architecture for early discard in interactive search. In: Proceedings of FAST ’04. https://doi.org/10.5555/1096673.1096686
Kannan K, Bhattacharya S, Kumar R, Murugan M, Voigt D (2016) SEeSAW—similarity exploiting storage for accelerating analytics workflows. In: Proceedings of HotStorage ’16. https://doi.org/10.5555/3026852.3026855
KubeFlow: Machine learning toolkit for kubernetes. https://www.kubeflow.org/
Li H (2018) Alluxio: a virtual distributed file system. Ph.D. thesis, EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-29.html
Li P, Gomez K, Lilja DJ (2013) Exploiting free silicon for energy-efficient computing directly in NAND flash-based solid-state storage systems. In: IEEE high performance extreme computing conference, HPEC 2013. https://doi.org/10.1109/HPEC.2013.6670317
MapR: Data Fabric for Kubernetes. https://docs.datafabric.hpe.com/60/PersistentStorage/kdf_overview.html
Minglani M, Nagarajan A, Deshapande S, Everson L, Lilja DJ (2015) Design space exploration for efficient computing in solid state drives with the storage processing unit. In: 2015 IEEE international conference on networking, architecture and storage (NAS), pp 87–94. https://doi.org/10.1109/NAS.2015.7255225
Muniswamy-Reddy KK, Holland DA, Braun U, Seltzer M (2006) Provenance-aware storage systems. In: Proceedings of USENIX ATEC ’06. https://doi.org/10.5555/1267359.1267363
Nakandala S, Zhang Y, Kumar A (2020) Cerebro: a data system for optimized deep learning model selection. In: Proceedings of the VLDB endowment, vol 13, no 11, pp 2159–2173. https://doi.org/10.14778/3407790.3407816
Neyman J (1992) On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In: Breakthroughs in statistics. Springer, Berlin, pp 123–150. https://doi.org/10.1007/978-1-4612-4380-9_12
Ormenisan AA, Meister M, Buso F, Andersson R, Haridi S, Dowling J (2020) Time travel and provenance for machine learning pipelines. In: 2020 USENIX conference on operational machine learning (OpML 20). USENIX Association. https://www.usenix.org/conference/opml20/presentation/ormenisan
Partha Nageswaran SK (2016) Managed dataframes and dynamically composable analytics: the bloomberg spark server. In: Spark summit
Quoc DL, Akkus IE, Bhatotia P, Blanas S, Chen R, Fetzer C, Strufe T (2018) Approxjoin: approximate distributed joins. In: Proceedings of the ACM symposium on cloud computing, SoCC ’18. Association for Computing Machinery, New York, pp 426–438. https://doi.org/10.1145/3267809.3267834
Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2019) Approximate computing for stream analytics. Springer International Publishing, Berlin, pp 90–97. https://doi.org/10.1007/978-3-319-77525-8_153
Salem K, Beyer K, Lindsay B, Cochrane R (2000) How to roll a join: asynchronous incremental view maintenance. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00. https://doi.org/10.1145/335191.335393
Sampson A, Baixo A, Ransford B, Moreau T, Yip J, Ceze L, Oskin M (2015) Accept: a programmer-guided compiler framework for practical approximate computing. University of Washington Technical Report UW-CSE-15-01, vol 1, no 2
Scott DW (2009) Sturges’ rule. WIREs Computat Stat 1(3):303–306. https://doi.org/10.1002/wics.35
Sim H, Kim Y, Vazhkudai SS, Tiwari D, Anwar A, Butt AR, Ramakrishnan L (2015) AnalyzeThis: an analysis workflow-aware storage system. In: Proceedings of SC ’15. https://doi.org/10.1145/2807591.2807622
Tang Y, Yang J (2015) Secure deduplication of general computations. In: 2015 USENIX annual technical conference (USENIX ATC 15). Santa Clara, CA. https://doi.org/10.1145/2810103.2813623
Thirumuruganathan S, Hasan S, Koudas N, Das G (2020) Approximate query processing for data exploration using deep generative models. In: 2020 IEEE 36th international conference on data engineering (ICDE), pp 1309–1320. https://doi.org/10.1109/ICDE48307.2020.00117
Vahdat A, Anderson T (1998) Transparent result caching. In: Proceedings of the annual conference on USENIX annual technical conference, ATEC ’98. https://doi.org/10.5555/1268256.1268259
Wen Z, Quoc DL, Bhatotia P, Chen R, Lee M (2018) ApproxIoT: approximate analytics for edge computing. In: 2018 IEEE 38th international conference on distributed computing systems (ICDCS). https://doi.org/10.1109/ICDCS.2018.00048
Zaharia M (2016) What’s changing in big data. https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/keynote-address
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I et al (2010) Spark: cluster computing with working sets. HotCloud 10(10-10):95. https://doi.org/10.5555/1863103.1863113
Zaharia M et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. https://doi.org/10.5555/2228298.2228301
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles, SOSP ’13. ACM. https://doi.org/10.1145/2517349.2522737
Zaharia M, Chen A, Davidson A, Ghodsi A, Hong S, Konwinski A, Murching S, Nykodym T, Ogilvie P, Parkhe M, Xie F, Zumar C (2018) Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng Bull 41:39–45
Zhang J, Yan Y, Chen LJ, Wang M, Moscibroda T, Zhang Z (2014) Impression store: compressive sensing-based storage for big data analytics. In: Proceedings of USENIX HotCloud’14. https://doi.org/10.5555/2696535.2696536
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Doug Voigt: Hewlett Packard Enterprise (Retired).
Rights and permissions
About this article
Cite this article
Murugan, M., Bhattacharya, S., Voigt, D. et al. ProSPECT: Proactive Storage Using Provenance for Efficient Compute and Tiering. Trans Indian Natl. Acad. Eng. 7, 219–234 (2022). https://doi.org/10.1007/s41403-021-00261-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41403-021-00261-8