Skip to main content
Log in

ProSPECT: Proactive Storage Using Provenance for Efficient Compute and Tiering

  • Original Article
  • Published:
Transactions of the Indian National Academy of Engineering Aims and scope Submit manuscript

Abstract

AI and analytics applications are good at deriving meaningful insights from data, but they do not always cope well with the storage management challenges that come with a high pace of data generation. At the same time, a conventional data storage and management layer is not optimized to derive timely insights and value from huge volumes of data. This problem is rooted in a classical cross-layer dilemma wherein neither the application nor the storage layer has the deep knowledge needed to optimize the whole system. We resolve this omniscience dilemma by introducing ProSPECT, a set of techniques to proactively optimize analytics computations and data storage. ProSPECT enables a data fabric to become aware of the purpose and relevance of stored data by intercepting the lineage of workflows under execution within existing analytics frameworks. Partial analytics computations can then be initiated proactively by the data fabric layer, where data is stored and managed. ProSPECT provides analytics applications with relevant data or precomputed insights and alleviates storage management challenges using proactive tiering and data approximation. We describe experiments with application case studies using Apache Spark and Alluxio to demonstrate an order of magnitude reduction in the storage space occupied in the fastest tier and in time to value for analytics applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Apache-Falcon: https://falcon.apache.org/.

  2. Hitachi Ethernet Drives: http://www.hgst.com/company/innovation-center.

  3. Joyent Manta: https://www.joyent.com/manta.

  4. Flight Dataset: http://stat-computing.org/dataexpo/2009/the-data.html.

  5. BDAS: http://ampcamp.berkeley.edu/wp-content/uploads/2013/02/Berkeley-Data-Analytics-Stack-BDAS-Overview-Ion-Stoica-Strata-2013.pdf.

  6. KDD Cup 1999 Dataset: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

References

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D.G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on operating systems design and implementation, OSDI’16. USENIX Association, Berkeley, pp 265–283. https://doi.org/10.5555/3026877.3026899

  • Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD SIGMOD ’00. https://doi.org/10.1145/335191.335450

  • Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys ’13. https://doi.org/10.1145/2465351.2465355

  • Agrawal N, Vulimiri A (2017) Low-latency analytics on colossal data streams with SummaryStore. In: Proceedings of the 26th symposium on operating systems principles, SOSP ’17. Association for Computing Machinery, New York, pp 647–664. https://doi.org/10.1145/3132747.3132758

  • Arlitt MF, Williamson CL (1996) Web server workload characterization: the search for invariants. http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

  • Barua HB, Mondal KC (2018) Green data mining using approximate computing: an experimental analysis with rule mining. In: 2018 International conference on computing, power and communication technologies (GUCON), pp 115–120. https://doi.org/10.1109/GUCON.2018.8675095

  • Benton W (2016) Containerized spark on Kubernetes. https://spark-summit.org/eu-2016/events/containerized-spark-on-kubernetes/

  • Carata L, Akoush S, Balakrishnan N, Bytheway T, Sohan R, Seltzer M, Hopper A (2014) A primer on provenance. Commun ACM 57(5):52–60. https://doi.org/10.1145/2596628

    Article  Google Scholar 

  • Chaudhuri S, Das G, Narasayya V (2007) Optimized stratified sampling for approximate query processing. ACM Trans Database Syst. https://doi.org/10.1145/1242524.1242526

    Article  Google Scholar 

  • Chen A, Chow A, Davidson A, DCunha A, Ghodsi A, Hong SA, Konwinski A, Mewald C, Murching S, Nykodym T, Ogilvie P, Parkhe M, Singh A, Xie F, Zaharia M, Zang R, Zheng J, Zumar C (2020) Developments in MLflow: a system to accelerate the machine learning lifecycle. In: Proceedings of the fourth international workshop on data management for end-to-end machine learning, DEEM’20. https://doi.org/10.1145/3399579.3399867

  • Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  • Derakhshan B, Mahdiraji AR, Abedjan Z, Rabl T (2020) Optimizing machine learning workloads in collaborative environments. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data. https://doi.org/10.1145/3318464.3389715

  • Devecsery D, Chow M, Dou X, Flinn J, Chen PM (2014) Eidetic systems. In: Proceedings of the 11th USENIX conference on operating systems design and implementation, OSDI’14. https://doi.org/10.5555/2685048.2685090

  • Efraimidis PS, Spirakis PG (2006) Weighted random sampling with a reservoir. Inf Process Lett 97(5):181–185. https://doi.org/10.1016/j.ipl.2005.11.003

    Article  MathSciNet  MATH  Google Scholar 

  • Feng Z, George S, Harkes J, Klatzky RL, Satyanarayanan M, Pillai P (2019) Eureka: edge-based discovery of training data for machine learning. IEEE Internet Comput 23(4):35–42. https://doi.org/10.1109/SEC.2018.00018

    Article  Google Scholar 

  • Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) ApproxHadoop: bringing approximations to MapReduce frameworks. In: Proceedings of ASPLOS ’15. https://doi.org/10.1145/2775054.2694351

  • Gunda PK et al (2010) Nectar: automatic management of data and computation in datacenters. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. https://doi.org/10.5555/1924943.1924949

  • Guo P, Hu W (2018) Potluck: cross-application approximate deduplication for computation-intensive mobile applications. In: Proceedings of the twenty-third international conference on architectural support for programming languages and operating systems, ASPLOS ’18. ACM, New York, pp 271–284. https://doi.org/10.1145/3173162.3173185

  • Guo P, Hu B, Li R, Hu W (2018) Foggycache: cross-device approximate computation reuse. In: Proceedings of the 24th annual international conference on mobile computing and networking, MobiCom ’18. Association for Computing Machinery, New York, pp 19–34. https://doi.org/10.1145/3241539.3241557

  • Heintz B, Chandra A, Sitaraman RK (2016) Trading timeliness and accuracy in geo-distributed streaming analytics. In: Proceedings of the seventh ACM symposium on cloud computing, SoCC ’16. https://doi.org/10.1145/2987550.2987580

  • Herschel M, Diestelkämper R, Lahmar HB (2017) A survey on provenance: what for? What form? What from? VLDB J 26(6):881–906. https://doi.org/10.1007/s00778-017-0486-1

    Article  Google Scholar 

  • Hindman B et al (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX conference on networked systems design and implementation, NSDI’11. https://doi.org/10.5555/1972457.1972488

  • HPE: Hybrid Cloud Solutions. https://www.hpe.com/us/en/solutions/container-platform.html

  • Huston L, Sukthankar R, Wickremesinghe R, Satyanarayanan M, Ganger GR, Riedel E, Ailamaki A (2004) Diamond: a storage architecture for early discard in interactive search. In: Proceedings of FAST ’04. https://doi.org/10.5555/1096673.1096686

  • Kannan K, Bhattacharya S, Kumar R, Murugan M, Voigt D (2016) SEeSAW—similarity exploiting storage for accelerating analytics workflows. In: Proceedings of HotStorage ’16. https://doi.org/10.5555/3026852.3026855

  • KubeFlow: Machine learning toolkit for kubernetes. https://www.kubeflow.org/

  • Li H (2018) Alluxio: a virtual distributed file system. Ph.D. thesis, EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-29.html

  • Li P, Gomez K, Lilja DJ (2013) Exploiting free silicon for energy-efficient computing directly in NAND flash-based solid-state storage systems. In: IEEE high performance extreme computing conference, HPEC 2013. https://doi.org/10.1109/HPEC.2013.6670317

  • MapR: Data Fabric for Kubernetes. https://docs.datafabric.hpe.com/60/PersistentStorage/kdf_overview.html

  • Minglani M, Nagarajan A, Deshapande S, Everson L, Lilja DJ (2015) Design space exploration for efficient computing in solid state drives with the storage processing unit. In: 2015 IEEE international conference on networking, architecture and storage (NAS), pp 87–94. https://doi.org/10.1109/NAS.2015.7255225

  • Muniswamy-Reddy KK, Holland DA, Braun U, Seltzer M (2006) Provenance-aware storage systems. In: Proceedings of USENIX ATEC ’06. https://doi.org/10.5555/1267359.1267363

  • Nakandala S, Zhang Y, Kumar A (2020) Cerebro: a data system for optimized deep learning model selection. In: Proceedings of the VLDB endowment, vol 13, no 11, pp 2159–2173. https://doi.org/10.14778/3407790.3407816

  • Neyman J (1992) On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In: Breakthroughs in statistics. Springer, Berlin, pp 123–150. https://doi.org/10.1007/978-1-4612-4380-9_12

  • Ormenisan AA, Meister M, Buso F, Andersson R, Haridi S, Dowling J (2020) Time travel and provenance for machine learning pipelines. In: 2020 USENIX conference on operational machine learning (OpML 20). USENIX Association. https://www.usenix.org/conference/opml20/presentation/ormenisan

  • Partha Nageswaran SK (2016) Managed dataframes and dynamically composable analytics: the bloomberg spark server. In: Spark summit

  • Quoc DL, Akkus IE, Bhatotia P, Blanas S, Chen R, Fetzer C, Strufe T (2018) Approxjoin: approximate distributed joins. In: Proceedings of the ACM symposium on cloud computing, SoCC ’18. Association for Computing Machinery, New York, pp 426–438. https://doi.org/10.1145/3267809.3267834

  • Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2019) Approximate computing for stream analytics. Springer International Publishing, Berlin, pp 90–97. https://doi.org/10.1007/978-3-319-77525-8_153

  • Salem K, Beyer K, Lindsay B, Cochrane R (2000) How to roll a join: asynchronous incremental view maintenance. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00. https://doi.org/10.1145/335191.335393

  • Sampson A, Baixo A, Ransford B, Moreau T, Yip J, Ceze L, Oskin M (2015) Accept: a programmer-guided compiler framework for practical approximate computing. University of Washington Technical Report UW-CSE-15-01, vol 1, no 2

  • Scott DW (2009) Sturges’ rule. WIREs Computat Stat 1(3):303–306. https://doi.org/10.1002/wics.35

  • Sim H, Kim Y, Vazhkudai SS, Tiwari D, Anwar A, Butt AR, Ramakrishnan L (2015) AnalyzeThis: an analysis workflow-aware storage system. In: Proceedings of SC ’15. https://doi.org/10.1145/2807591.2807622

  • Tang Y, Yang J (2015) Secure deduplication of general computations. In: 2015 USENIX annual technical conference (USENIX ATC 15). Santa Clara, CA. https://doi.org/10.1145/2810103.2813623

  • Thirumuruganathan S, Hasan S, Koudas N, Das G (2020) Approximate query processing for data exploration using deep generative models. In: 2020 IEEE 36th international conference on data engineering (ICDE), pp 1309–1320. https://doi.org/10.1109/ICDE48307.2020.00117

  • Vahdat A, Anderson T (1998) Transparent result caching. In: Proceedings of the annual conference on USENIX annual technical conference, ATEC ’98. https://doi.org/10.5555/1268256.1268259

  • Wen Z, Quoc DL, Bhatotia P, Chen R, Lee M (2018) ApproxIoT: approximate analytics for edge computing. In: 2018 IEEE 38th international conference on distributed computing systems (ICDCS). https://doi.org/10.1109/ICDCS.2018.00048

  • Zaharia M (2016) What’s changing in big data. https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/keynote-address

  • Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I et al (2010) Spark: cluster computing with working sets. HotCloud 10(10-10):95. https://doi.org/10.5555/1863103.1863113

    Article  Google Scholar 

  • Zaharia M et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. https://doi.org/10.5555/2228298.2228301

  • Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles, SOSP ’13. ACM. https://doi.org/10.1145/2517349.2522737

  • Zaharia M, Chen A, Davidson A, Ghodsi A, Hong S, Konwinski A, Murching S, Nykodym T, Ogilvie P, Parkhe M, Xie F, Zumar C (2018) Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng Bull 41:39–45

    Google Scholar 

  • Zhang J, Yan Y, Chen LJ, Wang M, Moscibroda T, Zhang Z (2014) Impression store: compressive sensing-based storage for big data analytics. In: Proceedings of USENIX HotCloud’14. https://doi.org/10.5555/2696535.2696536

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muthukumar Murugan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Doug Voigt: Hewlett Packard Enterprise (Retired).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Murugan, M., Bhattacharya, S., Voigt, D. et al. ProSPECT: Proactive Storage Using Provenance for Efficient Compute and Tiering. Trans Indian Natl. Acad. Eng. 7, 219–234 (2022). https://doi.org/10.1007/s41403-021-00261-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41403-021-00261-8

Keywords

Navigation