Instant-On Scientific Data Warehouses

Lazy ETL for Data-Intensive Research
  • Yağız Kargın
  • Holger Pirk
  • Milena Ivanova
  • Stefan Manegold
  • Martin Kersten
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 154)


In the dawn of the data intensive research era, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data.


Query Processing Query Time Data Ingestion Query Evaluation Incremental Update 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    MonetDB, Column-store Pioneers,
  2. 2.
    Standard for the Exchange of Earthquake Data. Incorporated Research Institutions for Seismology (February 1988)Google Scholar
  3. 3.
    Brobst, S., Venkatesa, A.V.R.: Active Warehousing. Teradata Magazine 2(1) (1999)Google Scholar
  4. 4.
    Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. ACM Sigmod Record 26(1), 65–74 (1997)CrossRefGoogle Scholar
  5. 5.
    Dayal, U., Castellanos, M., Simitsis, A., Wilkinson, K.: Data integration flows for business intelligence. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 1–11. ACM (2009)Google Scholar
  6. 6.
    Haas, L.M., Hentschel, M., Kossmann, D., Miller, R.J.: Schema AND data: A holistic approach to mapping, resolution and fusion in information integration. In: Laender, A.H.F., Castano, S., Dayal, U., Casati, F., de Oliveira, J.P.M. (eds.) ER 2009. LNCS, vol. 5829, pp. 27–40. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  7. 7.
    Hey, A.J.G., Tansley, S., Tolle, K.M.: The fourth paradigm: data-intensive scientific discovery. Microsoft Research Redmond, WA (2009)Google Scholar
  8. 8.
    Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. here are my queries. where are my results? In: 5th International Conference on Innovative Data Systems Research, CIDR (2011)Google Scholar
  9. 9.
    Incorporated Research Institutions for Seismology. libmseed: The Mini-SEED Software Library (2011)Google Scholar
  10. 10.
    Inmon, B.: Operational and informational reporting. DM Review Magazine (2000)Google Scholar
  11. 11.
    Ivanova, M., Kersten, M., Manegold, S.: Data vaults: A symbiosis between database technology and scientific file repositories. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 485–494. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Ivanova, M., Kersten, M.L., Nes, N.J., Gonçalves, R.: An Architecture for Recycling Intermediates in a Column-store. In: SIGMOD Conference, pp. 309–320 (2009)Google Scholar
  13. 13.
    Jaeger, S., Gaudan, S., Leser, U., Rebholz-Schuhmann, D.: Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9(suppl. 8), S2 (2008)Google Scholar
  14. 14.
    Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of data warehouses. Springer (2003)Google Scholar
  15. 15.
    Kiviniemi, J., Wolski, A., Pesonen, A., Arminen, J.: Lazy aggregates for real-time OLAP. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 165–172. Springer, Heidelberg (1999)Google Scholar
  16. 16.
    Kunchithapadam, K., Zhang, W., et al.: Oracle Database Filesystem. In: SIGMOD, pp. 1149–1160 (2011)Google Scholar
  17. 17.
    Labio, W.J., Yerneni, R., Garcia-Molina, H.: Shrinking the Warehouse Update Window. In: Proceedings of SIGMOD, pp. 383–394 (1998)Google Scholar
  18. 18.
    López, J., Degraf, C., DiMatteo, T., Fu, B., Fink, E., Gibson, G.: Recipes for Baking Black Forest Databases - Building and Querying Black Hole Merger Trees from Cosmological Simulations. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 546–554. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  19. 19.
    Oldham, P., Hall, S., Burton, G.: Synthetic biology: Mapping the scientific landscape. PLoS ONE 7(4), e34368 (2012)Google Scholar
  20. 20.
    ORFEUS. Seismology Event Data (1988 - now)Google Scholar
  21. 21.
    SQL/MED. ISO/IEC 9075-9:2008 Information technology - Database languages - SQL - Part 9: Management of External Data (SQL/MED)Google Scholar
  22. 22.
    Vassiliadis, P.: A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining (IJDWM) 5(3), 1–27 (2009)CrossRefGoogle Scholar
  23. 23.
    Vassiliadis, P., Simitsis, A.: Extraction, transformation, and loading. Encyclopedia of Database Systems, 1095–1101 (2009)Google Scholar
  24. 24.
    Wetterstrand, K.A.: DNA sequencing costs: data from the NHGRI large-scale genome sequencing program (2011), (accessed October 25, 2011) (retrieved)

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yağız Kargın
    • 1
  • Holger Pirk
    • 1
  • Milena Ivanova
    • 1
  • Stefan Manegold
    • 1
  • Martin Kersten
    • 1
  1. 1.Centrum Wiskunde & Informatica (CWI)AmsterdamThe Netherlands

Personalised recommendations