A Content-Driven ETL Processes for Open Data

  • Alain BerroEmail author
  • Imen Megdiche
  • Olivier Teste
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 312)


The emergent statistical Open Data (OD) seems very promising to generate various analysis scenarios for decision-making systems. Nevertheless, OD has problematic characteristics such as semantic and structural heterogeneousness, lack of schemas, autonomy and dispersion. These characteristics shakes the traditional Extract-Transform-Load (ETL) processes since these latter generally deal with well structured schemas. We propose in this paper a content-driven ETL processes which automates ”as far as possible” the extraction phase based only on the content of flat Open Data sources. Our processes rely on data annotations and data mining techniques to discover hierarchical relationships. Processed data are then transformed into instance-schema graphs to facilitate the structural data integration and the definition of the multidimensional schemas of the data warehouse.


Open Data ETL Graphs Self-Service BI Hierarchical classification Data warehouse 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1:1–1:39 (2008)Google Scholar
  2. 2.
    Balakrishnan, S., Chu, V., Hernández, M.A., Ho, H., Krishnamurthy, R., Liu, S., Pieper, J., Pierce, J.S., Popa, L., Robson, C., Shi, L., Stanoi, I.R., Ting, E.L., Vaithyanathan, S., Yang, H.: Midas: integrating public financial data. In: SIGMOD, pp. 1187–1190. ACM (2010)Google Scholar
  3. 3.
    Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., Vincini, M.: A semantic approach to etl technologies. Data and Knowledge Engineering 70(8), 717–731 (2011)CrossRefGoogle Scholar
  4. 4.
    Birkhoff, G.: Lattice Theory, 3rd edn. American Mathematical Society (1967)Google Scholar
  5. 5.
    Böhm, C., Freitag, M., Heise, A., Lehmann, C., Mascher, A., Naumann, F., Ercegovac, V., Hernandez, M., Haase, P., Schmidt, M.: Govwild: integrating open government data for transparency. In: WWW 2012 Companion, pp. 321–324. ACM (2012)Google Scholar
  6. 6.
    Coletta, R., Castanier, E., Valduriez, P., Frisch, C., Ngo, D., Bellahsene, Z.: Public data integration with websmatch. In: WOD, pp. 5–12. ACM (2012)Google Scholar
  7. 7.
    Ghozzi, F., Ravat, F., Teste, O., Zurfluh, G.: Constraints and multidimensional databases. In: 5th International Conference on Enterprise Information Systems, ICEIS 2003, Angers (France), Iceis, pp. 104–111 (2003)Google Scholar
  8. 8.
    Malinowski, E., Zimányi, E.: Hierarchies in a multidimensional model: From conceptual modeling to logical representation. Data Knowl. Eng. 59(2), 348–377 (2006)CrossRefGoogle Scholar
  9. 9.
    Mansmann, S., Rehman, N.U., Weiler, A., Scholl, M.H.: Discovering olap dimensions in semi-structured data. Information Systems (2013)Google Scholar
  10. 10.
    Mansmann, S., Scholl, M.H.: Empowering the olap technology to support complex dimension hierarchies. IJDWM 3(4), 31–50 (2007)Google Scholar
  11. 11.
    Mazón, J.N., Zubcoff, J.J., Garrigós, I., Espinosa, R., Rodríguez, R.: Open business intelligence: On the importance of data quality awareness in user-friendly data mining. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, pp. 144–147 (2012)Google Scholar
  12. 12.
    Ravat, F., Teste, O., Tournier, R., Zurfluh, G.: Algebraic and graphic languages for OLAP manipulations. International Journal of Data Warehousing and Mining 4(1), 17–46 (2008)CrossRefGoogle Scholar
  13. 13.
  14. 14.
    Rizzi, S., Abelló, A., Lechtenbörger, J., Trujillo, J.: Research in data warehouse modeling and design: Dead or alive? In: DOLAP 2006, pp. 3–10. ACM (2006)Google Scholar
  15. 15.
    Rodriguez, M.A., Neubauer, P.: Constructions from dots and lines. Bulletin of the American Society for Information Science and Technology 36(6), 35–41 (2010)CrossRefGoogle Scholar
  16. 16.
    Schneider, M., Vossen, G., Zimányi, E.: Data warehousing: from occasional olap to real-time business intelligence (dagstuhl seminar 11361). Dagstuhl Reports 1(9), 1–25 (2011)Google Scholar
  17. 17.
    Seligman, L., Mork, P., Halevy, A.Y., Smith, K., Carey, M.J., Chen, K., Wolf, C., Madhavan, J., Kannan, A., Burdick, D.: Openii: an open source information integration toolkit. In: SIGMOD Conference, pp. 1057–1060. ACM (2010)Google Scholar
  18. 18.
    Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL processes using graph transformations. In: Spaccapietra, S., Zimányi, E., Song, I.-Y. (eds.) Journal on Data Semantics XIII. LNCS, vol. 5530, pp. 120–146. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  19. 19.
  20. 20.
    Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009)Google Scholar
  21. 21.
    Wang, X.: Tabular abstraction, editing, and formatting. Technical report, University of Waretloo, Waterloo, Ontaria, Canada (1996)Google Scholar
  22. 22.
    Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138. New Mexico State University, Las Cruces (1994)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Manufacture TabacsUniversité Toulouse IToulouseFrance
  2. 2.IRITUniversité Toulouse IIIToulouseFrance
  3. 3.IUT BlagnacUniversité Toulouse IIToulouseFrance

Personalised recommendations