Simplifying Extract-Transform-Load for Ranked Hierarchical Trees via Mapping Specifications

  • Sarfaraz SoomroEmail author
  • Andréa Matsunaga
  • José A. B. Fortes
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 346)


A popular approach to deal with data integration of heterogeneous data sources is to Extract, Transform and Load (ETL) data from disparate sources into a consolidated data store while addressing integration challenges including, but not limited to: structural differences in the source and target schemas, semantic differences in their vocabularies, and data encoding. This work focuses on the integration of tree-like hierarchical data or information that when modeled as a relational schema can take the shape of a flat schema, a self-referential schema or a hybrid schema. Examples include evolutionary taxonomies, geological time scales, and organizational charts. Given the observed complexity in developing ETL processes for this particular but common type of data, our work focuses on reducing the time and effort required to map and transform this data. Our research automates and simplifies all possible transformations involving ranked self-referential and flat representations, by: (a) proposing MSL+, an extension to IBM’s Mapping Specification Language (MSL), to succinctly express the mapping between schemas while hiding the actual transformation implementation complexity from the user, and (b) implementing a transformation component for the Talend open-source ETL platform, called Tree Transformer (TT). We evaluated MSL+ and TT, in the context of biodiversity data integration, where this class of transformations is a recurring pattern. We demonstrate the effectiveness of MSL+ with respect to development time savings as well as a 2 to 25-fold performance improvement in transformation time achieved by TT when compared to existing implementations and to Talend built-in components.


data integration hierarchical tree self-referential schema flat schema mapping language schema mapping data transformation MSL ETL 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press (2011)Google Scholar
  2. 2.
    Morris, P.J.: Relational database design and implementation for biodiversity informatics. PhyloInformatics 7, 1–66 (2005)Google Scholar
  3. 3.
    iDigBio Project,
  4. 4.
    Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 233-246. ACM PODS (2002)Google Scholar
  5. 5.
    Katsis, Y., Papakonstantinou, Y.: View-based Data Integration. Encyclopedia of Database Systems, pp. 3332–3339 (2009), doi:10.1007/978-0-387-39940-9_1072Google Scholar
  6. 6.
    Ullman, J.D.: Information Integration Using Logical Views. In: Afrati, F.N., Kolaitis, P.G. (eds.) ICDT 1997. LNCS, vol. 1186, pp. 19–40. Springer, Heidelberg (1996)Google Scholar
  7. 7.
    Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries over heterogeneous data sources. In: VLDB (2001)Google Scholar
  8. 8.
    Halevy, A.Y.: Answering Queries Using Views: A Survey. The VLDB Journal 10, 270–294 (2001)CrossRefzbMATHGoogle Scholar
  9. 9.
    Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., Widom, J.: The TSIMMIS Approach to Mediation: Data Models and Languages. In: 2nd Workshop on Next-Gen. Information Technologies and Systems, Naharia, Israel (June 1995)Google Scholar
  10. 10.
    Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A., Niblack, W., Petkovic, D., ThomasII, J., Williams, J.H., Wimmers, E.L.: Towards heterogeneous multimedia information systems: The Garlic approach. In: RIDE-DOM, pp. 124–131 (1995)Google Scholar
  11. 11.
    Kirk, T., Levy, A.Y., Sagiv, Y., Srivatava, D.: The Information Manifold. In: AAAI Spring Symposium on Information Gathering (1995)Google Scholar
  12. 12.
    Friedman, M., Levy, A.Y., Millstein, T.D.: Navigational Plans for Data Integration. In: Proceedings of the 16th National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, pp. 67–73. AAAI/IAAI (1999)Google Scholar
  13. 13.
    Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Query Processing under GLAV Mappings for Relational and Graph Databases. In: VLDB 2013 (2013)Google Scholar
  14. 14.
    Kwakye, M.M.: A Practical Approach to Merging Multidimensional Data Models. IARIA (2013)Google Scholar
  15. 15.
    Haas, L.M., Hernandez, M.A., Ho, H., Popa, L., Roth, M.: Clio Grows Up: From Research Prototype to Industrial Tool. ACM SIGMOD (2005)Google Scholar
  16. 16.
    Fagin, R., Haas, L.M., Hernandez, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling, pp. 198–236Google Scholar
  17. 17.
    Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query Answering. Theoretical Comput. Sci. 336(1), 89–124 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Miller, R.J., Haas, L.M., Hernandez, M.A.: Schema Mapping as Query Discovery. In: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000 (2000)Google Scholar
  19. 19.
    Andritsos, P., Fagin, R., Fuxman, A., Haas, L.M., Hernandez, M.A., Ho, C.T.H., Kementsietsidis, A., Miller, R.J., Naumann, F., Popa, L., Velegrakis, Y., Vilarem, C.: Schema Management. IEEE Data Engineering Bulletin (DEBU) 25(3), 32–38 (2002)Google Scholar
  20. 20.
    Hernandez, M.A., Popa, L., Ho, C.T.H., Naumann, F.: Clio: A Schema Mapping Tool for Information Integration. In: Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks, ISPAN 2005, p. 11 (2005)Google Scholar
  21. 21.
    Hernandez, M.A., Miller, R.J., Haas, L.M.: Clio: A Semi-Automatic Tool For Schema Mapping. In: A Workshop Presentation at ACM Conference, p. 607. ACM SIGMOD (2001)Google Scholar
  22. 22.
    Miller, R.J., Hernandez, M.A., Haas, L.M.: The Clio Project: Managing Heterogeneity. SIGMOD Record 30(1), 78–83Google Scholar
  23. 23.
    Fuxman, A., Hernandez, M.A., Ho, C.T.H., Miller, R.J., Papotti, P., Popa, L.: Nested Mappings: Schema Mapping Reloaded. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB 2006, pp. 67–78 (2006)Google Scholar
  24. 24.
  25. 25.
    Xu, L.: Source discovery and schema mapping for data integration, Doctoral Dissertation, Brigham Young University (2003)Google Scholar
  26. 26.
    Xu, L., Embley, D.W.: Combining the Best of Global-as-View and Local-as-View for Data Integration. In: Conference on Information Systems Technology and its Applications (ISTA 2004), Salt Lake City, Utah, USA, pp. 123–136 (2004)Google Scholar
  27. 27.
    Xu, L., Embley, D.W.: A composite approach to automating direct and indirect schema mappings. Information Systems 31(8), 697–732 (2006)CrossRefGoogle Scholar
  28. 28.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)CrossRefzbMATHGoogle Scholar
  29. 29.
    Popa, L., Velegrakis, Y., Miller, R.J., Hernandez, M.A., Fagin, R.: Translating Web Data. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)Google Scholar
  30. 30.
    Specify database,
  31. 31.
  32. 32.
    DarwinCore, TDWG,

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sarfaraz Soomro
    • 1
    Email author
  • Andréa Matsunaga
    • 1
  • José A. B. Fortes
    • 1
  1. 1.Advanced Computing and Information Systems (ACIS) LaboratoryUniversity of FloridaGainesvilleUSA

Personalised recommendations