Advanced Data Integration with Signifiers: Case Studies for Rail Automation

  • Alexander WurlEmail author
  • Andreas Falkner
  • Alois Haselböck
  • Alexandra Mazak
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 814)


In Rail Automation, planning future projects requires the integration of business-critical data from heterogeneous, often noisy data sources. Current integration approaches often neglect uncertainties and inconsistencies in the integration process and thus cannot guarantee the necessary data quality. To tackle these issues, we propose a semi-automated process for data import, where the user resolves ambiguous data classifications. The task of finding the correct data warehouse entry for a source value in a proprietary, often semi-structured format is supported by the notion of a signifier which is a natural extension of composite primary keys. In three different case studies we show that this approach (i) facilitates high-quality data integration while minimizing user interaction, (ii) leverages approximate name matching of railway station and entity names, (iii) contributes to extract features from contextual data for data cross-checks and thus supports the planning phases of railway projects.


Data integration Signifier Data quality 


  1. 1.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-2003, 9–10 August 2003, Acapulco, Mexico, pp. 73–78 (2003)Google Scholar
  2. 2.
    Langer, P., Wimmer, M., Gray, J., Kappel, G., Vallecillo, A.: Language-specific model versioning based on signifiers. J. Object Technol. 11, 4-1 (2012)Google Scholar
  3. 3.
    Wurl, A., Falkner, A., Haselböck, A., Mazak, A.: Using signifiers for data integration in rail automation. In: Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA, INSTICC, pp. 172–179. SciTePress (2017)Google Scholar
  4. 4.
    Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Eng. 14, 131 (2009)CrossRefGoogle Scholar
  5. 5.
    Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42, 40–49 (2014)CrossRefGoogle Scholar
  6. 6.
    Salton, G., Harman, D.: Information Retrieval. Wiley, Chichester (2003)Google Scholar
  7. 7.
    Wimmer, M., Langer, P.: A benchmark for model matching systems: the heterogeneous metamodel case. Softwaretechnik-Trends 33 (2013)CrossRefGoogle Scholar
  8. 8.
    Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April 2009, pp. 371–380 (2009)Google Scholar
  9. 9.
    Zobel, J., Dart, P.W.: Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, 18–22 August 1996, Zurich, Switzerland, pp. 166–172 (1996). (Special Issue of the SIGIR Forum)Google Scholar
  10. 10.
    Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41, 1 (2009)CrossRefGoogle Scholar
  11. 11.
    Leser, U., Naumann, F.: Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen. dpunkt.verlag (2007)Google Scholar
  12. 12.
    Sharma, S., Jain, R.: Modeling ETL process for data warehouse: an exploratory study. In. In: Fourth International Conference on ACCT 2014, pp. 271–276. IEEE (2014)Google Scholar
  13. 13.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)Google Scholar
  14. 14.
    Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD, pp. 541–552. ACM (2013)Google Scholar
  15. 15.
    Fan, W., Geerts, F.: Foundations of data quality management. Synth. Lect. Data Manage. 4, 1–217 (2012)CrossRefGoogle Scholar
  16. 16.
    Dasu, T., Johnson, T.: Exploratory data mining and data cleaning: an overview. In: Exploratory Data Mining and Data Cleaning, pp. 1–16 (2003)Google Scholar
  17. 17.
    Hellerstein, J.M.: Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008)Google Scholar
  18. 18.
    Liu, H., Kumar, T.A., Thomas, J.P.: Cleaning framework for big data-object identification and linkage. In: 2015 IEEE International Congress on Big Data, pp. 215–221. IEEE (2015)Google Scholar
  19. 19.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD, pp. 39–48. ACM (2003)Google Scholar
  20. 20.
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endowment 5, 1483–1494 (2012)CrossRefGoogle Scholar
  21. 21.
    Müller, H., Freytag, J.C.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst. für Informatik (2005)Google Scholar
  22. 22.
    Krishnan, S., Haas, D., Franklin, M.J., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA@ SIGMOD, p. 9 (2016)Google Scholar
  23. 23.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. Proc. VLDB Endowment 3, 173–184 (2010)CrossRefGoogle Scholar
  24. 24.
    Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD, pp. 1215–1230. ACM (2015)Google Scholar
  25. 25.
    Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th ICDE 2014, pp. 244–255. IEEE (2014)Google Scholar
  26. 26.
    Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J.: Data profiling technology of data governance regarding big data: review and rethinking. Information Technology: New Generations. AISC, vol. 448, pp. 439–450. Springer, Cham (2016). Scholar
  27. 27.
    Gill, R., Singh, J.: A review of contemporary data quality issues in data warehouse ETL environment. J. Today’s Ideas Tomorrow’s Technol. 2(2), 153–160 (2014)CrossRefGoogle Scholar
  28. 28.
    Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W., Baumgartner, N.: SemGen—towards a semantic data generator for benchmarking duplicate detectors. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds.) DASFAA 2011. LNCS, vol. 6637, pp. 490–501. Springer, Heidelberg (2011). Scholar
  29. 29.
    Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endowment 9, 312–323 (2015)CrossRefGoogle Scholar
  30. 30.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)CrossRefGoogle Scholar
  31. 31.
    Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 87–96. IEEE (2006)Google Scholar
  32. 32.
    Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, pp. 440–445 (2006)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Alexander Wurl
    • 1
    Email author
  • Andreas Falkner
    • 1
  • Alois Haselböck
    • 1
  • Alexandra Mazak
    • 2
  1. 1.Siemens AG Österreich, Corporate TechnologyViennaAustria
  2. 2.Business Informatics GroupTU WienViennaAustria

Personalised recommendations