Journal on Data Semantics

, Volume 7, Issue 2, pp 65–85 | Cite as

Big Data Semantics

  • Paolo CeravoloEmail author
  • Antonia Azzini
  • Marco Angelini
  • Tiziana Catarci
  • Philippe Cudré-Mauroux
  • Ernesto Damiani
  • Alexandra Mazak
  • Maurice Van Keulen
  • Mustafa Jarrar
  • Giuseppe Santucci
  • Kai-Uwe Sattler
  • Monica Scannapieco
  • Manuel Wimmer
  • Robert Wrembel
  • Fadi Zaraket
Original Article


Big Data technology has discarded traditional data modeling approaches as no longer applicable to distributed data processing. It is, however, largely recognized that Big Data impose novel challenges in data and infrastructure management. Indeed, multiple components and procedures must be coordinated to ensure a high level of data quality and accessibility for the application layers, e.g., data analytics and reporting. In this paper, the third of its kind co-authored by members of IFIP WG 2.6 on Data Semantics, we propose a review of the literature addressing these topics and discuss relevant challenges for future research. Based on our literature review, we argue that methods, principles, and perspectives developed by the Data Semantics community can significantly contribute to address Big Data challenges.



This research was partially supported by the European Union’s Horizon 2020 research and innovation programme under the TOREADOR project, Grant Agreement No. 688797. The work of R. Wrembel is supported from the National Science Center Grant No. 2015/19/B/ST6/02637.


  1. 1.
    Zikopoulos P, Eaton C et al (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New YorkGoogle Scholar
  2. 2.
    Ward JS, Barker A (2013) Undefined by data: a survey of big data definitions. arXiv preprint arXiv:1309.5821
  3. 3.
    Beyer MA, Laney D (2012) The importance of big data: a definition. Gartner, Stamford, pp 2014–2018Google Scholar
  4. 4.
    Laney D (2001) 3d data management: controlling data volume, velocity and variety. META Gr Res Note 6:70Google Scholar
  5. 5.
    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of "big data" on cloud computing: review and open research issues. Inf Syst 47:98–115CrossRefGoogle Scholar
  6. 6.
    Wamba SF, Akter S, Edwards A, Chopin G, Gnanzou D (2015) How big data can make big impact: findings from a systematic review and a longitudinal case study. Int J Prod Econ 165:234–246 [Online]. Accessed 20 Feb 2018
  7. 7.
    Madden S (2012) From databases to big data. IEEE Internet Comput 16(3):4–6CrossRefGoogle Scholar
  8. 8.
    Amazon A (2016) Amazon 2016 [Online]. 2016-01-06
  9. 9.
    Hadoop A (2009) Hadoop [Online]. 2009-03-06
  10. 10.
    Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188Google Scholar
  11. 11.
    Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107CrossRefGoogle Scholar
  12. 12.
    Hilbert M (2016) Big data for development: a review of promises and challenges. Dev Policy Rev 34(1):135–174MathSciNetCrossRefGoogle Scholar
  13. 13.
    Assunç ao MD, Calheiros RN, Bianchi S, Netto MA, Buyya R, (2015) Big data computing and clouds: trends and future directions. J Parallel Distrib Comput 79:3–15Google Scholar
  14. 14.
    Markl V (2014) Breaking the chains: on declarative data analysis and data independence in the big data era. Proc VLDB Endow 7(13):1730–1733CrossRefGoogle Scholar
  15. 15.
    Damiani E, Oliboni B, Quintarelli E, Tanca L (2003) Modeling semistructured data by using graph-based constraints. OTM confederated international conferences "On the move to meaningful internet systems". Springer, Berlin, pp 20–21Google Scholar
  16. 16.
    Poole J, Chang D, Tolbert D, Mellor D (2003) Common warehouse metamodel. Developer’s guide, Wiley, HobokenGoogle Scholar
  17. 17.
    Ardagna C, Asal R, Damiani E, Vu Q (2015) From security to assurance in the cloud: a survey. ACM Comput Surv: CSUR 48(1):2:1–2:50Google Scholar
  18. 18.
    Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE et al (2016) The fair guiding principles for scientific data management and stewardship. Sci Data 3:160018CrossRefGoogle Scholar
  19. 19.
    Aberer K, Catarci T, Cudré-Mauroux P, Dillon T, Grimm S, Hacid M-S, Illarramendi A, Jarrar M, Kashyap V, Mecella M et al (2004) Emergent semantics systems. Semantics of a networked world. Semantics for grid databases. Springer, Berlin, pp 14–43CrossRefGoogle Scholar
  20. 20.
    Cudré-Mauroux P, Aberer K, Abdelmoty AI, Catarci T, Damiani E, Illaramendi A, Jarrar M, Meersman R, Neuhold EJ, Parent C et al (2006) Viewpoints on emergent semantics. In: Spaccapietra S, Aberer K, Cudré-Mauroux P (eds) Journal on data semantics VI. Springer, Berlin, pp 1–27Google Scholar
  21. 21.
    Ardagna CA, Ceravolo P, Damiani E (2016) Big data analytics as-a-service: Issues and challenges. In: IEEE International conference on Big Data (Big Data). IEEE, pp 3638–3644Google Scholar
  22. 22.
    Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209CrossRefGoogle Scholar
  23. 23.
    Azzini A, Ceravolo P (2013) Consistent process mining over big data triple stores. In: IEEE international congress on Big Data (BigData Congress). IEEE, pp 54–61Google Scholar
  24. 24.
    Woods WA (1975) What’s in a link: foundations for semantic networks. In: Representation and understanding. Elsevier, pp 35–82Google Scholar
  25. 25.
    Franklin MJ, Halevy AY, Maier D (2005) From databases to dataspaces: a new abstraction for information management. SIGMOD Rec 34(4):27–33 [Online].
  26. 26.
    Smith K, Seligman L, Rosenthal A, Kurcz C, Greer M, Macheret C, Sexton M, Eckstein A (2014) Big metadata: the need for principled metadata management in big data ecosystems. In: Proceedings of workshop on data analytics in the Cloud, series DanaC’14. ACM, New York, pp 13:1–13:4 [Online].
  27. 27.
    Waller MA, Fawcett SE (2013) Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management. J Bus Logist 34(2):77–84CrossRefGoogle Scholar
  28. 28.
    Borkar V, Carey MJ, Li C (2012) Inside big data management: ogres, onions, or parfaits? In: Proceedings of the 15th international conference on extending database technology. ACM, pp 3–14Google Scholar
  29. 29.
    White T (2012) Hadoop: the definitive guide. O’Reilly Media Inc, SebastopolGoogle Scholar
  30. 30.
    Jagadish H (2015) Big data and science: myths and reality. Big Data Res 2(2):49–52MathSciNetCrossRefGoogle Scholar
  31. 31.
    Pääkkönen P, Pakkala D (2015) Reference architecture and classification of technologies, products and services for big data systems. Big Data Res 2(4):166–186CrossRefGoogle Scholar
  32. 32.
    Ardagna C, Bellandi V, Bezzi M, Ceravolo P, Damiani E, Hebert C (June 2017) A model-driven methodology for big data analytics-as-a-service. In: Proceedings of BigData Congress, Honolulu. HI, USAGoogle Scholar
  33. 33.
    Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033.
  34. 34.
    Ardagna CA, Bellandi V, Bezzi M, Ceravolo P, Damiani E, Hebert C (2018) Model-based big data analytics-as-a-service: take big data to the next level. IEEE Trans Serv Comput PP(99):1–1Google Scholar
  35. 35.
    Liao C, Squicciarini A (2015) Towards provenance-based anomaly detection in mapreduce. In: 15th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), vol 2015. IEEE, pp 647–656Google Scholar
  36. 36.
    Duggan J, Elmore AJ, Stonebraker M, Balazinska M, Howe B, Kepner J, Madden S, Maier D, Mattson T, Zdonik S (2015) The BigDAWG polystore system. SIGMOD Rec 44(2):11–16CrossRefGoogle Scholar
  37. 37.
    Sowmya R, Suneetha K (2017) Data mining with big data. In: 11th international conference on intelligent systems and control (ISCO). IEEE, pp 246–250Google Scholar
  38. 38.
    Zhou W, Mapara S, Ren Y, Li Y, Haeberlen A, Ives Z, Loo BT, Sherr M (2012) Distributed time-aware provenance. In: Proceedings of the VLDB endowment, vol 6, no 2. VLDB Endowment, pp 49–60Google Scholar
  39. 39.
    Akoush S, Sohan R, Hopper A (2013) Hadoopprov: towards provenance as a first class citizen in mapreduce. In: TaPPGoogle Scholar
  40. 40.
    Glavic B (2014) Big data provenance: challenges and implications for benchmarking. In: Rabl T, Poess M, Baru C, Jacobsen H-A (eds) Specifying big data benchmarks. Springer, Berlin, Heidelberg, pp 72–80Google Scholar
  41. 41.
    Berti-Equille L, Ba ML (2016) Veracity of big data: challenges of cross-modal truth discovery. J. Data Inf Qual 7(3):12:1–12:3Google Scholar
  42. 42.
    Kläs M, Putz W, Lutz T (2016) Quality evaluation for big data: a scalable assessment approach and first evaluation results. In: 2016 joint conference of the international workshop on software measurement and the international conference on software process and product measurement (IWSM-MENSURA). IEEE, pp 115–124Google Scholar
  43. 43.
    Daiber J, Jakob M, Hokamp C, Mendes PN (2013) Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th international conference on semantic systems. ACM, pp 121–124Google Scholar
  44. 44.
    Shin J, Wu S, Wang F, De Sa C, Zhang C, Ré C (July 2015) Incremental knowledge base construction using DeepDive. Proc VLDB Endow 8(11), 1310–1321. ISSN 2150-8097.
  45. 45.
    Chiticariu L, Krishnamurthy R, Li Y, Raghavan S, Reiss FR, Vaithyanathan S (2010) Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the association for computational linguistics, pp 128–137Google Scholar
  46. 46.
    Fuhring P, Naumann F (2007) Emergent data quality annotation and visualization [Online]. Accessed 20 Feb 2018
  47. 47.
    Bondiombouy C, Kolev B, Levchenko O, Valduriez P (2016) Multistore big data integration with CloudMdsQL. In: Hameurlain A, Küng J, Wagner R, Chen Q (eds) Transactions on large-scale data-and knowledge-centered systems XXVIII: special issue on database-and expert-systems applications. Springer, Berlin, Heidelberg, pp 48–74.
  48. 48.
    Bergamaschi S, Beneventano D, Mandreoli F, Martoglia R, Guerra F, Orsini M, Po L, Vincini M, Simonini G, Zhu S , Gagliardelli L, Magnotta L (2018) From data integration to big data integration. In: Flesca S, Greco S, Masciari E, Saccà D (eds) A comprehensive guide through the Italian database research over the last 25 years. Springer, Cham, pp 43–59Google Scholar
  49. 49.
    Ramakrishnan R, Sridharan B, Douceur JR, Kasturi P, Krishnamachari-Sampath B, Krishnamoorthy K, Li P, Manu M, Michaylov S, Ramos R et al (2017) Azure data lake store: a hyperscale distributed file service for big data analytics. In: Proceedings of the 2017 ACM international conference on management of data. ACM, pp 51–63Google Scholar
  50. 50.
    Masseroli M, Kaitoua A, Pinoli P, Ceri S (2016) Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111:3–11CrossRefGoogle Scholar
  51. 51.
    Scannapieco M, Virgillito A, Zardetto D (2013) Placing big data in official statistics: a big challenge? In: Proceedings of NTTS (new techniques and technologies for statistics), March 5–7, BrusselsGoogle Scholar
  52. 52.
    Gualtieri M, Hopkins B (2014) SQL-For-Hadoop: 14 capable solutions reviewed. ForresterGoogle Scholar
  53. 53.
    Liu H, Kumar TA, Thomas JP (2015) Cleaning framework for big data-object identification and linkage. In: IEEE international congress on Big Data (BigData Congress). IEEE, pp 215–221Google Scholar
  54. 54.
    Gulzar MA, Interlandi M, Han X, Li M, Condie T, Kim M (2017) Automated debugging in data-intensive scalable computing. In: Proceedings of the 2017 symposium on cloud computing, series SoCC ’17. ACM, New York, pp 520–534 [Online].
  55. 55.
    de Wit T (2017) Using AIS to make maritime statistics. In: Proceedings of NTTS (New techniques and technologies for statistics), March 14–16, BrusselsGoogle Scholar
  56. 56.
    Zardetto D, Scannapieco M, Catarci T (2010) Effective automated object matching. In: Proceedings of the 26th international conference on data engineering, ICDE 2010, March 1-6, Long Beach, California, USA, pp 757–768Google Scholar
  57. 57.
    Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) Graphx: a resilient distributed graph system on spark. In: First international workshop on graph data management experiences and systems, GRADES 2013, co-loated with SIGMOD/PODS, New York, NY, USA, June 24, p 2 [Online]. Accessed 20 Feb 2018
  58. 58.
    Junghanns M, Petermann A, Gómez K, Rahm E (2015) GRADOOP: scalable graph data management and analytics with hadoop. CoRR [Online]. arxiv:1506.00548
  59. 59.
    Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, Bellevue, WA, USA, November 3–6, pp 70:1–70:4 [Online].
  60. 60.
    You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: 31st IEEE international conference on data engineering workshops, ICDE workshops 2015, Seoul, South Korea, April 13–17, pp 34–41. [Online].
  61. 61.
    Saleh O, Hagedorn S, Sattler K (2015) Complex event processing on linked stream data. Datenbank Spektrum 15(2):119–129CrossRefGoogle Scholar
  62. 62.
    Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for hadoop. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, Online proceedings, 2015 [Online].
  63. 63.
    Costea A, Ionescu A, Raducanu B, Switakowski M, Bârca C, Sompolski J, Luszczak A, Szafranski M, de Nijs G, Boncz PA (2016) Vectorh: taking sql-on-hadoop to the next level. In: Proceedings of the 2016 international conference on management of data, SIGMOD conference 2016, San Francisco, CA, USA, June 26–July 01, pp 1105–1117 [Online].
  64. 64.
    Schätzle A, Przyjaciel-Zablocki M, Skilevic S, Lausen G (2016) S2RDF: RDF querying with SPARQL on spark. PVLDB 9(10):804–815 [Online].
  65. 65.
    Cudré-Mauroux P, Enchev I, Fundatureanu S, Groth PT, Haque A, Harth A, Keppmann FL, Miranker DP, Sequeda J, Wylot M (2013) Nosql databases for RDF: an empirical evaluation. In: The semantic Web—ISWC 2013—12th international semantic web conference, Sydney, NSW, Australia, October 21–25, Proceedings, Part II, 2013, pp 310–325 [Online].
  66. 66.
    Appice A, Ceci M, Malerba D (2018) Relational data mining in the era of big data. In: Flesca S, Greco S, Masciari E, Saccà D (eds) A comprehensive guide through the Italian database research over the last 25 years. Springer, cham, pp 323–339.
  67. 67.
    Khare S, An K, Gokhale AS, Tambe S, Meena A (2015) Reactive stream processing for data-centric publish/subscribe. In: Proceedings of the 9th international conference on distributed event-based systems (DEBS). ACM, pp 234–245Google Scholar
  68. 68.
    Poggi F, Rossi D, Ciancarini P, Bompani L (2016) Semantic run-time models for self-adaptive systems: a case study. In: 2016 IEEE 25th international conference on enabling technologies: infrastructure for collaborative enterprises (WETICE). IEEE, pp 50–55Google Scholar
  69. 69.
    Um J-H, Lee S, Kim T-H, Jeong C-H, Song S-K, Jung H (2016) Semantic complex event processing model for reasoning research activities. Neurocomputing 209:39–45CrossRefGoogle Scholar
  70. 70.
    Giese M, Soylu A, Vega-Gorgojo G, Waaler A, Haase P, Jiménez-Ruiz E, Lanti D, Rezk M, Xiao G, Özçep Ö et al (2015) Optique: zooming in on big data. Computer 48(3):60–67CrossRefGoogle Scholar
  71. 71.
    Unece big data quality framework [Online]. Accessed 20 Feb 2018
  72. 72.
    Severin J, Lizio M, Harshbarger J, Kawaji H, Daub CO, Hayashizaki Y, Bertin N, Forrest AR, Consortium F et al (2014) Interactive visualization and analysis of large-scale sequencing datasets using zenbu. Nat Biotechnol 32(3):217–219Google Scholar
  73. 73.
    Mezghani E, Exposito E, Drira K, Da Silveira M, Pruski C (2015) A semantic big data platform for integrating heterogeneous wearable data in healthcare. J Med Syst 39(12):185CrossRefGoogle Scholar
  74. 74.
    Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinski M, Brilliant L (2009) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014CrossRefGoogle Scholar
  75. 75.
    Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J-F, Dennison D (2015) Hidden technical debt in machine learning systems. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28, Curran Associates, Inc., pp 2503–2511.
  76. 76.
    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):4:1–4:26.
  77. 77.
    Suriarachchi I, Plale B (2016) Provenance as essential infrastructure for data lakes. In: Proceedings of international workshop on provenance and annotation of data and processes. LNCS 9672Google Scholar
  78. 78.
    Terrizzano I, Schwarz P, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. In: Proceedings of conference on innovative data systems research (CIDR)Google Scholar
  79. 79.
    Teradata (2014) Putting the data lake to work: a guide to best practices. Accessed on 20 June 2017 [Online]
  80. 80.
    Batini C, Scannapieco M (2016) Data and information quality—dimensions. Principles and techniques, series. In: Data-centric systems and applications. SpringerGoogle Scholar
  81. 81.
    Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, Gehrke J, Haas L, Halevy A, Han J et al (2011) Challenges and opportunities with big data. Purdue University, Cyber Center Technical ReportsGoogle Scholar
  82. 82.
    Liu M, Wang Q (2016) Rogas: a declarative framework for network analytics. Proceedings of international conference on very large data bases (VLDB) 9(13):1561–1564Google Scholar
  83. 83.
    Hasan O, Habegger B, Brunie L, Bennani N, Damiani E (2013) A discussion of privacy challenges in user profiling with big data techniques: the EEXCESS use case. In: IEEE international congress on Big Data (BigData Congress). IEEE, pp 25–30Google Scholar
  84. 84.
    Doan A, Ardalan A, Ballard JR, Das S, Govind Y, Konda P, Li H, Paulson E, Zhang H et al (2017) Toward a system building agenda for data integration. arXiv preprint arXiv:1710.00027
  85. 85.
    Flood M, Grant J, Luo H, Raschid L, Soboroff I, Yoo K (2016) Financial entity identification and information integration (feiii) challenge: the report of the organizing committee. In: Proceedings of the second international workshop on data science for macro-modeling. ACM, p 1Google Scholar
  86. 86.
    Haryadi AF, Hulstijn J, Wahyudi A, Van Der Voort H, Janssen M (2016) Antecedents of big data quality: an empirical examination in financial service organizations. In: IEEE international conference on Big Data (Big Data). IEEE, pp 116–121Google Scholar
  87. 87.
    Benedetti F, Beneventano D, Bergamaschi S (2016) Context semantic analysis: a knowledge-based technique for computing inter-document similarity. Springer International Publishing, Berlin, pp 164–178Google Scholar
  88. 88.
    Ford E, Carroll JA, Smith HE, Scott D, Cassell JA (2016) Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 23(5):1007–1015.
  89. 89.
    Haas D, Krishnan S, Wang J, Franklin MJ, Wu E (2015) Wisteria: nurturing scalable data cleaning infrastructure. Proc VLDB Endow 8(12):2004–2007.
  90. 90.
    Cabot J, Toman D, Parsons J, Pastor O, Wrembel R (2016) Big data and conceptual models: are they mutually compatible? In: International conference on conceptual modeling (ER), panel discussion [Online]. Accessed 20 Feb 2018
  91. 91.
    Voigt M, Pietschmann S, Grammel L, Meißner K (2012) Context-aware recommendation of visualization components. In: Proceedings of the 4th international conference on information, process, and knowledge management. Citeseer, pp 101–109Google Scholar
  92. 92.
    Soylu A, Giese M, Jimenez-Ruiz E, Kharlamov E, Zheleznyakov D, Horrocks I (2013) OptiqueVQS: towards an ontology-based visual query system for big data. In: Proceedings of the fifth international conference on management of emergent digital ecosystems, series, MEDES ’13. ACM, New York, pp 119–126 [Online].
  93. 93.
    McKenzie G, Janowicz K, Gao S, Yang J-A, Hu Y (2015) POI pulse: a multi-granular, semantic signature-based information observatory for the interactive visualization of big geosocial data. Cartographica Int J Geogr Inf Geovis 50(2):71–85Google Scholar
  94. 94.
    Habib MB, Van Keulen (2016) TwitterNEED: a hybrid approach for named entity extraction and disambiguation for tweet. Nat Lang Eng 22(3):423–456.
  95. 95.
    Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. JDIQ 2(1):5:1–5:33.
  96. 96.
    van Keulen M (2012) Managing uncertainty: the road towards better data interoperability. Inf Technol: IT 54(3):138–146.
  97. 97.
    Andrews P, Kalro A, Mehanna H, Sidorov A (2016) Productionizing machine learning pipelines at scale. In: Machine learning systems workshop at ICMLGoogle Scholar
  98. 98.
    Sparks ER, Venkataraman S, Kaftan T, Franklin MJ, Recht B (2017) Keystoneml: optimizing pipelines for large-scale advanced analytics. In: 2017 IEEE 33rd international conference on data engineering (ICDE), pp 535–546Google Scholar
  99. 99.
    Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241MathSciNetzbMATHGoogle Scholar
  100. 100.
    Böse J-H, Flunkert V, Gasthaus J, Januschowski T, Lange D, Salinas D, Schelter S, Seeger M, Wang Y (2017) Probabilistic demand forecasting at scale. Proc VLDB Endow 10(12):1694–1705CrossRefGoogle Scholar
  101. 101.
    Baylor D, Breck E, Cheng H-T, Fiedel N, Foo CY, Haque Z, Haykal S, Ispir M, Jain V, Koc L et al (2017) Tfx: a tensorflow-based production-scale machine learning platform. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1387–1395Google Scholar
  102. 102.
    Ardagna C, Ceravolo P, Cota GL, Kiani MM, Damiani E (2017) What are my users looking for when preparing a big data campaign. In: IEEE international congress on Big Data (BigData Congress). IEEE, pp 201–208Google Scholar
  103. 103.
    Palmér C (2017) Modelling eu directive 2016/680 using enterprise architectureGoogle Scholar
  104. 104.
    Atzmueller M, Kluegl P, Puppe F (2008) Rule-based information extraction for structured data acquisition using textmarker. In: Proceedings of LWA, pp 1–7Google Scholar
  105. 105.
    Settles B (2011) Closing the loop: fast, interactive semi-supervised annotation with queries on features and instances. In: Proceedings of EMNLP.ACL, pp 1467–1478Google Scholar
  106. 106.
    Müller C, Strube M (2006) Multi-level annotation of linguistic data with MMAX2. Corpus Technol Lang Pedag New Resour New Tools New Methods 3:197–214Google Scholar
  107. 107.
    Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J (2012) Brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the demonstrations at the 13th conference of the European chapter of the association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 102–107Google Scholar
  108. 108.
    Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z (2007) DBpedia: a nucleus for a web of open data. In: Aberer K, Choi K-S, Noy N, Allemang D, Lee K, Nixon L, Golbeck J, Mika P, Maynard D, Mizoguchi R, Schreiber G, Cudré-Mauroux P (eds) The semantic web. Springer, Berlin, Heidelberg, pp 722–735CrossRefGoogle Scholar
  109. 109.
    Bizer C, Heath T, Berners-Lee T (2009) Linked data–the story so far. Int J Semant Web Inf Syst: IJSWIS 5(3):1–22CrossRefGoogle Scholar
  110. 110.
    Benikova D, Biemann C (2016) Semreldata ? Multilingual contextual annotation of semantic relations between nominals: dataset and guidelines. In: LRECGoogle Scholar
  111. 111.
    Lu A, Wang W, Bansal M, Gimpel K, Livescu K (2015) Deep multilingual correlation for improved word embeddings. In: NAACL-HLTGoogle Scholar
  112. 112.
    Pecina P, Toral A, Way A, Papavassiliou V, Prokopidis P, Giagkou M (2011) Towards using web-crawled data for domain adaptation in statistical machine translation. In: The 15th conference of the European association for machine translation (EAMT)Google Scholar
  113. 113.
    Yasseri T, Spoerri A, Graham M, Kertész J (2014) The most controversial topics in Wikipedia: a multilingual and geographical analysis. In: Fichman P, Hara N (eds) Global Wikipedia: international and cross-cultural issues in online collaboration. Rowman & Littlefield Publishers Inc, Lanham, pp 25–48Google Scholar
  114. 114.
    Micher JC (2012) Improving domain-specific machine translation by constraining the language model. Army Research Laboratory, Technical Report of ARL-TN-0492Google Scholar
  115. 115.
    D’Haen J, den Poel DV, Thorleuchter D, Benoit D (2016) Integrating expert knowledge and multilingual web crawling data in a lead qualification system. Decis Support Syst 82:69–78CrossRefGoogle Scholar
  116. 116.
    Helou MA, Palmonari M, Jarrar M (2016) Effectiveness of automatic translations for cross-lingual ontology mapping. J Artif Int Res 55(1):165–208MathSciNetGoogle Scholar
  117. 117.
    Furno D, Loia V, Veniero M, Anisetti M, Bellandi V, Ceravolo P, Damiani E (2011) Towards an agent-based architecture for managing uncertainty in situation awareness. In: 2011 IEEE symposium on intelligent agent (IA). IEEE, pp 1–6Google Scholar
  118. 118.
    Dalvi N, Ré C, Suciu D (2009) Probabilistic databases: diamonds in the dirt. Commun ACM 52(7):86–94.
  119. 119.
    Ceravolo P, Damiani E, Fugazza C (2007) Trustworthiness-related uncertainty of semantic web-style metadata: a possibilistic approach. In: ISWC workshop on uncertainty reasoning for the semantic web (URSW), vol 327 [Sn], pp 131–132Google Scholar
  120. 120.
    Panse F, van Keulen M, Ritter N (2013) Indeterministic handling of uncertain decisions in deduplication. JDIQ 4(2):91–925.
  121. 121.
    Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J 24(4):557–581.
  122. 122.
    Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph J-P, Schönberg M, Zwiener J, Naumann F (2015) Functional dependency discovery: an experimental evaluation of seven algorithms. Proc VLDB Endow 8(10):1082–1093CrossRefGoogle Scholar
  123. 123.
    Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347CrossRefGoogle Scholar
  124. 124.
    Naumann F (2014) Data profiling revisited. SIGMOD Rec 42(4):40–49CrossRefGoogle Scholar
  125. 125.
    Ahmadov A, Thiele M, Eberius J, Lehner W, Wrembel R (2015) Towards a hybrid imputation approach using web tables. In: IEEE/ACM international symposium on big data computing (BDC), pp 21–30Google Scholar
  126. 126.
    Ahmadov A, Thiele M, Lehner W, Wrembel R (2017) Context similarity for retrieval-based imputation. In: International symposium on foundations and applications of big data analytics (FAB) (to appear) Google Scholar
  127. 127.
    Li Z, Sharaf MA, Sitbon L, Sadiq S, Indulska M, Zhou X (2014) A web-based approach to data imputation. World Wide Web 17(5):873–897CrossRefGoogle Scholar
  128. 128.
    Miao X, Gao Y, Guo S, Liu W (2018) Incomplete data management: a survey. Front Comput Sci 12(1):4–25.
  129. 129.
    Wiederhold G (1992) Mediators in the architecture of future information systems. IEEE Comput 25(3):38–49CrossRefGoogle Scholar
  130. 130.
    Tonon A, Demartini G, Cudré-Mauroux P (2012) Combining inverted indices and structured search for ad-hoc object retrieval. In: The 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, pp 125–134 [Online].
  131. 131.
    Catasta M, Tonon A, Demartini G, Ranvier J, Aberer K, Cudré-Mauroux P (2014) B-hist: entity-centric search over personal web browsing history. J Web Semant 27:19–25 [Online].
  132. 132.
    Flood M, Jagadish HV, Raschid L (2016) Big data challenges and opportunities in financial stability monitoring. Financ Stab Rev 20:129–142Google Scholar
  133. 133.
    Ni LM, Tan H, Xiao J (2016) Rethinking big data in a networked world. Front Comput Sci 10(6):965–967CrossRefGoogle Scholar
  134. 134.
    Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with hadoop. PVLDB 5(12):1878–1881Google Scholar
  135. 135.
    Ghemawat S, Gobioff H, Leung S (2003) The google file system. In: Proceedings of the 19th ACM symposium on operating systems principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19–22, pp 29–43 [Online].
  136. 136.
    Dittrich J, Quiané-Ruiz J, Richter S, Schuh S, Jindal A, Schad J (2012) Only aggressive elephants are fast elephants. PVLDB 5(11):1591–1602 [Online].
  137. 137.
    Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink™: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38 [Online].
  138. 138.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation, NSDI 2012, San Jose, CA, USA, April 25–27, pp 15–28 [Online]. Accessed 20 Feb 2018
  139. 139.
    Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, Melbourne, Victoria, Australia, May 31–June 4, pp 1383–1394 [Online].
  140. 140.
    Hagedorn S, Götze P, Sattler K (2017) The STARK framework for spatio-temporal data analytics on spark. In: Datenbanksysteme für Business, Technologie und Web (BTW, 17. Fachtagung des GI-Fachbereichs, Datenbanken und Informationssysteme" (DBIS), 6.-10. März 2017. Stuttgart, Germany, Proceedings, pp 123–142Google Scholar
  141. 141.
    Meng X, Bradley JK, Yavuz B, Sparks ER, Venkataraman S, Liu D, Freeman J, Tsai D B, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17:34:1–34:7 [Online].
  142. 142.
    Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A (2009) Hadoopdb: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. PVLDB 2(1):922–933 [Online].
  143. 143.
    Du J, Wang H, Ni Y, Yu Y (2012) Hadooprdf: a scalable semantic data analytical engine. In: Intelligent computing theories and applications—8th international conference, ICIC 2012, Huangshan, China, July 25–29. Proceedings, pp 633–641 [Online].
  144. 144.
    Schätzle A, Przyjaciel-Zablocki M, Neu A, Lausen G (2014) Sempala: interactive SPARQL query processing on hadoop. In: The semantic Web—ISWC 2014—13th international semantic web conference, Riva del Garda, Italy, October 19–23. Proceedings, Part I, pp 164–179 [Online].
  145. 145.
    Ladwig G, Harth A (2011) Cumulusrdf: linked data management on nested key-value stores. In: Proceedings of the 7th international workshop on scalable semantic web knowledge base systems (SSWS2011) at the 10th international semantic web conference (ISWC2011). Oktober 2011, InproceedingsGoogle Scholar
  146. 146.
    Corbellini A, Mateos C, Zunino A, Godoy D, Schiaffino S (2017) Persisting big-data: the NoSQL landscape. Inf Syst 63:1–23CrossRefGoogle Scholar
  147. 147.
    Barbará D (2002) Requirements for clustering data streams. SIGKDD Explor Newsl 3(2):23–27.
  148. 148.
    Gama J, Aguilar-Ruiz J (2007) Knowledge discovery from data streams. Intell Data Anal 11(1):1–2Google Scholar
  149. 149.
    Meir-Huber M, Köhler M (2014) Big data in Austria. Austrian Ministry for Transport, Innovation and Technology (BMVIT), Technical reportGoogle Scholar
  150. 150.
    Nural MV, Peng H, Miller JA (2017) Using meta-learning for model type selection in predictive big data analytics. In: 2017 IEEE international conference on Big Data (Big Data). IEEE, pp 2027–2036Google Scholar
  151. 151.
    Cunha T, Soares C, de Carvalho AC (2018) Metalearning and recommender systems: a literature review and empirical study on the algorithm selection problem for collaborative filtering. Inf Sci 423:128–144MathSciNetCrossRefGoogle Scholar
  152. 152.
    Blair G, Bencomo N, France R (2009) Models@ run.time. Computer 42(10):22–27CrossRefGoogle Scholar
  153. 153.
    Schmid S, Gerostathopoulos I, Prehofer C, Bures T (2017) Self-adaptation based on big data analytics: a model problem and tool. In: IEEE/ACM 12th international symposium on software engineering for adaptive and self-managing systems (SEAMS). IEEE, pp 102–108Google Scholar
  154. 154.
    Hartmann T, Moawad A, Fouquet F, Nain G, Klein J, Traon YL (2015) Stream my models: reactive peer-to-peer distributed models@run.time. In: Proceedings of the 18th international conference on model driven engineering languages and systems (MoDELS). ACM/IEEEGoogle Scholar
  155. 155.
    van der Aalst W, Damiani E (2015) Processes meet big data: connecting data science with process science. IEEE Trans Serv Comput 8(6):810–819CrossRefGoogle Scholar
  156. 156.
    Luckham DC (2001) The power of events: an introduction to complex event processing in distributed enterprise systems. Addison-Wesley, BostonGoogle Scholar
  157. 157.
    van der Aalst WMP (2012) Process mining. Commun ACM 55(8):76–83CrossRefGoogle Scholar
  158. 158.
    van der Aalst WMP, Adriansyah A, de Medeiros AKA, Arcieri F, Baier T, Blickle T, Bose RPJC, van den Brand P, Brandtjen R, Buijs JCAM, Burattin A, Carmona J, Castellanos M, Claes J, Cook J, Costantini N, Curbera F, Damiani E, de Leoni M, Delias P, van Dongen BF, Dumas M, Dustdar S, Fahland D, Ferreira DR, Gaaloul W, van Geffen F, Goel S, Günther CW, Guzzo A, Harmon P, ter Hofstede AHM, Hoogland J, Ingvaldsen JE, Kato K, Kuhn R, Kumar A, Rosa ML, Maggi FM, Malerba D, Mans RS, Manuel A, McCreesh M, Mello P, Mendling J, Montali M, Nezhad H R M, zur Muehlen M, Munoz-Gama J, Pontieri L, Ribeiro J, Rozinat A, Pérez HS, Pérez RS, Sepúlveda M, Sinur J, Soffer P, Song M, Sperduti A, Stilo G, Stoel C, Swenson KD, Talamo M, Tan W, Turner C, Vanthienen J, Varvaressos G, Verbeek E, Verdonk M, Vigo R, Wang J, Weber B, Weidlich M, Weijters T, Wen L, Westergaard M, Wynn MT (2011) Process mining manifesto. In: Proceedings of the business process management workshops (BPM). Springer, pp 169–194Google Scholar
  159. 159.
    Dumas M, van der Aalst WMP, ter Hofstede AHM (2005) Process-aware information systems: bridging people and software through process technology. Wiley, HobokenCrossRefGoogle Scholar
  160. 160.
    van Dongen BF, van der Aalst WMP (2005) A meta model for process mining data. In: Proceedings of the international workshop on enterprise modelling and ontologies for interoperability (EMOI) co-located with the 17th conference on advanced information systems engineering (CAiSE)Google Scholar
  161. 161.
    Al-Ali H, Damiani E, Al-Qutayri M, Abu-Matar M, Mizouni R (2016) Translating bpmn to business rules. In: International symposium on data-driven process discovery and analysis. Springer, pp 22–36Google Scholar
  162. 162.
    Hripcsak G, Rothschild AS (2005) Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc 12(3):296–298CrossRefGoogle Scholar
  163. 163.
    Gilson O, Silva N, Grant PW, Chen M (2008) From web data to visualization via ontology mapping. Coput Graph Forum 27(3):959–966.
  164. 164.
    Nazemi K, Burkhardt D, Breyer M, Stab C, Fellner DW (2010) Semantic visualization cockpit: adaptable composition of semantics-visualization techniques for knowledge-exploration. In: International association of online engineering (IAOE): international conference interactive computer aided learning, pp 163–173Google Scholar
  165. 165.
    Nazemi K, Breyer M, Forster J, Burkhardt D, Kuijper A (2011) Interacting with semantics: a user-centered visualization adaptation based on semantics data. In: Smith MJ, Salvendy G (eds) Human interface and the management of information. Interacting with information. Springer, Berlin, Heidelberg pp 239–248Google Scholar
  166. 166.
    Melo C, Mikheev A, Le-Grand B, Aufaure M-A (2012) Cubix: a visual analytics tool for conceptual and semantic data. In: IEEE 12th international conference on data mining workshops (ICDMW). IEEE, pp 894–897Google Scholar
  167. 167.
    Fluit C, Sabou M, Van Harmelen F (2006) Ontology-Based information visualization: toward semantic web applications. In: Geroimenko V, Chen C (eds) Visualizing the semantic Web: XML-Based internet and information visualization. Springer, London, pp 45–58.
  168. 168.
    Krivov S, Williams R, Villa F (2007) Growl: a tool for visualization and editing of owl ontologies. Web Semant Sci Serv Agents World Wide Web 5(2):54–57CrossRefGoogle Scholar
  169. 169.
    Chu D, Sheets DA, Zhao Y, Wu Y, Yang J, Zheng M, Chen G (2014) Visualizing hidden themes of taxi movement with semantic transformation. In: Visualization symposium (PacificVis), IEEE pacific. IEEE, pp 137–144Google Scholar
  170. 170.
    Catarci T, Scannapieco M, Console M, Demetrescu C (2017) My (fair) big data. In: 2017 IEEE international conference on Big Data, BigData 2017, Boston, MA, USA, December 11–14, pp 2974–2979 [Online].
  171. 171.
    Oracle (2015) The five most common big data integration mistakes to avoid, white paper. Accessed 20 June 2017 [Online]
  172. 172.
    Ali SMF, Wrembel R (2017) From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. [Online].
  173. 173.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD, Vancouver, BC, Canada, June 10–12, pp 1099–1110 [Online].
  174. 174.
    Venkataraman S, Yang Z, Liu D, Liang E, Falaki H, Meng X, Xin R, Ghodsi A, Franklin MJ, Stoica I, Zaharia M (2016) Sparkr: scaling R programs with spark. In: Proceedings of the 2016 international conference on management of data, SIGMOD conference 2016, San Francisco, CA, USA, June 26–July 01, pp 1099–1104 [Online].
  175. 175.
    Dinter B, Gluchowski P, Schieder C (2015) A stakeholder lens on metadata management in business intelligence and big data-results of an empirical investigationGoogle Scholar
  176. 176.
    Yazici A, George R (1999) Fuzzy database modeling, ser. Studies in fuzziness and soft computing. Physica Verlag, vol 26. iSBN 978-3-7908-1171-1Google Scholar
  177. 177.
    Shafer G (1976) A mathematical theory of evidence. Princeton University Press, PrincetonzbMATHGoogle Scholar
  178. 178.
    Wanders B, van Keulen M, van der Vet P (2015) Uncertain groupings: probabilistic combination of grouping data. In: Proceedings of DEXA, ser. LNCS, vol 9261. Springer, pp 236–250.
  179. 179.
    Huang J, Antova L, Koch C, Olteanu D (2009) MayBMS: a probabilistic database management system. In: Proceedings of SIGMOD. ACM, pp 1071–1074.
  180. 180.
    Thiele M, Fischer U, Lehner W (2009) Partition-based workload scheduling in living data warehouse environments. Inf Syst 34(4–5):382–399CrossRefGoogle Scholar
  181. 181.
    Angelini M, Santucci G (2013) Modeling incremental visualizations. In: Proceedings of the EuroVis workshop on visual analytics (EuroVA13), pp 13–17Google Scholar
  182. 182.
    Schulz H-J, Angelini M, Santucci G, Schumann H (2016) An enhanced visualization process model for incremental visualization. IEEE Trans Vis Comput Graph 22(7):1830–1842CrossRefGoogle Scholar
  183. 183.
    Stolper CD, Perer A, Gotz D (2014) Progressive visual analytics: user-driven visual exploration of in-progress analytics. IEEE Trans Vis Comput Graph 20(12):1653–1662CrossRefGoogle Scholar
  184. 184.
    Fekete J-D, Primet R (2016) Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv:1607.05162
  185. 185.
    Shneiderman B, Aris A (2006) Network visualization by semantic substrates. IEEE Trans Vis Comput Graph 12(5):733–740CrossRefGoogle Scholar
  186. 186.
    Wu D, Greer MJ, Rosen DW, Schaefer D (2013) Cloud manufacturing: strategic vision and state-of-the-art. J Manuf Syst 32(4):564–579CrossRefGoogle Scholar
  187. 187.
    Martin KE (2015) Ethical issues in the big data industry. MIS Q Exec 14:2Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Paolo Ceravolo
    • 1
    Email author
  • Antonia Azzini
    • 2
  • Marco Angelini
    • 3
  • Tiziana Catarci
    • 3
  • Philippe Cudré-Mauroux
    • 4
  • Ernesto Damiani
    • 5
  • Alexandra Mazak
    • 6
  • Maurice Van Keulen
    • 7
  • Mustafa Jarrar
    • 8
  • Giuseppe Santucci
    • 3
  • Kai-Uwe Sattler
    • 9
  • Monica Scannapieco
    • 10
  • Manuel Wimmer
    • 6
  • Robert Wrembel
    • 11
  • Fadi Zaraket
    • 12
  1. 1.Università Degli Studi di MilanoMilanItaly
  2. 2.Consortium for the Technology Transfer, C2TMilanItaly
  3. 3.SAPIENZA University of RomeRomeItaly
  4. 4.University of FribourgFribourgSwitzerland
  5. 5.EBTICKhalifa UniversityAbu DhabiUAE
  6. 6.Vienna University of TechnologyWienAustria
  7. 7.University of TwenteEnschedeThe Netherlands
  8. 8.Birzeit UniversityBirzeitPalestine
  9. 9.TU IlmenauIlmenauGermany
  10. 10.Directorate for Methodology and Statistical DesignItalian National Institute of Statistics (Istat)RomeItaly
  11. 11.Poznan University of TechnologyPoznanPoland
  12. 12.American University of BeirutBeirutLebanon

Personalised recommendations