Abstract
The continuous improvement in connectivity, storage and data processing capabilities allow access to a data deluge from the big data generated on open, private, social and IoT (Internet of Things) data islands. Data Lakes introduced as a storage repository to organize this raw data in its native format until it is needed. The rationale behind a Data Lake is to store raw data and let the data analyst decide how to curate them later. Previously, we introduced the novel notion of Knowledge Lake, i.e., a contextualized Data Lake, and proposed algorithms to turn the raw data (stored in Data Lakes) into contextualized data and knowledge using extraction, enrichment, annotation, linking and summarization techniques. In this tutorial, we introduce Intelligent Knowledge Lakes to facilitate linking Artificial Intelligence (AI) and Data Analytics. This will enable AI applications to learn from contextualized data and use them to automate business processes and develop cognitive assistance for facilitating the knowledge intensive processes or generating new rules for future business analytics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
An entity E is represented as a data object that exists separately and has a unique identity. Entities are described by a set of attributes.
References
Alsubaiee, S., et al.: Storage management in AsterixDB. Proc. VLDB Endow. 7(10), 841–852 (2014)
Amouzgar, F., Beheshti, A., Ghodratnama, S., Benatallah, B., Yang, J., Sheng, Q.Z.: iSheets: a spreadsheet-based machine learning development platform for data-driven process analytics. In: Liu, X., et al. (eds.) ICSOC 2018. LNCS, vol. 11434, pp. 453–457. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17642-6_43
Beheshti, A., Benatallah, B., Motahari-Nezhad, H.R.: ProcessAtlas: a scalable and extensible platform for business process analytics. Softw.: Pract. Exp. 48(4), 842–866 (2018)
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: CoreDB: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 06–10 November 2017, pp. 2451–2454 (2017)
Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. PVLDB 11(12), 1942–1945 (2018)
Beheshti, A., Benatallah, B., Tabebordbar, A., Motahari-Nezhad, H.R., Barukh, M.C., Nouri, R.: DataSynapse: a social data curation foundry. Distrib. Parallel Databases 37(3), 351–384 (2019)
Beheshti, A., Moraveji-Hashemi, V., Yakhchi, S., Motahari-Nezhad, H.R., Ghafari, S.M., Yang, J.: personality2vec: enabling the analysis of behavioral disorders in social networks. In: Proceedings of the 13th ACM International Conference on Web Search and Data Mining, WSDM 2020, Houston, Texas, USA (2020)
Beheshti, A., et al.: iProcess: enabling IoT platforms in data-driven knowledge-intensive processes. In: Weske, M., Montali, M., Weber, I., vom Brocke, J. (eds.) BPM 2018. LNBIP, vol. 329, pp. 108–126. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98651-7_7
Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: CrowdCorrect: a curation pipeline for social data cleansing and curation. In: Mendling, J., Mouratidis, H. (eds.) CAiSE 2018. LNBIP, vol. 317, pp. 24–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92901-9_3
Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Galaxy: a platform for explorative analysis of open data sources. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, 15–16 March 2016, pp. 640–643 (2016)
Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34(3), 379–423 (2016)
Beheshti, S., et al.: Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-25037-3
Beheshti, S., Benatallah, B., Venugopal, S., Ryu, S.H., Motahari-Nezhad, H.R., Wang, W.: A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99(4), 313–349 (2017)
Beheshti, S., Motahari Nezhad, H.R., Benatallah, B.: Temporal provenance model (TPM): model and query language. CoRR, abs/1211.5009 (2012)
Beheshti, S., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017, pp. 165–169 (2017)
Berners-Lee, T.: Designing the web for an open society. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, 28 March–1 April 2011, pp. 3–4 (2011)
Freitas, A., Curry, E.: Big Data Curation. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_6
Gitelman, L.: Raw Data Is an Oxymoron. MIT Press, Cambridge (2013)
Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al’.s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100. ACM (2016)
Lord, P., Macdonald, A., Lyon, L., Giaretta, D.: From data deluge to data curation. In: Proceedings of the UK e-science All Hands meeting, pp. 371–375. Citeseer (2004)
McAfee, A., Brynjolfsson, E., Davenport, T.H., Patil, D., Barton, D.: Big data: the management revolution. Harv. Bus. Rev. 90(10), 60–68 (2012)
Miller, D.: Tales from Facebook. Polity, Cambridge (2011)
Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 88, 300–305 (2016)
Moreau, L., et al.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)
Murthy, D.: Twitter. Polity Press, Cambridge (2018)
Schiliro, F., et al.: iCOP: IoT-enabled policing processes. In: Liu, X., et al. (eds.) ICSOC 2018. LNCS, vol. 11434, pp. 447–452. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17642-6_42
Shadbolt, N., et al.: Linked open government data: Lessons from data.gov.uk. IEEE Intell. Syst. 27(3), 16–24 (2012)
Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: CIDR (2013)
Strapparava, C., Valitutti, A., et al.: Wordnet affect: an affective extension of wordnet. In: Lrec, vol. 4, pp. 40. Citeseer (2004)
Sumbaly, R., Kreps, J., Shah, S.: The big data ecosystem at linkedin. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1125–1134. ACM (2013)
Tene, O., Polonetsky, J.: Big data for all: privacy and user control in the age of analytics. Nw. J. Tech. Intell. Prop. 11, xxvii (2012)
Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: CIDR (2015)
Xia, F., Yang, L.T., Wang, L., Vinel, A.: Internet of things. Int. J. Commun. Syst. 25(9), 1101 (2012)
Zomaya, A.Y., Sakr, S.: Handbook of Big Data Technologies. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4
Acknowledgements
We Acknowledge the AI-enabled Processes (AIP) (https://aip-research-center.github.io/) Research Centre for funding part of this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Beheshti, A., Benatallah, B., Sheng, Q.Z., Schiliro, F. (2020). Intelligent Knowledge Lakes: The Age of Artificial Intelligence and Big Data. In: U, L., Yang, J., Cai, Y., Karlapalem, K., Liu, A., Huang, X. (eds) Web Information Systems Engineering. WISE 2020. Communications in Computer and Information Science, vol 1155. Springer, Singapore. https://doi.org/10.1007/978-981-15-3281-8_3
Download citation
DOI: https://doi.org/10.1007/978-981-15-3281-8_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3280-1
Online ISBN: 978-981-15-3281-8
eBook Packages: Computer ScienceComputer Science (R0)