Skip to main content

Intelligent Knowledge Lakes: The Age of Artificial Intelligence and Big Data

Part of the Communications in Computer and Information Science book series (CCIS,volume 1155)

Abstract

The continuous improvement in connectivity, storage and data processing capabilities allow access to a data deluge from the big data generated on open, private, social and IoT (Internet of Things) data islands. Data Lakes introduced as a storage repository to organize this raw data in its native format until it is needed. The rationale behind a Data Lake is to store raw data and let the data analyst decide how to curate them later. Previously, we introduced the novel notion of Knowledge Lake, i.e., a contextualized Data Lake, and proposed algorithms to turn the raw data (stored in Data Lakes) into contextualized data and knowledge using extraction, enrichment, annotation, linking and summarization techniques. In this tutorial, we introduce Intelligent Knowledge Lakes to facilitate linking Artificial Intelligence (AI) and Data Analytics. This will enable AI applications to learn from contextualized data and use them to automate business processes and develop cognitive assistance for facilitating the knowledge intensive processes or generating new rules for future business analytics.

Keywords

  • Knowledge Lake
  • Data Lake
  • Data analytics
  • Data curation
  • Artificial Intelligence
  • Big data

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-15-3281-8_3
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-981-15-3281-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   74.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Boston_Marathon_bombing.

  2. 2.

    https://developers.google.com/knowledge-graph.

  3. 3.

    https://www.wikidata.org/.

  4. 4.

    An entity E is represented as a data object that exists separately and has a unique identity. Entities are described by a set of attributes.

References

  1. Alsubaiee, S., et al.: Storage management in AsterixDB. Proc. VLDB Endow. 7(10), 841–852 (2014)

    CrossRef  Google Scholar 

  2. Amouzgar, F., Beheshti, A., Ghodratnama, S., Benatallah, B., Yang, J., Sheng, Q.Z.: iSheets: a spreadsheet-based machine learning development platform for data-driven process analytics. In: Liu, X., et al. (eds.) ICSOC 2018. LNCS, vol. 11434, pp. 453–457. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17642-6_43

    CrossRef  Google Scholar 

  3. Beheshti, A., Benatallah, B., Motahari-Nezhad, H.R.: ProcessAtlas: a scalable and extensible platform for business process analytics. Softw.: Pract. Exp. 48(4), 842–866 (2018)

    Google Scholar 

  4. Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: CoreDB: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 06–10 November 2017, pp. 2451–2454 (2017)

    Google Scholar 

  5. Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. PVLDB 11(12), 1942–1945 (2018)

    Google Scholar 

  6. Beheshti, A., Benatallah, B., Tabebordbar, A., Motahari-Nezhad, H.R., Barukh, M.C., Nouri, R.: DataSynapse: a social data curation foundry. Distrib. Parallel Databases 37(3), 351–384 (2019)

    CrossRef  Google Scholar 

  7. Beheshti, A., Moraveji-Hashemi, V., Yakhchi, S., Motahari-Nezhad, H.R., Ghafari, S.M., Yang, J.: personality2vec: enabling the analysis of behavioral disorders in social networks. In: Proceedings of the 13th ACM International Conference on Web Search and Data Mining, WSDM 2020, Houston, Texas, USA (2020)

    Google Scholar 

  8. Beheshti, A., et al.: iProcess: enabling IoT platforms in data-driven knowledge-intensive processes. In: Weske, M., Montali, M., Weber, I., vom Brocke, J. (eds.) BPM 2018. LNBIP, vol. 329, pp. 108–126. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98651-7_7

    CrossRef  Google Scholar 

  9. Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: CrowdCorrect: a curation pipeline for social data cleansing and curation. In: Mendling, J., Mouratidis, H. (eds.) CAiSE 2018. LNBIP, vol. 317, pp. 24–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92901-9_3

    CrossRef  Google Scholar 

  10. Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Galaxy: a platform for explorative analysis of open data sources. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, 15–16 March 2016, pp. 640–643 (2016)

    Google Scholar 

  11. Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34(3), 379–423 (2016)

    CrossRef  Google Scholar 

  12. Beheshti, S., et al.: Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-25037-3

    CrossRef  Google Scholar 

  13. Beheshti, S., Benatallah, B., Venugopal, S., Ryu, S.H., Motahari-Nezhad, H.R., Wang, W.: A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99(4), 313–349 (2017)

    MathSciNet  CrossRef  Google Scholar 

  14. Beheshti, S., Motahari Nezhad, H.R., Benatallah, B.: Temporal provenance model (TPM): model and query language. CoRR, abs/1211.5009 (2012)

    Google Scholar 

  15. Beheshti, S., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017, pp. 165–169 (2017)

    Google Scholar 

  16. Berners-Lee, T.: Designing the web for an open society. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, 28 March–1 April 2011, pp. 3–4 (2011)

    Google Scholar 

  17. Freitas, A., Curry, E.: Big Data Curation. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_6

    CrossRef  Google Scholar 

  18. Gitelman, L.: Raw Data Is an Oxymoron. MIT Press, Cambridge (2013)

    CrossRef  Google Scholar 

  19. Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al’.s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)

  20. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100. ACM (2016)

    Google Scholar 

  21. Lord, P., Macdonald, A., Lyon, L., Giaretta, D.: From data deluge to data curation. In: Proceedings of the UK e-science All Hands meeting, pp. 371–375. Citeseer (2004)

    Google Scholar 

  22. McAfee, A., Brynjolfsson, E., Davenport, T.H., Patil, D., Barton, D.: Big data: the management revolution. Harv. Bus. Rev. 90(10), 60–68 (2012)

    Google Scholar 

  23. Miller, D.: Tales from Facebook. Polity, Cambridge (2011)

    Google Scholar 

  24. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 88, 300–305 (2016)

    CrossRef  Google Scholar 

  25. Moreau, L., et al.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)

    Google Scholar 

  26. Murthy, D.: Twitter. Polity Press, Cambridge (2018)

    Google Scholar 

  27. Schiliro, F., et al.: iCOP: IoT-enabled policing processes. In: Liu, X., et al. (eds.) ICSOC 2018. LNCS, vol. 11434, pp. 447–452. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17642-6_42

    CrossRef  Google Scholar 

  28. Shadbolt, N., et al.: Linked open government data: Lessons from data.gov.uk. IEEE Intell. Syst. 27(3), 16–24 (2012)

    Google Scholar 

  29. Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: CIDR (2013)

    Google Scholar 

  30. Strapparava, C., Valitutti, A., et al.: Wordnet affect: an affective extension of wordnet. In: Lrec, vol. 4, pp. 40. Citeseer (2004)

    Google Scholar 

  31. Sumbaly, R., Kreps, J., Shah, S.: The big data ecosystem at linkedin. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1125–1134. ACM (2013)

    Google Scholar 

  32. Tene, O., Polonetsky, J.: Big data for all: privacy and user control in the age of analytics. Nw. J. Tech. Intell. Prop. 11, xxvii (2012)

    Google Scholar 

  33. Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: CIDR (2015)

    Google Scholar 

  34. Xia, F., Yang, L.T., Wang, L., Vinel, A.: Internet of things. Int. J. Commun. Syst. 25(9), 1101 (2012)

    CrossRef  Google Scholar 

  35. Zomaya, A.Y., Sakr, S.: Handbook of Big Data Technologies. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4

    CrossRef  Google Scholar 

Download references

Acknowledgements

We Acknowledge the AI-enabled Processes (AIP) (https://aip-research-center.github.io/) Research Centre for funding part of this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amin Beheshti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Beheshti, A., Benatallah, B., Sheng, Q.Z., Schiliro, F. (2020). Intelligent Knowledge Lakes: The Age of Artificial Intelligence and Big Data. In: U, L., Yang, J., Cai, Y., Karlapalem, K., Liu, A., Huang, X. (eds) Web Information Systems Engineering. WISE 2020. Communications in Computer and Information Science, vol 1155. Springer, Singapore. https://doi.org/10.1007/978-981-15-3281-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-3281-8_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-3280-1

  • Online ISBN: 978-981-15-3281-8

  • eBook Packages: Computer ScienceComputer Science (R0)