Data Science and Distributed Intelligence: Recent Developments and Future Insights

  • Alfredo Cuzzocrea
  • Mohamed Medhat Gaber
Part of the Studies in Computational Intelligence book series (SCI, volume 446)


Big Data, Data Science and MapReduce are three keywords that have flooded our research papers and technical articles during the last two years. Also, due to the inherent distributed nature of computational infrastructures supporting Data Science (like Clouds and Grids), it is natural to view Distributed Intelligence as the most natural underlying paradigm for novel Data Science challenges. Following this major trend, in this paper we provide a background of these new terms, followed by a discussion of recent developments in the data mining and data warehousing areas in the light of aforementioned keywords. Finally, we provide our insights of the next stages in research and developments in this area.


Data Mining Data Repository Business Intelligence Multidimensional Data Linear Support Vector Machine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)Google Scholar
  2. 2.
    Agrawal, D., Das, S., Abbadi, A.E.: Big data and cloud computing: current state and future opportunities. In: EDBT, pp. 530–533 (2011)Google Scholar
  3. 3.
    Apache. Hadoop (July 2011),
  4. 4.
    BBC. Gap scraps new logo after online outcry (2010),
  5. 5.
    Chu, C.-T., Kim, S.K., Lin, Y.-A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS, pp. 281–288 (2006)Google Scholar
  6. 6.
    Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: New analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)Google Scholar
  7. 7.
    Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: KDD, pp. 690–698 (2011)Google Scholar
  8. 8.
    Cuzzocrea, A.: CAMS: OLAPing Multidimensional Data Streams Efficiently. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 48–62. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  9. 9.
    Cuzzocrea, A.: Retrieving Accurate Estimates to OLAP Queries over Uncertain and Imprecise Multidimensional Data Streams. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 575–576. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Cuzzocrea, A., Chakravarthy, S.: Event-based lossy compression for effective and efficient olap over data streams. Data Knowl. Eng. 69(7), 678–708 (2010)CrossRefGoogle Scholar
  11. 11.
    Cuzzocrea, A., Furfaro, F., Mazzeo, G.M., Saccá, D.: A Grid Framework for Approximate Aggregate Query Answering on Summarized Sensor Network Readings. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM-WS 2004. LNCS, vol. 3292, pp. 144–153. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  13. 13.
    Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)Google Scholar
  14. 14.
    Ene, A., Im, S., Moseley, B.: Fast clustering using mapreduce. In: KDD, pp. 681–689 (2011)Google Scholar
  15. 15.
    Foster, I.T., Kesselman, C., Tuecke, S.: The anatomy of the grid: Enabling scalable virtual organizations. IJHPCA 15(3), 200–222 (2001)Google Scholar
  16. 16.
    Gaber, M.M.: Data stream mining using granularity-based approach. In: Foundations of Computational Intelligence, vol. (6), pp. 47–66. Springer (2009)Google Scholar
  17. 17.
    Ghoting, A., Kambadur, P., Pednault, E.P.D., Kannan, R.: Nimble: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In: KDD, pp. 334–342 (2011)Google Scholar
  18. 18.
    Bártolo Gomes, J., Gaber, M.M., Sousa, P.A.C., Menasalvas, E.: Context-Aware Collaborative Data Stream Mining in Ubiquitous Devices. In: Gama, J., Bradley, E., Hollmén, J. (eds.) IDA 2011. LNCS, vol. 7014, pp. 22–33. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  19. 19.
    Hacigümüs, H., Mehrotra, S., Iyer, B.R.: Providing database as a service. In: ICDE, pp. 29–38 (2002)Google Scholar
  20. 20.
    Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: CIDR, pp. 261–272 (2011)Google Scholar
  21. 21.
    Hill, K.: How target figured out a teen girl was pregnant before her father did. Forbes (2012)Google Scholar
  22. 22.
    Lintott, C.J., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., Raddick, M.J., Nichol, R.C., Szalay, A., Andreescu, D., Murray, P., Vandenberg, J.: Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society 389(3), 1179–1189 (2008)CrossRefGoogle Scholar
  23. 23.
    Loukides, M.: What is data science? the future belongs to the companies and people that turn data into products. An OReilly Radar Report (June 2010)Google Scholar
  24. 24.
    Muthukrishnan, S.: Data streams: algorithms and applications. Foundations and trends in theoretical computer science. Now Publishers (2005)Google Scholar
  25. 25.
    Papadimitriou, S., Sun, J.: Disco: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In: ICDM, pp. 512–521 (2008)Google Scholar
  26. 26.
    Papadimitriou, S., Sun, J., Yan, R.: Large-scale data mining: Mapreduce and beyond. In: Tutorial in KDD 2010(July 2010)Google Scholar
  27. 27.
    Soulellis, G.: Emerging trends in big data and analytics. Big Data Innovation, London (2012)Google Scholar
  28. 28.
    Stonebraker, M., Hong, J.: Researchers’ big data crisis; understanding design and functionality. Commun. ACM 55(2), 10–11 (2012)CrossRefGoogle Scholar
  29. 29.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)Google Scholar
  30. 30.
    Yin, J., Gaber, M.M.: Clustering distributed time series in sensor networks. In: ICDM, pp. 678–687 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Alfredo Cuzzocrea
    • 1
    • 2
  • Mohamed Medhat Gaber
    • 1
    • 2
  1. 1.ICAR-CNR and University of CalabriaCosenzaItaly
  2. 2.School of ComputingUniversity of PortsmouthPortsmouthUK

Personalised recommendations