The main objective of this chapter is to provide an overview of the modern field of data science and some of the current progress in this field. The overview focuses on two important paradigms: (1) big data paradigm, which describes a problem space for the big data analytics, and (2) machine learning paradigm, which describes a solution space for the big data analytics. It also includes a preliminary description of the important elements of data science. These important elements are the data, the knowledge (also called responses), and the operations. The terms knowledge and responses will be used interchangeably in the rest of the book. A preliminary information of the data format, the data types and the classification are also presented in this chapter. This chapter emphasizes the importance of collaboration between the experts from multiple disciplines and provides the information on some of the current institutions that show collaborative activities with useful resources.



Thanks to the Department of Statistics, University of California, Berkeley; the Center for Science of Information, Purdue University; the Statistical Applied Mathematical Science Institute; and the Institute for Mathematics and its Applications, University of Minnesota for their support which contributed to the development of this book.


  1. 1.
    M. Loukides. “What is data science?”, 2010.
  2. 2.
    A. Lazarevic, V. Kumar, and J. Srivastava, “Intrusion detection: A survey,” Managing Cyber Threats, vol.5, Part I, pp. 19–78, June 2005.Google Scholar
  3. 3.
    S. Suthaharan, M. Alzahrani, S. Rajasegarar, C. Leckie and M. Palaniswami. “Labelled data collection for anomaly detection in wireless sensor networks,” in Proceedings of the 6th International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 269–274, 2010.Google Scholar
  4. 4.
    S. Bandari and S. Suthaharan. “Intruder detection in public space using suspicious behavior phenomena and wireless sensor networks,” in Proceedings of the 1st ACM International Workshop on Sensor-Enhanced Safety and Security in Public Spaces at ACM MOBIHOC, pp. 3–8, 2012.Google Scholar
  5. 5.
    P. Zikopoulos, C. Eaton, et al. “Understanding big data: Analytics for enterprise class hadoop and streaming data.” McGraw-Hill Osborne Media, 2011.Google Scholar
  6. 6.
    S. Suthaharan. “Big data classification: Problems and challenges in network intrusion prediction with machine learning,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 4, pp. 70–73, 2014.CrossRefGoogle Scholar
  7. 7.
    H. Tong. “Big data classification,” Data Classification: Algorithms and Applications. Chapter 10. (Eds.) C.C. Aggarwal. Taylor and Francis Group, LLC. pp. 275–286. 2015.Google Scholar
  8. 8.
    C.M. Bishop. “Pattern recognition and machine learning,” Springer Science+Business Media, LLC, 2006.MATHGoogle Scholar
  9. 9.
    T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. New York: Springer, 2009.MATHCrossRefGoogle Scholar
  10. 10.
    T. G. Dietterich, “Machine-learning research: Four current directions,” AI Magazine, vol. 18, no. 4, pp. 97–136, 1997.Google Scholar
  11. 11.
    S. B. Kotsiantis. “Supervised machine learning: A review of classification techniques,” Informatica 31, pp. 249–268, 2007.MATHMathSciNetGoogle Scholar
  12. 12.
    S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2007.CrossRefPubMedGoogle Scholar
  13. 13.
  14. 14.
  15. 15.
    K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The hadoop distributed file system,” In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.Google Scholar
  16. 16.
    T. White. Hadoop: the definitive guide. O’Reilly, 2012.Google Scholar
  17. 17.
  18. 18.
    P. C. Wong, H.-W. Shen, C. R. Johnson, C. Chen, and R. B. Ross. “The top 10 challenges in extreme-scale visual analytics.” Computer Graphics and Applications, IEEE, 32(4):63–67, 2012.CrossRefGoogle Scholar
  19. 19.
    M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, and B. Scholkopf. “Support vector machines.” Intelligent Systems and their Applications, IEEE, 13(4), pp. 18–28, 1998.CrossRefGoogle Scholar
  20. 20.
    S.K. Murthy. “Automatic construction of decision trees from data: A multi-disciplinary survey,” Data Mining and Knowledge Discovery, Kluwer Academic Publishers, vol. 2, no. 4, pp. 345–389, 1998.CrossRefGoogle Scholar
  21. 21.
    L. Breiman, “Random forests.” Machine learning 45, pp. 5–32, 2001.MATHCrossRefGoogle Scholar
  22. 22.
    L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. “Regularization of neural networks using dropconnect.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1058–1066, 2013.Google Scholar
  23. 23.
  24. 24.
    A. K. Jain. “Data clustering: 50 years beyond K-means.” Pattern recognition letters, vol. 31, no. 8, pp. 651–666, 2010.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Shan Suthaharan
    • 1
  1. 1.Department of Computer ScienceUNC GreensboroGreensboroUSA

Personalised recommendations