Skip to main content

Analyzing Big Data Streams with Apache SAMOA

  • Conference paper
  • First Online:
Behavioral Analytics in Social and Ubiquitous Environments (MUSE 2015, MSM 2015, MSM 2016)

Abstract

Apache Apache samoa (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage and analyze, due to the time and memory complexity. Velocity is one of the main properties of big data. Apache Apache samoa provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, Apache Samza, and Apache Apex. Apache Apache samoa is written in Java and is available at https://samoa.incubator.apache.org/ under the Apache Software License version 2.0.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org.

  2. 2.

    http://mahout.apache.org.

  3. 3.

    http://storm.apache.org.

  4. 4.

    https://flink.apache.org/.

  5. 5.

    http://samza.apache.org/.

  6. 6.

    https://github.com/samoa-moa/samoa-moa.

  7. 7.

    http://moa.cms.waikato.ac.nz/datasets/, http://osmot.cs.cornell.edu/kddcup/datasets.html.

References

  1. Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, New York (2007). https://doi.org/10.1007/978-0-387-47534-9

    Book  MATH  Google Scholar 

  2. Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. JMLR 11, 849–872 (2010)

    MathSciNet  MATH  Google Scholar 

  3. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)

    Google Scholar 

  4. Bordino, I., Kourtellis, N., Laptev, N., Billawala, Y.: Stock trade volume prediction with Yahoo finance user browsing behavior. In: 30th International Conference on Data Engineering (ICDE), pp. 1168–1173. IEEE (2014)

    Google Scholar 

  5. Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., Vakali, A.: Mean birds: detecting aggression and bullying on Twitter. In: 9th International Conference on Web Science (WebSci). ACM (2017)

    Google Scholar 

  6. Chen, C., Zhang, J., Chen, X., Xiang, Y., Zhou, W.: 6 million spam tweets: a large ground truth for timely Twitter spam detection. In International Conference on Communications (ICC). IEEE (2015)

    Google Scholar 

  7. De Francisci Morales, G.: SAMOA: a platform for mining big data streams. In: RAMSS 2013: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams @WWW 2013 (2013)

    Google Scholar 

  8. De Francisci Morales, G., Bifet, A.: SAMOA: scalable advanced massive online analysis. JMLR J. Mach. Learn. Res. 16, 149–153 (2014)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In OSDI 2004: 6th Symposium on Operating Systems Design and Implementation, pp. 137–150. USENIX Association (2004)

    Google Scholar 

  10. Devooght, R., Kourtellis, N., Mantrach, A.: Dynamic matrix factorization with priors on unknown values. In: 21st International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 189–198. ACM (2015)

    Google Scholar 

  11. Domingos, P., Hulten, G.: Mining high-speed data streams. In: KDD 2000: 6th International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)

    Google Scholar 

  12. Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44 (2014)

    Article  Google Scholar 

  13. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MathSciNet  Google Scholar 

  14. Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1), 128–168 (2011)

    Article  MathSciNet  Google Scholar 

  15. Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)

    Article  Google Scholar 

  16. Kourtellis, N., Bonchi, F., De Francisci Morales, G.: Scalable online betweenness centrality in evolving graphs. IEEE Trans. Knowl. Data Eng. 27, 2494–2506 (2015)

    Article  Google Scholar 

  17. Kourtellis, N., De Francisci Morales, G., Bifet, A.: VHT: vertical Hoeffding tree. In BigData 2016: 4th IEEE International Conference on Big Data (2016)

    Google Scholar 

  18. Page, E.: Continuous inspection schemes. Biometrika 41, 100–115 (1954)

    Article  MathSciNet  Google Scholar 

  19. Thu Vu, A., De Francisci Morales, G., Gama, J., Bifet, A.: Distributed adaptive model rules for mining big data streams. In: BigData 2014: Second IEEE International Conference on Big Data (2014)

    Google Scholar 

  20. Uddin Nasir, M.A., De Francisci Morales, G., Garcia-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: practical load balancing for distributed stream processing engines. In: ICDE 2015: 31st International Conference on Data Engineering. IEEE ( 2015)

    Google Scholar 

  21. Uddin Nasir, M.A., De Francisci Morales, G., Kourtellis, N., Serafini, M.: When two choices are not enough: balancing at scale in distributed stream processing. In: ICDE 2016: 32nd International Conference on Data Engineering. IEEE (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Albert Bifet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kourtellis, N., de Francisci Morales, G., Bifet, A. (2019). Analyzing Big Data Streams with Apache SAMOA. In: Atzmueller, M., Chin, A., Lemmerich, F., Trattner, C. (eds) Behavioral Analytics in Social and Ubiquitous Environments. MUSE MSM MSM 2015 2015 2016. Lecture Notes in Computer Science(), vol 11406. Springer, Cham. https://doi.org/10.1007/978-3-030-34407-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34407-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33906-7

  • Online ISBN: 978-3-030-34407-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics