Large-Scale Learning from Data Streams with Apache SAMOA

Kourtellis, Nicolas; De Francisci Morales, Gianmarco; Bifet, Albert

doi:10.1007/978-3-319-89803-2_8

Nicolas Kourtellis³,
Gianmarco De Francisci Morales⁴ &
Albert Bifet⁵

Part of the book series: Studies in Big Data ((SBD,volume 41))

1165 Accesses
6 Citations

Abstract

Apache SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software License version 2.0.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://hadoop.apache.org.
2.
http://mahout.apache.org.
3.
http://storm.apache.org.
4.
http://flink.apache.org.
5.
http://samza.apache.org.
6.
https://apex.apache.org.
7.
http://jubat.us/en.
8.
http://github.com/vpa1977/stormmoa.
9.
https://github.com/samoa-moa/samoa-moa.
10.
http://moa.cms.waikato.ac.nz/datasets/,http://osmot.cs.cornell.edu/kddcup/datasets.html.
11.
http://moa.cms.waikato.ac.nz.
12.
http://kt.ijs.si/elena_ikonomovska/data.html.
13.
http://www.openstack.org.
14.
http://samza.incubator.apache.org.
15.
http://hadoop.apache.org.
16.
http://kafka.apache.org.
17.
http://zookeeper.apache.org.

References

Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, Berlin (2007)
Book Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Google Scholar
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11, 849–872 (2010). ISSN 1532–4435. http://dl.acm.org/citation.cfm?id=1756006.1756034
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar
Bordino, I., Kourtellis, N., Laptev, N., Billawala, Y.: Stock trade volume prediction with Yahoo Finance user browsing behavior. In: 30th International Conference on Data Engineering (ICDE), pp. 1168–1173. IEEE, New York (2014)
Google Scholar
Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., Vakali, A.: Mean birds: detecting aggression and bullying on Twitter. In: 9th International Conference on Web Science (WebSci). ACM, New York (2017)
Google Scholar
Chen, C., Zhang, J., Chen, X., Xiang, Y., Zhou, W.: 6 million spam tweets: a large ground truth for timely Twitter spam detection. In: International Conference on Communications (ICC). IEEE, New York (2015)
Google Scholar
De Francisci Morales, G.: SAMOA: a platform for mining big data streams. In: RAMSS: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams @WWW (2013)
Google Scholar
De Francisci Morales, G., Bifet, A.: SAMOA: scalable advanced massive online analysis. J. Mach. Learn. Res. 16, 149–153 (2015)
Google Scholar
De Francisci Morales, G., Gionis, A., Lucchese, C.: From chatter to headlines: harnessing the real-time web for personalized news recommendation. In: 5th ACM International Conference on Web Search and Data Mining (WSDM), pp. 153–162. ACM, New York (2012)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150. USENIX Association, Berkeley (2004)
Google Scholar
Devooght, R., Kourtellis, N., Mantrach, A.: Dynamic matrix factorization with priors on unknown values. In: 21st International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 189–198. ACM, New York (2015)
Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: 6th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 71–80 (2000)
Google Scholar
Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Mach. Learn. 90(3), 317–346 (2013)
Article MathSciNet Google Scholar
Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 13–30 (2014)
Article Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1963.10500830
Article MathSciNet Google Scholar
Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data Min. Knowl. Disc. 23(1), 128–168 (2011)
Article MathSciNet Google Scholar
Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)
Article Google Scholar
Kourtellis, N., Bonchi, F., De Francisci Morales, G.: Scalable online betweenness centrality in evolving graphs. IEEE Trans. Knowl. Data Eng. 27(9), 2494–2506 (2015)
Article Google Scholar
Kourtellis, N., De Francisci Morales, G., Bifet, A.: VHT: vertical hoeffding tree. In: 4th IEEE International Conference on Big Data (BigData) (2016)
Google Scholar
Oza, N.C., Russell, S.: Online bagging and boosting. In: Artificial Intelligence and Statistics, pp. 105–112. Morgan Kaufmann, Los Altos (2001)
Google Scholar
Page, E.: Continuous inspection schemes. Biometrika 41(1–2), 100–115 (1954)
Article MathSciNet Google Scholar
Thu Vu, A., De Francisci Morales, G., Gama, J., Bifet, A.: Distributed adaptive model rules for mining big data streams. In: 2nd IEEE International Conference on Big Data (BigData) (2014)
Google Scholar
Uddin Nasir, M.A., De Francisci Morales, G., Garcia-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: practical load balancing for distributed stream processing engines. In: 31st International Conference on Data Engineering (ICDE) (2015)
Google Scholar
Uddin Nasir, M.A., De Francisci Morales, G., Kourtellis, N., Serafini, M.: When two choices are not enough: balancing at scale in distributed stream processing. In: 32nd International Conference on Data Engineering (ICDE) (2016)
Google Scholar
Vasiloudis, T., Beligianni, F., De Francisci Morales, G.: BoostVHT: boosting distributed streaming decision trees. In: 26th ACM International Conference on Information and Knowledge Management (CIKM) (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Telefonica Research, Barcelona, Spain
Nicolas Kourtellis
Qatar Computing Research Institute, Doha, Qatar
Gianmarco De Francisci Morales
LTCI, Télécom ParisTech, Paris, France
Albert Bifet

Authors

Nicolas Kourtellis
View author publications
You can also search for this author in PubMed Google Scholar
Gianmarco De Francisci Morales
View author publications
You can also search for this author in PubMed Google Scholar
Albert Bifet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Kourtellis .

Editor information

Editors and Affiliations

Institute Mines-Telecom Lille Douai, Douai, France
Moamar Sayed-Mouchaweh

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kourtellis, N., De Francisci Morales, G., Bifet, A. (2019). Large-Scale Learning from Data Streams with Apache SAMOA. In: Sayed-Mouchaweh, M. (eds) Learning from Data Streams in Evolving Environments. Studies in Big Data, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-319-89803-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-89803-2_8
Published: 29 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89802-5
Online ISBN: 978-3-319-89803-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics