D3CAS: Distributed Clustering Algorithm Applied to Short-Text Stream Processing

Molina, Roberto; Hasperué, Waldo; Villa Monte, Augusto

doi:10.1007/978-3-030-20787-8_15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 995))

Included in the following conference series:

Argentine Congress of Computer Science

365 Accesses

Abstract

In this article, a proof of concept of a dynamic clustering algorithm based on density, called D3CAS, is presented. This algorithm was implemented to be run under the Spark Streaming framework, and it allows processing data streams. The algorithm was tested using a stream of short texts consisting of requirements generated by social media users, in particular, from a dataset called Pizza Request Dataset. The results, obtained in a virtualized environment, were analyzed with different configurations for algorithm parameters, which allowed establishing which are the configurations that yield the best results. Since the dataset used includes the label for each text in the stream, cluster purity could be measured and the results obtained could be compared to those presented by the authors of the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, S., Schlobach, S., Klein, M.: What is concept drift and how to measure it? In: Cimiano, P., Pinto, H.S. (eds.) EKAW 2010. LNCS (LNAI), vol. 6317, pp. 241–256. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16438-5_17
Chapter Google Scholar
Aggarwal, C.C.: Data streams: an overview and scientific applications. In: Gaber, M. (ed.) Scientific Data Mining and Knowledge Discovery. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-02788-8_14
Chapter Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002), New York, NY, USA, pp. 1–16. ACM (2002). https://doi.org/10.1145/543613.543615
Molina, R., Hasperué, W.: D3CAS: un Algoritmo de Clustering para el Procesamiento de Flujos de Datos en Spark. In: Proceedings of the XXIV Congreso Argentino de Ciencias de la Computación, pp. 452–461 (2018). ISBN 978-950-658-472-6
Google Scholar
Miner, G., Elder, J., Hill, T., Nisbet, R., Delen, D., Fast, A.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Academic Press, Cambridge (2012)
Google Scholar
Halibas, A.S., Shaffi, A.S., Mohamed, M.A.K.V.: Application of text classification and clustering of Twitter data for business analytics. In: Majan International Conference (MIC), Muscat, pp. 1–7 (2018)
Google Scholar
Li, P., et al.: Learning from short text streams with topic drifts. IEEE Trans. Cybern. 48(9), 2697–2711 (2018). https://doi.org/10.1109/TCYB.2017.2748598
Article Google Scholar
Jain, A., Sharma, I.: Clustering of text streams via facility location and spherical K-means. In: Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, pp. 1209–1213 (2018)
Google Scholar
Duan, R., Li, C.: An adaptive Dirichlet multinomial mixture model for short text streaming clustering. In: IEEE/WIC/ACM International Conference on Web Intelligence (WI), Santiago, pp. 49–55 (2018)
Google Scholar
Gama, J., Rodrigues, P.P.: An overview on mining data streams. In: Abraham, A., Hassanien, A.E., de Carvalho, A.P.L.F., Snášel, V. (eds.) Foundations of Computational, Intelligence Volume 6. Studies in Computational Intelligence, vol. 206. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-01091-0_2
Chapter Google Scholar
Gepperth, A., Hammer, B.: Incremental learning algorithms and applications. In: European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium (2016)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29, pp 81–92. VLDB Endowment (2003)
Google Scholar
Zhang, P., Zhu, X., Shi, Y., Wu, X.: An aggregate ensemble for mining concept drifting data streams with noise. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 1021–1029. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_109
Chapter Google Scholar
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the SIAM International Conference on Data Mining, pp. 328–339 (2006)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
Google Scholar
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: StreamKM++: a clustering algorithm for data streams. ACM J. Exp. Algorithmics 17(1), 173–187 (2012)
MathSciNet MATH Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Google Scholar
Zhang, X., Furtlehner, C., Sebag, M.: Data streaming with affinity propagation. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5212, pp. 628–643. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_41
Chapter Google Scholar
Althoff, T., Danescu-Niculescu-Mizil, C., Jurafsky, D.: How to ask for a favor: a case study on the success of altruistic requests. In: Proceedings of ICWSM (2014)
Google Scholar
Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A.R.: TF-ICF: a new term weighting scheme for clustering dynamic data streams, pattern recognition. In: Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA 2006) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Facultad de Informática, Instituto de Investigación en Informática (III-LIDI), Universidad Nacional de La Plata, La Plata, Argentina
Roberto Molina, Waldo Hasperué & Augusto Villa Monte
CIN-EVC, La Plata, Argentina
Roberto Molina
Comisión de Investigaciones Científicas (CIC), Provincia de Buenos Aires, Argentina
Waldo Hasperué
UNLP, La Plata, Argentina
Augusto Villa Monte

Authors

Roberto Molina
View author publications
You can also search for this author in PubMed Google Scholar
Waldo Hasperué
View author publications
You can also search for this author in PubMed Google Scholar
Augusto Villa Monte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Waldo Hasperué .

Editor information

Editors and Affiliations

National University of La Plata, La Plata, Argentina
Patricia Pesado
National University of Buenos Aires Center, Buenos Aires, Argentina
Claudio Aciti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Molina, R., Hasperué, W., Villa Monte, A. (2019). D3CAS: Distributed Clustering Algorithm Applied to Short-Text Stream Processing. In: Pesado, P., Aciti, C. (eds) Computer Science – CACIC 2018. CACIC 2018. Communications in Computer and Information Science, vol 995. Springer, Cham. https://doi.org/10.1007/978-3-030-20787-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-20787-8_15
Published: 17 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20786-1
Online ISBN: 978-3-030-20787-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics