Skip to main content
Log in

AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Despite the recent growth in large-scale, distributed streaming data processing, there are currently limited options for flexible clustering of streaming data on industrial production frameworks such as Mapreduce and Spark. The clustering methods being used by practitioners on such systems do not respond rapidly to new data, or do not adjust the number of clusters appropriately as more data are processed. This problem is particularly acute for unstructured data like text and other non-enumerated types that are common in log and message streams and not analyzed at scale for precisely this reason. To address such issues, this paper proposes a method for hierarchical clustering using adaptive hash (AdaHash) values that can be re-calculated during a periodic batch process and used for subsequent streaming processing at the speed of data arrival, assuming sufficient distributed compute resources. We demonstrate that this method is as fast as other (optimal) hashing methods while enabling an adaptive hash function on each batch cycle within a lambda architecture computing framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Apache spark mllib. http://spark.apache.org/docs/latest/ml-guide.html (2017). Accessed 15 Jan 2017

  2. Marz, N., Warren, J.: Big Data: Principles and Best Practices of Scalable Realtime Data Systems, 1st edn. Manning Publications Co., Greenwich (2015)

    Google Scholar 

  3. Fagin, R., Nievergelt, J., Pippenger, N., Strong, H.R.: Extendible hashing—a fast access method for dynamic files. ACM Trans. Database Syst. 4(3), 315 (1979). https://doi.org/10.1145/320083.320092

    Article  Google Scholar 

  4. Lu, P., Chen, G., Ooi, B.C., Vo, H.T., Wu, S.: ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]. Proc. VLDB Endow. 7(14), 1797 (2014). https://doi.org/10.14778/2733085.2733087

    Article  Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 Dec 2004, pp. 137–150. http://www.usenix.org/events/osdi04/tech/dean.html (2004)

  6. Zaharia, M.: An architecture for fast and general data processing on large clusters. Ph.D. thesis, EECS Department, University of California, Berkeley. (2014). http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html. Accessed 11 Sept 2017

  7. Apache spark. http://spark.apache.org/ (2017). Accessed 20 Jan 2017

  8. Zadeh, J., Apostolopoulos, G., Tryfonas, C., Sudhakar, M.: Defense at scale: building a central nervous system for the SOC. Blackhat 2015, 1–8 (2015)

    Google Scholar 

  9. Databricks spark scikit-learn. https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html (2016). Accessed 22 Jan 2017

  10. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation—Volume 6 (USENIX Association, Berkeley, CA, USA, 2004), OSDI’04, pp. 10–10. (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264. Accessed 11 Oct 2017

  11. Okasaki, C.: Purely Functional Data Structures. Cambridge University Press, Cambridge (1999)

    MATH  Google Scholar 

  12. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23–26 May 1998, pp. 604–613. https://doi.org/10.1145/276698.276876 (1998)

  13. Apache spark mllib lsh. https://spark.apache.org/docs/2.1.0/ml-features.html (2017). Accessed 21 Nov 2017

  14. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for Similarity Search: A Survey. CoRR arXiv:1408.2927 (2014)

  15. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Upper Saddle River (1988)

    MATH  Google Scholar 

  16. Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications. CRC Press, Boca Raton (2014). http://www.crcpress.com/product/isbn/9781466558212. Accessed 11 Oct 2017

  17. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, pp. 103–114 (1996)

  18. Ghesmoune, M., Lebbah, M., Azzag, H.: A new growing neural gas for clustering data streams. Neural Netw. 78, 36 (2016). https://doi.org/10.1016/j.neunet.2016.02.003

    Article  Google Scholar 

  19. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.P.L.F., Gama, J.A.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–13 (2013). https://doi.org/10.1145/2522968.2522981

    Article  MATH  Google Scholar 

  20. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases—Volume 29 (VLDB Endowment, 2003), VLDB ’03, pp. 81–92. (2003). http://dl.acm.org/citation.cfm?id=1315451.1315460. Accessed 11 Sept 2017

    Chapter  Google Scholar 

  21. Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339. https://doi.org/10.1137/1.9781611972764.29 (2006)

  22. Backhoff, O., Ntoutsi, E.: Scalable online-offline stream clustering in apache spark. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 37–44 (2016)

  23. Thornthwaite, W.: Dimensional modeling basics get flexible real-world dimensions with analysis services 2005. SQL Server Pro (2006)

  24. Apache common log format. https://httpd.apache.org/docs/1.3/logs.html (2004). Accessed 13 Mar 2017

  25. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-means++. Proc. VLDB Endow. 5(7), 622 (2012)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Ravi Srinivasan for providing valuable feedback on the mathematical framing of the model.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dean Teffer.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teffer, D., Srinivasan, R. & Ghosh, J. AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks. Int J Data Sci Anal 8, 257–267 (2019). https://doi.org/10.1007/s41060-018-0145-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-018-0145-7

Keywords

Navigation