Advertisement

Clustering Blockchain Data

  • Sudarshan S. ChawatheEmail author
Chapter
Part of the Unsupervised and Semi-Supervised Learning book series (UNSESUL)

Abstract

Blockchain datasets, such as those generated by popular cryptocurrencies Bitcoin, Ethereum, and others, are intriguing examples of big data. Analysis of these datasets has diverse applications, such as detecting fraud and illegal transactions, characterizing major services, identifying financial hotspots, and characterizing usage and performance characteristics of large peer-to-peer consensus-based systems. Unsupervised learning methods in general, and clustering methods in particular, hold the potential to discover unanticipated patterns leading to valuable insights. However, the volume, velocity, and variety of blockchain data, as well as the difficulties in evaluating results, pose significant challenges to the efficient and effective application of clustering methods to blockchain data. Nevertheless, recent and ongoing work has adapted classic methods, as well as developed new methods tailored to the characteristics of such data. This chapter motivates the study of clustering methods for blockchain data, and introduces the key blockchain concepts from a data-centric perspective. It presents different models and methods used for clustering blockchain data, and describes the challenges and some solutions to the problem of evaluating such methods.

Keywords

Blockchain Data Cryptocurrencies Ethereum Bitcoin Address Block Subsidy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

This work was supported in part by the US National Science Foundation grants EAR-1027960 and PLR-1142007. Several improvements resulted from detailed feedback from the reviewers.

References

  1. 1.
    M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander, Optics: ordering points to identify the clustering structure, in: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD’99. ACM, New York (1999), pp. 49–60. https://doi.org/10.1145/304182.304187
  2. 2.
    M.K. Awan, A. Cortesi, Blockchain transaction analysis using dominant sets, in Computer Information Systems and Industrial Management, ed. by K. Saeed, W. Homenda, R. Chaki. Springer, Cham (2017), pp. 229–239CrossRefGoogle Scholar
  3. 3.
    L. Backstrom, C. Dwork, J. Kleinberg, Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography, in Proceedings of the 16th International World Wide Web Conference (2007)Google Scholar
  4. 4.
    G. Becker, Merkle signature schemes,Merkle trees and their cryptanalysis. Ruhr-Universität Bochum (2008)Google Scholar
  5. 5.
  6. 6.
    Bitcoin Genesis Block, Blockchain.info Blockchain Explorer (2009). https://blockchain.info/tx/4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b
  7. 7.
    Blockchain Luxembourg S.A., Address tags. Bitcoin address tags database (2018). https://blockchain.info/tags
  8. 8.
    Blockchain Luxembourg S.A., Blockchain explorer (2018). https://blockchain.info/
  9. 9.
    J. Bondy, U. Murty, Graph Theory (Springer, London, 2008)CrossRefGoogle Scholar
  10. 10.
    J. Bonneau, A. Miller, J. Clark, A. Narayanan, J.A. Kroll, E.W. Felten, SoK: research perspectives and challenges for Bitcoin and cryptocurrencies, in Proceedings of the 36th IEEE Symposium on Security and Privacy, San Jose, California (2015), pp. 104–121Google Scholar
  11. 11.
    V. Buterin, et al., Ethereum whitepaper (2013). https://github.com/ethereum/wiki/wiki/White-Paper
  12. 12.
    Chainanalysis, Inc., Chainanalysis reactor (2018). https://www.chainalysis.com/
  13. 13.
    CoinMarketCap, Historical data for Bitcoin (2018). https://coinmarketcap.com/currencies/bitcoin/historical-data/
  14. 14.
    K. Collins, Inside the digital heist that terrorized the world—and only made $100k. Quartz (2017). https://qz.com/985093/inside-the-digital-heist-that-terrorized-the-world-and-made-less-than-100k/
  15. 15.
    J.A. Cuesta-Albertos, A. Gordaliza, C. Matran, Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25(2), 553–576 (1997)MathSciNetCrossRefGoogle Scholar
  16. 16.
    D. Di Francesco Maesa, A. Marino, L. Ricci, Uncovering the Bitcoin blockchain: an analysis of the full users graph, in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (2016), pp. 537–546.  https://doi.org/10.1109/DSAA.2016.52
  17. 17.
    C. Ding, X. He, K-means clustering via principal component analysis, in Proceedings of the Twenty-first International Conference on Machine Learning, ICML’04 (ACM, Banff, 2004), p. 29. https://doi.org/10.1145/1015330.1015408 Google Scholar
  18. 18.
    R. Dubes, A.K. Jain, Validity studies in clustering methodologies. Pattern Recogn. 11, 235–254 (1979)CrossRefGoogle Scholar
  19. 19.
    A. Epishkina, S. Zapechnikov, Discovering and clustering hidden time patterns in blockchain ledger, in First International Early Research Career Enhancement School on Biologically Inspired Cognitive Architectures (2017)Google Scholar
  20. 20.
    D. Ermilov, M. Panov, Y. Yanovich, Automatic Bitcoin address clustering, in Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico (2017)Google Scholar
  21. 21.
    T. Fawcett, ROC graphs: notes and practical considerations for researchers. Pattern Recogn. Lett. 27(8), 882–891 (2004)CrossRefGoogle Scholar
  22. 22.
    M. Fleder, M.S. Kester, S. Pillai, Bitcoin transaction graph analysis. CoRR (2015). abs/1502.01657Google Scholar
  23. 23.
    B. Fung, Bitcoin got a big boost in 2017. Here are 5 other cryptocurrencies to watch in 2018. Washington Post—Blogs (2018)Google Scholar
  24. 24.
    J. Gan, Y. Tao, Dbscan revisited: mis-claim, un-fixability, and approximation, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD’15 (ACM, New York, 2015), pp. 519–530. https://doi.org/10.1145/2723372.2737792 Google Scholar
  25. 25.
    Z. Ghahramani, Unsupervised learning, in Advanced Lectures on Machine Learning, ed. by O. Bousquet, U. von Luxburg, G. Rätsch. Lecture Notes in Computer Science, vol. 3176, chap. 5 (Springer, Berlin, 2004), pp. 72–112Google Scholar
  26. 26.
    A. Gunawan, A faster algorithm for DBSCAN. Master’s Thesis, Technical University of Eindhoven (2013)Google Scholar
  27. 27.
    M. Harrigan, C. Fretter, The unreasonable effectiveness of address clustering, in International IEEE Conferences on Ubiquitous Intelligence and Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) (2016), pp. 368–373.  https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071
  28. 28.
    Y. He, H. Tan, W. Luo, S. Feng, J. Fan, MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comp. Sci. 8(1), 83–99 (2014)MathSciNetCrossRefGoogle Scholar
  29. 29.
    B. Huang, Z. Liu, J. Chen, A. Liu, Q. Liu, Q. He, Behavior pattern clustering in blockchain networks. Multimed. Tools Appl. 76(19), 20099–20110 (2017). https://doi.org/10.1007/s11042-017-4396-4 CrossRefGoogle Scholar
  30. 30.
    A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). https://doi.org/10.1145/331499.331504.CrossRefGoogle Scholar
  31. 31.
    A. Janda, WalletExplorer.com: smart Bitcoin block explorer (2018). Bitcoin block explorer with address grouping and wallet labelingGoogle Scholar
  32. 32.
    D. Kaminsky, Black ops of TCP/IPi. Presentation slides (2011). http://dankaminsky.com/2011/08/05/bo2k11/
  33. 33.
    T. Kohonen, Essentials of the self-organizing map. Neural Netw. 37, 52–65 (2013). https://doi.org/10.1016/j.neunet.2012.09.018. Twenty-fifth Anniversary Commemorative IssueCrossRefGoogle Scholar
  34. 34.
    H. Kuzuno, C. Karam, Blockchain explorer: an analytical process and investigation environment for Bitcoin, in Proceedings of the APWG Symposium on Electronic Crime Research (eCrime) (2017), pp. 9–16.  https://doi.org/10.1109/ECRIME.2017.7945049
  35. 35.
    P.C. Mahalanobis, On the generalised distance in statistics. Proc. Natl. Inst. Sci. India 2(1), 49–55 (1936)MathSciNetzbMATHGoogle Scholar
  36. 36.
    S.T. Mai, I. Assent, M. Storgaard, AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16 (ACM, New York, 2016), pp. 1025–1034. https://doi.org/10.1145/2939672.2939750 Google Scholar
  37. 37.
    J. McCaffrey, Data clustering using entropy minimization. Visual Studio Magazine (2013)Google Scholar
  38. 38.
    S. Meiklejohn, M. Pomarole, G. Jordan, K. Levchenko, D. McCoy, G.M. Voelker, S. Savage, A fistful of Bitcoins: characterizing payments among men with no names, in Proceedings of the Conference on Internet Measurement, IMC’13, (ACM, Barcelona, 2013), pp. 127–140. https://doi.org/10.1145/2504730.2504747 Google Scholar
  39. 39.
    R.C. Merkle, A digital signature based on a conventional encryption function, in Advances in Cryptology—CRYPTO’87, ed. by C. Pomerance (Springer, Berlin, 1988), pp. 369–378Google Scholar
  40. 40.
    I. Miers, C. Garman, M. Green, A.D. Rubin, Zerocoin: anonymous distributed e-cash from Bitcoin, in Proceedings of the IEEE Symposium on Security and Privacy (2013)Google Scholar
  41. 41.
    P. Monamo, V. Marivate, B. Twala, Unsupervised learning for robust Bitcoin fraud detection, in Proceedings of the 2016 Information Security for South Africa (ISSA 2016) Conference, Johannesburg, South Africa (2016), pp. 129–134Google Scholar
  42. 42.
    C.M. Nachiappan, P. Pattanayak, S. Verma, V. Kalyanaraman, Blockchain technology: beyond Bitcoin. Technical Report, Sutardja Center for Entrepreneurship & Technology, University of California, Berkeley (2015)Google Scholar
  43. 43.
    S. Nakamoto, Bitcoin: a peer-to-peer electronic cash system. Pseudonymous posting (2008). Archived at https://bitcoin.org/en/bitcoin-paper
  44. 44.
    R. Norvill, B.B.F. Pontiveros, R. State, I. Awan, A. Cullen, Automated labeling of unknown contracts in ethereum, in Proceedings of the 26th International Conference on Computer Communication and Networks (ICCCN), (2017), pp. 1–6.  https://doi.org/10.1109/ICCCN.2017.8038513
  45. 45.
    M. Ober, S. Katzenbeisser, K. Hamacher, Structure and anonymity of the Bitcoin transaction graph. Future Internet 5(2), 237–250 (2013). https://doi.org/10.3390/fi5020237, http://www.mdpi.com/1999-5903/5/2/237 CrossRefGoogle Scholar
  46. 46.
    M.S. Ortega, The Bitcoin transaction graph—anonymity. Master’s Thesis, Universitat Oberta de Catalunya, Barcelona (2013)Google Scholar
  47. 47.
    V.C. Osamor, E.F. Adebiyi, J.O. Oyelade, S. Doumbia, Reducing the time requirement of k-means algorithm. PLoS One 7(12), 1–10 (2012).  https://doi.org/10.1371/journal.pone.0049946 CrossRefGoogle Scholar
  48. 48.
    S. Patel, Blockchains for publicizing available scientific datasets. Master’s Thesis, The University of Mississippi (2017)Google Scholar
  49. 49.
    T. Pham, S. Lee, Anomaly detection in Bitcoin network using unsupervised learning methods (2017). arXiv:1611.03941v1 [cs.LG] https://arxiv.org/abs/1611.03941v1
  50. 50.
    S. Pongnumkul, C. Siripanpornchana, S. Thajchayapong, Performance analysis of private blockchain platforms in varying workloads, in Proceedings of the 26th International Conference on Computer Communication and Networks (ICCCN) (2017), pp. 1–6.  https://doi.org/10.1109/ICCCN.2017.8038517
  51. 51.
    B. Raskutti, C. Leckie, An evaluation of criteria for measuring the quality of clusters. in Proceedings of the 16th International Joint Conference on Artificial Intelligence—Volume 2, IJCAI’99. Stockholm, Sweden (1999), pp. 905–910. http://dl.acm.org/citation.cfm?id=1624312.1624348
  52. 52.
    S. Raval, Decentralized applications: harnessing Bitcoin’s blockchain technology. O’Reilly Media (2016). ISBN-13: 978-1-4919-2454-9Google Scholar
  53. 53.
    F. Reid, M. Harrigan, An analysis of anonymity in the Bitcoin system (2012). arXiv:1107.4524v2 [physics.soc-ph]. https://arxiv.org/abs/1107.4524
  54. 54.
    E. Schubert, A. Koos, T. Emrich, A. Züfle, K.A. Schmid, A. Zimek, A framework for clustering uncertain data. Proc. VLDB Endow. 8(12), 1976–1979 (2015). https://doi.org/10.14778/2824032.2824115 CrossRefGoogle Scholar
  55. 55.
    E. Schubert, J. Sander, M. Ester, H.P. Kriegel, X. Xu, DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42(3), 19:1–19:21 (2017). https://doi.org/10.1145/3068335 CrossRefGoogle Scholar
  56. 56.
    D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)CrossRefGoogle Scholar
  57. 57.
    What is Bitcoin vanity address? (2017). http://bitcoinvanitygen.com/
  58. 58.
    H. Xiong, J. Wu, J. Chen, K-means clustering versus validation measures: A data distribution perspective, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’06, Philadelphia, PA, USA (2006), pp. 779–784. https://doi.org/10.1145/1150402.1150503 Google Scholar
  59. 59.
    X. Xu, N. Yuruk, Z. Feng, T.A.J. Schweiger, Scan: a structural clustering algorithm for networks, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’07 (ACM, New York, 2007), pp. 824–833. https://doi.org/10.1145/1281192.1281280 Google Scholar
  60. 60.
    Y. Yanovich, P. Mischenko, A. Ostrovskiy, Shared send untangling in Bitcoin. The Bitfury Group white paper (2016) (Version 1.0)Google Scholar
  61. 61.
    J. Yli-Huumo, D. Ko, S. Choi, S. Park, K. Smolander, Where is current research on blockchain technology?—a systematic review. PLoS One 11(10), e0163477 (2016).  https://doi.org/10.1371/journal.pone.0163477 CrossRefGoogle Scholar
  62. 62.
    D. Zhang, S. Chen, Z.H. Zhou, Entropy-inspired competitive clustering algorithms. Int. J. Softw. Inform. 1(1), 67–84 (2007)Google Scholar
  63. 63.
    A. Zimek, E. Schubert, H.P. Kriegel, A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 5(5), 363–387 (2012).  https://doi.org/10.1002/sam.11161 MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of MaineOronoUSA

Personalised recommendations