Advertisement

Sparse p-adic data coding for computationally efficient and effective big data analytics

  • F. MurtaghEmail author
Research Articles

Abstract

We develop the theory and practical implementation of p-adic sparse coding of data. Rather than the standard, sparsifying criterion that uses the L 0 pseudo-norm, we use the p-adic norm.We require that the hierarchy or tree be node-ranked, as is standard practice in agglomerative and other hierarchical clustering, but not necessarily with decision trees. In order to structure the data, all computational processing operations are direct reading of the data, or are bounded by a constant number of direct readings of the data, implying linear computational time. Through p-adic sparse data coding, efficient storage results, and for bounded p-adic norm stored data, search and retrieval are constant time operations. Examples show the effectiveness of this new approach to content-driven encoding and displaying of data.

Keywords

big data p-adic numbers ultrametric topology hierarchical clustering binary rooted tree computational and storage complexity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    R. G. Baraniuk, V. Cevher, M. F. Duarte and C. Hegde, “Model-based compressive sensing,” IEEE Trans. Inf. Theory 56 (4), 1982–2001 (2010).MathSciNetCrossRefGoogle Scholar
  2. 2.
    H.-H. Bock, “Origins and extensions of the k-means algorithm in cluster analysis,” E-Journ. Hist. Prob. Stat. 4 (2) (2008). http://wwwjehpsnet/Decembre2008/Bockpdf.Google Scholar
  3. 3.
    P. E. Bradley, “On p-adic classification,” p-Adic Numbers Ultrametric Anal. Appl. 1 (4), 271–285 (2009).MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    L. Brekke and P. G. O. Freund, “p-Adic numbers in physics”, Phys. Rep. 233, 1–66 (1993).Google Scholar
  5. 5.
    F. Benford, “The law of anomalous numbers,” Proc. Amer. Phil. Soc. 78, 551–572 (1938).zbMATHGoogle Scholar
  6. 6.
    A. Berger and T. P. Hill, An Introduction to Benford’s Law (Princeton Univ. Press, 2015).CrossRefzbMATHGoogle Scholar
  7. 7.
    P. Contreras and F. Murtagh, “Fast, linear time hierarchical clustering using the Baire metric,” J. Class. 29, 118–143 (2012).MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    B. Dragovich and A. Dragovich, “p-Adic modelling of the genome and the genetic code,” Comp. J. 53, 432–42 (2010).CrossRefzbMATHGoogle Scholar
  9. 9.
    O. Erdem, A. Carus and Hoang Le, “Value-coded trie structure for high-performance IPv6 lookup,” Comp. J. 58 (2), 204–214 (2015).CrossRefGoogle Scholar
  10. 10.
    F. Q. Gouvea, p-Adic Numbers (Springer, Berlin, 2003).zbMATHGoogle Scholar
  11. 11.
    P. Hall, J. S. Marron and A. Neeman, “Geometric representation of high dimension, low sample size data,” J. Royal Stat. Soc. Ser. B 67, 427–444 (2005).MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    K. Hayashi, M. Nagahara and T. Tanaka, “A user’s guide to compressed sensing for communications systems,” IEICE Trans. Comm. E96-B (3), 685–712 (2013).CrossRefGoogle Scholar
  13. 13.
    T. P. Hill, “A statistical derivation of the significant-digit law,” Stat. Sci. 10 (4), 354–363 (1995).MathSciNetzbMATHGoogle Scholar
  14. 14.
    D. W. Jones, The Ternary Manifesto, including “Standard ternary logic,” “Ternary arithmetic,” “Ter- SCII: ternary standard code for information interchange,” “Number representations for ternary computers”. http://homepagecsuiowaedu/~jones/ternary (2012).Google Scholar
  15. 15.
    A. Yu. Khrennikov, “Gene expression from polynomial dynamics in the 2-adic information space,” Proc. Steklov Inst.Math. 265, 131–139 (2009).MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    M. Krasner, “Nombres semi-ré els et espaces ultramé triques,” Comptes-Rendus de l’Acadé mie des Sciences, Tome II, 219, 433–435 (1944).MathSciNetGoogle Scholar
  17. 17.
    Y. Linde, A. Buzo and R. M. Gray, “An algorithm for vector quantization design,” IEEE Trans. Comm. 28, 84–95 (1980).CrossRefGoogle Scholar
  18. 18.
    B. Mirkin, “Linear embedding of binary hierarchies and its applications,” in B. Mirkin, F. McMorris, F. Roberts and A. Rzhetsky (Eds.) Mathematical Hierarchies and Biology, DIMACS Series Disc. Math. Theor. Comp. Sci. 37, 331–356 (Amer.Math. Soc., Providence, 1997).Google Scholar
  19. 19.
    B. Mirkin and E. Koonin, “A top-down method for building genome classification trees with linear binary hierarchies,” in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin and F. Roberts (Eds.) Bioconsensus, DIMACS Ser. 61, 97–112 (Amer.Math. Soc., Providence, 2003).Google Scholar
  20. 20.
    F. Murtagh, “Symmetry in data mining and analysis: a unifying view based on hierarchy,” Proc. Steklov Inst. Math. 265, 177–198 (2009).MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    F. Murtagh, “From data to the p-adic or ultrametic model,” p-Adic Numbers Ultrametric Anal. Appl. 1, 58–68 (2009).MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    F. Murtagh and P. Contreras, “Fast, linear time, m-adic hierarchical clustering for search and retrieval using the Baire metric, with linkages to generalized ultrametrics, hashing, Formal Concept Analysis, and precision of data measurement,” p-Adic Numbers Ultrametric Anal. Appl. 4, 45–56 (2012).MathSciNetzbMATHGoogle Scholar
  23. 23.
    F. Murtagh, “MoreLikeThis and Scoring in Solr,” technical report, 4 pp., 26 May 2013. http://wwwmultiresolutionscom/HiClBaireRanSpanPathsGoogle Scholar
  24. 24.
    F. Murtagh and P. Contreras, “Linear storage and potentially constant time hierarchical clustering using the Baire metric and random spanning paths,” in A. F. X. Wilhelm and H. A. Kestler (Eds.), Analysis of Large and Complex Data (Springer, Heidelberg, 2016).Google Scholar
  25. 25.
    F. Murtagh and P. Contreras, “Clustering through high dimensional data scaling: applications and implementations,” Proceedings, ECDA 2015, European Conf. on Data Analysis, forthcoming (2016).Google Scholar
  26. 26.
    F. Murtagh, “Big Data scaling through metric mapping: Exploiting the remarkable simplicity of very high dimensional spaces using Correspondence Analysis,” Proceedings, IFCS 2015, Int. Fed. Class. Soc., forthcoming (2016).Google Scholar
  27. 27.
    F. Murtagh and P. Contreras, “Random projection towards the Baire metric for high dimensional clustering,” in A. Gammerman, V. Vovk and H. Papadopoulos, (Eds.) Statistical Learning and Data Sciences, Lect. Notes Art. Intell. 9047, 424–431 (Springer, Heidelberg, 2015).CrossRefGoogle Scholar
  28. 28.
    A. Ng et al., “Sparse coding,” ufldlstanfordedu/wiki/indexphp/Sparse_Coding, last modified 8 April 2013 (accessed 2016-04-08).Google Scholar
  29. 29.
    A. Rodionov and S. Volkov, “p-Adic arithmetic coding,” Cont. Math. 508, 201–213 (2010).MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    A. Rodionov and S. Volkov, “p-Adic arithmetic coding,” 29 pp. (2007), http://arxivorg/abs/0704.0834v1.zbMATHGoogle Scholar
  31. 31.
    W. H. Schikhof, Ultrametric Calculus, (Chapters 18, 19, 20, 21) (Cambridge Univ. Press, Cambridge, UK, 1984).zbMATHGoogle Scholar
  32. 32.
    H. A. Simon, The Sciences of the Artificial, 3rd edn. (MIT Press, 1996).Google Scholar

Copyright information

© Pleiades Publishing, Ltd. 2016

Authors and Affiliations

  1. 1.Department of Computing and MathematicsUniversity of DerbyDerbyUK
  2. 2.Department of Computing, GoldsmithsUniversity of LondonLondonUK

Personalised recommendations