Data Mining and Knowledge Discovery

, Volume 24, Issue 2, pp 325–354 | Cite as

Descriptive matrix factorization for sustainability Adopting the principle of opposites

  • Christian Thurau
  • Kristian Kersting
  • Mirwaes Wahabzada
  • Christian Bauckhage
Article

Abstract

Climate change, the global energy footprint, and strategies for sustainable development have become topics of considerable political and public interest. The public debate is informed by an exponentially growing amount of data and there are diverse partisan interest when it comes to interpretation. We therefore believe that data analysis methods are called for that provide results which are intuitively understandable even to non-experts. Moreover, such methods should be efficient so that non-experts users can perform their own analysis at low expense in order to understand the effects of different parameters and influential factors. In this paper, we discuss a new technique for factorizing data matrices that meets both these requirements. The basic idea is to represent a set of data by means of convex combinations of extreme data points. This often accommodates human cognition. In contrast to established factorization methods, the approach presented in this paper can also determine over-complete bases. At the same time, convex combinations allow for highly efficient matrix factorization. Based on techniques adopted from the field of distance geometry, we derive a linear time algorithm to determine suitable basis vectors for factorization. By means of the example of several environmental and developmental data sets we discuss the performance and characteristics of the proposed approach and validate that significant efficiency gains are obtainable without performance decreases compared to existing convexity constrained approaches.

Keywords

Matrix factorization Convex combinations Distance geometry Large-scale data analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Achlioptas D, McSherry F (2007) Fast computation of low-rank matrix approximations. J ACM 54(9): 1–19MathSciNetGoogle Scholar
  2. Aguilar O, Huerta G, Prado R, West M (1998) Bayesian inference on latent structure in time series. In: Bernardo J, Bergen J, Dawid A, Smith A (eds) Bayesian statistics. Oxford University Press, OxfordGoogle Scholar
  3. Blumenthal LM (1953) Theory and applications of distance geometry. Oxford University Press, OxfordMATHGoogle Scholar
  4. Chan B, Mitchell D, Cram L (2003) Archetypal analysis of galaxy spectra. Mon Not R Astron Soc 338(3): 790–795CrossRefGoogle Scholar
  5. Chang CI, Wu CC, Liu WM, Ouyang YC (2006) A new growing method for simplex-based endmember extraction algorithm. IEEE T Geosci Remote 44(10): 2804–2819CrossRefGoogle Scholar
  6. Crippen G (1988) Distance geometry and molecular conformation. Wiley, New YorkMATHGoogle Scholar
  7. Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4): 338–347CrossRefMATHMathSciNetGoogle Scholar
  8. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1): 107–113CrossRefGoogle Scholar
  9. Ding C, Li T, Jordan M (2010) Convex and semi-nonnegative matrix factorizations. IEEE T Pattern Anal 32(1): 45–55CrossRefGoogle Scholar
  10. Drineas P, Kannan R, Mahoney M (2006) Fast Monte Carlo algorithms III: computing a compressed approixmate matrix decomposition. SIAM J Comput 36(1): 184–206CrossRefMATHMathSciNetGoogle Scholar
  11. Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, San DiegoGoogle Scholar
  12. Foster D, Nascimento S, Amano K (2004) Information limits on neural identification of coloured surfaces in natural scenes. Visual Neurosci 21: 331–336CrossRefGoogle Scholar
  13. Gomes C (2009) Computational sustainability. The Bridge, National Academy of Engineering 39(4): 6–11Google Scholar
  14. Goreinov SA, Tyrtyshnikov EE (2001) The maximum-volume concept in approximation by low-rank matrices. Contemp Math 280: 47–51MathSciNetGoogle Scholar
  15. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(7): 498–520CrossRefGoogle Scholar
  16. Kersting K, Wahabzada M, Thurau C, Bauckhage C (2010) Hierarchical convex NMF for clustering massive data. In: Proceedings of the 2nd Asian Conference on Machine Learning (ACML-10)Google Scholar
  17. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755): 788–799CrossRefGoogle Scholar
  18. Lucas A, Klaassen P, Spreij P, Straetmans S (2003) Tail behaviour of credit loss distributions for general latent factor models. Appl Math Finance 10(4): 337–357CrossRefMATHGoogle Scholar
  19. MacKay D (2009) Sustainable energy—without the hot air. UIT Cambridge Ltd, CambridgeGoogle Scholar
  20. Miao L, Qi H (2007) Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization. IEEE T Geosci Remote 45(3): 765–777CrossRefGoogle Scholar
  21. Nascimento JMP, Dias JMB (2005) Vertex component analysis: a fast algorithm to unmix hyperspectral data. IEEE T Geosci Remote 43(4): 898–910CrossRefGoogle Scholar
  22. Ostrouchov G, Samatova N (2005) On fastmap and the convex hull of multivariate data: toward fast and robust dimension reduction. IEEE T Pattern Anal 27(8): 1340–1434CrossRefGoogle Scholar
  23. Sippl M, Sheraga H (1986) Cayley-Menger coordinates. Proc Natl Acad Sci 83(8): 2283–2287CrossRefMATHGoogle Scholar
  24. Spearman C (1904) General intelligence objectively determined and measured. Am J Psychol 15: 201–293CrossRefGoogle Scholar
  25. Thurau C, Kersting K, Bauckhage C (2009) Convex non-negative matrix factorization in the wild. In: Proceedings of the IEEE International Conference on Data Mining, MiamiGoogle Scholar
  26. Thurau C, Kersting K, Wahabzada M, Bauckhage C (2010) Convex non-negative matrix factorization for massive datasets. Knowl Inf Syst (KAIS). doi:10.1007/s10115-010-0352-6
  27. Winter ME (1999) N-FINDR: an algorithm for fast and autonomous spectral endmember determination in hyperspectral data. In: Proceedings of the International Conference on Applied Geologic Remote Sensing, VancouverGoogle Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  • Christian Thurau
    • 1
  • Kristian Kersting
    • 1
  • Mirwaes Wahabzada
    • 1
  • Christian Bauckhage
    • 1
  1. 1.Fraunhofer Institute for Intelligent Analysis and Information Systems IAISSankt AugustinGermany

Personalised recommendations