Skip to main content

Principal Component Analysis for Categorical Histogram Data: Some Open Directions of Research

  • Conference paper
  • First Online:
Classification and Multivariate Analysis for Complex Data Structures

Abstract

In recent years, the analysis of symbolic data where the units are categories, classes or concepts described by interval, distributions, sets of categories and the like becomes a challenging task since many applicative fields generate massive amount of data that are difficult to store and to analyze with traditional techniques [1]. In this paper we propose a strategy for extending standard PCA to such data in the case where the variables values are “categorical histograms” (i.e. a set of categories called bins with their relative frequency). These variables are a special case of “modal” variables (see for example, Diday and Noirhomme [5]) or of “compositional” variables (Aitchison [1]) where the weights are not necessarily frequencies. First, we introduce “metabins” which mix together bins of the different histograms and enhance interpretability. Standard PCA applied on the bins of such data table loose the histograms constraints and suppose independencies between the bins but copulas takes care of the probabilities and the underlying dependencies. Then, we give several ways for representing the units (called “individuals”), the bins, the variables and the metabins when the number of categories is not the same for each variable. A way for representing the variation of the individuals, for getting histograms in output is given. Finally, some theoretical results allow the representation of the categorical histogram variables inside a hypercube covering the correlation sphere.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aitchison, J.: The statistical analysis of compositional data. Chapman & Hall, London; Biometrika 70(1), 57–65 (1996)

    Article  MathSciNet  Google Scholar 

  2. Billard, L., Diday, E.: Symbolic data analysis, Wiley, Chichester (2006)

    Book  MATH  Google Scholar 

  3. Bock, H.H., Diday, E. (eds.): Analysis of symbolic data, Springer, Berlin (2000)

    Google Scholar 

  4. Clayton, D.G.: A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 141–152 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  5. Diday, E., Noirhomme-Fraiture, M. (eds.): Symbolic data analysis and the SODAS software, pp. 279–311, Wiley, Chichester (2008)

    MATH  Google Scholar 

  6. Frank, M.J.: On the simultaneous associativity of F(x, y) and x+y-F(x, y). Aequationes Math. 19, 194–226 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  7. Ichino, M.: Symbolic principal component analysis based on the nested covering. In: ISI2007, Lisbon, 2007

    Google Scholar 

  8. Ichino, M.: Symbolic PCA for histogram-valued data. In: Proceedings IASC, Yokohama, Japan, 5–8 Dec 2008

    Google Scholar 

  9. Lebart, L., Morineau, A., Piron, M.: Statistique exploratoire multidimensionnelle. Dunod Editeur, Paris (1995)

    MATH  Google Scholar 

  10. Makosso Kallyth, S., Diday, E.: Analyse en composantes principales de variables symboliques de types histogrammes. In: D’Aubigny, G. (ed.) Proceedings SFC. IMAG, Grenoble, France (Sept 2009)

    Google Scholar 

  11. Nagabhsushan, P., Kumar, P.: Principal component analysis of histogram data. In: Liu, D. et al. (eds.) ISNN 2007, Part II, LNCS 4492, pp. 1012–1021. Springer, Berlin, Heidelberg (2007)

    Google Scholar 

  12. Nelsen, R.B.: An Introduction to copulas. In: Lecture Notes in Statistics, Springer, NewYork, NY (1998)

    Google Scholar 

  13. Rodriguez, O., Diday E., Winsberg, S.: Generalization of the principal component analysis to histogram data. In: Workshop on Symbolic Data Analysis of the 4th European Conference on Principles and Practice of Knowledge Discovery in Data Bases, 12–16 Sept 2000, Lyon (2001)

    Google Scholar 

  14. Schweizer, B., Sklar, A.: Probabilistic metric spaces, Elsevier, North-Holland, NewYork (1983)

    MATH  Google Scholar 

  15. Sklar, A.: Fonction de répartition à n dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris 8, 229–231 (1959)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edwin Diday .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Diday, E. (2011). Principal Component Analysis for Categorical Histogram Data: Some Open Directions of Research. In: Fichet, B., Piccolo, D., Verde, R., Vichi, M. (eds) Classification and Multivariate Analysis for Complex Data Structures. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13312-1_1

Download citation

Publish with us

Policies and ethics