Principal component analysis for histogram-valued data

Le-Rademacher, J.; Billard, L.

doi:10.1007/s11634-016-0255-9

Principal component analysis for histogram-valued data

Regular Article
Published: 26 May 2016

Volume 11, pages 327–351, (2017)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

J. Le-Rademacher¹ &
L. Billard²

958 Accesses
13 Citations
Explore all metrics

Abstract

This paper introduces a principal component methodology for analysing histogram-valued data under the symbolic data domain. Currently, no comparable method exists for this type of data. The proposed method uses a symbolic covariance matrix to determine the principal component space. The resulting observations on principal component space are presented as polytopes for visualization. Numerical representation of the resulting polytopes via histogram-valued output is also presented. The necessary algorithms are included. The technique is illustrated on a weather data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Median of a Set of Histogram Data

Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm

Article 11 July 2014

Principle component analysis: Robust versions

Article 11 March 2017

References

Anderson TW (1963) Asymptotic theory for principal components analysis. Ann Math Stat 34:122–148
Article MathSciNet MATH Google Scholar
Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. John Wiley, New York
MATH Google Scholar
Bertrand P and Goupil F (2000) Descriptive statistics for symbolic data. In: Bock H-H, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Billard L (2008) Sample covariance functions for complex quantitative data. In: Mizuta M, Nakano J (eds) Proceedings World Congress, International Association of Statistical Computing. Japanese Society of Computational Statistics, Japan, pp 157–163
Billard L (2011) Brief overview of symbolic data and analytic issues. Stat Anal Data Min 4:149–156
Article MathSciNet Google Scholar
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487
Article MathSciNet Google Scholar
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. John Wiley, Chichester
Book MATH Google Scholar
Billard L, Guo JH, Xu W (2011) Maximum Likelihood Estimators for Bivariate Interval-Valued Data. Technical Report, University of Georgia, Athens, GA, under revision
Billard L, Le-Rademacher J (2013) Symbolic principal components for interval-valued data. Revue des Nouvelles Technologies de l’Information 25:31–40
Google Scholar
Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin
Book MATH Google Scholar
Cazes P (2002) Analyse Factorielle d’un Tableau de Lois de Probabilité. Rev Stat Appl 50:5–24
Google Scholar
Cazes P, Chouakria A, Diday E, Schecktman Y (1997) Extension de l’analyse en composantes principales $\grave{a}$ des donn$\acute{e}$es de type intervalle. Rev Stat Appl 45:5–24
Google Scholar
Chouakria A (1998) Extension des M$\acute{e}$thodes d’Analyse Factorielle $\grave{a}$ des Donn$\acute{e}$es de Type Intervalle. Th$\acute{e}$se de doctorat. Universit$\acute{e}$ Paris Dauphine, Paris
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4:229–246
Article MathSciNet Google Scholar
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4:184–198
Article MathSciNet Google Scholar
Irpino A, Lauro C, Verde R (2003) Visualizing symbolic data by closed shapes. In: Schader M, Gaul W, Vichi M (eds) Between Data Science and Applied Data Analysis. Springer, Berlin. pp 244–251
Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis, 5th edn. Prentice Hall, New Jersey
MATH Google Scholar
Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York
MATH Google Scholar
Lauro NC, Palumbo F (2000) Principal component analysis of interval data: a symbolic data analysis approach. Comput Stat 15:73–87
Article MATH Google Scholar
Lauro NC, Verde R and Irpino A (2008) Principal component analysis of symbolic data described by intervals. In: Diday E, Noirhomme-Fraiture M (eds) Symbolic Data Analysis and the SODAS Software. Wiley, Chichester. pp 279–311
Le-Rademacher J (2008) Principal Component Analysis for Interval-Valued and Histogram-Valued Data and Likelihood Functions and Some Maximum Likelihood Estimators for Symbolic Data. Doctoral Dissertation. University of Georgia
Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602
Article MathSciNet MATH Google Scholar
Le-Rademacher J, Billard L (2012) Symbolic-covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21:413–432
Article MathSciNet Google Scholar
Le-Rademacher J, Billard L (2013) Principal component histograms from interval-valued observations. Comput Stat 28:2117–2138
Article MathSciNet MATH Google Scholar
Makosso-Kallyth S, Diday E (2012) Adaptation of interval PCA to symbolic histogram variables. Adv Data Anal Classif 6:147–159
Article MathSciNet MATH Google Scholar
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, New York
MATH Google Scholar
Palumbo F, Lauro NC (2003) A PCA for interval-valued data based on midpoints and radii. In: Yanai H, Okada A, Shigemasu K, Kano Y, Meulman J (eds) New Developments in Psychometrics. Springer, Tokyo. pp 641–648
Shapiro AF (2009) Fuzzy random variables. Insur Math Econ 44:307–314
Article MathSciNet MATH Google Scholar
Xu W (2010) Symbolic Data Analysis: Interval-Valued Data Regression. PhD thesis, University of Georgia
Zadeh LA (1965) Fuzzy Sets. Inf Control 8:338–353
Article MATH Google Scholar
Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Mayo Clinic, Rochester, MN, 55905, USA
J. Le-Rademacher
University of Georgia, Athens, GA, 30602, USA
L. Billard

Authors

J. Le-Rademacher
View author publications
You can also search for this author in PubMed Google Scholar
L. Billard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. Billard.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 0 KB)

Supplementary material 2 (txt 2 KB)

Supplementary material 3 (txt 1 KB)

Supplementary material 4 (txt 23 KB)

Supplementary material 5 (txt 29 KB)

Appendix: Algorithm

The algorithm to construct the polytope representation of the observations on principal component space has essentially two parts. The first part (“Constructing the matrix of vertices” in the Appendix) constructs the matrices of vertices needed to build the actual polytopes. Then (“Constructing the polytopes” in the Appendix) the construction of the polytopes per se is described. Extensions to two- and three-dimensional polytope plots are given in “Constructing Two and Three Dimensional Plots” in the Appendix. The algorithm to compute the histograms from the resulting polytopes is given in “Constructing the PC histograms” in the Appendix. The indexing notation used in these algorithms is similar to that of the R language. Therefore, the position for an element of a vector, a matrix or an array is specified in a pair of square brackets, [ ]. The index for an element of a vector is enclosed in the brackets. An element of a matrix is specified by a pair of numbers separated by a comma. The first number specifies the row and the second number specifies the column. The position of an array is specified by three numbers separated by commas corresponding to row, column, and matrix, respectively. Also, we use the lower case to represent an observed data matrix [e.g., $\mathbf{x}_i^v$ to distinguish it from the random data matrix $\mathbf{{X}}_i^v$ of Eq. (8)].

1.1 Constructing the matrix of vertices

First, assume that the observed data vector $\mathbf{x}_i$ has been separated into a vector of subinterval endpoints and a vector of the relative frequencies. That is, let $\mathbf{x}_{ep}$ be the vector of subinterval endpoints and $\mathbf{x}_{rf}$ be the vector of subinterval relative frequencies. Then, $\mathbf{x}_{ep}$ has $\sum ^p_{j=1}{(s_{ij}+1)}$ elements and has the form

$$\begin{aligned} \mathbf{x}_{ep} = \left[ \begin{array}{lcccccccccr} a_{i11}&\ldots&a_{i1(s_{i1}+1)}&\ldots&a_{ip1}&\ldots&a_{ip(s_{ip}+1)}\end{array} \right] \end{aligned}$$

where $a_{ijk}$, for $k=1,\ldots , s_{ij}+1$ and $j = 1, \ldots , p$, are elements of the set $E_{ij}$. The vector $\mathbf{x}_{rf}$ has $\sum ^p_{j=1}{s_{ij}}$ elements and has the form

$$\begin{aligned} \mathbf{x}_{rf} = \left[ \begin{array}{lccccccccr} p_{i11}&\ldots&p_{i1s_{i1}}&\ldots&p_{ip1}&\ldots&p_{ips_{ip}} \end{array} \right] \end{aligned}$$

where $p_{ijk}$ is the relative frequency of the k th subinterval of the observed histogram $x_{ij}$. Before creating the matrix of vertices for observation i, a p-vector whose elements are the number of subintervals for $X_{ij}$ is also needed. Let $\mathbf{ns}$ denote the vector of number of subintervals of $X_{ij}$. Then,

$$\begin{aligned} \mathbf{ns} = \left[ \begin{array}{lcr} s_{i1}&\ldots&s_{ip} \end{array} \right] . \end{aligned}$$

With the information in $\mathbf{x}_{ep}$, $\mathbf{x}_{rf}$, and $\mathbf{ns}$, we can proceed with constructing the matrix of vertices $\mathbf{x}^v_i$ using the following five steps:

$\underline{Step~1:}$ Create a ($p+1$)-vector $\mathbf{nr}$ whose $(j+1)^{\mathrm{{th}}}$ element, for $j=1,\ldots , p$, is the number of times that points $a_{ijk}$, for $k = 1, \ldots , s_{ij}+1$, must be repeated in Step 5 below. The first element of $\mathbf{nr}$ is the number of rows of the matrix of observed vertices, $\mathbf{x}^v_i$.

1.
For $j=1, \ldots , p$, set $\mathbf{nr}[p-j+1] = \prod _{l=p-j+1}^{p}{(s_{il}+1)}$.
2.
Set $\mathbf{nr}[p+1] = 1$.

$\underline{\mathrm{Step~2:}}$ Create a ($p+1$)-vector $\mathbf{nr}_p$ whose $(j+1)^{th}$ element, for $j=1,\ldots , p$, is the number of sub-hyperrectangles present in observation i when all variables up to j are excluded.

1.
For $j=1, \ldots , p$, set $\mathbf{nr}_p[p-j+1] = \prod _{l=p-j+1}^{p}{s_{il}}$.
2.
Set $\mathbf{nr}_p[p+1] = 1$.

$\underline{\mathrm{Step~3:}}$ Create a p-vector $\mathbf{sp}$ whose j th element is the position of the element of $\mathbf{x}_{ep}$ which is the first subinterval endpoint for variable j.

1.
Set $\mathbf{sp}[1] = 1$.
2.
For $j=1, \ldots , p-1$, set $\mathbf{sp}[j+1] = \sum _{l=1}^j{(s_{il}+j+1)}$.

$\underline{\mathrm{Step~4:}}$ Initialize the matrix of observed vertices $\mathbf{x}^v_i$ by letting $\mathbf{x}^v_i$ be an ($N_i \times p$) matrix of zeros where $N_i=\prod _{j=1}^p{(s_{ij}+1)}$.

$\underline{\mathrm{Step~5:}}$ Update the elements of $\mathbf{x}^v_i$ by

1.
For $j=1, \ldots , p$, do
1. (a)
  Let $nj = \mathbf{ns}[j]$.
2. (b)
  Let $rj = \mathbf{nr}[j+1]$.
3. (c)
  Let $sj = \mathbf{sp}[j]$.
4. (d)
  For $l= 0,\ldots , nj$,
  - For $k = 1, \ldots , rj$,
  - set $\mathbf{x}^v_i[l(rj) + k,j]=\mathbf{x}_{ep}[sj+l]$.
2.
For $j=2, \ldots , p$, do
1. (a)
  Let $tj = \frac{\mathbf{nr}[1]}{\mathbf{nr}[j]}-1$.
2. (b)
  Let $rj = \mathbf{nr}[j]$.
3. (c)
  For $l= 1,\ldots , tj$,
  - For $k = 1, \ldots , rj$,
  - set $\mathbf{x}^v_i[l(rj) + k,j]=\mathbf{x}^v_i[k,j]$.

End of Step 5. At the end of Step 5, we obtain the matrix $\mathbf{x}^v_i$ whose rows are the coordinates of the vertices of observation i.

1.2 Constructing the polytopes

The following algorithm includes seven steps:

$\underline{\mathrm{Step~1:}}$ First, compute the matrix of transformed vertices, $\mathbf{y}_i^v (=\mathbf{x}^v_i \mathbf{e})$, for the polytope representing observation i in a principal components space.

$\underline{\mathrm{Step~2:}}$ Next, create a three-dimensional array $\mathbf{y}_i$ to store the transformed vertices that belong to a sub-polytope together. The array $\mathbf{y}_i$ is a result of combining $r_i= \prod ^p_{j=1}{s_{ij}}$ matrices $\mathbf{y}^h_i$ where $h = 1, \ldots , r_i$. Each matrix $\mathbf{y}^h_i$ of dimension ($2^p \times p$) contains coordinates of all vertices that belong to sub-polytope h.

1.
Initialize array $\mathbf{y}_i$ by letting $\mathbf{y}_i$ be an array of zeros with dimension ($2^p \times p \times r_i$).
2.
Update the elements of $\mathbf{y}_i$ by running the following nested loop,
1. (a)
  Set $kr_0 = 0$ and $ni_0 = 0$.
2. (b)
  For $j = 1, \ldots , p-1$,
  - For $l_j = 0, \ldots , s_{ij}$,
    1. i.
      Let $kr_j = kr_{j-1}+(\mathbf{nr}[j+1])l_j$.
    2. ii.
      Let $ni_j = ni_{j-1}+(\mathbf{nr}_p[j+1])l_j$.
    3. iii.
      For $k = 1, \ldots , \mathbf{ns}[p]$,
      1. A.
        Set $kr = kr_{p-1} + k$.
      2. B.
        Set $ni = ni_{p-1} + k$.
      3. C.
        Set $\mathbf{y}_i[1,,ni]=\mathbf{y}_i^v[kr,]$
      4. D.
        For $o = 1, \ldots , p$, do
        
        For $r = 1, \ldots , 2^{(o-1)}$,
        
        set $\mathbf{y}_i[2^{(o-1)}+r,,ni] = \mathbf{y}_i^v[kr[r] + \mathbf{nr}[p-o+2],]$ and
        
        set $kr = (kr,kr[r] + \mathbf{nr}[p-o+2])$.

$\underline{\mathrm{Step~3:}}$ Next, reconstruct polytopes corresponding to sub-hyperrectangles of observation i by following the next two sub-steps.

$\underline{\mathrm{Step~3-A.}}$ Construct the matrix of connected vertices $\mathbf{C}$ associated with $\mathbf{y}_i^v$ as follows:

1.
Initialize $\mathbf{C}$ as a $2^p \times p$ matrix of zeros.
2.
Update $\mathbf{C}$ by doing the following step for $j = 1, \dots ,p$,
- For $j_1 = 0, \dots , 2^{(j-1)}-1$, do
  - For $j_2 = ((2^{(p-j+1)})j_1 + 1),\dots ,((2^{(p-j+1)})j_1 + 2^{(p-j)})$, set $\mathbf{C}[j_2,j] = j_2 + 2^{(p-j)}$.
  - For $j_2 = ((2^{(p-j+1)})j_1 + 2^{(p-j)} + 1),\dots ,((2^{(p-j+1)})j_1 + 2^{(p-j+1)})$,
  - set $\mathbf{C}[j_2,j] = j_2 - 2^{(p-j)}$.

$\underline{\mathrm{Step~3-B}}$ A p-dimensional plot of the polytopes is constructed in the principal component space by the following two steps:

1.
Make a scatter plot of $\mathbf{y}_i^v$.
2.
Construct the vertices of each sub-polytope as follows, for each $h = 1,\dots ,r_i$,
- For $v_1 = 1,\dots ,2^p$, do
- for $j_1 = 2,\dots ,p+1$,
- set $v_2 = \mathbf{C}[v_1,j_1]$, and
- connect the points $\mathbf{y}_i[v_1,h]$ and $\mathbf{y}_i[v_2,h]$ with a line.

End of Step 3. We now have a plot of the $r_i$ polytopes representing of observation $\mathbf{x}_i$, $i = 1,\dots ,n,$ in PC space. This step is an adaptation of Steps 3–4 for obtaining the polytope for interval-valued data; see Le-Rademacher (2008) and (Le-Rademacher and Billard (2012), Supplemental Material)

At the end of Step 3, polytopes representing observation i in a principal component space are plotted. While these polytopes are now constructed, we recall that the densities of a histogram observation vary across the hyperrectangles. To create the vector of densities for these polytopes, follow the next 4 steps.

$\underline{\mathrm{Step~4:}}$ Create a p-vector $\mathbf{sp}_p$ whose j th element is the position of the element of $\mathbf{x}_{rf}$ which is the first subinterval relative frequency for variable j.

1.
Set $\mathbf{sp}_p[1] = 1$.
2.
For $j=1, \ldots , p-1$,
- set $\mathbf{sp}_p[j+1] = \sum _{l=1}^j s_{il}+1$.

$\underline{\mathrm{Step~5:}}$ Let $\mathbf{x}^v_p$ be an $(r_i \times p)$ matrix of relative frequencies. The row h of $\mathbf{x}^v_p$ contains the relative frequencies of subintervals making up sub-hyperrectangle h. Initialize $\mathbf{x}^v_p$ by setting all elements of $\mathbf{x}^v_p$ to zeros.

$\underline{\mathrm{Step~6:}}$ Update the elements of $\mathbf{x}^v_p$ by

1.
For $j=1, \ldots , p$, do
- Let $nj = \mathbf{ns}[j]$.
- Let $rj = \mathbf{nr}_p[j+1]$.
- Let $sj = \mathbf{sp}_p[j]$.
- For $l= 0,\ldots , nj-1$,
  - For $k = 1, \ldots , rj$,
  - set $\mathbf{x}^v_p[l(rj) + k,j]=\mathbf{x}_{rf}[sj+l]$.
2.
For $j=2, \ldots , p$, do
- Let $tj = \frac{\mathbf{nr}_p[1]}{\mathbf{nr}_p[j]}-1$.
- Let $rj = \mathbf{nr}_p[j]$.
- For $l= 1,\ldots , tj$,
  - For $k = 1, \ldots , rj$,
    - set $\mathbf{x}^v_p[l(rj) + k,j]=\mathbf{x}^v_p[k,j]$.

$\underline{\mathrm{Step~7:}}$ Let $\mathbf{d}_i$ be an $r_i$-vector whose elements are densities of the sub-hyperrectangles belonging to observation i. The density for each sub-hyperrectangle is the product of relative frequencies of the p subintervals making up that sub-hyperrectangle. That is, for $h=1, \ldots , r_i$, $\mathbf{d}_i[h] = \prod ^p_{j=1}{\mathbf{x}^v_p[h,j]}$.

At the end of Step 7, we obtain a vector of densities $\mathbf{d}_i$ whose h th element is the density of sub-hyperrectangle h of observation i.

1.3 Constructing two and three dimensional plots

Usually, visualization of the projections of observations onto the principal component space is limited to two dimensions, $PC_{\nu _1} \times PC_{\nu _2}$. This is achieved by replacing the substeps 1 and 2 in Step 3-B of the polytope algorithm of Sect. 1, by the following three substeps:

1.
Let $\mathbf{y}^{(2)}_i$ be the $r_i2^p \times 2$ matrix whose first and second columns are, respectively, columns $\nu _1$ and $\nu _2$ of $\mathbf{y}_i^v$.
2.
Make a scatter plot of $\mathbf{y}^{(2)}_i$.
3.
Connect corresponding points of $\mathbf{y}^{(2)}_i$ by using substep 2 of Step 3-B of Sect. 1, with $\mathbf{y}_i^v$ replaced by $\mathbf{y}^{(2)}_i$; now $p=2$.

To construct a three-dimensional plot of $PC_{\nu _1} \times PC_{\nu _2} \times PC_{\nu _3}$, follow the same three steps as here for constructing two-dimensional plots except that $\mathbf{y}^{(2)}_i$ is replaced by $\mathbf{y}^{(3)}_i$ where now, in substep 1, $\mathbf{y}^{(3)}_i$ is an $(r_i2^p \times 3)$ matrix with columns $\nu _1$, $\nu _2$, and $\nu _3$ of $\mathbf{y}_i^v$. In substep 3, $p=3$.

1.4 Constructing the PC histograms

The following algorithm constructs a histogram representing the $\nu ^{th}$ principal component for observation i by first computing the PC histograms corresponding to the sub-polytopes of observation i, then combine the $r_i$ histograms into one ${PC}_{\nu }$ histogram representing observation i.

$\underline{\mathrm{Step~1:}}$ Follow the algorithm of Le-Rademacher and Billard (2013) to create the ($r_i \times 3s$) matrix $\mathbf {z}_{i \nu }$ whose h th row contains the subinterval endpoints and the relative frequencies for sub-polytope h as specified in Eq. (14). Here, elements $3k, k=1 \ldots , s,$ of $\mathbf{{z}}_{i \nu }[h,]$ are the unadjusted relative frequencies of the histogram representing sub-polytope h.

$\underline{\mathrm{Step~2:}}$ Update the relative frequencies from Step 1 by setting $\mathbf{{z}}_{i \nu }[h, 3k] = \mathbf{{d}}_i[h]\mathbf{{z}}_{i \nu }[h, 3k]$.

$\underline{\mathrm{Step~3:}}$ This next step combines the s histograms in $\mathbf{{z}}_{i \nu }$ into one histogram with subintervals of equal width.

1.
Let lo and hi be the lowest and the highest endpoints of the $r_i$ histograms of observation i. Then, $lo = min(\mathbf{{z}}_{i \nu }[,1])$ and $hi = max(\mathbf{{z}}_{i \nu }[,3s-1])$.
2.
Let sn denote the desired number of subintervals for the combined histogram. Then, the widths of the subintervals are $sw = (hi - lo)/sn$.
3.
Let $\mathbf {hm}$ be an ($sn \times 3$) transition matrix whose columns 1 and 2 contain the subinterval endpoints and column 3 contains the relative frequencies of the combined ${PC}_{\nu }$ histogram for observation i. Initialize $\mathbf {hm}$ by setting its elements to zero.
4.
Update $\mathbf {hm}$ as follows, For $t = 1, \ldots , ns$, do
1. (a)
  Set the endpoints of subinterval t by letting $\mathbf{{hm}}[t,1] = lo + (sw)(t-1)$ and $\mathbf{{hm}}[t,2] = lo + (sw)t$.
2. (b)
  Let $\mathbf {fr}$ be an ($r_i \times s$) matrix whose (h, q) element corresponds to the proportion of subinterval q of sub-polytope h that falls within the subinterval t. Initialize $\mathbf {fr}$ by setting its elements to zero.
  - For $h = 1,\ldots , r_i$, do
  - For $q = 1,\ldots , s$, do $\underline{\mathrm{Case~a:}}$ If ($\mathbf{{z}}_{i \nu }[h,3q-2] \ge \mathbf{{hm}}[s,1]$) and ($\mathbf{{z}}_{i \nu }[h, 3q-1] \le \mathbf{{hm}}[s,2]$), set $\mathbf{{fr}}[h,q] = \mathbf{{z}}_{i \nu }[h,3q]$. $\underline{\mathrm{Case~b:}}$ If ($\mathbf{{z}}_{i \nu }[h,3q-2] \ge \mathbf{{hm}}[s,1]$) and ($\mathbf{{z}}_{i \nu }[h, 3q-2] < \mathbf{{hm}}[s,2]$) and $\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,2]$, set $\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{hm}}[s,2]-\mathbf{{z}}_{i \nu }[h,3q-2])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}$. $\underline{\mathrm{Case~c:}}$ If ($\mathbf{{z}}_{i \nu }[h,3q-2] < \mathbf{{hm}}[s,1]$) and ($\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,1]$) and ($\mathbf{{z}}_{i \nu }[h, 3q-1] \le \mathbf{{hm}}[s,2]$), set $\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{hm}}[s,1])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}$. $\underline{\mathrm{Case~d:}}$ If ($\mathbf{{z}}_{i \nu }[h,3q-2] < \mathbf{}{hm}[s,1]$) and ($\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,2]$), set $\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{hm}}[s,2]-\mathbf{{hm}}[s,1])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}$.
3. (c)
  Let $\mathbf{{hm}}[t,3] = \sum _{h=1}^{r_i}{\sum _{q=1}^s{\mathbf{{fr}}[h,q]}}$.
5.
Let $sh = \sum ^{ns}_{t=1}{\mathbf{{hm}}[t,3]}$.
6.
Update $\mathbf{{hm}}[t,3] =\mathbf{{hm}}[t,3]/sh$.

At the end of this step, we have the subinterval endpoints and the relative frequencies for the combined histogram. Let $\mathbf {pc}_{\nu }$ be the ($n \times ns$) matrix whose i th row contains the ${PC}_{\nu }$ histogram for observation i. Then, for $t = 1,\ldots , ns$, do

1.
Let $\mathbf{{pc}}_{\nu }[i,3t-2] = \mathbf{{hm}}[t, 1]$.
2.
Let $\mathbf{{pc}}_{\nu }[i,3t-1] = \mathbf{{hm}}[t, 2]$.
3.
Let $\mathbf{{pc}}_{\nu }[i,3t] = \mathbf{{hm}}[t, 3]$.

This step concludes the histogram algorithm. Repeat these steps for all observations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Le-Rademacher, J., Billard, L. Principal component analysis for histogram-valued data. Adv Data Anal Classif 11, 327–351 (2017). https://doi.org/10.1007/s11634-016-0255-9

Download citation

Received: 14 August 2013
Revised: 30 April 2016
Accepted: 11 May 2016
Published: 26 May 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11634-016-0255-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Principal component analysis for histogram-valued data

Abstract

Access this article

Similar content being viewed by others

The Median of a Set of Histogram Data

Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm

Principle component analysis: Robust versions

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (txt 0 KB)

Supplementary material 2 (txt 2 KB)

Supplementary material 3 (txt 1 KB)

Supplementary material 4 (txt 23 KB)

Supplementary material 5 (txt 29 KB)

Appendix: Algorithm

1.1 Constructing the matrix of vertices

1.2 Constructing the polytopes

1.3 Constructing two and three dimensional plots

1.4 Constructing the PC histograms

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Principal component analysis for histogram-valued data

Abstract

Access this article

Similar content being viewed by others

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Appendix: Algorithm

Appendix: Algorithm

1.1 Constructing the matrix of vertices

1.2 Constructing the polytopes

1.3 Constructing two and three dimensional plots

1.4 Constructing the PC histograms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation