Skip to main content
Log in

Principal component analysis for histogram-valued data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

This paper introduces a principal component methodology for analysing histogram-valued data under the symbolic data domain. Currently, no comparable method exists for this type of data. The proposed method uses a symbolic covariance matrix to determine the principal component space. The resulting observations on principal component space are presented as polytopes for visualization. Numerical representation of the resulting polytopes via histogram-valued output is also presented. The necessary algorithms are included. The technique is illustrated on a weather data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Anderson TW (1963) Asymptotic theory for principal components analysis. Ann Math Stat 34:122–148

    Article  MathSciNet  MATH  Google Scholar 

  • Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. John Wiley, New York

    MATH  Google Scholar 

  • Bertrand P and Goupil F (2000) Descriptive statistics for symbolic data. In: Bock H-H, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124

  • Billard L (2008) Sample covariance functions for complex quantitative data. In: Mizuta M, Nakano J (eds) Proceedings World Congress, International Association of Statistical Computing. Japanese Society of Computational Statistics, Japan, pp 157–163

  • Billard L (2011) Brief overview of symbolic data and analytic issues. Stat Anal Data Min 4:149–156

    Article  MathSciNet  Google Scholar 

  • Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487

    Article  MathSciNet  Google Scholar 

  • Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. John Wiley, Chichester

    Book  MATH  Google Scholar 

  • Billard L, Guo JH, Xu W (2011) Maximum Likelihood Estimators for Bivariate Interval-Valued Data. Technical Report, University of Georgia, Athens, GA, under revision

  • Billard L, Le-Rademacher J (2013) Symbolic principal components for interval-valued data. Revue des Nouvelles Technologies de l’Information 25:31–40

    Google Scholar 

  • Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin

    Book  MATH  Google Scholar 

  • Cazes P (2002) Analyse Factorielle d’un Tableau de Lois de Probabilité. Rev Stat Appl 50:5–24

    Google Scholar 

  • Cazes P, Chouakria A, Diday E, Schecktman Y (1997) Extension de l’analyse en composantes principales \(\grave{a}\) des donn\(\acute{e}\)es de type intervalle. Rev Stat Appl 45:5–24

    Google Scholar 

  • Chouakria A (1998) Extension des M\(\acute{e}\)thodes d’Analyse Factorielle \(\grave{a}\) des Donn\(\acute{e}\)es de Type Intervalle. Th\(\acute{e}\)se de doctorat. Universit\(\acute{e}\) Paris Dauphine, Paris

  • Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4:229–246

    Article  MathSciNet  Google Scholar 

  • Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4:184–198

    Article  MathSciNet  Google Scholar 

  • Irpino A, Lauro C, Verde R (2003) Visualizing symbolic data by closed shapes. In: Schader M, Gaul W, Vichi M (eds) Between Data Science and Applied Data Analysis. Springer, Berlin. pp 244–251

  • Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis, 5th edn. Prentice Hall, New Jersey

    MATH  Google Scholar 

  • Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Lauro NC, Palumbo F (2000) Principal component analysis of interval data: a symbolic data analysis approach. Comput Stat 15:73–87

    Article  MATH  Google Scholar 

  • Lauro NC, Verde R and Irpino A (2008) Principal component analysis of symbolic data described by intervals. In: Diday E, Noirhomme-Fraiture M (eds) Symbolic Data Analysis and the SODAS Software. Wiley, Chichester. pp 279–311

  • Le-Rademacher J (2008) Principal Component Analysis for Interval-Valued and Histogram-Valued Data and Likelihood Functions and Some Maximum Likelihood Estimators for Symbolic Data. Doctoral Dissertation. University of Georgia

  • Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602

    Article  MathSciNet  MATH  Google Scholar 

  • Le-Rademacher J, Billard L (2012) Symbolic-covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21:413–432

    Article  MathSciNet  Google Scholar 

  • Le-Rademacher J, Billard L (2013) Principal component histograms from interval-valued observations. Comput Stat 28:2117–2138

    Article  MathSciNet  MATH  Google Scholar 

  • Makosso-Kallyth S, Diday E (2012) Adaptation of interval PCA to symbolic histogram variables. Adv Data Anal Classif 6:147–159

    Article  MathSciNet  MATH  Google Scholar 

  • Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, New York

    MATH  Google Scholar 

  • Palumbo F, Lauro NC (2003) A PCA for interval-valued data based on midpoints and radii. In: Yanai H, Okada A, Shigemasu K, Kano Y, Meulman J (eds) New Developments in Psychometrics. Springer, Tokyo. pp 641–648

  • Shapiro AF (2009) Fuzzy random variables. Insur Math Econ 44:307–314

    Article  MathSciNet  MATH  Google Scholar 

  • Xu W (2010) Symbolic Data Analysis: Interval-Valued Data Regression. PhD thesis, University of Georgia

  • Zadeh LA (1965) Fuzzy Sets. Inf Control 8:338–353

    Article  MATH  Google Scholar 

  • Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to L. Billard.

Electronic supplementary material

Appendix: Algorithm

Appendix: Algorithm

The algorithm to construct the polytope representation of the observations on principal component space has essentially two parts. The first part (“Constructing the matrix of vertices” in the Appendix) constructs the matrices of vertices needed to build the actual polytopes. Then (“Constructing the polytopes” in the Appendix) the construction of the polytopes per se is described. Extensions to two- and three-dimensional polytope plots are given in “Constructing Two and Three Dimensional Plots” in the Appendix. The algorithm to compute the histograms from the resulting polytopes is given in “Constructing the PC histograms” in the Appendix. The indexing notation used in these algorithms is similar to that of the R language. Therefore, the position for an element of a vector, a matrix or an array is specified in a pair of square brackets, [ ]. The index for an element of a vector is enclosed in the brackets. An element of a matrix is specified by a pair of numbers separated by a comma. The first number specifies the row and the second number specifies the column. The position of an array is specified by three numbers separated by commas corresponding to row, column, and matrix, respectively. Also, we use the lower case to represent an observed data matrix [e.g., \(\mathbf{x}_i^v\) to distinguish it from the random data matrix \(\mathbf{{X}}_i^v\) of Eq. (8)].

1.1 Constructing the matrix of vertices

First, assume that the observed data vector \(\mathbf{x}_i\) has been separated into a vector of subinterval endpoints and a vector of the relative frequencies. That is, let \(\mathbf{x}_{ep}\) be the vector of subinterval endpoints and \(\mathbf{x}_{rf}\) be the vector of subinterval relative frequencies. Then, \(\mathbf{x}_{ep}\) has \(\sum ^p_{j=1}{(s_{ij}+1)}\) elements and has the form

$$\begin{aligned} \mathbf{x}_{ep} = \left[ \begin{array}{lcccccccccr} a_{i11}&\ldots&a_{i1(s_{i1}+1)}&\ldots&a_{ip1}&\ldots&a_{ip(s_{ip}+1)}\end{array} \right] \end{aligned}$$

where \(a_{ijk}\), for \(k=1,\ldots , s_{ij}+1\) and \(j = 1, \ldots , p\), are elements of the set \(E_{ij}\). The vector \(\mathbf{x}_{rf}\) has \(\sum ^p_{j=1}{s_{ij}}\) elements and has the form

$$\begin{aligned} \mathbf{x}_{rf} = \left[ \begin{array}{lccccccccr} p_{i11}&\ldots&p_{i1s_{i1}}&\ldots&p_{ip1}&\ldots&p_{ips_{ip}} \end{array} \right] \end{aligned}$$

where \(p_{ijk}\) is the relative frequency of the k th subinterval of the observed histogram \(x_{ij}\). Before creating the matrix of vertices for observation i, a p-vector whose elements are the number of subintervals for \(X_{ij}\) is also needed. Let \(\mathbf{ns}\) denote the vector of number of subintervals of \(X_{ij}\). Then,

$$\begin{aligned} \mathbf{ns} = \left[ \begin{array}{lcr} s_{i1}&\ldots&s_{ip} \end{array} \right] . \end{aligned}$$

With the information in \(\mathbf{x}_{ep}\), \(\mathbf{x}_{rf}\), and \(\mathbf{ns}\), we can proceed with constructing the matrix of vertices \(\mathbf{x}^v_i\) using the following five steps:

\(\underline{Step~1:}\) Create a (\(p+1\))-vector \(\mathbf{nr}\) whose \((j+1)^{\mathrm{{th}}}\) element, for \(j=1,\ldots , p\), is the number of times that points \(a_{ijk}\), for \(k = 1, \ldots , s_{ij}+1\), must be repeated in Step 5 below. The first element of \(\mathbf{nr}\) is the number of rows of the matrix of observed vertices, \(\mathbf{x}^v_i\).

  1. 1.

    For \(j=1, \ldots , p\), set \(\mathbf{nr}[p-j+1] = \prod _{l=p-j+1}^{p}{(s_{il}+1)}\).

  2. 2.

    Set \(\mathbf{nr}[p+1] = 1\).

\(\underline{\mathrm{Step~2:}}\) Create a (\(p+1\))-vector \(\mathbf{nr}_p\) whose \((j+1)^{th}\) element, for \(j=1,\ldots , p\), is the number of sub-hyperrectangles present in observation i when all variables up to j are excluded.

  1. 1.

    For \(j=1, \ldots , p\), set \(\mathbf{nr}_p[p-j+1] = \prod _{l=p-j+1}^{p}{s_{il}}\).

  2. 2.

    Set \(\mathbf{nr}_p[p+1] = 1\).

\(\underline{\mathrm{Step~3:}}\) Create a p-vector \(\mathbf{sp}\) whose j th element is the position of the element of \(\mathbf{x}_{ep}\) which is the first subinterval endpoint for variable j.

  1. 1.

    Set \(\mathbf{sp}[1] = 1\).

  2. 2.

    For \(j=1, \ldots , p-1\), set \(\mathbf{sp}[j+1] = \sum _{l=1}^j{(s_{il}+j+1)}\).

\(\underline{\mathrm{Step~4:}}\) Initialize the matrix of observed vertices \(\mathbf{x}^v_i\) by letting \(\mathbf{x}^v_i\) be an (\(N_i \times p\)) matrix of zeros where \(N_i=\prod _{j=1}^p{(s_{ij}+1)}\).

\(\underline{\mathrm{Step~5:}}\) Update the elements of \(\mathbf{x}^v_i\) by

  1. 1.

    For \(j=1, \ldots , p\), do

    1. (a)

      Let \(nj = \mathbf{ns}[j]\).

    2. (b)

      Let \(rj = \mathbf{nr}[j+1]\).

    3. (c)

      Let \(sj = \mathbf{sp}[j]\).

    4. (d)

      For \(l= 0,\ldots , nj\),

      • For \(k = 1, \ldots , rj\),

      • set \(\mathbf{x}^v_i[l(rj) + k,j]=\mathbf{x}_{ep}[sj+l]\).

  2. 2.

    For \(j=2, \ldots , p\), do

    1. (a)

      Let \(tj = \frac{\mathbf{nr}[1]}{\mathbf{nr}[j]}-1\).

    2. (b)

      Let \(rj = \mathbf{nr}[j]\).

    3. (c)

      For \(l= 1,\ldots , tj\),

      • For \(k = 1, \ldots , rj\),

      • set \(\mathbf{x}^v_i[l(rj) + k,j]=\mathbf{x}^v_i[k,j]\).

End of Step 5. At the end of Step 5, we obtain the matrix \(\mathbf{x}^v_i\) whose rows are the coordinates of the vertices of observation i.

1.2 Constructing the polytopes

The following algorithm includes seven steps:

\(\underline{\mathrm{Step~1:}}\) First, compute the matrix of transformed vertices, \(\mathbf{y}_i^v (=\mathbf{x}^v_i \mathbf{e})\), for the polytope representing observation i in a principal components space.

\(\underline{\mathrm{Step~2:}}\) Next, create a three-dimensional array \(\mathbf{y}_i\) to store the transformed vertices that belong to a sub-polytope together. The array \(\mathbf{y}_i\) is a result of combining \(r_i= \prod ^p_{j=1}{s_{ij}}\) matrices \(\mathbf{y}^h_i\) where \(h = 1, \ldots , r_i\). Each matrix \(\mathbf{y}^h_i\) of dimension (\(2^p \times p\)) contains coordinates of all vertices that belong to sub-polytope h.

  1. 1.

    Initialize array \(\mathbf{y}_i\) by letting \(\mathbf{y}_i\) be an array of zeros with dimension (\(2^p \times p \times r_i\)).

  2. 2.

    Update the elements of \(\mathbf{y}_i\) by running the following nested loop,

    1. (a)

      Set \(kr_0 = 0\) and \(ni_0 = 0\).

    2. (b)

      For \(j = 1, \ldots , p-1\),

      • For \(l_j = 0, \ldots , s_{ij}\),

        1. i.

          Let \(kr_j = kr_{j-1}+(\mathbf{nr}[j+1])l_j\).

        2. ii.

          Let \(ni_j = ni_{j-1}+(\mathbf{nr}_p[j+1])l_j\).

        3. iii.

          For \(k = 1, \ldots , \mathbf{ns}[p]\),

          1. A.

            Set \(kr = kr_{p-1} + k\).

          2. B.

            Set \(ni = ni_{p-1} + k\).

          3. C.

            Set \(\mathbf{y}_i[1,,ni]=\mathbf{y}_i^v[kr,]\)

          4. D.

            For \(o = 1, \ldots , p\), do

            • For \(r = 1, \ldots , 2^{(o-1)}\),

            • set \(\mathbf{y}_i[2^{(o-1)}+r,,ni] = \mathbf{y}_i^v[kr[r] + \mathbf{nr}[p-o+2],]\) and

            • set \(kr = (kr,kr[r] + \mathbf{nr}[p-o+2])\).

\(\underline{\mathrm{Step~3:}}\) Next, reconstruct polytopes corresponding to sub-hyperrectangles of observation i by following the next two sub-steps.

\(\underline{\mathrm{Step~3-A.}}\) Construct the matrix of connected vertices \(\mathbf{C}\) associated with \(\mathbf{y}_i^v\) as follows:

  1. 1.

    Initialize \(\mathbf{C}\) as a \(2^p \times p\) matrix of zeros.

  2. 2.

    Update \(\mathbf{C}\) by doing the following step for \(j = 1, \dots ,p\),

    • For \(j_1 = 0, \dots , 2^{(j-1)}-1\), do

      • For \(j_2 = ((2^{(p-j+1)})j_1 + 1),\dots ,((2^{(p-j+1)})j_1 + 2^{(p-j)})\), set \(\mathbf{C}[j_2,j] = j_2 + 2^{(p-j)}\).

      • For \(j_2 = ((2^{(p-j+1)})j_1 + 2^{(p-j)} + 1),\dots ,((2^{(p-j+1)})j_1 + 2^{(p-j+1)})\),

      • set \(\mathbf{C}[j_2,j] = j_2 - 2^{(p-j)}\).

\(\underline{\mathrm{Step~3-B}}\) A p-dimensional plot of the polytopes is constructed in the principal component space by the following two steps:

  1. 1.

    Make a scatter plot of \(\mathbf{y}_i^v\).

  2. 2.

    Construct the vertices of each sub-polytope as follows, for each \(h = 1,\dots ,r_i\),

    • For \(v_1 = 1,\dots ,2^p\), do

    • for \(j_1 = 2,\dots ,p+1\),

    • set \(v_2 = \mathbf{C}[v_1,j_1]\), and

    • connect the points \(\mathbf{y}_i[v_1,h]\) and \(\mathbf{y}_i[v_2,h]\) with a line.

End of Step 3. We now have a plot of the \(r_i\) polytopes representing of observation \(\mathbf{x}_i\), \(i = 1,\dots ,n,\) in PC space. This step is an adaptation of Steps 3–4 for obtaining the polytope for interval-valued data; see Le-Rademacher (2008) and (Le-Rademacher and Billard (2012), Supplemental Material)

At the end of Step 3, polytopes representing observation i in a principal component space are plotted. While these polytopes are now constructed, we recall that the densities of a histogram observation vary across the hyperrectangles. To create the vector of densities for these polytopes, follow the next 4 steps.

\(\underline{\mathrm{Step~4:}}\) Create a p-vector \(\mathbf{sp}_p\) whose j th element is the position of the element of \(\mathbf{x}_{rf}\) which is the first subinterval relative frequency for variable j.

  1. 1.

    Set \(\mathbf{sp}_p[1] = 1\).

  2. 2.

    For \(j=1, \ldots , p-1\),

    • set \(\mathbf{sp}_p[j+1] = \sum _{l=1}^j s_{il}+1\).

\(\underline{\mathrm{Step~5:}}\) Let \(\mathbf{x}^v_p\) be an \((r_i \times p)\) matrix of relative frequencies. The row h of \(\mathbf{x}^v_p\) contains the relative frequencies of subintervals making up sub-hyperrectangle h. Initialize \(\mathbf{x}^v_p\) by setting all elements of \(\mathbf{x}^v_p\) to zeros.

\(\underline{\mathrm{Step~6:}}\) Update the elements of \(\mathbf{x}^v_p\) by

  1. 1.

    For \(j=1, \ldots , p\), do

    • Let \(nj = \mathbf{ns}[j]\).

    • Let \(rj = \mathbf{nr}_p[j+1]\).

    • Let \(sj = \mathbf{sp}_p[j]\).

    • For \(l= 0,\ldots , nj-1\),

      • For \(k = 1, \ldots , rj\),

      • set \(\mathbf{x}^v_p[l(rj) + k,j]=\mathbf{x}_{rf}[sj+l]\).

  2. 2.

    For \(j=2, \ldots , p\), do

    • Let \(tj = \frac{\mathbf{nr}_p[1]}{\mathbf{nr}_p[j]}-1\).

    • Let \(rj = \mathbf{nr}_p[j]\).

    • For \(l= 1,\ldots , tj\),

      • For \(k = 1, \ldots , rj\),

        • set \(\mathbf{x}^v_p[l(rj) + k,j]=\mathbf{x}^v_p[k,j]\).

\(\underline{\mathrm{Step~7:}}\) Let \(\mathbf{d}_i\) be an \(r_i\)-vector whose elements are densities of the sub-hyperrectangles belonging to observation i. The density for each sub-hyperrectangle is the product of relative frequencies of the p subintervals making up that sub-hyperrectangle. That is, for \(h=1, \ldots , r_i\), \(\mathbf{d}_i[h] = \prod ^p_{j=1}{\mathbf{x}^v_p[h,j]}\).

At the end of Step 7, we obtain a vector of densities \(\mathbf{d}_i\) whose h th element is the density of sub-hyperrectangle h of observation i.

1.3 Constructing two and three dimensional plots

Usually, visualization of the projections of observations onto the principal component space is limited to two dimensions, \(PC_{\nu _1} \times PC_{\nu _2}\). This is achieved by replacing the substeps 1 and 2 in Step 3-B of the polytope algorithm of Sect. 1, by the following three substeps:

  1. 1.

    Let \(\mathbf{y}^{(2)}_i\) be the \(r_i2^p \times 2\) matrix whose first and second columns are, respectively, columns \(\nu _1\) and \(\nu _2\) of \(\mathbf{y}_i^v\).

  2. 2.

    Make a scatter plot of \(\mathbf{y}^{(2)}_i\).

  3. 3.

    Connect corresponding points of \(\mathbf{y}^{(2)}_i\) by using substep 2 of Step 3-B of Sect. 1, with \(\mathbf{y}_i^v\) replaced by \(\mathbf{y}^{(2)}_i\); now \(p=2\).

To construct a three-dimensional plot of \(PC_{\nu _1} \times PC_{\nu _2} \times PC_{\nu _3}\), follow the same three steps as here for constructing two-dimensional plots except that \(\mathbf{y}^{(2)}_i\) is replaced by \(\mathbf{y}^{(3)}_i\) where now, in substep 1, \(\mathbf{y}^{(3)}_i\) is an \((r_i2^p \times 3)\) matrix with columns \(\nu _1\), \(\nu _2\), and \(\nu _3\) of \(\mathbf{y}_i^v\). In substep 3, \(p=3\).

1.4 Constructing the PC histograms

The following algorithm constructs a histogram representing the \(\nu ^{th}\) principal component for observation i by first computing the PC histograms corresponding to the sub-polytopes of observation i, then combine the \(r_i\) histograms into one \({PC}_{\nu }\) histogram representing observation i.

\(\underline{\mathrm{Step~1:}}\) Follow the algorithm of Le-Rademacher and Billard (2013) to create the (\(r_i \times 3s\)) matrix \(\mathbf {z}_{i \nu }\) whose h th row contains the subinterval endpoints and the relative frequencies for sub-polytope h as specified in Eq. (14). Here, elements \(3k, k=1 \ldots , s,\) of \(\mathbf{{z}}_{i \nu }[h,]\) are the unadjusted relative frequencies of the histogram representing sub-polytope h.

\(\underline{\mathrm{Step~2:}}\) Update the relative frequencies from Step 1 by setting \(\mathbf{{z}}_{i \nu }[h, 3k] = \mathbf{{d}}_i[h]\mathbf{{z}}_{i \nu }[h, 3k]\).

\(\underline{\mathrm{Step~3:}}\) This next step combines the s histograms in \(\mathbf{{z}}_{i \nu }\) into one histogram with subintervals of equal width.

  1. 1.

    Let lo and hi be the lowest and the highest endpoints of the \(r_i\) histograms of observation i. Then, \(lo = min(\mathbf{{z}}_{i \nu }[,1])\) and \(hi = max(\mathbf{{z}}_{i \nu }[,3s-1])\).

  2. 2.

    Let sn denote the desired number of subintervals for the combined histogram. Then, the widths of the subintervals are \(sw = (hi - lo)/sn\).

  3. 3.

    Let \(\mathbf {hm}\) be an (\(sn \times 3\)) transition matrix whose columns 1 and 2 contain the subinterval endpoints and column 3 contains the relative frequencies of the combined \({PC}_{\nu }\) histogram for observation i. Initialize \(\mathbf {hm}\) by setting its elements to zero.

  4. 4.

    Update \(\mathbf {hm}\) as follows, For \(t = 1, \ldots , ns\), do

    1. (a)

      Set the endpoints of subinterval t by letting \(\mathbf{{hm}}[t,1] = lo + (sw)(t-1)\) and \(\mathbf{{hm}}[t,2] = lo + (sw)t\).

    2. (b)

      Let \(\mathbf {fr}\) be an (\(r_i \times s\)) matrix whose (hq) element corresponds to the proportion of subinterval q of sub-polytope h that falls within the subinterval t. Initialize \(\mathbf {fr}\) by setting its elements to zero.

      • For \(h = 1,\ldots , r_i\), do

      • For \(q = 1,\ldots , s\), do \(\underline{\mathrm{Case~a:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] \ge \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] \le \mathbf{{hm}}[s,2]\)), set \(\mathbf{{fr}}[h,q] = \mathbf{{z}}_{i \nu }[h,3q]\). \(\underline{\mathrm{Case~b:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] \ge \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-2] < \mathbf{{hm}}[s,2]\)) and \(\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,2]\), set \(\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{hm}}[s,2]-\mathbf{{z}}_{i \nu }[h,3q-2])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}\). \(\underline{\mathrm{Case~c:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] < \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] \le \mathbf{{hm}}[s,2]\)), set \(\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{hm}}[s,1])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}\). \(\underline{\mathrm{Case~d:}}\) If (\(\mathbf{{z}}_{i \nu }[h,3q-2] < \mathbf{}{hm}[s,1]\)) and (\(\mathbf{{z}}_{i \nu }[h, 3q-1] > \mathbf{{hm}}[s,2]\)), set \(\mathbf{{fr}}[h,q] = \frac{(\mathbf{{z}}_{i \nu }[h,3q])(\mathbf{{hm}}[s,2]-\mathbf{{hm}}[s,1])}{\mathbf{{z}}_{i \nu }[h,3q-1]-\mathbf{{z}}_{i \nu }[h,3q-2]}\).

    3. (c)

      Let \(\mathbf{{hm}}[t,3] = \sum _{h=1}^{r_i}{\sum _{q=1}^s{\mathbf{{fr}}[h,q]}}\).

  5. 5.

    Let \(sh = \sum ^{ns}_{t=1}{\mathbf{{hm}}[t,3]}\).

  6. 6.

    Update \(\mathbf{{hm}}[t,3] =\mathbf{{hm}}[t,3]/sh\).

At the end of this step, we have the subinterval endpoints and the relative frequencies for the combined histogram. Let \(\mathbf {pc}_{\nu }\) be the (\(n \times ns\)) matrix whose i th row contains the \({PC}_{\nu }\) histogram for observation i. Then, for \(t = 1,\ldots , ns\), do

  1. 1.

    Let \(\mathbf{{pc}}_{\nu }[i,3t-2] = \mathbf{{hm}}[t, 1]\).

  2. 2.

    Let \(\mathbf{{pc}}_{\nu }[i,3t-1] = \mathbf{{hm}}[t, 2]\).

  3. 3.

    Let \(\mathbf{{pc}}_{\nu }[i,3t] = \mathbf{{hm}}[t, 3]\).

This step concludes the histogram algorithm. Repeat these steps for all observations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le-Rademacher, J., Billard, L. Principal component analysis for histogram-valued data. Adv Data Anal Classif 11, 327–351 (2017). https://doi.org/10.1007/s11634-016-0255-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0255-9

Keywords

Mathematics Subject Classification

Navigation