Principal component analysis on interval data

Gioia, Federica; Lauro, Carlo N.

doi:10.1007/s00180-006-0267-6

Principal component analysis on interval data

Published: 01 June 2006

Volume 21, pages 343–363, (2006)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Federica Gioia¹ &
Carlo N. Lauro¹

992 Accesses
62 Citations
Explore all metrics

Summary

Real world data analysis is often affected by different types of errors as: measurement errors, computation errors, imprecision related to the method adopted for estimating the data.

The uncertainty in the data, which is strictly connected to the above errors, may be treated by considering, rather than a single value for each data, the interval of values in which it may fall: the interval data. Statistical units described by interval data can be assumed as a special case of Symbolic Object (SO). In Symbolic Data Analysis (SDA), these data are represented as boxes. Accordingly, purpose of the present work is the extension of Principal Component analysis (PCA) to obtain a visualisation of such boxes, on a lower dimensional space pointing out of the relationships among the variables, the units, and between both of them. The aim is to use, when possible, the interval algebra instruments to adapt the mathematical models, on the basis of the classical PCA, to the case in which an interval data matrix is given. The proposed method has been tested on a real data set and the numerical results, which are in agreement with the theory, are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data clustering: application and trends

Article 27 November 2022

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Article 07 February 2017

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Notes

¹ The method works with intervals which are small with respect to the ratio between the radius and the coordinate of the centre of each interval. Empirically it has been observed that the above ratio must be approximately of 2–3%.
² Considering that the α-th eigenvalue of Θ is computed by perturbing the α-th eigenvalue of (X^c)X^c, the ordering on the interval eigenvalues is given by the natural ordering of the corresponding scalar eigenvalues of (X^c)’X^c.
¹ The absolute contributions on the first axes vary from the interval [0, 0.91] for Linseed and the interval [0,0.16] for Sesame, this reflect the “size” of the individuals on the first axes.

References

Alefeld, G. & Herzerberger, J. (1983), ‘Introduction to Interval computation’, Academic Press, New York.
Google Scholar
Billard, L. & Diday, E. (2002), ‘Symbolic regression Analysis’, Proceedings IFCS. In Krzysztof Jajuga et al (EDS.): Data Analysis, Classification and Clustering Methods Heidelberg, Springer-Verlag.
MATH Google Scholar
Billard, L. & Diday, E. (2000), ‘Regression Analysis for Interval-Valued Data’, in: Data Analysis, Classification and Related Methods (eds. H.-H. Bock and E. Diday), Springer, 103–124.
Burkill, J. C. (1924), ‘Functions of Intervals’, Proceedings of the London Mathematical Society, 22, 375–446.
MathSciNet Google Scholar
Canal, L. & Pereira, M. (1998), ‘Towards statistical indices for numeroid data’, in: Proceedings of the NTTS’98 Seminar, Sorrento Italy.
Cazes, P., Chouakria, A., Diday, E. & Schektman, Y. (1997), ‘Extension de l’analyse en composantes principales à des données de type intervalle’, Revue de Statistique Appliquée, XIV, 3, 5–24.
Google Scholar
Chouakria, A. (1998), ‘Extension des méthodes d’analyse factorielle à des données de type intervalle’, Paris IX Dauphine.
Chouakria, A., Diday, E. & Cazes, P. (1998), ‘An improved factorial representation of symbolic objects’, in: KESDA’98 April, Luxembourg.
Deif, A.S. (1991a), ‘The Interval Eigenvalue Problem’, ZAMM 71, 1.61–64, Akademic-Verlag Berlin.
Article MathSciNet Google Scholar
Deif, A. S. (1991b), ‘Singular Values of an Interval Matrix’, Linear Algebra and its Applications 151, 125–133.
Article MathSciNet Google Scholar
Deif, A. S. & Rohn, J. (1994), ‘On the Invariance of the Sign Pattern of Matrix Eigenvectors Under Perturbation’, Linear Algebra and its Applications 196, 63–70.
Article MathSciNet Google Scholar
Gioia, F. (2001), ‘Statistical Methods for Interval Variables’, Ph.D. thesis, Dip. di Matematica e Statistica-Università di Napoli “Federico II”, in Italian.
Gioia, F. & Lauro, C. (2005), ‘Basic Statistical Methods for Interval Data’, Statistica Applicata, 17 (1). In press.
Kearfott, R. B. & Kreinovich, V. (Eds.) (1996), ‘Applications Of Interval Computations’, Kluwer Academic Publishers.
Lauro, C. N. & Palumbo, F. (2000), ‘Principal component analysis of interval data: A symbolic data analysis approach’, Computational Statistics, 15 (1), 73–87.
Article Google Scholar
Lauro, C. N., Verde, R. & Palumbo, F. (2000), ‘Factorial methods with cohesion constraints on symbolic objects’, in: IFCS’00.
Marino, M. & Palumbo, F. (2003), ‘Interval arithmetic for the evaluation of imprecise data effects in least squares linear regression’, Statistica Applicata, 3.
Moore, R. E. (1966), ‘Interval Analysis’, Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Neumaier, A. (1990), ‘Interval methods for systems of equations’, Cambridge University Press, Cambridge.
MATH Google Scholar
Palumbo, F. & Lauro, C.N. (2003), ‘A PCA for interval valued data based on midpoints and radii’, in: New developments in Psychometrics, Yanai H. et al. eds., Psychometric Society, Springer-Verlag, Tokyo.
Google Scholar
Rodriguez, O. (2000), ‘Classification et Modeles Lineaires en Analyse des Donnes Symboliques’, Doctoral Thesis, Universite de Paris Dauphine IX.
Rhon, J. (1993), ‘Interval Matrices: Singularity and real eigenvalues’, SIAM J. Matrix Anal Apply, 14, 82–91.
Article MathSciNet Google Scholar
Seif, N. P., Hashem, S. & Deif, A. S. (1992), ‘Bounding the Eigenvectors for Symmetric Interval Matrices’, ZAMM 72, 233–236.
Article MathSciNet Google Scholar
Sunaga, T. (1958), ‘Theory of an Interval Algebra and its Application to Numerical Analysis’, Gaukutsu Bunken Fukeyu-kai, Tokyo.
MATH Google Scholar
Young, R. C. (1931), ‘The algebra of many-valued quanties’, Math. Ann. 104, 260–290.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Statistica, Università degli Studi di Napoli “Federico II”, Napoli, Italy
Federica Gioia & Carlo N. Lauro

Authors

Federica Gioia
View author publications
You can also search for this author in PubMed Google Scholar
Carlo N. Lauro
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

Given two single-valued variables: X_r = (x_ir), X_s =(x_is), i = 1, …,n, it is known that the correlation between X_r and X_s may be computed as follow:

$$corr\left(X_{r}, X_{s}\right)=h\left(x_{1, r}, \ldots x_{n, r} ; x_{1, s}, \ldots x_{n, s}\right)=\frac{cov\left(X_{r}, X_{s}\right)}{\sqrt{var\left(X_{r}\right)} \sqrt{var\left(X_{s}\right)}}$$

((1))

Let us consider now the following interval-valued variables:

$$X_{r}^{1}=\left(X_{i r}=\left[\underline{x}_{i r}, \overline{x}_{i r}\right]\right) \quad, \quad X_{s}^{1}=\left(X_{i s}=\left[\underline{x}_{is}\ \ , \ \\\overline{x}_{is}\right]\right)_{i} \quad i=1, \ldots, n$$

the interval correlation is computed as follow (Gioia & Lauro 2005):

$$Corr\left(X_{r}^{I}, X_{s}^{I}\right)=\left[\begin{array}{lll}\\ \\\ \ min \quad h\left(x_{l, r}, \ldots, x_{n, r} ; x_{l, s}, \ldots, x_{n, s}\right) \quad, \quad &\ \ max \quad h\left(x_{l, r}, \ldots, x_{n, r} ; x_{l, s}, \ldots, x_{n, s}\right)\\ \ \\ {x_{i r} \in X_{i r}} & {x_{i r} \in X_{i r}} \\ {x_{i s} \in X_{i s}} & {x_{i s} \in X_{i s}} \\ \\ i=1, \ldots, n & i=1, \ldots, n\end{array}\right]$$

where h(x_1,r,…,x_n,r;x_1,s,…,x_n,s) is the function in (1) .

Analogously, given the single-valued variable X_n the standardized S_j=(s_ir)_i, of X_r is given by:

$$s_{i r}=\frac{x_{i r}-\overline{x}_{r}}{\sqrt{n} \cdot \sigma_{r}^{2}}, \quad i=1, \ldots, n$$

((2))

where $\overline{x}_{r}$ and $\sigma_{r}^{2}$ are the mean and the variance of X_r respectively.

When an interval-valued variable $X_{r}^{I}$ is given, following the same approach of (Gioia & Lauro 2005), the component s_ir in (2) , for each i=1,…,n, transforms into the following function:

$$s_{i r}\left(x_{i r}, \ldots x_{n r}\right)=\frac{x_{i r}-\overline{x}_{r}}{\sqrt{n} \cdot \sigma_{r}^{2}}$$

((3))

as x_ir varies in $\lfloor \underline{x}_{i r}, \overline{x}_{i r} \rfloor,$, i=1,…,n. The standardized interval component $s_{i r}^{I}$ of $X_{r}^{I}$ may be computed by minimizing/maximizing function (3) , i.e. calculating the following set:

$$s_{i r}^{I}=\left[\begin{array}{lll} \\ \\ \ min\ \ s_{i r}\left(x_{i r}, \ldots x_{n r}\right) \quad, \quad &\ \ max\ \ s_{i r}\left(x_{i r}, \ldots x_{n r}\right) \\ {x_{i r} \in X_{i r}} &{x_{i r} \in X_{i r}} \\ i=1 ,\ldots n & i=1, \ldots n\end{array}\right]$$

((4))

$s_{i r}^{I}$ in (4) is the interval of the standardized component s_ir that may be computed when each component x_ir ranges in its interval of values. For computing the interval standardized matrix S^I of an n×p matrix X^I, interval (4) may be computed for each i=1,…,n and each r=1, …,p. Given a real matrix X and indicating by S the standardized of X, it is defined the product matrix:$S S^{\prime}=\left(s s^{\prime}_{i j}\right)$. Given an interval matrix X^I, the product of S^I by its transpose will not be computed by the interval matrix product (S’)¹/S¹ but by minimizing/maximizing each component of SS’ when X_ij varies in its interval of values. The interval matrix $\left(S S^{\prime}\right)^{I}=\left(\left(s s_{i j}^{\prime}\right)^{I}\right)$ is:

$$\left(s s_{i j}^{\prime}\right)^{I}=\left[\begin{array}{lll} \\ \\min\ \ ss^{\prime}_{i j}\left(x_{i j}, \ldots x_{n j}\right) \quad, \quad &max\ \ ss^{\prime}_{i j}\left(x_{i j}, \ldots x_{n j}\right) \\ {x_{i j} \in X_{i j}} &{x_{i j} \in X_{i j}} \\ i=1, \ldots n & i=1, \ldots n\end{array}\right]$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gioia, F., Lauro, C.N. Principal component analysis on interval data. Computational Statistics 21, 343–363 (2006). https://doi.org/10.1007/s00180-006-0267-6

Download citation

Published: 01 June 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s00180-006-0267-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Principal component analysis on interval data

Summary

Access this article

Similar content being viewed by others

Data clustering: application and trends

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Notes

References

Author information

Authors and Affiliations

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Principal component analysis on interval data

Summary

Access this article

Similar content being viewed by others

Data clustering: application and trends

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Notes

References

Author information

Authors and Affiliations

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation