Skip to main content
Log in

Principal component analysis on interval data

  • Published:
Computational Statistics Aims and scope Submit manuscript

Summary

Real world data analysis is often affected by different types of errors as: measurement errors, computation errors, imprecision related to the method adopted for estimating the data.

The uncertainty in the data, which is strictly connected to the above errors, may be treated by considering, rather than a single value for each data, the interval of values in which it may fall: the interval data. Statistical units described by interval data can be assumed as a special case of Symbolic Object (SO). In Symbolic Data Analysis (SDA), these data are represented as boxes. Accordingly, purpose of the present work is the extension of Principal Component analysis (PCA) to obtain a visualisation of such boxes, on a lower dimensional space pointing out of the relationships among the variables, the units, and between both of them. The aim is to use, when possible, the interval algebra instruments to adapt the mathematical models, on the basis of the classical PCA, to the case in which an interval data matrix is given. The proposed method has been tested on a real data set and the numerical results, which are in agreement with the theory, are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1:
Figure 2:

Similar content being viewed by others

Notes

  1. 1 The method works with intervals which are small with respect to the ratio between the radius and the coordinate of the centre of each interval. Empirically it has been observed that the above ratio must be approximately of 2–3%.

  2. 2 Considering that the α-th eigenvalue of Θ is computed by perturbing the α-th eigenvalue of (Xc)Xc, the ordering on the interval eigenvalues is given by the natural ordering of the corresponding scalar eigenvalues of (Xc)’Xc.

  3. 1 The absolute contributions on the first axes vary from the interval [0, 0.91] for Linseed and the interval [0,0.16] for Sesame, this reflect the “size” of the individuals on the first axes.

References

  • Alefeld, G. & Herzerberger, J. (1983), ‘Introduction to Interval computation’, Academic Press, New York.

    Google Scholar 

  • Billard, L. & Diday, E. (2002), ‘Symbolic regression Analysis’, Proceedings IFCS. In Krzysztof Jajuga et al (EDS.): Data Analysis, Classification and Clustering Methods Heidelberg, Springer-Verlag.

    MATH  Google Scholar 

  • Billard, L. & Diday, E. (2000), ‘Regression Analysis for Interval-Valued Data’, in: Data Analysis, Classification and Related Methods (eds. H.-H. Bock and E. Diday), Springer, 103–124.

  • Burkill, J. C. (1924), ‘Functions of Intervals’, Proceedings of the London Mathematical Society, 22, 375–446.

    MathSciNet  Google Scholar 

  • Canal, L. & Pereira, M. (1998), ‘Towards statistical indices for numeroid data’, in: Proceedings of the NTTS’98 Seminar, Sorrento Italy.

  • Cazes, P., Chouakria, A., Diday, E. & Schektman, Y. (1997), ‘Extension de l’analyse en composantes principales à des données de type intervalle’, Revue de Statistique Appliquée, XIV, 3, 5–24.

    Google Scholar 

  • Chouakria, A. (1998), ‘Extension des méthodes d’analyse factorielle à des données de type intervalle’, Paris IX Dauphine.

  • Chouakria, A., Diday, E. & Cazes, P. (1998), ‘An improved factorial representation of symbolic objects’, in: KESDA’98 April, Luxembourg.

  • Deif, A.S. (1991a), ‘The Interval Eigenvalue Problem’, ZAMM 71, 1.61–64, Akademic-Verlag Berlin.

    Article  MathSciNet  Google Scholar 

  • Deif, A. S. (1991b), ‘Singular Values of an Interval Matrix’, Linear Algebra and its Applications 151, 125–133.

    Article  MathSciNet  Google Scholar 

  • Deif, A. S. & Rohn, J. (1994), ‘On the Invariance of the Sign Pattern of Matrix Eigenvectors Under Perturbation’, Linear Algebra and its Applications 196, 63–70.

    Article  MathSciNet  Google Scholar 

  • Gioia, F. (2001), ‘Statistical Methods for Interval Variables’, Ph.D. thesis, Dip. di Matematica e Statistica-Università di Napoli “Federico II”, in Italian.

  • Gioia, F. & Lauro, C. (2005), ‘Basic Statistical Methods for Interval Data’, Statistica Applicata, 17 (1). In press.

  • Kearfott, R. B. & Kreinovich, V. (Eds.) (1996), ‘Applications Of Interval Computations’, Kluwer Academic Publishers.

  • Lauro, C. N. & Palumbo, F. (2000), ‘Principal component analysis of interval data: A symbolic data analysis approach’, Computational Statistics, 15 (1), 73–87.

    Article  Google Scholar 

  • Lauro, C. N., Verde, R. & Palumbo, F. (2000), ‘Factorial methods with cohesion constraints on symbolic objects’, in: IFCS’00.

  • Marino, M. & Palumbo, F. (2003), ‘Interval arithmetic for the evaluation of imprecise data effects in least squares linear regression’, Statistica Applicata, 3.

  • Moore, R. E. (1966), ‘Interval Analysis’, Prentice Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Neumaier, A. (1990), ‘Interval methods for systems of equations’, Cambridge University Press, Cambridge.

    MATH  Google Scholar 

  • Palumbo, F. & Lauro, C.N. (2003), ‘A PCA for interval valued data based on midpoints and radii’, in: New developments in Psychometrics, Yanai H. et al. eds., Psychometric Society, Springer-Verlag, Tokyo.

    Google Scholar 

  • Rodriguez, O. (2000), ‘Classification et Modeles Lineaires en Analyse des Donnes Symboliques’, Doctoral Thesis, Universite de Paris Dauphine IX.

  • Rhon, J. (1993), ‘Interval Matrices: Singularity and real eigenvalues’, SIAM J. Matrix Anal Apply, 14, 82–91.

    Article  MathSciNet  Google Scholar 

  • Seif, N. P., Hashem, S. & Deif, A. S. (1992), ‘Bounding the Eigenvectors for Symmetric Interval Matrices’, ZAMM 72, 233–236.

    Article  MathSciNet  Google Scholar 

  • Sunaga, T. (1958), ‘Theory of an Interval Algebra and its Application to Numerical Analysis’, Gaukutsu Bunken Fukeyu-kai, Tokyo.

    MATH  Google Scholar 

  • Young, R. C. (1931), ‘The algebra of many-valued quanties’, Math. Ann. 104, 260–290.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendix

Appendix

Given two single-valued variables: Xr = (xir), Xs =(xis), i = 1, …,n, it is known that the correlation between Xr and Xs may be computed as follow:

$$corr\left(X_{r}, X_{s}\right)=h\left(x_{1, r}, \ldots x_{n, r} ; x_{1, s}, \ldots x_{n, s}\right)=\frac{cov\left(X_{r}, X_{s}\right)}{\sqrt{var\left(X_{r}\right)} \sqrt{var\left(X_{s}\right)}}$$
((1))

Let us consider now the following interval-valued variables:

$$X_{r}^{1}=\left(X_{i r}=\left[\underline{x}_{i r}, \overline{x}_{i r}\right]\right) \quad, \quad X_{s}^{1}=\left(X_{i s}=\left[\underline{x}_{is}\ \ , \ \\\overline{x}_{is}\right]\right)_{i} \quad i=1, \ldots, n$$

the interval correlation is computed as follow (Gioia & Lauro 2005):

$$Corr\left(X_{r}^{I}, X_{s}^{I}\right)=\left[\begin{array}{lll}\\ \\\ \ min \quad h\left(x_{l, r}, \ldots, x_{n, r} ; x_{l, s}, \ldots, x_{n, s}\right) \quad, \quad &\ \ max \quad h\left(x_{l, r}, \ldots, x_{n, r} ; x_{l, s}, \ldots, x_{n, s}\right)\\ \ \\ {x_{i r} \in X_{i r}} & {x_{i r} \in X_{i r}} \\ {x_{i s} \in X_{i s}} & {x_{i s} \in X_{i s}} \\ \\ i=1, \ldots, n & i=1, \ldots, n\end{array}\right]$$

where h(x1,r,…,xn,r;x1,s,…,xn,s) is the function in (1) .

Analogously, given the single-valued variable Xn the standardized Sj=(sir)i, of Xr is given by:

$$s_{i r}=\frac{x_{i r}-\overline{x}_{r}}{\sqrt{n} \cdot \sigma_{r}^{2}}, \quad i=1, \ldots, n$$
((2))

where \(\overline{x}_{r}\) and \(\sigma_{r}^{2}\) are the mean and the variance of Xr respectively.

When an interval-valued variable \(X_{r}^{I}\) is given, following the same approach of (Gioia & Lauro 2005), the component sir in (2) , for each i=1,…,n, transforms into the following function:

$$s_{i r}\left(x_{i r}, \ldots x_{n r}\right)=\frac{x_{i r}-\overline{x}_{r}}{\sqrt{n} \cdot \sigma_{r}^{2}}$$
((3))

as xir varies in \(\lfloor \underline{x}_{i r}, \overline{x}_{i r} \rfloor,\), i=1,…,n. The standardized interval component \(s_{i r}^{I}\) of \(X_{r}^{I}\) may be computed by minimizing/maximizing function (3) , i.e. calculating the following set:

$$s_{i r}^{I}=\left[\begin{array}{lll} \\ \\ \ min\ \ s_{i r}\left(x_{i r}, \ldots x_{n r}\right) \quad, \quad &\ \ max\ \ s_{i r}\left(x_{i r}, \ldots x_{n r}\right) \\ {x_{i r} \in X_{i r}} &{x_{i r} \in X_{i r}} \\ i=1 ,\ldots n & i=1, \ldots n\end{array}\right]$$
((4))

\(s_{i r}^{I}\) in (4) is the interval of the standardized component sir that may be computed when each component xir ranges in its interval of values. For computing the interval standardized matrix SI of an n×p matrix XI, interval (4) may be computed for each i=1,…,n and each r=1, …,p. Given a real matrix X and indicating by S the standardized of X, it is defined the product matrix:\(S S^{\prime}=\left(s s^{\prime}_{i j}\right)\). Given an interval matrix XI, the product of SI by its transpose will not be computed by the interval matrix product (S’)1/S1 but by minimizing/maximizing each component of SS’ when Xij varies in its interval of values. The interval matrix \(\left(S S^{\prime}\right)^{I}=\left(\left(s s_{i j}^{\prime}\right)^{I}\right)\) is:

$$\left(s s_{i j}^{\prime}\right)^{I}=\left[\begin{array}{lll} \\ \\min\ \ ss^{\prime}_{i j}\left(x_{i j}, \ldots x_{n j}\right) \quad, \quad &max\ \ ss^{\prime}_{i j}\left(x_{i j}, \ldots x_{n j}\right) \\ {x_{i j} \in X_{i j}} &{x_{i j} \in X_{i j}} \\ i=1, \ldots n & i=1, \ldots n\end{array}\right]$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gioia, F., Lauro, C.N. Principal component analysis on interval data. Computational Statistics 21, 343–363 (2006). https://doi.org/10.1007/s00180-006-0267-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-006-0267-6

Keywords

Navigation