Abstract
With contemporary data sets becoming too large to analyze the data directly, various forms of aggregated data are becoming common. The original individual data are points, but after aggregation the observations are interval-valued (e.g.). While some researchers simply analyze the set of averages of the observations by aggregated class, it is easily established that approach ignores much of the information in the original data set. The initial theoretical work for interval-valued data was that of Le-Rademacher and Billard (J Stat Plan Infer 141:1593–1602, 2011), but those results were limited to estimation of the mean and variance of a single variable only. This article seeks to redress the limitation of their work by deriving the maximum likelihood estimator for the all important covariance statistic, a basic requirement for numerous methodologies, such as regression, principal components, and canonical analyses. Asymptotic properties of the proposed estimators are established. The Le-Rademacher and Billard results emerge as special cases of our wider derivations.
Similar content being viewed by others
References
Anderson T (2003) An Introduction to Multivariate Statistical Analysis. John Wiley, UK
Beranger B, Lin H, Sisson SA (2022) New models for symbolic data analysis. Adv Data Anal Class 16:1–41
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock H-H, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer-Verlag, Berlin, pp 103–124
Billard L (2008) Sample covariance functions for complex quantitative data. In: Mizuta M, Nakano J (eds) World Congress, International Association of Computational Statistics. Japanese Society of Computational Statistics, Yokohama, Japan, pp 157–163
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chichester
Bock H-H, Diday E (eds) (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer-Verlag, Berlin
Brito P, Polaillon G (2005) Structuring probabilist data by Galois lattices. Math Soc Sci 43:77–104
Cariou V, Billard L (2015) Generalization method when manipulating relational databases. Revue des Nouvelles Technol de l’Infor 27:59–86
Casella G, Berger RL (2002) Statistical Inference, 2nd edn. Duxbury, Pacific Grove CA
Clark CE (1962) The PERT model for the distribution of activity time. Oper Res 10:405–406
Diday E (1988) The symbolic approach in clustering and related methods of data analysis. In: Bock H-H (ed) Classification and Related Methods of Data Analysis, Proceeding IFCS 1987 Aachen, Germany. North-Holland, Netherlands, pp 673–684
Diday E (1995) Probabilist, possibilist and belief objects for knowledge analysis. Ann Oper Res 55:227–276
Diday E, Emilion R (1996) Lattices and capacities in analysis of probabilist objects. In: Diday E, Lechevallier Y, Opitz O (eds) Ordinal and Symbolic Data, Proceeding International Conference on Ordinal and Symbolic Data Analysis - OSDA 95, Paris. Springer, Heidelberg, pp 13–30
Diday E, Emilion R (1998) Capacities and credibilities in analysis of probabilistic objects by histograms and lattices. In: Hayashi C, Obsumi N, Yajima K, Tanaka Y, Bock H-H, Baba Y (eds) Data Science, Classification, and Related Methods. Springer, USA, pp 353–357
Diday E, Emilion R (2003) Maximal and stochastic Galois lattices. Discret Appl Math 127:271–284
Diday E, Emilion R,Hillali Y (1996) Symbolic data analysis of probabilistic objects by capacities and credibilities. In: Proceedings XXXVIII Riunione Scientifica Societ\(\grave{a}\) Italiana di Statistica, pp 5-22
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4:229–246
Emilion R (1997) Diff\(\acute{e}\)rentiation des capacit\(\acute{e}\)s et des int\(\acute{e}\)grales de Choquet. Comptes Rendus de l’Academie des Sci-Series I - Math 324:389–392
Fisher RA (1915) Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10:507–521
Lauro NC, Palumbo F (2000) Principal component analysis of interval data: a symbolic data analysis approach. Comput Stat 15:73–87
Lehmann EL (1983) Theory of Point Estimation. Wiley-Interscience, New Jersey
Lehmann EL (1986) Testing Statistical Hypotheses, 2nd edn. Wiley-Interscience, New Jersey
Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Infer 141:1593–1602
Le-Rademacher J, Billard L (2012) Symbolic-covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21:413–432
Leroy B, Chouakria A, Herlin I, Diday E (1996) Approche géométrique et classification pour la reconnaissance de visage. Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, pp 548–557
Liu F, Billard L (2022) Partition of interval-valued observations using regression. J Classif 39:55–77
Malcolm DG, Roseboom JH, Clark CE, Fazar W (1959) Application of a technique for research and development program evaluation. Oper Res 7:646–669
Moore RE (1966) Interval Analysis. Prentice-Hall, Englewood Cliffs NJ
Oliveira MR, Azeitona M, Pacheco A, Valadas R (2022) Association measures for interval variables. Adv Data Anal Classif 16:491–520
Oliveira MR, Vilela M, Pacheco A, Valadas R, Salvador P (2017) Extracting information from interval data using symbolic principal component analysis. Aust J Stat 46:79–87
Rahman PA, Beranger B, Roughan M, Sisson SA (2020) Likelihood-based inference for modelling packet transit from thinned flow summaries. IEEE Trans Signal Inf Process over Netw 8:571–583
Samadi SY, Billard L (2021) Analysis of dependent data aggregated into intervals. J Multivar Anal 186:104817
Wishart J (1928) The generalised product moment distribution distribution in samples from a normal multivariate population. Biometrika 20:32–52
Whitaker T, Beranger B, Sisson SA (2020) Composite likelihood methods for histogram-valued random variables. Stat Comput 30:1459–1477
Whitaker T, Beranger B, Sisson SA (2021) Logistic regression models for aggregated data. J Comput Graph Stat 30:1049–1067
Xu W (2010) Symbolic data analysis: interval-valued data regression. University of Georgia, USA
Zhang X, Beranger B, Sisson SA (2020) Constructing likelihood functions for interval-valued random variables. Scand J Stat 47:1–35
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no con ict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Early work on interval data sometimes transformed the interval-valued variable into two variables, center and range (or, given their one-to-one correspondence equivalently into the end point values). Consider the Y values of the interval-valued data sets of Table 3. Let us denote the interval centers by \(Y^c = (a + b)/2\), and the interval half-range by \(Y^r = (b - a)/2\). Then the first and second columns of Table 4(a) give the sample variances of Y for the interval centers and ranges, respectively (calculated by classical results or as special cases of Eq. (2)). The third column shows the sum \((\hbox {Var}(Y^c) + \hbox {Var}(Y^r))\). This can be compared with the sample variance \(\hbox {Var}(Y)\) of the intervals in the right-most column (from Eq. (2) and Bertrand and Goupil (2000)). Thus we see that sometimes the sum \((\hbox {Var}(Y^c) + \hbox {Var}(Y^r))\) is greater, and sometimes less, than the symbolic variance \(\hbox {Var}(Y)\); this depends on the actual data. The fourth data set consists of classical values (with \(a\equiv [a,a]\)); in this case, the \(\hbox {Var}(Y^r)=0\) and so \(\hbox {Var}(Y^c) = \hbox {Var}(Y)\), as it should.
Likewise, by using the centers and range values for both Y and X, we can calculate the classical covariances of the centers and of the ranges and the symbolic interval covariances, from Eq. (3), shown in Table 4(b). Again, the sum \((\hbox {Cov}(Y^c, X^c) + \hbox {Cov}(Y^r, X^r))\) can be greater, or smaller, than the symbolic covariance \(\hbox {Cov}(Y, X)\); and for classical observations, this sum equals the symbolic covariance correctly as expected.
For a second aspect, suppose a data set consists of intervals all with the same center but different range values. Then, the variance-covariance terms for the centers are zero; and in contrast, if the data are such that the observations have different center values but all have the same range value, then the variance-covariance terms for the ranges are zero. Then for methods that rely on the relevant variance-covariance matrices, the methodology cannot be properly implemented, since, e.g., in regression that matrix is zero and for principal components the eigenvalues are zero.
The variance-covariance definition of Eq. (3) does not have these limitations.
Appendix B
The log likelihood function \(\ln L_I\) from Eq. (12) and Eq. (13) is
Then successively differentiating \(\ln L_I\) with respect to each of the eight parameters in \({{\varvec{\tau }}}\), we obtain
where \(G=\gamma _1\gamma _2 - \gamma _3^2\).
Then, substituting the relevant maximum likelihood estimator and setting the derivatives to zero, we can obtain the maximum likelihood estimators \(\hat{{{\varvec{\tau }}}}_{xy} = (\hat{\mu }_x, \hat{\mu }_y, \hat{\sigma }_x^2, \hat{\sigma }_y^2, \hat{\rho }, \hat{\gamma }_1,\hat{\gamma }_2,\hat{\gamma }_3)\) for \({{\varvec{\tau }}}_{xy} = (\mu _x, \mu _y, \sigma _x^2, \sigma _y^2, \rho , \gamma _1,\gamma _2,\gamma _3)\) to be as given by Eqs. (19–22). We also note that instead of solving the partial derivative in Eq. (42) for the derivation of the estimator \(\hat{\rho }\), we can more easily obtain the result of Eq. (42) by following, e.g., (Casella and Berger (2002), p.358) who suggest using a partially maximized likelihood function.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Samadi, S.Y., Billard, L., Guo, JH. et al. MLE for the parameters of bivariate interval-valued model. Adv Data Anal Classif (2023). https://doi.org/10.1007/s11634-023-00546-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11634-023-00546-6
Keywords
- Interval data
- Likelihood
- Bivariate normal distribution
- Bivariate Wishart distribution
- Conditional moments