Abstract
In the literature on modeling heterogeneous data via mixture models, it is generally assumed that the samples are drawn from the underlying population using the simple random sampling (SRS) technique. This study exploits the bivariate ranked set sampling (BVRSS) technique to learn finite mixture models. We generalize the expectation-maximization (EM) algorithm under univariate RSS to the bivariate case. Computationally, through a simulation study under a noisy setting, we compare the performance of the proposed rank-based estimators with that of the SRS-based competitors in estimating unknown parameters and cluster assignments. The proposed methodology is applied to a breast cancer data set to diagnose malignant or benign tumors in patients. The results showed that the extra rank information in BVRSS samples leads to a better inference about the unknown features of mixture models.
Similar content being viewed by others
Data availability
The real dataset analyzed during the current study is available at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The complete data generation process for simulation study is included in Sect. 4.1.
References
Al-Saleh MF, Al-Shrafat K (2001) Estimation of average milk yield using ranked set sampling. Environmetrics 12:395–399
Al-Saleh MF, Samawi HM (2005) Estimation of the correlation coefficient using bivariate ranked set sampling with application to the bivariate normal distribution. Commun Stat 34:875–889
Al-Saleh MF, Zheng G (2002) Estimation of bivariate characteristics using ranked set sampling. Aust N Z J Stat 44:221–232
Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 30:1–38
Faraji N, Jafari Jozani M, Nematollahi N (2021) Another look at regression analysis using ranked set samples with application to an osteoporosis study. Biometrics. https://doi.org/10.1111/biom.13513
Hatefi A, Jafari Jozani M (2013) Fisher information in different types of perfect and imperfect ranked set samples from finite mixture models. J Multivar Anal 119:16–31
Hatefi A, Jafari Jozani M (2017) Proportion estimation based on a partially rank ordered set sample with multiple concomitants in a breast cancer study. Stat Method Med Res 26:2552–2566
Hatefi A, Jafari Jozani M, Ziou D (2014) Estimation and classification for finite mixture models under ranked set sampling. Stat Sin 24:675–698
Hatefi A, Jafari Jozani M, Ozturk O (2015) Mixture model analysis of partially rank-ordered set samples: age groups of fish from length-frequency data. Scand J Stat 42:848–871
Hatefi A, Reid N, Jafari Jozani M, Ozturk O (2020) Finite mixture modeling, classification and statistical learning with order statistics. Stat Sin 30:1881–1903
Homser DW (1973) A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of samples. Biometrics 29:761–770
Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10:2295–2328
Liu H, Han F, Yuan M, Lafferty J, Wasserman L (2012) High-dimensional semiparametric Gaussian copula graphical models. Ann Stat 40:2293–2326
Mahdizadeh M, Zamanzade E (2022) Using a rank-based design in estimating prevalence of breast cancer. Soft Comput 26:1–10
Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43:570–577
McIntyre GA (1952) A method for unbiased selective sampling using ranked sets. Aust J Agric Res 3:385–390
McLachlan G, Peel D (2004) Finite mixture models, Wiley series in probabilities and statistics. Wiley-Interscience, New York
Mode NA, Conquest LL, Marker DA (1999) Ranked set sampling for ecological research: accounting for the total costs of sampling. Environmetrics 10:179–194
Omidvar S, Jafari Jozani M, Nematollahi N (2018) Judgment post-stratification in finite mixture modeling: An example in estimating the prevalence of osteoporosis. Stat Med 37:4823–4836
Patil GP, Sinha AK, Taillie C (1994) Ranked set sampling for multiple characteristics. Int J Ecol Environ Sci 20:357–373
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Stokes SL (1980) Inferences on the correlation coefficient in bivariate normal population from ranked set sampling. J Am Stat Assoc 75:989–995
Zamanzade E, Asadi M, Parvardeh A, Zamanzade E (2022) A ranked-based estimator of the mean past lifetime with an application. Stat Papers 64:1–17
Zheng G, Modarres R (2006) A robust estimate of the correlation coefficient for bivariate normal distribution using ranked set sampling. J Stat Plan Inference 136:298–309
Funding
The authors did not receive support from any organization for the submitted paper, and this research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Appendix A.1.: Proof of Lemma 3.1
The conditional pdf’s of the latent variables are
In such a way, \(f\big (\left. {v_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {c_3}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}{{\bar{F}}_k}({x_{[i](j)s}};{\varvec{\xi }_k})} \over {\bar{F}({x_{[i](j)s}};\varvec{\Psi }) }}\right) }^{v_{sk}^{(j)}}}}\), \(f\big (\left. {u_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )={c_4}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}F_k^{[j]} ({y_{(i)[j]s}};{\varvec{\xi }_k})} \over {{F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi }) }}\right) }^{u_{sk}^{(i)}}}}\), and \(f\big (\left. {d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {c_5}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}{\bar{F}}_k^{[j]} ({y_{(i)[j]s}};{\varvec{\xi }_k})} \over {{{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi }) }}\right) }^{d_{sk}^{(i)}}}}\).
Since the latent variables are independent variables, we have
1.2 Appendix A.2.: Proof of equation (22)
For maximization (21) with respect to \(\pi _k\), we eliminate the terms that do not depend on \(\pi _k\). Therefore, the quantity \({Q_{{M_1}}}(\left. \varvec{\Psi } \right| {\varvec{\Psi }^{(t)}})\) reduces to \(Q_{M_1}(\pi _k)\) as follows
where \({\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})\) is defined in (23). We use the Lagrangian multiplier method under the constraint \(\sum \nolimits _{k=1}^K {\pi _k} = 1\). The Lagrangian function is
By setting the differential of \({\mathcal {L}}({\pi _k},\lambda ) \) with respect to \(\pi _k\) equal to zero, we have
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Haji Aghabozorgi, H., Eskandari, F. Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study. Stat Papers 65, 705–736 (2024). https://doi.org/10.1007/s00362-023-01411-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-023-01411-6