Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study

Haji Aghabozorgi, Hamid; Eskandari, Farzad

doi:10.1007/s00362-023-01411-6

Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study

Regular Article
Published: 02 March 2023

Volume 65, pages 705–736, (2024)
Cite this article

Statistical Papers Aims and scope Submit manuscript

251 Accesses
Explore all metrics

Abstract

In the literature on modeling heterogeneous data via mixture models, it is generally assumed that the samples are drawn from the underlying population using the simple random sampling (SRS) technique. This study exploits the bivariate ranked set sampling (BVRSS) technique to learn finite mixture models. We generalize the expectation-maximization (EM) algorithm under univariate RSS to the bivariate case. Computationally, through a simulation study under a noisy setting, we compare the performance of the proposed rank-based estimators with that of the SRS-based competitors in estimating unknown parameters and cluster assignments. The proposed methodology is applied to a breast cancer data set to diagnose malignant or benign tumors in patients. The results showed that the extra rank information in BVRSS samples leads to a better inference about the unknown features of mixture models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gumbel’s bivariate exponential distribution: estimation of the association parameter using ranked set sampling

Article 24 November 2021

Using a rank-based design in estimating prevalence of breast cancer

Article 10 February 2022

Estimation of a symmetric distribution function in multistage ranked set sampling

Article 20 November 2017

Data availability

The real dataset analyzed during the current study is available at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The complete data generation process for simulation study is included in Sect. 4.1.

References

Al-Saleh MF, Al-Shrafat K (2001) Estimation of average milk yield using ranked set sampling. Environmetrics 12:395–399
Article Google Scholar
Al-Saleh MF, Samawi HM (2005) Estimation of the correlation coefficient using bivariate ranked set sampling with application to the bivariate normal distribution. Commun Stat 34:875–889
Article MathSciNet Google Scholar
Al-Saleh MF, Zheng G (2002) Estimation of bivariate characteristics using ranked set sampling. Aust N Z J Stat 44:221–232
Article MathSciNet Google Scholar
Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 30:1–38
Google Scholar
Faraji N, Jafari Jozani M, Nematollahi N (2021) Another look at regression analysis using ranked set samples with application to an osteoporosis study. Biometrics. https://doi.org/10.1111/biom.13513
Article PubMed Google Scholar
Hatefi A, Jafari Jozani M (2013) Fisher information in different types of perfect and imperfect ranked set samples from finite mixture models. J Multivar Anal 119:16–31
Article MathSciNet Google Scholar
Hatefi A, Jafari Jozani M (2017) Proportion estimation based on a partially rank ordered set sample with multiple concomitants in a breast cancer study. Stat Method Med Res 26:2552–2566
Article Google Scholar
Hatefi A, Jafari Jozani M, Ziou D (2014) Estimation and classification for finite mixture models under ranked set sampling. Stat Sin 24:675–698
MathSciNet Google Scholar
Hatefi A, Jafari Jozani M, Ozturk O (2015) Mixture model analysis of partially rank-ordered set samples: age groups of fish from length-frequency data. Scand J Stat 42:848–871
Article MathSciNet Google Scholar
Hatefi A, Reid N, Jafari Jozani M, Ozturk O (2020) Finite mixture modeling, classification and statistical learning with order statistics. Stat Sin 30:1881–1903
MathSciNet Google Scholar
Homser DW (1973) A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of samples. Biometrics 29:761–770
Article Google Scholar
Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10:2295–2328
MathSciNet Google Scholar
Liu H, Han F, Yuan M, Lafferty J, Wasserman L (2012) High-dimensional semiparametric Gaussian copula graphical models. Ann Stat 40:2293–2326
Article MathSciNet Google Scholar
Mahdizadeh M, Zamanzade E (2022) Using a rank-based design in estimating prevalence of breast cancer. Soft Comput 26:1–10
Article Google Scholar
Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43:570–577
Article MathSciNet Google Scholar
McIntyre GA (1952) A method for unbiased selective sampling using ranked sets. Aust J Agric Res 3:385–390
Article Google Scholar
McLachlan G, Peel D (2004) Finite mixture models, Wiley series in probabilities and statistics. Wiley-Interscience, New York
Google Scholar
Mode NA, Conquest LL, Marker DA (1999) Ranked set sampling for ecological research: accounting for the total costs of sampling. Environmetrics 10:179–194
Article Google Scholar
Omidvar S, Jafari Jozani M, Nematollahi N (2018) Judgment post-stratification in finite mixture modeling: An example in estimating the prevalence of osteoporosis. Stat Med 37:4823–4836
Article MathSciNet PubMed Google Scholar
Patil GP, Sinha AK, Taillie C (1994) Ranked set sampling for multiple characteristics. Int J Ecol Environ Sci 20:357–373
Google Scholar
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Stokes SL (1980) Inferences on the correlation coefficient in bivariate normal population from ranked set sampling. J Am Stat Assoc 75:989–995
Article Google Scholar
Zamanzade E, Asadi M, Parvardeh A, Zamanzade E (2022) A ranked-based estimator of the mean past lifetime with an application. Stat Papers 64:1–17
MathSciNet Google Scholar
Zheng G, Modarres R (2006) A robust estimate of the correlation coefficient for bivariate normal distribution using ranked set sampling. J Stat Plan Inference 136:298–309
Article MathSciNet Google Scholar

Download references

Funding

The authors did not receive support from any organization for the submitted paper, and this research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Statistics, Allameh Tabataba’i University, Tehran, Iran
Hamid Haji Aghabozorgi & Farzad Eskandari

Authors

Hamid Haji Aghabozorgi
View author publications
You can also search for this author in PubMed Google Scholar
Farzad Eskandari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamid Haji Aghabozorgi.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Appendix A.1.: Proof of Lemma 3.1

The conditional pdf’s of the latent variables are

$$\begin{aligned}&f\big (\left. {z_{sk}^{(ij)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {{f\big (\left. {{x_{[i](j)}},{y_{(i)[j]s}}} \right| z_{sk}^{(ij)},\varvec{\Psi } \big ) f\big (z_{sk}^{(ij)}\big )} \over {{f^{[i](j),(i)[j]}} \big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi } \big )}}\\&\quad = {{{c_1}\prod \limits _{k = 1}^K {{{\{ {f_k}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\xi }_k})\} }^{z_{sk}^{(ij)}}}} {{\{ F({x_{[i](j)s}};\varvec{\Psi })\} }^{j - 1}}{{\{ {\bar{F}}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}} } \over {{c_1}f({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi }){{\{ F({x_{[i](j)s}};\varvec{\Psi })\} }^{j - 1}}{{\{ \bar{F}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\Psi )\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}}}} \\&\qquad \times \prod \limits _{k = 1}^K {\pi _k^{z_{sk}^{(ij)}}} = \prod \limits _{k = 1}^K {{{ \left( {{{\pi _k}{f_k}\big ({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\xi }_k}\big )} \over {f\big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi } \big )}} \right) }^{z_{sk}^{(ij)}}}},\\&f\big (\left. {w_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi }\big ) = {{f\big (\left. {{x_{[i](j)}},{y_{(i)[j]s}}} \right| w_{sk}^{(j)},\varvec{\Psi }\big ) f\big (w_{sk}^{(j)}\big )} \over {{f^{[i](j),(i)[j]}}\big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi }\varvec{)}}}\\&\quad = {{{c_1}{c_2}f({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi })\prod \limits _{k = 1}^K {{{\{ {F_k}({x_{[i](j)s}};{\varvec{\xi }_k})\} }^{w_{sk}^{(j)}}}} {{\{ {\bar{F}}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}}} \over {{c_1}f({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi }){{\{ F({x_{[i](j)s}};\varvec{\Psi })\} }^{j - 1}}{{\{ \bar{F}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}}}}\\&\qquad \times \prod \limits _{k = 1}^K {\pi _k^{w_{sk}^{(j)}}} = {c_2}\prod \limits _{k = 1}^K {{{\left( {{{\pi _k}{F_k}\big ({x_{[i](j)s}};{\varvec{\xi }_k}\big )} \over {F\big ({x_{[i](j)s}};\varvec{\Psi }\big )}}\right) }^{w_{sk}^{(j)}}}}, \end{aligned}$$

In such a way, $f\big (\left. {v_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {c_3}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}{{\bar{F}}_k}({x_{[i](j)s}};{\varvec{\xi }_k})} \over {\bar{F}({x_{[i](j)s}};\varvec{\Psi }) }}\right) }^{v_{sk}^{(j)}}}}$, $f\big (\left. {u_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )={c_4}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}F_k^{[j]} ({y_{(i)[j]s}};{\varvec{\xi }_k})} \over {{F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi }) }}\right) }^{u_{sk}^{(i)}}}}$, and $f\big (\left. {d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {c_5}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}{\bar{F}}_k^{[j]} ({y_{(i)[j]s}};{\varvec{\xi }_k})} \over {{{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi }) }}\right) }^{d_{sk}^{(i)}}}}$.

Since the latent variables are independent variables, we have

$$\begin{aligned}&f\big ({x_{[i](j)s}},{y_{(i)[j]s}}, \varvec{z}_s^{(ij)}, \varvec{w}_s^{(j)}, \varvec{v}_s^{(j)}, \varvec{u}_s^{(i)}, \varvec{d}_s^{(i)}; \varvec{\Psi }\big ) \\&\quad = f \big (\left. {z_{sk}^{(ij)},w_{sk}^{(j)},v_{sk}^{(j)},u_{sk}^{(i)} ,d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) {f^{[i](j),(i)[j]}}\big ({x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )\\&\quad = f \big (\left. {z_{sk}^{(ij)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) f \big (\left. {w_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) f \big (\left. {v_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )\\&\qquad \times f\big (\left. {u_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) f\big (\left. {d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ){f^{[i](j),(i)[j]}} \big ({x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) \\&\quad = c_1 c_2 c_3 c_4 c_5 \Bigg \{ \prod \limits _{k = 1}^K{ {\pi _k^{ z_{sk}^{(ij)} + w_{sk}^{(j)} + v_{sk}^{(j)} +u_{sk}^{(i)} + d_{sk}^{(i)} } } {\big \{ {f_k}\big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\xi }_k \big )\big \} ^{z_{sk}^{(ij)}}}} \\&\qquad \times {{\big \{ {F_k}\big ({x_{[i](j)s}};\varvec{\xi }_k \big )\big \} ^{w_{sk}^{(j)}}}{\big \{ {{\bar{F}}_k}\big ({x_{[i](j)s}};\varvec{\xi }_k \big )\big \} ^{v_{sk}^{(j)}}}{\big \{ F_k^{[j]}\big ({y_{(i)[j]s}};\varvec{\xi }_k \big )\big \} ^{u_{sk}^{(i)}}}{\big \{ \bar{F}_k^{[j]}\big ({y_{(i)[j]s}};\varvec{\xi }_k \big )\big \} ^{d_{sk}^{(i)}}} } \Bigg \}. \end{aligned}$$

1.2 Appendix A.2.: Proof of equation (22)

For maximization (21) with respect to $\pi _k$, we eliminate the terms that do not depend on $\pi _k$. Therefore, the quantity ${Q_{{M_1}}}(\left. \varvec{\Psi } \right| {\varvec{\Psi }^{(t)}})$ reduces to $Q_{M_1}(\pi _k)$ as follows

$$\begin{aligned} Q_{M_1}({\pi _k}) = (2m-1)\sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {\sum \limits _{k = 1}^K {{\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}}) \log {\pi _k} } } } }, \end{aligned}$$

where ${\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})$ is defined in (23). We use the Lagrangian multiplier method under the constraint $\sum \nolimits _{k=1}^K {\pi _k} = 1$. The Lagrangian function is

$$\begin{aligned} {\mathcal {L}}({\pi _k},\lambda ) = {Q_{{M_1}}}({\pi _k}) - \lambda (\sum \limits _{k = 1}^K {{\pi _k}} - 1). \end{aligned}$$

By setting the differential of ${\mathcal {L}}({\pi _k},\lambda ) $ with respect to $\pi _k$ equal to zero, we have

$$\begin{aligned} \pi _{k,{M_1}}^{(t + 1)} = {1 \over \lambda }\sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {{\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})} } } , \end{aligned}$$

where from (16)–(20), we have

$$\begin{aligned} \lambda&= \sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {\sum \limits _{k = 1}^K {{\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})} } } }\\&= \sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {1 \over {(2m-1)}} \big \{1+ (j-1) + (m-j) + (i-1) + (m-i)\big \} }} \\&= rm^2. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Haji Aghabozorgi, H., Eskandari, F. Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study. Stat Papers 65, 705–736 (2024). https://doi.org/10.1007/s00362-023-01411-6

Download citation

Received: 15 June 2022
Revised: 30 November 2022
Published: 02 March 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00362-023-01411-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study

Abstract

Access this article

Similar content being viewed by others

Gumbel’s bivariate exponential distribution: estimation of the association parameter using ranked set sampling

Using a rank-based design in estimating prevalence of breast cancer

Estimation of a symmetric distribution function in multistage ranked set sampling

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

1.1 Appendix A.1.: Proof of Lemma 3.1

1.2 Appendix A.2.: Proof of equation (22)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study

Abstract

Access this article

Similar content being viewed by others

Gumbel’s bivariate exponential distribution: estimation of the association parameter using ranked set sampling

Using a rank-based design in estimating prevalence of breast cancer

Estimation of a symmetric distribution function in multistage ranked set sampling

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 Appendix A.1.: Proof of Lemma 3.1

1.2 Appendix A.2.: Proof of equation (22)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation