Skip to main content

Advertisement

Log in

Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

In the literature on modeling heterogeneous data via mixture models, it is generally assumed that the samples are drawn from the underlying population using the simple random sampling (SRS) technique. This study exploits the bivariate ranked set sampling (BVRSS) technique to learn finite mixture models. We generalize the expectation-maximization (EM) algorithm under univariate RSS to the bivariate case. Computationally, through a simulation study under a noisy setting, we compare the performance of the proposed rank-based estimators with that of the SRS-based competitors in estimating unknown parameters and cluster assignments. The proposed methodology is applied to a breast cancer data set to diagnose malignant or benign tumors in patients. The results showed that the extra rank information in BVRSS samples leads to a better inference about the unknown features of mixture models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The real dataset analyzed during the current study is available at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The complete data generation process for simulation study is included in Sect. 4.1.

References

  • Al-Saleh MF, Al-Shrafat K (2001) Estimation of average milk yield using ranked set sampling. Environmetrics 12:395–399

    Article  Google Scholar 

  • Al-Saleh MF, Samawi HM (2005) Estimation of the correlation coefficient using bivariate ranked set sampling with application to the bivariate normal distribution. Commun Stat 34:875–889

    Article  MathSciNet  Google Scholar 

  • Al-Saleh MF, Zheng G (2002) Estimation of bivariate characteristics using ranked set sampling. Aust N Z J Stat 44:221–232

    Article  MathSciNet  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 30:1–38

    Google Scholar 

  • Faraji N, Jafari Jozani M, Nematollahi N (2021) Another look at regression analysis using ranked set samples with application to an osteoporosis study. Biometrics. https://doi.org/10.1111/biom.13513

    Article  PubMed  Google Scholar 

  • Hatefi A, Jafari Jozani M (2013) Fisher information in different types of perfect and imperfect ranked set samples from finite mixture models. J Multivar Anal 119:16–31

    Article  MathSciNet  Google Scholar 

  • Hatefi A, Jafari Jozani M (2017) Proportion estimation based on a partially rank ordered set sample with multiple concomitants in a breast cancer study. Stat Method Med Res 26:2552–2566

    Article  Google Scholar 

  • Hatefi A, Jafari Jozani M, Ziou D (2014) Estimation and classification for finite mixture models under ranked set sampling. Stat Sin 24:675–698

    MathSciNet  Google Scholar 

  • Hatefi A, Jafari Jozani M, Ozturk O (2015) Mixture model analysis of partially rank-ordered set samples: age groups of fish from length-frequency data. Scand J Stat 42:848–871

    Article  MathSciNet  Google Scholar 

  • Hatefi A, Reid N, Jafari Jozani M, Ozturk O (2020) Finite mixture modeling, classification and statistical learning with order statistics. Stat Sin 30:1881–1903

    MathSciNet  Google Scholar 

  • Homser DW (1973) A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of samples. Biometrics 29:761–770

    Article  Google Scholar 

  • Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10:2295–2328

    MathSciNet  Google Scholar 

  • Liu H, Han F, Yuan M, Lafferty J, Wasserman L (2012) High-dimensional semiparametric Gaussian copula graphical models. Ann Stat 40:2293–2326

    Article  MathSciNet  Google Scholar 

  • Mahdizadeh M, Zamanzade E (2022) Using a rank-based design in estimating prevalence of breast cancer. Soft Comput 26:1–10

    Article  Google Scholar 

  • Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43:570–577

    Article  MathSciNet  Google Scholar 

  • McIntyre GA (1952) A method for unbiased selective sampling using ranked sets. Aust J Agric Res 3:385–390

    Article  Google Scholar 

  • McLachlan G, Peel D (2004) Finite mixture models, Wiley series in probabilities and statistics. Wiley-Interscience, New York

    Google Scholar 

  • Mode NA, Conquest LL, Marker DA (1999) Ranked set sampling for ecological research: accounting for the total costs of sampling. Environmetrics 10:179–194

    Article  Google Scholar 

  • Omidvar S, Jafari Jozani M, Nematollahi N (2018) Judgment post-stratification in finite mixture modeling: An example in estimating the prevalence of osteoporosis. Stat Med 37:4823–4836

    Article  MathSciNet  PubMed  Google Scholar 

  • Patil GP, Sinha AK, Taillie C (1994) Ranked set sampling for multiple characteristics. Int J Ecol Environ Sci 20:357–373

    Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

    Article  Google Scholar 

  • Stokes SL (1980) Inferences on the correlation coefficient in bivariate normal population from ranked set sampling. J Am Stat Assoc 75:989–995

    Article  Google Scholar 

  • Zamanzade E, Asadi M, Parvardeh A, Zamanzade E (2022) A ranked-based estimator of the mean past lifetime with an application. Stat Papers 64:1–17

    MathSciNet  Google Scholar 

  • Zheng G, Modarres R (2006) A robust estimate of the correlation coefficient for bivariate normal distribution using ranked set sampling. J Stat Plan Inference 136:298–309

    Article  MathSciNet  Google Scholar 

Download references

Funding

The authors did not receive support from any organization for the submitted paper, and this research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Haji Aghabozorgi.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Appendix A.1.: Proof of Lemma 3.1

The conditional pdf’s of the latent variables are

$$\begin{aligned}&f\big (\left. {z_{sk}^{(ij)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {{f\big (\left. {{x_{[i](j)}},{y_{(i)[j]s}}} \right| z_{sk}^{(ij)},\varvec{\Psi } \big ) f\big (z_{sk}^{(ij)}\big )} \over {{f^{[i](j),(i)[j]}} \big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi } \big )}}\\&\quad = {{{c_1}\prod \limits _{k = 1}^K {{{\{ {f_k}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\xi }_k})\} }^{z_{sk}^{(ij)}}}} {{\{ F({x_{[i](j)s}};\varvec{\Psi })\} }^{j - 1}}{{\{ {\bar{F}}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}} } \over {{c_1}f({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi }){{\{ F({x_{[i](j)s}};\varvec{\Psi })\} }^{j - 1}}{{\{ \bar{F}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\Psi )\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}}}} \\&\qquad \times \prod \limits _{k = 1}^K {\pi _k^{z_{sk}^{(ij)}}} = \prod \limits _{k = 1}^K {{{ \left( {{{\pi _k}{f_k}\big ({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\xi }_k}\big )} \over {f\big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi } \big )}} \right) }^{z_{sk}^{(ij)}}}},\\&f\big (\left. {w_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi }\big ) = {{f\big (\left. {{x_{[i](j)}},{y_{(i)[j]s}}} \right| w_{sk}^{(j)},\varvec{\Psi }\big ) f\big (w_{sk}^{(j)}\big )} \over {{f^{[i](j),(i)[j]}}\big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi }\varvec{)}}}\\&\quad = {{{c_1}{c_2}f({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi })\prod \limits _{k = 1}^K {{{\{ {F_k}({x_{[i](j)s}};{\varvec{\xi }_k})\} }^{w_{sk}^{(j)}}}} {{\{ {\bar{F}}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}}} \over {{c_1}f({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\Psi }){{\{ F({x_{[i](j)s}};\varvec{\Psi })\} }^{j - 1}}{{\{ \bar{F}({x_{[i](j)s}};\varvec{\Psi })\} }^{m - j}}{{\{ {F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{i - 1}}{{\{ {{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi })\} }^{m - i}}}}\\&\qquad \times \prod \limits _{k = 1}^K {\pi _k^{w_{sk}^{(j)}}} = {c_2}\prod \limits _{k = 1}^K {{{\left( {{{\pi _k}{F_k}\big ({x_{[i](j)s}};{\varvec{\xi }_k}\big )} \over {F\big ({x_{[i](j)s}};\varvec{\Psi }\big )}}\right) }^{w_{sk}^{(j)}}}}, \end{aligned}$$

In such a way, \(f\big (\left. {v_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {c_3}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}{{\bar{F}}_k}({x_{[i](j)s}};{\varvec{\xi }_k})} \over {\bar{F}({x_{[i](j)s}};\varvec{\Psi }) }}\right) }^{v_{sk}^{(j)}}}}\), \(f\big (\left. {u_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )={c_4}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}F_k^{[j]} ({y_{(i)[j]s}};{\varvec{\xi }_k})} \over {{F^{[j]}}({y_{(i)[j]s}};\varvec{\Psi }) }}\right) }^{u_{sk}^{(i)}}}}\), and \(f\big (\left. {d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) = {c_5}\prod \limits \nolimits _{k = 1}^K {{{\left( {{{\pi _k}{\bar{F}}_k^{[j]} ({y_{(i)[j]s}};{\varvec{\xi }_k})} \over {{{\bar{F}}^{[j]}}({y_{(i)[j]s}};\varvec{\Psi }) }}\right) }^{d_{sk}^{(i)}}}}\).

Since the latent variables are independent variables, we have

$$\begin{aligned}&f\big ({x_{[i](j)s}},{y_{(i)[j]s}}, \varvec{z}_s^{(ij)}, \varvec{w}_s^{(j)}, \varvec{v}_s^{(j)}, \varvec{u}_s^{(i)}, \varvec{d}_s^{(i)}; \varvec{\Psi }\big ) \\&\quad = f \big (\left. {z_{sk}^{(ij)},w_{sk}^{(j)},v_{sk}^{(j)},u_{sk}^{(i)} ,d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) {f^{[i](j),(i)[j]}}\big ({x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )\\&\quad = f \big (\left. {z_{sk}^{(ij)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) f \big (\left. {w_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) f \big (\left. {v_{sk}^{(j)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big )\\&\qquad \times f\big (\left. {u_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) f\big (\left. {d_{sk}^{(i)}} \right| {x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ){f^{[i](j),(i)[j]}} \big ({x_{[i](j)s}},{y_{(i)[j]s}},\varvec{\Psi } \big ) \\&\quad = c_1 c_2 c_3 c_4 c_5 \Bigg \{ \prod \limits _{k = 1}^K{ {\pi _k^{ z_{sk}^{(ij)} + w_{sk}^{(j)} + v_{sk}^{(j)} +u_{sk}^{(i)} + d_{sk}^{(i)} } } {\big \{ {f_k}\big ({x_{[i](j)s}},{y_{(i)[j]s}};\varvec{\xi }_k \big )\big \} ^{z_{sk}^{(ij)}}}} \\&\qquad \times {{\big \{ {F_k}\big ({x_{[i](j)s}};\varvec{\xi }_k \big )\big \} ^{w_{sk}^{(j)}}}{\big \{ {{\bar{F}}_k}\big ({x_{[i](j)s}};\varvec{\xi }_k \big )\big \} ^{v_{sk}^{(j)}}}{\big \{ F_k^{[j]}\big ({y_{(i)[j]s}};\varvec{\xi }_k \big )\big \} ^{u_{sk}^{(i)}}}{\big \{ \bar{F}_k^{[j]}\big ({y_{(i)[j]s}};\varvec{\xi }_k \big )\big \} ^{d_{sk}^{(i)}}} } \Bigg \}. \end{aligned}$$

1.2 Appendix A.2.: Proof of equation (22)

For maximization (21) with respect to \(\pi _k\), we eliminate the terms that do not depend on \(\pi _k\). Therefore, the quantity \({Q_{{M_1}}}(\left. \varvec{\Psi } \right| {\varvec{\Psi }^{(t)}})\) reduces to \(Q_{M_1}(\pi _k)\) as follows

$$\begin{aligned} Q_{M_1}({\pi _k}) = (2m-1)\sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {\sum \limits _{k = 1}^K {{\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}}) \log {\pi _k} } } } }, \end{aligned}$$

where \({\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})\) is defined in (23). We use the Lagrangian multiplier method under the constraint \(\sum \nolimits _{k=1}^K {\pi _k} = 1\). The Lagrangian function is

$$\begin{aligned} {\mathcal {L}}({\pi _k},\lambda ) = {Q_{{M_1}}}({\pi _k}) - \lambda (\sum \limits _{k = 1}^K {{\pi _k}} - 1). \end{aligned}$$

By setting the differential of \({\mathcal {L}}({\pi _k},\lambda ) \) with respect to \(\pi _k\) equal to zero, we have

$$\begin{aligned} \pi _{k,{M_1}}^{(t + 1)} = {1 \over \lambda }\sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {{\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})} } } , \end{aligned}$$

where from (16)–(20), we have

$$\begin{aligned} \lambda&= \sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {\sum \limits _{k = 1}^K {{\eta _{k,{M_1}}}({x_{[i](j)s}},{y_{(i)[j]s}};{\varvec{\Psi }^{(t)}})} } } }\\&= \sum \limits _{s = 1}^r {\sum \limits _{j = 1}^m {\sum \limits _{i = 1}^m {1 \over {(2m-1)}} \big \{1+ (j-1) + (m-j) + (i-1) + (m-i)\big \} }} \\&= rm^2. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haji Aghabozorgi, H., Eskandari, F. Clustering and estimation of finite mixture models under bivariate ranked set sampling with application to a breast cancer study. Stat Papers 65, 705–736 (2024). https://doi.org/10.1007/s00362-023-01411-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-023-01411-6

Keywords

Mathematics Subject Classification

Navigation