Abstract
On paper, prevention appears to be a good complement to health insurance. However, its implementation is often costly. To maximize the impact and efficiency of prevention plans, plans should target particular groups of policyholders. In this article, we propose a way of clustering policyholders that could be a starting point for the targeting of prevention plans. This two-step method considers mainly policyholder health consumption for classification. The dimension is first reduced using a nonnegative matrix factorization algorithm, producing intermediate health product clusters. Policyholders are then clustered using Kohonen’s map algorithm. This leads to a natural visualization of the results, allowing the simple comparison of results from different databases. The method is applied to two real French health insurer datasets. The method is shown to be easily understandable and able to cluster most policyholders efficiently.
Similar content being viewed by others
Notes
The term health product is used for every item of health expenditure that may be refunded by the insurer (such as GP visits, nights at the hospital, medication and glasses).
The legal retirement age in France is 62.
More precisely, if H designs the frequency matrix as described above, the matrix \(log(H + 1)\) is computed.
Six different implementations have been tested: those proposed by Lee and Seung (Lee, [37]), Brunet et al. (Brunet, [9]), Pascual-Montano et al. (nsNMF, [43]) and Badea (Offset, [4]) and the two proposed by Kim and Park (snmf/l and snmf/r, [31]). The “snmf/l” algorithm yields among the best results, while being significantly faster. Implementations from the R package “NMF”, developed by Gaujoux and Seoighe [19], were used in the analysis presented here.
The R package “Kohonen”, developed by Wehrens et al. ([55]), was used in the analysis presented here.
A well-known alternative is to choose the starting points via a PCA; however, Akinduko et al. show that this is not suitable for non-linear datasets [3].
For the databases used here, dimension reduction dramatically improves clustering.
The health product “Legal copayment” may be unfamiliar to the reader. In the French health system, many health products are partially reimbursed by the public insurer, “l’Assurance Maladie”. The price of health products is fixed by law (for example, a GP consultation costs 25 Euros). However, the public insurer does not refund all of this amount (only 16.5 Euros for GPs) to limit health consumption. Here, we call the 25–16.5 gap the “legal copayment”. Moreover, GPs are allowed to charge higher fees that are not covered by the public insurer. The reimbursement of the legal copayment is usually covered by private insurance.
An individual can need glasses and have an operation in the same year. Such an individual would then belong to two clusters: the optic cluster and the hospitalization cluster. In this regard, the health sector is linked to a multi-label context.
In fact, almost none of the 20 HPCs made using the PCA offer satisfying consistency.
The \(R^{2}\) coefficient is given by \(R^{2} = 1 - \frac{\sum _{i=1}^{k} I_{C_{i}}}{I_{C}}\), with \(I_{C_{i}}\) denoting the inertia of the cluster \(C_{i}\). The inertia has been computed using the cosine similarity, following the recommendation of Huang [28].
See Sect. 2 for a review of the subdatabases.
According to Wikipedia, “Orthoptics is a profession allied to eye care professions whose primary emphasis is the diagnosis and non-surgical management of strabismus (wandering eyes), amblyopia (lazy eye) and eye movement disorders”.
The same definition as that of Huang [28] has been used.
References
Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces, vol. 29. ACM
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM
Akinduko AA, Mirkes EM, Gorban AN (2016) Som: Stochastic initialization versus principal components. Inform Sci 364:213–221
Badea L (2008) Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization. In: Biocomputing 2008, pp. 267–278. World Scientific
Beaulieu N, Cutler DM, Ho K, Isham G, Lindquist T, Nelson A, O’Connor P (2006) The business case for diabetes disease management for managed care organizations. In: Forum for Health Economics & Policy, vol. 9. De Gruyter
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbor meaningful? In: International conference on database theory, pp. 217–235. Springer, Berlin
Boutsidis C, Gallopoulos E (2008) SVD based initialization: a head start for nonnegative matrix factorization. Pattern Recogn 41(4):1350–1362
Brockett PL, Xia X, Derrig RA (1998) Using Kohonen’s self-organizing feature map to uncover automobile bodily injury claims fraud. J Risk Insur pp. 245–274
Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci 101(12):4164–4169
Bühlmann H, Gisler A (2006) A course in credibility theory and its applications. Springer Science & Business Media, Berlin
Cardoso-Cachopo A (2007) Improving Methods for Single-label Text Categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa
Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 84–93. ACM
Darblade M (2015) Analyse de profils de consommation et tarification des futures garanties sur-complémentaire santé. Master’s thesis, ISFA
Dargent-Molina P, Cassou B (2008) Prévention des chutes et des fractures chez les femmes âgées. Gérontologie et société 31(2):65–78
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41(6):391–407
Derrig RA, Ostaszewski KM (1995) Fuzzy techniques of pattern recognition in risk and claim classification. J Risk Insur 62(3):447–482
Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on Machine learning, p. 29. ACM
Gauchon R, Hermet JP (2019) La psychiatrie: un risque important en assurance santé?
Gaujoux R, Seoighe C (2010) A flexible r package for nonnegative matrix factorization. BMC Bioinform 11(1):367
Ghoreyshi S, Hosseinkhani J (2015) Developing a clustering model based on k-means algorithm in order to creating different policies for policyholders in insurance industry. Int J Adv Comput Sci Inf Technol (IJACSIT) 4(2):46–53
Hainaut D (2019) A self-organizing predictive map for non-life insurance. Eur Actuar J 9(1):173–207
Henckaerts R, Antonio K, Clijsters M, Verbelen R (2018) A data driven binning strategy for the construction of insurance tariff classes. Scand Actuar J 2018(8):681–705
Herring B (2010) Suboptimal provision of preventive healthcare due to expected enrollee turnover among private insurers. Health Econ 19(4):438–448
Hinneburg A, Keim DA (1999) Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. pp. 506–517. 25 th International Conference on Very Large Databases
Hinton GE, Salakhutdinov RR (2009) Replicated softmax: an undirected topic model. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1607–1614. Curran Associates, Inc.
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5:1457–1469
Hoyle D, Rattray M (2003) PCA learning for sparse high-dimensional data. EPL (Europhys Lett) 62(1):117
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pp. 49–56
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. ACM
Jones BW, Chung W (2016) Topic modeling of small sequential documents: Proposed experiments for detecting terror attacks. In: Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, pp. 310–312. IEEE
Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12):1495–1502
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Kuang D, Choo J, Park H (2015) Nonnegative matrix factorization for interactive topic modeling and document clustering. Partitional clustering algorithms. Springer, Berlin, pp 215–243
Kuo R, Lin S, Shih C (2007) Mining association rules through integration of clustering analysis and ant colony system for health insurance database in taiwan. Expert Syst Appl 33(3):794–808
Langville AN, Meyer CD, Albright R, Cox J, Duling D (2006) Initializations for the nonnegative matrix factorization. In: Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 23–26. Citeseer
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, Kiezun A (2013) Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499(7457):214
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp. 556–562
Mote SR, Baid UR, Talbar SN (2017) Non-negative matrix factorization and self-organizing map for brain tumor segmentation. In: Wireless Communications, Signal Processing and Networking (WiSPNET), 2017 International Conference on, pp. 1133–1137. IEEE
Murtagh F (1995) Interpreting the Kohonen self-organizing feature map using contiguity-constrained clustering. Pattern Recogn Lett 16(4):399–408
Nesvijevskaia A, Taudou B (2016) La data science au service de la prévention santé et prévoyance : nouveaux paradigmes - 17eme rencontre mutré, 14–15 november - nantes. Tech. rep, Malakoff Mederic
Paatero P, Tapper U (1994) Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2):111–126
Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD (2006) Nonsmooth nonnegative matrix factorization (nsnmf). IEEE Trans Pattern Anal Mach Intell 28(3):403–415
Pauca VP, Piper J, Plemmons RJ (2006) Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl 416(1):29–47
Pauca VP, Shahnaz F, Berry MW, Plemmons RJ (2004) Text mining using non-negative matrix factorizations. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 452–456. SIAM
Peng Y, Kou G, Sabatka A, Chen Z, Khazanchi D, Shi Y (2006) Application of clustering methods to health insurance fraud detection. In: Service Systems and Service Management, 2006 International Conference on, vol. 1, pp. 116–120. IEEE (2006)
Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp. 616–623
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc 60(5):503–520
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: Advances in neural information processing systems, pp. 1289–1296
Utsumi A (2010) Evaluating the performance of nonnegative matrix factorization for constructing semantic spaces: Comparison to latent semantic analysis. In: 2010 IEEE International Conference on Systems, Man and Cybernetics, pp. 2893–2900. IEEE
Van Benthem MH, Keenan MR (2004) Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems. J Chemometr J Chemometr Soc 18(10):441–450
Verrall RJ, Yakoubov YH (1999) A fuzzy approach to grouping by policyholder age in general insurance. J Actuar Pract 7:181–204
Wang D, Cui P, Zhu W (2016) Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. ACM
Wehrens R, Buydens LM et al (2007) Self-and super-organizing maps in r: the Kohonen package. J Stat Softw 21(5):1–19
World Health Organization (2017) Depression and other common mental disorders: global health estimates. No. WHO/MSD/MER/2017.2
Wu B, Wang E, Zhu Z, Chen W, Xiao P (2018) Manifold nmf with l21 norm for clustering. Neurocomputing 273:78–88
Xu H, Caramanis C, Sanghavi S (2010) Robust pca via outlier pursuit. In: Advances in Neural Information Processing Systems, pp. 2496–2504
Xu L, Yuille AL (1995) Robust principal component analysis by self-organizing rules based on statistical physics approach. IEEE Trans Neural Netw 6(1):131–143
Yeo AC, Smith KA, Willis RJ, Brooks M (2001) Clustering technique for risk classification and prediction of claim costs in the automobile insurance industry. Intell Syst Account Finan Manag 10(1):39–50
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: ACM Sigmod Record, vol. 25, pp. 103–114. ACM
Acknowledgements
The authors would like to thank Alexandra Barral for useful comments over the duration of this research, and Nabil Rachdi for technical advice. They are also grateful to Addactis in France for providing the data, and to everyone (including two reviewers) who had reread the paper. This research was carried out in the framework of the Chair Prevent'Horizon, supported by the risk foundation Louis Bachelier and in partnership with Claude Bernard Lyon 1 University, Addactis in France, AG2R La Mondiale, G2S, Covea, Groupama Gan Vie, Groupe Pasteur Mutualité, Harmonie Mutuelle, Humanis Prévoyance and La Mutuelle Générale. S. Loisel acknowledges support from the IDR Actuariat Durable sponsored by Milliman Paris, and the DAMI research chair sponsored by BNP Paribas Cardif.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Appendix 1: Comparing the clusters obtained using the proposed method with those obtained using a very basic approach
1.2 Appendix 2: An example of another map obtained from the same data
See Fig. 14.
1.3 Appendix 3: Test of the algorithm on a database with known clusters
While several tests have been carried out to test the relevance of the results obtained in health insurance databases, it is impossible to compute an objective error metric because the underlying clusters are unknown. To process this kind of test, it is thus necessary to use a dataset coming from another field. For example, the text mining field offers many databases with known clusters. Moreover, it is a common practice in this field to work with a word frequency matrix, making it realistic to apply the NMF/Kohonen process to a text-mining dataset.
The 20-newsgroups dataset is chosen to perform this test. This is a well-known text-mining dataset and has been used for various text-mining tasks, such as word embedding (e.g., [25, 54]), unsupervised clustering (e.g., [17, 28]) and supervised classification (e.g., [47, 50]). The training test dataset has been used, representing 11 293 texts from 20 different newsgroups. For this study, we use the dataset as pre-processed by Cardoso Cachopo [11] (the “no-short” dataset). The objective is to find the original newsgroup of each document.
As text mining is not one of the goals of this paper, the results presented below come from the first run of the algorithm, without attempts to calibrate the model or improve the results. The dimension is first reduced to 60 before clustering, and the frequency matrix is pre-processed using the tf-idf method, which is a common practice in text mining. Since the 20-newsgroups dataset contains 20 different natural clusters, the HAC has been calibrated to obtain 20 different classes.
From the Kohonen map (Fig. 15), it is possible to see that clusters 2 and 10 are spread out. Moreover, clusters 6 and 9 seem significantly larger than the others. Their purity score confirms that they are less homogeneous than the other clusters (purity is shown in Fig. 17). Except for in these four clusters and cluster 18, purity is acceptable. The overall purity is 62%, and the total entropyFootnote 15 is 0.4, which is significantly better than the results obtained by Huang from the same dataset [28], even though we do not aim to achieve a good score.
Comparing Figs. 16 and 17, even though the algorithm does not identify all of the documents in a given cluster, the resulting clusters are still reliable. This means that if one wants to identify all of the policyholders with psychiatric medication (for example), this algorithm is not very appropriate. However, if a psychiatric class is identified, it is reliable enough to justify the targeting of a prevention plan.
To summarize, the method produces acceptable results for the 20-newsgroup dataset. Most of the clusters represent a specific newsgroup. However, the method cannot differentiate between very similar newsgroups, such as IBM and Mac computers. This produces large clusters containing most of the documents the method cannot differentiate.
This clustering method is thus able to construct meaningful policyholder clusters. However, large classes (such as the everyday-care cluster) are heterogeneous and should not be used to target prevention plans: they contain policyholders who cannot be differentiated by the algorithm.
Rights and permissions
About this article
Cite this article
Gauchon, R., Loisel, S. & Rullière, JL. Health policyholder clustering using medical consumption. Eur. Actuar. J. 10, 599–626 (2020). https://doi.org/10.1007/s13385-020-00244-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13385-020-00244-z