Skip to main content
Log in

Simultaneous dimension reduction and clustering via the NMF-EM algorithm

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Mixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to non-Gaussian mixtures is not straightforward. We propose a new constraint for parameters in non-Gaussian mixture model: the K components parameters are combinations of elements from a small dictionary, say H elements, with \(H \ll K\). Including a nonnegative matrix factorization (NMF) in the EM algorithm allows us to simultaneously estimate the dictionary and the parameters of the mixture. We propose the acronym NMF-EM for this algorithm, implemented in the R package nmfem. This original approach is motivated by passengers clustering from ticketing data: we apply NMF-EM to data from two Transdev public transport networks. In this case, the words are easily interpreted as typical slots in a timetable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Alquier P, Guedj B (2017) An oracle inequality for quasi-Bayesian non-negative matrix factorization. Math Methods Stat 26(1):55–67

    Article  Google Scholar 

  • Arlot S, Massart P (2009) Data-driven calibration of penalties for least-squares regression. J Mach Learn Res 10(Feb):245–279

    Google Scholar 

  • Baek J, McLachlan GJ, Flack LK (2009) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans Pattern Anal Mach Intell 32(7):1298–1309

    Article  Google Scholar 

  • Baudry J-P, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput 22(2):455–470

    Article  MathSciNet  Google Scholar 

  • Benaglia T, Chauveau D, Hunter DR, Young D (2009) mixtools: An R package for analyzing finite mixture models. J Stat Softw 32(6):1–29

    Article  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (1999) An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recognit Lett 20(3):267–272

    Article  Google Scholar 

  • Bishop C (2007) Pattern recognition and machine learning (information science and statistics), 1st edn. 2006. corr. 2nd printing edn

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78

    Article  MathSciNet  Google Scholar 

  • Bouveyron C, Côme E, Jacques J (2015) The discriminative functional mixture model for a comparative analysis of bike sharing systems. Ann Appl Stat 9(4):1726–1760

    Article  MathSciNet  Google Scholar 

  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  Google Scholar 

  • Carel L, Alquier P (2017) Non-negative matrix factorization as a pre-processing tool for travelers temporal profiles clustering. In: Verleysen M (ed) Proceedings of the 25th European symposium on artificial neural networks. pp 417–422. i6doc.com

  • Celeux G, Frühwirth-Schnatter S, Robert CP (eds) (2018a) Handbook of mixture analysis. CRC Press, Boca Raton

    Google Scholar 

  • Celeux G, Maugis-Rabusseau C, Sedki M (2018b) Variable selection in model-based clustering and discriminant analysis with a regularization approach. Adv Data Anal Classif 13:259–278

    Article  MathSciNet  Google Scholar 

  • Côme E, Oukhellou L (2014) Model-based count series clustering for bike sharing system usage mining: a case study with the Vélib’ system of Paris. ACM Trans Intell Syst Technol (TIST) 5(3):39

    Google Scholar 

  • Ding C, He X, Simon H. D (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the 2005 SIAM international conference on data mining. SIAM, pp 606–610

  • El Mahrsi MK, Côme E, Baro J, Oukhellou L (2014) Understanding passenger patterns in public transit through smart card and socioeconomic data: a case study in Rennes, France. In: ACM SIGKDD workshop on urban computing

  • Févotte C, Bertin N, Durrieu J-L (2009) Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis. Neural Comput 21(3):793–830

    Article  Google Scholar 

  • Fop M, Murphy TB (2017) Variable selection methods for model based clustering. arXiv preprint arXiv:1707.00306

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MathSciNet  Google Scholar 

  • Ghahramani Z, Hinton GE (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto

  • Gonzalez EF, Zhang Y (2005) Accelerating the Lee–Seung algorithm for non-negative matrix factorization. Department of Computational and Applied Mathematics, Rice University, Houston, TX, Tech. Rep. TR-05-02

  • Grün B (2018) Model-based clustering. In: Celeux G, Frühwirth-Schnatter S, Robert CP (eds) Handbook of mixture analysis. CRC Press, Boca Raton, pp 155–188

    Google Scholar 

  • Hamon R, Borgnat P, Févotte C, Flandrin P, Robardet C (2015) Factorisation de réseaux temporels: étude des rythmes hebdomadaires du système Vélo’v. In: Colloque GRETSI 2015

  • Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics. J Comput Graph Stat 5:299–314

    Google Scholar 

  • Khan ME, Bouchard G, Murphy KP, Marlin BM (2010) Variational bounds for mixed-data factor analysis. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems, vol 23. Curran Associates, Inc, pp 1108–1116

  • Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37

    Article  Google Scholar 

  • Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

    Article  Google Scholar 

  • Lee DL, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, pp 556–562

  • Lin C-J (2007) Projected gradient methods for non-negative matrix factorization. Neural Comput 19(10):2756–2779

    Article  MathSciNet  Google Scholar 

  • Luo X, Zhou M, Xia Y, Zhu Q (2014) An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. IEEE Trans Ind Inform 10(2):1273–1284

    Article  Google Scholar 

  • Maugis C, Celeux G, Martin-Magniette M-L (2009a) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701–709

    Article  MathSciNet  Google Scholar 

  • Maugis C, Celeux G, Martin-Magniette M-L (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53(11):3872–3882

    Article  MathSciNet  Google Scholar 

  • McLachlan GJ, Peel D (2004) Finite mixture models. Wiley, Hoboken

    MATH  Google Scholar 

  • McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3–4):379–388

    Article  MathSciNet  Google Scholar 

  • McNicholas PD (2016a) Model-based clustering. J Classif 33(3):331–373

    Article  MathSciNet  Google Scholar 

  • McNicholas PD (2016b) Mixture model-based classification. CRC Press, Boca Raton

    Book  Google Scholar 

  • McNicholas PD, Murphy TB (2008) Parsimonious gaussian mixture models. Stat Comput 18(3):285–296

    Article  MathSciNet  Google Scholar 

  • Mei J, De Castro Y, Goude Y, Hébrail G (2017) Recovering multiple nonnegative time series from a few temporal aggregates. In: 34th International conference on machine learning (ICML)

  • Montanari A, Viroli C (2010) Heteroscedastic factor mixture analysis. Stat Model 10(4):441–460

    Article  MathSciNet  Google Scholar 

  • Morency C, Trépanier M, Agard B (2007) Measuring transit use variability with smart-card data. Transp Policy 14(3):193–203

    Article  Google Scholar 

  • Murphy K, Gormley IC, Viroli C (2017) Infinite mixtures of infinite factor analysers: nonparametric model-based clustering via latent gaussian models. arXiv preprint arXiv:1701.07010

  • Paisley J, Blei D, Jordan MI (2014) Bayesian nonnegative matrix factorization with stochastic variational inference. In: Airoldi EM, Blei D, Erosheva EA, Fienberg SE (eds) Handbook of mixed membership models and their applications. Chapman and Hall/CRC Handbooks of Modern Statistical Methods

  • Pelletier M.-P, Trépanier M, Morency C (2009) Smart card data in public transit planning: a review. CIRRELT

  • Peng C, Jin X, Wong K-C, Shi M, Liò P (2012) Collective human mobility pattern from taxi trips in urban area. PLoS ONE 7(4):e34487

    Article  Google Scholar 

  • Poussevin M, Tonnelier E, Baskiotis N, Guigue V, Gallinari P (2014) Mining ticketing logs for usage characterization with nonnegative matrix factorization. In: International workshop on modeling social media. Springer, pp 147–164

  • Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178

    Article  MathSciNet  Google Scholar 

  • Randriamanamihaga AN, Côme E, Oukhellou L, Govaert G (2013) Clustering the Vélib’ origin-destinations flows by means of poisson mixture models. In: ESANN

  • Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289

    Article  Google Scholar 

  • Shahnaz F, Berry M, Pauca P, Plemmons R (2006) Document clustering using nonnegative matrix factorization. Inf Process Manag 42(2):373–386

    Article  Google Scholar 

  • Steinley D, Brusco MJ (2008) Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73(1):125–144

    Article  MathSciNet  Google Scholar 

  • Sun D, Févotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6201–6205

  • Tonnelier E, Baskiotis N, Guigue V, Gallinari P (2018) Anomaly detection in smart card logs and distant evaluation with twitter: a robust framework. Neurocomputing 298:109–121

    Article  Google Scholar 

  • Wolfe JH (1963) Object cluster analysis of social areas. MSc thesis, Univ. of California

  • Wu M (2007) Collaborative filtering via ensembles of matrix factorizations. In: Proceedings of KDD cup and workshop. vol 2007

  • Xu W, Liu Xi, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273

  • Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92(4):937–950

    Article  MathSciNet  Google Scholar 

  • Yang Z, Corander J, Oja E (2016) Low-rank doubly stochastic matrix decomposition for cluster analysis. J Mach Learn Res 17(187):1–25

    MathSciNet  MATH  Google Scholar 

  • Zheng Y, Capra L, Wolfson O, Yang H (2014) Urban computing: concepts, methodologies, and applications. ACM Trans Intell Syst Technol (TIST) 5(3):38

    Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous Referees and the Associate Editor for their constructive comments and suggestions. We also thank Denis COUTROT and Nadir MEZIANI from Transdev for their support and comments on previous versions of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Alquier.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Léna Carel: This paper was written when the first author was a Ph.D. student at ENSAE Paris funded by the Transdev Group. Both author acknowledge the Transdev Group for funding, and for providing the data used in this paper.

Appendix

Appendix

1.1 Analysis of the clusters of users

As written above, we have no personal information in our data. Therefore, we are not able to describe individually the users in each cluster. However, for each transaction made, we have the encrypted card number and the transport ticket used. So we can recover for each card the most used transport ticket during the period. This provides interesting information as some schemes are associated to age ranges (Young, Senior...) and to time periods (Unit, Annual or Monthly Subscription). Let us now provide the description of each cluster in terms of age ranges (Fig. 3a–c in Fig. 8).

Fig. 8
figure 8

Age range analysis of the clusters

Adults are more present in clusters 7 and 9, that are clusters with check-ins mostly in the morning. People benefiting from half-price are present in every cluster but with highest rates in clusters 2, 3, 4 and 5. Children (4–6) are not very present on the network, but they are more represented in clusters 1, 5 and 9. Young travelers (6–25) are more present in clusters 1 and 4. These clusters correspond to scholar time slot. In clusters 8 and 10 there are large rate of seniors and free travelers. As these clusters have profiles of diffuse travels during the week and as free travelers are unemployed or low salaries people, these regroupments make sense.

Figure 9 shows the repartition of transport ticket type through clusters. Unit products are more used in clusters 8 and 10 that are clusters with a lots of seniors and free travelers. As they don’t have obligations, they likely use unit products for occasional trips. Clusters 1, 3, 4 and 9, that have mostly scholar profiles althought have a large majority of annual subscripters. A possible interpretation is that schoolchildren and students are public transportation captives, and have to use the network in order to go to class every day. Thus, buying an annual pass is more advantageous than buying any other product type.

Fig. 9
figure 9

Transportation ticket type analysis of the clusters

As described in Sect. 4.1, we kept only users whose first trip of the day is made at the same station at least \(50\%\) of the study time. That main “morning station” is thus called the “home station” as it gives us an estimation of the residence place of users. In Figs. 10 and 11, we can observe the shares of clusters by home stations. It shows the share of travelers identified as belonging to every cluster leaving near each station.

Fig. 10
figure 10

Share of clusters per home station—clusters 1–6

Fig. 11
figure 11

Share of clusters per home station—clusters 7–10

We note that:

  1. 1.

    Cluster 1: travelers are over represented at peripheral stations.

  2. 2.

    Cluster 2: no particular pattern observed.

  3. 3.

    Cluster 3: no particular pattern observed.

  4. 4.

    Cluster 4: few stations show over representation of cluster 4.

  5. 5.

    Cluster 5: over representation of the cluster at two stations in the north.

  6. 6.

    Cluster 6: no particular pattern observed.

  7. 7.

    Cluster 7: One station is \(100\%\) represented by cluster 7. As only one user is assigned to this station, no particular pattern is observed.

  8. 8.

    Cluster 8: the cluster is over represented at one station in the city center and at another further.

  9. 9.

    Cluster 9: cluster 9 is over represented in few stations in the center.

  10. 10.

    Cluster 10: cluster is over represented in poorest neighborhoods of the city.

1.2 Stations profile clustering

Clustering the different stations of the network would allow us to better know the different type of stations, and to group them by temporal similarity. As we have very few number of stations (475), it is not safe to process as described above for the users clustering. Indeed, a K larger than 6 or 7 leads to very small clusters. In place we fixed H and K a priori to 3 and 5 respectively.

Fig. 12
figure 12

Words obtained by NMF-EM on stations data with \(K=5\) and \(H=3\)

The 3 words obtained are the ones in Fig. 12. The first time component is described by check-ins at 7 and 8 a.m. We will call it the “morning component”. The second time component shows check-ins at 4 and 5 p.m on Mondays, Tuesdays, Thursdays and Fridays and check-ins at 12 p.m on Wednesdays. We will name it the “end of school component”. The third component shows check-ins at 6 p.m, during Wednesdays afternoons, during Saturdays and off-peaks periods. This component will be called the “off-peak component”.

Fig. 13
figure 13

Clusters obtained by NMF-EM on stations data with \(K=5\) and \(H=3\)

Figure 13 shows the 5 clusters. Stations in cluster 1 are stations where there are check-ins only in the morning at 7 or 8 a.m. These stations are likely in residential areas. In cluster 2, the stations have check-ins all day long, but with highest probabilities during peaks. Stations in cluster 3 have check-ins in the morning and at the end of school. They are likely to be near schools in residential areas. Stations in cluster 4 have check-ins only at end of school times. Thus, these stations are probably near schools. Finally, stations in cluster 5 are pretty similar than the ones in cluster 1: a large majority of check-ins are made in the morning (7 or 8 p.m). The only difference is that it is more likely to have check-ins during the rest of the day in cluster 5 than in cluster 1.

Thanks to the French National Institute of Statistics and Economics Studies (INSEE), there are open data permitting us to introduce contextual information. Firstly, a database containing socioeconomic data on a grid of 200 m \(\times \) 200 m is available. We used two indicator of it: the number of inhabitants and the percentage of households living in collective housing per tiles. Secondly, we used a database referencing and geolocating every french company or administration. In this way, we were able to know the number of employees per tile. By clustering the tiles in the study area, we obtained different group of areas that will allow us to lead the study on stations more finely. Table 3 contains the description of the mean tile by cluster.

Table 3 Description of tiles clusters
Fig. 14
figure 14

Map of the stations—opacity of the points are proportional to the adequacy between the stations and the clusters

Fig. 15
figure 15

Words obtained by NMF-EM on users data with \(K=10\) and \(H=7\)

Fig. 16
figure 16

Clusters obtained by NMF-EM on users data with \(K=10\) and \(H=7\)

As tiles contained in cluster 1 and 2, are those with the least number of employees, they can be described as residential areas. Moreover, the percentage of collective housing allows to distinguish them. Indeed, cluster 1 have more households living in collective housing than cluster 3. That is why we will refer as tiles from cluster 1 as residential areas in collective housing and as residential areas in individual housing for tiles from cluster 2. Since the number of inhabitants and of employees are high, tiles from cluster 3 will be refered as mixed areas. Finally, as the number of employees in cluster 4 is very large, we will refer these tiles as business areas.

The figures in Fig. 14 show the geographical repartition of the five clusters. In Fig. 6a, we oserve the stations contained in cluster 1. This cluster groups stations that have check-ins only in the morning. On the figure, we observe that these stations are distant from the city center and are mainly located in residential areas. Figure 6b shows stations of cluster 2, that have check-ins all day long with stronger attendance during peak-periods. These stations are mainly located in the city center. Figure 6c, d look alike. Indeed, clusters 3 and 4 have the “end of school” component and the points on the map are close to educational establishment. Figure 6e shows stations from cluster 5. These stations have check-ins all day long but most are made in the morning. By looking at the map, we cannot notice any significant pattern.

1.3 Passengers profile clustering on another network

To ensure efficiency of the algorithm, we applied it on another network located in the Netherlands. By applying the same model selection method as in Sect. 4.2, we obtained the optimal values of \(K=10\) and \(H=7\). Figures 15 and 16 contain respectively the profiles of the words and clusters obtained.

The interpretation of the words is:

  1. 1.

    Word 1: travels at 6 or 7 a.m and slightly around 4 p.m during the week.

  2. 2.

    Word 2: travels during the week-end.

  3. 3.

    Word 3: diffuse travel habits from 8 a.m to 4 p.m Mondays to Fridays.

  4. 4.

    Word 4: travels at 7a.m on weekdays.

  5. 5.

    Word 5: diffuse habits with highest probabilities from 5 p.m to 12 a.m during the week.

  6. 6.

    Word 6: diffuse habits from 9 a.m to 5 p.m with highest probability at 1 p.m Mondays to Saturdays.

  7. 7.

    Word 7: travels at 8 a.m and 5 p.m.

We can interpret the cluster as follows:

  1. 1.

    Cluster 1: diffuse habits from 9 a.m to 5 p.m with highest probability at 1 p.m Mondays to Saturdays.

  2. 2.

    Cluster 2: travels at 6 or 7 a.m and at 4 or 5 p.m during the week.

  3. 3.

    Cluster 3: diffuse habits from 7 a.m to 6 p.m on weekdays.

  4. 4.

    Cluster 4: diffuse travel habits from 9 a.m to 11 p.m.

  5. 5.

    Cluster 5: travels at 7 or 8 a.m diffuse habits during the afternoon.

  6. 6.

    Cluster 6: travels at 8 a.m and 5 p.m.

  7. 7.

    Cluster 7: diffuse travel habits from 7 a.m to 5 p.m Mondays to Fridays.

  8. 8.

    Cluster 8: diffuse habits from 8 a.m to 4 p.m during the week.

  9. 9.

    Cluster 9: travels during the week-end.

  10. 10.

    Cluster 10: travels at 7 or 8 a.m and around 4 p.m.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carel, L., Alquier, P. Simultaneous dimension reduction and clustering via the NMF-EM algorithm. Adv Data Anal Classif 15, 231–260 (2021). https://doi.org/10.1007/s11634-020-00398-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00398-4

Keywords

Mathematics Subject Classification

Navigation