Abstract
The non-negative matrix factorization has been used in many disciplines of research, where the number of factors plays a crucial role. However, a fully data-driven method for determining the number is yet not available in the literature. Based on the fact that the most appropriate number of factors should generate the best prediction, in this paper we propose a selection method using a two-step delete-one-out approach, called twice cross-validation. This method is easy to implement and is fully data-driven. It also works when constraints are imposed on the factorization including the sparsity. Intensive simulations and real data analyses suggest that the proposed method performs well in most cases and can select the number of factors correctly when the number of factors is much less than the dimension of variables and the sample size is reasonably large. As an important application, the proposed method is used for source apportionment of air pollution in Singapore, and provides physically reasonable source profiles.
Similar content being viewed by others
References
Al-Thani H, Koc M, Isaifan RJ (2018) Investigations on deposited dust fallout in Urban Doha: characterization, source apportionment and mitigation. Environ Ecol Res 6:1493–506
Bai J, Ng S (2002) Determining the number of factors in approximate factor models. Econometrica 70:191–221
Bartoletti S, Loperfido N (2010) Modelling air pollution data by the skew-normal distribution. Stoch Environ Res Risk Assess 24:513–517
Bayraktar H, Turalioǧlu FS, Tuncel G (2010) Average mass concentrations of TSP, PM10 and PM2. 5 in Erzurum urban atmosphere, Turkey. Stoch Environ Res Risk Assess 24:57–65
Belis CA et al (2014) European guide on with receptor models air pollution. JRC reference report, European Commission
Beuck H, Quass U, Klemm O, Kuhlbusch TAJ (2011) Assessment of sea salt and mineral dust contributions to PM10 in NW Germany usingtracer models and positive matrix factorization. Atmos Environ 45:5813–5821
Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Cross-validation of component models: a critical look at current methods. Anal Bioanal Chem 390:1241–1251
Brown S, Hafner H (2005) Multivariate receptor modeling workbook. USEPA, Research Triangle Park
Brunet J, Tamayo P, Golub T, Mesirov J (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci 101:4164–4169
Buzcu B, Fraser MP, Kulkarni P, Chellam S (2003) Source identification and apportionment of fine particulate matter in Houston, TX, using positive matrix factorization. Environ Eng Sci 20:533–545
Cabada JC, Pandis SN, Robinson AL (2002) Sources of atmospheric carbonaceous particulate matter in Pittsburgh, Pennsylvania. J Air Waste Manag Assoc 52:732–741
Chan YC, Hawas O, Hawker D, Vowles P, Cohen DD, Stelcer E et al (2011) Using multiple type composition data and wind data in PMF analysis to apportion and locate sources of air pollutants. Atmos Environ 2:439–449
Fassò A (2013) Statistical assessment of air quality interventions. Stoch Environ Res Risk Assess 27:1651–1660
Hien P, Bac V, Thinh N (2004) PMF receptor modelling of fine and coarse PM 10 in air masses governing monsoon conditions in Hanoi, northern Vietnam. Atmos Environ 38:189–201
Ho WY, Tseng KH, Liou ML, Chan CC, Wang CH (2018) Application of positive matrix factorization in the identification of the sources of PM2.5 in Taipei City. Int J Environ Res Public Health 15:1305
Hopke P (2000) A guide to positive matrix factorization. In: Workshop on UNMIX and PMF as applied to PM2, vol 5, p 600
Kim E, Hopke P (2004) Improving source identification of fine particles in a rural northeastern U.S. area utilizing temperature-resolved carbon fractions. J Geophys Res Atmos 109:729–736
Kim E, Hopke PK, Edgerton ES (2003) Source identification of Atlanta aerosol by positive matrix factorization. J Air Waste Manag Assoc 53:731–739
Lanz VA, Alfarra MR, Baltensperger U, Buchmann B, Hueglin C, Prevot ASH (2007) Source apportionment of submicron organic aerosols at an urban site by factor analytical modelling of aerosol mass spectra. Atmos Chem Phys 7:1503–1522
Larsen RK, Baker JE (2003) Source apportionment of polycyclic aromatic hydrocarbons in the urban atmosphere: a comparison of three methods. Environ Sci Technol 37:1873–1881
Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
Lee E, Chan C, Paatero P (1999) Application of positive matrix factorization in source apportionment of particulate pollutants in Hong Kong. Atmos Environ 33:3201–3212
Li H, Li Q, Shi Y (2017) Determining the number of factors when the number of factors can increase with sample size. J Econom 197:76–86
Liu W, Hopke P, Han Y, Yi S, Holsen T, Cybart S, Kozlowski K, Milligan M (2003) Application of receptor modeling to atmospheric constituents at Potsdam and Stockton, NY. Atmos Environ 37:4997–5007
Muñoz E, Martin ML, Turias IJ, Jimenez-Come MJ, Trujillo FJ (2014) Prediction of PM10 and SO\(_2\) exceedances to control air pollution in the Bay of Algeciras, Spain. Stoch Environ Res Risk Assess 28:1409–1420
Murillo JH, Roman SR, Marin JFR, Ramos AC, Jimenez SB, Gonzalez BC, Baumgardner DG (2013) Chemical characterization and source apportionment of PM10 and PM2.5 in the metropolitan area of Costa Rica, Central America. Atmos Pollut Res 4:181–190
Nieto PG, Lasheras FS, García-Gonzalo E, de Cos Juez FJ (2018) Estimation of PM10 concentration from air quality data in the vicinity of a major steelworks site in the metropolitan area of Avilés (Northern Spain) using machine learning techniques. Stoch Environ Res Risk Assess 32(11):3287–3298
Norris G, Vedantham R, Wade K, Zahn P, Brown S, Paatero P, Martin L (2009) Guidance document for PMF applications with the multilinear engine. Prepared for the US Environmental Protection Agency, Research Triangle Park, NC, by the National Exposure Research Laboratory, Research Triangle Park, NC
Paatero P (2000) User’s guide for positive matrix factorization programs PMF2 and PMF3. University of Helsinki, Helsinki
Paatero P, Hopke P (2009) Rotational tools for factor analytic models. J Chemom 23:91–100
Paatero P, Tapper U (1993) Analysis of different modes of factor analysis as least squares fit problems. Chemom Intell Lab Syst 18:183–194
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126
Poirot R, Wishinski P, Hopke P, Polissar A (2001) Comparative application of multiple receptor methods to identify aerosol sources in northern Vermont. Environ Sci Technol 35:4622–4636
Pósfai M, Anderson JR, Buseck PR, Sievering H (1995) Compositional variations of sea-salt-mode aerosol particles from the North Atlantic. J Geophys Res Atmos 100:23063–23074
Radonić J, Gavanski NJ, Ilić M, Popov S, Očovaj SB, Miloradov MV, Sekulić MT (2017) Emission sources and health risk assessment of polycyclic aromatic hydrocarbons in ambient air during heating and non-heating periods in the city of Novi Sad, Serbia. Stoch Environ Res Risk Assess 31:2201–2213
Ramadan Z, Song X, Hopke P (2000) Identification of sources of Phoenix aerosol by positive matrix factorization. J Air Waste Manag Assoc 50:1308–1320
Reff A, Eberly S, Bhave P (2007) Identification of sources of Phoenix aerosol by positive matrix factorization. J Air Waste Manag Asso 57:146–154
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494
Song Y, Zhang Y, Xie S, Zeng Li, Zheng M, Salmon L, Shao M, Slanina J (2006) Source apportionment of PM2.5 in Beijing by positive matrix factorization. Atmos Environ 40:1526–1537
Tibshirani R, Taylor J (2012) Degrees of freedom in lasso problems. Ann Stat 40:1198–1232
Ulbrich IM, Canagaratna MR, Zhang Q, Worsnop DR, Jimenez JL (2009) Interpretation of organic components from positive matrix factorization of aerosol mass spectrometric data. Atmos Chem Phys 9:2891–2918
United States Environmental Protection Agency (2017) Positive matrix factorization model for environmental data analyses. https://www.epa.gov/air-research/positive-matrix-factorization-model-environmental-data-analyses
Wang H, Shooter D (2005) Source apportionment of fine and coarse atmospheric particles in Auckland, New Zealand. Sci Tot Environ 340:189–198
Wang X, Zong Z, Tian C, Chen Y, Luo C, Li J, Luo Y (2017) Combining positive matrix factorization and radiocarbon measurements for source apportionment of PM2.5 from a national background site in North China. Sci Rep 7:10648
Zekri H, Mokhtari AR, Cohen DR (2016) Application of singular value decomposition (SVD) and semi-discrete decomposition (SDD) techniques in clustering of geochemical data: an environmental study in central Iran. Stoch Environ Res Risk Assess 30:1947–1960
Zeng X, Xia Y (2018) Selection of the number of factors in factor models. Manuscript, Department of Statistics and Applied Probability, National University of Singapore
Zhang L, Liu Y, Zhao F (2018) Singular value decomposition analysis of spatial relationships between monthly weather and air pollution index in China. Stoch Environ Res Risk Assess 32:733–748
Zong Z, Wang X, Tian C, Chen Y, Qu L, Ji L, Zhang G (2016) Source apportionment of PM2.5 at a regional background site in North China using PMF linked with radiocarbon analysis: insight into the contribution of biomass burning. Atmos Chem Phys 16:11249–11265
Acknowledgements
We are most grateful to the AE and two referees for their valuable comments and constructive suggestions, which have led to a substantial improvement of this paper. YC Xia’s research is partially supported by MOE Tier 1 Grant: R-155-000-193-114, and MOE Grant of Singapore: MOE2014-T2-1-072, and National Natural Science Foundation of China, 11771066.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yan, M., Yang, X., Hang, W. et al. Determining the number of factors for non-negative matrix and its application in source apportionment of air pollution in Singapore. Stoch Environ Res Risk Assess 33, 1175–1186 (2019). https://doi.org/10.1007/s00477-019-01677-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-019-01677-z