Abstract
Identification of critical episodes of environmental pollution, both as a outlier identification problem and as a classification problem, is a usual application of multivariate functional data analysis. This article addresses the effects of robustifying multivariate functional samples on the identification of critical pollution episodes in Medellín, Colombia. To do so, it compares 18 depth-based outlier identification methods and highlights the best options in terms of precision through simulation. It then applies the two methods with the best performance to robustify a real dataset of air pollution (PM2.5 concentration) in the Metropolitan Area of Medellín, Colombia and compares the effects of robustifying the samples on the accuracy of supervised classification through the multivariate functional DD-classifier. Our results show that 10 out of 20 methods revised perform better in at least one kind outliers. Nevertheless, no clear positive effects of robustification were identified with the real dataset.
Similar content being viewed by others
Notes
The notation is standardized aiming uniformity and it is not necessarily equal to the original.
The original weighting function proposed by Claeskens et al. (2014) gives to each point t a weight that corresponds to the proportion of amplitude of the multivariate dataset at that point.
Simplicial depth is also used with Method one, as can be seen in López-Pintado et al. (2014), but that method is not explored in this document.
This inequality is shown in Ieva and Paganoni (2017) in more detail, with the corresponding plots of (modified) band depth (X axis) against (modified) epigraph index (Y axis). Every point falls below the aforementioned inequality, but points that fall too far from the boundary could be understood as shape outliers, while points whit low (M)BD could be considered magnitude outliers.
The indicator goes from 0.5, indicating the poorest performance where all positives are false and none of them are true, and 2, where all positives are true and none of them are false. This indicator, nevertheless, must be complemented with the false positive rate.
The imputation algorithm consists on the estimation of the smoothed mean value for each t using a cubic spline, followed by a modification of the EM algorithm made of three steps: 1. Replacing missing values by estimates, 2. estimate the parameters \(\mu \) and \(\Sigma \), 3. Estimate the level for each multivariate time series, 4. Re-estimate the missing values with new parameters (Junger and Ponce de Leon 2015). The procedure was made using the R package mtsdi.
References
Berrendero JR, Justel A, Svarc M (2011) Principal components for multivariate functional data. Comput Stat Data Anal 55(9):2619–2634. https://doi.org/10.1016/j.csda.2011.03.011
Claeskens G, Hubert M, Slaets L, Vakili K (2014) Multivariate functional halfspace depth. J Am Stat Assoc 109(505):411–423. https://doi.org/10.1080/01621459.2013.856795
Cuesta-Albertos JA, Nieto-Reyes A (2008) The random Tukey depth. Comput Stat Data Anal 52(11):4979–4988. https://doi.org/10.1016/j.csda.2008.04.021arXiv:0707.0167
Cuesta-Albertos JA, Febrero-Bande M, Oviedo de la Fuente M (2017) The DD G -classifier in the functional setting. Test 26(1):119–142. https://doi.org/10.1007/s11749-016-0502-6
Cuevas A, Febrero M, Fraiman R (2006) On the use of the bootstrap for estimating functions with functional data. Comput Stat Data Anal 51(2):1063–1074. https://doi.org/10.1016/j.csda.2005.10.012
Cuevas A, Febrero M, Fraiman R (2007) Robust estimation and classification for functional data via projection-based depth notions. Comput Statistics 22(3):481–496. https://doi.org/10.1007/s00180-007-0053-0
Dai W, Genton MG (2019) Directional outlyingness for multivariate functional data. Comput Stat Data Anal 131:50–65. https://doi.org/10.1016/j.csda.2018.03.017arXiv:1612.04615
Dai W, Mrkvička T, Sun Y, Genton MG (2020) Functional outlier detection and taxonomy by sequential transformations. Comput Stat Data Anal 149:11901573 arXiv:1808.05414
Febrero M, Galeano P, González-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics 19(4):331–345. https://doi.org/10.1002/env.878
Febrero-Bande M, Oviedo de la Fuente M (2012) Statistical computing in functional data analysis: the R package fda.usc. J Stat Softw 51(4):1–28 (http://www.jstatsoft.org/v51/i04/)
Febrero-Bande M, Galeano P, González-Manteiga W (2007) A functional analysis of NOx levels: location and scale estimation and outlier detection. Rep Stat Oper Res 22(3):481–496. https://doi.org/10.1007/s00180-007-0053-0
Fraiman R, Muniz G (2001) Trimmed means for functional data. Test 10(2):419–440. https://doi.org/10.1007/BF02595706
Hubert M, Vandervieren E (2008) An adjusted boxplot for skewed distributions. Comput Stat Data Anal 52(12):5186–5201. https://doi.org/10.1016/j.csda.2007.11.008
Hubert M, Rousseeuw PJ, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24(2):177–202. https://doi.org/10.1007/s10260-015-0297-8
Hyndman RJ, Shang HL (2010) Rainbow plots, bagplots, and boxplots for functional data. J Comput Graph Stat 19(1):29–45. https://doi.org/10.1198/jcgs.2009.08158
Ieva F, Paganoni AM (2013) Depth measures for multivariate functional data. Commun Stat Theory Methods 42(7):1265–1276. https://doi.org/10.1080/03610926.2012.746368
Ieva F, Paganoni AM (2017) Component-wise outlier detection methods for robustifying multivariate functional samples. Stat Pap. https://doi.org/10.1007/s00362-017-0953-1
Ieva F, Paganoni AM, Romo J, Tarabelloni N (2019) Roahd package: robust analysis of high dimensional data. R J 11(2):291–307. https://doi.org/10.32614/RJ-2019-032
Junger WL, Ponce de Leon A (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104. https://doi.org/10.1016/j.atmosenv.2014.11.049
Kosiorowski D, Zawadzki Z (2020) Depthproc an r package for robust exploration of multidimensional economic phenomena [Computer software manual]
Li J, Cuesta-Albertos JA, Liu RY (2012) DD-classifier: nonparametric classification procedure based on DD-plot. J Am Stat Assoc 107(498):737–753. https://doi.org/10.1080/01621459.2012.688462
Liang D, Zhang H, Chang X, Huang H (2020) Modeling and regionalization of China’s PM2.5 using spatial-functional mixture models. J Am Stat Assoc 116:116–132. https://doi.org/10.1080/01621459.2020.1764363
Liu RY (1990) On a notion of data depth based on random simplices. Ann Stat 18(1):405–414
Liu RY, Parelius JM, Singh K (1999) Multivariate analysis by data depth: descriptive statistics, graphics and inference. Ann Stat 27(3):783–858. https://doi.org/10.2307/120138
López-Pintado S, Romo J (2009) On the concept of depth for functional data. J Am Stat Assoc 104(486):718–734. https://doi.org/10.1198/jasa.2009.0108
López-Pintado S, Sun Y, Lin JK, Genton MG (2014) Simplicial band depth for multivariate functional data. Adv Data Anal Classif 8(3):321–338. https://doi.org/10.1007/s11634-014-0166-6
Martínez J, Saavedra Á, García-Nieto PJ, Pi neiro, J.I., Iglesias, C., Taboada, J., Pastor, J. (2014) Air quality parameters outliers detection using functional data analysis in the Langreo urban area (Northern Spain). Appl Math Comput 241(2):1–10. https://doi.org/10.1016/j.amc.2014.05.004
Nagy S, Gijbels I, Hlubinka D (2017) Depth-based recognition of shape outlying functions. J Comput Graph Stat 26(4):883–893. https://doi.org/10.1080/10618600.2017.1336445
Ojo OT, Lillo RE, Fernandez Anta A (2020) fdaoutlier: Outlier detection tools for functional data analysis [Computer software manual]. https://CRAN.R-project.org/package=fdaoutlier (R package version 0.1.1)
Ramsay J, Silverman B (2005) Functional data analysis. Springer, Berlin
Rousseeuw PJ, Ruts I (1999) The bagplot: a bivariate boxplot. Stat Comput Graph 53(4):382–387
Sánchez-Lasheras F, Ordóñez-Galán C, García-Nieto PJ, García-Gonzalo E (2020) Detection of outliers in pollutant emissions from the Soto de Ribera coal-fired power plant using functional data analysis: a case study in northern Spain. Environ Sci Pollut Res 27(1):8–20. https://doi.org/10.1007/s11356-019-04435-4
Segaert P, Hubert M, Rousseeuw P, Raymaekers J (2020) mrfdepth: Depth measures in multivariate, regression and functional settings [Computer software manual]. https://CRAN.Rproject.org/package=mrfDepth (R package version 1.0.12)
Shaadan N, Deni SM, Jemain AA (2012) Assessing and comparing PM10 pollutant behaviour using functional data approach. Sains Malaysiana 41(11):1335–1344
Shaadan N, Jemain AA, Latif MT, Deni SM (2015) Anomaly detection and assessment of PM10 functional data at several locations in the Klang Valley, Malaysia. Atmos Pollut Res 6(2):365–375. https://doi.org/10.5094/APR.2015.040
SIATA (2019) Generalidades de la información Red de Cali- dad del Aire del Valle de Aburrá (Tech. Rep.). Medellín: Área Metropolitana del Valle de Aburrá. https://siata.gov.co/descarga siata/index.php/info/aire/
SIATA (2021) Información de calidad del aire. https://siata.gov.co/descarga siata/index.php/index2/calidad aire/
Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334. https://doi.org/10.1198/jcgs.2011.09224
Tarabelloni N, Arribas-Gil A, Ieva F, Paganoni AM, Romo J (2018) roahd: Robust analysis of high dimensional data [Computer software manual]. https://CRAN.R-project.org/package=roahd (R package version 1.4)
Torres JM, Nieto PJ, Alejano L, Reyes AN (2011) Detection of outliers in gas emissions from urban areas using functional data analysis. J Hazard Mater 186(1):144–149. https://doi.org/10.1016/j.jhazmat.2010.10.091
Torres JM, Pérez JP, Val JS, McNabola A, Comesa na MM, Gallagher J (2020) A functional data analysis approach for the detection of air pollution episodes and outliers: a case study in Dublin, Ireland. Mathematics. https://doi.org/10.3390/math8020225
Wang Y, Xu K, Li S (2020) The functional spatio-temporal statistical model with application to O3 pollution in Beijing, China. Int J Environ Res Public Health. https://doi.org/10.3390/ijerph17093172
World Health Organization (2006) WHO Air quality guide- lines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide - Global Update 2005 (Tech. Rep.). https://doi.org/10.1007/s12011-019-01864-7
Zuo Y, Ser ing, R. (2000) General notions of statistical depth function. Statistics 28(2):461–482
Author information
Authors and Affiliations
Corresponding author
Additional information
Handling Editor: Dr. Luiz Duczmal.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Roldán-Alzate, L.M., Zuluaga, F. Assessing the effects of multivariate functional outlier identification and sample robustification on identifying critical PM2.5 air pollution episodes in Medellín, Colombia. Environ Ecol Stat 29, 801–825 (2022). https://doi.org/10.1007/s10651-022-00544-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10651-022-00544-5