Abstract
A prompt and accurate prediction of air quality index (AQI) has become a necessity to tackle the mounting environmental threats. This paper proposes a feature-driven hybrid method for hourly, 3-step-ahead, and deterministic AQI prediction, which includes three modules. In Module 1, an “extract-merge-filter” procedure of feature engineering is created to capture the potential features from the AQI series. Ten feature sets are generated as candidates. In Module 2, six models including Light Gradient Boosting Machine, Extreme Gradient Boosting, Long Short-Term Memory, Convolutional Neural Network, Multilayer Perceptron, and Deep Neural Network are developed as base predictors and performed on the candidate features. In Module 3, predictors are first matched with their optimal features using a comprehensive metric, and then combined in an optimized ensemble using OPTUNA. A case study on the AQI data from four different Chinese cities is carried out to demonstrate the method. The experimental results show the following: (1) Feature engineering significantly boosts prediction performance and provides interpretable findings for practical use. (2) Customized input of features to the predictors is more effective than a fixed input and can rise the performance to a higher level. (3) OPTUNA is a promising tool for optimizing ensemble weights. The final ensemble model is superior to single machine learning models and has a good robustness.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author on reasonable request.
References
Akiba T, Sano S, Yanase T, Ohta T and Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, Anchorage, AK, USA, pp. 2623–2631. https://doi.org/10.1145/3292500.3330701
Alabdullah AA, Iqbal M, Zahid M, Khan K, Amin MN, Jalal FE (2022) Prediction of rapid chloride penetration resistance of metakaolin based high strength concrete using light GBM and XGBoost models by incorporating shap analysis. Constr Build Mater 345:128296. https://doi.org/10.1016/j.conbuildmat.2022.128296
Bitencourt HV, Orang O, de Souza LAF, Silva PC, Guimarães FG (2022) An embedding-based non-stationary fuzzy time series method for multiple output high-dimensional multivariate time series forecasting in Iot applications. Neural Comput Applic 35:9407–9420. https://doi.org/10.1007/s00521-022-08120-5
Cao J, Li Z, Li J (2019) Financial time series forecasting model based on CEEMDAN and LSTM. Physica A 519:127–139. https://doi.org/10.1016/j.physa.2018.11.061
Castelli M, Clemente FM, Popovič A, Silva S, Vanneschi L (2020) A machine learning approach to predict air quality in California. Complexity 2020:8049504. https://doi.org/10.1155/2020/8049504
Chianese E, Camastra F, Ciaramella A, Landi TC, Staiano A, Riccio A (2019) Spatio-temporal learning in predicting ambient particulate matter concentration by multi-layer perceptron. Eco Inform 49:54–61. https://doi.org/10.1016/j.ecoinf.2018.12.001
Dai H, Huang G, Wang J, Zeng H, Zhou F (2021) Prediction of air pollutant concentration based on one-dimensional multi-scale CNN-LSTM considering spatial-temporal characteristics: a case study of Xi’an. China Atmosphere 12:1626. https://doi.org/10.3390/atmos12121626
de Gennaro G, Trizio L, Di Gilio A, Pey J, Pérez N, Cusack M, Alastuey A, Querol X (2013) Neural network model for the prediction of PM10 daily concentrations in two sites in the Western Mediterranean. Sci Total Environ 463:875–883. https://doi.org/10.1016/j.scitotenv.2013.06.093
Domashova J, Mikhailina N (2021) Usage of machine learning methods for early detection of money laundering schemes. Procedia Comput Sci 190:184–192. https://doi.org/10.1016/j.procs.2021.06.033
Elmaz F, Eyckerman R, Casteels W, Latré S, Hellinckx P (2021) CNN-LSTM architecture for predictive indoor temperature modeling. Build Environ 206:108327. https://doi.org/10.1016/j.buildenv.2021.108327
Eslami E, Salman AK, Choi Y, Sayeed A, Lops Y (2020) A data ensemble approach for real-time air quality forecasting using extremely randomized trees and deep neural networks. Neural Comput Appl 32:7563–7579. https://doi.org/10.1007/s00521-019-04287-6
Guan W-J, Zheng X-Y, Chung KF, Zhong N-S (2016) Impact of air pollution on the burden of chronic respiratory diseases in China: time for urgent action. Lancet 388:1939–1951. https://doi.org/10.1016/S0140-6736(16)31597-5
Guo C, Liu G, Chen C-H (2020) Air pollution concentration forecast method based on the deep ensemble neural network. Wirel Commun Mob Comput 2020:8854649. https://doi.org/10.1155/2020/8854649
Hao Y, Tian C (2019) The study and application of a novel hybrid system for air quality early-warning. Appl Soft Comput 74:729–746. https://doi.org/10.1016/j.asoc.2018.09.005
He B-J, Ding L, Prasad D (2019) Enhancing urban ventilation performance through the development of precinct ventilation zones: a case study based on the Greater Sydney, Australia. Sustain Cities Soci 47:101472. https://doi.org/10.1016/j.scs.2019.101472
He X, Zhao K, Chu X (2021) Automl: a survey of the state-of-the-art. Knowl Based Syst 212:106622. https://doi.org/10.1016/j.knosys.2020.106622
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Jamei M, Ali M, Malik A, Karbasi M, Sharma E, Yaseen ZM (2022) Air quality monitoring based on chemical and meteorological drivers: application of a novel data filtering-based hybridized deep learning model. J Clean Prod 374:134011. https://doi.org/10.1016/j.jclepro.2022.134011
Ji C, Zhang C, Hua L, Ma H, Nazir MS, Peng T (2022) A multi-scale evolutionary deep learning model based on Ceemdan, improved whale optimization algorithm, regularized extreme learning machine and LSTM for AQI prediction. Environ Res 215:114228. https://doi.org/10.1016/j.envres.2022.114228
Kattenborn T, Leitloff J, Schiefer F, Hinz S (2021) Review on convolutional neural networks (CNN) in vegetation remote sensing. ISPRS J Photogramm Remote Sens 173:24–49. https://doi.org/10.1016/j.isprsjprs.2020.12.010
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q and Liu T-Y (2017) LightGBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Long Beach, California, USA, pp 3149–3157
Kim D, Han H, Wang W, Kang Y, Lee H, Kim HS (2022) Application of deep learning models and network method for comprehensive air-quality index prediction. Appl Sci 12:6699. https://doi.org/10.3390/app12136699
Lee S, Kim H, Lieu QX, Lee J (2020) CNN-based image recognition for topology optimization. Knowl Based Syst 198:105887. https://doi.org/10.1016/j.knosys.2020.105887
Li J, Hao J, Feng Q, Sun X, Liu M (2021) Optimal selection of heterogeneous ensemble strategies of time series forecasting with multi-objective programming. Expert Syst Appl 166:114091. https://doi.org/10.1016/j.eswa.2020.114091
Li R, Jin Y (2018) The early-warning system based on hybrid optimization algorithm and fuzzy synthetic evaluation model. Inf Sci 435:296–319. https://doi.org/10.1016/j.ins.2017.12.040
Li S, Xie G, Ren J, Guo L, Yang Y, Xu X (2020) Urban PM2. 5 concentration prediction via attention-based CNN–LSTM. Appl Sci 10:1953. https://doi.org/10.3390/app10061953
Li Y, Peng T, Hua L, Ji C, Ma H, Nazir MS, Zhang C (2022) Research and application of an evolutionary deep learning model based on improved grey wolf optimization algorithm and DBN-ELM for AQI prediction. Sustain Cities Soc 87:104209. https://doi.org/10.1016/j.scs.2022.104209
Liu C-M (2002) Effect of PM2. 5 on AQI in Taiwan. Environ Model Softw 17:29–37. https://doi.org/10.1016/S1364-8152(01)00050-0
Liu D-R, Hsu Y-K, Chen H-Y, Jau H-J (2021a) Air pollution prediction based on factory-aware attentional LSTM neural network. Computing 103:75–98. https://doi.org/10.1007/s00607-020-00849-y
Liu H, Chen C (2019) Data processing strategies in wind energy forecasting models and applications: a comprehensive review. Appl Energy 249:392–408. https://doi.org/10.1016/j.apenergy.2019.04.188
Liu H, Xu Y, Chen C (2019) Improved pollution forecasting hybrid algorithms based on the ensemble method. Appl Math Model 73:473–486. https://doi.org/10.1016/j.apm.2019.04.032
Liu H, Yan G, Duan Z, Chen C (2021b) Intelligent modeling strategies for forecasting air quality time series: a review. Appl Soft Comput 102:106957. https://doi.org/10.1016/j.asoc.2020.106957
Liu H, Yang R (2021) A spatial multi-resolution multi-objective data-driven ensemble model for multi-step air quality index forecasting based on real-time decomposition. Comput Ind 125:103387. https://doi.org/10.1016/j.compind.2020.103387
Liu X, Qin M, He Y, Mi X, Yu C (2021c) A new multi-data-driven spatiotemporal PM2. 5 forecasting model based on an ensemble graph reinforcement learning convolutional network. Atmos Pollut Res 12:101197. https://doi.org/10.1016/j.apr.2021.101197
Luo Z, Huang J, Hu K, Li X and Zhang P (2019) Accuair: winning solution to air quality prediction for KDD Cup 2018. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, Anchorage, AK, USA, pp 1842–1850. https://doi.org/10.1145/3292500.3330787
Masmoudi S, Elghazel H, Taieb D, Yazar O, Kallel A (2020) A machine-learning framework for predicting multiple air pollutants’ concentrations via multi-target regression and feature selection. Sci Total Environ 715:136991. https://doi.org/10.1016/j.scitotenv.2020.136991
Ojagh S, Cauteruccio F, Terracina G, Liang SH (2021) Enhanced air quality prediction by edge-based spatiotemporal data preprocessing. Comput Electr Eng 96:107572. https://doi.org/10.1016/j.compeleceng.2021.107572
Panichella A (2021) A systematic comparison of search-based approaches for LDA hyperparameter tuning. Inf Softw Technol 130:106411. https://doi.org/10.1016/j.infsof.2020.106411
Perez P, Menares C (2018) Forecasting of hourly PM2. 5 in south-west zone in Santiago De Chile. Aerosol Air Qual Res 18:2666–2679. https://doi.org/10.4209/aaqr.2018.01.0029
Pravin P, Tan JZM, Yap KS, Wu Z (2022) Hyperparameter optimization strategies for machine learning-based stochastic energy efficient scheduling in cyber-physical production systems. Digital Chem Eng 4:100047. https://doi.org/10.1016/j.dche.2022.100047
Sezer OB, Gudelek MU, Ozbayoglu AM (2020) Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput 90:106181. https://doi.org/10.1016/j.asoc.2020.106181
Singh KP, Gupta S, Rai P (2013) Identifying pollution sources and predicting urban air quality using ensemble learning methods. Atmos Environ 80:426–437. https://doi.org/10.1016/j.atmosenv.2013.08.023
Sipper M, Moore JH (2022) AddGBoost: a gradient boosting-style algorithm based on strong learners. Mach Learn Appl 7:100243. https://doi.org/10.1016/j.mlwa.2021.100243
Song K, Yan F, Ding T, Gao L, Lu S (2020) A steel property optimization model based on the XGBoost algorithm and improved PSO. Comput Mater Sci 174:109472. https://doi.org/10.1016/j.commatsci.2019.109472
Srinivas P, Katarya R (2022) Hyoptxg: optuna hyper-parameter optimization framework for predicting cardiovascular disease using XGBoost. Biomed Signal Process Control 73:103456. https://doi.org/10.1016/j.bspc.2021.103456
Surakhi O, Zaidan MA, Fung PL, Hossein Motlagh N, Serhan S, AlKhanafseh M, Ghoniem RM, Hussein T (2021) Time-lag selection for time-series forecasting using neural network and heuristic algorithm. Electronics 10:2518. https://doi.org/10.3390/electronics10202518
Thongthammachart T, Araki S, Shimadera H, Matsuo T, Kondo A (2022) Incorporating light gradient boosting machine to land use regression model for estimating NO2 and PM2. 5 Levels in Kansai Region, Japan. Environ Model Softw 155:105447. https://doi.org/10.1016/j.envsoft.2022.105447
Wang J, Jin L, Li X, He S, Huang M, Wang H (2022) A hybrid air quality index prediction model based on CNN and attention gate unit. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3217242
Wu C-f, Larson TV, Wu S-y, Williamson J, Westberg HH, Liu L-JS (2007) Source apportionment of PM2. 5 and selected hazardous air pollutants in Seattle. Sci Total Environ 386:42–52. https://doi.org/10.1016/j.scitotenv.2007.07.042
Wu L, Gao X, Xiao Y, Liu S, Yang Y (2017) Using grey Holt-Winters model to predict the air quality index for cities in China. Nat Hazards 88:1003–1012. https://doi.org/10.1007/s11069-017-2901-8
Xian S, Chen K, Cheng Y (2022) Improved seagull optimization algorithm of partition and XGBoost of prediction for fuzzy time series forecasting of COVID-19 daily confirmed. Adv Engin Softw 173:103212. https://doi.org/10.1016/j.advengsoft.2022.103212
Yang B, Sun S, Li J, Lin X, Tian Y (2019) Traffic flow prediction using LSTM with feature enhancement. Neurocomputing 332:320–327. https://doi.org/10.1016/j.neucom.2018.12.016
Yang Y, Zheng Z, Bian K, Song L, Han Z (2017) Real-time profiling of fine-grained air quality index distribution using UAV Sensing. IEEE Internet Things J 5:186–198. https://doi.org/10.1109/JIOT.2017.2777820
Zhang K, Thé J, Xie G, Yu H (2020) Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: a case study of Huaihai Economic Zone. J Clean Prod 277:123231. https://doi.org/10.1016/j.jclepro.2020.123231
Zhang L, Lin J, Qiu R, Hu X, Zhang H, Chen Q, Tan H, Lin D, Wang J (2018) Trend analysis and forecast of PM2. 5 in Fuzhou, China using the ARIMA model. Ecol Ind 95:702–710. https://doi.org/10.1016/j.ecolind.2018.08.032
Zhao S, Xu Z, Liu L, Guo M, Yun J (2018) Towards accurate deceptive opinions detection based on word order-preserving CNN. Math Probl Eng 2018:2410206. https://doi.org/10.1155/2018/2410206
Zhao X, Li Q, Xue W, Zhao Y, Zhao H, Guo S (2022) Research on ultra-short-term load forecasting based on real-time electricity price and window-based XGBoost model. Energies 15:7367. https://doi.org/10.3390/en15197367
Zhou G, Xu J, Xie Y, Chang L, Gao W, Gu Y, Zhou J (2017) Numerical air quality forecasting over eastern China: an operational application of WRF-Chem. Atmos Environ 153:94–108. https://doi.org/10.1016/j.atmosenv.2017.01.020
Zhu S, Yang L, Wang W, Liu X, Lu M, Shen X (2018) Optimal-combined model for air quality index forecasting: 5 cities in North China. Environ Pollut 243:842–850. https://doi.org/10.1016/j.envpol.2018.09.025
Funding
The study is fully supported by the National Natural Science Foundation of China (Grant No. 52072412), the Changsha Science & Technology Project (Grant No. KQ1707017), and the Hunan Province Science and Technology Talent Support Project (Grant No. 2020TJ-Q06).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent to publish
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yin, Y., Liu, H. Air quality index prediction based on three-stage feature engineering, model matching, and optimized ensemble. Air Qual Atmos Health 16, 1871–1890 (2023). https://doi.org/10.1007/s11869-023-01380-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11869-023-01380-7