Air quality index prediction based on three-stage feature engineering, model matching, and optimized ensemble

Yin, Yucheng; Liu, Hui

doi:10.1007/s11869-023-01380-7

Air quality index prediction based on three-stage feature engineering, model matching, and optimized ensemble

Published: 23 May 2023

Volume 16, pages 1871–1890, (2023)
Cite this article

Air Quality, Atmosphere & Health Aims and scope Submit manuscript

266 Accesses
Explore all metrics

Abstract

A prompt and accurate prediction of air quality index (AQI) has become a necessity to tackle the mounting environmental threats. This paper proposes a feature-driven hybrid method for hourly, 3-step-ahead, and deterministic AQI prediction, which includes three modules. In Module 1, an “extract-merge-filter” procedure of feature engineering is created to capture the potential features from the AQI series. Ten feature sets are generated as candidates. In Module 2, six models including Light Gradient Boosting Machine, Extreme Gradient Boosting, Long Short-Term Memory, Convolutional Neural Network, Multilayer Perceptron, and Deep Neural Network are developed as base predictors and performed on the candidate features. In Module 3, predictors are first matched with their optimal features using a comprehensive metric, and then combined in an optimized ensemble using OPTUNA. A case study on the AQI data from four different Chinese cities is carried out to demonstrate the method. The experimental results show the following: (1) Feature engineering significantly boosts prediction performance and provides interpretable findings for practical use. (2) Customized input of features to the predictors is more effective than a fixed input and can rise the performance to a higher level. (3) OPTUNA is a promising tool for optimizing ensemble weights. The final ensemble model is superior to single machine learning models and has a good robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Water quality prediction using machine learning models based on grid search method

Article Open access 29 September 2023

Air pollution prediction with machine learning: a case study of Indian cities

Article 15 May 2022

A survey on ensemble learning

Article 30 August 2019

Data availability

The data that support the findings of this study are available from the corresponding author on reasonable request.

References

Akiba T, Sano S, Yanase T, Ohta T and Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, Anchorage, AK, USA, pp. 2623–2631. https://doi.org/10.1145/3292500.3330701
Alabdullah AA, Iqbal M, Zahid M, Khan K, Amin MN, Jalal FE (2022) Prediction of rapid chloride penetration resistance of metakaolin based high strength concrete using light GBM and XGBoost models by incorporating shap analysis. Constr Build Mater 345:128296. https://doi.org/10.1016/j.conbuildmat.2022.128296
Article CAS Google Scholar
Bitencourt HV, Orang O, de Souza LAF, Silva PC, Guimarães FG (2022) An embedding-based non-stationary fuzzy time series method for multiple output high-dimensional multivariate time series forecasting in Iot applications. Neural Comput Applic 35:9407–9420. https://doi.org/10.1007/s00521-022-08120-5
Article Google Scholar
Cao J, Li Z, Li J (2019) Financial time series forecasting model based on CEEMDAN and LSTM. Physica A 519:127–139. https://doi.org/10.1016/j.physa.2018.11.061
Article Google Scholar
Castelli M, Clemente FM, Popovič A, Silva S, Vanneschi L (2020) A machine learning approach to predict air quality in California. Complexity 2020:8049504. https://doi.org/10.1155/2020/8049504
Article Google Scholar
Chianese E, Camastra F, Ciaramella A, Landi TC, Staiano A, Riccio A (2019) Spatio-temporal learning in predicting ambient particulate matter concentration by multi-layer perceptron. Eco Inform 49:54–61. https://doi.org/10.1016/j.ecoinf.2018.12.001
Article Google Scholar
Dai H, Huang G, Wang J, Zeng H, Zhou F (2021) Prediction of air pollutant concentration based on one-dimensional multi-scale CNN-LSTM considering spatial-temporal characteristics: a case study of Xi’an. China Atmosphere 12:1626. https://doi.org/10.3390/atmos12121626
Article CAS Google Scholar
de Gennaro G, Trizio L, Di Gilio A, Pey J, Pérez N, Cusack M, Alastuey A, Querol X (2013) Neural network model for the prediction of PM10 daily concentrations in two sites in the Western Mediterranean. Sci Total Environ 463:875–883. https://doi.org/10.1016/j.scitotenv.2013.06.093
Article CAS Google Scholar
Domashova J, Mikhailina N (2021) Usage of machine learning methods for early detection of money laundering schemes. Procedia Comput Sci 190:184–192. https://doi.org/10.1016/j.procs.2021.06.033
Article Google Scholar
Elmaz F, Eyckerman R, Casteels W, Latré S, Hellinckx P (2021) CNN-LSTM architecture for predictive indoor temperature modeling. Build Environ 206:108327. https://doi.org/10.1016/j.buildenv.2021.108327
Article Google Scholar
Eslami E, Salman AK, Choi Y, Sayeed A, Lops Y (2020) A data ensemble approach for real-time air quality forecasting using extremely randomized trees and deep neural networks. Neural Comput Appl 32:7563–7579. https://doi.org/10.1007/s00521-019-04287-6
Article Google Scholar
Guan W-J, Zheng X-Y, Chung KF, Zhong N-S (2016) Impact of air pollution on the burden of chronic respiratory diseases in China: time for urgent action. Lancet 388:1939–1951. https://doi.org/10.1016/S0140-6736(16)31597-5
Article Google Scholar
Guo C, Liu G, Chen C-H (2020) Air pollution concentration forecast method based on the deep ensemble neural network. Wirel Commun Mob Comput 2020:8854649. https://doi.org/10.1155/2020/8854649
Article Google Scholar
Hao Y, Tian C (2019) The study and application of a novel hybrid system for air quality early-warning. Appl Soft Comput 74:729–746. https://doi.org/10.1016/j.asoc.2018.09.005
Article Google Scholar
He B-J, Ding L, Prasad D (2019) Enhancing urban ventilation performance through the development of precinct ventilation zones: a case study based on the Greater Sydney, Australia. Sustain Cities Soci 47:101472. https://doi.org/10.1016/j.scs.2019.101472
Article Google Scholar
He X, Zhao K, Chu X (2021) Automl: a survey of the state-of-the-art. Knowl Based Syst 212:106622. https://doi.org/10.1016/j.knosys.2020.106622
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article CAS Google Scholar
Jamei M, Ali M, Malik A, Karbasi M, Sharma E, Yaseen ZM (2022) Air quality monitoring based on chemical and meteorological drivers: application of a novel data filtering-based hybridized deep learning model. J Clean Prod 374:134011. https://doi.org/10.1016/j.jclepro.2022.134011
Article CAS Google Scholar
Ji C, Zhang C, Hua L, Ma H, Nazir MS, Peng T (2022) A multi-scale evolutionary deep learning model based on Ceemdan, improved whale optimization algorithm, regularized extreme learning machine and LSTM for AQI prediction. Environ Res 215:114228. https://doi.org/10.1016/j.envres.2022.114228
Article CAS Google Scholar
Kattenborn T, Leitloff J, Schiefer F, Hinz S (2021) Review on convolutional neural networks (CNN) in vegetation remote sensing. ISPRS J Photogramm Remote Sens 173:24–49. https://doi.org/10.1016/j.isprsjprs.2020.12.010
Article Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q and Liu T-Y (2017) LightGBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., Long Beach, California, USA, pp 3149–3157
Kim D, Han H, Wang W, Kang Y, Lee H, Kim HS (2022) Application of deep learning models and network method for comprehensive air-quality index prediction. Appl Sci 12:6699. https://doi.org/10.3390/app12136699
Article CAS Google Scholar
Lee S, Kim H, Lieu QX, Lee J (2020) CNN-based image recognition for topology optimization. Knowl Based Syst 198:105887. https://doi.org/10.1016/j.knosys.2020.105887
Article Google Scholar
Li J, Hao J, Feng Q, Sun X, Liu M (2021) Optimal selection of heterogeneous ensemble strategies of time series forecasting with multi-objective programming. Expert Syst Appl 166:114091. https://doi.org/10.1016/j.eswa.2020.114091
Article Google Scholar
Li R, Jin Y (2018) The early-warning system based on hybrid optimization algorithm and fuzzy synthetic evaluation model. Inf Sci 435:296–319. https://doi.org/10.1016/j.ins.2017.12.040
Article Google Scholar
Li S, Xie G, Ren J, Guo L, Yang Y, Xu X (2020) Urban PM2. 5 concentration prediction via attention-based CNN–LSTM. Appl Sci 10:1953. https://doi.org/10.3390/app10061953
Article CAS Google Scholar
Li Y, Peng T, Hua L, Ji C, Ma H, Nazir MS, Zhang C (2022) Research and application of an evolutionary deep learning model based on improved grey wolf optimization algorithm and DBN-ELM for AQI prediction. Sustain Cities Soc 87:104209. https://doi.org/10.1016/j.scs.2022.104209
Article Google Scholar
Liu C-M (2002) Effect of PM2. 5 on AQI in Taiwan. Environ Model Softw 17:29–37. https://doi.org/10.1016/S1364-8152(01)00050-0
Article Google Scholar
Liu D-R, Hsu Y-K, Chen H-Y, Jau H-J (2021a) Air pollution prediction based on factory-aware attentional LSTM neural network. Computing 103:75–98. https://doi.org/10.1007/s00607-020-00849-y
Article Google Scholar
Liu H, Chen C (2019) Data processing strategies in wind energy forecasting models and applications: a comprehensive review. Appl Energy 249:392–408. https://doi.org/10.1016/j.apenergy.2019.04.188
Article Google Scholar
Liu H, Xu Y, Chen C (2019) Improved pollution forecasting hybrid algorithms based on the ensemble method. Appl Math Model 73:473–486. https://doi.org/10.1016/j.apm.2019.04.032
Article Google Scholar
Liu H, Yan G, Duan Z, Chen C (2021b) Intelligent modeling strategies for forecasting air quality time series: a review. Appl Soft Comput 102:106957. https://doi.org/10.1016/j.asoc.2020.106957
Article Google Scholar
Liu H, Yang R (2021) A spatial multi-resolution multi-objective data-driven ensemble model for multi-step air quality index forecasting based on real-time decomposition. Comput Ind 125:103387. https://doi.org/10.1016/j.compind.2020.103387
Article Google Scholar
Liu X, Qin M, He Y, Mi X, Yu C (2021c) A new multi-data-driven spatiotemporal PM2. 5 forecasting model based on an ensemble graph reinforcement learning convolutional network. Atmos Pollut Res 12:101197. https://doi.org/10.1016/j.apr.2021.101197
Article CAS Google Scholar
Luo Z, Huang J, Hu K, Li X and Zhang P (2019) Accuair: winning solution to air quality prediction for KDD Cup 2018. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, Anchorage, AK, USA, pp 1842–1850. https://doi.org/10.1145/3292500.3330787
Masmoudi S, Elghazel H, Taieb D, Yazar O, Kallel A (2020) A machine-learning framework for predicting multiple air pollutants’ concentrations via multi-target regression and feature selection. Sci Total Environ 715:136991. https://doi.org/10.1016/j.scitotenv.2020.136991
Article CAS Google Scholar
Ojagh S, Cauteruccio F, Terracina G, Liang SH (2021) Enhanced air quality prediction by edge-based spatiotemporal data preprocessing. Comput Electr Eng 96:107572. https://doi.org/10.1016/j.compeleceng.2021.107572
Article Google Scholar
Panichella A (2021) A systematic comparison of search-based approaches for LDA hyperparameter tuning. Inf Softw Technol 130:106411. https://doi.org/10.1016/j.infsof.2020.106411
Article Google Scholar
Perez P, Menares C (2018) Forecasting of hourly PM2. 5 in south-west zone in Santiago De Chile. Aerosol Air Qual Res 18:2666–2679. https://doi.org/10.4209/aaqr.2018.01.0029
Article CAS Google Scholar
Pravin P, Tan JZM, Yap KS, Wu Z (2022) Hyperparameter optimization strategies for machine learning-based stochastic energy efficient scheduling in cyber-physical production systems. Digital Chem Eng 4:100047. https://doi.org/10.1016/j.dche.2022.100047
Article Google Scholar
Sezer OB, Gudelek MU, Ozbayoglu AM (2020) Financial time series forecasting with deep learning: a systematic literature review: 2005–2019. Appl Soft Comput 90:106181. https://doi.org/10.1016/j.asoc.2020.106181
Article Google Scholar
Singh KP, Gupta S, Rai P (2013) Identifying pollution sources and predicting urban air quality using ensemble learning methods. Atmos Environ 80:426–437. https://doi.org/10.1016/j.atmosenv.2013.08.023
Article CAS Google Scholar
Sipper M, Moore JH (2022) AddGBoost: a gradient boosting-style algorithm based on strong learners. Mach Learn Appl 7:100243. https://doi.org/10.1016/j.mlwa.2021.100243
Article Google Scholar
Song K, Yan F, Ding T, Gao L, Lu S (2020) A steel property optimization model based on the XGBoost algorithm and improved PSO. Comput Mater Sci 174:109472. https://doi.org/10.1016/j.commatsci.2019.109472
Article CAS Google Scholar
Srinivas P, Katarya R (2022) Hyoptxg: optuna hyper-parameter optimization framework for predicting cardiovascular disease using XGBoost. Biomed Signal Process Control 73:103456. https://doi.org/10.1016/j.bspc.2021.103456
Article Google Scholar
Surakhi O, Zaidan MA, Fung PL, Hossein Motlagh N, Serhan S, AlKhanafseh M, Ghoniem RM, Hussein T (2021) Time-lag selection for time-series forecasting using neural network and heuristic algorithm. Electronics 10:2518. https://doi.org/10.3390/electronics10202518
Article Google Scholar
Thongthammachart T, Araki S, Shimadera H, Matsuo T, Kondo A (2022) Incorporating light gradient boosting machine to land use regression model for estimating NO2 and PM2. 5 Levels in Kansai Region, Japan. Environ Model Softw 155:105447. https://doi.org/10.1016/j.envsoft.2022.105447
Article Google Scholar
Wang J, Jin L, Li X, He S, Huang M, Wang H (2022) A hybrid air quality index prediction model based on CNN and attention gate unit. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3217242
Article Google Scholar
Wu C-f, Larson TV, Wu S-y, Williamson J, Westberg HH, Liu L-JS (2007) Source apportionment of PM2. 5 and selected hazardous air pollutants in Seattle. Sci Total Environ 386:42–52. https://doi.org/10.1016/j.scitotenv.2007.07.042
Article CAS Google Scholar
Wu L, Gao X, Xiao Y, Liu S, Yang Y (2017) Using grey Holt-Winters model to predict the air quality index for cities in China. Nat Hazards 88:1003–1012. https://doi.org/10.1007/s11069-017-2901-8
Article Google Scholar
Xian S, Chen K, Cheng Y (2022) Improved seagull optimization algorithm of partition and XGBoost of prediction for fuzzy time series forecasting of COVID-19 daily confirmed. Adv Engin Softw 173:103212. https://doi.org/10.1016/j.advengsoft.2022.103212
Article Google Scholar
Yang B, Sun S, Li J, Lin X, Tian Y (2019) Traffic flow prediction using LSTM with feature enhancement. Neurocomputing 332:320–327. https://doi.org/10.1016/j.neucom.2018.12.016
Article Google Scholar
Yang Y, Zheng Z, Bian K, Song L, Han Z (2017) Real-time profiling of fine-grained air quality index distribution using UAV Sensing. IEEE Internet Things J 5:186–198. https://doi.org/10.1109/JIOT.2017.2777820
Article Google Scholar
Zhang K, Thé J, Xie G, Yu H (2020) Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: a case study of Huaihai Economic Zone. J Clean Prod 277:123231. https://doi.org/10.1016/j.jclepro.2020.123231
Article CAS Google Scholar
Zhang L, Lin J, Qiu R, Hu X, Zhang H, Chen Q, Tan H, Lin D, Wang J (2018) Trend analysis and forecast of PM2. 5 in Fuzhou, China using the ARIMA model. Ecol Ind 95:702–710. https://doi.org/10.1016/j.ecolind.2018.08.032
Article CAS Google Scholar
Zhao S, Xu Z, Liu L, Guo M, Yun J (2018) Towards accurate deceptive opinions detection based on word order-preserving CNN. Math Probl Eng 2018:2410206. https://doi.org/10.1155/2018/2410206
Article Google Scholar
Zhao X, Li Q, Xue W, Zhao Y, Zhao H, Guo S (2022) Research on ultra-short-term load forecasting based on real-time electricity price and window-based XGBoost model. Energies 15:7367. https://doi.org/10.3390/en15197367
Article Google Scholar
Zhou G, Xu J, Xie Y, Chang L, Gao W, Gu Y, Zhou J (2017) Numerical air quality forecasting over eastern China: an operational application of WRF-Chem. Atmos Environ 153:94–108. https://doi.org/10.1016/j.atmosenv.2017.01.020
Article CAS Google Scholar
Zhu S, Yang L, Wang W, Liu X, Lu M, Shen X (2018) Optimal-combined model for air quality index forecasting: 5 cities in North China. Environ Pollut 243:842–850. https://doi.org/10.1016/j.envpol.2018.09.025
Article CAS Google Scholar

Download references

Funding

The study is fully supported by the National Natural Science Foundation of China (Grant No. 52072412), the Changsha Science & Technology Project (Grant No. KQ1707017), and the Hunan Province Science and Technology Talent Support Project (Grant No. 2020TJ-Q06).

Author information

Authors and Affiliations

Institute of Artificial Intelligence and Robotics (IAIR), Key Laboratory of Traffic Safety On Track of Ministry of Education, School of Traffic and Transportation Engineering, Central South University, Changsha, 410075, Hunan, China
Yucheng Yin & Hui Liu

Authors

Yucheng Yin
View author publications
You can also search for this author in PubMed Google Scholar
Hui Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Liu.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent to publish

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 35 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yin, Y., Liu, H. Air quality index prediction based on three-stage feature engineering, model matching, and optimized ensemble. Air Qual Atmos Health 16, 1871–1890 (2023). https://doi.org/10.1007/s11869-023-01380-7

Download citation

Received: 26 November 2022
Accepted: 15 May 2023
Published: 23 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11869-023-01380-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Air quality index prediction based on three-stage feature engineering, model matching, and optimized ensemble

Abstract

Access this article

Similar content being viewed by others

Water quality prediction using machine learning models based on grid search method

Air pollution prediction with machine learning: a case study of Indian cities

A survey on ensemble learning

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent to publish

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary file1 (DOCX 35 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Air quality index prediction based on three-stage feature engineering, model matching, and optimized ensemble

Abstract

Access this article

Similar content being viewed by others

Water quality prediction using machine learning models based on grid search method

Air pollution prediction with machine learning: a case study of Indian cities

A survey on ensemble learning

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent to publish

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary file1 (DOCX 35 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation