1 Introduction

Air pollution is considered by United Nations as the one of the most significant environmental risks to health worldwide, and consequently addressed in the United Nations Sustainable Development Goals [1]. Air quality can vary significantly across territories, even at high geographic and temporal granularity [2]: an accurate assessment of the exposure of population to pollutants would require an almost continuous distribution of measuring ground-stations, an approach far from being feasible. Hence, scientific research has spent significant efforts in implementing air quality models, in order to increase the usability of limited measurements in time and space by inferring more detailed information through data processing. Many different models were implemented over the years, applying diverse approaches [3], the most common being Kriging interpolation and land-use regression (LUR), along with more complex processing frameworks such as chemical transport models (CTM). However, recent reviews on the topic [3,4,5,6] highlighted some critical issues with these widespread approaches, namely the limited performance of the more basic statistical models (Kriging, LUR) and the high requirements in terms of data and computational capabilities of the more complex models (CTM). For such reasons, an exponential increase in implementation of models based on machine learning (ML) algorithms emerged in the last years and is now the most diffused in scientific research in this field, setting a new state-of-art in particular with relation to health impact assessment, where advanced data processing and geographical modelling are taking over more traditional approaches [7]. This kind of models offers a performance comparable (or even superior) to CTM, while relying on less data and requiring less computational capabilities. However, ‘machine learning’ is a macro-category, which includes many algorithms with significant differences in terms of mathematical background, applicability, complexity and interpretability; the number of different solutions is virtually infinite, as these algorithms can also be modified and adapted to specific frameworks. Furthermore, multiple ML algorithms can be used as separate ‘functional blocks’ in the different phases of a unique modelling process, to build ‘ensemble’ architectures. What emerges is a complex scenario, with the spread of many different solutions, and a consequent struggle in comparison, evaluation and replication, thus hindering the definition of the state-of-art. As a consequence, it may be difficult to identify the best solutions to be tested when designing a new project focused on air quality.

In this context, object of this scoping review was to analyze the latest scientific research on the topic of ML applied to air quality modelling, focusing in particular on particulate matter (PM), known to be a serious hazard for human health [8, 9]. The intent was to identify the most widespread solutions and to try to compare them, according to level of evidence, thus identifying requirements and possibly supporting the design of future projects in the field. Therefore, with this review, the research goal is to verify if machine learning has become the state-of-art methodology in air quality modelling (either globally or in limited areas), if there are specific architectures and algorithms that outperform other solutions, and what is the performance that can be expected from such models.

The manuscript is structured in the following sections: (II) Review methodology: describes the procedure of collection and analysis of relevant scientific literature. (III) Objective of selected studies: classification of the identified studies according to the different aim, distinguishing explorative correlation analysis, interpolation and forecast. (IV) Geographic distribution: analysis of the origins’ distribution of the studies. (V) Input data sources: assessment of the input data on which models were based. (VI) Used algorithms and estimated performance: analysis and comparative evaluation of the different solutions implemented. (VII) Critical discussion of the results and consequent conclusions.

2 Review methodology

The explored database was Google Scholar; for the query, the applied keywords were ‘pollution’, ‘PM’ and ‘particulate matter’, ‘interpolation’, ‘prediction’ and ‘forecast’, ‘machine learning’ and ‘ML’. Showing a strong attention towards this topic, the number of potentially relevant results (as returned by the search engine) was very high, with over 7000 results returned (in English language). In order to catch the trend of the state-of-art evolution, the search was therefore limited to the last year (2022) only, thus reducing the number of potential results to 940. Based on the title and abstract, articles were selected as relevant if the study included the development (or at least the use) of a specific model to estimate PM concentration. The number was thus further reduced to 169. Finally, after full-text reading, 142 relevant studies (with the same criterion) were identified, including 4 literature reviews [3,4,5,6] and 138 observational studies [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147]. These 138 studies were analyzed, collecting structured information relevant to (1) the study objective (primary and eventually secondary), (2) the target pollutant(s), (3) the data sources, (4) the method of attributes selection, (5) the target territory, (6) the spatial and time resolution, (7) the models tested, (8) the performance evaluation method and results, and (9) the final model selected. Such information was manually recorded and structured in a relational database, with a pre-defined codified language that allowed comparison and processing. After the information was structured within this framework, it was possible to automatize the subsequent analyses, implementing them through Python programming language (v 3.7). With this approach, it was possible to quickly obtain statistics and graphics, as well as the list of references corresponding to each identified group of studies. A first analysis round was relevant to the studies objective, resulting in a classification that allowed to identify research sub-groups. Once this classification was performed, all following analyses were repeated separately on the whole database and on the single groups. For qualitative information (such as 1–5, 7 and 9), basic statistics were extracted, discussing absolute and relative frequencies of the different labels, generally aggregating single-spot elements (i.e. found in one study only across the database) in the ‘other’ category. For quantitative information (6 and especially 8), a more advanced analysis was applied, assessing the cumulated results as average and confidence interval, and identifying the robustness of the results in terms of number of studies in relation to category-based sub-groups. Anyway, it is worth noticing that the different aspects of the analysis were addressed one-by-one, eventually adjusting the methodology according to the specific needs.

3 Objective of selected studies

The 138 identified observational studies were classified according to their primary goal. In particular, 3 categories were identified:

  1. A.

    Correlation analysis: these studies aimed at analyzing the impact on PM concentration of the different considered data sources, eventually evaluating their weight within the implemented ML models. This category included 14 (10.14%) studies [10,11,12,13,14,15,16,17,18,19,20,21,22,23].

  2. B.

    Spatial and/or temporal interpolation models: these studies aimed at inferring missing data in space (missing records and/or computation of continuous mapping from discrete points) and/or in time, thus including 65 (47.1%) studies [26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90].

  3. C.

    Forecast models: these studies aimed at developing and validating predictive models to forecast the future concentration of PM. This category included 56 (40.58%) studies [92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147].

On top of this classification, some additional studies should be considered, whose main scope was not PM modelling, yet in which a model for PM concentration (eventually externally developed) was used. Specifically, it is a total of 3 studies, two of them [24, 25] had explorative correlation analysis as secondary purpose (category A), while the third [91] had a spatial-temporal interpolation as secondary scope (category B). Therefore, the final number of studies for the three categories resulted as:

  1. A.

    Correlation analysis: 14 + 2 = 16 (11.59%) [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25].

  2. B.

    Interpolation: 65 + 1 = 66 (47.83%) studies [26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91].

  3. C.

    Forecast: 56 (40.58%) studies [92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147].

However, it is worth noticing that this classification should not be interpreted as rigid, as many studies could be assigned to more than one category when secondary aims were taken into account. A graphical representation of sub-categories is reported in Fig. 1.

Fig. 1
figure 1

Classification of studies published in 2022 relevant to modelling of particulate matter concentration, divided according to the primary (first column) and secondary (derived blocks) study aims

4 Geographic distribution

Considering the territory under analysis, the vast majority of the studies (112, 81.16%) was relevant to Asia, in particular east and south-east. Among them, the largest contribution was provided by China [11,12,13, 19,20,21, 25,26,27,28,29,30, 34, 35, 37, 38, 40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60, 62, 63, 71,72,73,74,75,76,77,78,79,80, 93, 98, 100,101,102,103,104,105, 115,116,117, 123,124,125,126,127,128,129,130,131,132,133,134,135], which represented, with 73 studies (52.9%), more than half of the total. A second block included India [15, 24, 35, 61, 83,84,85, 94, 106, 120, 121, 136, 137] (13, 9.42%), South Korea [22, 26, 60, 62, 63, 87, 111, 118, 140,141,142,143] (12, 8.7%) and USA [10, 17, 18, 23, 33, 66,67,68, 89, 97, 114] (11, 7.97%), while the other countries addressed in more than one publication were Japan [60, 63, 95, 122] and Thailand [16, 64, 112, 144] (4, 2.9%), Taiwan [69, 96, 119] and UK [35, 113, 147] (3, 2.17%), Spain [88, 145], Germany [82, 91], Iran [107, 138], Malaysia [108, 109], Canada [39, 70] (2, 1.45%). The total counting (subdivided by study category according to the classification defined in Sect. 3) is reported in Table 1.

Table 1 Number of studies per country relevant to machine-learning based modelling of particulate matter concentration; for references of countries with more than one occurrence, please refer to the main text

When normalizing the number of studies on the population (N studies / 10 million) of the different target countries, the most studied country resulted by far South Korea (2.34), while (considering only countries with more than one single publication) the second was Taiwan (1.26). Countries with the largest absolute numbers had lower values: 0.51 for China, 0.09 for India, 0.33 for USA. A graphical representation of the normalized number of studies is provided in Fig. 2, while complete results (also subdivided by study category according to the classification defined in Sect. 3) are reported in Table 2.

Fig. 2
figure 2

Heat-map with number of studies per country, relevant to machine-learning based modelling of particulate matter concentration, normalized on the country’s population

Table 2 Number of studies per country, relevant to machine-learning based modelling of particulate matter concentration, normalized on the country’s population

5 Input data sources

Concerning the data sources used in the modelling, a first distinction should be made between univariate and multivariate models. Univariate models were based on a single data source, represented by the time-series of PM concentration as recorded by ground stations. Such studies were 15 [31, 35, 36, 83, 85, 87,88,89, 95, 113, 116, 130, 132, 135, 147] in total (10.87%), 8 of which were interpolation models (12.12% of category B) and 7 were predictive models (12.15% of category C). No study of this kind belonged to category A, for which the scope is indeed to evaluate the impact of other data sources on the target (concentration of PM).

The majority of studies were multivariate: 123 in total (89.13%), 16 for correlation analysis (100%), 58 for interpolation models (87.88%) and 49 for prediction models (87.85%). In such models, the most used data source (besides ground stations) was meteorological information, with 92 [13,14,15,16, 19,20,21, 24,25,26,27, 30, 34, 38, 40, 42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63, 65,66,67,68, 70, 71, 73, 74, 76,77,78,79,80,81,82, 84, 90, 91, 93, 94, 96,97,98,99,100,101,102,103,104,105, 107, 109,110,111,112, 114, 115, 117, 119, 122, 124,125,126,127, 129, 133, 134, 136,137,138,139,140, 142, 144, 145] studies (74.8% of all multivariate models), followed by satellite data, where 45 [15, 17, 26, 29, 30, 32, 40, 42, 43, 45,46,47,48, 50,51,52,53, 55, 57, 60,61,62,63, 65, 67, 70, 73,74,75,76,77,78,79,80,81,82, 93, 107, 112, 115, 120, 133, 134, 141, 144] studies (36.59%) used Aerosol Optical Depth, and 37 [10,11,12,13, 15, 17,18,19,20, 22, 26, 34, 37, 40, 41, 43, 45, 47, 48, 51, 52, 54, 56, 58, 60, 62,63,64,65, 68, 70,71,72,73, 79,80,81,82, 86, 93, 97, 110, 112, 115, 117, 129, 131, 139, 144] (30.08%) used other satellite imagery. Other largely included variables were land use and/or topography, present in 51 [11,12,13, 16, 18, 22, 23, 25, 26, 30, 33, 34, 38, 39, 42, 43, 45,46,47,48,49,50,51, 53,54,55,56, 58, 59, 61, 63, 64, 66, 68, 71, 73,74,75,76,77, 80, 81, 92, 93, 101, 112, 115, 122, 133, 139, 144] studies (41.46%). Other less frequently included data sources were measurement of other pollutants, other models previously implemented, demography, ad-hoc micro-sensors networks, road traffic information, wildfires localization. Furthermore, 12 [10, 11, 13, 18,19,20, 22, 34, 51, 80, 110, 129] studies (9.76%) used other specific categories of variables not previously used in any other study, and therefore not classified. A complete description of data sources included in multivariate models, also subdivided by study category, are reported in Table 3.

Table 3 Data sources used in studies relevant to machine-learning based modelling of particulate matter concentration, subdivided according to study aim

6 Used algorithms and estimated performance

With regards to the type of the chosen final algorithm, a first distinction can be made between single-block models and ‘ensemble’ architectures. Single-block models are algorithms trained for the specific intended task, while ‘ensemble’ architectures are systems composed by multiple functional blocks, in series and/or in parallel, each of whom is basically a single-block model performing a specific sub-task within the overall framework. Such a more complex approach was tested in 43 [22, 24, 25, 29,30,31, 36, 42, 44, 46, 47, 64, 67, 84, 89, 97,98,99,100,101,102, 104,105,106,107,108, 110, 117, 118, 121, 123, 126, 127, 129,130,131, 133, 135, 138, 141, 142, 145, 147] studies (31.16%), 3 of which [22, 24, 25] were correlation analysis (18.75% of category A), 12 [29,30,31, 36, 42, 44, 46, 47, 64, 67, 84, 89] were interpolation models (18.18% of category B), and 28 [97,98,99,100,101,102, 104,105,106,107,108, 110, 117, 118, 121, 123, 126, 127, 129,130,131, 133, 135, 138, 141, 142, 145, 147] prediction models, accounting for 50% of category C. As a result, this last category represents the application field where the use of ensemble architectures was mostly diffused.

In sake of comparison, in the following analysis all algorithms were considered singularly, even when they were inserted into a more complex structure. The reason is that ensemble architectures are basically unique and implemented ad-hoc on each specific project, meaning that the same architecture was never used more than once across all considered studies, thus impeding any kind of comparison.

The performed analysis regarded both all the algorithms that were tested and evaluated, as well as those that were selected for the final implementation resulting the most performant. A first relevant result is the use (in test phase) of 232 different algorithms across all the studies, of which only 60 (25.86%) were tested in more than one. Similarly, considering the final model chosen as the most performant, only 20 solutions, out of a total list of 108 (thus corresponding to the 18.52%), were selected in more than one study.

Considering only the repeated solutions, the most frequently tested algorithm was the Random Forest (RF: 43 [10, 12, 13, 19, 20, 37, 39, 42, 45, 46, 48, 52, 54, 55, 58,59,60,61, 63, 64, 68, 70, 71, 73, 75, 89, 90, 95, 99, 101, 103, 107, 109, 110, 113, 114, 117, 120, 122, 134, 136, 140, 141], 31.16%), followed by the Long-Short Term Memory (LSTM: 34 [36, 38, 46, 84, 94, 96, 98, 100, 102,103,104,105, 111, 113, 114, 116, 119, 125,126,127, 129, 130, 137, 138, 142, 143, 145], 24.64%) and by Convolutional Neural Networks (CNN: 19 [84, 86, 94, 96, 100, 108, 113, 114, 116, 126, 127, 130, 142, 145], 13.77%). With regards to the choice of the most performant algorithm, the most frequently selected two were again RF (27 [9, 12, 13, 15, 19, 20, 37, 39, 42, 45, 48, 52, 60, 63, 64, 68, 70, 73, 89, 95, 99, 101, 109, 110, 117, 122, 141], 19.57%) and LSTM (12 [24, 36, 84, 96, 98, 103, 105, 108, 126, 129, 136, 145], 8.7%), while the third was the eXtreme Gradient Boosting (XGBoost: 7 [26, 42, 51, 59, 61, 107, 110], 5.07%, tested in 12 [26, 42, 51, 55, 61, 90, 103, 107, 110, 113, 140], 8.7%).

Considering the choice of the most performant computational algorithm, according to the different application field and studies objectives as categorized in Sect. 3, for correlation analysis (category A) the most used algorithm was RF, with 6 [10, 12, 13, 15, 19, 20] studies out of 16 (37.5%), followed by Geographically Weighted Regression (GWR: 2 [11, 18], 12.5%); in all the other cases, ad-hoc solutions were implemented and not repeated anywhere else, while in 4 further cases [21,22,23, 25] (25%) the analysis was based on classic statistical methods, meaning that there was not an actual model development. Also in interpolation models (category B), RF was the most widely adopted solution (13 [37, 39, 42, 45, 48, 52, 60, 63, 64, 68, 70, 73, 89], 19.7%), followed by XGBoost (5 [26, 42, 51, 59, 61], 7.58%) and the Deep Forest (DF: 4 [50, 53, 54, 58], 6.06%). Within predictive models (category C), the most used was instead LSTM (9 [96, 98, 103, 105, 108, 126, 129, 136, 145], 16.07%), followed by RF (8 [95, 99, 101, 109, 110, 117, 122, 141], 14.29%) while three different algorithms were equally applied with the third highest frequency (3, 5.36%), namely CNN [100, 126, 138], AutoRegressive Moving Average (ARMA) [110, 130, 132] and Chemical Transportation Models (CTM [101, 131, 147], always included in ensemble architectures in the analyzed studies). The frequency of application of the most diffused algorithms, subdivided according to study category, are reported in Fig. 3.

Fig. 3
figure 3

Algorithms used in studies relevant to machine-learning based modelling of particulate matter concentration, subdivided according to study aim

With regards to performance, the most common parameter (118 studies, 85.51%) considered for comparison was root mean squared error (RMSE), expressed as µg/m3 and reported as median [1st quartile – 3rd quartile] or as 95% confidence interval (Table 4).

Table 4 Statistics about the performance evaluation (through root mean squared error, RMSE) of algorithms used in studies relevant to machine-learning based modelling of particulate matter concentration, subdivided according to study aim

The following analysis considered the performance evaluation of the single algorithms applied, although the need of a common metric imposed to only include studies reporting RMSE (118 studies, 85.51%, as per Table 4); given the nature of the metric (error measurement), lower values correspond to higher performance. Moreover, for statistical robustness, only algorithms that were used more than once (i.e. at least in two different studies) were included, whether they are applied as a single-block framework or as one of the multiple blocks in an ensemble architecture. In the first case (single-blocks), the best performance was that of LSTM (5.75 ± 5.18 µg/m3), although the evidence is quite low, being used in two studies only [96, 103]. XGBoost follows with 7.78 ± 4.68 µg/m3 and a higher level of evidence, being used in 5 studies [26, 51, 59, 61, 107]. The highest level of evidence was reached for RF, applied in 19 studies [10, 12, 13, 15, 19, 20, 37, 45, 48, 52, 60, 63, 68, 70, 73, 89, 95, 109, 122], with a lower but comparable performance of 12.79 ± 9.1 µg/m3, while the least performant solution (mainly due to a large confidence interval among the 4 cases of application [50, 53, 54, 58]) was the Deep Forest (DF) with 27 ± 14.61 µg/m3.

When considering the scond group, ensemble architectures, the best performances were reached when a RF was included (10.44 ± 7.64 µg/m3, with 6 cases of application [42, 64, 99, 101, 110, 117]), followed by LSTM (13.15 ± 9.1 µg/m3, with 7 cases of application [24, 36, 84, 98, 105, 108, 126]). A comparative graphical representation of such results is reported in Fig. 4.

Fig. 4
figure 4

Statistics about the performance evaluation (through root mean squared error, RMSE) of algorithms used in studies relevant to machine-learning based modelling of particulate matter concentration. Blue lines refer to single-block frameworks, while green ones represent the use of algorithms within ensemble architectures. Lines thickness is proportional to the level of evidence (number of studies presenting that solution), also reported in labels

As previously stated, the specific aim of each study has a primary impact on the implemented algorithms and their performance. To take this into account, this performance analysis was also conducted separately for the three categories identified in Sect. 3.

For Category A, relevant to correlation analysis, it must be noticed that the full implementation of a model is not a requirement to fulfill the goal and, as a result, only two algorithms could be evaluated: RF, 5.05 ± 4.04 µg/m3 on 3 applications [10, 19, 20], and GWR, 30.57 ± 20.8 µg/m3 on 2 applications [11, 18]. In category B, interpolation models, the best results were obtained with LSTM into an ensemble architecture (9.9 ± 2 µg/m3), although with low evidence (2 only cases of application [36, 84]). The most frequently applied approach was the implementation of a single-block framework based on an empowered decision-tree-like algorithm, such as an Extremely Randomized Tree (11.22 ± 1.28 µg/m3 with 2 cases [81, 90]), Deep Forest (DF: 16.66 ± 4.5 µg/m3 with 4 applications [50, 53, 54, 58]) and XGBoost (13.95 ± 5.15 µg/m3 with 3 applications [51, 59, 61]). Such use of empowered decision trees showed, although with lower evidence, a higher performance when compared with a basic RF (14.9 ± 8.45 µg/m3 with 10 applications [37, 45, 48, 52, 60, 63, 68, 70, 73, 89]). A graphical representation of these performance results (category B) is provided in Fig. 5. Concerning category C, predictive modelling, the best results were obtained with a single-block LSTM (4.32 ± 3.23 µg/m3 with 3 applications [96, 103, 136]), closely followed by a single RF (6.49 ± 4.91 µg/m3 with 3 applications [95, 109, 122]), while considering ensemble architectures the most performant were those including a CNN (5.81 ± 5.46 µg/m3 with 3 applications [100, 126, 145]). The most frequently applied approaches were the inclusion, in the ensemble architecture, of either a RF [99, 101, 110, 117, 141] (12 ± 7.93 µg/m3) or a LSTM [98, 105, 126, 129, 145] (13.7 ± 9.1 µg/m3), both with 5 cases of application. Lower performance, again mainly due to a high range of values in the 3 cases of application [101, 131, 147], was provided by ensemble models including a CTM (18.79 ± 11.99 µg/m3). A graphical representation of such results (category C) is provided in Fig. 6.

Fig. 5
figure 5

Statistics about the performance evaluation (through root mean squared error, RMSE) of algorithms used in studies relevant to machine-learning based interpolation modelling of particulate matter concentration. Blue lines refer to single-block frameworks, while green ones represent the use of algorithms within ensemble architectures. Lines thickness is proportional to the level of evidence (number of studies presenting that solution), also reported in labels

Fig. 6
figure 6

Statistics about the performance evaluation (through root mean squared error, RMSE) of algorithms used in studies relevant to machine-learning based predictive modelling of particulate matter concentration. Blue lines refer to single-block frameworks, while green ones represent the use of algorithms within ensemble architectures. Lines thickness is proportional to the level of evidence (number of studies presenting that solution), also reported in labels

7 Conclusions

A scientific literature review was performed on the topic of advanced data computational techniques (mainly machine learning ML) applied to air quality models, with a specific focus on particulate matter (PM). This topic resulted to be of very high interest for the international scientific community, with a production of scientific literature of impressive dimension. As a matter of fact, by considering a single year (2022) it was already possible to identify a total of 138 relevant studies to be included and fully analyzed. While, on one side, this represents a limitation (resulting in a very small time period for the review), it is also to be considered as a relevant result in itself, showing a unique level of interest and attention from the scientific community towards this field, and its characteristic of exceptional dynamicity and speed of advancement.

According to the analysis, ML is the edging technology in air quality modelling, and recent scientific research confirms a widely spread application of this approach, thus positively answering (on a global scale) to the primary research question addressed with this work. In particular, three main fields of application emerged, with the largest share of studies focused on the spatial and/or temporal interpolation of data (either filling gaps in recordings or inferring a continuous measurement from sparse samples), followed closely by prediction models for concentration forecast, and a smaller amount of studies focused on a correlation analysis between explicative factors and concentration levels.

Despite a wide enough geographical distribution of the countries under examination, a larger part of the production was focused on southern-eastern Asia, in particular in China (in absolute numbers) and South Korea (in proportion to the population). The most frequent approach in PM modelling was to implement multi-variate models (almost 9 cases out of 10), including additional measurements on top of ground stations, mainly meteorological data and satellite-derived information (such as AOD), but many more additional data were frequently considered (land-use, demography, previous models etc.), thus confirming established knowledge in the field [4].

With regards to the implemented methodological solutions, a strong sparsity was found, with the vast majority of studies developing ad-hoc unique frameworks. While literature [3] enlightens that there is not a single best solution suiting all needs, which can to some extent explain the recorded sparsity, this variety hinders replicability and therefore comparisons, thus being a potential barrier to identify best-practices for new future studies on this topic. As a result, the second research question addressed is left unanswered, having to notice the impossibility to identify a specific architecture/model that consistently outperforms other solutions.

Despite this obstacle, it was possible to draw some significant conclusions, and to address the last research question about the expected performance. In particular, the most interesting assessment regards the relationship between the estimated performance and the level of complexity of the models. In this sense, a primary distinction can be made between classic ML (e.g. RF) and Deep Learning DL (e.g. CNN), with this last approach resulting, in line with literature [4,5,6], more diffused for prediction tasks. However, the overall evidence does not point clearly to a superiority of DL over the simpler basic ML. As a matter of fact, when considering single-block frameworks, an increase in the performance is present, despite a different robustness of the evidence: for instance, it is possible to consider DL approaches such as LSTM with RMSE = 5.75 ± 5.18 µg/m3, against basic ML such as XGBoost with RMSE = 7.78 ± 4.68 µg/m3 or RF with RMSE = 12.79 ± 9.1 µg/m3. When instead considering ensemble architectures, an inverse result emerges, such as was recorded for RF, with RMSE = 10.44 ± 7.64 µg/m3, and LSTM, with RMSE = 13.15 ± 9.1 µg/m3. It must be noticed that, considering the large overlapping in confidence intervals, there is no evidence about the higher suitability of one approach over the other. However, it is anyway possible to partly confirm previous literature results [4, 5] in the field of DL. For instance, a primary role of LSTM and CNN, considered by literature to solve many issues affecting older approaches (e.g. vanishing gradient issue), was verified. On the contrary, other parts of established knowledge were not confirmed, such as the preferability of Gated Recurrent Unit (GRU) over other approaches, which did not emerge clearly in this analysis, where one case only [24] was recorded in which it was selected as the best solution (moreover, in this case, GRU was put in series with an LSTM building an ensemble architecture, reaching a higher performance compared to each of the two used singularly). Anyway, it is recommendable to always base the choice on the final goal of the modelling task. As a matter of fact, ML offers the possibility to easily implement explainable-AI models [148], which is vital to generate evidence about the impact of different factors on the levels of pollution, thus generating insights for policy makers about a proper management of the territory in terms of land-use [149] and human activities.

Secondarily, in addition to the distinction between classic ML and DL, another important distinction is between single-block frameworks and the more complex ensemble architectures. Previous reviews [3,4,5,6] generally agreed in identifying ensemble architectures as the most performant approaches. At first sight, in this review, a different result seems to have emerged, with an average lower performance of ensemble architectures across the different studies. However, considering the 37 studies that implemented an ensemble architecture (and were included in the comparison, being evaluated through RMSE), the vast majority of them (26, 70.3%) made this choice after a comparison between the ensemble architecture and other single-block models, thus recording a performance increase when the more complex solution was tested. Therefore, it is possible to hypothesize that the overall higher performance of single-block models is actually due to the different experimental set-ups, rather than to the characteristics of the models themselves. This hypothesis is also corroborated by the fact that the inversed scenario, thus a single-block model preferred over an ensemble architecture when both were tested, was a very rare occurrence, with only 3 cases out of 96 (3.13%). Therefore, while it is possible to state that an ensemble architecture can help reaching higher performances, it must be also specified that the opportunity of this approach depends strongly on the experimental set-up. An increase in complexity does not automatically result in a higher performance, thus partly denying previous literature. The trade-off between complexity and expected performance should therefore be accurately analyzed case-by-case, according to needs and specifics in terms of context, application scenario, aim, available data sources, and characteristics of target and explicative data, resulting in different choices being suitable according to the different situations. While some solutions resulted generally more robust and have stronger evidence compared to others, an extensive effort emerges recommendable in terms of comparative analysis of different models when implementing a new solution. Choosing the model that is reported to have the best numerical performance can be misleading, as the quantitative evaluations resulted more dependent from the initial set-up rather than the developed model.

In conclusion, this study shows that the target field is one of the most fast-evolving and manifold applications of machine learning technologies. In this scenario, a relevant application of the performed analysis is to provide a reference framework for researchers in this field to address the topic, having identified the most relevant features in cases-studies to be taken into account when defining the experimental set-up. In light of all the above, while this literature review can be considered a reference for general benchmarking, an even higher relevance should be attributed to the methodological guidelines proposed.