Comparison of statistical clustering techniques for the classification of modelled atmospheric trajectories
 271 Downloads
 25 Citations
Abstract
In this study, we used and compared three different statistical clustering methods: an hierarchical, a nonhierarchical (Kmeans) and an artificial neural network technique (selforganizing maps (SOM)). These classification methods were applied to a 4year dataset of 5 days kinematic back trajectories of air masses arriving in Athens, Greece at 12.00 UTC, in three different heights, above the ground. The atmospheric back trajectories were simulated with the HYSPLIT Vesion 4.7 model of National Oceanic and Atmospheric Administration (NOAA). The meteorological data used for the computation of trajectories were obtained from NOAA reanalysis database. A comparison of the three statistical clustering methods through statistical indices was attempted. It was found that all three statistical methods seem to depend to the arrival height of the trajectories, but the degree of dependence differs substantially. Hierarchical clustering showed the highest level of dependence for fastmoving trajectories to the arrival height, followed by SOM. Kmeans was found to be the least depended clustering technique on the arrival height. The air quality management applications of these results in relation to PM_{10} concentrations recorded in Athens, Greece, were also discussed. Differences of PM_{10} concentrations, during certain clusters, were found statistically different (at 95% confidence level) indicating that these clusters appear to be associated with longrange transportation of particulates. This study can improve the interpretation of modelled atmospheric trajectories, leading to a more reliable analysis of synoptic weather circulation patterns and their impacts on urban air quality.
Keywords
Cluster Technique PM10 Concentration Sahara Desert Back Trajectory Circulation Regime1 Introduction
Computerbased statistical classifications use various meteorological parameters from one or more surface or upperair stations in order to identify the weather types occurring in an area (Yao 1998). Although these “objective” methods do not entirely rely on the investigator's expert judgment, they still introduce certain limitations to the physical interpretation of the results. In order to reduce the subjectivity of these classifications and improve the analysis of large meteorological datasets, principal component and/or cluster analyses have been used to classify weather types.
Cluster analysis groups data into subsets (clusters), in such a way that data within each cluster are more closely related to one another than those assigned to different clusters. There are two main methods of statistical clustering: the hierarchical and the nonhierarchical (Kmeans).
The statistical clustering techniques for the classification of atmospheric trajectories used in the past vary widely, based on different hierarchical and nonhierarchical methods. In recent years, artificial neural network techniques have increasingly been recognized as a useful statistical technique for the classification of both environmental and meteorological data.
In order to analyze the atmospheric circulation patterns, several investigators applied multivariate techniques including clustering methods to classify modelled back trajectories (Leavey and Sweeny 1990; Dorling and Davis 1995; Cape et al. 2000; Jorba et al. 2004). In these studies, the coordinates of the back trajectories were used as the clustering variables. The coordinates along the air mass type route of the trajectories led to the identification of distinct groups with similar characteristics, i.e. similar direction of approach and speed of passage over potential pollution source areas (Dorling et al. 1992). The trajectory types are more readily interpretable in terms of the synoptic conditions that form them. Largescale circulation features often result in certain trajectory clusters. Recently, Borge et al. (2007) used a twostage cluster analysis (based on the nonhierarchical Kmeans algorithm) to classify back trajectories arriving in three different areas in Europe. This technique, although not fully objective, has the advantage of producing highly disaggregated trajectory clusters which may correspond to significantly different ambient PM_{10} (i.e. particulates with diameter smaller than 10 μg) and ozone levels.
Considerable work in the field of the analysis of back trajectories and their associated uncertainty was also made with the aid of the model Flexpart by Stohl et al. (2002), while Wernli and Davies (1997) examined the spacetime structure and dynamics of extratropical cyclones with the use of a Lagrangianbased method.
In recent years, artificial neural network, fuzzy logic and data mining techniques gained interest and have been progressively recognized as promising techniques for the prediction and classification of both meteorological and air quality data (Schlink et al. 2006). Specifically, Kolehmainen et al. (2000, 2001) applied the selforganizing map (SOM) algorithm to forecast NO_{2} levels in Stockholm. Recently, Schädler and Sasse (2006) analyzed the connection between precipitation and synoptic atmospheric processes in the eastern Mediterranean using SOM. To the authors knowledge, neither fuzzy nor SOM techniques have been used, up to now, to classify atmospheric back trajectories.
The aim of the present study is to apply several statistical clustering techniques (hierarchical, nonhierarchical Kmeans and artificial neural network SOM) to a 4year dataset of modelled atmospheric trajectories, in order to examine their performance and sensitivity to certain parameters such as the arrival height and to compare the resulting back trajectory groups derived from the different clustering techniques. These statistical clustering techniques can be used to identify synoptic weather regimes and longrange transport patterns that may affect air pollution.
2 Methods
2.1 Model and input data
Fiveday long kinematic back trajectories arriving in Athens, Greece (37.2°N, 23.47°E), computed at 12.00 UTC (i.e. 14.00 local time) for every day during a 4year period (2001–2004) were used. The kinematic back trajectories were calculated with version 4 of the Hybrid SingleParticle Lagrangian Integrated Trajectory model (HYSPLIT) developed by the National Oceanic and Atmospheric Administration (NOAA) Air Resources Laboratory (Draxler and Hess 1998; Rolph 2003). HYSPLIT has been widely used in air pollution applications (Artíñano et al. 2001; Borge et al. 2007) and can be run either online or offline on a PC (further information on the model can be found at: http://www.arl.noaa.gov/ready/hysplit4.html).
The meteorological data used for the computation of the trajectories in the present study were obtained from the NOAA reanalysis database (http:/www.arl.noaa.gov/archives.php). These data were produced by the National Center for Environmental Prediction Global Data Assimilation System (Kanamitsu 1989) which uses the spectral MediumRange Forecast Model for the weather forecast. The horizontal resolution (2.5° for latitude–longitude of the trajectories matched the resolution of the reanalysis data. All in all, 29 vertical levels were used. The vertical levels (first at 10 m, second at 75 m above ground) at which meteorological fields were computed were denser near the ground and more sparse above. The vertical transport was modelled using the isobaric option of HYSPLIT. The back trajectories were computed, every 6 h, at three heights above ground (10, 100 and 500 m). These heights were chosen in such a way that the results of this study could be used for air quality purposes.
The use of trajectories is based on the assumption that the troposphere is constituted of consistent layers that are transported undergoing gradual mixing with the background (Newell et al. 1999). The length of backtrajectories is restricted by the distances between the source regions and the point of arrival. In this study, we have chosen 5day back trajectories which are consistent with previous studies carried out in the same area (Mihalopoulos et al. 1997).
2.2 Clustering procedure
Cluster analysis is a multivariate statistical technique designed to find structure inside a dataset (Everitt 1980). The approach involves splitting a dataset into a number of groups that need to be as different as possible. It is a rather objective classification method since there are techniques and criteria, examining the minimization of the distance within a cluster and maximization of the distance between clusters, for finding the optimal number of clusters. Clustering techniques are unsupervised learning processes that try to group data based on a similarity and/or dissimilarity measure. The use of different measures is expected to lead to different clusters.
In this study, we employed a variation of a graphbased method (Salvador and Chan 2005) in order to define the appropriate number of clusters for our dataset of modelled atmospheric trajectories. Having determined the number of clusters, we proceed by utilizing three different clustering techniques. This gives us the ability to examine whether the generated clusters differ substantially for different clustering techniques depending strongly on the employed similarity measure. This assessment relies on the analysis of the average trajectories (centroids) for each cluster, which have been calculated by averaging the coordinates of the individual trajectories belonging to each cluster.
Trajectories are defined as the series of successive points where the air mass is located in 6h intervals. Distances are calculated between points corresponding to the same time interval. The distance between two trajectories is the sum of the Euclidean distances of the trajectories' points.
In order to analyze the back trajectory data, we employed the following statistical clustering techniques: hierarchical clustering (Johnson 1967), the Kmeans algorithm (McQueen 1967) and the selforganizing maps (Kohonen 2001).
2.3 Determining the number of clusters
When the above procedure was applied to the back trajectory data from Athens and after utilizing 20 clusters, still the screenlike curve required by the lmethod could not be attained. Using an even larger number of clusters would not offer any practical solution to the problem under study. Figure 1 presents the abovementioned screenlike curve. The xaxis represents the number of clusters while yaxis represents the normalised Euclidean distance between the trajectories to fall within the interval (2–20). Based on this slight variation of the lmethod, we finally obtained the required curve and estimated the appropriate number of clusters, which was 5.6 (Fig. 1). It must be noted, at this point, that several fitting approaches were attempted (e.g. higher order polynomials, interpolation, splines etc.) to determine the number of clusters. In every case, the point of interest was approximately at the same location, thus yielding always six as the optimum number of clusters.
2.4 Hierarchical clustering
Setting six as the optimum number of clusters, we employed four different algorithms in order to segregate the back trajectory data, starting with the simple hierarchical clustering technique. Briefly, hierarchical clustering partitions data following a series of steps either by grouping (agglomerative or bottomup) or by separating (divisive or topdown) the objects one by one in each step. Agglomerative approaches are those most commonly employed and this is the one used in our work. According to this, the two closest clusters are merged in each step. It should be noted that the procedure starts with singleton clusters and ends with a single cluster that contains all the objects. Moreover, there are a number of different techniques to measure the distance (or the similarity) between the clusters, which may lead to different subsets. The simplest of these techniques and the one utilized here is the squared Euclidean distance method. At each step the distances between all objects of one cluster and all objects of another cluster are computed and the minimum one defines the distance between these two clusters. This is carried out for all possible pairs of clusters so the pair with the minimum distance is merged and the process continues similarly. The final output of this procedure is the grouping of all elements in the form of a tree hierarchy. “Cutting” the tree at a given level will yield the corresponding clustering with certain precision. Conversely, having the number of clusters fixed the equivalent precision level can easily be determined.
2.5 Kmeans algorithm
Kmeans is a nonhierarchical clustering algorithm widely used in many applications that require partitional clustering. The difference between the two techniques is that hierarchical algorithms create successive clusters applying userdetermined cluster numbers, whereas Kmeans partitional algorithms determine iteratively all clusters in one step (Dorling and Davis 1995). The main concept behind Kmeans is the utilization of k centroids (or “seeds”), one per cluster. Initially, the centroids are placed randomly in space and all objects are assigned to the nearest centroid. The positions of the centroids are then reestimated and located in the centre of the cluster they correspond to. This procedure, object assignment and centroid reestimation, is repeated until the k centroids are fixed. It should be mentioned that the final clusters produced by Kmeans are sensitive to the initial placement of the centroids, so the procedure should be repeated several times in order to identify stable centroid positions and provide more robust results.
Bearing this in mind, we ran Kmeans twenty times and finally selected six clusters that minimized (as much as possible) the within cluster sum of squared errors (i.e. distances) between the centroids and the corresponding data points. The Kmeans clustering employed in this study is suitable for large datasets because of its relatively small computational requirements. Its main feature is that the optimum number of clusters is given by the algorithm (Moody et al. 1995). The measure of distance used to evaluate the rootmean square deviation of a trajectory from its centroid was based on the Haversine formula of the greatcircle distance between two points (Sinnott 1984) concerning longitude and latitude variables.
2.6 Selforganizing maps
SOM is considered an advanced approach of clustering that can produce reliable segregation even in difficult cases. They operate similarly to Kmeans, but instead of using a number of clusters they utilize a grid of nodes with predetermined shape and size. This grid iteratively adjusts to the data until it maps as close as possible their structure in space. The obtained nodes (or “clusters”) are also organized in a 2D grid so that similar clusters are placed near each other. In that way, clustering is performed following a structured approach, in contrast with the unstructured Kmeans approach. When SOM is used, our dataset can be organised in six clusters, in two alternative 2D grids (1 × 6 and 2 × 3 configuration). Consequently two different sets of clusters were generated. In the first case, the nodes were placed in a sequential order thus the two remote clusters (left and right) were the most unrelated, while in the second case the nodes were placed in a 2D lattice arrangement.
3 Results and discussion
3.1 Circulation regimes
 Cluster A:
Air masses have their origin either over Sahara desert or Gulf of Sidra and the maritime area between Africa and Sicily in general. It is a rather slow moving regime reaching Athens after passing over the sea, from South or South West. In some cases, the air masses are moving around Athens in a wider sense, representing regional circulation.
 Cluster B:
Initially, the air masses originate from the wider area of centraleastern Europe. Then they cross the Balkans arriving in Athens after their passage over northern Greece and/or the Aegean Sea. It is a rather slowmoving air mass regime. In one case (10 m, hierarchical), the mean centroid has its origin over the north Adriatic Sea.
 Cluster C:
The origin of this cluster is over Russia (in many cases over eastern Russia). The air masses move towards the Black Sea, pass over eastern Thrace, North Aegean Sea and reach Athens from northeast.
 Cluster D:
This is a fast moving cluster. It originates from northwest Europe and crosses central Europe arriving in Athens after passing over the Balkans.
 Cluster E:
This is the fastest moving cluster. Its origin is over mid Atlantic, then it passes over France, western Mediterranean and Italy or the Adriatic Sea/former Yugoslavia arriving in Athens mainly from the west. It must be noted that in one case (10 m, Kmeans) the origin of the air mass was over the Cantabrian Sea.
 Cluster F:
The origin of the air mass is over the Pyrenees Mountains/Gulf of Lion or western Mediterranean. Then, it crosses South Italy reaching Athens from the west. At 100 m and 10 m the origin of the mean back trajectory centroid is over the Adriatic Sea for two clustering techniques (Kmeans and hierarchical at 100 m, and Kmeans at 10 m). It is a rather slow moving cluster.
Figures 2a–c present the six cluster centroids for 500, 100 and 10 m height above ground as well as the percentages of occurrence of each cluster, for the hierarchical, Kmeans and SOM (2 × 3) clustering technique. SOM (1 × 6) results are not presented since they are very similar to those for SOM (2 × 3).
3.2 Results produced by different clustering techniques
Cluster occurrence (%) per clustering technique for three arrival heights
Cluster  SOM 1 × 6 500 m  SOM 1 × 6 100 m  SOM 1 × 6 10 m  SOM 2 × 3 500 m  SOM 2 × 3 100 m  SOM 2 × 3 10 m  Kmeans 500 m  Kmeans 100 m  Kmeans 10 m  Hier 500 m  Hier 100 m  Hier 10 m 

A  18.6  23.8  15.2  22.9  23.8  15.8  14.9  14.3  13.2  17.5  14.4  13.3 
B  25.5  27.2  29.2  24.9  27.4  29.2  29.0  25.0  22.7  20.6  35.9  24.4 
C  9.7  8.4  11.6  9.6  8.2  11.6  21.5  17.0  19.4  15.2  4.8  26.7 
D  13.1  15.5  15.3  14.4  16.4  15.3  9.9  11.5  11.2  8.8  11.5  17.0 
E  8.4  7.7  8.3  7.9  7.7  8.3  7.6  10.5  11.3  3.2  12.3  6.9 
F  24.6  17.4  19.8  20.3  17.4  19.8  17.0  21.8  22.2  34.7  21.1  9.7 
At 100 m arrival height, two main differences were detected between clustering techniques. The first one was for cluster C whose origin was identified over western Russia according to the Kmeans analysis, while it was found to be over eastern Russia according to the hierarchical and SOM analysis. The other discrepancy was for cluster F, whose origin was identified over northern Mediterranean for Hierarchical and Kmeans, but over the Gulf of Lion using SOM (Fig. 2b). At this height, the percentage of occurrence (Table 1) varies greatly between clustering techniques, especially for cluster C.
At 10 m arrival height, quite a few discrepancies were detected for different clustering techniques. Specifically, cluster F originated from the Adriatic Sea according to the Kmeans analysis, but from the Tyrrenian Sea in SOM and the Pyrenees in hierarchical analysis. For cluster B, the origin of the centroid is over the Adriatic Sea in hierarchical, but over Ukraine according to the other two clustering techniques (Fig. 2c). Cluster E has its origin over the Cantabrian Sea in Kmeans, but over midAtlantic in the other two clustering techniques. Although significant differences were found in the origin of the cluster centroids at 10 m, the percentage of occurrence of each group is quite similar, except for clusters C (12–29%) and F (10–22%). Generally, the SOM technique attributes more cases to cluster A. (Table 1).
Distance between centroids of clusters and within cluster variance for Kmeans clustering approach in the three arrival heights (in Km)
A  B  C  D  E  F  

10 m  
A  112  178  168  157  110  
B  112  111  80  242  123  
C  178  111  175  240  102  
D  168  80  174  314  201  
E  157  242  240  314  141  
F  110  123  102  201  141  
Variance  860  224  465  978  1,721  382 
100 m  
A  432  201  109  152  128  
B  432  599  365  559  489  
C  201  599  309  142  252  
D  109  365  309  232  141  
E  152  559  142  232  128  
F  128  489  252  141  128  
Variance  826  455  1769  1,017  656  548 
500 m  
A  513  737  593  621  595  
B  513  374  189  169  762  
C  737  374  192  250  1,102  
D  593  189  192  135  919  
E  621  169  250  135  914  
F  595  762  1102  919  914  
Variance  810  1,775  2,337  1,420  1,100  772 
In order to compare the classifications made by the different clustering techniques at the three different heights above ground, we used the Somers'd test of significance, which is a measure of the association between two ordinal variables (classifications) that ranges from −1 to 1. Values close to an absolute value of 1 indicate a strong relationship between the two classifications and values close to 0 indicate little or no relationship between the two classifications.
Somers'd test of significance results for pairs of clustering techniques at three arrival heights
Hier 500  Hier 100  Hier 10  Kmeans 500  Kmeans 100  Kmeans 10  SOM2 × 3 500  SOM2 × 3 100  SOM2 × 3 10  

Hier 500  0.31  0.15  0.32  0.36  
Hier 100  0.14  0.63  0.49  
Hier 10  0.18  0.28  
Kmeans 500  0.43  0.37  0.54  
Kmeans 100  0.88  0.40  
Kmeans 10  0.56  
SOM 2 × 3 500  0.62  0.54  
SOM 2 × 3 100  0.68  
SOM 2 × 3 10 
3.3 Seasonal distribution
Seasonal distribution of clusters (%) per clustering technique for three arrival heights
Cluster  SOM 1 × 6 500 m  SOM 1 × 6 100 m  SOM 1 × 6 10 m  SOM 2 × 3 500 m  SOM 2 × 3 100 m  SOM 2 × 3 10 m  Kmeans 500 m  Kmeans 100 m  Kmeans 10 m  Hier 500 m  Hier 100 m  Hier 10 m 

Winter  
A  11  11  8  12  11  9  9  9  8  9  9  7 
B  10  10  12  9  10  11  11  10  9  8  15  10 
C  5  6  8  5  6  7  10  9  11  8  4  13 
D  6  8  8  7  8  8  5  6  6  5  6  10 
E  6  5  6  6  5  6  4  7  7  2  8  4 
F  12  10  8  11  10  9  10  9  9  18  8  6 
Summer  
A  8  12  7  11  12  7  6  5  5  9  5  6 
B  16  17  17  15  17  18  18  15  14  12  21  15 
C  4  3  4  4  3  4  12  8  9  8  1  16 
D  7  7  11  8  7  7  4  5  5  3  6  7 
E  2  3  3  3  3  3  3  4  4  1  4  3 
F  13  8  8  9  8  11  7  13  13  17  13  3 
Cluster A, describing air mass transport from Sahara and the southern Mediterranean over Athens, appears to be more frequent during winter months. This is due to the fact that March and April, in which the maximum of sand transportation from Sahara Desert occurs over Greece, were included in the winter period in our analysis (Grivas et al. 2008; Kocak et al. 2007). An exception to this pattern was detected at 100 m arrival height for the two SOM techniques.
A clear seasonal pattern is observed for cluster B, which is the northeasterly regime (Etesian winds), occurring during the summer months over Athens. Indeed, cluster B has higher occurrence during the summer period for all clustering techniques and arrival heights.
Cluster C centroid reaches Athens from northeastern directions having its origin over Russia. This regime’s occurrence is uniformly distributed around the year, presenting a rather mixed behaviour. It mainly appears as a winter regime when the SOM techniques are applied, but then as a summer regime when the Kmeans or the Hierarchical methods are used.
Clusters D and E mainly represent winter regimes, associated with northern, western or northwestern air flow over Greece (Table 4). Their seasonal distribution is consistent with earlier studies describing the weather regimes over Athens (Kassomenos 2003a, b).
Finally, cluster F represents a mixed behaviour. At 500 m is a rather winter regime, while at the other two arrival heights the summer occurrence appears to be more frequent. This westerly wind regime may be associated with local circulation patterns over Athens (Kassomenos 2003a, b).
3.4 Sensitivity of clustering techniques
The sensitivity of the clustering techniques applied to the three datasets (one per arrival height) was tested in two different ways: (1) With the application of the same clustering techniques to three different arrival heights; (2) with the application of three clustering techniques to the same height.
3.4.1 Application of the same clustering technique to three arrival heights
In this section, we examine whether trajectories arriving in Athens at different heights on the same day are classified in the same cluster using a specific clustering technique.
For the hierarchical clustering technique, the values estimated by the Somers’d test of significance were 0.31, 0.15 and 0.14 (Table 3) for the pairs of arrival heights (500, 100 m), (500, 10 m) and (100, 10 m), respectively. Values close to 0 indicate little or almost no relationship between the classifications and thus high sensitivity of the method (i.e. small percentage of days associated with the same trajectory cluster for different arrival heights).
Trajectories (%) classified in the same cluster for two different arrival heights
Hierar chical  Kmeans  SOM 2 × 3  

(500, 10 m)  (500, 100 m)  (100, 10 m)  (500, 10 m)  (500, 100 m)  (100, 10 m)  (500, 10 m)  (500, 100 m)  (100, 10 m)  
A  54  56  81  83  86  97  80  72  97 
B  15  37  7  54  56  92  56  67  76 
C  30  66  10  70  78  81  47  70  62 
D  23  23  10  37  39  86  10  61  69 
E  6  10  35  8  7  88  53  63  64 
F  71  50  14  24  27  88  64  75  67 
The Kmeans clustering technique presented low sensitivity at all three arrival heights. Values estimated by the Somers'd test of significance were 0.63 and 0.37 for the pairs of arrival heights of 500 and 100 m and 500 and 10 m and 0.88 for the pair 100 and 10 m, as shown in Table 3. For clusters A and C, the Kmeans technique was less sensitive to the arrival height. For the cluster A, the percentage of trajectories that remained in the same cluster was 83–97%, while for cluster C this percentage ranged between 70–81% (Table 5). Trajectories in cluster E were very sensitive to the arrival height for the pairs of 500 and 10 m and 500 and 100 m, but they are almost insensitive for the pair of 10–100 m. It is interesting to note that for the arrival height pair of 100 and 10 m, 81–97% of the trajectories remained in the same cluster for all six transport regimes (AF).
The SOM technique (1 × 6, 2 × 3) shows low sensitivity to all three arrival heights. The values estimated for the Somers’d test of significance ranged between 0.68 and 0.54 for SOM (2 × 3) (Table 3) and between 0.607 and 0.703 for SOM (1 × 6) (not shown).
The highest percentage values were detected for the 100 and 10 m pair (62–97% of trajectories remain in the same cluster for all transport regimes). Furthermore, this technique showed very low sensitivity (61–75%) for the pair of arrival heights of 500 and 100 m. Only cluster D trajectories appeared to be very sensitive to the arrival height for the pair of 500 and 10 m, while the opposite occurred for cluster A (Table 5).
Overall, the SOM clustering technique was less sensitive to the arrival height for clusters A and F, since 72–97% and 64–75% of the trajectories in clusters A and F, respectively, remained in the same cluster when using different arrival heights. By contrast, clusters C and especially D were found to be more dispersed and thus more sensitive to the arrival height (Table 5).
3.4.2 Application of three clustering techniques to the same arrival height
In this section, we examine whether a trajectory arriving in Athens at a specific height is classified in the same cluster when different clustering techniques are used.
Trajectories (%) classified in the same cluster using two different clustering techniques
10 m  100 m  500 m  

Hier, Kmeans  Hier, SOM  Kmeans, SOM2 × 3  Hier, Kmeans  Hier, SOM2 × 3  Kmeans, SOM2 × 3  Hier, Kmeans  Hier, SOM2 × 3  Kmeans, SOM2 × 3  
A  81  76  81  78  88  82  57  59  79 
B  10  0  78  78  70  72  32  55  41 
C  55  1  44  27  99  38  48  45  38 
D  53  9  71  58  45  48  39  47  73 
E  73  53  50  72  45  29  12  40  46 
F  3  50  52  62  32  31  67  38  63 
The highest variability in cluster classification for 500m arrival height was observed between hierarchical and Kmeans, with a cluster overlap percentage of 12–67%. For the (hierarchical, SOM 2 × 3) pair the respective percentage of trajectories ranged between 38–59%, while for the (Kmeans, SOM 2 × 3) pair it ranged between 38–79% (Table 6). Looking at particular clusters, it can be observed that the pair of (Kmeans, SOM 2 × 3) is less sensitive for clusters A and D (79 and 73%, respectively), the pair of (hierarchical, Kmeans) for clusters F and A (67% and 57%, respectively), and the pair of (hierarchical and SOM 2 × 3) for clusters A and B (59% and 55%, respectively; Table 6).
At 100 m arrival height, the values estimated for the Somers'd test of significance between different clustering techniques ranged from 0.49 for the pair of (hierarchical, SOM 2 × 3) to 0.63 for the pair of (Kmeans, hierarchical; Table 3). A relatively large percentage of the trajectories (32–99%) were classified in the same cluster when hierarchical and SOM techniques were applied to the dataset of trajectories arriving at 100m height. The respective percentages for the other two pairs of clustering techniques (e.g. Kmeans and SOM 2 × 3, and hierarchical and Kmeans) were generally lower, indicating a higher sensitivity of the classification to the clustering technique (Table 6).
For the (hierarchical, SOM 2 × 3) pair of clustering techniques, the highest percentage of overlapping was found for cluster C trajectories, followed by cluster A, while the lowest was found for cluster F. For the (hierarchical, Kmeans) pair the highest values were obtained for clusters A and B (approximately 77%) and the lowest for cluster C (58%). Finally, when we applied the Kmeans and SOM techniques to trajectories arriving at 100 m height, the overlapping of the results was from 29% for cluster E to 82% for cluster A (Table 6).
Finally, we applied all three clustering techniques to the dataset of trajectories arriving at 10 m. The Somers'd values for the three pairs of clustering techniques ranged between 0.18 and 0.28 (Table 3). The highest value corresponded to the (SOM 2 × 3, hierarchical) pair, and the lowest to the (hierarchical, Kmeans) pair, indicating high sensitivity of the latter pair of methods applied to trajectories arriving at 10 m height.
When we applied Kmeans and SOM (2 × 3) to the 10m arrival height dataset, 44–81% of the trajectories were classified in the same cluster. Again the highest percentage was obtained for cluster A, followed by cluster B, while the lowest value was for cluster C. For the pairs of (hierarchical, Kmeans) and (hierarchical, SOM 2 × 3), the classification depended greatly on the selected clustering approach, with the exception of cluster A (Table 6).
3.5 Analysis of PM_{10} concentrations
The aim of this work is to present a tool to interpret atmospheric back trajectories based on their respective technique. In this section, we discuss the air quality management implications of our results in relation with particulate matter concentrations in ambient air (PM_{10}). It must be noted that several papers dealing with back trajectories stated that this technique could be used to estimate longrange transport of polluted air masses over a specific area (Dorling et al. 1992; Stohl et al. 2002; Borge et al. 2007; Grivas et al. 2008). This information could be very useful to the relevant authorities, since they could subtract the longrange contribution often related to natural sources (e.g. desert dust) from the PM_{10} concentrations measured in a city such as Athens in order to comply with EU air quality regulations (Vardoulakis and Kassomenos 2008). In this way, in terms of air quality management, the knowledge of the source and path of an air mass (represented by a cluster of back trajectories) could be very useful. The critical question is whether the estimation of the longrange transport contribution to urban PM_{10} depends on the clustering approach used or on the arrival height above the ground specified in the back trajectory modelling.
Thus, we may conclude that although certain clusters appear to be associated with longrange transportation of particulates, it is not possible at this stage to directly attribute them to natural sources such as desert dust or sea salt (Grivas et al. 2008). The presented results may be seen as qualitative and preliminary, as more effort is needed to quantify the impact of longrange atmospheric transport on local PM_{10}.
Factors influencing the relation between cluster and PM_{10} concentrations as travelling time, washout, deposition, partitioning of PM_{10}, PM_{2.5} and PM_{1.0}, chemical composition of the particulates (in their three infancies), as well as, possible local effects are not taken under consideration in this stage and it is planned to be examined in a forthcoming work.
4 Conclusions and future work

The air masses affecting Athens have their origin mainly in Sahara desert of North Africa, CentralEastern Europe, Russia, northwestern Europe, midAtlantic and western Mediterranean.

The hierarchical clustering technique seems to be very sensitive to the arrival height of the trajectories, for both fast and slow moving clusters. The Kmeans (nonhierarchical) clustering technique appeared to be less sensitive to the arrival height, while the variability of the occurrence of each circulation regime was small.

The neural network SOM clustering technique was found to be sensitive to the arrival height. Compared with the other two methods, it was found to be more sensitive than Kmeans and less sensitive than the hierarchical technique.

All three statistical methods seem to be sensitive to the arrival height of the trajectories, but the degree of sensitivity differs substantially. Hierarchical clustering showed the highest level of sensitivity for fast moving trajectories (C, D and E clusters, i.e. air masses originating from midAtlantic, NW Europe and east Russia) to the arrival height, followed by SOM. Kmeans was found to be the least sensitive clustering technique for this variable.

Slowmoving circulation regimes, especially those originating from North Africa or western Mediterranean (clusters A and F) did not show significant sensitivity to the arrival heights or to the clustering technique applied.

The low sensitivity of the clustering for the couple of arrival height of 10 and 100 m detected shows that the variability of the transport between 10 m and 100 m may cannot be established using the atmospheric back trajectories produced in this study. More work is needed for this to reach in more safe conclusions.

Use meteorological fields at higher resolution

A range of clustering techniques should be preferably used when investigating atmospheric trajectories.
The use of modelled atmospheric trajectories and their classification into circulation regimes can play a significant role in studying air pollution in Athens, Greece, an area with a large number of exceedences of the daily PM_{10} concentration limit imposed by the European Union. It may also contribute to the quantification of local and longrange transport contributions (natural or anthropogenic) to the PM_{10}/PM_{2.5} concentrations recorded in Athens. The quantification of these influences is still an open issue, not only for Athens but also for other major cities in south Europe, and can be further investigated using back trajectory modelling. It is recommended that a range of arrival heights above the ground is used in such applications. Finally, excessive reliance on one particular clustering technique for data analysis and interpretation should be avoided.
Notes
Acknowledgements
The authors gratefully acknowledge the NOAA Air resources (ARL) for the provision of the FNLHYSPLIT data, the HYSPLIT transport and dispersion model and the READY web site (http://www.alr.noaa.gov/ready.htm) used in this work. The authors would also like to thank the two anonymous reviewers for their valuable and constructive suggestions that improve this work substantially.
References
 Artíñano B, Querol X, Salvador P, Rodríguez S, Alonso DG, Alastuey A (2001) Assessment of airborne particulate levels in Spain in relation to the new EUDirective. Atmos Environ 35:S43–S53CrossRefGoogle Scholar
 Borge R, Lumbreras J, Vardoulakis S, Kassomenos P, Rodriguez E (2007) Analysis of longrange transport influences on urban PM10 using twostage atmospheric trajectory clusters. Atmos Environ 41:4434–4450CrossRefGoogle Scholar
 Cape JN, Methven J, Hudson LE (2000) The use of trajectory cluster analysis to interpret trace gas measurements at Mace Head, Ireland. Atmos Environ 34:3651–3663CrossRefGoogle Scholar
 Dorling SR, Davis TD (1995) Extending cluster analysissynoptic meteorology links to characterize chemical climates at six northwest European monitoring stations. Atmos Environ 29:145–167CrossRefGoogle Scholar
 Dorling SR, Davies TD, Pierce CE (1992) Cluster analysis: a technique for estimating the synoptic meteorological controls on air and precipitation chemistry. Atmos Environ 26:2575–2581CrossRefGoogle Scholar
 Draxler RR, Hess GD (1998) An overview of the HYSPLIT 4 modelling system for trajectories, dispersion and deposition. Aust Meteorol Mag 47:295–308Google Scholar
 Everitt B (1980) Cluster analysis. Halstead, New York, p 136Google Scholar
 Grivas G, Chaloulakou A, Kassomenos P (2008) An overview of the PM10 pollution problem, in the Metropolitan Area of Athens, Greece. Assessment of controlling factors and potential impact of long range transport. Sci Total Environ 389:165–177CrossRefGoogle Scholar
 Johnson SC (1967) Hierarchical Clustering Schemes. Psychometrika 2:241–254CrossRefGoogle Scholar
 Jorba O, Perez C, Rocadenbosch F, Baldasano J (2004) Cluster Analysis of 4Day back Trajectories arriving in the Barcelona area, Spain, from 1997 to 2002. J Appl Meteorol 43:887–900CrossRefGoogle Scholar
 Kanamitsu M (1989) Description of the NMC Global Data Assimilation and Forecast System. Weather Forecasting 4:335–342CrossRefGoogle Scholar
 Kassomenos P (2003a) Anatomy of the synoptic conditions occurring over southern Greece during the second half of 20th century. Part I. Summer and Winter. Theor Appl Climatol 75(1–2):65–77Google Scholar
 Kassomenos P (2003b) Anatomy of the synoptic conditions occurring over southern Greece during the second half of 20th century. Part II. Spring and Autumn. Theor Appl Climatol 75(1–2):79–92Google Scholar
 Kocak M, Mihalopoulos N, Kubilay N (2007) Contributions of natural sources to high PM10 and PM2.5 events in the eastern Mediterranean. Atmos Environ 41:3806–3818CrossRefGoogle Scholar
 Kohonen T (2001) Self organizing maps. SpringerGoogle Scholar
 Kolehmainen M, Martikainen H, Hiltunen T, Ruuskanen J (2000) Forecasting air quality parameters using hybrid neural network modelling. Environ Monit Assess 65:277–286CrossRefGoogle Scholar
 Kolehmainen M, Martikainen H, Ruuskanen J (2001) Neural networks and periodic components used in air quality forecasting. Atmos Environ 35:815–825CrossRefGoogle Scholar
 Leavey M, Sweeny J (1990) The influence of long range transport of air pollutants on summer visibility at Dublin. Int J Climatol 10:191–201CrossRefGoogle Scholar
 McQueen JB (1967) Some methods for classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. University of California Press, Los Angeles, pp 281–297Google Scholar
 Melas D, Ziomas I, Klemm O, Zerefos CS (1998) Anatomy of the Sea breeze circulation in Athens area under weak largescale ambient winds. Atmos Environ 32:2223–2237CrossRefGoogle Scholar
 Mihalopoulos N, Stephanou E, Kanakidou M, Pilitsidis S, Bousquet P (1997) Troposheric aerosol ionic composition in the Eastern Mediterranean region. Tellus B Chem Phys Meteorol 49:314–326CrossRefGoogle Scholar
 Moody JL, Oltmans SJ, Levy H II, Merrill T (1995) Transport climatology of tropospheric ozone. Bermuda, 1988–1991. J Geophys Res 100:7179–7194CrossRefGoogle Scholar
 Newell R, Thuret V, Cho J, Stoller P, Marenco A, Smit H (1999) Ubiquity of quasihorizontal layers in the troposphere. Nature 398:316–319CrossRefGoogle Scholar
 Rolph G.D., 2003. READY: Real time Environmental Applications and Display system. NOAA Air resources Laboratory (http://www.arl.noaa.gov/ready.html).
 Salvador S, Chan P (2005) Learning States and rules for detecting anomalies in Time Series. Appl Intell 23:241–255CrossRefGoogle Scholar
 Schädler G, Sasse R (2006) Analysis of the connection between precipitation and synoptic scale processes in the Eastern Mediterranean using selforganizing maps. Meteorol Z 15(3):273–278CrossRefGoogle Scholar
 Schlink U, Herbarth O, Richter M, Dorling S, Nunnari G, Cawley G, Pelikan E (2006) Statistical models to assess the health effects and to forecast groundlevel ozone. Environ Model Softw 21:547–558CrossRefGoogle Scholar
 Stohl A, Eckhartdt S, Forster C, James P, Spichtinger N, Seibert P (2002) A replacement for simple back trajectory calculations in the interpretation of atmospheric trace substance measurement. Atmos Environ 36:4635–4648CrossRefGoogle Scholar
 Sinnott RW (1984) Virtues of the Haversine. Sky Telescope 68:159Google Scholar
 Vardoulakis S, Kassomenos P (2008) Comparison of factors influencing PM_{10} levels in Athens (Greece) and Birmingham (UK). Atmos Environ 42:3949–3963CrossRefGoogle Scholar
 Wernli H, Davies H (1997) A Langrangianbased analysis of extratropical cyclones. I: the method and some applications. Q J R Meteorol Soc 123:467–489CrossRefGoogle Scholar
 Yao CS (1998) A loading correlation model for climatic classification in terms of synoptic climatology. Theor Appl Climatol 61:113–120CrossRefGoogle Scholar