Advertisement

Theoretical and Applied Climatology

, Volume 102, Issue 1–2, pp 1–12 | Cite as

Comparison of statistical clustering techniques for the classification of modelled atmospheric trajectories

  • P. Kassomenos
  • S. Vardoulakis
  • R. Borge
  • J. Lumbreras
  • C. Papaloukas
  • S. Karakitsios
Original Paper

Abstract

In this study, we used and compared three different statistical clustering methods: an hierarchical, a non-hierarchical (K-means) and an artificial neural network technique (self-organizing maps (SOM)). These classification methods were applied to a 4-year dataset of 5 days kinematic back trajectories of air masses arriving in Athens, Greece at 12.00 UTC, in three different heights, above the ground. The atmospheric back trajectories were simulated with the HYSPLIT Vesion 4.7 model of National Oceanic and Atmospheric Administration (NOAA). The meteorological data used for the computation of trajectories were obtained from NOAA reanalysis database. A comparison of the three statistical clustering methods through statistical indices was attempted. It was found that all three statistical methods seem to depend to the arrival height of the trajectories, but the degree of dependence differs substantially. Hierarchical clustering showed the highest level of dependence for fast-moving trajectories to the arrival height, followed by SOM. K-means was found to be the least depended clustering technique on the arrival height. The air quality management applications of these results in relation to PM10 concentrations recorded in Athens, Greece, were also discussed. Differences of PM10 concentrations, during certain clusters, were found statistically different (at 95% confidence level) indicating that these clusters appear to be associated with long-range transportation of particulates. This study can improve the interpretation of modelled atmospheric trajectories, leading to a more reliable analysis of synoptic weather circulation patterns and their impacts on urban air quality.

Keywords

Cluster Technique PM10 Concentration Sahara Desert Back Trajectory Circulation Regime 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Computer-based statistical classifications use various meteorological parameters from one or more surface or upper-air stations in order to identify the weather types occurring in an area (Yao 1998). Although these “objective” methods do not entirely rely on the investigator's expert judgment, they still introduce certain limitations to the physical interpretation of the results. In order to reduce the subjectivity of these classifications and improve the analysis of large meteorological datasets, principal component and/or cluster analyses have been used to classify weather types.

Cluster analysis groups data into subsets (clusters), in such a way that data within each cluster are more closely related to one another than those assigned to different clusters. There are two main methods of statistical clustering: the hierarchical and the non-hierarchical (K-means).

The statistical clustering techniques for the classification of atmospheric trajectories used in the past vary widely, based on different hierarchical and non-hierarchical methods. In recent years, artificial neural network techniques have increasingly been recognized as a useful statistical technique for the classification of both environmental and meteorological data.

In order to analyze the atmospheric circulation patterns, several investigators applied multivariate techniques including clustering methods to classify modelled back trajectories (Leavey and Sweeny 1990; Dorling and Davis 1995; Cape et al. 2000; Jorba et al. 2004). In these studies, the coordinates of the back trajectories were used as the clustering variables. The coordinates along the air mass type route of the trajectories led to the identification of distinct groups with similar characteristics, i.e. similar direction of approach and speed of passage over potential pollution source areas (Dorling et al. 1992). The trajectory types are more readily interpretable in terms of the synoptic conditions that form them. Large-scale circulation features often result in certain trajectory clusters. Recently, Borge et al. (2007) used a two-stage cluster analysis (based on the non-hierarchical K-means algorithm) to classify back trajectories arriving in three different areas in Europe. This technique, although not fully objective, has the advantage of producing highly disaggregated trajectory clusters which may correspond to significantly different ambient PM10 (i.e. particulates with diameter smaller than 10 μg) and ozone levels.

Considerable work in the field of the analysis of back trajectories and their associated uncertainty was also made with the aid of the model Flexpart by Stohl et al. (2002), while Wernli and Davies (1997) examined the space-time structure and dynamics of extratropical cyclones with the use of a Lagrangian-based method.

In recent years, artificial neural network, fuzzy logic and data mining techniques gained interest and have been progressively recognized as promising techniques for the prediction and classification of both meteorological and air quality data (Schlink et al. 2006). Specifically, Kolehmainen et al. (2000, 2001) applied the self-organizing map (SOM) algorithm to forecast NO2 levels in Stockholm. Recently, Schädler and Sasse (2006) analyzed the connection between precipitation and synoptic atmospheric processes in the eastern Mediterranean using SOM. To the authors knowledge, neither fuzzy nor SOM techniques have been used, up to now, to classify atmospheric back trajectories.

The aim of the present study is to apply several statistical clustering techniques (hierarchical, non-hierarchical K-means and artificial neural network SOM) to a 4-year dataset of modelled atmospheric trajectories, in order to examine their performance and sensitivity to certain parameters such as the arrival height and to compare the resulting back trajectory groups derived from the different clustering techniques. These statistical clustering techniques can be used to identify synoptic weather regimes and long-range transport patterns that may affect air pollution.

2 Methods

2.1 Model and input data

Five-day long kinematic back trajectories arriving in Athens, Greece (37.2°N, 23.47°E), computed at 12.00 UTC (i.e. 14.00 local time) for every day during a 4-year period (2001–2004) were used. The kinematic back trajectories were calculated with version 4 of the Hybrid Single-Particle Lagrangian Integrated Trajectory model (HYSPLIT) developed by the National Oceanic and Atmospheric Administration (NOAA) Air Resources Laboratory (Draxler and Hess 1998; Rolph 2003). HYSPLIT has been widely used in air pollution applications (Artíñano et al. 2001; Borge et al. 2007) and can be run either online or offline on a PC (further information on the model can be found at: http://www.arl.noaa.gov/ready/hysplit4.html).

The meteorological data used for the computation of the trajectories in the present study were obtained from the NOAA reanalysis database (http:/www.arl.noaa.gov/archives.php). These data were produced by the National Center for Environmental Prediction Global Data Assimilation System (Kanamitsu 1989) which uses the spectral Medium-Range Forecast Model for the weather forecast. The horizontal resolution (2.5° for latitude–longitude of the trajectories matched the resolution of the reanalysis data. All in all, 29 vertical levels were used. The vertical levels (first at 10 m, second at 75 m above ground) at which meteorological fields were computed were denser near the ground and more sparse above. The vertical transport was modelled using the isobaric option of HYSPLIT. The back trajectories were computed, every 6 h, at three heights above ground (10, 100 and 500 m). These heights were chosen in such a way that the results of this study could be used for air quality purposes.

The use of trajectories is based on the assumption that the troposphere is constituted of consistent layers that are transported undergoing gradual mixing with the background (Newell et al. 1999). The length of back-trajectories is restricted by the distances between the source regions and the point of arrival. In this study, we have chosen 5-day back trajectories which are consistent with previous studies carried out in the same area (Mihalopoulos et al. 1997).

2.2 Clustering procedure

Cluster analysis is a multivariate statistical technique designed to find structure inside a dataset (Everitt 1980). The approach involves splitting a dataset into a number of groups that need to be as different as possible. It is a rather objective classification method since there are techniques and criteria, examining the minimization of the distance within a cluster and maximization of the distance between clusters, for finding the optimal number of clusters. Clustering techniques are unsupervised learning processes that try to group data based on a similarity and/or dissimilarity measure. The use of different measures is expected to lead to different clusters.

In this study, we employed a variation of a graph-based method (Salvador and Chan 2005) in order to define the appropriate number of clusters for our dataset of modelled atmospheric trajectories. Having determined the number of clusters, we proceed by utilizing three different clustering techniques. This gives us the ability to examine whether the generated clusters differ substantially for different clustering techniques depending strongly on the employed similarity measure. This assessment relies on the analysis of the average trajectories (centroids) for each cluster, which have been calculated by averaging the coordinates of the individual trajectories belonging to each cluster.

Trajectories are defined as the series of successive points where the air mass is located in 6-h intervals. Distances are calculated between points corresponding to the same time interval. The distance between two trajectories is the sum of the Euclidean distances of the trajectories' points.

In order to analyze the back trajectory data, we employed the following statistical clustering techniques: hierarchical clustering (Johnson 1967), the K-means algorithm (McQueen 1967) and the self-organizing maps (Kohonen 2001).

2.3 Determining the number of clusters

In clustering applications, the appropriate number of clusters is commonly determined following a trial-and-error procedure combined with manual verification. However, when the dataset is very large and multidimensional, as in our case, the above approach is impractical and an automated method is required. In the present study, the l-method (Salvador and Chan 2005) was employed. According to this, clustering is performed sequentially using different number of clusters each time. The number of clusters is then plotted against an evaluation metric of the clusters, i.e. distance, similarity, error etc. Each side of the generated screen-like curve, right and left (Fig. 1), is fitted with a straight line and the point where the two straight lines intersect determines the appropriate number of clusters.
Fig. 1

Application of the l-method to determine the appropriate number of clusters (in this case found to be 5.6)

When the above procedure was applied to the back trajectory data from Athens and after utilizing 20 clusters, still the screen-like curve required by the l-method could not be attained. Using an even larger number of clusters would not offer any practical solution to the problem under study. Figure 1 presents the above-mentioned screen-like curve. The x-axis represents the number of clusters while y-axis represents the normalised Euclidean distance between the trajectories to fall within the interval (2–20). Based on this slight variation of the l-method, we finally obtained the required curve and estimated the appropriate number of clusters, which was 5.6 (Fig. 1). It must be noted, at this point, that several fitting approaches were attempted (e.g. higher order polynomials, interpolation, splines etc.) to determine the number of clusters. In every case, the point of interest was approximately at the same location, thus yielding always six as the optimum number of clusters.

2.4 Hierarchical clustering

Setting six as the optimum number of clusters, we employed four different algorithms in order to segregate the back trajectory data, starting with the simple hierarchical clustering technique. Briefly, hierarchical clustering partitions data following a series of steps either by grouping (agglomerative or bottom-up) or by separating (divisive or top-down) the objects one by one in each step. Agglomerative approaches are those most commonly employed and this is the one used in our work. According to this, the two closest clusters are merged in each step. It should be noted that the procedure starts with singleton clusters and ends with a single cluster that contains all the objects. Moreover, there are a number of different techniques to measure the distance (or the similarity) between the clusters, which may lead to different subsets. The simplest of these techniques and the one utilized here is the squared Euclidean distance method. At each step the distances between all objects of one cluster and all objects of another cluster are computed and the minimum one defines the distance between these two clusters. This is carried out for all possible pairs of clusters so the pair with the minimum distance is merged and the process continues similarly. The final output of this procedure is the grouping of all elements in the form of a tree hierarchy. “Cutting” the tree at a given level will yield the corresponding clustering with certain precision. Conversely, having the number of clusters fixed the equivalent precision level can easily be determined.

2.5 K-means algorithm

K-means is a non-hierarchical clustering algorithm widely used in many applications that require partitional clustering. The difference between the two techniques is that hierarchical algorithms create successive clusters applying user-determined cluster numbers, whereas K-means partitional algorithms determine iteratively all clusters in one step (Dorling and Davis 1995). The main concept behind K-means is the utilization of k centroids (or “seeds”), one per cluster. Initially, the centroids are placed randomly in space and all objects are assigned to the nearest centroid. The positions of the centroids are then re-estimated and located in the centre of the cluster they correspond to. This procedure, object assignment and centroid re-estimation, is repeated until the k centroids are fixed. It should be mentioned that the final clusters produced by K-means are sensitive to the initial placement of the centroids, so the procedure should be repeated several times in order to identify stable centroid positions and provide more robust results.

Bearing this in mind, we ran K-means twenty times and finally selected six clusters that minimized (as much as possible) the within cluster sum of squared errors (i.e. distances) between the centroids and the corresponding data points. The K-means clustering employed in this study is suitable for large datasets because of its relatively small computational requirements. Its main feature is that the optimum number of clusters is given by the algorithm (Moody et al. 1995). The measure of distance used to evaluate the root-mean square deviation of a trajectory from its centroid was based on the Haversine formula of the great-circle distance between two points (Sinnott 1984) concerning longitude and latitude variables.

2.6 Self-organizing maps

SOM is considered an advanced approach of clustering that can produce reliable segregation even in difficult cases. They operate similarly to K-means, but instead of using a number of clusters they utilize a grid of nodes with predetermined shape and size. This grid iteratively adjusts to the data until it maps as close as possible their structure in space. The obtained nodes (or “clusters”) are also organized in a 2D grid so that similar clusters are placed near each other. In that way, clustering is performed following a structured approach, in contrast with the unstructured K-means approach. When SOM is used, our dataset can be organised in six clusters, in two alternative 2D grids (1 × 6 and 2 × 3 configuration). Consequently two different sets of clusters were generated. In the first case, the nodes were placed in a sequential order thus the two remote clusters (left and right) were the most unrelated, while in the second case the nodes were placed in a 2D lattice arrangement.

3 Results and discussion

3.1 Circulation regimes

According to the above described methodology, we have applied the three clustering techniques (hierarchical, K-means and SOM) for six clusters (Fig. 2). The correspondence that we have chosen to entitle the clusters was based on the visual assessment of the similarity of trajectory characteristics mainly location and length. The six groups of back trajectories arriving at 500, 100 and 10 m above ground in Athens include the following circulation regimes:
  1. Cluster A:

    Air masses have their origin either over Sahara desert or Gulf of Sidra and the maritime area between Africa and Sicily in general. It is a rather slow moving regime reaching Athens after passing over the sea, from South or South West. In some cases, the air masses are moving around Athens in a wider sense, representing regional circulation.

     
  2. Cluster B:

    Initially, the air masses originate from the wider area of central-eastern Europe. Then they cross the Balkans arriving in Athens after their passage over northern Greece and/or the Aegean Sea. It is a rather slow-moving air mass regime. In one case (10 m, hierarchical), the mean centroid has its origin over the north Adriatic Sea.

     
  3. Cluster C:

    The origin of this cluster is over Russia (in many cases over eastern Russia). The air masses move towards the Black Sea, pass over eastern Thrace, North Aegean Sea and reach Athens from northeast.

     
  4. Cluster D:

    This is a fast moving cluster. It originates from northwest Europe and crosses central Europe arriving in Athens after passing over the Balkans.

     
  5. Cluster E:

    This is the fastest moving cluster. Its origin is over mid Atlantic, then it passes over France, western Mediterranean and Italy or the Adriatic Sea/former Yugoslavia arriving in Athens mainly from the west. It must be noted that in one case (10 m, K-means) the origin of the air mass was over the Cantabrian Sea.

     
  6. Cluster F:

    The origin of the air mass is over the Pyrenees Mountains/Gulf of Lion or western Mediterranean. Then, it crosses South Italy reaching Athens from the west. At 100 m and 10 m the origin of the mean back trajectory centroid is over the Adriatic Sea for two clustering techniques (K-means and hierarchical at 100 m, and K-means at 10 m). It is a rather slow moving cluster.

     
Fig. 2

Back trajectory cluster centroids at a 500 m, b 100 m and c 10 m for the Hierarchical, K-means and SOM (2 × 3) technique. The percentage of occurrence for each cluster is shown in parenthesis

Figures 2a–c present the six cluster centroids for 500, 100 and 10 m height above ground as well as the percentages of occurrence of each cluster, for the hierarchical, K-means and SOM (2 × 3) clustering technique. SOM (1 × 6) results are not presented since they are very similar to those for SOM (2 × 3).

3.2 Results produced by different clustering techniques

In general, at 500 m arrival height, the three clustering techniques identify quite similar circulation patterns (Fig. 2a). The main difference was the origin of cluster B, which was found to be over Austria/Hungary using K-means, but over Ukraine using hierarchical or SOM techniques. Of course, differences were found in the percentages of occurrence of each cluster using different clustering technique. For example, 14–23% of the trajectories arriving at 500 m height above the ground were classified in cluster A and 21–29% in cluster B (Table 1).
Table 1

Cluster occurrence (%) per clustering technique for three arrival heights

Cluster

SOM 1 × 6 500 m

SOM 1 × 6 100 m

SOM 1 × 6 10 m

SOM 2 × 3 500 m

SOM 2 × 3 100 m

SOM 2 × 3 10 m

K-means 500 m

K-means 100 m

K-means 10 m

Hier 500 m

Hier 100 m

Hier 10 m

A

18.6

23.8

15.2

22.9

23.8

15.8

14.9

14.3

13.2

17.5

14.4

13.3

B

25.5

27.2

29.2

24.9

27.4

29.2

29.0

25.0

22.7

20.6

35.9

24.4

C

9.7

8.4

11.6

9.6

8.2

11.6

21.5

17.0

19.4

15.2

4.8

26.7

D

13.1

15.5

15.3

14.4

16.4

15.3

9.9

11.5

11.2

8.8

11.5

17.0

E

8.4

7.7

8.3

7.9

7.7

8.3

7.6

10.5

11.3

3.2

12.3

6.9

F

24.6

17.4

19.8

20.3

17.4

19.8

17.0

21.8

22.2

34.7

21.1

9.7

At 100 m arrival height, two main differences were detected between clustering techniques. The first one was for cluster C whose origin was identified over western Russia according to the K-means analysis, while it was found to be over eastern Russia according to the hierarchical and SOM analysis. The other discrepancy was for cluster F, whose origin was identified over northern Mediterranean for Hierarchical and K-means, but over the Gulf of Lion using SOM (Fig. 2b). At this height, the percentage of occurrence (Table 1) varies greatly between clustering techniques, especially for cluster C.

At 10 m arrival height, quite a few discrepancies were detected for different clustering techniques. Specifically, cluster F originated from the Adriatic Sea according to the K-means analysis, but from the Tyrrenian Sea in SOM and the Pyrenees in hierarchical analysis. For cluster B, the origin of the centroid is over the Adriatic Sea in hierarchical, but over Ukraine according to the other two clustering techniques (Fig. 2c). Cluster E has its origin over the Cantabrian Sea in K-means, but over mid-Atlantic in the other two clustering techniques. Although significant differences were found in the origin of the cluster centroids at 10 m, the percentage of occurrence of each group is quite similar, except for clusters C (12–29%) and F (10–22%). Generally, the SOM technique attributes more cases to cluster A. (Table 1).

The distance between the centroids of the clusters as well as the within cluster variance was computed for all clustering techniques and arrival heights. Table 2 presents, as an example, these results for the K-means clustering technique.
Table 2

Distance between centroids of clusters and within cluster variance for Kmeans clustering approach in the three arrival heights (in Km)

 

A

B

C

D

E

F

10 m

A

 

112

178

168

157

110

B

112

 

111

80

242

123

C

178

111

 

175

240

102

D

168

80

174

 

314

201

E

157

242

240

314

 

141

F

110

123

102

201

141

 

Variance

860

224

465

978

1,721

382

100 m

A

 

432

201

109

152

128

B

432

 

599

365

559

489

C

201

599

 

309

142

252

D

109

365

309

 

232

141

E

152

559

142

232

 

128

F

128

489

252

141

128

 

Variance

826

455

1769

1,017

656

548

500 m

A

 

513

737

593

621

595

B

513

 

374

189

169

762

C

737

374

 

192

250

1,102

D

593

189

192

 

135

919

E

621

169

250

135

 

914

F

595

762

1102

919

914

 

Variance

810

1,775

2,337

1,420

1,100

772

In order to compare the classifications made by the different clustering techniques at the three different heights above ground, we used the Somers'd test of significance, which is a measure of the association between two ordinal variables (classifications) that ranges from −1 to 1. Values close to an absolute value of 1 indicate a strong relationship between the two classifications and values close to 0 indicate little or no relationship between the two classifications.

Strong relationship between 10 and 100 m classifications was found when we applied K-means clustering technique (Table 3). The Sommers'd test of significance present value 0.88. Significant relationship was also found when both SOM techniques were applied (0.68). On the contrary the application of hierarchical method gave less significant relationship (0.14).
Table 3

Somers'd test of significance results for pairs of clustering techniques at three arrival heights

 

Hier 500

Hier 100

Hier 10

Kmeans 500

Kmeans 100

Kmeans 10

SOM2 × 3 500

SOM2 × 3 100

SOM2 × 3 10

Hier 500

 

0.31

0.15

0.32

  

0.36

  

Hier 100

  

0.14

 

0.63

  

0.49

 

Hier 10

     

0.18

  

0.28

Kmeans 500

    

0.43

0.37

0.54

  

Kmeans 100

     

0.88

 

0.40

 

Kmeans 10

        

0.56

SOM 2 × 3 500

       

0.62

0.54

SOM 2 × 3 100

        

0.68

SOM 2 × 3 10

         

3.3 Seasonal distribution

The seasonal distribution of the results explains the annual circulation regime characteristics in more detail. Table 4 presents the 6-month seasonal (i.e. May to October for summer, November to April for winter) occurrence of the clusters at the three arrival heights.
Table 4

Seasonal distribution of clusters (%) per clustering technique for three arrival heights

Cluster

SOM 1 × 6 500 m

SOM 1 × 6 100 m

SOM 1 × 6 10 m

SOM 2 × 3 500 m

SOM 2 × 3 100 m

SOM 2 × 3 10 m

K-means 500 m

K-means 100 m

K-means 10 m

Hier 500 m

Hier 100 m

Hier 10 m

Winter

A

11

11

8

12

11

9

9

9

8

9

9

7

B

10

10

12

9

10

11

11

10

9

8

15

10

C

5

6

8

5

6

7

10

9

11

8

4

13

D

6

8

8

7

8

8

5

6

6

5

6

10

E

6

5

6

6

5

6

4

7

7

2

8

4

F

12

10

8

11

10

9

10

9

9

18

8

6

Summer

A

8

12

7

11

12

7

6

5

5

9

5

6

B

16

17

17

15

17

18

18

15

14

12

21

15

C

4

3

4

4

3

4

12

8

9

8

1

16

D

7

7

11

8

7

7

4

5

5

3

6

7

E

2

3

3

3

3

3

3

4

4

1

4

3

F

13

8

8

9

8

11

7

13

13

17

13

3

Cluster A, describing air mass transport from Sahara and the southern Mediterranean over Athens, appears to be more frequent during winter months. This is due to the fact that March and April, in which the maximum of sand transportation from Sahara Desert occurs over Greece, were included in the winter period in our analysis (Grivas et al. 2008; Kocak et al. 2007). An exception to this pattern was detected at 100 m arrival height for the two SOM techniques.

A clear seasonal pattern is observed for cluster B, which is the northeasterly regime (Etesian winds), occurring during the summer months over Athens. Indeed, cluster B has higher occurrence during the summer period for all clustering techniques and arrival heights.

Cluster C centroid reaches Athens from northeastern directions having its origin over Russia. This regime’s occurrence is uniformly distributed around the year, presenting a rather mixed behaviour. It mainly appears as a winter regime when the SOM techniques are applied, but then as a summer regime when the K-means or the Hierarchical methods are used.

Clusters D and E mainly represent winter regimes, associated with northern, western or northwestern air flow over Greece (Table 4). Their seasonal distribution is consistent with earlier studies describing the weather regimes over Athens (Kassomenos 2003a, b).

Finally, cluster F represents a mixed behaviour. At 500 m is a rather winter regime, while at the other two arrival heights the summer occurrence appears to be more frequent. This westerly wind regime may be associated with local circulation patterns over Athens (Kassomenos 2003a, b).

3.4 Sensitivity of clustering techniques

The sensitivity of the clustering techniques applied to the three datasets (one per arrival height) was tested in two different ways: (1) With the application of the same clustering techniques to three different arrival heights; (2) with the application of three clustering techniques to the same height.

3.4.1 Application of the same clustering technique to three arrival heights

In this section, we examine whether trajectories arriving in Athens at different heights on the same day are classified in the same cluster using a specific clustering technique.

For the hierarchical clustering technique, the values estimated by the Somers’d test of significance were 0.31, 0.15 and 0.14 (Table 3) for the pairs of arrival heights (500, 100 m), (500, 10 m) and (100, 10 m), respectively. Values close to 0 indicate little or almost no relationship between the classifications and thus high sensitivity of the method (i.e. small percentage of days associated with the same trajectory cluster for different arrival heights).

In hierarchical clustering, 54–81% of cluster A trajectories arriving in Athens at one of the three specified heights (10, 100 or 500 m) would belong to the same cluster for the other two arrival heights. The highest percentage of common occurrence was found for the (100 and 10 m) pair of arrival heights, indicating a very small sensitivity (Table 5). This method seems more sensitive for trajectories classified in cluster E (e.g. only 6% and 10% of cluster E trajectories arriving at 500 m were classified in the same cluster for 10- and 100-m arrival heights, respectively). The high sensitivity of this cluster could be explained by: (1) the fact that it is the fastest moving transport pattern, and (2) the fact that E cluster corresponds to a winter regime. During winter, the depth of the planetary boundary layer is generally smaller yielding a decoupling of transport regimes associated with 100 m and 500 m arrival heights (Melas et al. 1998). Cluster F trajectories were found to be less sensitive, as 71% and 50% of them were classified in the same cluster for the arrival height pairs of 500 and 10 m and 500 and 100 m, respectively (Table 5). It is interesting to note that 66% for the trajectories attributed to cluster C (air masses coming from Russia) at 500 m are still attributed to the same cluster at 100-m arrival height. The reasons of high sensitivity within this cluster are the speed of the transport pattern and the shallow boundary layer as in cluster E.
Table 5

Trajectories (%) classified in the same cluster for two different arrival heights

 

Hierar chical

Kmeans

SOM 2 × 3

 

(500, 10 m)

(500, 100 m)

(100, 10 m)

(500, 10 m)

(500, 100 m)

(100, 10 m)

(500, 10 m)

(500, 100 m)

(100, 10 m)

A

54

56

81

83

86

97

80

72

97

B

15

37

7

54

56

92

56

67

76

C

30

66

10

70

78

81

47

70

62

D

23

23

10

37

39

86

10

61

69

E

6

10

35

8

7

88

53

63

64

F

71

50

14

24

27

88

64

75

67

The K-means clustering technique presented low sensitivity at all three arrival heights. Values estimated by the Somers'd test of significance were 0.63 and 0.37 for the pairs of arrival heights of 500 and 100 m and 500 and 10 m and 0.88 for the pair 100 and 10 m, as shown in Table 3. For clusters A and C, the K-means technique was less sensitive to the arrival height. For the cluster A, the percentage of trajectories that remained in the same cluster was 83–97%, while for cluster C this percentage ranged between 70–81% (Table 5). Trajectories in cluster E were very sensitive to the arrival height for the pairs of 500 and 10 m and 500 and 100 m, but they are almost insensitive for the pair of 10–100 m. It is interesting to note that for the arrival height pair of 100 and 10 m, 81–97% of the trajectories remained in the same cluster for all six transport regimes (A-F).

The SOM technique (1 × 6, 2 × 3) shows low sensitivity to all three arrival heights. The values estimated for the Somers’d test of significance ranged between 0.68 and 0.54 for SOM (2 × 3) (Table 3) and between 0.607 and 0.703 for SOM (1 × 6) (not shown).

The highest percentage values were detected for the 100 and 10 m pair (62–97% of trajectories remain in the same cluster for all transport regimes). Furthermore, this technique showed very low sensitivity (61–75%) for the pair of arrival heights of 500 and 100 m. Only cluster D trajectories appeared to be very sensitive to the arrival height for the pair of 500 and 10 m, while the opposite occurred for cluster A (Table 5).

Overall, the SOM clustering technique was less sensitive to the arrival height for clusters A and F, since 72–97% and 64–75% of the trajectories in clusters A and F, respectively, remained in the same cluster when using different arrival heights. By contrast, clusters C and especially D were found to be more dispersed and thus more sensitive to the arrival height (Table 5).

3.4.2 Application of three clustering techniques to the same arrival height

In this section, we examine whether a trajectory arriving in Athens at a specific height is classified in the same cluster when different clustering techniques are used.

At 500-m arrival height, the values estimated for the Somers'd test of significance between different clustering techniques ranged from 0.32 for the pair of (hierarchical, K-means) to 0.54 for the pair of (K-means, SOM 2 × 3) (Table 3). This indicates a higher consistency between the K-means and SOM (2 × 3) methods when applied to the same dataset. However, the percentage of trajectories classified in the same cluster for the (K-means, SOM 2 × 3) pair is lower at 500 m arrival height compared to the other two heights (Table 6).
Table 6

Trajectories (%) classified in the same cluster using two different clustering techniques

 

10 m

100 m

500 m

 

Hier, Kmeans

Hier, SOM

Kmeans, SOM2 × 3

Hier, Kmeans

Hier, SOM2 × 3

Kmeans, SOM2 × 3

Hier, Kmeans

Hier, SOM2 × 3

Kmeans, SOM2 × 3

A

81

76

81

78

88

82

57

59

79

B

10

0

78

78

70

72

32

55

41

C

55

1

44

27

99

38

48

45

38

D

53

9

71

58

45

48

39

47

73

E

73

53

50

72

45

29

12

40

46

F

3

50

52

62

32

31

67

38

63

The highest variability in cluster classification for 500-m arrival height was observed between hierarchical and K-means, with a cluster overlap percentage of 12–67%. For the (hierarchical, SOM 2 × 3) pair the respective percentage of trajectories ranged between 38–59%, while for the (K-means, SOM 2 × 3) pair it ranged between 38–79% (Table 6). Looking at particular clusters, it can be observed that the pair of (K-means, SOM 2 × 3) is less sensitive for clusters A and D (79 and 73%, respectively), the pair of (hierarchical, K-means) for clusters F and A (67% and 57%, respectively), and the pair of (hierarchical and SOM 2 × 3) for clusters A and B (59% and 55%, respectively; Table 6).

At 100 m arrival height, the values estimated for the Somers'd test of significance between different clustering techniques ranged from 0.49 for the pair of (hierarchical, SOM 2 × 3) to 0.63 for the pair of (K-means, hierarchical; Table 3). A relatively large percentage of the trajectories (32–99%) were classified in the same cluster when hierarchical and SOM techniques were applied to the dataset of trajectories arriving at 100-m height. The respective percentages for the other two pairs of clustering techniques (e.g. K-means and SOM 2 × 3, and hierarchical and K-means) were generally lower, indicating a higher sensitivity of the classification to the clustering technique (Table 6).

For the (hierarchical, SOM 2 × 3) pair of clustering techniques, the highest percentage of overlapping was found for cluster C trajectories, followed by cluster A, while the lowest was found for cluster F. For the (hierarchical, K-means) pair the highest values were obtained for clusters A and B (approximately 77%) and the lowest for cluster C (58%). Finally, when we applied the K-means and SOM techniques to trajectories arriving at 100 m height, the overlapping of the results was from 29% for cluster E to 82% for cluster A (Table 6).

Finally, we applied all three clustering techniques to the dataset of trajectories arriving at 10 m. The Somers'd values for the three pairs of clustering techniques ranged between 0.18 and 0.28 (Table 3). The highest value corresponded to the (SOM 2 × 3, hierarchical) pair, and the lowest to the (hierarchical, K-means) pair, indicating high sensitivity of the latter pair of methods applied to trajectories arriving at 10 m height.

When we applied K-means and SOM (2 × 3) to the 10-m arrival height dataset, 44–81% of the trajectories were classified in the same cluster. Again the highest percentage was obtained for cluster A, followed by cluster B, while the lowest value was for cluster C. For the pairs of (hierarchical, K-means) and (hierarchical, SOM 2 × 3), the classification depended greatly on the selected clustering approach, with the exception of cluster A (Table 6).

3.5 Analysis of PM10 concentrations

The aim of this work is to present a tool to interpret atmospheric back trajectories based on their respective technique. In this section, we discuss the air quality management implications of our results in relation with particulate matter concentrations in ambient air (PM10). It must be noted that several papers dealing with back trajectories stated that this technique could be used to estimate long-range transport of polluted air masses over a specific area (Dorling et al. 1992; Stohl et al. 2002; Borge et al. 2007; Grivas et al. 2008). This information could be very useful to the relevant authorities, since they could subtract the long-range contribution often related to natural sources (e.g. desert dust) from the PM10 concentrations measured in a city such as Athens in order to comply with EU air quality regulations (Vardoulakis and Kassomenos 2008). In this way, in terms of air quality management, the knowledge of the source and path of an air mass (represented by a cluster of back trajectories) could be very useful. The critical question is whether the estimation of the long-range transport contribution to urban PM10 depends on the clustering approach used or on the arrival height above the ground specified in the back trajectory modelling.

Figure 3 presents the results per clustering technique and arrival height for the six clusters. Daily mean PM10 data were obtained from an urban background monitoring site located in central Athens. All clustering approaches and arrival heights indicate that cluster A (representing air masses coming from Sahara desert and the Mediterranean south of Athens) was associated with the highest PM10 concentrations (with the exception of hierarchical clustering of back trajectories arriving at 500 m). In most of the cases, cluster F was associated with the second highest PM10 concentrations (i.e. representing air masses coming from the western Mediterranean possibly enriched with sand from Sahara and sea salt particulates). Some discrepancies were observed in hierarchical clustering between 10 and 500 m arrival height, indicating a higher sensitivity of this approach. B back trajectory cluster (representing air masses coming from the Central Europe and Northern Balkans) was associated with the third highest PM10 concentrations in almost all the cases except Hierarchical clustering at 500 m, following by D (air masses coming from Russia). To further check the PM10 differences between clusters B, C, D and E, we found that these differences were not statistically different at 95% confidence level in most? cases using the Tukey honesty significant difference test. On the other hand, PM10 differences between clusters A and F, were found to be statistically different at 95% confidence level.
Fig. 3

Mean PM10 concentrations for the six clusters of back trajectories per clustering technique and arrival height

Thus, we may conclude that although certain clusters appear to be associated with long-range transportation of particulates, it is not possible at this stage to directly attribute them to natural sources such as desert dust or sea salt (Grivas et al. 2008). The presented results may be seen as qualitative and preliminary, as more effort is needed to quantify the impact of long-range atmospheric transport on local PM10.

Factors influencing the relation between cluster and PM10 concentrations as travelling time, washout, deposition, partitioning of PM10, PM2.5 and PM1.0, chemical composition of the particulates (in their three infancies), as well as, possible local effects are not taken under consideration in this stage and it is planned to be examined in a forthcoming work.

4 Conclusions and future work

In this study, we investigated the sensitivity of cluster classification of atmospheric back trajectories to three commonly used statistical techniques (a hierarchical, a non-hierarchical and a neural network self organizing map). The l-method was used to define the optimum number of clusters in our analysis, which included modelled kinematic back trajectories arriving in Athens at three different heights during a 4-year period. From this analysis, the following conclusions could be drawn:
  • The air masses affecting Athens have their origin mainly in Sahara desert of North Africa, Central-Eastern Europe, Russia, north-western Europe, mid-Atlantic and western Mediterranean.

  • The hierarchical clustering technique seems to be very sensitive to the arrival height of the trajectories, for both fast and slow moving clusters. The K-means (non-hierarchical) clustering technique appeared to be less sensitive to the arrival height, while the variability of the occurrence of each circulation regime was small.

  • The neural network SOM clustering technique was found to be sensitive to the arrival height. Compared with the other two methods, it was found to be more sensitive than K-means and less sensitive than the hierarchical technique.

  • All three statistical methods seem to be sensitive to the arrival height of the trajectories, but the degree of sensitivity differs substantially. Hierarchical clustering showed the highest level of sensitivity for fast moving trajectories (C, D and E clusters, i.e. air masses originating from mid-Atlantic, NW Europe and east Russia) to the arrival height, followed by SOM. K-means was found to be the least sensitive clustering technique for this variable.

  • Slow-moving circulation regimes, especially those originating from North Africa or western Mediterranean (clusters A and F) did not show significant sensitivity to the arrival heights or to the clustering technique applied.

  • The low sensitivity of the clustering for the couple of arrival height of 10 and 100 m detected shows that the variability of the transport between 10 m and 100 m may cannot be established using the atmospheric back trajectories produced in this study. More work is needed for this to reach in more safe conclusions.

  • Use meteorological fields at higher resolution

  • A range of clustering techniques should be preferably used when investigating atmospheric trajectories.

The use of modelled atmospheric trajectories and their classification into circulation regimes can play a significant role in studying air pollution in Athens, Greece, an area with a large number of exceedences of the daily PM10 concentration limit imposed by the European Union. It may also contribute to the quantification of local and long-range transport contributions (natural or anthropogenic) to the PM10/PM2.5 concentrations recorded in Athens. The quantification of these influences is still an open issue, not only for Athens but also for other major cities in south Europe, and can be further investigated using back trajectory modelling. It is recommended that a range of arrival heights above the ground is used in such applications. Finally, excessive reliance on one particular clustering technique for data analysis and interpretation should be avoided.

Notes

Acknowledgements

The authors gratefully acknowledge the NOAA Air resources (ARL) for the provision of the FNL-HYSPLIT data, the HYSPLIT transport and dispersion model and the READY web site (http://www.alr.noaa.gov/ready.htm) used in this work. The authors would also like to thank the two anonymous reviewers for their valuable and constructive suggestions that improve this work substantially.

References

  1. Artíñano B, Querol X, Salvador P, Rodríguez S, Alonso DG, Alastuey A (2001) Assessment of airborne particulate levels in Spain in relation to the new EU-Directive. Atmos Environ 35:S43–S53CrossRefGoogle Scholar
  2. Borge R, Lumbreras J, Vardoulakis S, Kassomenos P, Rodriguez E (2007) Analysis of long-range transport influences on urban PM10 using two-stage atmospheric trajectory clusters. Atmos Environ 41:4434–4450CrossRefGoogle Scholar
  3. Cape JN, Methven J, Hudson LE (2000) The use of trajectory cluster analysis to interpret trace gas measurements at Mace Head, Ireland. Atmos Environ 34:3651–3663CrossRefGoogle Scholar
  4. Dorling SR, Davis TD (1995) Extending cluster analysis-synoptic meteorology links to characterize chemical climates at six northwest European monitoring stations. Atmos Environ 29:145–167CrossRefGoogle Scholar
  5. Dorling SR, Davies TD, Pierce CE (1992) Cluster analysis: a technique for estimating the synoptic meteorological controls on air and precipitation chemistry. Atmos Environ 26:2575–2581CrossRefGoogle Scholar
  6. Draxler RR, Hess GD (1998) An overview of the HYSPLIT 4 modelling system for trajectories, dispersion and deposition. Aust Meteorol Mag 47:295–308Google Scholar
  7. Everitt B (1980) Cluster analysis. Halstead, New York, p 136Google Scholar
  8. Grivas G, Chaloulakou A, Kassomenos P (2008) An overview of the PM10 pollution problem, in the Metropolitan Area of Athens, Greece. Assessment of controlling factors and potential impact of long range transport. Sci Total Environ 389:165–177CrossRefGoogle Scholar
  9. Johnson SC (1967) Hierarchical Clustering Schemes. Psychometrika 2:241–254CrossRefGoogle Scholar
  10. Jorba O, Perez C, Rocadenbosch F, Baldasano J (2004) Cluster Analysis of 4-Day back Trajectories arriving in the Barcelona area, Spain, from 1997 to 2002. J Appl Meteorol 43:887–900CrossRefGoogle Scholar
  11. Kanamitsu M (1989) Description of the NMC Global Data Assimilation and Forecast System. Weather Forecasting 4:335–342CrossRefGoogle Scholar
  12. Kassomenos P (2003a) Anatomy of the synoptic conditions occurring over southern Greece during the second half of 20th century. Part I. Summer and Winter. Theor Appl Climatol 75(1–2):65–77Google Scholar
  13. Kassomenos P (2003b) Anatomy of the synoptic conditions occurring over southern Greece during the second half of 20th century. Part II. Spring and Autumn. Theor Appl Climatol 75(1–2):79–92Google Scholar
  14. Kocak M, Mihalopoulos N, Kubilay N (2007) Contributions of natural sources to high PM10 and PM2.5 events in the eastern Mediterranean. Atmos Environ 41:3806–3818CrossRefGoogle Scholar
  15. Kohonen T (2001) Self organizing maps. SpringerGoogle Scholar
  16. Kolehmainen M, Martikainen H, Hiltunen T, Ruuskanen J (2000) Forecasting air quality parameters using hybrid neural network modelling. Environ Monit Assess 65:277–286CrossRefGoogle Scholar
  17. Kolehmainen M, Martikainen H, Ruuskanen J (2001) Neural networks and periodic components used in air quality forecasting. Atmos Environ 35:815–825CrossRefGoogle Scholar
  18. Leavey M, Sweeny J (1990) The influence of long range transport of air pollutants on summer visibility at Dublin. Int J Climatol 10:191–201CrossRefGoogle Scholar
  19. McQueen JB (1967) Some methods for classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. University of California Press, Los Angeles, pp 281–297Google Scholar
  20. Melas D, Ziomas I, Klemm O, Zerefos CS (1998) Anatomy of the Sea breeze circulation in Athens area under weak large-scale ambient winds. Atmos Environ 32:2223–2237CrossRefGoogle Scholar
  21. Mihalopoulos N, Stephanou E, Kanakidou M, Pilitsidis S, Bousquet P (1997) Troposheric aerosol ionic composition in the Eastern Mediterranean region. Tellus B Chem Phys Meteorol 49:314–326CrossRefGoogle Scholar
  22. Moody JL, Oltmans SJ, Levy H II, Merrill T (1995) Transport climatology of tropospheric ozone. Bermuda, 1988–1991. J Geophys Res 100:7179–7194CrossRefGoogle Scholar
  23. Newell R, Thuret V, Cho J, Stoller P, Marenco A, Smit H (1999) Ubiquity of quasi-horizontal layers in the troposphere. Nature 398:316–319CrossRefGoogle Scholar
  24. Rolph G.D., 2003. READY: Real time Environmental Applications and Display system. NOAA Air resources Laboratory (http://www.arl.noaa.gov/ready.html).
  25. Salvador S, Chan P (2005) Learning States and rules for detecting anomalies in Time Series. Appl Intell 23:241–255CrossRefGoogle Scholar
  26. Schädler G, Sasse R (2006) Analysis of the connection between precipitation and synoptic scale processes in the Eastern Mediterranean using self-organizing maps. Meteorol Z 15(3):273–278CrossRefGoogle Scholar
  27. Schlink U, Herbarth O, Richter M, Dorling S, Nunnari G, Cawley G, Pelikan E (2006) Statistical models to assess the health effects and to forecast ground-level ozone. Environ Model Softw 21:547–558CrossRefGoogle Scholar
  28. Stohl A, Eckhartdt S, Forster C, James P, Spichtinger N, Seibert P (2002) A replacement for simple back trajectory calculations in the interpretation of atmospheric trace substance measurement. Atmos Environ 36:4635–4648CrossRefGoogle Scholar
  29. Sinnott RW (1984) Virtues of the Haversine. Sky Telescope 68:159Google Scholar
  30. Vardoulakis S, Kassomenos P (2008) Comparison of factors influencing PM10 levels in Athens (Greece) and Birmingham (UK). Atmos Environ 42:3949–3963CrossRefGoogle Scholar
  31. Wernli H, Davies H (1997) A Langrangian-based analysis of extratropical cyclones. I: the method and some applications. Q J R Meteorol Soc 123:467–489CrossRefGoogle Scholar
  32. Yao CS (1998) A loading correlation model for climatic classification in terms of synoptic climatology. Theor Appl Climatol 61:113–120CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • P. Kassomenos
    • 1
  • S. Vardoulakis
    • 2
  • R. Borge
    • 3
  • J. Lumbreras
    • 3
  • C. Papaloukas
    • 4
  • S. Karakitsios
    • 4
  1. 1.Department of Physics, Laboratory of MeteorologyUniversity of IoanninaIoanninaGreece
  2. 2.Public and Environmental Health Research UnitLondon School of Hygiene and Tropical MedicineLondonUK
  3. 3.Department of Chemical and Environmental EngineeringTechnical University of Madrid, (UPM)MadridSpain
  4. 4.Department of Biological Applications and TechnologyUniversity of IoanninaIoanninaGreece

Personalised recommendations