Preferred intra-seasonal circulation patterns of the Indian summer monsoon and active-break cycles

Intra-Seasonal circulation regimes are identified from a cluster analysis of 5-day mean anomaly fields of 850 hPa horizontal winds from the ERA-Interim reanalysis for the boreal summer season (June–Sept. for 1979–2018) over the region (50°–100° E; 5° S–35° N). The k-means method was applied to the leading 6 principal components yielding k clusters. The degree of clustering is significant compared to synthetic data sets for any value of k>3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k > 3$$\end{document}. The circulation is most likely to stay in the same cluster from one pentad to the next; significant transitions (with 95% confidence level) form a cycle. The similarity between the cycle depicted from 4 or 5 clusters and the active-break cycle, as well as the 45-day oscillation, is established by composites of 850 hPa winds, 200 hPa divergence, 500 hPa vorticity and vertical pressure velocity, precipitable water, diabatic heating and rainfall over India: Strong convection over the subtropical Indian Ocean moves to the central Bay of Bengal and central India, subsequently to the northern Bay of Bengal and west Bengal, and then further north into the Himalayas. We also find preferred transitions in which the convection moves equatorward from central India. The number of complete cycles found in 40 summers is 7 in the 4-cluster analysis. The number of times the system undergoes four (three) consecutive legs of the cycle is 16 (31). For 5 clusters only 3 complete cycles are found. sequences of five, four and three consecutive legs occur 10, 11 and 28 times, respectively.


Introduction
The atmospheric circulation associated with the Indian monsoon has a very rich structure on intra-seasonal time scales. Fluctuations in the wind field over the broad Indian region strongly modulate the occurrence of synoptic systems and the resulting rainfall, so that prolonged episodes (several days to one week) of strong rainfall (and of no rainfall) are observed over central India and surrounding regions (Murakami et al. 1984;Annamalai and Slingo 2001;Rajeevan et al. 2010) Since this well-known "active-break cycle" has a large impact on agriculture (Prasanna 2020) the fidelity with which weather and climate models treat this cycle (and the circulation in which it is embedded) is clearly of concern (Pattanaik et al. 2020). These active and break periods are often characterized by compositing fields based on their time lagged relationship with a single, regionally averaged index of extreme rainfall, as in Rajeevan et al. (2010).
This cycle is embedded in a complex of larger scale intraseasonal fluctuations which extend over the broad Indian Ocean-western Pacific region. There is general agreement that two broad-band oscillations, the 30-60 day boreal summer intra-seasonal oscillation, or BSISO (Murakami et al. 1984;Lau and Chan 1986;Kemball-Cook and Wang 2001;Annamalai and Sperber 2005;Hazra and Krishnamurthy 2014) and the 10-20 day (quasi bi-weekly, or QBW) oscillation (Krishnamurthi and Bhalme 1976;Krishnamurthi and Ardunay 1980;Chatterjee and Goswami 2004) dominate this variability. In order to identify these oscillations and determine their character, a number of complex spatio-termporal spectral analysis techniques have been used, including extended empirical orthogonal functions (Suhas et al. 2013;Sahai et al. 2015), multi-channel singular spectrum analysis (Krishnamurthy and Shukla 2000, 2007, 2008, principal oscillation patterns (Annamalai and Slingo 2001) and cyclostationary empirical orthogonal functions (Annamalai and Sperber 2005).
Comparisons of the BSISO in simulations (Neena et al. 2017) and forecasts (Lee et al. 2015) with those from reanalyses have been carried out in the EEOF framework, although this process is somewhat cumbersome.
In contrast to the local index-based approach used to diagnose the active-break cycle, and the complex mode techniques used to isolate the BSISO and QBW, this article analyzes the large-scale variability of the circulation field, focusing on circulation patterns ("regimes") which are by some measure preferred states of the system. This approach can be implemented via the general technique of cluster analysis, which while very often used in studies of midlatitude intra-seasonal variability (see Straus et al. 2017), has not been widely applied to the Indian monsoon circulation. Chu et al. (2017) use self-organizing maps (a form of cluster analysis) to characterize the intra-seasonal variability of outgoing long-wave radiation and low-level zonal winds in the very broad region extending from East Africa to the western Pacific, and are able fto identify both stationary and propagating patterns. The purpose of this article is to explore the structure of preferred circulation patterns over the Indian region (50°-100° E, 0°-35° N) and to relate transitions between them to the active-break cycle over India. This is meant to provide a framework in which the model forecasts (and simulations) can be evaluated with regard to the both the circulation regimes and, via their transitions, to the active-break cycle. Recent interest in the fidelity of predictions and simulations of the active-break cycle is evidenced in the work of Neena et al. (2017) and Pattanaik et al. (2020).
Cluster analysis partitions every state of the atmosphere (whether daily or time-filtered anomalies) between a small number of clusters on the basis of a common structure. The average of all states in a cluster defines a centroid (also called a circulation regime). The complex evolution of the circulation is "coarse-grained" and so is described by the residence time within a regime and the probability of transition into another regime. Active-break cycles within this approach emerge as preferred transition paths between regimes. A fuller picture of the cycle and its components is available by compositing a number of variables over states assigned to each regime.
In Sect. 2, the data set and the analysis methods are described. Methods of estimating statistical significance are described in Sect. 3, which can be skipped by the reader without loss of continuity. Section 4 gives a number of composite fields over the broad South Asian region for the characteristic states (centroids) from each cluster for a 4-cluster analysis and a 5-cluster analysis, while Sect. 5 describes the composite diabatic heating over the global tropical belt. The frequency of occurrence and lifetime of the regimes, as well as evidence for preferred transition paths between them, are given in Sect. 6. A Discussion is given in Sect. 7, and the Conclusions in Sect. 8.

Data
Once-daily fields of the horizontal wind (u, v) at 200, 500 and 850 hPa and the vertical pressure velocity ( ) at 500 hPa were obtained from the ERA-Interim reanalysis (Dee et al. 2011) for the 120-day period starting 01June for the 40 years 1979 through 2018. These data were interpolated from the original (512 × 256) (lon × lat) Gaussian grid to a (128 × 64) (lon × lat) Gaussian grid. 24 consecutive five-day means (pentads) of the winds, divergence (D), vorticity ( ) and vertical pressure velocity ( ) were formed for use in the cluster analysis and composites.
Additionally, four-times daily data of the horizontal winds (u, v), temperature T, and pressure vertical velocity were obtained from the ERA5 reanalysis (Hersbach et al. 2020) at 37 pressure levels for the purpose of estimating the diabatic heating. These fields were interpolated from the original 0.25° grid to a (256 × 128) (lon × lat) Gaussian grid prior to calculating the heating. The diabatic heating, whose calculation is described in Sect. 2.3 was vertically integrated into nine layers, and pentads formed.
Daily Indian precipitation data for June through September of the years 1979-2018 were obtained from the India Meteorological Department (Pai et al. 2014). This dataset, which has been updated regularly, covers only the Indian land region on a 0.25° (lon × lat) grid. The daily data were averaged into pentads, as described above. For all the fields described above, the seasonal cycle was computed by projecting the seasonal time series of pentads (for each grid point and year) onto the first three Legendre polynomials, equivalent to a parabola in time, as in Straus (1983). The polynomial coefficients were then averaged over all 40 years, and a single smooth seasonal cycle reconstructed. Anomalies were obtained by removing this seasonal cycle.

Cluster analysis
The goal of the cluster analysis is to describe the slowly varying, large-scale circulation field over the Indian region in terms of a relatively few characteristic patterns. Although it is possible to apply a cluster analysis to the raw pentad fields of (u, v) directly, it is convenient to pre-filter the data using a principal component analysis, which provides a set of patterns (the empirical orthogonal functions or EOFs), and their time coefficients (the principal components, or PCs). The PCs provide a new representation of the data set in a reduced-dimensional coordinate space. It is in this space that the cluster analysis is carried out.
We apply a principal component analysis to the vector field of (u, v) at 850 hPa over the wide Indian region (50.63°-98.44° E, 1.40 S°-34.88° N). The leading 6 (12) modes explain 65% ( 79% ) of the total space-time variance. Since the variance explained drops off rapidly with mode number, we retain only 6 modes in the cluster calculations to be described , although we have verified that the composite circulation and rainfall for each cluster are nearly unchanged when 12 modes are retained.
The particular method of cluster analysis used is the k-means method (Deday and Simon 1976;Desbois et al. 1982;Michelangeli et al. 1995;Straus et al. 2017), which takes as input the leading principal component time series, and seeks to approximately identify regions in this phase space in which the local density is higher than would be expected based on a purely multi-normal distribution. Given a number k of clusters to be sought, the method assigns each observed state (pentad) uniquely to one of the k clusters, each consisting of points"bunched" together in the phase space. Given these assignments, the average PC coordinates of all states in a given cluster (the so-called "centroid") define a single set of PC coefficients which corresponds (using their respective EOF patterns) to a unique pattern, termed a circulation regime. The criterion that determines how the states are assigned is a global one, based on a single parameter R, called the variance ratio, which is defined as the ratio of the average sum of squares of the centroid coordinates (weighted by the cluster population) to the average intra-cluster variance. This ratio is used both to monitor the convergence of the algorithm, and in the assessment of its significance. Technical details of the k-means clustering algorithm and more interpretation of the results can be found in Straus et al. (2017).
While each individual circulation regime consist of a linear combination of the leading EOFs, the regimes are not constrained to be symmetric in sign: a regime pattern often does not have the opposite pattern represented in the set of k regimes. This is in distinct contrast to the output of principal component analysis, which describes modes which are equally likely to have either sign.
It should be pointed out that there is no unique method of choosing what k, the number of regimes, should be. Retaining too many regimes generally leads to very poor reproducibility to sub-sampling, while retaining too few regimes may obscure important distinctions in local behavior. In this paper we report results for both k = 4 and k = 5 regimes.
The cluster analysis yields an assignment of each 5-day period to one of k regimes. Composites of the full (not truncated) fields of anomalies of low-level (850 hPa) circulation u, v, , D, and rainfall over each regime are examined. The frequency of transitions between regimes, and the possible sequences of regimes, together give a coarse-grained approximation to trajectories in phase space.

Diabatic heating
The diabatic heating rate Q is estimated from the fourtimes daily ERA5 data at 37 pressure levels as a residual in the thermodynamic equation. The heating is evaluated on the (256 × 128) (lon × lat) Gaussian grid, using a residual method as reported in Swenson and Straus (2021): The heating rate Q is obtained from: Here is potential temperature, p is pressure, = dp dt , c p the specific heat at constant pressure per unit mass, the horizontal velocity field, and Π = p 0 p , with p 0 = 1000 hPa and = R c p , R being the gas constant. Q is transformed from the grid representation to spherical harmonics, truncated to a triangular (T159) representation, and transformed back to the grid.

Significance of results
The assessment of significance of the cluster analysis is a global one, applying to the entire assignment of clusters, and not the local density in the PC space. The idea is that if the time series of one the PCs of the observed states (assumed to have Gaussian statistics) is statistically independent of the time series of all other PCs, then any maximum in probability density not at the origin (identifying a cluster centroid) can only occur due to sampling error: in the limit of a large sample, the joint probability density of all the PCs should be a multi-normal distribution (no clustering). Based on this idea, surrogate stochastic Gaussian time series are constructed for each individual PC, with the same length as the observed time series. The surrogate time series are constructed in such a way that the time lag-correlation statistics of each observed PC are preserved, but the time series of different PCs are independent (Christiansen 2007). When the clustering algorithm is applied to the surrogate time series, it will always give a set of cluster assignments (for a given k) and a variance ratio R, but the apparent clustering can only be due to the finite size of the sample. Repeating the application of the cluster analysis on 100 independently generated sets of surrogate time series, we determine how many times the value of R exceeds the observed value. The corresponding fraction yields the p-value, so that if = 0.05 then we are 95% confident that the clustering of the observed time series is not due only to sampling error. We have carried out this analysis for a range of k, from 2 to 6, and find that for k = 4,5, or 6 the confidence level is 99%, indicating very little chance that the degree of clustering in the original data is due to sampling error. Common to many methods of clustering, the final choice for the value of k is to some extent arbitrary (Christiansen 2007). We used both k = 4 and 5 in our analysis, but focused on the choice of 5 clusters, as explained in Sect. 4. We should note that for highly skewed data, Christiansen (2007) has shown that k-means algorithm may incorrectly identify multiple clusters. In our case this is not a concern: the leading two PCs have small skewness (− 0.143 and 0.125).
In order to test the robustness of the cluster centroids to sub-sampling, we repeated the cluster analysis using 1000 randomly chosen half-length data sets. For each such halflength dataset, the variance ratio R was computed and the resulting histogram formed into a probability distribution (not shown). The distribution is sharply peaked at a value near 1.01 for k = 4 and k = 5 , and at a slightly higher value for k = 6 . Another measure of the agreement between the clusters in the half-length data sets and those in the full data set is the pattern correlation between the cluster centroids. For each half-length data set and value of k, each cluster centroid was matched to one of the k centroids in the full data set based on pattern correlation. The k values of correlation were then averaged (using the Fisher transformation) to assign a single value to that half-length data set. In all cases almost the entire distribution lies above correlation values of 0.8, with a small tail below this for k = 4 and k = 6.
Preferred transition paths between the clusters are assessed with the use of the transition matrix, whose ith row and jth column gives the number of times the system makes a transition between cluster i and cluster j. The off-diagonal elements define transitions, whose statistical significance is determined by comparison to the results of a large number (here 1000) of scrambled data sets. In each scrambled data set the cluster assignments are resampled using a bootstrap with replacement technique: the cluster assignment of each state (pentad p of year x) is replaced with that of a randomly chosen pentad q and year y. If a block of n pentads q + 0, q + 1, ...q + n − 1 of year y are all assigned to the same cluster, then the pentads associated with the entire block of n days are assigned to pentads p, p + 1, ..., p + n − 1 of year x.
The significance of the composite anomalies of the vorticity, divergence, vertical velocity and diabatic heating were assessed with a two-sided t-test. Significance is indicated for all anomalies exceeding 95% confidence.

Circulation regimes
Composites of a number of anomaly fields are presented here for the cases of k = 4 and k = 5 . The 850 hPa horizontal wind field (u, v), shown using the full anomalies (not truncated in PC-space), is helpful in indicating the location of anomalies in the Monsoon trough. This is also indicated by the 500 hPa vorticity . Both the 200 hPa divergence D and the 500 hPa vertical pressure velocity indicate regions of anomalous upward motion associated with convection. Finally, we show the rainfall anomalies (available only over the Indian land region).

Four regimes
Composites for k = 4 are shown in Figs. 1, 2, and 3. The composite figures are ordered (from (a) through (d)) to correspond to a preferred cyclical transition path In Fig. 1, Cluster (a) shows low-level cyclonic anomalies over the equatorial Indian Ocean associated with enhanced upper-level divergence (indicative of deep convection), and a suppression of the Monsoon trough over India itself (with Easterly anomalies) associated with a negative divergence anomaly in the Bay of Bengal. The mid-level and anomalies for the same cluster, shown in Fig. 2a, indicate enhanced vorticity and upward motion over the equatorial Indian Ocean, with suppressed vorticity and downward motion over India and the Bay of Bengal. Consistent with this, the corresponding rainfall anomalies over India, shown in Fig. 3 show suppressed rainfall over the Western Ghats and over the Indo-Gangetic plain.
Cluster (b) presents a sharp contrast: the Monsoon trough is now enhanced, with positive upper-level divergence anomalies over the Bay of Bengal and central India (Fig. 1b), associated with enhanced mid-level vorticity and upward motion (Fig. 2b) and strongly enhanced rainfall over the Western Ghats and over central India (Fig. 3b). The enhanced rainfall over the west coast is consistent with the enhanced on-shore low-level winds, seen in Fig. 3.
In cluster (c), the enhanced cyclonic low-level anomaly (and the accompanying upper-level divergence) are now located further to the northeast compared to their positions in cluster (b), and are also weaker (compare Fig. 1b and c). Enhanced mid-level vorticity (along with rising motion) is seen over the northern Bay of Bengal, with a large area of sinking motion and decreased vorticity further to the southwest (Fig. 2c). As expected from the circulation anomalies, the rainfall is modestly enhanced over northeastern India but is suppressed over central India (Fig. 3c).
Finally, in cluster (d) the enhanced low-level cyclonic anomaly and associated upper level divergence (and rising motion) have moved further northward to the eastern Himalayan mountain range (see Figs. 1d and 2d), while anomalous sinking motion and negative vorticity are seen over the Bay of Bengal and across much of India into the Arabian Sea. Over the tropical Indian Ocean and southern India, increased convection, with upper level divergence and rising motion are apparent in Figs. 1d and 2d. The rainfall anomalies (Fig. 3d) are consistent with the circulation, with suppressed rainfall over much of central India, but enhanced rainfall in the Himalayas and over southern India.

Five regimes
Composites for k = 5 are shown in Figs. 4, 5, and 6. The composite figures are ordered (from (a) through (e)) to correspond to a preferred cyclical transition path (a) → (b) → (c) → (d) → (e) → (a) which will be discussed in Sect. 6.
The low-level circulation and upper-level divergence anomalies for cluster (a) shown in Fig. 4a are very similar to cluster (a) for k = 4 (compare Fig. 1a), although the center of upper-level divergence over the tropics is displaced southwards slightly in Fig. 4a, while a small area of positive divergence is seen over the eastern Himalayas. These  The cluster (a) rainfall composite for k = 5 shown in Fig. 6a indicates suppressed rainfall over central India and the western Ghats with enhanced rainfall over the Himalayas and south central India, in rough agreement with cluster (a) for k = 4 (Fig. 3a), although the positive anomalies in the k = 5 cluster are clearly stronger. Cluster (b) (Fig. 6b) shows a weak enhancement of rainfall over central India and only a small remnant of the dryness which has moved northeastward (compared to cluster (a)). In cluster (c), strong positive anomalies of rainfall appear over central India and the western Ghats, with a suppression seen over the eastern Himalayas (Fig. 6c), very similar to cluster (b) for k = 4 (Fig. 3b). The positive rainfall anomalies weaken and move northeastward in cluster (d) (Fig. 6d), matching cluster (c) for k = 4 (Fig. 3c). At the end of the sequence, cluster e (Fig. 6e) shows widespread suppression of rainfall over both central India and the western Ghats, and a small area of enhanced rainfall over the eastern Himalayas, a pattern very similar to that seen for k = 4 in cluster (d) (Fig. 3d).

Diabatic heating
Composites of the anomalies of diabatic heating (vertically integrated from 1000 to 50 hPa) are given for the four clusters ( k = 4 ) over the global tropical belt in Fig. 7. Again we shall discuss the clusters in the framework of a cyclical transition path (a) → (b) → (c) → (d) → (a).
In cluster (a), prominent tropical Indian Ocean heating anomalies to the south of the equator are connected to a northwest-to-southeast ("tilted") band of heating extending from the Arabian Sea, across southern India into the southern Bay of Bengal and further east over the Maritime continent. A parallel, similarly oriented band of cooling lies about 15 degrees further to the north. Heating anomalies are also seen further north over the Himalayas. The tilted band of heating moves northward in cluster (b) so that it covers central and northern India, and extends southeastwards into the tropical western Pacific. The Himalayan area of heating in cluster (a) is replaced by cooling in cluster (b), as is the tropical Indian Ocean heating. The tilted heating bands moves further northward in cluster (c), reaching the Himalayas in cluster (d), in which heating is again seen in the tropical Indian Ocean. Generally bands of heating are separated by bands of cooling. Of interest is the alternating areas of cooling (in clusters (a) and (b)) and heating (clusters (c) and (d)) over the eastern tropical Pacific.
Heating anomalies for five clusters are shown in Fig. 8, labeled (a) through (e) as in Sect. 4.2. Cluster (a) is very similar to cluster (a) for the 4 cluster case (Fig. 7), while cluster (d) and (e) for 5 clusters are very similar to clusters (c) and (d) for 4 clusters, respectively. It is clusters (b) and (c) in the 5-cluster case which are merged into cluster (b) in the 4-cluster case (compare Figs. 7b and 8b). The far eastern Pacific cooling noted in the 4-cluster case is seen in clusters (a) and (b) in Fig. 8, whereas the previously noted eastern Pacific heating is seen in clusters (d) and (e).

Regime lifetimes
The average lifetime in a cluster (regime) is given for each of the k = 4 clusters in Table 1 and for each of the k = 5 clusters in Table 2, in both pentads and the corresponding number of days. Also given is the total time the system resides in each cluster. The average regime lifetime for k = 4 is 8.6 days, while that for k = 5 is 7.8 days. For k = 4 the system spends 58% of time in regimes (b) and (c), the regimes which show  Fig. 1b and c. The remaining 40% is divided nearly evenly between the regime with divergence closer to the tropics or over the Himalayas (Fig. 1a and d.) Considering 5 clusters, the system spends 61% of time in regimes (b)-(d) with upper-level divergence over the Bay of Bengal (Fig. 4b through d). The remaining two regimes (a) and (e) with rising motion to the north and south of India (Fig. 4a and e) are occupied for the remaining 39% of the time. Table 3 presents the total number of transitions between individual clusters for k = 4 . The number of transitions between cluster i and cluster j is given in the ith row and jth column. The rows and columns are labeled (a) through (d) to correspond to the composites of Figs. 1, 2 and 3. The diagonal elements give the number of pentads for which the system stayed in the same cluster, and this is largest for clusters (b) and (c). The off-diagonal elements are generally smaller, and give the number of transitions. Bold values are significant at the 95% confidence level using the boot-strap method detailed in Sect. 3. Significant transitions include the complete cycle: (a) → (b) , (b) → (c) , (c) → (d) , and (d) → (a) . In addition, the number of transitions from cluster (c) → (b) is not only significant, but is also larger than the (c) → (d) leg of the complete cycle. Table 4 shows the number of episodes during which 3, 4 or 5 legs of the cycle occur consecutively, and the average length of each episode. Such episodes are counted as long as residence in one of the clusters is followed (after one or more pentads) by residence in the next cluster in the cycle. All possible sequences within the entire cycle are counted; for example the entry in the table for 3  legs includes

Regime transitions
The entire cycle from (a) through (d) back to (a) occurs only 7 times in the 40 years; 3 consecutive legs occur on average for over three quarters of the years. The average cycle length for the entire cycle is about 48 days. Table 5 presents the total number of transitions between individual clusters for k = 5 , in the same form as Table 3. The significant transitions encompass the complete cycle: In addition to the transitions encompassing the cycle, one transition in the opposite direction ( (b) → (a) ) is seen. This transition will be discussed further in Sect. 7. Table 6 shows the number of episodes during which 3, 4, 5 or 6 legs of the cycle occur consecutively, and the average length of each episode. The episodes are counted in a way analogous to those in Table 4.

Five clusters
The entire cycle from (a) through (e) back to (a) occurs only 3 times in the 40 years, with an average length of 42 days. Subsets of the cycle occur increasingly more often as the number of required legs of the cycle decreases; 3 legs occur 70% of the time, taking about 23 days.

Comparison of 4 and 5 clusters
In general, the cyclical sequence of states for k = 5 from The states in the k = 4 sequence seem to occur at slightly earlier times than the corresponding ones in the k = 5 sequence. For example, the appearance of rising motion and upper-level divergence over the Himalayas appear at the "end" of the cycle using 4 clusters, but appear both at the "end" of the k = 5 cycle and the "beginning" of the next one. Thus we see slightly stronger rising motion over the Himalayas in Fig. 5a compared to 2a, as well as the weaker rising motion over the equatorial Indian Ocean in in Figs. 5e compared to 2d. Remnants of positive rainfall anomalies over the Himalayas and south India are seen in the k = 5 cluster (a) composite (Fig. 6a) which appear at end of the "previous" cycle for the k = 4 cluster (d) (Fig. 3d).

Relationship to active/break episodes and intra-seasonal oscillations
The sequences of rainfall composites presented here can be compared to composites of the active and break episodes presented by Rajeevan et al. (2010) (hereafter R). These authors define the dates of these isolated episodes on the basis of rainfall anomalies averaged over the Monsoon core region exceeding ±1 (standard deviation) for 3 consecutive days, using a high resolution data set of precipitation over India. The active phase lag-0 composite of R, showing large positive rainfall anomalies over the Monsoon core region of India, over the western Ghats, along with suppressed rainfall over southern India and to the northeast over the Himalayas, is very similar to cluster (c) for k = 5 (Fig. 6c), as well as with cluster (b) for k = 4 (3b). The break composite presented in R shows completely the reverse pattern, similar to the k = 5 cluster (a) (and to some extent cluster (e)) in Fig. 6, and cluster (d) for k = 4 (Fig. 3d). It should be noted that the longer life-span of break phases compared to active phases found by R is not reflected in the average persistence times seen for the clusters. For example, Table 2 shows that for k = 5 cluster (c) (active phase) has the same average persistence (7.8 days) as clusters (a) and (e), which encompass the break phase. Krishnamurthy and Shukla (2008), herafter KS8, show rainfall composites based on 8 phases of both the 45-day oscillation and 28-day oscillation obtained from application of the MSSA technique to observed outgoing long-wave radiation. The composites for phases 7 and 8 of the 45-day oscillation (from Fig. 4 of KS8), with large negative OLR over India, correspond well to the circulation in regime (c) for k = 5 shown in Figs. 4 and 5. The opposite phases of KS8 (1 and 2) correspond to the circulation in regime (a) in Figs. 4 and 5. The correspondence of our results with the 28-day oscillation of KS8 is less clear. This oscillation shows a dipolar structure (in latitude) of OLR over the Indian longitudes, in contrast to the more tripolar patterns seen in both the 45-day oscillation and our cluster composites. The association of our results with the 45-day oscillation is consistent with the average length of time it takes for the entire preferred sequence of k = 4 or k = 5 clusters to complete, 42-48 days, with partial sequences completing in about 34 days (Tables 4 and 6).
The phases of the 45-day oscillation in diabatic heating, shown in Hazra and Krishnamurthy (2014) also map onto the 5 clusters shown in Fig. 8, with cluster (a) corresponding to phases 2 and 3 of the 45-day oscillation, cluster (c) corresponding to phase 6, and cluster (e) to phases 8 and 1. Note that as the strong equatorial Indian Ocean (EIO) vertically integrated heating in cluster (a) gives way to EIO diabatic cooling in cluster (c), the negative precipitation anomalies seen over India in Fig. 6 for cluster (a) become positive anomalies for cluster (c). This association with lack of convection in the EIO with positive Indian Monsoon rainfall is also seen on the inter-annual time scale, as discussed by Gadgil et al. (2004).

Mechanisms of transitions between clusters
The mechanisms of northward and eastward propagation seen in the BSISO have been discussed extensively in the literature in terms of Rossby wave dynamics and air-sea interaction (e.g. Kemball-Cook and Wang 2001;Zhang et al. 2018;Hazra and Krishnamurthy 2018). Hazra and Krishnamurthy (2018) emphasize the role of moisture preconditioning, in which large-scale circulations create positive moisture anomalies leading to convection, followed by and negative moisture anomalies before suppressed conditions.
In order to indicate the moisture availability in the set of 5 clusters, we show the precipitable water (PW) for the five clusters in Fig. 9. The northeastward propagation of moist anomalies, closely tracking the upper level divergence seen in Fig. 4, is very clear. Figure 10 shows the changes in PW just prior to each of the six significant transitions indicated in Table 5. For each regime transition, the figure shows the difference between the PW for the last pentad prior to the transition and the mean PW of the regime being left. For example, the PW difference in regime (a) five days prior to transitions to regime (b) (seen in Fig. 10a) shows an enhancement of moisture over the Arabian Sea and Bay of Bengal north of the area of large PW in regime (a). The enhancement leads to a northward movement of convection and corresponding areas of high PW as the system moves to regime (b). The change in PW prior to the transition from (b) to (c) consists of more of a strengthening and widening of the existing area of high PW, rather than a migration. The strengthening is consistent with the difference in moisture between regimes (b) and (c). The remaining transitions (c) to (d), (d) to (e) and (e) to (a) are all presaged by a movement of the PW anomalies.
The PW prior to the transition from (b) back to (a), which is not part of the active-break cycle, is shown in panel (f) of Fig. 10. A strong drying is seen extending from the Bay of Bengal west-northwestward across central India and intensifying in the Arabian Sea, while moistening occurs over the equatorial Indian Ocean to the south. This south-north dipole in moisture changes presages the drying over India and the southward movement of moisture leading back to regime (a).

Summary and conclusions
The cluster analysis presented here categorizes the evolution of the full circulation in time, rather than focus on only those components which are oscillatory. The preferred transition paths between the circulation regimes clearly reflect aspects of both the active/break episodes and the 45-day oscillation which is associated with northeastward propagation of the circulation. Yet significant transitions (from cluster (c) to (b) in Figs. 2 and 3, and from cluster (b) to cluster (a) in Figs. 5 and 6) are characterized by equatorward propagation, and there are a number of other (not statistically significant) transitions between clusters. Such transitions occur during periods when regular northward propagation is not seen.     The evaluation of forecast and reforecast anomalies using the clustering approach presented has a number of advantages, including: (i) the focus on large-scale circulation patterns that are frequently seen in reanalyses; (ii) a natural way to describe the components of the active/ break cycle in forecasts as well as the transitions that are not part of this cycle, and (iii) its applicability to short forecasts, for which application of time-filtering is not an option. The assignment of each forecast pentad anomaly to one of the observed regimes can be done on the basis of pattern correlation or mean-squared error, and leads to a coarse-grained representation of the (re)forecast evolution. Ensembles of such (re)-forecasts can then meaningfully summarized in terms of the evolution of regime probabilities. To evaluate the fidelity of long simulations, another option is to apply the clustering procedure described in this paper to determine the realism of the preferred circulation states as well as their transition paths. In this way the presence (or absence) of a simulated active-break cycle can be obtained without the use of spectral time filtering.