1 Introduction

The Regional Earthquake Likelihood Model (RELM) working group formed in 2000 and was supported by the Southern California Earthquake Center (SCEC) and the United States Geological Survey (USGS). The group’s main purpose was to improve seismic hazard assessment and to increase understanding of earthquake generation processes. Seismic hazard analysis requires two fundamental components: an earthquake forecast that describes the probabilities of earthquake occurrence in a spatio-temporal volume; and a ground-motion model that transforms each forecasted event into a site-specific estimate of ground-shaking. RELM participants focused on the former component and developed several earthquake forecast models (Bird and Liu, 2007; Console et al., 2007; Ebel et al., 2007; Gerstenberger et al., 2007; Helmstetter et al., 2007; Holliday et al., 2007; Kagan et al., 2007; Petersen et al., 2007; Rhoades, 2007; Shen et al., 2007; Ward, 2007; Wiemer and Schorlemmer, 2007). These models span a broad range of input data and methods: most are based on past seismicity, however some incorporate geodetic data and/or geological insights. See Field (2007) and the special volume of Seismological Research Letters for more details on the RELM project.

In addition to developing forecast models, RELM also explored comparative testing strategies and established a plan for conducting these tests. The group developed a suite of likelihood tests (Schorlemmer et al., 2007) to be implemented within a Testing Center, a facility in which earthquake forecast models are installed as software codes and in which all necessary tests are conducted in an automated and fully prospective fashion (Schorlemmer and Gerstenberger, 2007). By the end of the 5-year project, 19 earthquake forecasts were submitted for prospective testing in the period of 1 January 2006, 00:00–1 January 2011, 00:00. These forecasts were not installed as software codes in the Testing Center because the RELM group decided to use simple forecast tables; nevertheless, the processing is fully automated and does not require human interaction. All other models in the Testing Center, including the RELM 1-day models, are installed as codes.

Following the conclusion of the RELM project, the Collaboratory for the Study of Earthquake Predictability (CSEP) was formed as a venue to expand upon the RELM experiment and to establish and maintain a Testing Center (Jordan, 2006). CSEP is built upon a global partnership to promote rigorous earthquake predictability experiments in various tectonic environments. In addition to establishing new testing regions, CSEP is developing new testing methods, introducing new kinds of earthquake forecast models, and improving upon the testing rules suggested by the RELM working group. The U.S. branch of CSEP inherited all RELM earthquake forecasts, as well as the task of testing them according to the rules outlined by Schorlemmer et al. (2007) in a Testing Center designed according to Schorlemmer and Gerstenberger (2007).

All models developed by RELM participants forecast earthquakes in a testing area that covers the state of California and all regions within about one degree of its borders. This test region was chosen to include any earthquake that might cause shaking within the state of California (Schorlemmer and Gerstenberger, 2007). The RELM working group proposed two major classes of forecasts: 1 day and 5 years (Schorlemmer and Gerstenberger, 2007). In contrast to daily or yearly periodicity in weather, earthquakes do not follow obvious seasonal or cyclical patterns that could be used to scientifically justify the chosen durations. Rather, the classes are end-user-oriented: The 5-year class is relevant for seismic hazard calculations, while the 1-day class allows a closer look at aftershock hazard forecasts and potential short-term precursor detection. Daily forecasts can make use of all seismicity up to and including the previous day to adapt to new earthquakes and to re-calibrate the model, whereas the 5-year forecasts are fixed at the beginning of the experiment and never updated. Because of this fundamental difference in the setup, models were either submitted for the 1-day class or the 5-year class. Forecasts submitted to the 5-year class were taken to be time-invariant. We briefly describe the models below; a detailed summary of the models is given by Field (2007) while the full descriptions of each model can be found in the individual articles in the special volume of Seismological Research Letters (see Table 1).

Table 1 RELM models being evaluated within the Testing Center

One of the main goals of RELM was to test models comparatively; to compare models, a significant standardization of the forecasts was necessary. Therefore, all testing rules, the testing period, the testing area, and the earthquake catalog and its processing were defined by Schorlemmer and Gerstenberger (2007) and agreed upon by the members of the RELM working group. This standardization also required that all RELM models provide grid-based forecasts: earthquake rates specified in latitude/longitude/magnitude bins, and characterized by Poisson uncertainty. Models that declare alarms or forecast fault ruptures were not considered, as no testing method was developed or specified for these kinds of forecasts.

In this paper we describe the different model classes and present the results from the first 2.5 years of testing the time-invariant 5-year RELM forecasts. Because the forecasts were specified as being time-invariant, all forecast rates were halved for the results presented here. We emphasize, however, that these results are preliminary because the forecasts were specified as 5-year forecasts. As more earthquakes occur, the results will likely change. Nevertheless, the results indicate which models are consistent with the observations to date and which models have so far performed best in comparative testing.

2 Models

2.1 5-Year Models

The forecasts submitted to the 5-year class represent a broad spectrum of models, each of which is built on its own set of scientific hypotheses pertaining to the occurrence of earthquakes. Most of the models use past seismicity as the primary data set for model calibration and parameter value estimation, and they then extrapolate historical seismicity rates into the future. However, some models make use of geological, geodetic, and/or tectonic data.

Large earthquakes are followed by dozens to hundreds of earthquakes in their immediate wake. If a very large event were to occur in California tomorrow, its triggered earthquakes would likely dominate the statistics of the entire 5-year period. Because mainshocks and dependent aftershocks cannot be identified by some physical measurement, a compromise was made to accommodate models which forecast independent mainshocks only. Two forecast subclasses were created: one for forecasts of mainshocks only (mainshock models) and one for forecasts of all earthquakes (mainshock+aftershock models). Schorlemmer and Gerstenberger (2007) and Schorlemmer et al. (2007) provide details on the declustering procedure that is used at the testing center to create catalogs of mainshocks against which the mainshock models are tested. Both classes forecast rates of earthquakes with magnitude greater than or equal to 4.95 with a binning of 0.1 magnitude units (resulting in magnitude bins of [4.95, 5.05), [5.05, 5.15), etc., with a final bin starting at magnitude 8.95 with no upper limit) and a spatial binning of 0.1° × 0.1° with the cell boundaries aligned to the full degrees. The observed magnitude is taken to be the magnitude reported in the Advanced National Seismic System (ANSS) catalog, disregarding the magnitude scale.

2.2 Mainshock Models

Twelve mainshock models were submitted to RELM; these were formally registered and published on the RELM website (http://relm.cseptesting.org, see also Table 1 and Figs. 1 and 2). Of these, many were generated by smoothing past seismicity under different assumptions. The models Ebel-et-al.Mainshock and Ebel-et-al.Mainshock.Corrected (see below for the explanation of the double entry), developed by Ebel et al. (2007), average the 5-year rate of \(M {\geq}5\) earthquakes in 3° by 3° cells from a declustered catalog from 1932 until 2004 and use a Gutenberg-Richter distribution for computing rates per magnitude. The model Kagan-et-al.Mainshock (Kagan et al., 2007) smooths past earthquakes using a longer catalog dating back to 1800 and it accounts for the spatial extent of large earthquake ruptures. Rates are calculated using a tapered Gutenberg-Richter distribution with corner magnitude 8. Helmstetter et al. (2007) extend this approach to their Helmstetter-et-al.Mainshock model by including past \(M {\geq}2\) events since 1984 in the smoothing, by optimizing the smoothing, and by accounting for the spatial variability of the completeness magnitude. The model Ward.Seismic81 (Ward, 2007) is also based on smoothing past earthquakes, in this case going back to 1850.

Fig. 1
figure 1

Forecast maps of 5-year mainshock models. Colors indicate the forecast rate of all events with \(M {\geq}4.95\) (unmasked areas only), reducing the latitude/longitude/magnitude forecasts to latitude/longitude forecasts by summing over the magnitude bins. The observed target earthquakes are shown as white squares; only those earthquakes occurring in unmasked cells are shown for each model. Models from left to right: (first row) Ebel-et-al.Mainshock.Corrected with Ebel-et-al.Mainshock as inset, Helmstetter-et-al.Mainshock, and Holliday-et-al.PI. (second row) Kagan-et-al.Mainshock, Shen-et-al.Mainshock, and Ward.Combo81

Fig. 2
figure 2

Forecast maps of 5-year mainshock models. Colors indicate the forecast rate of all events with \(M {\geq}4.95\) (unmasked areas only), reducing the latitude/longitude/magnitude forecast to latitude/longitude forecasts by summing over the magnitude bins. The observed target earthquakes are shown as white squares; only those earthquakes occurring in unmasked cells are shown for each model. Models from left to right: (first row) Ward.Geodetic81, Ward.Geodetic85, and Ward.Geologic81. (second row) Ward.Seismic81, Ward.Simulation, and Wiemer-Schorlemmer.ALM

Wiemer and Schorlemmer (2007) estimated the a and b values of the Gutenberg-Richter distribution in each latitude/longitude cell to test the hypothesis that spatial variations in these values designate stationary asperities that govern the relative frequency of large and small earthquakes (the Wiemer-Schorlemmer.ALM model). The model Holliday-et-al.PI, submitted by Holliday et al. (2007), is based on the assumption that regions of strongly fluctuating seismicity will be the regions of future large earthquakes.

Some models include data other than past earthquake observations. Three models are based solely on geodetic data. In one, Shen-et-al.Mainshock, Shen et al. (2007) assumed that the earthquake rate is proportional to the horizontal maximum shear strain rate. The magnitude rates are obtained from a spatially-invariant tapered Gutenberg-Richter distribution with corner magnitude 8.02. A second model, Ward.Geodetic81 by Ward (2007), uses a larger data set and a different technique to map strain rates to seismicity rates. The sole difference between this and the third model, Ward.Geodetic85 by Ward (2007), is the maximum magnitude in the truncated Gutenberg-Richter distribution (8.1 and 8.5, respectively).

Ward (2007) also provided a mainshock model based solely on geological data (Ward.Geologic81). The model is constructed by mapping fault slip rates into a smoothed geological moment rate density and then into seismicity rate, again assuming a spatially invariant truncated Gutenberg-Richter distribution. The model Ward.Simulation is based on simulations of velocity-weakening friction on a fixed fault network representing California. The model Ward.Combo81 presents the average of the seismic, geodetic, and geological models by Ward (2007).

2.3 Mainshock+Aftershock Models

Six mainshock+aftershock models were submitted to RELM (see Table 1 and Fig. 3). Of these, all but one are modifications of corresponding mainshock forecasts: Ebel et al. (2007), Kagan et al. (2007), Helmstetter et al. (2007) and Shen et al. (2007) calibrated their mainshock+aftershock forecast to a complete catalog while their mainshock forecasts were calibrated based on a declustered catalog of past seismicity. The model Bird-Liu.Neokinema by Bird and Liu (2007) is based on a local kinematic model of surface velocities derived from geodetic, tectonic, geological, and stress-direction data. The velocities are mapped into seismic moment rate and then into long-term seismicity rate.

Fig. 3
figure 3

Forecast maps of all 5-year mainshock+aftershock models. Colors indicate the forecast rate of all events with \(M \geq 4.95\) (unmasked areas only), reducing the latitude/longitude/magnitude forecasts to latitude/longitude forecasts by summing over the magnitude bins. The observed target earthquakes are shown as white squares; only those earthquakes occurring in unmasked cells are shown for each model.. Models from left to right: (first row) Bird-Liu.Neokinema, Ebel-et-al.Aftershock, and Helmstetter-et-al.Aftershock. (second row) Kagan-et-al.Aftershock, Shen-et-al.Aftershock, and Ebel-et-al.Aftershock.Corrected. The Ebel-et-al.Aftershock.Corrected model was submitted on 12 November 2006 and is therefore tested against a smaller set of earthquakes

2.4 Corrected Forecast Groups

Two additional 5-year model classes were introduced to account for corrected versions of the models by Ebel et al. (2007). In their initial submission, the forecasts were erroneous at some locations; they were replaced by a corrected version on 12 November 2006. Because of the logic of truly prospective testing, the mainshock class and the mainshock+aftershock class were expanded into two groups each. The first group includes all initial RELM submissions and compares them to observations from 1 January 2006 forward, while the second group (denoted by a “corrected” suffix) covers all initial submissions and the corrected version of the model by Ebel et al. (2007). Because the corrected versions were submitted later, testing for this group started at the submission date of the corrected versions.

For any further model addition or correction, a new group will be introduced. Such a group would consist of all existing models and the new submissions, and the starting date for testing would be the submission date of the new contributions.

3 Testing Center

The Testing Center is a multi-computer system running the CSEP Testing Center software. It is divided into four main components: the development system, the integration system, the operational system, and the web presentation system (Zechar et al., 2009). The development system is used for software development of the Testing Center software and for model development and installation. After Testing Center software and respective models successfully run on the development system, their functionality is tested on the integration system. Each day this system checks out all necessary software codes and performs unit and acceptance tests for all software programs. This step is introduced to mimic the operational system and to detect possible problems before codes are transferred to the operational system. The operational system has the same setup as the integration system, however the codes are only updated every three months according to the release schedule of new versions of the Testing Center software. On the operational system, all tests are performed according to different scheduling depending on the model groups. All results are copied to the web presentation system from which they can be retrieved.

The design of the Testing Center followed the four main goals as outlined by Schorlemmer and Gerstenberger (2007):

  • Transparency. All computer codes are managed in a version control repository and are freely available. Thus, all changes to the codes are documented and a web-based collaboration system allows everyone to monitor the software development. The Testing Center codes are published under the open-source General Public License, and the majority of the models which were submitted as codes are open-source codes and can be used by other researchers. The RELM 5-year models were submitted as simple forecast files which are also freely available on the RELM website (http://relm.cseptesting.org). The Testing Center also catalogs all data files used for generating and testing forecasts. Any of these files is freely available.

  • Controlled Environment. The Testing Center ensures truly prospective tests of all submitted models with the same data. Any model submission gets time-stamped and will only be tested for periods after the submission date. Such an environment is needed for continuous testing of short-term models like the RELM 1-day model class. Because modelers cannot modify their models after submission, no conscious or unconscious bias of a modeler is introduced into the forecasts.

  • Comparability. One of the major purposes of the Testing Center is the comparative testing of models. Models are tested for consistency with the observation and against each other (given the observation) to assess their comparative performance.

  • Reproducibility. Full reproducibility of any result is perhaps the most important feature of the Testing Center. Each data set used for computing a test is stored in the system. Thus, any forecast and any input data set can be reproduced and the tests can be recomputed at any time. Each test computation also stores the system configuration for full reproducibility.

3.1 Tests for Evaluating the Earthquake Forecasts

Schorlemmer et al. (2007) proposed a suite of statistical tests to evaluate probabilistic earthquake forecasts. Similar tests were discussed by Jackson (1996) and used by Kagan and Jackson (1994, 1995) for the evaluation of long-term forecasts of large earthquakes. In the language of statistical hypothesis testing, the tests fall into the class of significance tests: Assuming a null hypothesis (a given forecast model), the distribution of an observable test statistic is simulated; if the observed test statistic (e.g., the number of earthquakes) falls into the upper or lower tail of the distribution, the null hypothesis is rejected. The predictive distributions are constructed from model-dependent Monte Carlo simulations and hence are not assumed to be asymptotically normal. Daley and Vere-Jones (2004) and Harte and Vere-Jones (2005) explored performance evaluations based on the entropy score and the information gain.

Three tests are used to evaluate the RELM forecasts: the first two—the L(ikelihood)-Test and the N(umber)-Test—measure the consistency of the forecasts with the observations, while the third—the likelihood R(atio)-Test—measures the relative performance of one model against another. Each of these tests compares forecast rates with observed rates, and although they make slightly different measurements, these tests are not independent metrics.

For the RELM models, the forecast in each bin is the expected Poisson earthquake rate (the mean seismicity rate), which is usually a very small floating point number (e.g., 10−4). To evaluate the likelihood of the model forecast given an observation (which is an integer, usually 0 or 1), the discrete Poisson distribution with mean equal to the forecast is used. For simplicity, the forecasts are stated in terms such that all observations in bins are independent, allowing probabilities to factorize.

3.2 The Number- or N-Test

The N(umber)-Test measures the consistency of the total forecasted rate with the total number of observed earthquakes, summed over all bins. The results of the N-Test indicate whether a forecast has predicted too many earthquakes, too few earthquakes, or a number of earthquakes that is considered to be consistent with the observed number. For example, consider a model which predicted λ = 28.4 earthquakes in the total space-time-magnitude testing region, and assume that, like the RELM models we consider, the forecast is characterized by Poisson uncertainty. If ω = 30 events were observed during the experiment, the model obtains a quantile score of \(\delta=\hbox{Poi}(\omega=30|\lambda=28.4)=0.66\) (here Poi stands for the Poisson cumulative distribution function). A model may be rejected if δ is very small (e.g., less than 0.025) or very large (e.g., greater than 0.975), which would indicate that the observed number of earthquakes falls into the far upper or far lower end of the forecast distribution, respectively. This indicates that the number of observed earthquakes is unlikely given the model forecast and, hence, the forecast is inconsistent with the observation. The N-Test disregards the spatial and magnitude distributions of the forecast and the observations, emphasizing each forecast’s rate model.

3.3 The Likelihood- or L-Test

The L(ikelihood)-Test measures the consistency of a forecast with the observed rate and distribution of earthquakes. In each latitude-longitude-magnitude bin, the log-likelihood of an observation, given the forecast, is computed (again assuming the Poisson distribution). The log-likelihoods are then summed over all bins. To understand whether this sum—the observed log-likelihood—is consistent with what would be expected if the model were correct, many synthetic catalogs consistent with the model forecast are simulated, and their log-likelihoods calculated. This process produces a distribution of log-likelihoods, assuming that the model of interest is the “true” model. The statistic γ measures the proportion of simulated log-likelihoods less than the observed log-likelihood. If γ is low (e.g., less than 0.05), then the observed log-likelihood is much smaller than what would be expected given the model’s veracity. The observation may therefore be considered inconsistent with the model. If γ is very high, the observed likelihood is considerably higher than expected, given the model forecast’s veracity. In this case, however, it may be that a model predicted the distribution of earthquakes well but smoothed its forecast too much, and therefore high γ values are not considered grounds for model rejection. For example, consider the case when earthquakes occur only in a model’s most highly-ranked bins—those bins with the highest forecast rates. If the model is smooth, simulations consistent with the model would produce more diffuse seismicity than that observed, yielding simulated catalogs with events in bins with lower forecast rates, and thus a very high γ. Considering this effect, the L-Test is one-sided.

3.4 The Likelihood-Ratio- or R-Test

The likelihood R(atio)-Test consists of a pairwise-comparison between forecasts (e.g., forecasts i and j). The observed log-likelihood is calculated for each model forecast, and the difference—the observed likelihood ratio—indicates which model better fits the observations. To understand whether this difference is significant, a null hypothesis that model i is correct is adopted and synthetic catalogs consistent with this model are produced. The likelihood ratio is calculated for each simulated catalog. If the fraction αij of simulated likelihood ratios less than the observed likelihood ratio is very small (e.g., less than 0.05), the observed likelihood ratio is deemed significantly small enough to reject model i. So that no single forecast is given an advantage, this procedure is applied symmetrically. That is, synthetic catalogs are also simulated assuming model j to be true, and these simulations are used to estimate αji. Comparing each model with all other models results in a table of α values.

3.5 Masking

Several models are based on data that are not available throughout the entire testing area, and some researchers felt their model was not applicable everywhere in the testing area. For a forecast to cover fully the testing area, a model needs an additional “background” model to fill the gaps. RELM requested that all submitted models cover the entire testing area, although modelers were permitted to mask the area in which they were unable to create their forecast according to their scientific ideas. Thus, the area of the genuine forecast can be identified during testing, although it is also possible to evaluate a model over the entire testing area if a background model is chosen. Currently, only the unmasked areas are tested in the Testing Center; that is, a forecast is only evaluated over bins which are unmasked. For the R-Test, only bins which are unmasked in both forecasts are considered.

3.6 Uncertainties in Observations

The earthquake catalog data used to test forecasts contain measurement uncertainties. To account for these uncertainties in the tests, Schorlemmer et al. (2007) proposed generating “modified” catalogs. Each event’s location and magnitude is modified using an error distribution suggested by the catalog compilers. Additionally, in the case of mainshock catalogs, declustering according to Reasenberg (1985) is applied using parameters that are sampled as described by Schorlemmer and Gerstenberger (2007). For each observed catalog, 1000 modified catalogs are generated, and these modified catalogs help to estimate the uncertainty of the test results resulting from the uncertainties of earthquake data.

4 Results

In this section we report preliminary summary results for the first half of the ongoing 5-year RELM experiment in California. Detailed results are available at http://us.cseptesting.org, where they are archived and regularly updated. We remind the reader that these results are preliminary, as they are based on only the first half of the 5-year experiment in progress.

4.1 Observed Earthquakes

Twelve earthquakes with magnitude greater than or equal to 4.95 were reported in the ANSS catalog in the RELM testing region during the first half of the ongoing 5-year experiment. Table 2 lists the properties of these target events. Among the details in Table 2 is the estimated independence probability for each earthquake, computed by a Monte-Carlo application (Schorlemmer and Gerstenberger, 2007) of the Reasenberg (1985) declustering algorithm. For example, the first target earthquake has an independence probability, P I, of 21%, indicating that the declustering algorithm identified this earthquake as belonging to a cluster in 79% of the declustering iterations, each using a different, Monte Carlo-sampled set of algorithm parameters from a range of plausible values. The independence probabilities were used during evaluation of the mainshock and mainshock.corrected forecast group models; as mentioned in the previous section, the tests estimate the effect of observation uncertainties by generating modified catalogs, and the independence probability determines in what percentage of the modified catalogs a given earthquake appears.

Table 2 Observed target earthquakes of magnitude \(M_{\rm ANSS} \geq 4.95\) in the testing area

For the 5-year mainshock forecast class, only a subset of the events in Table 2 are considered. This subset is determined by applying the Reasenberg (1985) declustering algorithm to the original observed catalog, using standard California parameters. Those events that are not declustered are considered mainshocks and are used to evaluate the 5-year mainshock forecasts.

An investigation of historical seismicity rates in the RELM testing region indicates that the observed sample of 12 earthquakes (with nine of them mainshocks) in a 2.5-year period is relatively small, but not significantly so. We analyzed the rate of all \(M \ge 4.95\) earthquakes from 1 January 1932 to 30 June 2004 using the ANSS catalog. To compare with the experimental observation, we divided this time period into 29 non-overlapping periods of 2 years and 6 months duration; the rates in each period are shown in Fig. 4a. On average, 15.45 earthquakes (with 10.59 of them being mainshocks) were observed during each 2.5-year period, with a sample standard deviation of 9.99. As suggested by Jackson and Kagan (1999) (see also (Vere-Jones, 1970; Kagan, 1973)), we found that the number of earthquakes in each period is better fit by a negative binomial distribution than a Poisson distribution—that is, the best-fit negative binomial distribution obtains a lower Akaike Information Criterion (AIC) value (Akaike, 1974) (206.4) than the best-fit Poisson distribution (278.2). The best-fitting negative binomial distribution also provides a marginally better fit to the mainshock rate distribution: the negative binomial model obtains an AIC value of 167.3, whereas the Poisson model obtains an AIC of 168.5. The seismicity rate data and the best fits are shown in Fig. 4b. We find the best-fit negative binomial distribution is described by parameter values (τ, ν) = (2.83, 0.15); under this model, the probability to obtain fewer than 12 earthquakes is 41.01%. Accordingly, under the best-fit model for mainshock rates, the probability to obtain fewer than nine mainshocks is 32.91%. Despite our finding that the negative binomial distribution better fits historical rates of seismicity, RELM forecasts were formulated as having Poisson uncertainty, and therefore the tests applied to the models are based on Poisson statistics.

Fig. 4
figure 4

Earthquake rates in California from 1 January 1932 to 30 June 2004. (left) Bar graph showing the number of earthquakes in 29 non-overlapping periods of 2 years and 6 months duration. White and gray bars indicate the number of earthquakes in the declustered catalog, thus mainshocks only, and complete catalog, respectively. (right) Cumulative distribution function of the earthquakes rates in the complete catalog from the left frame. The solid black line indicates the observation, the solid gray line indicates the Poissonian distribution of rate λ = 15.45, the dashed black line indicates the best-fit negative binomial distribution

4.2 Mainshock Models

The summary results for the mainshock forecast class are given in Tables 3, 4, and 5. Table 3 lists the quantile scores for the L- and N-Tests. The RELM working group decided a priori to use a significance value of 5%; in the case of the two-sided N-Test, this corresponds to critical values of 2.5% and 97.5%; bold values in the tables indicate that the corresponding forecast is inconsistent with the observed target earthquake catalog. Recall that the γ quantile score, associated with the L-Test, describes how well a forecast matches the observed distribution of earthquakes. A very low γ score is means for rejecting a model, while a very high γ score is suspect, but not grounds for rejection. On the other hand, an extremely low or extremely high δ quantile score—characterizing the overall rate of earthquakes but not including any spatial information—yields rejection. From Table 3 we see that the observations during the first half of the RELM experiment are inconsistent—at the a priori significance level—with the Holliday-et-al.PI, Ward.Combo81, Ward.Geodetic81, Ward.Geologic81, and Ward.Seismic81 forecasts. All of these models have overpredicted in the first half of the experiment as indicated by their small δ values. (See also Fig. 5 for a visual comparison of predicted and observed number of earthquakes per model.)

Table 3 L-Test and N-Test results for the mainshock forecast class
Table 4 Result details for the mainshock forecast class
Table 5 R-Test results for the mainshock forecast class
Fig. 5
figure 5

Visual comparison of predicted and observed number of earthquakes per model in the mainshock and mainshock+aftershock forecast classes. For each model, the bar indicates the range of observed earthquake rates that would be consistent with the model, given a Poissonian distribution. The gray squares indicate observations per model considering the coverage of the model. If the gray square overlaps with the bar, the model is consistent with the observation

Table 4 shows the contribution of each earthquake to the resulting likelihoods per model and highlights for each earthquake the model with the highest forecast rate in the respective bin—in other words, which model best forecast the earthquake. The Wiemer-Schorlemmer.ALM model provides the highest forecast rate for four earthquakes, and the Helmstetter-et-al.Mainshock model has the highest forecast rate for three earthquakes. The Ebel-et-al.Mainshock and Holliday-et-al.PI models provide the highest forecast rate for one earthquake each.

The R-Test results for the mainshock forecast class are shown in Table 5 and provide a comparative evaluation of the forecasts. This table lists the α quantile scores for each pairwise comparison; for simplicity, we exclude the pairwise comparisons that would include the models shown to be inconsistent by the L- and/or N-Tests. Scores indicating that the corresponding model can be rejected are shown in bold. In this case, such a score indicates that the row model (labeled to the left) should be rejected in favor of the column model (labeled at the top). For example, the α value in the first row and second column indicates that the Ebel-et-al.Mainshock forecast should be rejected in favor of the Helmstetter-et-al.Mainshock forecast. From this table, we find that only the Helmstetter-et-al.Mainshock forecast is not rejected (because all other rows contain at least one bold value). Moreover, all models are rejected in favor of the Helmstetter-et-al.Mainshock forecast (all scores in the second column are bold).

4.3 Mainshock Corrected

As mentioned in the Models section, the mainshock.corrected forecast group contains all the same forecasts as the mainshock forecast class with one exception: the Ebel-et-al.Mainshock.Corrected forecast is added and implicitly replaces the Ebel-et-al.Mainshock forecast. For consistency, the experiment for this forecast group began on 12 November 2006, so it contains only earthquakes 3–11 from Table 2. The summary results for this forecast group are shown in Tables 6 and 7. In this forecast group, the L- and N-Test results indicate that the observed earthquake distribution is consistent with all forecast models except the Ward.Combo81 and Ward.Geodetic81 models, which overpredicted the number of events (Table 6). The R-Test results are similar to the results for the mainshock forecast class and indicate that only the Helmstetter-et-al.Mainshock forecast is not rejected in any pairwise comparison (Table 7).

Table 6 L-Test and N-Test results for the mainshock.corrected forecast group
Table 7 R-Test results for the mainshock.corrected forecast group

4.4 Mainshock+Aftershock Models

The summary results for the mainshock+aftershock forecast class are shown in Tables 8, 9, and 10. N-Test results show that the Bird-Liu.Neokinema model and the Ebel-et-al.Aftershock model have each predicted too many earthquakes in the experiment to date (see also Fig. 5). The R-Test results show that only the Helmstetter-et-al.Aftershock forecast is not rejected in any pairwise comparison.

Table 8 L-Test and N-Test results for the mainshock+aftershock forecast class
Table 9 Result details for the mainshock+aftershock forecast class
Table 10 R-Test results for the mainshock+aftershock forecast class

4.5 Mainshock+Aftershock Corrected

As with the mainshock and mainshock.corrected forecast groups, the mainshock+aftershock.corrected forecast group was added to the mainshock+aftershock forecast class. The Ebel-et-al.Aftershock.Corrected forecast is added and implicitly replaces the Ebel-et-al.Aftershock forecast. For consistency, the experiment for this forecast group began on 12 November 2006. The summary results for this forecast group are shown in Tables 11 and 12.

Table 11 L-Test and N-Test results for the mainshock+aftershock.corrected forecast class
Table 12 R-Test results for the mainshock+aftershock.corrected forecast group

As in the mainshock+aftershock forecast group, the N-Test results show that the Ebel-et-al.Aftershock model has predicted too many earthquakes in the experiment to date, as has the Ebel-et-al.Aftershock.Corrected model. The R-Test results show that only the Helmstetter-et-al.Aftershock forecast is not rejected in any pairwise comparison.

5 Discussion

The science of earthquake predictability is an active field with many unsolved problems, including the question of best practices for formulating and evaluating earthquake forecasts. The RELM effort, as one of the first large-scale, prospective, and cooperative predictability experiments, can provide lessons along these lines. RELM experiment participants decided to specify their forecasts as the expected rate of earthquakes in latitude/longitude/magnitude bins, and they decided that the forecasts should be interpreted as having Poisson uncertainty. As we showed in the Observed Earthquakes subsection (and as shown by Jackson and Kagan, 1999), seismicity rates are better fit by a negative binomial distribution than a Poisson distribution; therefore it may be worthwhile for future forecasts to specify an additional parameter per bin (or per forecast) that allows for negative binomial uncertainty. Preferably, a forecast should specify a discrete probability distribution in each bin. This approach would not require the agreement of all participants on one particular distribution to be used for testing and it would also allow for propagating uncertainties of input data into the forecast (Werner and Sornette, 2008). The tests and forecast format that RELM decided to use are relatively simple yet powerful. Nevertheless, they are not without flaws; for example the assumption that observations in each space-time-magnitude bin are independent may sometimes be violated, particularly in the wake of a large earthquake.

Some of these issues will be addressed by considering alternative forecast formats, e.g., by allowing models to specify the likelihood distribution to be used. Moreover, CSEP is incorporating modifications to the current tests and other tests, e.g., alarm-based tests that do not require a specific rate or uncertainty model (Molchan, 1990; Molchan and Kagan, 1992; Kagan, 2007; Molchan and Keilis-Borok, 2008; Zechar and Jordan, 2008).

The stability of RELM test results—including those presented here—is not easy to understand comprehensively. We made efforts to address stability of the L-Test by exploring a hypothetical predictability experiment. For a given forecast, we determined the bin with the lowest forecast rate, and we generated a modified catalog by adding to the observed catalog one additional event placed in this bin. This additional event represents the most unexpected occurrence according to the model, and we were curious to see if this one event could cause a forecast to be rejected if it otherwise was not rejected. We applied the L-Test to each forecast and the corresponding modified catalog and compared the resulting γ statistic with the observed γ reported in the tables throughout the Results section. We find that there is no simple relationship: some forecasts were rejected while others were not, and rejection depended on the peakedness of a forecast. For example, if a forecast has a very high ratio between its highest and lowest forecast values (i.e., it is very peaked), the most unexpected event has a much stronger effect on the L-Test result than otherwise. In other words, stability of test results is model-dependent, and this issue should be considered carefully in future experiments.

Another aspect of result stability is the duration of the experiment. Five years will most likely not be long enough for a comprehensive and final test result, as it can be questioned how representative the seismicity of these particular five years is. One effect of this problem can be seen in the results of the mainshock and mainshock.corrected forecast groups. While in the former group five models are rejected based on N-Test results, only two are rejected in the latter group. The exclusion of about 11 months from testing changes the L-Test considerably. However, the results of the R-Test suggest in both cases that the Helmstetter-et-al.Mainshock cannot be rejected by any other model.

The fact that some forecasts masked a significant portion of the entire testing area led to the problem that eight of the twelve mainshock forecasts were tested against only two earthquakes. Four of these eight were rejected due to overpredicting the number of events. Although only two earthquakes occurred in the unmasked area, this low number indicates that the models are not consistent with the observation as the models expected far more events.

Although the RELM project ended in 2005, efforts to develop testing methods, implement these methods into Testing Center software systems, and expand the scope of experiments to other seismically active regions are ongoing, as is the experiment considered in this study. CSEP, the successor of RELM, took over the entire operation and development and is becoming a global reference project for earthquake predictability research.

Standardization can be considered one of the most important achievements of the RELM project and the Testing Center. The substantial consensus of RELM participants on the tests, rules, and processes is more than just a nucleus for other efforts. The Testing Center software is currently deployed to facilities in New Zealand, Europe, and Japan, and the rules set in California are adopted throughout all new Testing Centers. The next major step will become the unification of all efforts into a global testing program which was made possible only through the successful standardization.