Estimation, diagnostics, and extensions of nonparametric Hawkes processes with kernel functions

The Hawkes self-exciting model has become one of the most popular point-process models in many research areas in the natural and social sciences because of its capacity for investigating the clustering effect and positive interactions among individual events/particles. This article discusses a general nonparametric framework for the estimation, extensions, and post-estimation diagnostics of Hawkes models, in which we use the kernel functions as the basic smoothing tool.


Introduction
Analyzing time series data has a core role in analyzing data series with an evolutionary characteristic. However, when we investigate the underlying processes at the microscale, most are continuous processes, point processes, or a mixture. For example, sales at a shop are not daily incomes, but a process of each trade, which includes trading times, amounts, and the types of goods. Point process models are common in research as a natural tool to model the patterns of discrete events that occur in a continuous space, time, or a space-time domain, such as urban fires, wild forest fires, crimes, earthquakes, diseases, tree locations, animal locations, communication network failures, and so on.
Depending on the type of the domain in which the events occur, researchers classify point into two classes: spatial point processes and spatiotemporal/temporal point processes. The difference between these two types of models is that the latter have a special evolutionary time axis on which researchers can sort events according to their chronological order, and share many common features with time series sequences. Spatial point processes do not have such evolutionary direction and are usually regarded as a permanent pattern of particle locations or a snapshot of a spatiotemporal point process. Spatial point processes are usually modeled using the moment intensities and the Papangelou intensity (e.g., Møller and Waagepetersen 2003). When a property or a characteristic is attached to each event, such as the magnitude of an earthquake or the burned area of a wild fire, the point process is then called a marked point process.
Among the different types of point processes, clustering point processes attracted much interest from mathematicians and statisticians. Typical clustering processes include the Neyman-Scott process Scott 1953, 1958), which has been used to describe the distribution of locations of galaxies in the universe, and the Bartlett-Lewis process to model the rainfall process (Bartlett 1963;Lewis 1964). Many spatiotemporal/temporal clustering point processes can be categorized as a Hawkes self-exciting process (Hawkes 1971a, b;Hawkes and Oakes 1974). In short, a Hawkes process consists of a series of discrete events that each stem from one of two subprocesses: the background subprocess or the clustering subprocess. The former is considered a Poisson process, which can be inhomogeneous in space and/ or nonstationary in time, while the latter consists of events from the exciting effect of all of the events that occurred in the past. Equivalently, each event, whether it is a background event or an excited event, triggers ("encourages") the occurrence of the future events according to some probability rules. The Hawkes excitating model has become one of the most popular models in point process data analysis in both natural and social sciences because of its capacity to investigate the clustering effect (positive interactions) among individual events/ particles, and to, thus, help determine the potential causal relationship among them. Nowadays, due to the rapid development of observation and data-storage technologies, big data is also a hot topic in point-process data analysis. With many sequences (datasets) or a long sequence (dataset) containing a huge number of discrete events, a quick tool or general framework to quantify and forecast the clustering or the triggering effect among events is desirable. The Hawkes process model fits this purpose.
In seismological application, researchers use a special form of the Hawkes process, called the epidemic-type aftershock sequence (ETAS) model (e.g., Ogata 1988Ogata , 1998Zhuang et al. 2002Zhuang et al. , 2004Console et al. 2003;Helmstetter et al. 2003;Lombardi et al. 2010;Guo et al. 2015) to evaluate the probabilities of future earthquakes as well as analyzing the characteristics of seismicity. It has been adopted by many research institutions or governmental agencies in the United States, Switzerland, Italy, New Zealand, Japan, China, and so on. (Schorlemmer et al. 2018). The ETAS model is now accepted as the standard model for describing earthquake clusters (e.g., Schorlemmer et al. 2018;Huang et al. 2016). Such a model is also used in crime data analysis (see, e.g., Mohler et al. 2011) and in economics, where studies show that the interaction between prices has epidemic features (e.g., Bacry and Muzy 2014). In recent years, researchers applied this model for a data analysis of terrorists' behavior (e.g, Tench et al. 2016), interactions in social networks (e.g., Zipkin et al. 2015), and genomes or neuronal activities (e.g., Truccolo et al. 2005), among others. In all of these areas, a big portion of the theories and methodologies were originally developed using studies of earthquakes as an outcome of studies on the ETAS model. The second biggest source is crime data analysis. The applications of the Hawkes process in other areas are mainly for parameter estimation and results explanation.
The core of spatiotemporal point-process models is the conditional intensity, which gives the expectation of the number of events occurring in a unit time-space range in the near future, under the condition that we know the history of previous events in the process, and/or the history of one or more relevant processes, up to the current time, but not including it. Starting from the conditional intensity, we can conduct a parameter estimation, simulation, forecast, and even control. Many powerful tools associated with the conditional intensity function have been developed for the Hawkes process, such as stochastic de-clustering, stochastic reconstruction, the expectation-maximization (EM) algorithm, first-and higher-order residuals, and Bayesian analysis, as well as the theories associated with asymptotic properties (see a review by Reinhart 2018).
This study focuses on the use of kernel functions to solve the estimation problem of this type of point processes. With these solutions, we provide a ready-to-use tool to perform modeling, analysis, and forecasting for different point-process data in different application areas by using the Hawkes-type point process. In the following, Sect. 2 provides the basic concepts and formulations of the Hawkes process and its variations. Section 3 describes the estimation methods related to parametric Hawkes Models, including the maximum likelihood estimate (MLE), (EM) algorithm, and stochastic declustering. Section 4 introduces the nonparametric kernel estimates of the nonparametric background rate and clustering response components, and Sect. 5 uses two examples to explain how to extend existing models in a data-driven manner.

Hawkes process
We can determine a point process by its conditional intensity (e.g., Daley and Vere-Jones 2003;Zhuang 2015). For a temporal point process with no overlapping events (a simple temporal point process), the conditional intensity is where H t denotes the -algebra generated by the observational N before time t, but not including t. For any measurable set D ∈ ℝ, where f(t) is a predictable function; that is, its value at t is determined by the observation history of N before t. This equality and its spatiotemporal (high dimensional) versions are the key to the theories and methods summarized in this article. The Hawkes process describes the stochastic excitations among a series of events that occur in a continuous time domain or in a spatiotemporal domain. A temporal Hawkes process, supposing its realization N = {t i ∶ i ∈ ℤ} with ℤ being the set of all integers, has a conditional intensity in the form where is the rate of occurrence of spontaneous events (also called background events or immigrants), and g(t) is the rate of occurrence of the direct offspring generated by an event occurring at 0.
The criticality parameter, which is the average number of direct offspring per ancestor, is A stable and stationary Hawkes process requires < 1 . Otherwise, the rate of occurrence grows to infinity with time. If < 1 , then this parameter is identical to the branching ratio, which is the proportion of non-spontaneous events in the whole process. In general, these two quantities are different (see Zhuang et al. 2013 for details).
We can extend the Hawkes process easily to the spatiotemporal version where x denotes the location in the space of ℝ d . We can also generalize it the multivariate case where, if we have K types events in total, then each type has a conditional intensity of for k = 1, … , K , where k (x) represents the rate of occurrence of spontaneous events (also called background) for type-k events, and g k (t − s, x − u) is the rate of occurrence of type-k events excited by a type-event at (s, u). We can also extend the space-time Hawkes process to cases of marked processes easily: where x and m denote the location in the space of ℝ d and the mark in the space of , respectively, and f (m | m � ) gives the p.d.f. for the magnitudes of direct offspring from an event of magnitude m ′ . We can regard the multivariate case simply as a marked point process in which the mark takes only a finite number of values.
In the above, we assume that the marks of triggered events are location-and time independent.
Linlin model The earliest version of the Hawkes mode used by Hawkes and Oakes (1974) had the following self-and mutually exciting process where K and L are two given non-negative integers, and X(t) denotes the external process that can trigger events in N, which can be a point process, a continuous process, or mixture of both. Its name, Linlin, came from Ogata's FORTRAN program named "Linlin.f", meaning a linear response effect for both internal and external responses. Ogata and Akaike (1982) use this model to investigate the temporal clustering patterns of seismicity in different regions, and the correlation of seismicity among different regions. A technical problem is to keep g(t) and h(t) positive during the optimization when fitting this model to the data.
The space-time Epidemic-Type Aftershock Sequence (ETAS) model The spatiotemporal ETAS model has been used widely to describe the clustering features of earthquakes in space and time (see Ogata 1998;Zhuang et al. 2002Zhuang et al. , 2004Zhuang and Ogata 2006;Ogata and Zhuang 2006;Console et al. 2003;Sornette and Werner 2005a, b;Helmstetter et al. 2005). The conditional intensity of this model is where t, (x, y), and m represent the time of occurrence, spatial location, and magnitude of the earthquake, respectively. In the formula above, represents the probability density of the earthquake magnitude, where m c is the magnitude threshold of the earthquake, and is the expectation of the number of children (productivity), which is a Poisson random variable, from an event of magnitude m. Furthermore, is the probability density function of the length of the time interval between a child and its parent, and is the probability density function of the relative locations between the parent and children, where m c is the magnitude threshold. Epidemic forecasting In modeling and forecasting routinely collected invasive meningococcal disease (IMD), Meyer et al. (2012Meyer et al. ( , 2016 and Meyer and Held (2014) use the following model where k,l is the intensity offset in the spatiotemporal grid (k, l); ( (t), (x, y)) is the grid index in which t, (x, y) is located; (t), (x,y) is a linear predictor of endemic covariates on the grid that contains (t, (x, y)); j is a predictor attached to each infected individual; g and f are the temporal and spatial response functions, respectively; and I * (t, x, y) = {j ∶ t − ≥ t j < t ∧ ||(x, y) − (x j , y j )|| ≤ } with and being positive constants.
Social networks Fox et al. (2016) and Zipkin et al. (2015) use a multivariate Hawkes process to model the mail sent between pairs in a network of officers at the West Point military academy. The difference between these two studies is that the former uses the messages sent by the same sender as a component and the latter uses messages between each pair of officers as a component.

Parametric estimation
We can classify the forms of Hawkes models into three categories: parametric, nonparametric, and semiparametric. The parametric model can be estimated through the MLE method and the EM algorithm.

Likelihood function and MLE
Given the observation data of a spatiotemporal parametric Hawkes model in a space-time window S × T , the likelihood function can be written as where (t, x; ) is the conditional intensity of the process and denotes the parameter vector in the model (Daley and Vere-Jones 2003). We can estimate the model parameters, supposing that they are regular, by maximizing the likelihood above; that is, Rathbun (1996) discusses the asymptotic normality of the MLE for point processes.

Decomposing and reassembling the events: stochastic declustering
Consider a Hawkes process with conditional intensity where (t, x) is the background rate, which is different from the corresponding term in (5) as it can be time dependent, and g(t, x) is the rate of occurrence triggered by an event at time 0 and the location at the origin. The probability that an event, say j, is a background event; that is, the background probability, is and the probability that event j is triggered by another event i is It is easy to see .
which implies that an event is always either a background event or is triggered by a previous event. Another explanation for (19) and (20) is that, once an event, say j, occurs at (t, x), we can say that, at (t, x), we observed j background events, and that for each i = 1, … , j − 1 , event i triggers ij direct offspring at (t j , x j ) . In this way, we separate event j into background and offspring events from previous events (Zhuang et al. 2004). We say that the above probabilities, j , j = 1, 2, … , n, are background probabilities because if we select each event j with probability j , we can realize a Poisson process with rate (t, x) . To prove this point, we need only to show that the compensator for the resulting process is rate (u, x) . For any measurable set B ∈ ℝ d , where X(t, x) is a random field that takes values of 1 and 0 with probabilities (t,x) is a predictable function, then In the above, H s + = ∩ u>s H u represents the history of N up to time s and including s.

Substituting the above equation into (22), we have
That is, the resulting process has a compensator with a deterministic rate (t, x) . Thus, it is a Poisson process.

Expectation-maximization algorithm
A direct use of the background and triggering probabilities is to construct an expectation-maximization (EM) algorithm (Veen and Schoenberg 2008;Li et al. 2019). First, we treat the whole process as a missing data problem. The complete observation for each event j is (t j , x j , j ) , where j takes a value 0 if it is a background event and i if it is triggered by event i. Thus, the complete likelihood for the whole process is The parameters can be estimated with the following EM algorithm: E-Step For each step k, calculate (k) j and (k) ij for j = 1, 2, … , n and i = 1, 2, … , j − 1.
M-Step Maximize the expected log-likelihood: to obtain the model parameters.
The computational complexity of this algorithm is the same as the original MLE method, proportional to n 2 , where n is the number of events in the process.

The nonparametric background rate
The EM algorithm is almost easy to implement when and are both parametric functions, with only some unknowns parameters. However, in many cases, the explicit form of the background rate is usually unknown. Veen and Schoenberg (2008) divide the whole study region into several subregions, each with a constant background rate; that is, they assumed that the background rate was a 2D piecewise function, with all values for the background rate in each subregion being parameters to estimate. Using the MLE method, the background rate k in each subregion can be estimated as follows: Suppose the background rate where K is the total number of subregions and the whole area A = ∪ K k=1 S k , S k ∩ S l = ∅ for 1 ≤ k ≠ l ≤ K . Then, and yield Multiplying both sides by ̂k and rearranging the terms, we have As discussed in Sect. 3.2, j and ij , i = 1, 2, … , j − 1 , quantify how event j is sliced into background and offspring from previous events. Equation (31) in fact provides a histogram estimator of the background rate function in an iterative manner.
We can modify such histogram estimates easily into kernel estimates. For example, Zhuang et al. (2002Zhuang et al. ( , 2004) use a weighted kernel function estimate of (⋅, ⋅) in combination with variable bandwidths where h j is the bandwidth for the kernel function corresponding to event j, equal to its distance to the n p th closest neighboring event, n p = 2-15. When is also time dependent, when we can also estimate it using a weighted kernel estimation, for example, as follows where Z (t) h t (⋅) is the temporal kernel with bandwidth h t .

The nonparametric triggering term
We can estimate the triggering term g(⋅, ⋅) by where the denominator is for normalizing purposes, Z h is the Gaussian kernel with bandwidth h, and ij is as defined in (20).
We can verify the estimator above in the following way. First, the spatiotemporal version of (2), holds for any predictable process f(t, x), any given time interval [T 1 , T 2 ] , and any area S, provided that the integral on either side exists, or that f is nonnegative. Second, let Regarding t, x, t i , and x i as fixed and substituting where Δ x and Δ y are small positive numbers. These estimates can be revised into their kernel function version correspondingly.

When both the background rate and triggering term are nonparametric
In the above, when estimating (t, x) and g(t, x), we need to know i and ij , and when estimating i and ij , we need to know and g. We can resolve this loop using an iterative algorithm. Given an observed process of events {(t i , x i ) ∶ i = 1, … , n} in a time-space window T × S , by guessing some initial value of and g, we obtain i and ij for all possible i, j. Then, we estimate the background rate and each component in the clustering part g using i and ij with some nonparametric methods, such as kernel estimates or histograms. Once we update and g, we go back to the step of calculating ⋅ and ⋅⋅ , or stop if convergence is reached. In summary, the iterative estimation procedure includes the following three integrants: Algorithm 1

Stochastic declustering (Expectation).
Calculate the background probability and triggering probabilities.

Reconstruction (Maximization I).
Estimate the nonparametric function in the model using nonparametric methods such as kernel functions. 3. Parametrization (Maximization II). Use the MLE method or EM algorithm to estimate the parameters in the parametric functions.
In an iterative algorithm such as the one above, the feedback should be negative. However, we cannot be assured of negative feedback if we use a nonparametric estimate. For example, consider that we use only to estimate and g in the conditional intensity If all ij , j = 1, … , N and i < j increase at some time, then we obtain a large ĝ(t) , which yields a larger ij in the next step. This positive feedback finally yields a trivial solution with ̂(t) = 0 . To avoid this positive feedback, Zhuang et al. (2002) and Zhuang and Mateu (2019) introduce relaxation coefficients to prevent positive feedback. Instead of (43), they use as the conditional intensity function, and the estimates for and g become with restrictions ∫ T 0̂( t) dt∕T = 1 and ∫ ∞ 0ĝ (t) dt = 1 . In the above equations, the parameters and A are the relaxation coefficients and are estimated by the MLE or EM method, as given in Sects. 3.1 and 3.3.
Introducing the relaxation coefficients changes Algorithm 1 into the following new algorithm.

Reconstruction (Maximization I).
Estimate the nonparametric function in the model using nonparametric methods such as kernel functions. 3. Parametrization (Maximization II). Use the MLE method or EM algorithm to estimate the parameters in the parametric functions and the relaxation coefficients for the nonparametric functions.

Choice of bandwidth
Bandwidth selection is always an unavoidable topic with kernel estimation. Many previous works discuss fixed bandwidth, such as Silverman's rule of thumb in Silverman (1986) and Scott (2009). Alternatively and technically, this can be done with the cross-validation method (e.g., Hall and Wehrly 1991) or using the forward predictive likelihood (e.g., Chiodi and Adelfio 2011). In principle, the bandwidth should be selected according to data resolution, which is at the order of 1-10 times of the nearest neighboring distance. Zhuang (2011) reduces n p to 3 or 4. Xiong et al. (2019) compare two variable bandwidth kernel estimates, with their optimal bandwidth parameters obtained using cross-validation, with two other nonparametric methods, a Bayesian smoothing procedure on a tessellation configuration with smoothness prior, and a newly proposed incomplete centroidal Voronoi tessellation method. They find that the performance differences among the methods are marginal in estimating the seismicity rate.

Boundary effect correction
The second problem is boundary correction. In most cases, observations lie within some specific range, while the kernel function distributes the mass of an event over a much larger, or even infinite, range. The observation range makes the total weight for an event not equal to 1 and the estimates for the values near the boundary depend on the shape of the boundary. In this study, we adopt the weight-based correction (Hall and Turlach 1999), which can also be called the truncated kernel function.
For each event, we use a truncated density as the kernel for event i, for which S is the support range of observation.

3 5 Extending the Hawkes model
We should note that a useful model reflects only partial information about the observation. One of the tasks in statistical analysis is to extend a model such that it includes more predictable information. This requires not only for statistical analysis, but also to obtain a good understanding of the observed phenomena. In this section, we provide two examples to illustrate how to extend the Hawkes model to accommodate the need for better modeling the data.

Example 1: Building more physics into the model-incorporating earthquake source geometry into seismicity modeling
In the space-time ETAS model, all of the earthquake events are regarded as a point in space-time-magnitude domain. In fact, the rupture of each earthquake has a spatial extension on the earthquake fault, from several kilometers for a M5.5 event up to about 500 km for a M8.0 event. When we regard the focal source of an earthquake as a point and describe the spatial response to the triggering effect an isotropic function of distance from the epicenter of the parent location, as in (13), biased results will be obtained in data analysis, since aftershocks are usually distributed along the rupture fault, especially for large earthquakes. For example, Hainzl et al. (2008) discuss the impact of the rupture extension of the 1992 M7.3 Landers, California earthquake on parameter estimations of the point-process model by comparing the results from space-time and purely temporal ETAS models. They find that ignoring the rupture extensions of earthquakes and assuming an isotropic aftershock response could lead to a significant bias in the parameter estimations, especially an underestimated value. Several researchers attempted to correct such biases. As early as in 1998, Ogata (1998) suggested that the aftershock rate is spatially elliptically distributed and that the centroid of the ellipsoid formed from the aftershock cluster should be used as the location of the main shock instead of the initial fracture point. Considering such anisotropy, (13) was replaced by a general bivariate Guassian density and takes different values for different clusters. The aftershock cluster from each main shock was determined by the MBC algorithm (Ogata et al. 1995). The modified elliptic distribution model outputs better results for observed seismicity. However, the MBC algorithm is quite subjective and empirical in terms of dividing aftershock clusters, and the location of the main shock is still a point, either the epicenter or the centroid of the aftershock epicenters. Marsan and Lengliné (2010) and Bach and Hainzl (2012) propose an alternative treatment for the anisotropic response as a function of the distance to earthquake fault, instead of a function of the epicenter distance.
Instead of regarding large earthquakes to have point sources, Guo et al. (2015Guo et al. ( , 2017) develop a finite-source ETAS model to incorporate the spatial extensions of their ruptures. Each earthquake rupture consists of many small patches, and each patch triggers its own aftershocks isotropically and independently as a usual mainshock. The superposition of triggering effects from all patches produces an anisotropic pattern of the aftershock locations, mainly distributed along the rupturing fault. In mathematics, the spatial response of the production of direct offspring to a large earthquake with source body S i is where i (u) is the productivity density at location u in the focal zone and the numerator is used for normalization. The model parameters, the unobserved fault geometry, and the background rate are estimated simultaneously through an EM-type iterative algorithm similar to Algorithm 1. We can use their treatment to invert the earthquake fault from seismicity.
In the estimation, the productivity density (u) is also implemented using stochastic reconstruction, which has a histogram version of estimation of where represents the probability that event k is triggered by patch C j , which contains location u in the rupture of event i. Similar to the background rate , the function can also be estimated iteratively (Guo et al. 2017;. This model has been applied to seismicity in different regions, such as Southwestern China (Guo et al. 2015), Japan (Guo et al. 2017), and Italy . Figure 1 shows the surface projection of the spatial variation of the productivity density along the Nocia earthquake (2016-10-30 08:40 local time, 13.11 • E, 42.83 • N, Mw 6.5) rupture fault obtained by . The main coseismic slip area is to the updip from the hypocenter. The major parts of the direct aftershocks are located on the north and south of the area with biggest coseismic slip. These patches of high productivity situated are close to but do not overlap with the area of the high coseismic slips. The feature has been observed for many large earthquakes.

Example 2: Complexity in the background rate: crime modeling
This example is related to the model used by Molher and others in modeling crime behavior. Studies using crime data based on point processes do not consider the periodic components in the background rate. Since criminals are also human beings, their behaviors should be controlled by their biological clock and could be influenced by periodic social activity (Felson and Boba 2010). Thus, studies should account for periodicity, for instance, daily periodicity and weekly periodicity, to build a more precise model.
Zhuang and Mateu (2019) use the following space-time point process model to describe the robbery data in Castollen, Spain, which they specify completely using a conditional intensity function where A and 0 are the relaxation coefficients to estimate, the average values of the trend term t (t) , daily periodicity d (t) , weekly periodicity w (t) , and spatial background heterogeneity b (x, y) are all normalized to 1, and the temporal response g 1 and the spatial response g 2 are normalized as p.d.f.s.
Though we cannot estimate the periodic components of the background rate in our model formulation directly using the stochastic reconstruction method, we can solve this problem using the spatiotemporal version of (2). Given a realization of the point process {(t i , x i ) ∶ i = 1, 2, … , n} in a time-space range [T 1 , T 2 ] × S , where t and x = (x (1) , x (2) ) denote time and location, respectively, the long-term trend term g 1 (t − s) h(x − u) dN(s, u),   Comparison between the pattern of the productivity of direct offspring along the rupture areas (contour images) inferred by the finite-source ETAS model and coseismic slip (contour lines) for the Norcia earthquake (2016-10-30, Mw 6.5). The values of the coseismic slip from zero to the maximum for each event are indicated by contour lines from green to red in rainbow colors. The red stars represent the epicenter of the corresponding major earthquake, and the blue dots represent the locations of towns and cities in the area. The small black dots mark the locations of small events that occurred shortly after the corresponding major events. The traces of active faults are also plotted in black lines. The kinematic model adopted for the earthquake of Norcia (2016-10-30) is based on Chiaraluce et al. (2017) and f (t, x) = w (t) (t, x) , and substitute f into (35). Then, assuming that t is smooth enough, where Δ t is a small positive number. For ease of writing, define Then, Similarly, we can reconstruct the other components in the background rate as follows and where (50)  , y) . c Estimated trend function. d Estimated weekly periodicity. e Estimated daily periodicity. f Estimated temporal response function. g Estimated spatial response function