Parallel Evaluation of Chebyshev Approximations: Applications in Astrodynamics

Approximations based on Chebyshev polynomials have several astrodynamic applications. The performance of these approximations can be improved by parallel implementations exploiting parallel architectures, such as OpenMP and CUDA. In this paper, we introduce the parallel implementation to two astrodynamic applications. The first is the gravitational finite element model (FEM): a piecewise Chebyshev approximation that replaces high degree and order gravitational spherical harmonic models (SHMs). Thus, much lower degree, locally valid functions can efficiently model and compute local gravity perturbations in parallel structure for efficient performance. For this model, the total gravity acceleration is split into a reference and disturbance term. The reference includes two-body plus J2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J_2$$\end{document}, which are relatively cheap to compute. The FEM approximates the higher-order gravity terms. It is developed from a 2D mesh grid covering a sphere of a specified radius, and a family of spherical shells is sampled using a cosine distribution in the radial direction. To reduce the required memory when seeking a specific accuracy, an adaptive version of the gravitational FEM is introduced. In addition, a parallel implementation of the FEM using OpenMP is preseneted. We show the runtime comparison for the 200 degree × 200 order EGM2008 SHM and the serial and parallel equivalent FEM algorithms. The other application is the Chebyshev-Picard method (CPM): a numerical integrator that solves an ordinary differential equation by approximating the integrand using a Chebyshev approximant and iterates over the trajectory via Picard iteration. A parallel CUDA implementation of the CPM method in conjunction with the EGM2008 SHM and the FEM is introduced. We present numerical examples for propagating four Earth-orbiting satellites considering both the 200×200\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$200\times 200$$\end{document} EGM2008 SHM and the equivalent FEM representation to test the algorithm’s performance via parallel and serial computation (i.e., a single CPU thread).


Introduction
Function approximation is concerned with approximating a certain "target" function, with another "approximating" function that closely matches the "target/true" function in a specific manner. It is beneficial when the target function is supposedly "difficult to work with". A typical example is solving differential equations that do not have closed-form solutions. Spectral methods address this problem by approximating the solution u( ) ∈ ℝ using a sum of N + 1 spectral basis functions i ( ) , e.g. Chebyshev polynomials and Legendre polynomials, i.e. u( are the coefficients of the expansion. For orthogonal polynomials, the coefficients are evaluated by projecting the function u( ) onto each basis function i ( ) . In contrast to spectral methods that use high-order global basis functions over the entire domain, finite element methods divide the domain into a finite number of sub-domains and use low-order local basis functions. Therefore, finite element methods are computationally more efficient, especially when low and medium accuracies are sought. However, their memory requirements are significantly higher than those of the spectral methods. For the spectral and finite element methods, the orthogonal Chebyshev polynomials are commonly used as basis functions due to the fast convergence and completeness of the Chebyshev series [1]. In addition to their continuous orthogonality, Chebyshev polynomials are discrete orthogonal for two sets of nodes (their roots and extrema). Thus, no numerical matrix inversion is required to calculate the coefficients when interpolating using these nodes. Another advantage of using these sets is that the interpolating error has the same order of magnitude as truncating the Chebyshev series. We show two astrodynamics applications that discrete Chebyshev approximation: the gravitational finite FEM and the Chebyshev-Picard Method (CPM) numerical integrator. The performance of the approximation, measured by the execution time, is improved by employing parallel architectures, such as OpenMP and CUDA. Thus we introduce parallel implementations to the gravitational FEM and the CPM in conjunction with the gravitational SHM and FEM.
Parallelism is inevitable since the heat dissipation impedes the increase of processor clock speed. Therefore, even personal computers are equipped with parallel processors, and some of them have graphical processing units (GPUs) that can be used for general-purpose processing. In this paper, we use two different application programming interfaces: OpenMP and CUDA. The main difference between them is that OpenMP does all processes on the central processing units (CPUs), whereas CUDA runs the serial part of the code on a single CPU and the parallel part on GPU. In astrodynamics, different levels of parallelism could be used. For instance, if we have a catalog of a large number of space objects, we can propagate each object's orbit in parallel. This paper focuses on two other levels of parallelism that can exploit the existence of parallel computer architectures even in personal computers. The first level is parallelism across the ODE function. The gravitational FEM is suited for this level of parallelism since it is based on a multivariate Chebyshev approximation that can be evaluated in parallel. We use OpenMP to accelerate the evaluation of the FEM. The second level of parallelism is across the integration method. Unlike the current state-of-the-use numerical integrators, such as the multistep Gauss-Jackson method [2] and the explicit single-step DORPRI8 method [3], implicit integration methods are well suited for parallelism, since the ordinary differential equation (ODE) at each node is evaluated in parallel. We utilize CUDA to accelerate the CPM in conjunction with the Earth gravitational model 2008 (EGM2008) SHM and the equivalent FEM.
The Earth's gravitational force is the primary force affecting the motion of the Earth-orbiting satellites. The SHM is standard for evaluating gravitational potential and acceleration (potential gradient). It is a spectral method that represents the potential as a series expansion of associated Legendre polynomials. The potential at a given point is a function of its coordinates, the maximum degree and order of expansion, and the gravity model used. A gravity model includes the Earth's constants and the coefficients of the series expansion. With the availability of satellite observations, gravity models have increasingly evolved by increasing the number of coefficients and accuracy. The current standard model is the EGM2008 of maximum degree 2190, and maximum order 2159 [4]. In many applications, a low degree and order gravity field is sufficient due to the uncertainties in modeling of non-conservative forces, including drag and solar radiation pressure. Considering high degree and order gravity fields is particularly needed for applications such as precise orbit determination to measure aerodynamic drag for re-entry of satellites. Even with a low degree and order gravity fields, evaluating the spherical harmonic series is computationally intensive compared to other forces. To reduce the computational cost, Junkins developed a locally valid FEM gravity representation that replaces a degree 23 spherical harmonic series by a 3rd order expansion [5]. Bani Younes and Junkins revisited this work using multivariate discrete Chebyshev approximations [6,7]. Since they include the boundaries, the Chebyshev extrema (also called: Chebyshev-Gauss-Lobatto (CGL)) were used as the collocation nodes. This model reduces the computational cost with equivalent accuracy, up to 10 significant digits of the 200 × 200 EGM2008 SHM.
Bani Younes [6] pointed out that further improvement in the computational memory can be sought by eliminating noncontributing coefficients in the FEM. Building on that and instead of using a fixed number of Chebyshev expansion terms for all elements, we introduce an adaptive FEM that allows the choice of the number of Chebyshev expansion terms to be included to maintain a prescribed accuracy tolerance. In other words, all elements contribute nearly evenly to the approximation error. Since, over their orthogonality domain [−1, 1] , the absolute values of Chebyshev polynomials for all degrees are bounded by 1, we remove coefficients below the tolerance. If the prescribed tolerance is 10 −10 , this adaptive technique decreases the coefficients' memory from 940 Megabytes to 430 Megabytes. More reductions are obtained when lower accuracies are desired. The performance of the FEM can be further accelerated by evaluating the Chebyshev approximation in parallel. One essential contribution of this paper is to investigate and address possible extensions and enhancements of the FEM approach using OpenMP acceleration of the computation for higher fidelity gravity potential. OpenMP is a library for parallel programming in the symmetric multi-processors model that offers the optimal balance of execution cost versus the interconnection of the memory network. The results in this paper show that about 3x speedup of the parallel implementation to the FEM compared with the serial implementation is obtained using eight threads.
The other application of Chebyshev approximations, discussed in this paper, is solving ODEs by approximating the integrand using a Chebyshev series. Clenshaw [8] first used Chebyshev approximation to solve linear ODEs numerically. This method is limited to solving differential equations whose coefficients are polynomial functions of the independent variable. To avoid this restriction and enable solving nonlinear ODEs, Clenshaw and Norton [9] proposed using Picard iteration to iterate for the solution while using a Chebyshev series expansion to approximate the integrand and the trajectory. At each iteration step, they integrated the basis functions term-by-term to establish a recursive trajectory approximation technique, hence the name Chebyshev-Picard method (CPM). In this paper, we present its matrix-vector derivation using the Chebyshev integration matrix. Moreover, we introduce a simple formula, based on the Kepler's equation, that allows using fixed true anomaly steps instead of fixed time steps to integrate the orbit problem. This formula improves the approximation since the orbital motion is much faster and more nonlinear near perigee than near apogee. In general, many physical problems have peculiar characteristics that, if considered carefully, will lead to improved approximation.
Since the approximation is available at once on each Picard iteration step, the CPM integrand evaluations are independent since the approximation is available at once on each Picard iteration step. The feasibility of parallel computation using the Chebyshev-Picard methods has been studied by Shaver [10] and Fukushima [11]. Shaver [10] made the first serious study of parallel computation with a variant of CPM. After that, Fukushima [11] implemented a Chebyshev-Picard algorithm on a vector computer. However, for one example problem, the vector code of Fukushima was shown to be slower than the scalar code, thus leading the author to surprisingly conclude that his vectorized code led to additional overhead and an inefficient implementation. One decade later, Bai [12] implemented a vector/matrix form of the CPM method to solve Keplerian motion using CUDA. She achieved 2-3x speedup compared with a CPU version of the algorithm and pointed out that more speedup might be achieved for a problem with more complicated force functions and a larger number of required iterations. Motivated by Bai's work, Koblick [13] achieved a 2x speedup when compared with other solvers by implementing CPM using Matlab's Parallel Computing Toolbox for a high degree and order gravity model. Lately, Probe et al. [14] used CUDA to implement the vector/matrix form of CPM in a segmentation scheme considering a high degree and order gravity field. However, the maximum degree and order of the gravity field for their implementation are 150. In this work, we extend the work of [8,[14][15][16] to allow the choice of the arbitrary degree and order of the considered gravity field. Moreover, a CUDA implementation to the FEM model is added. We also compare the performance of the CUDA implementation of CPM to its serial version employing the SHM and FEM gravitational acceleration models.
Chebyshev-Picard method is a collocation method that employs CGL points as the collocation nodes and fixed-point iteration to solve the resulting nonlinear system of equations. Collocation methods are a subset of implicit Runge-Kutta (IRK) methods where each set of collocation nodes defines a particular method. Gauss-Legendre (GL) sets are optimal since the order of accuracy of GL-IRK methods is twice the number of nodes, and no other method can have a higher accuracy order 17. This fact motivated many researchers to investigate using the GL-IRK method in orbit propagation. Jones [18] developed a variable step GL-IRK method for first-order systems, known as VGL-s. He showed that VGL-s requires fewer function calls than DORPRI8 for near-circular orbits and matches it for a highly eccentric Molniya orbit. For second-order systems, Aristoff et al. [19,20] introduced a variable-step GL-IRK, dubbed VGL-IRK. Compared to DORPRI8, VGL-IRK is significantly more efficient than DORPRI8 for low and highly eccentric orbits. Similarly, CPM can be considered an IRK method that uses CGL nodes as the collocation nodes and fixed-point iteration to solve the resulting nonlinear system of equations. The advantages of using CGL over GL nodes are discussed in Section 4.1. The CUDA implementation to CPM can be slightly modified to parallelize GL-IRK methods.
The key contributions of this paper are as follows. First, we introduce a parallel scheme for FEM integrated with CPM and implemented in both CPU and GPU. Two levels of parallelism are implemented; a) parallel FEM is implemented in the OpenMP platform, and b) parallel CPM integrated with FEM and SHM is implemented in CUDA. The results show about 3x speedup gain compared to the serially implemented FEM and about two orders of magnitude speedup gain compared to the SHM. Second, we develop an adaptive FEM approach to reduce the computational memory by eliminating noncontributing coefficients in the approximation while maintaining ten significant digits accuracy. Third, we introduce a segmentation scheme based on the true anomaly as the independent variable instead of fixed time steps to improve the approximation convergence. To test the performance of the FEM and CPM, several orbit cases are employed. The performance is quantified by the execution time required to achieve a given accuracy. The accuracy is defined by the Hamiltonian integral equation, conservation of energy, adopted by Bani Younes [6].
The paper is organized as follows. Section 2 reviews the properties of Chebyshev polynomials and discusses the univariate and multivariate Chebyshev approximations. Section 3 gives an overview of the gravitational FEM model and the adaptive algorithm. It also introduces the parallel OpenMP algorithm. In Section 4, we introduce the vector-matrix derivation to the Chebyshev-Picard method, the time-stepping based on fixed true anomaly steps, and the CUDA implementation. The results are given in Section 5. Finally, we present the conclusions in Section 6.

Basic Properties of Chebyshev Polynomials
Chebyshev polynomials are a sequence of polynomials that are orthogonal with respect to the weight function w( where T i ( ) denotes the Chebyshev polynomial in of degree i. On the interval ∈ [−1, 1] , this polynomial is also defined by the trigonometric relation Chebyshev polynomials obey the three-term recurrence relation with T 0 ( ) = 1 and T 1 ( ) = .
In addition to their continuous orthogonality, Chebyshev polynomials are discrete orthogonal for two sets nodes (their zeros and extrema) on the interval [−1, 1] . The extrema / CGL nodes are preferred because they include the endpoints and, in contrast to other orthogonal polynomials nodes, are explicitly computed, rather than being roots of a transcendental equation. The M + 1 CGL nodes are given by The discrete orthogonality conditions of the Chebyshev polynomials for the M + 1 CGL nodes of a maximum degree N are The cosine set of nodes has high nodal density near the ±1 domain boundaries that compensates and approximately minimizes the Runge phenomena [21]. The Chebyshev polynomial of an order i ≤ N evaluated at the M + 1 CGL nodes can be directly computed from Eq. 1 as For later use, we define the Chebyshev regression matrix ∈ ℝ (N+1)×(M+1) as follows When used in numerical integration, Chebyshev approximations exploit the integration property that allows expressing the indefinite integral of T i ( ) in terms of Chebyshev polynomials [22]. For i ≥ 2, and for i = 0 and i = 1 , this integral equals T 1 ( ) and 1 4 T 2 ( ) + 1 4 T 0 ( ) , respectively.

Univariate and Multivariate Approximations
Let f ( ) be a univariate Lipschitz continuous function in defined on the interval [−1, 1] 1 , then it can be represented by a unique Chebyshev series, This series converges to the approximand absolutely and uniformly [23]. Due to the orthogonality of Chebyshev polynomials, the coefficients of this series, i , are given by The Chebyshev projection, p N ( ), is a polynomial approximant resulting from truncating the Chebyshev series at degree N [24], i.e., Then, it follows that the truncation error f − p N = ∑ ∞ i=N+1 i T i ( ) converges to 0 uniformly and absolutely. The smoother the function is, the faster the truncation error converges as N → ∞ . In particular, if the function f is m times continuously differentiable with f m is of bounded total variation B, then ||f ( ) − p N ( )|| = O(N −m ) 2 and the coefficients for i ≥ m + 1 satisfy Furthermore, if the function is analytic, the coefficients i decrease geometrically, and the truncation error is the same order-of-magnitude as the last coefficient retained [1].
To avoid the integration in computing the coefficients as given in Eq. 8, another approximation of the function f is obtained by interpolation in M + 1 CGL nodes where N ≤ M . The Chebyshev interpolant f N ( ) is given by where the coefficients a i , chosen to satisfy the discrete orthogonality conditions of Eq. 3, are computed directly as A primary advantage of the interpolation in the CGL nodes is that the interpolation error is within a factor of 2 of the truncation error [23]. It follows that f − f N converges uniformly and absolutely [23]. It also follows that if the function f is m times continuously differentiable with f m is of bounded total variation B, then The coefficients are computed from a Kronecker product (⊗) of d univariate Chebyshev interpolation matrices as [6] Here a ∈ ℝ ) , and N i + 1, M i + 1, i , and i are the number of coefficients, number of CGL nodes, the univariate Chebyshev interpolation matrix, and the weight matrix associated with i , respectively.

Gravitational Finite Element Model
The classical solution to Laplace's equation for gravity is adopted using the globally valid spherical harmonic gravity potential model, defined by where r is the geocentric radius (i.e., distance from the Earth's center to the typical point near the Earth), and are the geocentric (geographic) latitude and longitude respectively, = GM is the Earth's gravitational-mass constant, R ⊕ is the Earth equator radius, C m n and S m n are spherical harmonic gravity coefficients, and P m n are the fully normalized associated Legendre polynomials of degree n and order m. The acceleration can be obtained from the potential using the classical gradient relationships in spherical or rectangular coordinates as follows: The classical solution to Laplace's equation for gravity is adopted using the globally valid spherical harmonic gravity potential model. However, this approach reveals two major challenges [5]: 1. Choosing a finite upper limit of the series defines the accuracy (the more we know about gravity, the more terms are required and the more it costs to compute acceleration). 2. Convergence is very inefficient and slow for n > 2 , so, for the current state of the art, tens of thousands of terms are required to obtain a sufficiently high accuracy global gravity representation.
Given the slow convergence of global gravity models, previous work truncated the classical expansion at n = 2 and introduced a FEM local gravity representation in the anticipation that much lower degree functions can be used to efficiently model and compute locally valid gravity perturbations. Applicable to both irregular and near-spherical-shaped bodies, methods in this class expedite computations by effectively trading computer memory for runtime speed. First proposed by Junkins in 1976 [5] and explored by [6,7], Geopotential FEM interpolation methods have been bolstered recently by the extraordinary memory resources of standard computers.

FEM Representations of the Geopotential Acceleration
The FEM is fully developed in [6] using the classical discrete Chebyshev polynomial approximation to solve for a piecewise approximation for the high-degree and order spherical harmonic gravity models. The higher-order perturbations in the EGM2008 spherical harmonic model were replaced by a finite element model orthogonal approximation [6,7]. Thus, much lower degree, locally valid functions are used to efficiently model and compute local gravity perturbations. The total gravity acceleration is split into a reference and a perturbative term. The reference includes two-body plus J 2 contributions, which are relatively cheap to compute, and the higher-order gravity terms are considered perturbative. This perturbative term is approximated using a multivariate Chebyshev interpolant.
The FEM consists of multiple FEM grids. A sphere of radius R ⊕ is covered by a 2D mesh ( 4 × 4 ) degree ( , ) ∶ 0 ≤ ≤ 360 • ; − 88 • ≤ ≤ 88 • cellular FEM grid, except for the polar caps of angular radius 2 degrees. At arbitrary r r min = R ⊕ ≤ r ≤ r max = 7R ⊕ . Observing Fig. 1, since the gravity field within the range of interest [ r min = R ⊕ up to r max = 7 R ⊕ ] results in significant differential changes along the radial position r, it becomes elegant to subdivide the  [6] problem into spherical shells based on the spectral content of gravity perturbations in each region. For instance, nearest the Earth and within the Earth's atmosphere (r ≤ 1.02 R ⊕ ) , the gravity perturbations are strongest, and due to the atmospheric drag, the overwhelming majority of orbit computations occur outside this spherical shell. This observation motivates the adoption of two spherical regions as shown in Fig. 2: (1) atmospheric region r ∈ R ⊕ , 1.02 R ⊕ , and (2) exo-atmospheric region r ∈ 1.02 R ⊕ , 7 R ⊕ .
A family of spherical shells is sampled using a cosine distribution in the radial direction. For each 2D cell on a particular shell, gravity is approximated using a bivariate Chebyshev interpolant. A 16th degree polynomial is used in Region I ∈ [R ⊕ , 1.02R ⊕ ] and a 13th degree polynomial in Region II ∈ [1.02R ⊕ , 7R ⊕ ].
Let the spherical coordinates (r, , and ) be respectively transformed into ( , , and ) where −1 ≤ , , ≤ +1 with the sample points located according to the cosine distribution. Note that the transformed position (r) obeys smart cosine-like transformation as a function of the transformed radial variable ( ). The number of radial samples is 10 in the first region and 38 in the second region. The Chebyshev interpolant to the perturbative gravitational acceleration is given by Here i,j,k ∈ ℝ 3 is a vector of FEM coefficient and N r , N , and N are the degrees of the interpolant for the r, , and directions, respectively. The approximant N r ,N ,N ( , , ) is added to the two-body and J 2 terms to compute the total gravity acceleration [6].

Adaptive FEM
The main drawback of the FEM is a large memory needed. The memory can be significantly reduced by removing noncontributing coefficients, i.e., removing all coefficients with absolute values less than a given tolerance. Thus, all grids contribute evenly to the approximation error. If the tolerance is 10 −10 , then there is about a 64% coefficient saving for each component in the first region, r ∈ [R ⊕ , 1.02R ⊕ ] , and about a 75% coefficient saving for each component in the second region, r ∈ [R ⊕ , 1.02R ⊕ ] . The total number of coefficient savings is shown in Tables 1 and 2. This is a huge memory saving since only 36% of the original set of coefficients for the first region, and 25% for the second region are required to produce an approximation that satisfies this desired accuracy level.
The benefit of this optimization is efficiently reducing the memory size associated with the interpolating coefficients and subsequently reducing the runtime. The coefficients storage is reduced from 940 Megabytes to 430 Megabytes.

OpenMP Implementation
Open Multi-Processing (OpenMP) is an application programming interface that supports shared-memory parallel programming in C/C++ and Fortran. The parallel structure in OpenMP consists of a master thread and slave threads. The master thread is a series of instructions executed consecutively. It forks off a specified number of slave threads that execute blocks of code in parallel, as shown in Fig. 3. These threads run concurrently, with the runtime environment allocating threads to different processors. Figure 4 presents a flowchart for the OpenMP FEM algorithm. The inputs are the spherical coordinates ( r, , and ). Based on these coordinates, the 2D cells are selected, and they are mapped to the normalized parameters , , and which, in turn, are used to evaluate the univariate Chebyshev polynomials  Then, the primary thread is forked off to evaluate the multivariate Chebyshev interpolant N r ,N ,N ( , , ) in parallel. The threads then join back to evaluate the FEM acceleration by adding the spherical and J 2 terms.

Derivation of the method
An initial value problem (IVP) seeks the solution of y(t) that satisfies the ordinary differential equation (ODE) and the given initial condition (t 0 ) = 0 , where t 0 is the initial time and t f is the final time. From the fundamental theorem of calculus, Eq. 18 can be arranged without approximation to obtain the integral equation ) is continuous and Lipschitz ∀ ∈ ℝ p and ∀t ∈ [t 0 , t f ] . However, if the function is only locally Lipschitz, then the solution is unique over an interval [t 0 , t 0 + ] for some > 0 [26]. If these conditions hold, then the sequence of approximate solutions { j } ∞ j=1 , generated by Picard iteration, converges to that unique solution (t) . For Picard iteration, the i th approximate solution is given by In other words, the Picard iteration method can solve an IVP by repetitive integration of the integrand. Nevertheless, this method is not well suited for solving broad ODEs or problems with complicated integrands.
To remove the limitation of Picard iteration, the Chebyshev-Picard method (CPM) approximates the integrand using a Chebyshev interpolant, allowing using the integration property to iterate for the solution. To do so, the trajectory must be transformed from the time domain t ∈ [t 0 , t f ] onto the domain ∈ [−1, 1] . The transformation is given as follows: This transformation is applied to the ODE in Eq. 18 to transform from (t, (t)) to a new scaled form of the ODE ( , ( 1 + 2 )): and, analogous to Eq. 20, Picard iteration for this transformed system becomes By approximating the integrand using a Chebyshev interpolant of degree N, we get By exploiting the integration property in Eq. 6, the approximate solution evaluated at the CGL nodes is given by is the integration operator matrix defined by 2 ∈ ℝ (N+2)×(M+1) is the Chebyshev regression matrix with one more row representing the increase of the polynomial degree by one, and ∈ ℝ (N+2)×(M+1) is the polynomial matrix at the lower limit whose i-j element (i.e., the ith row, jth column element) is given by Substituting Eq. 13 into Eq. 24 gives , and is the Chebyshev integration matrix defined by Note that the matrix does not depend on the function but only on the number of nodes M + 1 and the degree of the Chebyshev interpolant N. Since the CGL nodes include the endpoints, the approximate solution at t f is the last row of Y j .
CPM belongs to the family of collocation methods for which the integrand is approximated with a polynomial interpolating a set of nodes within the interval. This family also contains Gaussian quadrature-based implicit Runge-Kutta (IRK) (t, (t)) = 2 ( , ( 1 + 2 )), methods. A quadrature formula approximates a definite integral of a function in over the interval [−1, 1] by a weighted average of the function evaluated at a set of distinct nodes. That is, where w( ) is a weight function, i are the quadrature nodes, and w i are the quadrature weights. A natural choice for the nodes is to use equidistant nodes over the interval [−1, 1] , and this leads to Newton-Cotes methods [17]. The N + 1 Newton-Cotes formula exactly integrates all polynomials of degree N. However, Newton-Cotes formulae do not converge for all integrands [27]. In contrast, Gauss quadrature formulae converge for all continuous functions, and an (N+1)-node Gauss quadrature formula integrates exactly 2N + 1 polynomials. The nodes for this formula are the roots of a polynomial of degree N of the set of polynomials which are orthogonal with respect to the weight function w( ) . For instance, if the weight function w( ) ≡ 1∕ √ 1 − 2 , then the roots of the N + 1 Chebyshev polynomials are employed. In contrast, since the Legendre polynomials are orthogonal with respect to the weight function w( ) ≡ 1 , Gauss-Legendre methods use the roots of Legendre polynomials to integrate functions in the form ∫ 1 −1 f ( )d . Several authors have studied the application of Gauss-Legendre IRK (GL-IRK) methods in astrodynamics [18,20,[28][29][30], and the numerical tests showed that GL-IRK methods are faster than classical integrators in several cases.
On the other hand, the Clenshaw-Curtis quadrature is the base for CPM. This quadrature method approximates integrand by a Chebyshev interpolant and integrates the partial sum of the interpolant series. In contrast to Newton-Cotes formulae, Clenshaw-Curtis formulae converge for all continuous integrands [23]. Clenshaw-Curtis quadrature, however, is not as accurate as Gauss-Legendre quadrature since the (N+1)-node Clenshaw-Curtis formula can exactly integrate polynomials only up to only degree N . Despite that, Trefethen [27] showed with numerical examples that Clenshaw-Curtis and Gauss-Legendre have approximately the same rate of convergence for nonanalytic integrands. Aside from the accuracy of integration, Clenshaw-Curtis formulae have several advantages over Gauss-Legendre ones. First, Gauss-Legendre nodes are never nested; whereas Clenshaw-Curtis formulae are nested since the nodes of the formula of degree N are a subset of the nodes of the formula of degree 2N. This property is beneficial in estimating the accuracy without the need for additional integrand evaluations. Second, the indefinite integration property, as given in Eq. 6, allows the evaluation of the quadrature weights analytically in terms of Chebyshev polynomials. Therefore, the solution forms a Chebyshev series. Last but not least, the Clenshaw-Curtis nodes include the endpoints, and these nodes can be evaluated analytically. Comparing the performance of Gauss-Legendreand Clenshaw-Curtis-based integrators for solving the orbit problem will be examined in future work.

Time Segmentation
The approximation error, f − f N , depends on the smoothness of the function, the interval, and the degree of the approximation. For a given required accuracy and a fixed degree of the Chebyshev approximation, the domain may be split into a number of segments that contribute evenly to the integration error. For the orbit problem, using equal time segments is far from optimal for eccentric orbits since the orbit is faster and feels more nonlinear anomalies at the perigee. Instead, using segments of equal true anomaly allows taking shorter time steps are needed near the perigee (see Fig. 5).
At a given time t k , true anomaly (t k ) , and true anomaly step (d ) . the true anomaly at t k+1 is given by The eccentric anomaly E(t k+1 ) is then calculated via where e is the instantaneous eccentricity. The mean anomaly can be obtained by the Kepler's equation: Then, it follows that the corresponding time segment is given by where Tp is the instantaneous orbit period.

CUDA Implementation
Compute unified device architecture (CUDA) is a parallel computing platform and programming model developed by Nvidia for general computing on its own graphics Fig. 5 Schematic of moderately eccentric orbit divided into segments of even length in time and true anomaly, [31] processing units (GPUs). Typically, most GPU applications utilize the CPU (host) and one or more GPUs (devices). The kernels are launched (invoked) by the CPU and generate many threads on the GPU to exploit data parallelism. These threads execute the same section of code, but in parallel, on the GPU. In CUDA, the host and devices have separate memory spaces, so memory must be allocated to the device memory to transfer data for computation. Following kernel completion, the data must be transferred back from the device to the host. Because of the latency inherent in this data transfer, some applications are more efficient only using serial computation on a CPU, depending on the number of computations completed in parallel.

Fig. 6
Flowchart showing implementation of the parallel CPM algorithm using CUDA CPM is well suited to be implemented using CUDA as the function at each CGL node is evaluated independently. Figure 6 shows the CUDA implementation of the CPM algorithm. The inputs to the algorithm are the degree of the Chebyshev approximation (N), the number of CGL nodes ( M + 1 ), the initial time ( t 0 ), the final time ( t f ), the ODE ( (t, ) ), and the initial state vector ( 0 ). The algorithm starts with calculating the CPM matrices. The Picard iteration should start with an initial guess. For the current implementation, the initial guess at each node is equal to the initial condition. The initial guess is copied from the host to the device. Then, the ODE at each node is evaluated in parallel. Using the CPM integration operator matrix, the state vector at each node is then updated. The updated state vectors are copied from the device to the host if the stopping criterion is met; if not, another iteration is performed the updated state vectors are used to evaluate the ODE.

Results and Discussion
We present results for propagating four Earth-orbiting satellites to test the performance of the FEM compared to the 200 × 200 SHM and the parallel implementation of CPM compared with its serial implementation. A geometrical representation of the four test orbits is shown in Fig. 7. We selected these orbits as they span a wide eccentricity range as well as a large semimajor axis range: a low Earth orbit (LEO), a geostationary orbit (GEO), a medium Earth orbit (MEO), and a highly elliptic orbit (HEO). Table 3 presents the initial conditions of these orbits. All computations underlying this paper have been done in a machine with the following characteristics: The performance is measured by the execution time required to achieve a given accuracy. Since the gravitation force is conservative in nature, the variation in the total energy (Hamiltonian) along the propagated trajectory can be used as a metric to quantify the error of the numerical integration method [6]. For an Earth-Centered-Earth-Fixed (ECEF) system, the Hamiltonian (H) is given by where ECEF and ECEF are the position and velocity vectors in the rotating frame, is the Earth's angular velocity, and U is the Earth's gravitational potential. In addition to the position, the potential is a function of the maximum degree and order of the expansion. Therefore, the Hamiltonian depends on the gravitational employed. The Hamiltonian accuracy in digits ( ) is computed as where N s is the number of integration steps, H k is the Hamiltonian at time k, and H r is a reference Hamiltonian.

Finite Element Method Results
Our Matlab implementation to the FEM (that approximates the 200 × 200 EGM2008 SHM) and the adaptive FEM truncated at 10 −10 (FEM(10)) and at 10 −8 (FEM(08)) are compared against the 200 × 200 and 100 × 100 EGM2008 SHMs. The reference Hamiltonian is evaluated using the initial state vector and the 500 × 500 EGM2008 SHM. Figure 8 shows the performance of these gravity models to propagate the LEO and HEO orbits using DORPRI8 [3] numerical integrator. We did not include the GEO and GPS orbits as low degree/order gravity fields are sufficient even for high accuracies. For the HEO orbit, the radial distance exceeds the FEM upper limit. Therefore, we use the radial adaptive method presented in [32] to choose the degree of the EGM2008 SHM automatically. In general, for solutions with 10 accurate digits or less, the FEM is between two to three orders of magnitude times faster than the 200 × 200 SHM. For the LEO case, the 200 × 200 SHM, FEM, and FEM(10) have the same maximum accuracy level (around ten digits), whereas the accuracies for the 100 × 100 SHM and FEM(08) are limited to around eight digits. Therefore, the FEM is more efficient and allows two more accurate digits than the 100 × 100 SHM. In contrast, for the HEO case, the 100 × 100 SHM can generate near 12-digit accurate solutions. However, the execution time of the FEM is at least two orders of magnitude less than that of the 100 × 100 SHM for accuracies of less than 10 digits.
The adaptive FEM(10) and FEM(08) do not decrease the execution times and may increase the computational costs since more computations are needed to access the coefficients. However, they significantly decrease the required memory for Region I from 675 megabytes to 350 megabytes and 120 megabytes for the FEM(10) and FEM(08), respectively.
In addition to the Matlab implementation to the FEM, we implemented it using C++ and OpenMP. In Fig. 9, we show the runtime comparison for the 200 × 200 SHM, serial FEM, and parallel FEM (using OpenMP) for various number of nodes. The results show that the OpenMP FEM implementation allows 3x speedup compared with the serial FEM and nearly two orders of magnitude speedup compared with the serial implementation of the 200 × 200 SHM. The OpenMP FEM is divided into eight threads for these results. Higher speedup gains are expected if more threads are used. Table 4 show the computational efficiency results for the four test cases using serial and CUDA CPM in conjunction with the 200×200 EGM2008 SHM and the equivalent FEM. The solution is obtained, of each case, for one orbital period. The maximum number of iterations 100 was chosen as the stopping condition for all cases.

Figure 10 and
As expected for the four orbits, the CUDA CPM is more efficient than the serial counterpart, with a speedup gain being nearly 5 when integrating both the SHM and the FEM. Moreover, the FEM gravity model algorithms are nearly 40 times faster than the SHM algorithms. Therefore, the CUDA CPM for integrating the FEM is almost 200 times faster than using the serial CPM to integrate the SHM. However,

Conclusions
This paper reviewed the univariate and multivariate discrete Chebyshev approximations and presented two astrodynamics applications of Chebyshev approximations. The parallel evaluation of these applications improves the performance of these applications. The first application is the gravitational finite element model (FEM). This model replaces the standard spherical harmonics model (SHM). In terms of the computational cost, we have shown that the serial FEM could reduce the computational cost by two orders of magnitude compared with a 200 × 200 SHM. In addition, around 3x speedup could be further obtained by using the parallel version of the FEM using only eight threads in the CPU. The downside of the FEM is its limited precision; the maximum accuracy is around ten digits. In many missions, this level of accuracy is accepted. In addition, an efficient parallel CUDA implementation of the Chebyshev approximation-based numerical integrator Chebyshev Picard Method (CPM) is presented. A segmentation scheme based on the orbit true anomaly is introduced to enhance the overall performance for efficient propagation of eccentric orbits. Numerical examples for propagating four Earth-orbiting satellites considering the 200 × 200 EGM2008 SHM and the equivalent FEM representation are presented to test the algorithm's performance via parallel and serial computation. Used in conjunction with the orthogonal FEM gravity approximation, the CPM enables a 200x speedup for solutions with 10 accurate digits.