1 Introduction

This is Part 2 of our work entitled Towards a reliable implementation of least-squares collocation for higher index differential-algebraic equations, which is introduced and classified in detail in Part 1. We put together here very briefly the necessary ingredients for fluent reading of the current second part.

Consider a linear boundary value problem for a DAE with properly involved derivative,

$$ \begin{array}{@{}rcl@{}} A(t)(Dx)'(t)+B(t)x(t) & =&q(t),\quad t\in[a,b], \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} G_{a}x(a)+G_{b}x(b) & =&d. \end{array} $$
(2)

with \([a,b]\subset \mathbb {R}\) being a compact interval, \(D=[I 0]\in \mathbb {R}^{k\times m}\), k < m, with the identity matrix \(I\in {\mathbb {R}}^{k\times k}\). Furthermore, \(A(t)\in {\mathbb {R}}^{m\times k}\), \(B(t)\in {\mathbb {R}}^{m\times m}\), and \(q(t)\in {\mathbb {R}}^{m}\) are assumed to be sufficiently smooth with respect to t ∈ [a,b]. Moreover, \(G_{a},G_{b}\in \mathbb {R}^{l_{dyn}\times m}\). Thereby, ldyn is the dynamical degree of freedom of the DAE, that is, the number of free parameters which can be fixed by initial and boundary conditions. We assume further that \(\ker D\subseteq \ker G_{a}\) and \(\ker D\subseteq \ker G_{b}\).

Unlike regular ordinary differential equations (ODEs) where ldyn = k = m, for DAEs it holds that 0 ≤ ldynk < m, in particular, ldyn = k for index-one DAEs, ldyn < k for higher index DAEs, and ldyn = 0 can certainly happen.

Let \(\boldsymbol {\mathfrak {P}}_{K}\) denote the set of all polynomials of degree less than or equal to K ≥ 0.

Given the partition π,

$$ \pi: \quad a=t_{0}<t_{1}<\cdots<t_{n}=b, $$
(3)

with the stepsizes hj = tjtj− 1, let \(C_{\pi }([a,b],\mathbb {R}^{m})\) denote the space of piecewise continuous functions having breakpoints merely at the meshpoints of the partition π. Let N ≥ 1 be a fixed integer. We are looking for an approximate solution of our boundary value problem from the ansatz space Xπ,

$$ \begin{array}{@{}rcl@{}} X_{\pi} & =&\{x\in C_{\pi}([a,b],\mathbb{R}^{m}):Dx\in C([a,b],\mathbb{R}^{k}), \\ && x_{\kappa}\lvert_{[t_{j-1},t_{j})}\in\boldsymbol{\mathfrak{P}}_{N}, \kappa=1,\ldots,k,\quad x_{\kappa}\lvert_{[t_{j-1},t_{j})}\in\boldsymbol{\mathfrak{P}}_{N-1}, \kappa=k+1,\ldots,m, j=1,\ldots,n\}. \end{array} $$
(4)

The continuous version of the least-squares method reads: Find an xπXπ that minimizes the functional

$$ \boldsymbol{{\varPhi}}(x)={{\int}_{a}^{b}}|A(t)(Dx)'(t)+B(t)x(t)-q(t)|^{2}\mathrm{d}t +|G_{a}x(a)+G_{b}x(b) -d|^{2}. $$
(5)

Hanke and März [11, Theorem 1] provides sufficient conditions ensuring the existence and uniqueness of the approximate solution from Xπ.

The functional values Φ(x), which are needed when minimizing for xXπ, cannot be evaluated exactly and the integral must be discretized accordingly. Taking into account that the boundary value problem is ill-posed in the higher index case, perturbations of the functional may have a serious influence on the error of the approximate least-squares solution or even prevent convergence towards the exact solution. Therefore, careful approximations of the integral in Φ are required. We take over the options provided in [11], in which MN + 1 so-called collocation points

$$ 0\leq\tau_{1}<\cdots<\tau_{M}\leq 1. $$
(6)

are used, further, on the subintervals of the partition π,

$$ t_{ji}=t_{j-1}+\tau_{i}h_{j},\quad i=1,\ldots,M, j=1,\ldots,n. $$

Introducing, for each xXπ and w(t) = A(t)(Dx)(t) + B(t)x(t) − q(t), the corresponding vector \(W\in {\mathbb {R}}^{mMn}\) by

$$ W=\left[\begin{array}{c} W_{1}\\ \vdots\\ W_{n} \end{array}\right]\in\mathbb{R}^{mMn},\quad W_{j}=h_{j}^{1/2}\left[\begin{array}{c} w(t_{j1})\\ \vdots\\ w(t_{jM}) \end{array}\right]\in\mathbb{R}^{mM}, $$
(7)

we turn to an approximate functional of the form

$$ \begin{array}{@{}rcl@{}} \boldsymbol{{\varPhi}}_{\pi,M}(x)=W^{T}\mathscr{L}W+|G_{a}x(a)+G_{b}x(b)-d|^{2}, \quad x\in X_{\pi}, \end{array} $$
(8)

with a positive definite symmetric matrixFootnote 1

$$ \begin{array}{@{}rcl@{}} \mathscr{L}=\text{diag}(L\otimes I_{m},\ldots,L\otimes I_{m}). \end{array} $$
(9)

As detailed in [11], we have different options for the positive definite symmetric matrix \(L\in \mathbb {R}^{M\times M}\), namely

$$ \begin{array}{@{}rcl@{}} L&=&L^{C}=M^{-1}I_{M}, \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} L&=&L^{I}=\text{diag}(\gamma_{1},\ldots,\gamma_{M}), \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} L&=&L^{R}=(\tilde{V}^{-1})^{T}\tilde{V}, \end{array} $$
(12)

see [11, Section 3] for details concerning the selection of the quadrature weights and the construction of the mass matrix. We emphasize that the matrices LC,LI,LR depend only on M, the node sequence (6), and the quadrature weights, but do not depend on the partition π and its stepsizes at all.

In the context of the experiments below, we denote each of the different versions of the functional by \( \boldsymbol {{\varPhi }}_{\pi ,M}^{C}\), \( \boldsymbol {{\varPhi }}_{\pi ,M}^{I}\), and \( \boldsymbol {{\varPhi }}_{\pi ,M}^{R}\), respectively.

It should be underlined that minimizing each version of the functional Φπ,M on Xπ can be viewed as a special least-squares method to solve the overdetermined collocation system W = 0, Gax(a) + Gbx(b)) = d, with respect to xXπ, that is in detail, the collocation system

$$ \begin{array}{@{}rcl@{}} A(t_{ji})(Dx)'(t_{ji})+B(t_{ji})x(t_{ji}) & =&q(t_{ji}),\quad i=1,\ldots,M,\quad j=1,\ldots,n, \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} G_{a}x(a)+G_{b}x(b)) & =&d. \end{array} $$
(14)

The system (13)–(14) for xXπ becomes overdetermined since Xπ has dimension mnN + k, whereas the system consists of mnM + ldyn > nmN + k + ldynmnN + k scalar equations. We refer to [11, Theorem 2] for sufficient conditions which ensure the existence and uniqueness of the minimizing element

$$ \begin{array}{@{}rcl@{}} x_{\pi}=\text{argmin}\{\boldsymbol{{\varPhi}}_{\pi,M}(x):x\in X_{\pi}\}. \end{array} $$
(15)

Once the basis of the ansatz space Xπ has been chosen and the collocation nodes are selected, the discrete problem (15) for a linear boundary value problem (1)–(2) leads to a constrained linear least-squares problem

$$ \varphi(c)=\lvert\mathscr{A}c-r\rvert_{\mathbb{R}^{nmM+l_{dyn}}}^{2}\rightarrow\min! $$
(16)

under the linear constraint

$$ \mathscr{C}c=0. $$
(17)

The equality constraints consists of the k(n − 1) continuity conditions for the elements of Xπ while the functional φ(c) represents a reformulation of the functional (8). Here, \(c\in \mathbb {R}^{n(mN+k)}\) is the vector of coefficients of the basis functions for Xπ disregarding the continuity conditions. Furthermore, it holds \(r\in {\mathbb {R}}^{nmM+l}\), \({\mathscr{A}}\in \mathbb {R}^{(nmM+l)\times n(mN+k)}\), and \({\mathscr{C}}\in \mathbb {R}^{(n-1)k\times n(mN+k)}\). The matrices \({\mathscr{A}}\) and \({\mathscr{C}}\) are very sparse. Owing to the construction, \({\mathscr{C}}\) has full row rank.

We specify the structure of \({\mathscr{A}}\) and \({\mathscr{C}}\) in detail in Section 2 below. Different approaches to solve the constraint optimization problem (16)–(17) have been tested. We report on related experiments in Section 4. The examples which are used on different places are collected in the particular Section 3. The performance of the linear solver is discussed in Section 5. Section 6 shows some additional experiments concerning the boundary conditions weighting. Section 7 contains final remarks.

2 The structure of the discrete problem (16), (17)

Based on the analysis in [11, Section 4] we provide a basis of the ansatz space Xπ to begin with. Assume that {p0,…,pN− 1} is a basis of \(\boldsymbol {\mathfrak {P}}_{N-1}\) defined on the reference interval [0,1]. Then, \(\{\bar {p}_{0},\ldots ,\bar {p}_{N}\}\) given by

$$ \bar{p}_{i}(\rho)=\left\{\begin{array}{ll} 1, & i=0,\\ {\int}_{0}^{\rho}p_{i-1}(\sigma)\mathrm{d}\sigma, & i=1,\ldots,N,\quad\rho\in[0,1], \end{array}\right. $$
(18)

forms a basis of \(\boldsymbol {\mathfrak {P}}_{N}\). The transformation to the interval (tj− 1,tj) of the partition π (3) yields

$$ \begin{array}{@{}rcl@{}} p_{ji}(t) =p_{i}((t-t_{j-1})/h_{j}),\quad \bar{p}_{ji}(t) =h_{j}\bar{p}_{i}((t-t_{j-1})/h_{j}). \end{array} $$
(19)

and in particular

$$ \begin{array}{@{}rcl@{}} \bar{p}_{ji}(t_{j-1})&=&h_{j}\bar{p}_{i}(0)=h_{j}\left\{\begin{array}{ll} 1, & i=0,\\ 0, & i=1,\ldots,N, \end{array}\right.\\ \bar{p}_{ji}(t_{j})&=&h_{j}\bar{p}_{i}(1)=h_{j}\left\{\begin{array}{ll} 1, & i=0,\\ {{\int}_{0}^{1}}p_{i-1}(\sigma)\mathrm{d}\sigma, & i=1,\ldots,N. \end{array}\right. \end{array} $$

Next we form the matrix functions

$$ \begin{array}{@{}rcl@{}} \bar{\mathscr{P}}_{j} = \begin{bmatrix} \bar{p}_{j0}&{\ldots} &\bar{p}_{jN} \end{bmatrix}: I_{j}\rightarrow \mathbb{R}^{1\times(N+1)},\quad \mathscr{P}_{j} = \begin{bmatrix} p_{j0}&\ldots&p_{j,N-1} \end{bmatrix}: I_{j}\rightarrow\mathbb{R}^{1\times N}, \end{array} $$

such that

$$ \begin{array}{@{}rcl@{}} \bar{\mathscr{P}}_{j}(t_{j-1})&=&h_{j}\left[\begin{array}{llll} 1&0&{\ldots} &0 \end{array}\right], \quad j=1,\ldots,n, \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} \bar{\mathscr{P}}_{j}(t_{j})&=&h_{j}\left[\begin{array}{lllll} 1&{{\int}_{0}^{1}}p_{0}(\sigma)\mathrm{d}\sigma&{\ldots} &{{\int}_{0}^{1}}p_{N-1}(\sigma)\mathrm{d}\sigma \end{array}\right], \quad j=1,\ldots,n. \end{array} $$
(21)

Observe that choosing {p0,…,pN− 1} to be Legendre polynomials simplifies the latter matrix to

$$ \begin{array}{@{}rcl@{}} \bar{\mathscr{P}}_{j}(t_{j})&=h_{j}\left[\begin{array}{lllll} 1&1&0&{\ldots} &0 \end{array}\right], \quad j=1,\ldots,n, \end{array} $$

which will prove important.

For xXπ we set the denotations

$$ \begin{array}{@{}rcl@{}} x(t)=x_{j}(t)=\left[\begin{array}{ccc} x_{j1}(t)\\\vdots\\x_{jm}(t) \end{array}\right]\in\mathbb{R}^{m},\quad Dx(t)=Dx_{j}(t)=\begin{bmatrix} x_{j1}(t)\\\vdots\\x_{jk}(t) \end{bmatrix}\in\mathbb{R}^{k},\quad t\in I_{j}. \end{array} $$

Then, we develop each xj componentwise

$$ \begin{array}{@{}rcl@{}} x_{j\kappa}(t) & =&\sum\limits_{l=0}^{N}c_{j\kappa l}\bar{p}_{jl}(t)=\bar{\mathscr{P}}_{j}(t)c_{j\kappa},\quad\kappa=1,\ldots,k, \\ x_{j\kappa}(t) & = & \sum\limits_{l=0}^{N-1}c_{j\kappa l}p_{jl}(t)=\mathscr{P}_{j}(t)c_{j\kappa},\quad\kappa=k+1,\ldots,m. \end{array} $$

with

$$ \begin{array}{@{}rcl@{}} c_{j\kappa}=\begin{bmatrix} c_{j\kappa 0}\\\vdots\\c_{j \kappa N} \end{bmatrix}\in \mathbb{R}^{N+1},\quad \kappa=1,\ldots,k,\quad c_{j\kappa}=\begin{bmatrix} c_{j\kappa 0}\\\vdots\\c_{j \kappa, N-1} \end{bmatrix}\in \mathbb{R}^{N},\quad \kappa=k+1,\ldots,m. \end{array} $$

Introducing still

$$ \begin{array}{@{}rcl@{}} {{\varOmega}}_{j}(t)=\left[\begin{array}{cc} I_{k}\otimes\bar{\mathscr{P}}_{j}(t) & 0\\ 0 & I_{m-k}\otimes \mathscr{P}_{j}(t) \end{array}\right]\in\mathbb{R}^{m\times(mN+k)}, \quad c_{j}=\begin{bmatrix} c_{j 1}\\\vdots\\c_{j m} \end{bmatrix}\in \mathbb{R}^{mN+k}, \end{array} $$

we represent

$$ \begin{array}{@{}rcl@{}} x_{j}(t)&=&{{\varOmega}}_{j}(t)c_{j}, \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} Dx_{j}(t)&=&D{{\varOmega}}_{j}(t)c_{j}=\begin{bmatrix} I_{k}\otimes\bar{\mathscr{P}}_{j}(t) & 0 \end{bmatrix}c_{j},\quad t\in I_{j}, \quad j=1,\ldots,n. \end{array} $$
(23)

Now we collect all coefficients cjκl in the vector c,

$$ \begin{array}{@{}rcl@{}} c=\begin{bmatrix} c_{1}\\\vdots\\c_{n} \end{bmatrix}\in\mathbb{R}^{nmN+nk}. \end{array} $$

It follows that the matrix \({\mathscr{C}}\in \mathbb {R}^{k(n-1)\times n(mN+k)}\) in (17) corresponding to the continuity requirement for Dx has the precise form

$$ \begin{array}{@{}rcl@{}} \mathscr{C}=\left[\begin{array}{cccccc} I_{k}\otimes \bar{\mathscr{P}}_{1}(t_{1}) & -I_{k}\otimes \bar{\mathscr{P}}_{2}(t_{1})\\ & I_{k}\otimes \bar{\mathscr{P}}_{2}(t_{2}) & -I_{k}\otimes \bar{\mathscr{P}}_{3}(t_{2})\\ & {\ddots} & {\ddots} & &\\ & & & &{\ddots} &{\ddots} \\ & & & & I_{k}\otimes \bar{\mathscr{P}}_{n-1}(t_{n-1}) &- I_{k}\otimes \bar{\mathscr{P}}_{n}(t_{n-1}) \end{array}\right]. \end{array} $$
(24)

By construction the segments Dxj,j = 1,…,n, all together form a continuous function Dx on [a,b] exactly when \({\mathscr{C}}c=0\).

Regarding the structure of \({\mathscr{C}}\in \mathbb {R}^{k(n-1)\times n(mN+k)}\) we know that

$$ \begin{array}{@{}rcl@{}} \text{rank}\mathscr{C}=k(n-1),\quad \dim\ker\mathscr{C}=nmN+k=\dim X_{\pi}, \end{array} $$

and formula (22) provides an one-to-one relation between Xπ and \(\ker {\mathscr{C}}\subset \mathbb {R}^{}\).

Now we turn to the detailed description of the functional value (16). For this aim we factorize \(L=\tilde {L}^{T}\tilde {L}\) and \({\mathscr{L}}=\tilde {{\mathscr{L}}}^{T}\tilde {{\mathscr{L}}}\) such that

$$ \begin{array}{@{}rcl@{}} \tilde{\mathscr{L}}=\text{diag}(\tilde{L}\otimes I_{m},\cdots,\tilde{L}\otimes I_{m}) \end{array} $$

and (8) rewrites as

$$ \begin{array}{@{}rcl@{}} \boldsymbol{{\varPhi}}_{\pi,M}(x)=\lvert\tilde{\mathscr{L}}W\rvert^{2}_{\mathbb{R}^{nmM}}+|G_{a}x(a)+G_{b}x(b)-d|_{\mathbb{R}^{d_{dyn}}}^{2}, \quad x\in X_{\pi}. \end{array} $$

We derive applying (22), (23)

$$ \begin{array}{@{}rcl@{}} G_{a}x(a)+G_{b}x(b)=G_{a}D^{+}D{{\varOmega}}_{1}(t_{0})c_{1}+G_{b}D^{+}D{{\varOmega}}_{n}(t_{n})c_{n}=:{{\varGamma}}_{a}c_{1}+{{\varGamma}}_{b}c_{n} \end{array} $$

with matrices \({{\varGamma }}_{a},{{\varGamma }}_{b} \in \mathbb {R}^{l_{dyn}\times (mN+k)}\), and

$$ \begin{array}{@{}rcl@{}} w(t_{ji})=\underbrace{\left[A(t_{ji})(D{{\varOmega}}_{j})'(t_{ji})+B(t_{ji}){{\varOmega}}_{j}(t_{ji})\right]}_{=\mathscr{A}_{ji}}c_{j} -q(t_{ji})=\mathscr{A}_{ji}c_{j}-q(t_{ji}), \end{array} $$

with \({\mathscr{A}}_{ji}\in \mathbb {R}^{m\times (mN+k)}\). According to (7) we set

$$ W_{j}=h_{j}^{1/2}\left[\begin{array}{c} w(t_{j1})\\ \vdots\\ w(t_{jM}) \end{array}\right]= h_{j}^{1/2}\left[\begin{array}{c} \mathscr{A}_{j1}\\ \vdots\\ \mathscr{A}_{jM} \end{array}\right] c_{j}- h_{j}^{1/2}\left[\begin{array}{c}q(t_{j1})\\ \vdots\\ q(t_{jM}) \end{array}\right] $$

and

$$ \begin{array}{@{}rcl@{}} (\tilde{L}\otimes I_{m})W_{j}=\underbrace{h_{j}^{1/2}(\tilde{L}\otimes I_{m})\begin{bmatrix} \mathscr{A}_{j1}\\\vdots\\\mathscr{A}_{jM} \end{bmatrix}}_{=:\mathscr{A}_{j}}c_{j}-(\tilde{L}\otimes I_{m})\underbrace{h_{j}^{1/2} \begin{bmatrix} q(t_{j1})\\\vdots\\q(t_{jM}) \end{bmatrix}}_{=:W_{j}^{[q]}} =\mathscr{A}_{j}c_{j}-(\tilde{L}\otimes I_{m})W_{j}^{[q]}. \end{array} $$

Introducing still the sparse matrix \({\mathscr{A}}\in \mathbb {R}^{(nmM+l_{dyn})\times (nmN+nk)}\) and the vector \(r\in {\mathbb {R}}^{nmM+l_{dyn}}\),

$$ \mathscr{A}=\left[\begin{array}{ccccc} \mathscr{A}_{1} & 0 & {\cdots} & & 0\\ 0 & {\ddots} & & & \vdots\\ {\vdots} & & \ddots\\ & & & {\ddots} & 0\\ 0 & & & & \mathscr{A}_{n}\\ {{\varGamma}}_{a} & 0 & {\cdots} & 0 & {{\varGamma}}_{b} \end{array}\right], \quad r=\left[\begin{array}{c}(\tilde{L}\otimes I_{m}) W_{1}^{[q]} \\ {\vdots} \\(\tilde{L}\otimes I_{m}) W_{n}^{[q]} \end{array}\right] $$

we arrive at the representation

$$ \begin{array}{@{}rcl@{}} \varphi(c)&=&\lvert\mathscr{A}c-r\rvert_{\mathbb{R}^{nmM+l_{dyn}}}^{2}=\sum\limits_{j=1}^{n}\lvert\mathscr{A}_{j}c_{j}-(\tilde{L}\otimes I_{m}) W_{j}^{[q]}\rvert^{2}_{{\mathbb{R}}^{mM}}+\lvert {{\varGamma}}_{a}c_{1}+{{\varGamma}}_{b}c_{n}-d\rvert^{2}_{{\mathbb{R}}^{l_{dyn}}}\\ &=&\sum\limits_{j=1}^{n}\lvert (\tilde{L}\otimes I_{m}) W_{j} \rvert^{2}_{\mathbb{R}^{mM}}+\lvert {{\varGamma}}_{a}c_{1}+{{\varGamma}}_{b}c_{n}-d\rvert^{2}_{{\mathbb{R}}^{l_{dyn}}}=\rvert \tilde{\mathscr{L}}W\lvert^{2}_{{\mathbb{R}}^{nmM}}+\lvert {{\varGamma}}_{a}c_{1}+{{\varGamma}}_{b}c_{n}-d\rvert^{2}_{{\mathbb{R}}^{l_{dyn}}}\\ &=&\boldsymbol{{\varPhi}}_{\pi,M}(x), \end{array} $$

as desired. Eventually, each minimizer xπ ∈argmin{Φπ,M(x) : xXπ} corresponds to a minimizer \(c_{min}\in {argmin}\{\varphi (c):c\in {\mathbb {R}}^{nmN+nk},{\mathscr{C}}c=0\}\), and vice versa. Recall that [11, Theorem 2] provides sufficient condition for xπ to be unique.

Proposition 1

Let the functional Φπ,M have the only minimizer xπ on Xπ. Then the following assertions are valid:

  1. (1)

    There is exactly one minimizer cmin of the functional φ on \(\ker {\mathscr{C}}\).

  2. (2)

    If \({\mathscr{B}}\in \mathbb {R}^{(nmN+nk)\times (nmN+k)}\) is a basis of \(\ker {\mathscr{C}}\) then \(\mathfrak {A}:={\mathscr{A}}{\mathscr{B}}\) has full column rank nmN + k.

Proof

(1) follows directly from the above representations of the related functionals. (2): \(z\in \ker \mathfrak {A}\) implies \({\mathscr{B}} z\in \ker {\mathscr{A}}\) and \(c_{min}+{\mathscr{B}}z\in \ker {\mathscr{C}}\), \(\varphi (c_{min}+{\mathscr{B}}z)=\varphi (c_{min})\). Owing to the uniqueness of the minimizer it follows that \({\mathscr{B}}z=0\), and in turn z = 0, since \({\mathscr{B}}\) has full column rank being a basis. □

3 Test examples

The first test problem is often used in the literature to show that standard integration methods fail if applied to higher index DAEs, e.g., [13, 14].

Example 1

The DAE

$$ \begin{array}{@{}rcl@{}} x^{\prime}_{2}(t)+x_{1}(t) & =&q_{1}(t),\\ t\eta x^{\prime}_{2}(t)+x^{\prime}_{3}(t)+(\eta+1)x_{2}(t) & =&q_{2}(t),\\ t\eta x_{2}(t)+x_{3}(t) & =&q_{3}(t),\quad t\in [0,1]. \end{array} $$

has index-3 and dynamical degree of freedom ldyn = 0 such that no additional boundary or initial conditions are necessary for unique solvability. We choose the exact solution

$$ \begin{array}{@{}rcl@{}} x_{\ast,1}(t) & =&e^{-t}\sin t,\\ x_{\ast,2}(t) & =&e^{-2t}\sin t,\\ x_{\ast,3}(t) & =&e^{-t}\cos t \end{array} $$

and adapt the right-hand side q accordingly. For the exact solution, it holds \(\|x_{\ast }\|_{L^{2}((0,1),{\mathbb {R}}^{3})}\approx 0.673\), \(\|x_{\ast }\|_{L^{\infty }((0,1),{\mathbb {R}}^{3})}=1\), and \(\|x_{\ast }\|_{{H_{D}^{1}}((0,1),{\mathbb {R}}^{3})}\approx 1.11\).

The next example is the linearized version of a test problem presented [6] that has also been discussed, e.g., in [12].

Example 2

We consider the DAE

$$ A(Dx)'(t)+B(t)x(t)=q(t),\quad t\in[0,5], $$

where

$$ \begin{array}{@{}rcl@{}} A&=&\begin{bmatrix} 1&0&0&0&0&0\\ 0&1&0&0&0&0\\ 0&0&1&0&0&0\\ 0&0&0&1&0&0\\ 0&0&0&0&1&0\\ 0&0&0&0&0&1\\ 0&0&0&0&0&0 \end{bmatrix}, D=\begin{bmatrix} 1&0&0&0&0&0&0\\ 0&1&0&0&0&0&0\\ 0&0&1&0&0&0&0\\ 0&0&0&1&0&0&0\\ 0&0&0&0&1&0&0\\ 0&0&0&0&0&1&0 \end{bmatrix},\\ B(t)&=& \begin{bmatrix} 0&0&0&-1&0&0&0\\ 0&0&0&0&-1&0&0\\ 0&0&0&0&0&-1&0\\ 0&0&\sin t&0&1&-\cos t&-2\rho \cos^{2}t\\ 0&0&-\cos t&-1&0&-\sin t&-2\rho \sin t\cos t\\ 0&0&1&0&0&0&2\rho \sin t\\ 2\rho \cos^{2}t&2\rho \sin t \cos t&-2\rho\sin t&0&0&0&0 \end{bmatrix},\quad \rho=5, \end{array} $$

subject to the initial conditions

$$ x_{2}(0)=1,\quad x_{3}(0)=2,\quad x_{5}(0)=0,\quad x_{6}(0)=0. $$

This problem has index 3 and dynamical degree of freedom ldyn = 4. The right-hand side q has been chosen in such a way that the exact solution becomes

$$ \begin{array}{@{}rcl@{}} x_{\ast,1} &=& \sin t, \qquad\qquad\quad x_{\ast,4} = \cos t, \\ x_{\ast,2} &=& \cos t, \qquad\qquad \quad x_{\ast,5} = -\sin t, \\ x_{\ast,3} &=& 2\cos^{2} t, \qquad \qquad x_{\ast,6} = -2\sin 2t, \\ x_{\ast,7} &=& -\rho^{-1}\sin t. \end{array} $$

For the exact solution, it holds \(\|x_{\ast }\|_{L^{2}((0,5),{\mathbb {R}}^{7})}\approx 5.2\), \(\|x_{\ast }\|_{L^{\infty }((0,5),\mathbb {R}^{7})}=2\), and \(\|x_{\ast }\|_{{H_{D}^{1}}((0,5),\mathbb {R}^{7})}\approx 9.4\).

The following example is a boundary value problem in contrast to Example 2 which is an initial value problem.

Example 3

On the interval [0,1], consider the DAE

$$ \left[\begin{array}{cccccc} 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 \end{array}\right]\frac{d}{dt}\left[\begin{array}{c} x_{1}\\ x_{2}\\ y_{1}\\ y_{2}\\ y_{3}\\ y_{4} \end{array}\right]+\left[\begin{array}{cccccc} 0 & -\lambda & 0 & 0 & 0 & 0\\ -\lambda & 0 & 0 & 0 & 0 & 0\\ -1 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 \end{array}\right]\left[\begin{array}{c} x_{1}\\ x_{2}\\ y_{1}\\ y_{2}\\ y_{3}\\ y_{4} \end{array}\right]=\left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ 0 \end{array}\right],\quad\lambda>0, $$

subject to the boundary conditions

$$ x_{1}(0)=x_{1}(1)=1. $$

This DAE can be brought into the proper form (1) by setting

$$ A=\left[\begin{array}{ccccc} 1 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 1 \end{array}\right],\quad D=\left[\begin{array}{cccccc} 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 \end{array}\right],\quad B=\left[\begin{array}{cccccc} 0 & -\lambda & 0 & 0 & 0 & 0\\ -\lambda & 0 & 0 & 0 & 0 & 0\\ -1 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 \end{array}\right]. $$

This DAE has the tractability index μ = 4 and dynamical degree of freedom l = 2. The solution reads

$$ \begin{array}{@{}rcl@{}} x_{\ast,1}(t) & =&\frac{e^{-\lambda t}(e^{\lambda}+e^{2\lambda t})}{1+e^{\lambda}}\\ x_{\ast,2}(t) & =&\frac{e^{-\lambda t}(-e^{\lambda}+e^{2\lambda t})}{1+e^{\lambda}}\\ y_{\ast,1}(t) & =&\frac{e^{-\lambda t}(e^{\lambda}+e^{2\lambda t})}{1+e^{\lambda}}\\ y_{\ast,2}(t) & =&\lambda\frac{e^{-\lambda t}(-e^{\lambda}+e^{2\lambda t})}{1+e^{\lambda}}\\ y_{\ast,3}(t) & =&\lambda^{2}\frac{e^{-\lambda t}(e^{\lambda}+e^{2\lambda t})}{1+e^{\lambda}}\\ y_{\ast,4}(t) & =&\lambda^{3}\frac{e^{-\lambda t}(-e^{\lambda}+e^{2\lambda t})}{1+e^{\lambda}} \end{array} $$

4 Approaches to solve the constraint optimization problem (16)–(17)

Different approaches to solve the constraint optimization problem (16)–(17) have been tested, namely the direct elimination method, the weighting of the constraints, and a special deferred correction procedure as specified in the following three subsections.

4.1 Direct elimination method

The matrix \({\mathscr{C}}\) has full row rank (n − 1)k.

The solution manifold of (17), that is \(\ker {\mathscr{C}}\), forms an (nmN + k)-dimensional subspace of \({\mathbb {R}}^{nmN+nk}\) which can be characterized by

$$ \mathscr{C}c=0\text{ if and only if }c=\mathscr{B}z\text{ for some }z\in\mathbb{R}^{nmN+k}. $$

Here, \({\mathscr{B}}\in \mathbb {R}^{n(mN+k)\times (nmN+k)}\) is an orthogonal basis of \(\ker {\mathscr{C}}\). With this representation, the constrained minimization problem can be reduced to the unconstrained one

$$ \tilde{\varphi}(z)=\lvert\mathscr{A}\mathscr{B}z-r\rvert_{\mathbb{R}^{nmN+l_{dyn}}}^{2}\rightarrow\min! $$

Owing to Proposition 1, the matrix product \({\mathscr{A}}{\mathscr{B}}\) has full column rank nmN + k.

The implemented algorithm is that of [5] (see also [4, Section 5.1.2]) which is sometimes called the direct elimination method. In our tests below the direct method seems to be the most robust one.

4.2 Weighting of the constraints to solve the optimization problem (16)–(17).

In this approach, a sufficiently large parameter ω > 0 is chosen and the problem (16)–(17) is replaced by the free minimization problem

$$ \varphi_{\omega}(c)=\lvert\mathscr{A}c-r\rvert_{\mathbb{R}^{nmN+l_{dyn}}}^{2}+\omega\lvert\mathscr{C}c\rvert^{2}_{\mathbb{R}^{k(n-1)}}\rightarrow\min! $$

It is known thatFootnote 2 the minimizer cω of φω converges towards the solution of (16)–(17) for \(\omega \rightarrow \infty \) (cf. [9, Section 12.1.5]). Two different orderings of the equations have been implemented. One is

$$ \mathscr{G}=\left[\begin{array}{c} \omega\mathscr{C}\\ \mathscr{A} \end{array}\right],\quad\bar{r}=\left[\begin{array}{c} 0\\ r \end{array}\right] $$

while the other uses a block-bidiagonal structure as it is common for collocation methods for ODEs, cf [1]. It is known that the order of the equations in the weighting method may have a large impact on the accuracy of the solutions [16]. In our test examples, however, we did not observe a difference in the behavior of both orderings.

The results of the weighting method depend substantially on the choice of the parameter ω. In order to have an accurate approximation of the exact solution c of the problem (16)–(17), a large value of ω should be used (in the absence of rounding errors). However, if ω becomes too large, the algorithm may lack numerical stability. A discussion of this topic has been given in [16]. In particular, it turns out that the algorithm used for the QR decomposition and the pivoting strategies have a strong influence on the success of this method. In our implementation, we use the sparse QR implementation of [8]. On the other hand, an accuracy of the solution being much lower than the approximation error of xπ is not necessary.Footnote 3 Therefore, a number of experiments have been done in order to obtain some insight into what reasonable choices might be.

Experiment 1

Influence of the choice of the weighting parameter ω

We use Example 2. Two sets of parameters are selected: (i) N = 5, n = 160 and (ii) N = 20, n = 20. The choice (i) corresponds to low degree polynomials with a corresponding large number of subintervals while (ii) uses higher degree polynomials with a corresponding small number of subintervals. Both cases have been selected according to [11, Table 20] in such a way that a high accuracy can be obtained while at the same time having only a small influence of the problem conditioning. The other parameters chosen in this experiment are: M = N + 1, Gauss-Legendre collocation nodes and Legendre polynomials as basis functions. The error in dependence of ω is measured both with respect to the exact solution and with respect to a reference solution obtained by the direct solution method. The results are provided in Tables 1 and 2. The results for Example 3 below are quite similar. The results indicate that an optimal ω may vary considerably depending on the problem parameters. However, the accuracy against the exact solution is rather insensitive of ω.

Table 1 Influence of parameter ω for the constraints in Example 2 using N = 5 and n = 160
Table 2 Influence of parameter ω for the constraints in Example 2 using N = 20 and n = 20.

4.3 Deferred correction procedure

The direct solution method by eliminating the constraints has often the deficiency of generating a lot of fill-in in the intermediate matrices. An approach to overcome this situation has been proposed in [16]. The solutions of the weighting approach are iteratively enhanced by a defect correction process. This method is implemented in the form presented in [2, 3]. This form is called the deferred correction procedure for constrained least-squares problems by the authors. As a stopping criterion, the estimate (i) in [3, p. 254] has been implemented. Additionally, a bound for the maximal number of iterations can be provided. Under reasonable conditions, at most 2 iterations should be sufficient for obtaining maximal (with respect to the sensitivity of the problem) accuracy for the discrete solution.

The iterative solver using defect corrections may overcome the difficulties connected with a suitable choice of the parameter ω in the weighting method. According to Experiment 1, we would expect the optimal ω to be in the order of magnitude 10− 3…10+ 2 with an optimum around 10− 2. This is in contrast to the recommendations given in [3] where a choice of \(\omega \approx \varepsilon _{\text {mach}}^{-1/3}\) is recommended for the deferred correction algorithm. We test the performance of the deferred correction solver in the next experiment. Here, the tolerance in the convergence check is set to 10− 15. The iterations are considered not to converge if the convergence check has failed after two iterations.

Experiment 2

We check the performance of the deferred correction solver in dependence of the weight parameter ω. Both Examples 2 and 3 are used. The results are presented in Tables 345 and 6. The results indicate that a larger value for ω seems to be preferable.

Table 3 Influence of the parameter ω on the accuracy of the discrete solution for Example 2 using N = 5 and n = 160. The error of the solution with respect to the exact solution (A) and with respect to a discrete reference solution obtained by a direct method (B) is given in the norms of \(L^{2}((0,5),{\mathbb {R}}^{7})\), \(L^{\infty }((0,5),{\mathbb {R}}^{7})\) and \({H_{D}^{1}}((0,5),{\mathbb {R}}^{7})\). 2 iterations are applied
Table 4 Influence of the parameter ω on the accuracy of the discrete solution for Example 2 using N = 20 and n = 20
Table 5 Influence of the parameter ω on the accuracy of the discrete solution for Example 3 using N = 20 and n = 5.
Table 6 Influence of the parameter ω on the accuracy of the discrete solution for Example 3 using N = 5 and n = 20.

5 Performance of the linear solvers

In this section, we intend to provide some insight into the behavior of the linear solvers. This concerns both the accuracy as well as the computational resources (computation time, memory consumption). All these data are highly implementation dependent. Also the hardware architecture plays an important role.

The linear solvers have been implemented using the standard strategy of subdividing them into a factorization step and a solve step. The price to pay is a larger memory consumption. However, their use in the context of, e.g., a modified Newton method may decrease the computation time considerably.

The tests have been run on a Linux laptop Dell Latitude E5550. While the program is a pure sequential one, the MKL library may use shared memory parallel versions of their BLAS and LAPACK routines. The CPU of the machine is an Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz providing two cores, each of them capable of hyperthreading. For the test runs, cpu throttling has been disabled such that all cores ran at roughly 3.2 GHz.

The parameter for the weighting solver is ω = 1 while the corresponding parameter for the deferred correction solver is \(\omega =\epsilon _{\text {mach}}^{-1/3}\approx 1.65\times 10^{5}\). These parameters have been chosen since they seem to be best suited for the examples. The test cases (combination of N and n) have been selected by choosing the best combinations in [11, Tables 20 and 21], respectively.

Experiment 3

First, we consider Example 2. For all values of N, M = N + 1 Gauss-Legendre nodes have been used. The characteristics of the test cases using Legendre basis functions are provided in Table 7. For the special properties of the Legendre polynomials, the matrix \({\mathscr{C}}\) representing the constraints is extremely sparse featuring only three nonzero elements per row. The computational results are shown in Table 8. In the next computations, the Chebyshev basis has been used which leads to a slightly more occupied matrix \({\mathscr{C}}\). The results are provided in Tables 9 and 10.

Table 7 Case characteristics for Experiment 3 using the Legendre basis.
Table 8 Computing times, permanent workspace needed, and error for the cases described in Table 7. The computing times are provided in milliseconds
Table 9 Case characteristics for Experiment 3 using the Chebyshev basis
Table 10 Computing times, permanent workspace needed, and error for the cases described in Table 9

The previous example is an initial value problem. This structure may have consequences on the performance of the linear solvers. Therefore, in the next experiment, we consider a boundary value problem.

Experiment 4

We repeat Experiment 3 with Example 3. The problem characteristics and computational results are provided in Tables 111213, and 14. It should be noted that the deferred correction solver returned normally (tolerance as before 10− 15) after at most two iterations in all cases. However, in some cases, the results are completely off. This happens, for example, in Tables 12 and 14, cases 1 and 2, for \(\boldsymbol {{\varPhi }}_{\pi ,M}^{C}\).

Table 11 Case characteristics for Experiment 4 using the Legendre basis
Table 12 Computing times, permanent workspace needed, and error for the cases described in Table 11
Table 13 Case characteristics for Experiment 4 using the Chebyshev basis
Table 14 Computing times, permanent workspace needed, and error for the cases described in Table 13

It should be noted that a considerable amount of memory for the QR factorizations is consumed by the internal representation of the Q-factor in SPQR. This can be avoided if the factorization and solution steps are intervowen.

6 Sensitivity of boundary condition weighting

As already known for boundary value problems for ODEs and index-1 DAEs, a special problem is the scaling of the boundary condition, and hence, here the inclusion of the boundary conditions (2). Their scaling is independent of the scaling of the DAE (1). Therefore, it seems to be reasonable to provide an additional possibility for the scaling of the boundary conditions. We decided to enable this by introducing an additional parameter α to be chosen by the user. So, Φ from (5) is replaced by the functional

$$ \tilde{\boldsymbol{{\varPhi}}}(x)={{\int}_{a}^{b}}|A(t)(Dx)'(t)+B(t)x(t)-q(t)|^{2}\text{dt} +\alpha |G_{a}x(a)+G_{b}x(b) -d|^{2}. $$

Analogously, the discretized versions \(\boldsymbol {{\varPhi }}_{\pi ,M}^{R}\), \(\boldsymbol {{\varPhi }}_{\pi ,M}^{I}\) and \(\boldsymbol {{\varPhi }}_{\pi ,M}^{C}\) are replaced by their counterparts \(\tilde {\boldsymbol {{\varPhi }}}_{\pi ,M}^{R}\), \(\tilde {\boldsymbol {{\varPhi }}}_{\pi ,M}^{I}\) and \(\tilde {\boldsymbol {{\varPhi }}}_{\pi ,M}^{C}\) with weighted boundary conditions. The convergence theorems will hold true for these modifications of the functional, too.

Experiment 5

Influence of α on the accuracy

We use the example and settings of Experiment 1. The results are provided in Table 15.

Table 15 Influence of weight parameter α for the boundary conditions in Example 2

Experiment 6

Influence of α on the accuracy

We repeat the previous experiment with Example 3. The discretization parameters are (i) N = 5, n = 20 and (ii) N = 20, n = 5. All other settings correspond to those of Experiment 5. The results are presented in Table 16.

Table 16 Influence of weight parameter α for the boundary conditions in Example 3

The results of Experiments 5 and 6 indicate that the final accuracy is rather insensitive to the choice of α. It should be noted that the coefficient matrices in Examples 2 and 3 are well-scaled.

7 Final remarks

In summary, we investigated questions related to an efficient and reliable realization of a least-squares collocation method. These questions are particularly important since a higher index DAE is an essentially ill-posed problem in naturally given spaces, which is why we must be prepared for highly sensitive discrete problems. In Part 1, in order to obtain an overall procedure that is as robust as possible, we provided criteria which led to a robust selection of the collocation points and of the basis functions, whereby the latter is also useful for the shape of the resulting discrete problem. We refer to the corresponding Final remarks and conclusions in [11].

A critical ingredient for the implementation of the method is the algorithm used for the solution of the discrete linear least-squares problem. Given the expected bad conditioning of the least-squares problem, a QR factorization with column pivoting must lie at the heart of the algorithm. At the same time, the sparsity structure must be used as best as possible. In our tests, the direct solver seems to be the most robust one. With respect to efficiency and accuracy, the deferred correction solver is preferable. However, it failed in certain tests.

The results for M = N + 1 are not much different from those obtained for a larger M, for which we do not yet have an explanation.

In conclusion, we note that earlier implementations, among others the one from the very first paper in this matter [13] which started from proven ingredients for ODE codes, are from today’s point of view and experience a rather bad version for the least-squares collocation. Nevertheless, the test results calculated with it were already very impressive. This strengthens our belief that a careful implementation of the method gives rise to a very efficient solver for higher index DAEs.

The algorithms have been implemented in C++ 11. All computations have been performed on a laptop running OpenSuSE Linux, release Leap 15.1, the GNU g++ compiler (version 7.5.0) [15], the Eigen matrix library (version 3.3.7) [10], SuitesParse (version 5.6.0) [7], in particular its sparse QR factorization [8], Intel®; MKL (version 2019.5-281), all in double precision with a rounding unit of 𝜖mach ≈ 2.22 × 10− 16.Footnote 4 The code is optimized using the level -O3.Footnote 5