1 Introduction

The reconstruction of an audio signal with missing sampled or clipped, is a classical problem in signal processing and it was largely discussed in the specialized scientific literature, see [13].

In this paper, we report the experiments performed with a method inspired from computational topology, namely the homotopy continuation method in order to enhance the typical recovery of audio speech recordings based on \(\ell ^1-\) minimization, [46].

To precisely formulate the problem, we consider the following non parametric model with observations:

$$y = \varTheta s + e \in \mathbb {R}^P$$

where \(s \in \mathbb {R}^N\) is the speech signal to recover, \(e \in \mathbb {R}^P\) is a noise vector, and \(\varTheta \in \mathbb {R}^{P \times N}\) models the acquisition device. This device is nowadays equipped with the additional assumption of sparsity, which refers to the circumstance that many natural signals can be expanded (using a suited dictionary \(\varTheta \)) with only few non zero coefficients. We assume a compressed sensing scenario where the operator \(\varTheta \) could be the realization of a random Gaussian, Bernoulli, or partial Fourier matrix satisfying the restricted isometry property (RIP) [7]. But given the special characteristics of nature signals as the speech recordings, which usually consist of sets of distinct components as transients and harmonics with orientation in time and frequency, we have used for the proposed method a Gabor frame generated by the Alltop sequences as proposed in [8, 9].

2 Gabor Frames and the \(\ell ^1-\)minimization

Frames \((g_i)_{i \in I}\) generalize the idea of a basis in a Hilbert space \({\varvec{H}}\) and consist of the indexed families such that the so-called frame operator S

$$\begin{aligned} Sf= \sum _{i\in I}{ \langle f,g_i \rangle g_i} \end{aligned}$$
(1)

is invertible. The main tool for time-frequency analysis is the Short-Time Fourier Transform, defined for functions \(f,g \in {{\varvec{L}}^2}({\mathbb R}^d)\) at \(\lambda = (\alpha ,\beta ) \in \mathbb {R}^{2d} \) by

$$\begin{aligned} V_gf(\lambda ) = V_gf(\alpha , \beta ) = { \langle f, M_\beta T_\alpha g \rangle } = \langle f, \pi (\lambda ) g \rangle \end{aligned}$$
(2)

where \( T_\alpha f(t)=f(t-\alpha )\) is the translation (time shift) and \(M_\beta f(t)=e^{2\pi i\beta \cdot t}f(t) \) is the modulation (frequency shift). The operators \(\pi (\lambda ):= M_{\beta } T_{\alpha }\) are called time-frequency shifts and the set \(\varLambda =\{\lambda ; \, \, \lambda = (\alpha ,\beta ) \in {{ {{\mathbb {R}^d}}\times {\widehat{{\mathbb R}}^d}}}\}\) is a lattice, [11]. The Gabor system \(\mathcal{{G}}(g,\varLambda )=\{\pi (\lambda )g; \, \, \lambda \in \varLambda \}\) over the lattice \(\varLambda \) consisting of the translated and modulated versions of one atom g, is a frame for the space \( L^{2}({\mathbb {R}^d})\), if and only if there exist \(0< A \le B < \infty \) (frame bounds) with

$$\begin{aligned} A||f||^{2}\le \sum _{\lambda \in \varLambda }| \langle f, \pi (\lambda ) g \rangle |^{2}\le B||f||^{2} \quad \text{ for } \text{ every } f\in L^{2}({\mathbb {R}^d}), \end{aligned}$$
(3)

We will use in the construction of Gabor frames the Alltop sequences as proposed in [8].

To recover an approximation of the signal s, a standard method is the basis pursuit denoising or \(\ell ^1\)-minimization [10]. This method is based on using the \(\ell ^1\) norm as a sparsity enforcing penalty. That turns into an optimization problem and allows us to recover the signal minimizing the expression:

$$\begin{aligned} s_\rho \in argmin_{s \in \mathbb {R}^N} \frac{1}{2} ||y-\varTheta s||^2 + \rho ||s||_1 \end{aligned}$$
(4)

where the \(\ell ^1\) norm is defined as \( ||s||_1 = \sum _i \vert s_i \vert \).

The parameter \(\rho \) should be set in accordance to the noise level \(||e||\).

In the case where there is no noise, \(e=0\), we let \(\lambda \rightarrow 0^+\) and solve the basis pursuit constrained optimization \(s_{0^+} \in argmin_{\varTheta s= y} ||s||_1.\)

In order to avoid technical difficulties, we could further assume that \(\varTheta \) is such that \(s_\rho \) is uniquely defined.

In the following, for some index set \(I \subset \{1,\ldots ,N\}\), we denote by

$$\varTheta _I = (\theta _i)_{i \in I} \in \mathbb {R}^{P \times \vert I \vert }$$

the sub-matrix obtained by extracting the columns \(\theta _i \in \mathbb {R}^P\) of \(\varTheta \) indexed by I. The support of a vector is \( supp(x) = \left\{ i \in \{1,\ldots ,N\} \;:\; x_i \ne 0 \right\} .\)

Using results from the convex analysis, we obtain that \(s_\rho \) is a solution of (4) if and only if

$$ \left\{ \begin{array}{l} (C1) \qquad \varTheta _I^*( y-\varTheta _I s_{\rho ,I} ) = \rho sign(s_{\rho ,I}), \\ (C2) \qquad || \varTheta _{J}^*( y-\varTheta _I s_{\rho ,I} ) ||_\infty \le \rho \end{array} \right. $$

where \(I=supp(s_\rho )\) and \(J = I^c\) is the complementary.

3 The Homotopy-Continuation Algorithm and Experiments

Topology helps to understand the different degrees of connectivity a geometric object has. To deal with topological isomorphisms or homeomorphisms between continuous geometric objects is a hard task and discretization strategies, such as triangulations, are employed for reducing the computational complexity of the topological interrogation. While homology considers the notion of hole in linear algebra terms, the homotopy is dealing with the same issues in a purely combinatorial terms. Therefore, homotopy computation is much more harder in general than homology computation, but in combination with numerical methods it can be proven to be a useful tool for signal recovery but also in image recognition.

The proposed homotopy-based method for speech recovery is gradually deforming a trivial initialization of the speech vector into the original speech vector through the process of path-tracking. The numerical homotopy procedure is based on the fact that the objective function undergoes a homotopy from the \(\ell ^2\) to the \(\ell ^1\) optimization as the algorithm progresses. The homotopy algorithm proceeds by computing iteratively the value \(s_\rho \).

We sum below the complete algorithm:

figure a

(\(\ell ^1-\)minimization with homotopy deformation)

For the numerical experiments, we have used 5 speech data s of 2 to 5 s, recorded by a microphone and sampled at 16 kHz. All signals were normalized, and after that the following noise level \(\sigma = 0.05 * norm(\varTheta *s)/sqrt(P)\) was applied. We used \(P = round(N/4)\) where \(N = size(s)\). The distorted measurements where defined by the expression \( y = \varTheta *s + \sigma *randn(P,1)\) as in [5]. These measurements were the input for our algorithm.

In Fig. 1, we displayed 6 iterations of the algorithm to visualize the homotopic progression towards the correct restoration. For clarity reasons, only the first 2000 samples of the speech signal are shown.

Fig. 1.
figure 1

First 6 iterations of the homotopy algorithm (orginal in red, recovered in blue) (Color figure online)

Even though the application of the algorithm provides a complete recovery of the original speech recording, a drawback is the large number of iterations. In our experiments, we managed to recover the 5 speech data, with a number of iterations proportional with almost half the size of the signal, depending on the distortion applied. In comparison with other \(\ell ^1-\)minimization methods like the iterative shrinkage-thresholding, proximal gradient or augmented Lagrange multiplier, the homotopy achieves the best accuracy, even though, as mentioned before, in terms of speed, the homotopy takes longer time to converge when the distortion is high. But since speech recognition is usually a sensitive issue, the accuracy degree of the reconstruction made us confident in the utility of the proposed algorithm.

4 Conclusions

In this report, we presented the results of speech restoration using the basis pursuit algorithm in a sparse Gabor frames scenario, enhanced with a topology-inspired procedure entitled the homotopy-continuation method. The method allows a complete recovery of a speech recording with missing samples or clipped but with a high computational cost given by a large number of iteration necessary. Further parallelization of the algorithm are considered by the authors.