Autonomous Robots

, Volume 34, Issue 3, pp 217–232

The ManyEars open framework

Microphone array open software and open hardware system for robotic applications


  • François Grondin
    • IntRoLab – Intelligent, Interactive, Integrated Robotics LabInterdisciplinary Institute for Technological Innovation Université de Sherbrooke
  • Dominic Létourneau
    • IntRoLab – Intelligent, Interactive, Integrated Robotics LabInterdisciplinary Institute for Technological Innovation Université de Sherbrooke
  • François Ferland
    • IntRoLab – Intelligent, Interactive, Integrated Robotics LabInterdisciplinary Institute for Technological Innovation Université de Sherbrooke
  • Vincent Rousseau
    • IntRoLab – Intelligent, Interactive, Integrated Robotics LabInterdisciplinary Institute for Technological Innovation Université de Sherbrooke
    • IntRoLab – Intelligent, Interactive, Integrated Robotics LabInterdisciplinary Institute for Technological Innovation Université de Sherbrooke

DOI: 10.1007/s10514-012-9316-x

Cite this article as:
Grondin, F., Létourneau, D., Ferland, F. et al. Auton Robot (2013) 34: 217. doi:10.1007/s10514-012-9316-x


ManyEars is an open framework for microphone array-based audio processing. It consists of a sound source localization, tracking and separation system that can provide an enhanced speaker signal for improved speech and sound recognition in real-world settings. ManyEars software framework is composed of a portable and modular C library, along with a graphical user interface for tuning the parameters and for real-time monitoring. This paper presents the integration of the ManyEars Library with Willow Garage’s Robot Operating System. To facilitate the use of ManyEars on various robotic platforms, the paper also introduces the customized microphone board and sound card distributed as an open hardware solution for implementation of robotic audition systems.


Open sourceSound source localizationSound source separationMobile roboticsUSB sound cardOpen hardwareMicrophone array

1 Introduction

Autonomous robots must be able to perceive sounds from the environment in order to interact naturally with humans. Robots operate in noisy environments, and limitations are observed in such conditions when using only one or two microphones (Wolff et al. 2009). In that regard, a microphone array can enhance performances by allowing a robot to localize, track, and separate multiple sound sources.
Fig. 1

Software architecture of the ManyEars library

Using an array of eight microphones, ManyEars (Valin et al. 2006a, b; Valin et al. 2003) demonstrated that it can, simultaneously and in real-time, reliably localize and track up to four of the loudest sound sources in reverberant and noisy environments (Valin et al. 2006b). ManyEars can also reliably separate up to three sources in an adverse environment with a suitable signal-to-noise ratio improvement for speech recognition (Yamamoto et al. 2006, 2005). ManyEars needs at least four microphones to operate, and the number of microphones used influences the number of sources that can be processed. It has mostly been used with arrays of eight microphones, to match the maximum number of analog input channels on the sound cards used. ManyEars has been used on different platforms including Spartacus (Michaud et al. 2007), SIG2 (Yamamoto et al. 2005) and ASIMO (Yamamoto et al. 2006), and as a pre-processing module for improved speech recognition (Valin et al. 2007; Yamamoto et al. 2007, 2006; Yamamoto et al. 2005). Many components of ManyEars are also used in HARK (HRI-JP Audition for Robots with Kyoto University) (Nakadai et al. 2010), an open source real-time system that also integrates new localization techniques such as GEVD MUSIC (Generalized EigenValue Decomposition Multiple Signal Classification) and GSVD-MUSIC (Generalized Singular Value Decomposition Multiple Signal Classification) (Nakadai et al. 2012; Nakamura et al. 2012; Otsuka et al. 2011).

The first implementation of ManyEars and HARK both rely heavily on FlowDesigner (Létourneau et al. 2005; Valin et al. 2008), an open source data flow development environment used to build complex applications by combining small and reusable building blocks. To facilitate maintenance and portability, ManyEars is now implemented in C as a modular library, with no dependance on external libraries. The source code is available online (Grondin et al. 2012) under the GNU GPL license (Free Software Foundation, Inc. 2012). A Graphical User Interface (GUI) (also available online (IntRoLab 2012)) is used to display in real-time the tracked sound sources and to facilitate configuration and tuning of the parameters of the ManyEars library. This paper presents these implementations and their integration to Willow Garage’s Robot Operating System (ROS) (Quigley et al. 2009).

To make use of the ManyEars library, a computer, a sound card and microphones are required. ManyEars can be used with commercially available sound cards and microphones. However, commercial sound cards present limitations when used for embedded robotic applications: they are usually expensive; they have functionalities such as sound effects, integrated mixing, optical inputs/outputs, S/PDIF, MIDI, numerous analogs outputs, etc., which are not required for robot audition; they also require significant amount of power and size. The EAR sensor has been proposed as an alternative (Bonnal et al. 2009), but it remains large and has strong coupling with the software which runs on a Field-Programmable Gate Array (FPGA). With ManyEars, computations are done on an onboard computer not embedded to the sound card, to facilitate portability and maintenance. To this end, the paper also introduces the customized microphone acquisition board and a 8-input sound card distributed as an open hardware alternative for robotic audition systems.

The paper is organized as follows. Sect. 2 presents the revised implementation of ManyEars as an open source C library. Sect. 3 introduces the GUI and explains ManyEars’ portability by presenting its integration to ROS. Sect. 4 describes ManyEars’ open hardware components and Sect. 5 presents test cases to illustrate the use of the implemented framework.

2 ManyEars

Figure 1 illustrates the software architecture of ManyEars Library. It is composed of five modules: Preprocessing, Localization, Tracking, Separation and Postprocessing. These modules receive inputs and generate data using the Microphones, potential sources, tracked sources, separated sources and postfiltered sources data structures. In the following subsections, each of the five modules is described, along with the equations explaining what is implemented in the code. The parameters provided were set empirically to be robust to environmental changes, unless mentioned otherwise. More detailed explanations and justifications of these equations and parameters are available in (Valin et al. 2006b) (Preprocessing, Localization and Tracking) and in (Valin et al. 2004) (Separation and Postprocessing). Also note that in this section, the variables m, i and k stand for the microphone, the frame and the bin indexes, respectively.
Fig. 2

Block diagram of the preprocessing module

2.1 Preprocessing module

The Preprocessing Module uses a MicST (Microphone Signal Transform) data structure to transform the time-domain signal of each microphone (sampled at 48,000 samples/sec) in weighted frequency frames, as shown in Fig. 2. The Preprocessor function transforms the microphone signal in the time domain into many individual frames of N = 1,024 samples. Each frame is multiplied by a power-complementary window and then transformed in the frequency domain with a Fast Fourier Transform (FFT), which leads to \(X_{m}^{l}[k]\). The MCRA (Minimum Controlled Recursive Averaging) function is used to estimate the spectrum of the stationary noise \((\lambda ^{s})_{m}^{l}[k]\) during silence periods (Cohen and Berdugo 2002). The frames are initialized with zero values at frame l = 0 in Eq. 1, and then updated recursively. The weighting factor \(\zeta _{m}^{l}[k]\) is then computed at each frequency bin according to Eqs. 2, 3, and 4. The variables \(\xi _{m}^{l}[k]\) and \((\lambda ^{r})_{m}^{l}[k]\) respectively represent the estimation of the a priori Signal-to-Noise Ratio (SNR) and the reverberation estimation (Ephraim and Malah 1984, 1985). The parameter \(\alpha _{d} = 0.1\) is the adaptation rate, \(\gamma = 0.3\) the reverberation decay for the room, and \(\delta = 1.0\) the level of reverberation. These parameters need to be adjusted to the environment.
$$\begin{aligned} \begin{array}{lc} \begin{array}{ll} (\lambda ^{r})_{m}^{0}[k]&= 0 \\ (\lambda ^{s})_{m}^{0}[k]&= 0 \\ \xi _{m}^{0}[k]&= 0 \\ \zeta _{m}^{0}[k]&= 0 \\ \end{array}&{ 0 \le m < M, 0 \le k < N}\\ \end{array} \end{aligned}$$
$$\begin{aligned} (\lambda ^{r})_{m}^{l}[k] = \gamma (\lambda ^{r})_{m}^{l-1}[k] + \frac{(1-\gamma )}{\delta }\displaystyle \left|\zeta _{m}^{l-1}[k]X_{m}^{l-1}[k]\right|^2 \end{aligned}$$
$$\begin{aligned} \xi _{m}^{l}[k] = \displaystyle \frac{(1-\alpha _{d}) \displaystyle \left| \zeta _{m}^{l-1}[k] {X_{m}^{l-1}[k]}\right|^2 + \alpha _{d} \displaystyle \left|{X_{m}^{l}[k]}\right|^2}{(\lambda ^{r})_{m}^{l}[k] + (\lambda ^{s})_{m}^{l}[k]} \end{aligned}$$
$$\begin{aligned} \zeta _{m}^{l}[k] = \frac{\xi _{m}^{l}[k]}{\xi _{m}^{l}[k]+1} \end{aligned}$$

2.2 Localization module

Figure 3 illustrates the block diagram of the Localization Module. The Microphones data structure contains the cartesian positions (in meters) for each microphone in relation to the center of the array. A uniform unit sphere (with a 1 m radius) is generated at the initialization of the Sphere data structure. This sphere is recursively generated from a tetrahedron, for a total of 2,562 points. This resolution can be adjusted to satisfy real-time requirements. The delay between each pair of microphones for sound propagation of a source is precomputed at each point on the sphere during initialization of the Delays data structure, and stored in an array. Each delay corresponds to the direct path of sound, even if this hypothesis is influenced by the diffraction due to the body of the robot. However, experiments show that the system still performs well as long as a few microphones capture the direct path (Valin et al. 2006b). The cross-correlation \(R_{m_1,m_2}^{l}(\tau )\) between microphones \(m_1\) and \(m_2\) is then computed for each new frames according to Eq. 5, with \(\tau \) representing the delay.
$$\begin{aligned} R_{m_1,m_2}^{l}(\tau ) = \sum _{k=0}^{N-1} { \frac{ \zeta _{m_1}^{l}[k] X_{m_{1}}^{l}[k] }{ | X_{m_{1}}^{l}[k] | } \frac{ \zeta _{m_2}^{l}[k] X_{m_{2}}^{l}[k]^* }{ | X_{m_{2}}^{l}[k] | } e^{\left( \frac{j 2 \pi k \tau }{N} \right)}} \end{aligned}$$
Fig. 3

Block diagram of the localization module

To speed up computations, Eq. 5 is performed with an inverse Fourier Transform (IFFT). Although the IFFT reduces the number of operations, this step remains one of the most computationally expensive part of ManyEars. Moreover, since this operation is done for each pair of microphones, the complexity order is \(O(M(M-1)/2)\), where M is the number of microphones.1 Beamformer search is performed (Valin et al. 2006b) and implemented in the Beamformer function. Once Q potential sources are found, their positions and probabilities are stored in the potential sources data structure. The position (x, y, z) of each potential source \(q\) is represented by the observation vector \(\mathbf O _{q}^{l}\). The probability \(P^{l}_{q}\) for each potential source \(q\) to be a true source (and not a false detection) is computed according to Eq. 6. The variable \(E^{l}_{0}\) stands for the energy of the beamformer for the first potential source, and the constant \(E_{T} = 600\) represents the energy threshold adjusted to the environment (to find a good trade-off between false and missed sources detections). Experiments showed that the energy of the first potential source is related to the confidence that this is a valid source, while this is not the case for the next potential sources (Valin et al. 2006b). For this reason, the probability depends on the energy for \(q=0\) and is then associated to a constant value found empirically for the other sources (\(0 < q < Q\)). The probability for the first source is null when the energy is null, and goes to one as the energy goes to infinity. Moreover, it is relevant to notice that these probabilities are independent (\(\sum _{p=0}^{Q-1}{P^{l}_{q}} \not \equiv 1\), \(0 \le E^{l}_{0} < \infty \)).
$$\begin{aligned} P^{l}_{q} = {\left\{ \begin{array}{ll} (E^{l}_{0}/E_{T})^2 / 2,&q = 0, E^{l}_{0} \le E_{T} \\ 1 - (E^{l}_{0}/E_{T})^{-2} / 2&q = 0, E^{l}_{0} > E_{T} \\ 0.3&q = 1 \\ 0.16&q = 2 \\ 0.03&q = 3 \end{array}\right.} \end{aligned}$$

2.3 Tracking module

Figure 4 represents the block diagram of the tracking module. There is a particle filter for each tracked source, represented by the Filter functions. There are \(S\) tracked sources and filters, each made of \(F\) particles. Each tracked source is assigned a unique ID and a position. The ID of a source stays the same over time as long as the source is active.
Fig. 4

Block diagram of the tracking module

The state vector of each particle, \(\mathbf s _{s}^{l}(f)\) = \([ (\mathbf x _{s}^{l}(f))^{T}\)\((\dot{\mathbf{x }}_{s}^{l}(f))^{T} ]^{T}\), is composed of a \((x,y,z)\) position, \(\mathbf x _{s}^{l}(f)\), and a velocity, \(\dot{\mathbf{x }}_{s}^{l}(f)\), where \((.)^{T}\) denotes the transpose operator. The beamformer provides this module with an observation vector for each frame \(l\) and potential source \(q\), denoted by the variable \(\mathbf O ^{l}_{q}\). These observation vectors are concatenated in a single vector \(\mathbf O ^{l}\) = \(\left[ \mathbf O ^{l}_{0}, \dots , \mathbf O ^{l}_{Q-1} \right]\). Moreover, the vector \(\mathbf O ^{1:l} = \{ \mathbf O ^{i}, i = 1, \dots , l \}\) stands for the set of all observations over time from frame 1 to frame l.

During prediction, the position \(\mathbf x _{s}^{l}(f)\) and velocity \(\dot{\mathbf{x }}_{s}^{l}(f)\) of each particle f for source s are updated according to Eqs. 7 and 8. The parameters \(a_{s}^{l}(f)\) and \(b_{s}^{l}(f)\) stand for the damping and the excitation terms, respectively. They are obtained with Eqs. 9 and 10. The parameters \(\alpha _{s}^{l}(f)\) and \(\beta _{s}^{l}(f)\) are chosen according to the state of each particle. These values and the proportion of particles associated to each state are provided in Table 1. The parameter \(\Delta T = 0.04\) stands for the time interval between updates. These three parameters are set to optimize tracking for both static and moving sound sources, and are robust to environmental changes since the source dynamics is independent of reverberation and noise. The variable \(F_{x}\) represents a normally distributed random variable.
$$\begin{aligned} \mathbf x _{s}^{l}(f) = \mathbf x _{s}^{l-1}(f) + \Delta T \dot{\mathbf{x }}_{s}^{l}(f) \end{aligned}$$
$$\begin{aligned} \dot{\mathbf{x }}_{s}^{l}(f) = a_{s}^{l}(f) \dot{\mathbf{x }}_{s}^{l-1}(f) + b_{s}^{l}(f) F_{x} \end{aligned}$$
$$\begin{aligned} a_{s}^{l}(f) = e^{-\alpha _{s}^{l}(f) \Delta T} \end{aligned}$$
$$\begin{aligned} b_{s}^{l}(f) = \beta _{s}^{l}(f) \sqrt{1 - a_{s}^{l}(f)^2} \end{aligned}$$
Table 1

Particle parameters


\(\alpha _{s}^{l}(f)\)

\(\beta _{s}^{l}(f)\)






Constant velocity








The position of each particle is normalized such that each particle stays on the unit sphere. The velocity is also normalized to ensure it is tangent to the sphere surface.

Each observation \(\mathbf O ^{l}_{q}\) is either a false detection (hypothesis \(H0\)), a new source not yet being tracked (hypothesis \(H2\)) or matches one of the sources currently tracked (hypothesis \(H1\)). The function \(g_{c}^{l}(q)\) showed in Eq. 11 maps each observation \(\mathbf O ^{l}_{q}\) to a hypothesis. The vector \(\mathbf g _{c}^{l}\) introduced in Eq. 12 concatenates the mapping functions of all observations in a vector.
$$\begin{aligned} g_{c}^{l}(q) = {\left\{ \begin{array}{ll} \begin{array}{l} -2, H0: \mathbf O ^{l}_{q} \text{ is} \text{ a} \text{ false} \text{ detection}\\ -1, H2: \mathbf O ^{l}_{q} \text{ is} \text{ a} \text{ new} \text{ source} \\ 0, H1: \mathbf O ^{l}_{q} \rightarrow \text{ source} s = 0 \\ {\vdots } \\ S-1, H1: \mathbf O ^{l}_{q} \rightarrow \text{ source} s = S - 1 \\ \end{array} \end{array}\right.} \end{aligned}$$
$$\begin{aligned} \mathbf g _{c}^{l} = \left\{ g_{c}^{l}(q), q = 0, \dots , Q-1 \right\} \end{aligned}$$
The variable \(c\) stands for the index of each realisation of the vector \(\mathbf g _{c}^{l}\). There are \((S+2)^Q\) possible realisations, as demonstrated in Eq. 13.
$$\begin{aligned} \begin{array}{lcrrcccrcrc} \mathbf g _{0}^{l}&= \{&-2&,&\dots&,&-2&,&-2&\} \\ \mathbf g _{1}^{l}&= \{&-2&,&\dots&,&-2&,&-1&\} \\ \mathbf g _{2}^{l}&= \{&-2&,&\dots&,&-2&,&0&\} \\ \mathbf g _{3}^{l}&= \{&-2&,&\dots&,&-2&,&1&\} \\ {\vdots }&\,&\,&\,&\,&\,&{\vdots }&\,&\,&\,&\\ \mathbf g _{S}^{l}&= \{&-2&,&\dots&,&-2&,&S - 2&\} \\ \mathbf g _{S+1}^{l}&= \{&-2&,&\dots&,&-2&,&S - 1&\} \\ \mathbf g _{S+2}^{l}&= \{&-2&,&\dots&,&-1&,&-2&\} \\ \mathbf g _{S+3}^{l}&= \{&-2&,&\dots&,&-1&,&-1&\} \\ {\vdots }&\,&\,&\,&\,&\,&{\vdots }&\,&\,&\,&\\ \mathbf g _{(S+2)^Q - 2}^{l}&= \{&S-1&,&\dots&,&S-1&,&S-2&\} \\ \mathbf g _{(S+2)^Q - 1}^{l}&= \{&S-1&,&\dots&,&S-1&,&S-1&\} \\ \end{array} \end{aligned}$$
The expression \(P(\mathbf g _{c}^{l}|\mathbf O ^{1:l})\) stands for the probability of a realisation \(\mathbf g _{c}^{l}\) given the observations \(\mathbf O ^{1:l}\). Equation 14 introduces an alternative representation derived from Bayes inference.
$$\begin{aligned} P(\mathbf g _{c}^{l}|\mathbf O ^{1:l}) = \frac{P(\mathbf O ^{1:l}|\mathbf g _{c}^{l}) P(\mathbf g _{c}^{l})}{\displaystyle \sum _{c=0}^{(S+2)^{Q}-1}{P(\mathbf O ^{1:l}|\mathbf g _{c}^{l}) P(\mathbf g _{c}^{l})}} \end{aligned}$$
Conditional independence is assumed for the observations given the mapping function (\(P(\mathbf O ^{1:l}|\mathbf g _{c}^{l})\)), which leads to the decomposition expressed by Eq. 15. Independence of mapping functions is also assumed, and therefore the a priori probability \(P(\mathbf g _{c}^{l})\) is decomposed as shown in Eq. 16.
$$\begin{aligned} P(\mathbf O ^{1:l}|\mathbf g _{c}^{l}) = \prod _{q=0}^{Q-1} p(\mathbf O _{q}^{1:l}|g_{c}^{l}(q)) \end{aligned}$$
$$\begin{aligned} P(\mathbf g _{c}^{l}) = \prod _{q=0}^{Q-1} p(g_{c}^{l}(q)) \end{aligned}$$
The probability distribution of the observations given the hypothesis is uniform for a false detection or a new source, and depends on the previous weights of the particle filter (\(p(\mathbf x ^{l-1}_{s}(f)|\mathbf O ^{1:l-1}_{q})\)) and the probability density of an observation given each particle position (\(p(\mathbf O _{q}^{1:l}|\mathbf x _{s}^{l}(f))\)), as shown in Eq. 17.
$$\begin{aligned} \begin{array}{l} p(\mathbf O _{q}^{1:l}|g_{c}^{l}(q)) = \\ {\left\{ \begin{array}{ll} 1 / 4 \pi&g_{c}^{l}(q) = -2\\ 1 / 4 \pi&g_{c}^{l}(q) = -1\\ \displaystyle \sum _{f=0}^{F-1}{ \left( \begin{array}{l} p(\mathbf x ^{l-1}_{g_{c}^{l}(q)}(f)|\mathbf O ^{1:l-1}_{q}) \times \\ p(\mathbf O _{q}^{l}|\mathbf x _{g_{c}^{l}(q)}^{l}(f)) \end{array} \right) }&0 \le g_{c}^{l}(q) < S \end{array}\right.} \end{array} \end{aligned}$$
The a priori probability \(p(g_{c}^{l}(q))\) shown in Eq. 18 depends on the a priori probabilities that a new source appears and that there is a false detection. These values are represented by the variables \(P_{new} = 0.005\) and \(P_{false} = 0.05\). The probabilities that a source is observable and is a true source are represented by the variables \(P(Obs_{s}^{l} | \mathbf O ^{1:l-1})\) and \(P^{l}_{q}\), respectively.
$$\begin{aligned} p(g_{c}^{l}(q)) = {\left\{ \begin{array}{ll} (1 - P^{l}_{q}) P_{false}&g_{c}^{l}(q) = -2\\ P^{l}_{q} P_{new}&g_{c}^{l}(q) = -1\\ P^{l}_{q} P(Obs_{s}^{l} | \mathbf O ^{1:l-1})&0 \le g_{c}^{l}(q) < S \end{array}\right.} \end{aligned}$$
Equation 19 shows that, given the previous observations \(\mathbf O ^{1:l-1}\), the probability that the source s is observable (\(P(Obs_{s}^{l}|\mathbf O ^{1:l-1})\)) depends on the probability that the source exists (\(P(E_{s}^{l}|\mathbf O ^{1:l-1})\)) and is active (\(P(A_{s}^{l}|\mathbf O ^{1:l-1})\)).
$$\begin{aligned} P(Obs_{s}^{l}|\mathbf O ^{1:l-1}) = P(E_{s}^{l}|\mathbf O ^{1:l-1})P(A_{s}^{l}|\mathbf O ^{1:l-1}) \end{aligned}$$
The probability that a source is active, \(P(A_{s}^{l}|\mathbf O ^{1:l-1})\), is obtained with a first order Markov process. The transition probabilities between states are given by the expressions \(P(A_{s}^{l}|A_{s}^{l-1}) = 0.7\) and \(P(A_{s}^{l}|\lnot A_{s}^{l-1}) = 0.3\), which respectively represent the probability a source remains active and becomes active.
$$\begin{aligned} \begin{array}{ll} P(A_{s}^{l}|\mathbf O ^{1:l-1}) =&P(A_{s}^{l}|A_{s}^{l-1})P(A_{s}^{l-1}|\mathbf O ^{1:l-1}) + \\&P(A_{s}^{l}|\lnot A_{s}^{l-1})(1 - P(A_{s}^{l-1}|\mathbf O ^{1:l-1})) \end{array} \end{aligned}$$
The active and inactive states are assumed to be equiprobable, and therefore the probability of activity \(P(A_{s}^{l-1}|\mathbf O ^{1:l-1})\) is obtained with Bayes’rule in Eq. 21.
$$\begin{aligned} \begin{array}{l} P(A_{s}^{l-1}|\mathbf O ^{1:l-1}) = \\ \left( 1 + \displaystyle \frac{(1 - P(A_{s}^{l-1}|\mathbf O ^{1:l-2}))(1 - P(A_{s}^{l-1}|\mathbf O ^{l-1}))}{P(A_{s}^{l-1}|\mathbf O ^{1:l-2})P(A_{s}^{l-1}|\mathbf O ^{l-1})} \right)^{-1} \end{array} \end{aligned}$$
Equation 22 introduces \(P(A_{s}^{l-1}|\mathbf O ^{l-1})\), which stands for the probability a source is active. This relation is introduced in Eq. 22 with parameters \(P_{b} = 0.15\) and \(P_{m} = 0.85\).
$$\begin{aligned} P(A_{s}^{l-1}|\mathbf O ^{l-1}) = P_{b} + P_{m}P_{s}^{l-1} \end{aligned}$$
The expression \(P_{s}^{l-1}\) stands for the probability that the tracked source \(s\) is observed, which is obtained from the sum of the probabilities that this source is assigned to each potential source \(q\) (\(P_{s}^{l-1}(q)\)), as expressed by Eq. 23.
$$\begin{aligned} P_{s}^{l-1} = \sum _{q = 0}^{Q-1}{P_{s}^{l-1}(q)} \end{aligned}$$
Setting the a priori probability a source exists but is not observed to be \(P_{0} = 0.5\), the probability the source exists, \(P(E_{s}^{l}|\mathbf O ^{1:l-1})\), is obtained with Eq. 24.
$$\begin{aligned} P(E_{s}^{l}|\mathbf O ^{1:l-1}) \!=\! P_{s}^{l-1} \!+\! \frac{(1 \!-\! P_{s}^{l-1})P_{o}P(E_{s}^{l-1}|\mathbf O ^{1:l-2})}{1 \!-\! (1 \!-\! P_{o})P(E_{s}^{l-1}|\mathbf O ^{1:l-2})} \end{aligned}$$
The expression \(P(\mathbf g _{c}^{l}|\mathbf O ^{1:l})\) derives the probabilities that a new source is observed (\(P_{H_{2}}^{l}(q)\)), the source \(s\) is observed (\(P_{s}^{l}(q)\)) and there is a false detection (\(P_{H_{0}}^{l}(q)\)), as shown in Eqs. 25, 26 and 27. The expression \(\delta _{x,y}\) stands for the Kronecker delta. These probabilities are normalized for each value of q.
$$\begin{aligned} P_{H_{0}}^{l}(q) = \sum _{c = 0}^{C-1}{\delta _{-2,g_{c}^{l}(q)}P(\mathbf g _{c}^{l}|\mathbf O ^{1:l})} \end{aligned}$$
$$\begin{aligned} P_{s}^{l}(q) = \sum _{c = 0}^{C-1}{\delta _{s,g_{c}^{l}(q)}P(\mathbf g _{c}^{l}|\mathbf O ^{1:l})} \end{aligned}$$
$$\begin{aligned} P_{H_{2}}^{l}(q) = \sum _{c = 0}^{C-1}{\delta _{-1,g_{c}^{l}(q)}P(\mathbf g _{c}^{l}|\mathbf O ^{1:l})} \end{aligned}$$
The weight of each particle f is given by the expression \(p(\mathbf x _{s}^{l}(f)|\mathbf O ^{1:l})\), and is obtained recursively with Eq. 28.
$$\begin{aligned} p(\mathbf x _{s}^{l}(f)|\mathbf O ^{1:l}) = \frac{p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l})p(\mathbf x _{s}^{l-1}(f)|\mathbf O ^{1:l-1})}{\displaystyle \sum _{f=0}^{F-1}{p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l})p(\mathbf x _{s}^{l-1}(f)|\mathbf O ^{1:l-1})}} \end{aligned}$$
The observations may or may not match the tracked sources. The event \(I_{s}^{l}\) occurs when the source s is observed at frame l. The probability of this event is equal to the expression \(P_{s}^{l}\). The expression \(p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l})\) stands for the probability the observation \(\mathbf O ^{l}\) matches the particle \(\mathbf x _{s}^{l}(f)\), and is obtained in Eq. 29.
$$\begin{aligned} \begin{array}{ll} p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l}) =&p(\lnot I_{s}^{l})p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l},\lnot I_{s}^{l})\\&+ p(I_{s}^{l})p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l},I_{s}^{l}) \end{array} \end{aligned}$$
When the event \(I_{s}^{l}\) does not occur, all particles have the same probability \((1/F)\) to match the observations. The probability that the particle \(f\) matches the observation \(\mathbf O ^{l}\) (\(p(\mathbf x _{s}^{l}(f) | \mathbf O ^{l}),I_{s}^{l}\)) is obtained from the probability each potential source \(\mathbf O _{q}^{l}\) matches the particle \(f\) (\(p(\mathbf O _{q}^{l}|\mathbf x _{s}^{l}(f))\)). The denominator is needed to normalize the expression, as shown in Eq. 30.
$$\begin{aligned} \begin{array}{ll} p(\mathbf x _{s}^{l}(f)|\mathbf O ^{l}) =&(1 - P_{s}^{l}) ( 1 / F ) \\&+P_{s}^{l} \left(\frac{\displaystyle \sum _{q=0}^{Q-1}{P_{s}^{l}(q)p(\mathbf O _{q}^{l}|\mathbf x _{s}^{l}(f))}}{\displaystyle \sum _{f=0}^{F-1}{\displaystyle \sum _{q=0}^{Q-1}{P_{s}^{l}(q)p(\mathbf O _{q}^{l}|\mathbf x _{s}^{l}(f))}}}\right) \end{array} \end{aligned}$$
The expression \(p(\mathbf O _{q}^{l}|\mathbf x _{s}^{l}(f))\) is obtained with the sum of gaussians shown in Eq. 31, and the variable \(d\) stands for the distance between the particle and the observation, as shown in Eq. 32. The initial model is inspired from a gaussian distribution that matches the distribution of the potential sources obtained from the beamformer. The model is then tuned empiricially to fit more accurately the observations, generating the distribution in Eq. 31.
$$\begin{aligned} p(\mathbf O _{q}^{l}|\mathbf x _{s}^{l}(f)) = 0.8 e^{-80d} + 0.18 e^{-8d} + 0.02 e^{-0.4d} \end{aligned}$$
$$\begin{aligned} d = \left\Vert \mathbf x _{s}^{l}(f) - \mathbf O _{q}^{l} \right\Vert \end{aligned}$$
The estimated position of the tracked source \((\mathbf x_{trk} )_{s}^{l}\) is finally obtained with Eq. 33.
$$\begin{aligned} (\mathbf x_{trk} )_{s}^{l} = \sum _{f=0}^{F-1}{p(\mathbf x _{s}^{l}(f)|\mathbf O ^{1:l}) \mathbf x _{s}^{l}(f)} \end{aligned}$$
This estimated position is sent to the tracked source structure, along with the source ID. Resampling is required when the particle diversity is lower than a predefined level (\(N_{min} = 0.7F\)), as shown in Eq. 34.
$$\begin{aligned} \frac{1}{\displaystyle \sum _{f=0}^{F-1}{(p(\mathbf x _{s}^{l}(f)|\mathbf O ^{1:l}))^2}} < N_{min} \end{aligned}$$
A new source may be added if \(P_{H_{0}}^{l}(q)\) exceeds a threshold (fixed to \(0.5\)), and a new filter is then assigned to this source. Each new source is assigned an ID. A source \(s\) being tracked can also be deleted when it stays inactive for too long (\(P_{s}^{l} < 0.5\) for \(l = (l_{now}-24):1:l_{now}\), where \(l_{now}\) is the index of the current frame). All currently tracked sources and their respective IDs are stored in the Tracked Sources data structure.
Fig. 5

Block diagram of the separation module

2.4 Separation module

Figure 5 illustrates the Separation Module block diagram. Geometric Source Separation (GSS) is performed with the unmixing matrix \(\mathbf W ^{l}[k]\) in the GSS function, as expressed by Eq. 35. This matrix is initialized with the information from the Tracked Sources and the Microphones. This matrix is then optimized using Eqs. 36 and 37 in order to minimize the independence (\(J_{1}\)) and geometric (\(J_{2}\)) costs. The gradient is used as it is a fast-convergence and low-complexity minimization solution (Parra and Alvino 2002). The matrices \(\mathbf I \) and \(\mathbf A \) stand for the identify matrix and the direct propagation delays matrix, respectively. The matrix \(\mathbf A \) is defined in Eq. 38, and the variable \(\tau ^{l}_{m,s}\) stands for the delay in samples at frame l , when sound leaves source s and reaches microphone m.
$$\begin{aligned} \mathbf Y ^{l}[k] = \mathbf W ^{l}[k] \mathbf X ^{l}[k] \end{aligned}$$
$$\begin{aligned} \frac{\partial J_{1}(\mathbf W ^{l}[k])}{\partial (\mathbf W ^{l})^{*}[k]} = 4 \left( \mathbf E ^{l}[k] \mathbf W ^{l}[k] \mathbf X ^{l}[k] \right) \mathbf X ^{l}[k]^{H} \end{aligned}$$
$$\begin{aligned} \frac{\partial J_{2}(\mathbf W ^{l}[k])}{\partial (\mathbf W ^{l})^{*}[k]} = 2 [ \mathbf W ^{l}[k] \mathbf A ^{l}[k] - \mathbf I ] \mathbf A ^{l}[k]^{H} \end{aligned}$$
$$\begin{aligned} \mathbf A ^{l}[k] = \left[ \begin{array}{ccc} e^{j 2 \pi k \tau ^{l}_{0,0}}&\dots&e^{j 2 \pi k \tau ^{l}_{0,s}} \\ \vdots&\ddots&\vdots \\ e^{j 2 \pi k \tau ^{l}_{m,0}}&\dots&e^{j 2 \pi k \tau ^{l}_{m,s}} \\ \end{array} \right] \end{aligned}$$
Update is performed with Eq. 39. The variables \(\lambda = 0.5\) and \(\mu = 0.001\) stand for the regularization factor and the adaptation rate, respectively.
$$\begin{aligned} \begin{array}{ll} \mathbf W ^{(l+1)}[k] \!=\!&(1\! -\! \lambda \mu ) \mathbf W ^{l}[k] \\&\!-\mu \left[ \left\Vert \mathbf R_{mm} ^{l}[k] \right\Vert ^ {-2}\frac{\partial J_{1}(\mathbf W ^{l}[k])}{\partial (\mathbf W ^{l})^{*}[k]} \!+ \!\frac{\partial J_{2}(\mathbf W ^{l}[k])}{\partial (\mathbf W ^{l})^{*}[k]} \right] \end{array}\nonumber \\ \end{aligned}$$
The covariance matrix of the microphones \(\mathbf R_{mm} ^{l}[k]\), the covariance matrix of the separated sources \(\mathbf R_{ss} ^{l}[k]\) and the intermediate expression \(\mathbf E ^{l}[k]\) are defined in Eqs. 40, 41 and 42. These covariance matrices are obtained with instantaneous estimations and thus greatly reduce the amount of computations required. This approximation is similar to the Least Mean Square adaptive filter (Haykin 2002). The operator diag sets all nondiagonal terms to zero.
$$\begin{aligned} \mathbf R_{mm} ^{l}[k] = \mathbf X ^{l}[k]\mathbf X ^{l}[k]^{H} \end{aligned}$$
$$\begin{aligned} \mathbf R_{ss} ^{l}[k] = \mathbf Y ^{l}[k]\mathbf Y ^{l}[k]^{H} \end{aligned}$$
$$\begin{aligned} \mathbf E ^{l}[k] = \mathbf R_{ss} ^{l}[k] - \mathrm diag (\mathbf R_{ss} ^{l}[k]) \end{aligned}$$
The spectra of the separated sources and their corresponding IDs (the same as for the tracked sources) are defined in the Separated Sources data structure.
Post-filtering is then performed on the separated sources. A gain is applied to the separated signals, as expressed by Eq. 43. The gain is computed according to interference and stationary noise (Valin et al. 2004).
$$\begin{aligned} Z^{l}_{s}[k] = G^{l}_{s}[k] Y^{l}_{s}[k] \end{aligned}$$
Moreover, this step requires a MCRA function for each separated source to estimate the stationary noise. The new spectra and their corresponding IDs (the same as for both the tracked and separated sources) are defined in the Postfiltered Sources data structure.

2.5 Postprocessing module

As illustrated by Fig. 6, the separated and postfiltered spectra from the Separated Sources and the Postfiltered Sources are then converted back to the time domain with IFFTs by the Postprocessor function. The new frames are then windowed and overlap-added to generate the new signals. Power-complementary windows are used for analysis and synthesis, and therefore overlap-add is required to achieve signal reconstruction.
Fig. 6

Block diagram of the postprocessing module

3 The ManyEars open software library integrated to ROS

Implementation of ManyEars as an open software library with ROS involves translating the description of the architecture and algorithms presented in Sect. 2 into software processes, developing a Graphical User Interface (GUI) to visualize the tracked sound sources and to fine-tune the parameters of the ManyEars library, and interfacing the library to the ROS environment.

3.1 Software processes

ManyEars’ functions, as shown in the block diagrams of Sect. 2, are managed according to three stages: Initialization, Processing, and Termination. Figure 7 illustrates these stages. All the parameters are stored in the structure parametersStruct. This structure is used to provide parameters to the functionStruct during Initialization. Memory is also allocated for the elements of functionStruct during this step. This functionStruct is then used to perform many processing operations. During Processing, input arguments are used and output arguments are generated. Moreover, the elements of functionStruct are updated during this step. Finally, Termination is performed and the previously allocated memory is freed.
Fig. 7

Software structure of each module

To avoid dependency on external libraries, the following utility functions have been created and are used to perform various computations:
  • Memory allocation. Memory is allocated to align arrays for SSE (Streaming SIMD Extensions) operations.

  • FFT and IFFT. The FFT and IFFT operations are performed with a decimation-in-frequency radix-2 algorithm, which is optimized with SSE instructions. Moreover, since the signals are real in the time domain, two transforms are performed with each complex FFT or IFFT.

  • Matrix operations. Most operations are performed on vectors and matrices. For this reason, customized functions for operations on vectors and matrices are created, and make use of SSE instructions.

  • ID Manager. An ID manager is used to generate unique IDs to identify tracked, separated and postfiltered sources.

  • Linear correlation. Linear correlation in the time domain is needed for the Postfiltering Module. This operation is also optimized with SSE instructions.

  • Random number generator. Used by the Tracking Module, this function generates random numbers according to a uniform or a normal distributions.

  • Transcendental function. This function estimates a confluent hypergeometric function for the Postfiltering Module.

  • Window generation. This function generates a Hanning window for the Postfiltering Module. A power-complementary window is also generated for the Preprocessing Module and the Postprocessing Module.
Fig. 8

ManyEars GUI

3.2 Graphical user interface

Figure 8 shows the GUI created for tuning parameters of the ManyEars Library. The GUI is a complementary tool not essential to use with the ManyEars Library. It consists of the following subwindows:
  1. 1.

    Microphone positions, beamformer configuration, source tracking and separation parameters. Parameters can be saved to and loaded from a file for rapid configuration.

  2. 2.

    Probabilities of sources calculated by the Localization Module in latitude.

  3. 3.

    Probabilities of sources calculated by the Localization Module in longitude.

  4. 4.

    Outputs of the Tracking Module in latitude.

  5. 5.

    Outputs of the Tracking Module in longitude.

  6. 6.

    3D unit-sphere representation of the Tracked Sources.

  7. 7.

    Customizable colour representation of the information displayed by the GUI.

When the ManyEars GUI starts, the user selects to process the audio data either from a pre-recorded raw file or in real-time from the sound card. The application menu allows the user to start or stop processing and to select the audio input. Once the processing starts, subwindows (2) through (6) are updated as the audio input is being processed. The recorded data can also be saved to a raw file.

The GUI is implemented with the Qt4 framework (Nokia corporation 2012) because of its flexibility, its open source licence and for the ability to create cross-platform applications.

3.3 ROS integration

Figure 9 illustrates the integration of the ManyEars library with ROS (Quigley et al. 2009). Oval shapes represent ROS nodes, and rectangular shapes represent topics.

The integration with ROS is done with multiple simple nodes:
  • rt_audio_publisher. This node publishes the raw audio data coming from the sound card in a ROS message called AudioStream containing the frame number and the stream data of all the microphones in 16 bits signed little endian format. The default publication topic is /audio_stream.

  • manyears_ros. This node uses the raw stream information published in rt_audio_publisher and executes the sound source localization, tracking and separation algorithm. It can use parameters saved by the ManyEars GUI described in Sect. 3.2. A message called ManyEarsTrackedAudioSource is published for each frame processed. This message contains an array of tracked sources of ROS message type SourceInfo. Each element of the array describes the source properties (ID, position, energy estimation, separation data, longitude, latitude). The default publication topic is /tracked_sources.
Fig. 9

ManyEars-ROS nodes organization

  • manyears_savestream. This node connects to the manyears_ros node and uses the separation_data field in the SourceInfo ROS message to save the separated audio into WAV format files. This node can be used to listen to separated data. No ROS message is transmitted from this node. However, instead of saving to file the audio data in WAV format, the node could be easily modified to publish the audio data on a topic. This would be useful for other nodes that use single audio streams like speech recognition engines.

  • sound_position_exploitation. This node is responsible for publishing pose information for each of the detected sound source. The node connects to the tracked_sources topic and outputs the right geometry_msgs::PoseStamped message with the correct orientation and unit distance from the center of geometry of the microphone array. The default publication topic is source_pose. Figure 10 shows the result of the position data published with this node and visualized with the ROS RViz application.
Fig. 10

Visualization of sound source position using ROS RViz

4 Open hardware microphone acquisition board and 8-input sound card

The design of a microphone acquisition board and a 8-input sound card specifically for ManyEars satisfy the following guidelines:
  • Have small microphone boards powered by the sound card directly. Connectors must allow hot-plugging and must be low-cost. Installation of microphones must be easy.

  • Minimalist design supporting up to eight microphone inputs and one stereo output.

  • Minimize physical dimensions for installation on a mobile robot.

  • Low power consumption and support of a wide range of power supply voltage.

  • Minimum signal resolution of 12 effective bits and sampling rates from 8 to 192 kSamples/sec.

  • Fabrication cost comparable or lower to commercially-available sound cards.

  • Compatible with multiple operating systems (Linux, MacOS, Windows).

  • Processing of the ManyEars algorithm is done externally, on the host computer, reducing processing power requirements on the sound card and facilitating portability and maintainability.
Fig. 11

Microphone board
Fig. 12

Hardware block diagram

Figure 11 shows one of the microphone boards designed. Each microphone board has its own preamplifier circuit, which is powered by the sound card at 4.3 V. The main electronic components on the top side of the board include the omnidirectional microphone (CUI CMA-4544PF-W) and the preamplifier (STMicroelectronics TS472). The frequency response of the electret microphone is relatively flat from 20 Hz to 20 kHz. The back side of the board is composed of the RJ-11 connector (low-cost standard telephone jack style) and a potentiometer for easy preamplifier gain adjustment. Parallel insertion of the RJ-11 connector prevents the power line from making contact with the data line when insertion occurs, making the connection hot pluggable. Signals are mapped to the RJ-11 connector lines such that a standard telephone cable can be used. The connector also has a latch mechanism, which ensures reliable physical connections. The preamplifier has a high signal-to-noise ratio (more than 70 dB according to the TS472 datasheet), differential input and output channels and a maximum closed loop gain of approximately 40 dB, required to obtain a peak-to-peak amplitude of 4.3 V at the output. This maximizes the dynamic range of the codecs used by the sound card. Moreover, the microphone preamplifiers are positioned as close as possible to the electret microphone in order to reduce the effects of electromagnetic interference and to preserve a good signal-to-noise ratio.

To design the sound card, the first step was to choose the hardware interface to be used. The USB 2.0 High-Speed interface is a good choice because it is more commonly available compared to FireWire, and can be used directly to power the sound card unlike the standard Ethernet ports. The USB 2.0 transfer rate reaches 480 Mbits/sec, which is sufficient to transfer the raw (uncompressed) data of eight microphones and one stereo output. Recently introduced, the USB Audio Class 2.0 standard (Knapen 2006) includes more channels and better sampling resolutions and rates compared to Audio Class 1.0. The total consumption of the system must not exceed 2.5 W (500 mA @ 5V) for normal USB configuration. The design is based on the XMOS USB Audio 2.0 Multichannel Reference Design (XMOS ltd 2012) to meet the power and interface requirements. This standard is convenient as it is automatically supported by standard drivers (ALSA for Linux, CoreAudio for OSX). On Windows platforms, a third party driver provided by XMOS partners is used because the USB Audio Class 2.0 is not yet supported natively. The XMOS is strictly used to operate the codec and forward the sound stream to the host computer (no processing is performed by the sound card).

Figure 12 shows a block diagram of the hardware implementation. The analog signal coming from the microphones is transmitted in differential mode to the codecs. Differential mode is preferred to single-ended signalling because of noise immunity and because it increases the dynamic range of the analog/digital converter of the codec. Since the sound card is designed to operate on a robot, many external devices can induce electromagnetic interference in the transmitted signal. Twisted pairs for differential signal transmission distribute the interference and a differential amplifier rejects the common mode noise, which minimizes the effect of overall electromagnetic interference.

The Preamp module uses a differential audio amplifier (National Semiconductor LME49726) that biases the signal for the codec’s input and also filters the signal to avoid aliasing. This audio amplifier is especially intended for audio and it uses a single power supply. The band-pass filter in the Preamp module has a flat frequency response in the audio frequency band and has a rejection of 20 dB at the low frequencies. The configuration of the anti-aliasing filters is the one suggested by the codec manufacturer. Two four-input codec chips (Cirrus Logic CS42448) are used for analog to digital conversion. Data is transfered using the I2S protocol to the XMOS processor. The codecs are configured with the I2C protocol.

The XMOS XS1-L2 dual core microprocessor operates at 1000 MIPS and is of particular interest for mobile applications. Eight threads for each core are independently activated by events and do not run continuously with the system clock if not used. This reduces power consumption since only active threads require power. Threads are scheduled directly by the XMOS chip, making real-time performances possible without any operating system. An external PLL (Cirrus Logic CS2300CP) is used to synchronize the codecs and the XMOS cores. This is required to avoid jitter in the clock controlling the sampling of the analog inputs. The original XMOS firmware is used from the reference design, with a small addition to support the second codec. The firmware is stored in the 512k flash memory connected via SPI.

An XTAG2 connector (JTAG interface) is used for programming the SPI flash memory and for debugging. There is also an expansion port available compatible with XMOS standard SKT connector for future use. The XMOS processor is connected to the USB port using an external USB 2.0 PHY (SMSC USB3318). The PHY requires a 13 MHz clock to operate.

The sound card also has an external power connector (from 7 V to 36 V) if USB power is not available or insufficient. The switching power supply (Texas Instruments PTN78000W) uses the wide range input power and converts it effienciently to the required 5V. In case power is supplied both through USB and an external power supply, the Power Selection module (Texas Instruments TPS2111) prioritizes the external power source. Figure 13 presents a picture of the designed sound card.
Fig. 13

Sound card

Table 2

Microphone board and sound card characteristics




Length (mm)

Width (mm)

Height (mm)





Sound card




Power source

Power supply


Voltage (V)

Current (A)

Power (W)

External power
















External (with
















USB (without






USB (with






Maximum with external power


Maximum with USB power


Additional information

Max. sampling latency between channels

16 us

Mean noise floor

\(-\)132 dBV

Maximum noise floor

\(-\)112 dBV

Table 2 summarizes the characteristics of the designed microphone board and sound card. The microphone design files are available online (Abran-Côté et al. 2012) under the Creative Commons Attribution-ShareAlike 3.0 Unported license (Creative Commons 2012). For the sound card, all the design files, gerbers and firmware are available online (Abran-Côté et al. 2012) under the Creative Commons Attribution-ShareAlike 3.0 Unported license (Creative Commons 2012).

5 Demonstration test cases

The ManyEars Library comes with many demonstration test cases available online (Grondin et al. 2012), with parameters tuned to optimize performance. Two of these test cases are presented here: one with static sound sources, and another with moving sound sources. Figure 14 illustrates the coordinate system used in these test cases. The results in this section are displayed with a Matlab/Octave script (demo.m).
Fig. 14

Coordinate system

5.1 Static sound sources

This test case uses an 8-microphone cubic array. According to Rabinkin (1998), the performance of a beamformer with speech sources (with a bandwidth between 100 Hz and 4 kHz) is optimized when the spacing between the microphones varies between 20 cm and 1 m. A 0.32 m \(\times \) 0.32 m \(\times \) 0.32 m array is used to make the spatial gain uniform and to fit on top of a mobile robot (e.g., Pioneer 2 platforms). The diameter of each microphone is 0.8 cm. For the results presented in this paper, the array is positioned at 0.6 m above the floor, in a room of 10 m \(\times \) 10 m \(\times \) 2.5 m with normal reverberation and some audible background noise generated by fans and other electronics. Two loud speakers with a diameter of 6 cm are used as sound sources, at a distance of 1.5 m and separated by \(90^{\circ }\). Each speaker is placed approximately 0.6 m above the floor. Speech segments of two female speakers are played during 10 seconds. The signals of the microphones are recorded and then processed with the ManyEars Library.
Fig. 15

Positions of the tracked sources

Figure 15 represents the longitude and the latitude of the tracked sound sources. These positions match the locations of the loud speakers, with the small difference in latitude caused by the offset between the heights of the speakers. The localization error of ManyEars has been characterized to be less than \(1^{\circ }\) (Valin et al. 2006a). Figures 16 and 17 show the spectrograms of source 1 and source 2, respectively. Separated and postfiltered spectrograms match many features in the clean spectrograms. Speech intelligibility and recognition are evaluated in (Yamamoto et al. 2006, 2007; Yamamoto et al. 2005, 2006; Yamamoto et al. 2005).
Fig. 16

Spectrograms for source 1
Fig. 17

Spectrograms for source 2

5.2 Moving sound sources

To demonstrate the use of ManyEars on a mobile robot and with an asymmetric array, this test case uses the microphones array on IRL-1, as shown by Fig. 18. Two scenarios have been evaluated, illustrated by Fig. 19, with human speakers producing uninterrupted speech sequences:
  • Scenario A: The two sources are separated by \(90^{\circ }\) with respect to the xy-plane. They move by \(90^{\circ }\) and then come back to their initial position.

  • Scenario B: The two sources are separated by \(180^{\circ }\) with respect to the xy-plane. They move to the position of the other speaker and cross each other.
Fig. 18

IRL-1 with the microphones identified by the red circles
Fig. 19

Positions of the moving sources
Fig. 20

Positions of the tracked sources

As shown by Fig. 20(a), the tracked sources match the positions of the moving sources. In scenario B, the inertia of the particles used for tracking solve the source crossing problem. However, sources could have swapped if both speakers would have come close to each other at the same time, and then move back to their initial position. This problem can be solved by reducing the inertia of the particles (with parameters \(\alpha _{s}^{l}(f)\) and \(\beta _{s}^{l}(f)\) introduced in Sect. 2.3), but then sources swapping could occur when speakers cross. Parameters of the Tracking module were tuned to find a trade-off between these two scenarios.

6 Conclusion

Compared to vision, there is not much hardware and software tools to implement and experiment with robot audition. The ManyEars Open Framework offers both software and hardware solutions to do so. The proposed system is versatile, portable and low-cost. The ManyEars C library is compatible with ROS and provides an easy-to-use GUI for tuning parameters and visualizing the results in real-time. Software and hardware components can be easily modified for efficient integration of new audio capabilities to robotic platforms. This new version of ManyEars has recently been used to demonstrate a speaker identification algorithm (Grondin and Michaud 2012), and is currently used in augmented teleoperation and human-robot interaction scenarios with IRL-1, a humanoid robot with compliant actuators for motion and manipulation, artificial vision and audition, and facial expressions (Ferland et al. 2012). Integration of ManyEars and HARK libraries in ROS suggests that there is a potential for further standardization of ROS audio components, which could include data structures, standard DSP operations, audio codecs, and Matlab / Octave script integration. In addition, since the introduction of ManyEars, new methods have been proposed to detect the exact number of active sources (Danes and Bonnal 2010; Ishi et al. 2009), for tracking moving people (Yao and Odobez 2008), and for sound source separation using Independent Component Analysis (Mori et al. 2006), and these could be easily added to the ManyEars open framework. This effort would lead to a collection of useful, open source and portable tools similar to OpenCV (OpenCV 2012) for image processing.


Set by parameter GLOBAL_MICSNUMBER in the file parameters.h



This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, the Canadian Foundation for Innovation and the Canada Research Chair program.

Copyright information

© Springer Science+Business Media New York 2013