1 Introduction

In recent years, there has been a growing interest in the use of non-verbal sound in Human-Robot Interaction (HRI), as evidenced by a recent systematic review by Zhang and Fitter [30]. This interest is motivated by the potential of non-verbal sound to compensate for limitations in robot communicative channels [8] and to enhance the robot’s ability to exhibit rich and engaging behavior [24]. For the practitioners in the field of HRI, the goal is to improve the robot’s social interaction skills and establish long-term user relationships [21]. In parallel, the interest in merging robots and sound has also been growing in the artistic field, exemplified by robot theater performances [16], dance performances [6], and the use of robotic arms as musical instruments [26]. We firmly believe that these practical and artistic approaches are closely intertwined, as argued by Hoffman that the artistic explorations “could serve as useful testbeds for the development and evaluation of action coordination in robotics” [13].

Sonification, described as the process of representing data by using sound [10, 17], has been utilized for a variety of functional purposes, such as conveying robot status [15] and providing information about the environment [25]. Beyond the functional purposes, sonification has also been explored for the aesthetical purposes in HRI, such as the enjoyment of the interaction [29]. This highlights the potential of sonification to contribute not only to the functional aspects of HRI but also to the subjective experiences of users interacting with robots. As such, the use of sonification in HRI offers a promising avenue for improving the effectiveness and appeal of human-robot interactions.

The development of effective sonification in HRI involves interdisciplinary approaches, drawing from computer science, engineering, psychology, and music [9]. To create effective auditory displays, acoustics and sound design knowledge is essential. Moreover, understanding the listener’s perceptual and cognitive abilities is critical to ensuring that the auditory displays are intuitive and can be easily perceived [27]. By leveraging interdisciplinary approaches, researchers and practitioners can create innovative sonification solutions that enhance human-robot interaction and improve the user experience. However, the development of effective sonification in HRI involves multiple production steps using different tools due to the variations in software environments used for robot programming and sound production. Sound designers and engineers typically use tools distinct from those used in robotics and vice versa. Therefore, for sound designers and engineers using tools such as Pure Data,Footnote 1 MaxMSP,Footnote 2 and SuperCollider,Footnote 3 an interface is necessary to communicate with a robotic system.

This paper presents the design and development of PepperOSC,Footnote 4 an interface that connects Pepper and NAO robots with sound production tools. The goals of PepperOSC are twofold: (i) to provide a tool for HRI researchers in developing multimodal user interfaces through sonification, and (ii) to lower the barrier for sound designers to contribute to HRI. Our objective is to enable the development of Model-Based Sonification (MBS) as defined in [11], which involves creating sound models that systematically incorporate data to generate evolving acoustic signals. Specifically, we aim to sonify the robot’s movements to enhance strategies for compensating for limitations in robot communicative channels. In addition to our technical contributions, this paper reflects on our experiences working with an interdisciplinary approach.

The paper is structured as follows. In Sect. 2, we discuss related works on interactive sonification in HRI. Section 3 presents the design and implementation of PepperOSC. We then present our use cases in Sect. 4, where we describe two applications we have conducted: (i) a course project by two master’s students in the Interactive Media Technology programme who created a robot sound model in Pure Data, and (ii) a museum installation of Pepper robot, employing sound models developed by a sound designer (the first author) and a composer/researcher in music technology using MaxMSP and SuperCollider respectively. Section 5 reflects on our experience of creating PepperOSC and developing sound models, highlighting the potential and challenges of interactive sonification in HRI. Furthermore, a potential use case of how PepperOSC can be utilized in social robotics is also discussed. Finally, we present our conclusions in Sect. 6.

2 Related works

Kramer et al. define sonification as the process of transforming data relations into perceived relations in an acoustic signal [17]. In the context of human-computer interaction, interactive sonification involves the representation of data into sound, allowing humans to interact with a system in a non-visual way [14]. Various approaches have been developed for interactive sonification, ranging from simple Auditory Icons and Earcons that use audio files triggered by a particular state to more complex techniques such as Parameter Mapping Sonification (PMSon) and Model-Based Sonification (MBS), where data features are mapped into acoustic parameters of sonic events or sound models [14].

Interactive sonification has been utilized in a variety of ways in HRI. For instance, Johannsen has employed Auditory Icons and Earcons to design directional sounds of a robot and additional sounds that respond to the robot states of Heavy Load, Waiting, Near Obstacle, and Low Battery [15]. Additionally, Auditory Icons help robots express emotions, as shown by Zahray et al. [29]. One approach to sonification involves mapping varying robotics data into acoustic parameters, as in Hermann et al.’s study, which aimed to “integrate all available information about the (robotic) modules (like message type, density, run-time, results, state) by sound” [12]. Several studies have also implemented sonification by mapping the robot’s movement for expressive purposes [4, 25, 29]. Likewise, the robot’s sensor data have been used in sonification, such as mapping the user’s movement [25] or emotion [32]. While direct mapping from movement can be used in sonification, novel ways of movement sonification have been explored. For example, Bellona et al. used Laban’s Effort System [18] to parameterize the robot’s movement qualities [3], and Frid and Bresin used Blended Sonification by mixing the mechanical sounds of the robot with the sonification of its movements [7]. These various approaches to interactive sonification demonstrate the versatility of sonification in enhancing the human-robot interaction experience.

Although the use of interactive sonification in HRI is increasing, practical guidance and tools for its implementation remain limited, as noted by Zhang et al. [31]. To address this gap, this paper aims to introduce a system that serves as an interface between robotics and sound design tools. By providing a practical solution and reflection to bridge the gap between these two disciplines, this system can enable researchers and sound designers to develop and implement interactive sonification in HRI.

Fig. 1
figure 1

An overview of PepperOSC. The system uses ALProxy to gain access to all the Pepper robot’s modules, including ALMemory, which supplies the robot’s kinematic data. The data is then streamed out as OSC messages using pyOSC

3 Design and implementation

PepperOSC is developed as part of the SONAO project [5, 8], which aims to establish new methods for achieving robust interaction between users and humanoid robots through the sonification of expressive gestures. As part of this project, we worked with a Pepper robot and found the need for an interface to extract data from the robot and utilize it in sound design tools. While Robot Operating System (ROS)Footnote 5 is an open-source framework that can be applied to various types of robots, we decided to use NAOqi, the programming framework developed by Aldebaran Robotics specifically for Pepper and NAO robots. We utilized the Python SDK in NAOqi 2.5.10Footnote 6 to develop PepperOSC using Python 2.7.

PepperOSC streams the kinematic data as Open Sound Control (OSC) messages,Footnote 7 a protocol developed specifically for communicating with sound synthesizers [28]. Recently, Zhang et al. integrated Robot Operating System (ROS) and Pure Data in a system called SonifyIt [31]. SonifyIt uses the socket library to facilitate data streaming between ROS and Pure Data. We chose to use OSC messages for PepperOSC due to their flexibility in facilitating communication with a wide range of sound design and music production tools.

The core functionality of PepperOSC revolves around its capability to acquire real-time kinematic data from the robot and seamlessly stream it as OSC messages. In addition, we have incorporated a robot control module to enable precise control over the robot’s behavior. In NAOqi, access to real-time data and control are provided by ALProxy, a class of objects that allows access to all of the robot’s modules. Figure 1 illustrates the structure of PepperOSC.

ALProxy facilitates access to kinematic data by connecting with ALMemory, a centralized memory that stores information about the current state of the robot’s sensors and actuators. In the current implementation of PepperOSC, the rotational angles of the head and arm joints (in radians) are retrieved. However, ALMemory allows for effortless retrieval of other sensor data as well.Footnote 8 To stream the retrieved data out as OSC messages, we used pyOSC.Footnote 9 Angular data of each joint is streamed individually with the joint name (e.g., HeadYaw, HeadPitch, RShoulderRoll) included as part of the OSC address.

PepperOSC controls the robot’s movement through two methods. The first method utilizes pre-programmed Behaviors, which are predetermined movements and speeches created using an independent application called Choregraphe.Footnote 10 Once the Behaviors are deployed to the robot, PepperOSC can initiate and terminate a behavior using ALBehaviorManager, eliminating the need for granular control over individual joints. In contrast, the second method entails PepperOSC directly controlling individual joints using ALMotion and handling speech output via ALTextToSpeech. The two control methods complement each other effectively. Behaviors are well-suited for pre-defined movements such as emotional displays (e.g., raising both hands in excitement). At the same time, direct controls provide interactivity with the robot, allowing for real-time adjustments and customization of the movements and speeches to be handled by PepperOSC, e.g., in a Wizard-of-Oz manner [23]. Moreover, any notification of movement controls can be streamed out as OSC messages, enabling additional customization options for the sonification.

Fig. 2
figure 2

Diagram showing the signal flow in the Pure Data application. The sound model is designed to sonify one joint data and can be duplicated for each sonified joint

To ensure smooth running, the kinematic data streaming and robot control processes are executed in parallel using a Python thread. With these capabilities, PepperOSC can be expanded to serve as the central controller of an interactive system, as demonstrated by the museum installation presented in Sect. 4.2.

4 Application of PepperOSC

In this section, we present two applications of PepperOSC that we have conducted. The first application involved two master’s students in Interactive Media Technology who undertook a course project aimed at utilizing the real-time kinematic data stream. The second application was an experiment conducted as a museum installation, with a particular emphasis on using PepperOSC as the control module for an interactive sonification system. All of the sound models presented are available in the PepperOSC online repository.

Building upon the insights from Hoffman [13], the increased complexity introduced between the first and second applications can serve as an ideal testbed for further developing and evaluating PepperOSC for practical use. The potential use case of PepperOSC in social robotics is further explored in Sect. 5.

4.1 Sound design by the interactive media technology students

The first application of PepperOSC was aimed at utilizing the stream of kinematic data, and was offered as a course assignment for the final-year students in the master’s programme in Interactive Media Technology. Two master’s students, who had previously taken courses related to sound in interaction, undertook this project. Their objective was to sonify the robot’s emotional expressions. To facilitate this, four pre-existing robot behaviors, derived from a previous study by the authors [19], were provided. As the primary goal was to test the stream of kinematic data, no additional robot control was employed beyond initiating the behaviors.

The students designed a sound model that can be applied to each of Pepper’s joints to augment the movement through sound, which was implemented using granular synthesis in Pure Data. Servo sound recordings were used as inputs for the grains, parameterized in the synthesis using the streamed movement data. The real-time angular positions of each joint were calculated to determine the angular speed, which was then mapped to the pitch and volume of the synthesized sound. The students assigned a higher pitch to the outward direction of change to draw attention to outward movements. The system is customizable, allowing the audio texture output for each data to be modified by changing the grains’ input file. The sound model can be duplicated for each joint by creating sub-patches. A simplified diagram of the Pure Data signal flow is shown in Fig. 2.

4.2 Deployment in a museum installation

Fig. 3
figure 3

Museum installation setup. Note the placement of the speakers behind the robot for the synthesized sound to ensure multimodal coupling between the sound and the robot’s movement

Fig. 4
figure 4

Diagram showing the control flow of the museum installation (adapted from [20]). The control system for the installation is expanded from PepperOSC

In a recent experimental installation at the Swedish National Museum of Science and Technology (Tekniska) in Stockholm, PepperOSC was utilized to probe the aesthetic strategies of sound produced by movement sonification of a Pepper robot. The robot was presented in two different scenarios: (1) welcoming patrons to a restaurant and (2) providing information to visitors in a shopping center. Two sets of sound models were employed to complement the robot’s gestures, and museum visitors were asked to indicate their preferred sound in terms of sound material and complexity for each of the two scenarios. Results revealed that the visitors preferred subtle sounds that blend well with ambient sounds (i.e., less distracting) and natural sounds where the sound source matches the visual characteristics of the robot, related to both its physical appearance and movement. Additionally, more complex sound models for the sonification of robot movements were favored over simple sawtooth-based sounds. A comprehensive description of the experiment and its results can be found in [20].

This section presents the installation and explains how PepperOSC was utilized.

As shown in Fig. 3, the installation was arranged in a large, dimly lit room featuring a Pepper robot placed in front of a large projection screen wall. Museum visitors could enter the room through doors on both sides of the projection wall at any time. The projection wall displayed photos of the locations where the interaction scenarios occurred, with ambient sounds of the locations emanating from four ceiling speakers. Two additional speakers (ESI Aktiv 05) were placed behind the robot for the synthesized robot sound. A touchscreen station was available in front of the robot, allowing visitors to initiate Pepper’s actions in the experiment and answer questions afterward. The control system for the installation was built on PepperOSC, which communicated with both the robot and the touchscreen GUI using the OSC protocol.

The interaction scenarios were created as behaviors deployed in the robot, with pre-determined robot movements and speeches. A command from the touchscreen station (i.e., from the visitors) would trigger the corresponding behavior and initiate the movement sonification with the intended sound model. The kinematic data of the robot was continuously streamed to the sound models through a separate thread. After the robot finished the behavior, a flag was sent back to the touchscreen to proceed to the next step and to the sound model to stop the synthesis. A technical setup diagram is presented in Fig. 4.

The setup was designed to run independently without requiring manual controls, and it was found to be fairly robust during the four days of deployment at the museum. Regardless, an assistant was always present near the robot to provide information and invite visitors to participate in the study. This was done to ensure a smooth running of the experiment and to assist visitors if they had any questions or concerns.

4.2.1 Sound designer perspective

Fig. 5
figure 5

Diagram showing the signal flow in the MaxMSP application. The robot’s movement data is used as parameters in Sound Design Toolkit objects

The first set of sounds was created by the first author, who drew on their background in game sound design, to investigate museum visitors’ preferences in terms of sound material. The sounds were developed using the Sound Design Toolkit (SDT) [2] in MaxMSP, with the choice of materials based on the robot’s movements from three perspectives: internal mechanisms, physical appearance, and arm displacement. All sounds were mapped to the position of the robot’s hands in space, calculated from the angular state of three joints: ShoulderRoll, ShoulderPitch, and ElbowRoll.

Figure 5 shows the simplified structure of the sound models. The first sound model emphasized the presence of the robot’s internal mechanism (e.g., servomotor) using SDT synthesis models [sdt.motor] and [sdt.dcmotor]. The second model emphasized the robot’s physical appearance, utilizing basic solid interaction from the SDT to produce a softer tone of metal sound. In this model, repeated soft impacts on a hollow metal object were generated, with the movement speed mapped to the intensity of the impacts. Finally, the third model aimed to emphasize the arm displacement, producing whoosh sounds using the SDT object [sdt.windkarman].

4.2.2 Composer perspective

Fig. 6
figure 6

Diagram showing the signal flow in the feedback sound model (source: [20]). The dashed arrows going into the sound model indicate parameters that can be controlled in real-time

The second set of sounds was developed to examine visitors’ preferences in terms of sound complexity. Three conditions were employed: (1) the original sound generated by Pepper without any additional sound, (2) a simple sound created with a filtered sawtooth waveform mapped to the robot’s hand position in space, and (3) a complex sound model that employs feedback chains. The sounds were developed by a colleague who is a professional composer and a researcher in music technology, whose works have utilized feedback chains both in compositions and in sound installations (such as in [22]). For the museum installation, the feedback chains are modeled to be mapped to the robot arms’ movement.

Figure 6 shows the signal flow diagram of the complex sound model. This sound model involves a feedback chain that is activated by an impulse from a non-band-limited pulse oscillator. The impulse is processed through a bandpass filter that can dynamically adjust the center frequency between two extremes. Spectral processing techniques, such as phase shifting and partial randomization of the bins’ order, are then applied to the signal. The choice of window size for the fast Fourier transform (FFT) analysis affects the rhythmicity of the output. After the spectral manipulations, the resulting signal is converted back to time-domain audio data using an inverse FFT (IFFT). The signal is not only sent to the output, but also sent back into the chain by multiplying it by a feedback factor and re-injecting it into the bandpass filter. Finally, an envelope follower is applied to the resulting sound as a negative feedback control signal to prevent saturation. The dashed arrows going into the sound model in Fig. 6 represent the parameters mapped to the robot arms’ movement, resulting in variations in the output produced by this sound model.

5 Discussion

In this section, we reflect on the design and implementation of PepperOSC, highlighting its potential and limitations both in social robotics and artistic exploration.

PepperOSC was developed to address the challenges that HRI researchers and sound designers face when integrating sound production into HRI. For HRI researchers, we provide a tool that enables the development of multimodal user interfaces beyond the playback of recorded audio files. For sound designers who may lack a background in robotics or programming, we provide an interface between the robot and sound design tools, making it accessible and manageable to work with robots.

In the context of the SONAO project, PepperOSC provides a versatile platform for investigating various levels of complexity and sound texture, enabling the exploration of diverse aesthetic strategies for robot movement sonification. This can involve developing new sound models or utilizing existing toolkits such as the Sound Design Toolkit, as exemplified in Sect. 4.2.1. Additionally, the Pure Data application featured in Sect. 4.1 demonstrates the potential of granular synthesis in exploring sound texture, as changes in the grain input can yield distinct texture outcomes.

A particularly challenging task is to incorporate PepperOSC in social robotics. Pre-programmed user interaction and conversation can be added with ease into the current version of PepperOSC. However, the implementation of a fully interactive communication extends beyond the scope of this paper and requires further development. One potential approach involves integrating PepperOSC with a speech-driven gesture synthesis system such as [1]. In this approach, as the robot executes the synthesized gestures, PepperOSC can augment them with movement sonification. Further developments in gesture synthesis are necessary to help determine the optimal sound model choice and timing for the sonification to effectively contribute to the social interaction.

Although the use of PepperOSC in social robotics and conversational HRI is still a distant prospect, its deployment in the museum installation discussed in Sect. 4.2 underscores its potential as a primary control system for robot sonification in artistic contexts. PepperOSC’s implementation of the OSC protocol results in a reliable, fast, and user-friendly data stream that can communicate with multiple nodes in a network. Moreover, this protocol allows bidirectional communication between the sound programming tools and the robot, which opens up more adaptive sonification possibilities, enabling the robot to react to changes in the sound and vice versa. For instance, the robot’s movements can be adjusted to match the sound’s characteristics. Furthermore, the OSC protocol can communicate with DAWs such as Ableton, making it possible to perform real-time music with the robot. With these capabilities, we believe PepperOSC can support the design of complex robot sonifications by lowering the barrier for sound designers and sound artists to contribute novel insights to HRI.

We identified several limitations associated with the use of PepperOSC. While the PepperOSC scripts can be deployed on the robot, sound programming tools cannot be run on the robot and must instead be run on a separate machine. This presents two issues: first, streaming the resulting sound through the robot’s speaker can be technically challenging, and second, the robot’s built-in speakers have a limited sound quality. According to the technical specifications, Pepper’s two speakers are located in each ear, with a frequency response range of 400 Hz to 9 kHz (-6 dB), meaning many details in the low and high frequency ranges may be lost. To the best of our knowledge, there have been no robots specifically designed with sound production in mind. We encourage further exploration in this direction for future development. To address this limitation, in our applications, loudspeakers were strategically placed behind the robot to ensure multimodal coupling between the sound and the robot’s movement (see Fig. 3).

Another limitation of PepperOSC is the manufacturer’s discontinuation of the Python SDK development for NAOqi. Furthermore, the SDK only supports Python 2.7.x, which is no longer updated,Footnote 11 limiting future applications of PepperOSC. Another constraint is that PepperOSC is tailored for use with Pepper and NAO robots, which may not suit the needs of researchers working with other robot models. As described in Sect. 3, the core functionality of PepperOSC is its ability to capture real-time kinematic data from a robot. In order to adapt PepperOSC to a different robot, the corresponding SDK must offer means to access the robot’s current state. For instance, in the case of the Cozmo robot, a specific class called cozmo.robot provides this information.Footnote 12 Moreover, the Robot Operating System (ROS), an open-source framework compatible with various robots, also offers such methods. Notably, ROS has proven to be effective in the context of robot sonification [31], further demonstrating its versatility.

Despite the limitations of PepperOSC, we consider the OSC protocol crucial for interactive sonification in HRI due to its ability to facilitate various sound programming tools and ease of network transmission. Moreover, OSC protocol enables interdisciplinary approaches by providing an interface between the robot, sound designers, composers, and engineers, allowing for collaboration across fields such as robotics, human-computer interaction, and music production and composition. Therefore, we recommend its continued use in future research on interactive sonification in HRI.

6 Conclusion

PepperOSC is an interface designed for interactive sonification and sound model creation in conjunction with Pepper and NAO robots. This paper describes its design and explores two applications: one developed by two master’s students who created a sound model using Pure Data, and the other by a sound designer and a composer/researcher in music technology who developed sound models for a museum installation using MaxMSP and SuperCollider, respectively. These applications demonstrate the versatility of PepperOSC and its ability to explore diverse aesthetic strategies for robot movement sonification.

However, there are several technical limitations to PepperOSC, including the discontinuation of Python SDK development for NAOqi, the need for proper external speaker setup, and its limited compatibility with only Pepper and NAO robots. Nonetheless the Pepper robot is one of the most diffused humanoid robots present in several contexts aorund the world, and despite the technical limitations, PepperOSC allows various sound programming tools to create a richer sound interaction in HRI applications as we have shown in this study.