Keywords

1 Introduction

This chapter presents a study focused on expressive musical gestures that represent the interpretation of musical works. Our search for a gestural language to drive digital sound systems naturally led us to consider conducting, the gestural art of directing musical performances by orchestras or choirs in rehearsal or concert situations. Conducting relies on proven techniques that have evolved over the centuries, from the ancient art of chironomy—a conducting technique that uses hand gestures to direct musical performances, typically Gregorian chants in choirs—to recent orchestral techniques developed for classical music or other music ensembles. In this context, the orchestra can be considered a “meta-instrument,” where the performers master their instrument and are guided by the conductor’s gestures to perform according to the musical intention.

Conducting gestures are fascinating because they embody a deep understanding of the musical piece. The conductor, often a skilled musician, can indeed internalize the musical intent of the work as it was imagined and conceived by the composer. They can encode sound images to apprehend the organization and the streaming of the musical discourse, the parallel melodic lines, the rhythms, the variations and breaths, the dynamics of the sounds, the quality of the timbre, etc. Then, through their body language and gestures, the conductor directs and motivates the musicians, efficiently conveys the organizational and temporal elements of the work, ensures its metrical development, and transfers its expressive and dynamic strengths.

Finally, conducting gestures are designed for effective communication. Although idiosyncratic, a set of them can be shared by a large community of musicians [24]. On the other hand, when semantically close gestures are characterized by different realizations, they share similar features in form or kinematics. This sharing, as well as transmission over time, requires an encoding of gestures [34] governed by rules of economy specific to gestural languages. In this chapter, we hypothesize that these rules constitute a grammar of conducting gestures.

Whereas most studies of conducting gestures focus on the gestures made by the dominant hand, i.e., the beating gestures that indicate the structural and temporal organization of the musical piece (tempo, rhythm), this chapter is on expressive conducting gestures performed by the non-dominant hand. These gestures show other aspects of music performance and interpretation, including variations in dynamics and intensity, musical phrasing or articulation, accentuation, entrances and endings, sound quality and color, and more generally, they reflect musical intent and expressiveness. Following the hypothesis that there exists a set of meaningful gestures or features shared by conductors, we propose a grammar of expressive gestures that draws directly from the grammatical foundations of sign languages for the Deaf. These gestures have some common properties with conducting gestures, as they are both visual and gestural languages; that is to say, they use the sensorimotor system to produce the gestures and the visual receptors to receive the information. Moreover, similar mechanisms can be observed, both in conducting and sign languages, including iconic dynamics and spatial referencing mechanisms to describe and manipulate metaphorical or metonymic entities. Both use space, whether the body space or the space in which the gesture unfolds, thus promoting expression within the narrative or along the musical discourse. We, therefore, propose to analyze conducting gestures in the light of sign language gestures.

After positioning our approach with sound-related gestures and conducting gestures, we propose in this chapter to analyze the linguistic similarities between the conductor’s gestures and those of sign languages [10]. This approach leads us to define a repertoire of expressive gestures classified into four main categories (Articulation, Dynamics, Attack, and Cut-off) corresponding to classically used sound modulations. Within each category, we define several expressive variations. Our methodological approach can be linked to the theory of sonorous objects [32], and by extension, to that of gestural-sonorous objects [12]. Following this methodology, sound objects are first defined and grouped into functional categories, and then gestures and their variations are identified. We then present our data collection and propose a qualitative evaluation of our gestural dataset using machine learning before briefly reviewing the research challenges for gesture recognition and motion-to-sound transformation systems.

2 Related Work

2.1 Sound-Related Gestures

As mentioned in [13], many works concern the study of musical gestures in fields such as musicology, cognitive science, gesture linguistics, computer music, and human-computer interaction.

Research in gestural control of musical instruments, both physical and simulated, has highlighted the many possibilities to control sound features with various mapping schemes. Beyond the analog or digital relationship between gestural features (including geometrical, kinematic, or physical features), and sound features (including temporal, spectral, or psycho-acoustical features), there are cognitive relationships based on abstract representations of mental images of sounds or movements. This can be connected to the theory of musical sounds presented in [32], where the acoustic substrate of sounds is potentially associated with perceptual images. This theory has been extended to the concept of embodied gestural-sonorous objects [12].

There have been proposals for the classification of sound-related gestures [7]. Four functional aspects of musical gestures are usually considered [15]:

  1. 1.

    Sound-producing gestures, including excitatory gestures such as hitting, bowing, plucking, blowing, and modifying gestures such as continuous modulations of pitch or timbre

  2. 2.

    Sound-facilitating gestures that support the sound-producing gestures and include support, phrasing, and entrained gestures

  3. 3.

    Sound-accompanying gestures that follow the music

  4. 4.

    Communicative gestures, intended for communication

For sound-producing gestures, the relationship between sound and body motion is well understood by musicians. Godøy et al. [12] argue that different theories can explain the gesture–sound link. According to the ecological perspective, auditory perception exploits cues from previous experiences to produce patterns that give meaning to sound. Other researchers share the idea that motor production is involved in the perception of sound. More specifically, the motor theory of speech perception [22] holds that the listener recognizes speech by activating the motor programs that would produce sounds like those that are being heard. This theory can be transposed to sign language gestures with the motor theory of sign language perception [9]. In this case, the linguistic knowledge is embodied into sensory-motor processes, where sensory data may be visual clues (iconic gestures) or perception of action, and motor commands put into action the multiple degrees of freedom of the articulated system. Our approach builds on these theories from a linguistic point of view.

2.2 Conducting Gestures

Since Mathews’ research on conductor programs [3], much work has focused on conducting gestures, from the analysis and recognition of gestures to their use in gesture-controlled sound systems [16, 28]. Many different sensors have been used to capture conductor’s gestures, from commercial sensors (e.g., accelerometers, gyroscopes, infrared cameras, and electromyographic (EMG systems), to sensors designed specifically for conducting, such as the MIDI Baton [17]. Gesture follower systems have been developed, for example, the Conductor Follower [6], or interactive systems using sensor gloves for capturing expressive gestures [27]. Other approaches have led to the recognition of gestures, notably using Hidden Markov Models. This is the case of the system that follows both the rhythm and the amplitude of the right hand, as well as the expressive gestures of the left hand [18], or the system that follows and recognizes conducting gestures by real-time warping of the observed sequence to the learned sequence [1]. These capturing devices and gesture tracking and recognition models of conducting gestures have led to multiple systems that map gestures to sound synthesis [33]. These include systems for live performances, home entertainment, interactive public installations, or even conductor training systems [16]. More recently, a sound system that follows conducting gestures has been proposed [19], and machine learning approaches have been developed in music conducting [26, 31].

Our approach goes beyond existing studies of conducting gestures, which aim to map gestures to sound systems. Instead, we return to a structural analysis of gestures with sound objects, focusing on the characteristics of gestural languages that encode information at different levels of abstraction and at different time scales.

2.3 Motivation

In our study, we focus on conducting orchestral or choral gestures and, more specifically, on expressive conducting gestures performed by the non-dominant hand. These gestures are not predefined but are hand signs that have been created over the centuries to direct a group of musicians. Thus, constrained by the structure of the musical work, the message conveyed by each gesture corresponds to a desired sound function, and the quality of the movement responds to a clear and understandable musical intention.

The expressive conducting gestures differ significantly from other musical gestures. Unlike sound-producing gestures, they do not involve any interaction with a physical instrument. If we exclude the beating gestures performed with a baton, conducting gestures involve all the degrees of freedom of the conductor’s arm and torso and possibly their gaze and facial expression. On the other hand, these are anticipatory gestures based on a predictive reading of the musical score. They are anchored at key moments of the musical discourse, indicating variations of dynamics, attacks, temporal phrase variations (slowing down, acceleration, cuts), and qualitative sound variations (e.g., timbre). These are concise and efficient gestures that anticipate the sound flow in real-time while remaining synchronized with the rhythm of the music. These qualities are also those sought in gesture-controlled digital sound systems.

In this chapter, we are interested in the linguistic dimension of the conducting gestures, in the sense that they are structured in several layers of abstraction, following linguistic principles. These layers and rules define the basis of a language whose linguistic structure is similar to signed languages. The economy of representation proper to any language can be expressed by grammatical processes at different levels. First, we will see that the conducting gestures are structured according to a limited number of basic components. Modifying one of these components modifies the gesture’s meaning, which can lead to expressive nuances of the musical interpretation. This linguistic specificity is also characterized by grammatical rules based on the iconic and spatial dimension of gestures, which brings the conducting gestures closer to those of signed languages.

The linguistic extension of the gestural-sonorous objects [12] is a first step towards understanding the underlying grammatical structure of expressive conducting gestures, from hand sign formation to musical phrasing. The objective is not to find a unique repertoire of expressive gestures that would be shared by all conductors; these gestures differ according to the style of the conductor and the type of music. Instead, the aim is to identify structural elements and invariant features that constitute the foundations of these gestural languages and to formalize their rules of production. By extension, this structuring might facilitate the spontaneous understanding between conductors. The examples chosen are partially inspired by those presented in [4]. Our contribution concerns the comparative linguistic study between conducting and sign language gestures, based on a formal grammar of French sign language (LSF) [25].

3 Similarities Between Sign Language and Conducting Gestures

Interestingly, there are strong similarities between conducting and sign language gestures. This similarity can be explained by the fact that these gestural languages both rely on visual and gestural modes of communication and on processes of spatiality and iconicity to build the meaning of the sequence of gestures (utterances in sign language or phrases in conducting). Spatiality is one of the fundamental elements of gestural expression, as the gestures are executed in the 3D geometric space surrounding the body. Iconicity is characterized by the more or less close resemblance between the imagined concept and the performed gesture.

Although sign languages differ from country to country, we find iconicity processes at all levels (phonological, lexical, syntactic–semantic). For example, rain can be represented in sign language with a claw handshape and a hand movement from top to bottom; variations of this sign make possible the creation of the signs river, torrent, or waterfall. Such movements can also control the sound synthesis of natural phenomena such as rain with various strengths in different environments [5]. In Play of the Waves (La Mer, Debussy), the conductor can move back and forth as if a wave was moving through the orchestra. These wave movements are very similar in sign language. It should be noted that iconic signs in sign language are not mimicry. Although they metaphorically imitate particular objects, situations, or actions, they follow specific conventions and rules. Conducting gestures use similar conventions. For example, the conductor, like the signer, uses their body and frontal space efficiently so that the musicians can distinguish the gestures and understand their meaning. Furthermore, the signer or conductor remains in place and refers spatially to static or dynamic entities in this abstract space.

Several aspects explain the richness of expression that both gestural languages offer. First, their multimodality allows the parallel use of information conveyed by different articulatory channels (including handshapes, hand movements, torso orientation, head movements, facial expressions, and eye gaze). Second, the gestures can be broken down into meaningful components that are then recombined to form signs or phrases. Moreover, similar grammatical mechanisms can be observed, both in conducting and sign language gestures.

In this chapter, we will consider three main grammatical processes:

  • The structuring into elementary components that we will call phonological components

  • The spatiality

  • The iconicity

We will review these three mechanisms by showing examples of the similarity between conducting and sign language gestures. In what follows, we will use Millet’s grammar of French Sign Language to describe the structural aspects of both sign language and conducting gestures [25]. This grammar, very flexible and generic, can be extended to different sign languages. We will show how it can apply to both gestural languages at the lexical, syntactic, or semantic levels using inflected processes.

3.1 Phonological Components

In sign language, we can identify minimum units, called phonological components, that are structured to form the signs and that take a limited number of values. One of the basic assumptions is that two distinct signs can be differentiated when only one of the components is changed (the so-called minimal pairs). These phonological components are expressed simultaneously in multiple channels, including manual and non-manual. Manual components contain Placement (PL), Hand Configuration (HC), and Hand Movement (HM), and non-manual ones include facial expression and eye gaze. The components of conducting gestures are similar to those of sign language; we will also call them phonological components.

Both sign language and conducting gestures use a limited set of hand configurations; Millet identifies about 41 in French sign language. In conducting, the number of handshapes regularly used for expressive gestures is about 10 (included in the sign language set), but this depends on the conducting technique and style. Similarly, although continuous, the hand movements in sign language are characterized by typical trajectories that belong to a finite number of shapes. Traditionally, we consider simple elementary movements (pointing, line, arc, ellipse) or complex ones (spiral, waves, etc.). These movements can be achieved at various locations (Locus) in the signing space (starting and target points), according to the three biomechanical planes (see Sect. 4). They can be unitary or repeated movements. Conducting gestures use similar hand movements but are limited in number (mainly pointing, line, arc, and ellipse). Later, we will also see that the location of gestures, in both sign language and conducting, can take values in a finite discrete set of areas surrounding the signer or the conductor. These components, combined in parallel with the other components, form signs that convey meanings that may vary if at least one of the phonological components is modified.

For example, in sign language, the Fist or Pursed hand configuration may be used to pick up a purse or a sheet of paper. Pursed, associated with placement near the mouth and an alternating hand movement of opening and closing the fingers, becomes the sign [DUCK]. In conducting, the attack gesture with the same Fist handshape, associated with a straight downward movement, means to hit hard. The same attack gesture with a Pursed handshape, associated with repeated and precise movements of small amplitude, means beating the bar in staccato mode.

4 Spatiality

In sign languages, signs and sentences are organized in space. We differentiate the signer’s space from the signing space. The signer’s space can be divided into discrete areas along the three dimensions axis: height, distance, radial, as shown in Fig. 1 (Left), or it can be described relative to the three biomechanical planes: sagittal, frontal, and transverse (Fig. 1, Right).

Fig. 1.
figure 1

The signer’s space. Left: discretization along the height, distance, and radial axis (extracted from [29]). Right: the three anatomical planes: sagittal, frontal, and transverse

The signing space goes beyond the physical or geometrical signer’s space—it is an abstract and delimited space, which makes spatial thinking possible. Through spatialization, signs can be signed at spatial references created and organized in the signing space. Locations, called Locus, become the referent locations of the entity. Deictic gestures may designate this entity by pointing with the index finger, the hand, or even with an eye gaze. Moreover, the entities can be placed relative to each other, with simplified and meaningful hand configurations, called proforms. This is also the space in which the discourse is deployed, which allows the syntactic consistency of sentences. For example, verbs can use trajectories in the signing space that link entities or express syntactic variations by changing personal pronouns. In French sign language, the signing space is divided into discrete pre-semantic areas (Fig. 2, Left).

Fig. 2.
figure 2

Left: The pre-semanticized signing space in French sign language. 1: Neutral space; 2: Pro-3 (pronoun he/she); 3: Pro-1 (pronoun I); 4: Inanimate (goal); 5: Indefinite agent; 6: Locative linked to the verb. Right: The conducting space in a symphony orchestra

We can define the conducting space as a delimited, abstract space that represents the stage, with musicians and groups of instruments (Fig. 2, Right). The conductor’s stage can be compared to a metaphorical surface (for the plan of the orchestral scene) or volume (for the sound), in which some entities can be designated or manipulated. There are spatial metaphors associated with this space. They can be found in the orchestra (e.g., a soloist, the timpanists, the string players), in showing, pointing, occupying space, following lines or curves, etc. They can also be found in the sound, in manipulating the instruments (pulling, pushing, gathering, etc.), or in the sound qualities (evoking a specific timbre, augmenting the brightness, etc.). During the performance, the metaphorical gestures used by the conductor are understood and translated into sounds. The musical discourse is thus elaborated in this space through spatial referencing (Locus), use of deictic gestures, following lines, paths, etc.

5 Iconicity

Iconicity is at the heart of sign languages and, more generally, gestural languages. In this section, we will explore the different types of iconicity involved at three levels: lexical, syntactic, and semantic, both for sign language and for conducting gestures.

5.1 Iconicity at the Lexical Level

At the lexical level, iconicity has an illustrative purpose in sign language, what Cuxac calls “signing by showing” [8]. The signs are thus represented by concrete objects, symbols, or metaphorical concepts. Two kinds of mechanisms can be used to modify the meaning of the signs by changing very few components.

  • A derivative-based mechanism designates a family of signs with a similar component attached to the same meaning. For example, signs located on the side of the forehead have a meaning related to psychic activity, such as [CONCEPT], [TO THINK], or [TO INVENT] (see Fig. 3). The placement is identical, while the hand configuration and hand movement are different.

  • Inflected mechanisms allow to modify a sign by changing one specific component in this sign:

    • Size-and-Shape Specifiers use hand configuration, wrist orientation, and hand movement to describe the shape and size of an object. For example, the sign [BOWL] (Fig. 4, Left) becomes a [BIG-BOWL] (Fig. 4, Right) if the shape or size of the hand trajectory is modified.

    • Although listed in the previous category, spatialization is implicitly included in iconic processes. An entity signed at a specific place will designate it at this location: “This bowl at this place.”

    • Proforms represent animated entities (e.g., a person or object) characterized by a limited number of hand configurations. They function as pronouns, thus avoiding naming an entity multiple times. For example, the [PERSON] proform can be positioned in the narrative scene. In addition, one person can be represented in different postures associated with different hand configurations (e.g., a raised finger for a standing person or a curved one for a sitting person). Also, several people can be represented in space (around a table, for example) or a conference room.

Fig. 3.
figure 3

Derivative-based signs in sign language with the same placement on the head. Left: [CONCEPT]. Middle: [TO THINK]. Right: [TO INVENT]

Fig. 4.
figure 4

Left: The standard sign [BOWL]. Right: the sign [BIG-BOWL], with size and shape specifier (extracted from [29])

In conducting gestures, we find similar iconic mechanisms. One of the significant components is the placement. Deictic gestures show locations on the stage, indicating, for example, a group of musicians. The handshape can be a traditional deictic index finger, a V handshape or a flat handshape, or even a slightly curved handshape. These deictic gestures can also be performed with different body parts, such as the head or the eye gaze. For example, the sign [LOOK-AT-ME] shown in Fig. 5 (Left), which can be used in both sign language and conducting gestures, involves a V handshape coupled with a pointing hand movement. During the execution of this gesture, the torso and the head move synchronously with the hand. In the same way, the phrase “I am looking at you” implies the same V handshape with a reversed hand movement, while the gaze is directed towards the target representing the entity to be seen, for example, a solo musician. This V-hand configuration can be considered derivative-based for a series of signs involving vision. We also find the various inflection mechanisms mentioned earlier. In the previous example, changing the gaze target or the hand trajectory changes the meaning of the gesture “I look at you.” A conducting gesture can also be performed at a specific location in the conducting stage, thus specifying an instruction to a specific group of musicians (spatialization). Furthermore, when it comes to expressing the radiating quality of an orchestral sound or a bright timbre, the movement can be more or less ample (Size-and-Shape Specifier) (Fig. 5, Right).

Fig. 5.
figure 5

Left: the sign [LOOK-AT-ME], with the V handshape, used both in conducting and sign languages. Right: the conducting gesture for increasing the brightness of the timbre

Among the functional conducting gestures, some concern dynamic gestures associated with the intensity with which the instruments play. These dynamic gestures are generally executed along vertical paths in the frontal plane: louder for an upward gesture and softer for a downward one. An inflected mechanism can be applied to the handshape, with a flat hand stretched upwards or slightly bent downwards, released at the end of the movement. Another inflection can be expressed by the kinematics of the movement: a strong acceleration will accompany a fast crescendo of large amplitude (from p to f). At the same time, a smooth decreasing speed will be observed for a soft decrescendo. Thereby, the expression of dynamics in expressive conducting gestures uses a combination of phonological components and inflectional processes similar to the size-and-shape specifiers of sign language.

Attack gestures can be represented by arc or line paths. Here also, the inflection can be applied to the handshape and the movement quality. A Fist handshape can express “Hitting hard”, indicating a powerful sound strike. The quality of the movement can also modulate the type of attack, with more or less weight given to the arm movement. To simulate a softer attack, the handshape can be modified, such as an open, flat hand with the palm facing down. In addition, by changing the movement and orientation of the hand, one can more closely imitate actions on specific materials (metal, wood, etc.) and use this metaphor to indicate different qualities of attack (representing, for example, various staccato). Again, these examples show the inflectional mechanisms used in conducting gestures, similar to sign language.

5.2 Iconicity at the Syntactical Level

In sign language, at the syntactic level, the relations between the entities of the scene are embedded in the signing space. For example, this iconicity can be represented by i) a relative Placement of the objects: e.g., “The ball is under the table,” where that proform [BALL] is shown under the proform [TABLE], the signs ball and table having been signed before, or ii) verbs described by trajectories in the signing space, also called Indicating verbs. Different inflected mechanisms exist for such verbs. The first one is linked to the hand configuration, which represents, for transitive indicating verbs, the direct object. For example, in the two sentences, “I give you a glass” or “I give you a coin,” the [GLASS] or the [COIN] are represented by different hand configurations: a cylindrical one for a glass or a pursed one for a coin (Fig. 6).

Fig. 6.
figure 6

Indicating verb with direct objects. “I give you a glass”, performed with a cylindrical HC meaning [GLASS]; with the Pursed HC, it becomes“I give you a coin”

The second inflected process for indicating verbs is achieved by changing the trajectories of the hands in the signing space, according to the agent and the recipient of the verb, respectively. Thus, in the sentence “You give me a glass,” the hand movement follows a line from a point in front of the signer to a point on their chest, whereas in the sentence “I give him a glass,” the line goes from the chest to the right side of the signer, symbolizing the 3rd person pronoun [PRO-3]. The hand configuration representing the direct object [GLASS] (cylindrical hand configuration) is identical.

In conducting gestures, conductors also use indicating verbs, as illustrated in Fig. 7 (Right) with the phrase “I propose you prepare to start” corresponding to the sign [PROPOSE-PRO2] ([PRO2] being the 2nd person). Here, the conductor uses this sign to tell the flutist: “I propose you prepare your breath to start playing.” The hand movement goes from the chest towards the flutist, and the hand spreads from closed to open. This expressive gesture is very similar to the indicating verb [OFFER-PRO2], meaning “I offer you” used in different sign languages (Fig. 7, Left).

Fig. 7.
figure 7

Left: the French sign language indicating verb [OFFER-YOU]: “I offer you”. Right: the conducting gesture [PROPOSE-PRO2]: “I propose you prepare to start”

Many other indicating verbs borrowed from sign language are frequently used by conductors, with different meanings according to the context, for example, the sign language signs [TO INVITE], [TO BRING], [TO CARRY], etc. The expressive conducting gestures are not identical, but the inflected mechanisms follow the same rules. They involve primarily changes of the handshape, movement trajectory (direction, start and end locations that change the agent/beneficiary), and kinematics (dynamical quality). Thus, in the gesture “Pulling out an object” (Fig. 8, Left), the hand moves along a straight line from a musician on the stage toward the conductor. This means metaphorically “Pulling a sound.” It may be performed differently according to the direct object represented by the handshape. For “Pulling a full sound,” the Spread-bent handshape represents a specific brass instrument. Note that the French sign language sign [TO-ATTRACT] is very close to this expressive gesture (Fig. 8, Right).

Fig. 8.
figure 8

Left: The conducting gesture “Pulling a brass sound”. Right: The indicated verb [TO-ATTRACT] in French sign language

The substitution of the Spread-bent handshape by the Pinched handshape in Fig. 9 (Right) can be used to indicate the entrance of flute sounds or vocalists (“thinner” sounds). The handshape may represent the envelope of the instruments’ spectrum. In this gesture, the other components remain the same (movement and orientation of the hand), except the placement that might express a higher pitch. This gesture is similar to the French sign language sign [TO CHOOSE] executed with the dominant hand (Fig. 9, Left).

Fig. 9.
figure 9

Left: The indicating verb [TO CHOOSE] in French sign language. Right: The conducting gesture “Pulling a flute sound”

5.3 Iconicity at the Semantic Level

When preparing the orchestration, the conductor must understand the structure of the musical work, both in space (instruments) and time (musical development). In this preparation phase, the score is broken down into essential phases, using points of articulation or other markers (signs, text) located at the level of the instrumental ensemble or a specific group of instruments. This results in a constantly changing combination of instruments that come in and out at different times. Hence, dynamic changes, radiating quality of sound, and timbre are often achieved by the addition or the removal of instruments. During the performance, the conductor can then convey the most important cues of musical development through his gestures. These structural aspects are linked to the semantics of gestures. We distinguish spatial and temporal aspects, as well as aspects specific to the sound texture and quality.

From the point of view of spatial semantics, many gestures indicate musical paths. In particular, they show where a musical phrase begins and ends and in which direction it develops. These paths can be inscribed in the conducting space, showing, for example, the movement from one group of musicians to another. They may also represent melodic lines executed by the movement of the hand, such as a direct line, an arc, or a wave curve. The inflected elements considered here are mainly the placements or trajectories of the hand. The quality of the movement, especially the way the hand moves from one group to another, can also change to inform the musical evolution: slow, abrupt change, etc. Similarly, these trajectories can be found in the sign language narrative. For example, the French sign language sign [REGULAR] may indicate the steady flow of a crowd of people or a herd of gazelles. More generally, the movement of a vehicle or an animated entity can be represented in sign language by a trajectory between several target points in the signing space.

Specific aspects of the temporal structure of the musical work give rise to conducting gestures that indicate essential points of articulation in the development of the music. For example, the conductor, using a circular movement, can tell the musicians to keep moving at a specific tempo. In the same category of temporal semantics, similar circular and repeated trajectories can be found in the signs [TO CONTINUE] or [TO START AGAIN] in French sign language, which can also be used by conductors.

Fig. 10.
figure 10

Temporal semantics. Left: the conductor’s gesture indicates to cut off. Right: the signs [TO STOP] in French sign language

Also temporally, the end of a musical phrase can be indicated by the conductor with a cut-off gesture (Fig. 10, Left), which can be modulated by modifying the amplitude of the trajectory, using the whole arm or only the hand, or by closing the hand more or less rapidly. The French sign language sign [TO STOP] (Fig. 10, Right) can also be used by the conductor. Numerous other gestures warn the musicians of places in the score where they should pay attention, which can result, for example, in a deictic gesture with the index finger pointing upwards or a gesture mimicking a pivot zone in a musical passage by drawing it. These gestures are similar to those used in sign language.

Fig. 11.
figure 11

Sound quality. Left: the conducting gesture “support an object” for a sustained sound. The French sign language signs [HEAVY] (Middle) and [LIGHT] (Right)

Semantic conducting gestures can also express aspects of the sound content or quality (timbre, brightness, spectral envelope, etc.). For example, a conducting gesture mimicking the touch of a flat surface can be used to obtain a homogeneous sound quality. This gesture has similarities with the sign [FLATENED] in French sign language, which can be used with the spread–flat hand configuration and a movement in a horizontal plane to qualify the flat structure of a surface. In the same way, a squeaking sound can be represented with a slow movement and a claw handshape to evoke a thick substance corresponding to a rough spectral texture. Such a material could be represented by the same sign in sign language, for example, to knead a more or less thick and viscous dough. In contrast, a soft material would be associated with the French sign language sign [SOFT]. Finally, the sustained (Tenuto), heavy or light quality of the sound can be expressed by the conducting gesture meaning “Supporting an object” (Fig. 11, Left), or by the signs [HEAVY] or [LIGHT] in French sign language (Fig. 11, Middle and Right).

This presentation of conducting gestures closely related to sign language is far from being exhaustive. It would be interesting to extend this study by analyzing several conducting systems and systematizing the link between expressive conducting gestures and the grammatical mechanisms presented in this chapter. In the following, we use some examples mentioned above to build our gesture–sound database.

6 Repertoire of Four Classes of Conducting Gestures

In this section, we were interested in the definition of a restricted subset of meaningful and expressive gestures borrowed from the vocabulary of conductors. These correspond to effective sound variations, particularly those transcribed on musical scores. We, therefore, proposed a case study to analyze conducting gestures performed by the non-dominant hand. For this purpose, based on the previous study, we created a dataset of expressive gestures to control the interpretation of musical excerpts, and we evaluated this dataset following Laban’s Theory of Effort [20, 23]. Our motivation was twofold. First, we wanted to transfer the nuances written on orchestra scores to expressive gestures. We thereby oriented our choice towards gestures inspired by orchestral conducting for their ability to represent meaningful and expressive sound variations. Second, we relied on the grammar of French sign language [25] to take into account the elements of gestural structuring presented in Sects. 4 and 5. To create this dataset, we followed the sound-tracking methodology [30]. We defined a limited set of sound objects belonging to traditional functional categories and derived gestures that reflect these categories with appropriate expressive variations.

6.1 Sound Categories and Variations

The challenge of the conductor is to have a global idea of the composer’s musical intention, to imagine sounds and colors, and to read and understand all the scores of all the instruments. Besides the information contained in the temporal organization (tempo, rhythm) of the musical excerpt, we focused on four main categories: Articulation, Dynamics, Attack, and Cut-off.

  • The Articulation category is related to the phrasing of the musical discourse, which is strongly dependent on the style of the piece. It expresses how specific parts of a piece are played from the point of view of musical phrasing and how they are linked and co-articulated, taking into account the synchronization and quality of the musical sequencing. Among the techniques of articulation, we have retained in our case study three of them: Legato (linked notes), Staccato (short and detached notes), and Tenuto (held and sustained notes). In our examples, we know these terms and their meaning might differ according to the instrument and the musical context.

  • The Dynamic category, also called Dynamics or Intensity in musicology, characterizes the music’s loudness. In our study, we were interested in variations of dynamics. These variations can be progressive (smooth) or abrupt, with an increase or decrease in intensity. Four dynamic variations have been retained: Long Crescendo, Long/Medium Decrescendo, Short Crescendo, Short Decrescendo.

  • The Attack category gathers different types of accents, which are indicated in the score by different symbols, but also by terms such as sforzato (sfz). In our study, we identified two primary distinctive attacks: Hard hit, Soft Hit.

  • The Cut-off category expresses the way a musical phrase ends. We have retained two main variations within this last category: Hard Cut-off, Soft Cut-off.

Table 1. Repertoire of gestures: four categories (Articulation, Dynamics, Attack, Cut-off), described by their hand movements (HM) and hand configurations (HC). In each category, there are several classes. Attributes and possible values are given for each class. To simplify the table, we use Bent instead of Spread-Bent and Flat instead of Spread-Flat

6.2 Grammar of Gestures and Their Modulation

We defined a lexicon of gestures and their discrete variations according to the four categories mentioned above: Articulation, Dynamics, Attack, and Cut-off. The gestures in the Dynamic category are generally isolated actions performed in the frontal plane, upward or downward (crescendo or decrescendo), with varied duration, depending on the variation of the sound intensity (short, medium or long). The gestures in the Attack category correspond to short actions, so they can be used isolated or repeated a limited number of times, depending on the nature of the sound accents. The gestures in the Cut-off category are isolated actions performed at the end of musical phrases. They follow an elliptical trajectory that closes at the end, with a handshape that changes from open to closed. The amplitude, duration, and kinematic quality of these gestures change according to the end of the musical phrase. Unlike the other gestures, those of the Articulation category are continuous gestures repeated over one or more cycles. For this category, we considered three gestures involving various hand movements and handshapes performed in different planes with various kinematics.

The structure of these gestures is that of sign language gestures, defined by the parallelized composition of phonological components, which gather manual components (Hand Placement, Hand Movement, Hand Configuration) and non-manual components (facial mimicry, eye gaze, mouthing). In this case study, our gestural corpus is composed only of hand–arm gestures. The number and nature of hand configurations and hand movements change according to the context (nature of the musical passage, style of the conductor, etc.). In our expressive gesture dataset, we selected five basic hand configurations that can be seen in Fig. 12 (Spread-Bent, Fist, Pursed, Spread-Flat, O). We retained four hand movements: Line, Arc, Ellipse, and Lemniscate (In 2D geometry, a lemniscate is any of several eight-shaped curves).

Fig. 12.
figure 12

List of the five selected hand configurations. Top, from left to right: Spread-Bent, Fist, Pursed. Down, from left to right: Spread-Flat, O

Combining these parallel components results in gestures with specific meaning and expressiveness. The modification of one or more components can lead to the alteration of the gesture’s expressiveness. For example, an Attack can be represented by a generic gesture meaning “Hitting an object.” It is mainly characterized by a vertical Arc component in the frontal plane, the type of the hand configuration indicating the nature of the object being hit, and the quality of the motion indicating the strength of the hitting.

Table 1 illustrates our gesture repertoire according to the four categories and the two dimensions. For each category, several discrete classes have been identified, associated with a set of discrete attributes and values. Moreover, the modification of the quality of the movement (above all, the kinematic quality, such as speed and acceleration, or the dynamic quality, such as the variation of the effort impelled in the gesture) modulates the gesture and, consequently, the sound nuance.

6.3 Data Acquisition Protocol

How the datasets of gestures or sounds corresponding to the categories described above are constructed is essential insofar as it determines the richness of the resulting sound and gesture variations, in particular, the quality, precision, and subtlety of expressive nuances. In the following, we describe the data acquisition methodology we adopted in our case study.

Our approach is directly inspired by sound tracing experiments on digital tablets whose goal was to produce 2D kinematic tracings related to sounds categorized by Pierre Schaeffer’s typology of sound objects [11]. Other experiments extended this principle in 3D by exploiting motion capture technologies based on markers detected by infrared optical cameras, leading to very accurate recordings [30]. In these experiments, the gestures were performed freely while listening to sound examples with a limited number of sound features (pitch, spectral centroid, dynamic envelope). In our experiments, we also adopted this sound tracing methodology, but we focused instead on higher-level cognitive sound features, using musical excerpts related to the interpretation categories presented above. Several aspects of the sound characteristics intervene simultaneously (dynamics, timbre, etc.). Still, we have selected musical excerpts so that each of them highlights, more specifically, one of the categories identified above. In addition, to limit the variability of gestures, these were determined based on a lexicon of sign language gestures approved by conductors. These gestures, especially those involving iconic dynamics, are similar from one sign language to another and may be shared by different conductors.

Our data collection comprises two kinds of musical excerpts, for a total of 50 musical excerpts:

  • 30 orchestral classical music, mostly taken from conducting scores [24]

  • Two musical phrases with different variations played on a piano (one variation at a time, keeping the same tempo of 80 bps), extracted from the work of J. S. Bach: Prelude No. 1 in C Major and Cantate Bwv 147

These excerpts cover the four sound categories (Dynamics, Attack, Cut-off, Articulation), and each sound variation is represented in different musical excerpts (at least three excerpts per variation). Moreover, within the same musical excerpt, different nuances of the same variation can be present at different times of the excerpt (for example, several attacks or several cut-offs). An expert conductor validated these musical excerpts and the corresponding chosen gestures.

Our motion data was recorded thanks to a motion capture system based on passive markers and infrared cameras, which measures very precisely the position of the markers located on the body (20 markers) and the hands (8 markers per hand) with a frame rate of 200 Hz. Three subjects participated in the recording session: one expert musician with a good level of conducting, one expert musician in classical music, and one non-expert subject.

For each musical excerpt, there was a preliminary training phase in which the excerpt was played several times, and the participant was instructed to perform a given gesture while listening with their non-dominant hand. Then, the executed movement was recorded along with the corresponding sound excerpt. During each recorded sequence, the user repeated the gesture at least five times. This process was repeated for each musical excerpt. After pre-processing and manual segmentation, the dataset comprises 1265 gesture samples for each subject. Even though we got synchronized gesture and sound data, there are several drawbacks with this experimental protocol. In particular, the data were recorded in a studio and not in a real orchestral performance situation. It does not allow for analyzing the anticipation specific to the conducting gestures. In the following, we will only analyze the data of the expert subject.

6.4 Evaluation and Research Methodology

We used questionnaires to evaluate the gesture and sound databases. Questions concerning the expressive quality of gestures were related to the Effort parameters from the Laban Movement Analysis theory [20, 23]. This theory identifies semantic components that describe the structural, geometric, and dynamic properties of human motion. The Effort components focus more specifically on qualitative movement aspects regarding dynamics, energy, and intent [21]. It comprises four sub-categories (Weight, Time, Space, and Flow), which vary continuously in intensity between opposing poles. The Weight Effort parameter refers to physical movement properties, the two opposing weights being Strong (powerful, forceful) or Light (gentle, delicate, sensitive). The Time Effort parameter represents the sense of urgency and has been defined by two opposing dimensions: Sudden (urgent, quick) and Sustained (stretching the time, steady). The Space Effort parameter defines the directness of the movement, which is related to the attention to the surroundings: Direct (focused and toward a particular spot) and Indirect (multi-focused and flexible). Finally, the Flow Effort parameter defines the continuity of the movement: Free (fluid, released) and Bound (controlled, careful, and restrained).

Within the preliminary study, we were interested in classifying the expressiveness of the performed Articulation gestures. These gestures were evaluated through Laban’s Effort parameters (Weight, Time, Space, Flow). We used two types of questions: i) questions based on Laban Effort parameters, expressed quantitatively on a Likert scale from 1 to 7; ii) questions based on semantic terms (at least three terms per opposite pole). A total of 21 subjects answered the questionnaires. The qualitative variables were coded as numeric variables. This allowed us to propose a classification method of expressive gestures according to the three expressive classes (Legato, Tenuto, Staccato). To classify the expressive qualities, two machine learning methods were used: Logistic Regression and Random Forest. We found an accuracy of 86% and 84%, respectively, for both sets of questions, which encourages us in our approach.

This preliminary evaluation appears relevant, as it allows us to discriminate the different expressive classes of the Articulation category. It constitutes a methodological approach that can be used in gesture recognition for sonification systems to validate the choice of gestures and their variations. If it applies to Articulation gestures, it can also be adopted for other categories.

7 Conclusion

Expressive conducting gestures are essential for guiding musicians and can be the starting point for gesture recognition systems used for sonic interaction. Such an interactive system involves recognizing the gestures being performed, adapting to their variations in real-time, and finding the most effective and meaningful mapping algorithms to match gesture and sound parameters. The specificity of such an approach and related research challenges can be summarized as:

  • Multichannel structure: meaningful gestures can be defined as spatial and temporal structural patterns. These patterns contain multiple channels running in parallel, such as hand configurations and movements, eye gaze, and facial expressions. Within these patterns, we can identify stable and static areas (hand configurations and facial expressions), dynamic areas (hand movements), and transient areas (co-articulation within patterns). For example, the Cut-off gesture can be represented by an elliptical movement (hand movement channel) or a handshape (hand configuration channel) evolving between the Spread-Bent and the Pursed shape.

  • Segmentation: motion capture data is represented as a multidimensional time series, which needs to be synchronized with sound to identify meaningful phases and those that constitute transitions between gestures.

  • Annotation: structured and segmented gestures can be labeled and annotated (like syllables or words in a language), thus allowing the identification of meaningful motion chunks. This annotation process can be done manually or automatically. A sequence of postures can be represented by a symbolic sequence similar to a written phrase in natural language and following the indications written on musical scores.

  • Expressive qualities: variational aspects of expressive gestures are inscribed into patterns that can be temporally adjusted according to the musical context and the expressive intention of the conductor. For example, the Tenuto articulation can be realized by a gesture that follows an elliptic trajectory similar to a Legato gesture; it is the variation of speed and acceleration that determines the expressive modulation.

Many unsolved questions remain. One of the central issues in recognition systems concerns gesture adaptation and anticipation, which is necessary to control time-constrained sound processes. Several approaches which exploit different types of models (Hidden Markov models, dynamic time warping, particle filtering, etc.) have been developed [1, 2, 14]. Another issue is related to the gesture–sound mapping process. The very different nature of sound and motion signals makes it difficult to identify the best characteristics of each and propose mappings between them.

A large amount of data is needed to learn the high variability of expressive gestures of conductors. The advent of neural architectures using deep learning opens up new possibilities for gesture recognition and mapping. Due to the time series nature of gestures, sequence-to-sequence approaches should be successful for recognizing both gestures and their variations, as far as enough data is available to train their deep architectures. The structuring of gestures into patterns might improve the performance of these neural networks. However, the data available for training such models is still limited. Moreover, there is a lack of aligned resources between motion and audio feedback that would be required to provide parallel resources for training models.

Beyond the analysis of conducting gestures, this chapter provides insights for building gesture–sound datasets for studying expressive gestures with strong semantics. The proposed methodology opens up the possibility of creating new systems of gestural interaction for sonification; it facilitates the learning of gestures and their sharing by many musicians and non-musicians and contributes to the effectiveness of semiotic communication by exploiting grammatical mechanisms specific to gestural languages.