CoHEARence of audible shapes—a qualitative user study for coherent visual audio design with resynthesized shapes

One way to achieve an intuitive sound design are visual approaches for synthesis and sound collages. Therefore, during spectral synthesis and editing, the sound is designed in a visualization of the frequency domain. In order to create a coherent workflow between visuals and the resulting audio, the stimuli should be matched to each other. In this work, a qualitative user study is presented, which is supposed to show the intuitive understanding from the shape to the sound. The shape is hereby the spectral envelope. The general aim is to find out whether there is a connection between the visual shape and the subsequent auditory impression.


Introduction
Synthesizer and DSP algorithms are often adjusted via value inputs, sliders or knobs, similar to their hardware counterparts.However, sound design and music production becomes more creative through a visual manipulation of audio.
Furthermore, it was shown that multi sensory convergence exists in the low-sensory processing [34] and for visual-auditive stimuli (see [14,27]).This low-level processing of coherent sensor inputs allows the improvement of visual and acoustic perception by simultaneous matching stimuli (compare also [4,34]).The early convergence suggests that acoustic and visual stimuli have a positive effect on each other [26].The correlation between color and sound was examined with an empirical approach [17] and showed that there is a strong correlation between loudness and saturation as well as tonality and brightness.Lars Engeln lars.engeln@tu-dresden.deRainer Groh rainer.groh@tu-dresden.de 1 Chair of MediaDesign, Technische Universität Dresden, Dresden, Germany Besides feature mapping [17], timbre design [29], wave terrain synthesis [31], scanned synthesis [33], and geometric oscillator [24], spectral editing is a common way to have a direct relationship between the visual (spectrogram) and the (resynthesized) sound.In addition, the research field of soundtextures (compare [28]) describes a graphical synthesis that is sometimes done in the frequency domain.As for spectral editing in connection with SpecDraw [7], AudioSculpt [3] provides tools for damping, enforcing, and duplicating spectrals, as well as for time-stretching [2].TAPESTREA [23] allows the creation of sonic spaces from different sounds.With Meta-Synth, images can be used as input within the frequency domain.SPEAR [20] got a sparse representation of the frequency domain and is thereby abstracting the spectrogram with partial lines.Furthermore, a touch interface for partial line mapping corresponding to morphing was shown [12].Moreover, filters and transpositions with metaphorical manipulations of the frequency domain with fluids [8] were proposed, and besides that, affine transformation and image processing effects are introduced with VisualAudio-Design [10].
In order to achieve a visual sound design, a user should be able to understand why the visual arrangement sounds the way it does after synthesis.To better understand how sound is shaped, subjects were asked to perform hand gestures for given melodic phrases [19].To sketch sounds, not only graphical shapes are discussed but also the voice as input for complex ideas of sounds is used [18].On the contrary, with Track-draw [1], a speech synthesizer was controlled by drawing.Also, it is proven that movements induced by sounds of drawing scratches are predictable [32].So, whether a circle, ellipse, stroke, and so on was drawn.In addition, drawing graphic scores were utilized for public and collaborative creation [25].Likewise, text as symbolic input was sonified as spectral shapes (sine sweeps and envelopes of white noise) [30].Besides that, with Sound Mosaics [15,16], a system for sound synthesis via influencing graphical variables was created.With the terms of gestural similarity [22], it can be assumed that a visual and a sound are coherent to each other, if they are created by gestures and the gesture generator remains the same.
To examine the coherence between the visual design and the (re-)synthesized audio, a small qualitative study was conducted with audio-and design-related students.However, the subjects did not order the sounds to the shapes (spectral envelope), or evaluate an existing assignment.Their main task was to visually describe the sound with sketches, whereby the sounds were made by visual shapes in the frequency domain.In this way, the association to a shape while hearing the sound is investigated, and therefore the connection from the visual to the resulting audio could be shown.In addition, the compositional problem according to Klingbeil [21] is been addressed, to compose music in the frequency domain.
The results are classified by function and by visual semiotic meaning.The subject of semiotics is described with its pragmatic, semantic, and syntactic function (compare [6]).Pragmatic describes the effect of a subject, i.e., how a user thinks and feels about a subject.The semantic describes the meaning of the subject and the syntax the general structure of the subject.Thereby, the syntactic structure influences the semantic meaning and the semantic meaning the pragmatic effect.The semantic is differentiated into the symbolic, indexical, and iconic meaning.Symbols are characterized by social conventions, like a vocabulary an user has to learn.In the semiotics, the metaphoric meaning is more like a subclass of symbolic, but it is leading to indexical.The indexical meaning references to hints, which allude a contextual meaning.For instance, the pictogram of a floppy disk has the indexical hint (being a storage medium) to save the current documents.For younger users, who do not grew up with floppy disks, the meaning becomes a symbol, although the visual is still pictographic.Thereby, the iconic meaning is the direct reflection of the thing itself (the storage medium).The semiotic classification of the semantic was compared to a classification of earcons [11].Doing this, abstract earcons up to metaphoric earcons were differentiated and organized.

User study on resynthesized shapes
This user study extends the pre-test [9] by asking additional 16 persons to perform the same tasks as the pre-test.The goal was to get more insights in what people are doing when drawing the sounds they are hearing.That is why the pre-test is briefly considered first; after that, the setup and the tasks for the study are described.Then, the findings are reported, which are discussed afterwards.

User (pre-)test
A pre-test was conducted (see [9]), in which 6 subjects (2 female, 4 male) aged 20-25 took part.Because the pre-test was conducted within a seminar for audio processing, all subjects were at least interested in audio or working with audio.This pre-test should show whether it is generally reasonable to investigate the connection between sound and visual with drawings in depth and which direction should be taken with this very tasks as study method.In Fig. 6, all results for task 1 (see Section 2.4) are shown.The results from the pre-test are A to E in Fig. 6.Only the drawings from five subjects are displayed, because one subject only described the imagination with keywords.
Thereby, A and C drew forms, where A is the only one who has drawn solid forms.D and E used just one stroke and characterized only one single sound from the repetition.A's drawings seem to be based more on spectrograms, B's on waveforms and sinusoids, C's on waveforms, and D's on envelopes (loudness-curve).It seems as if the drawings are based on the iconic meaning, i.e., the representation of the physical signal (with taking the perception into account).

Setup
Seven elementary shapes in different rotations were examined (see Fig. 1).Each shape was repeated three times and with each repetition the shape became smaller, to decouple the shapes size for the examination.
These shapes are interpreted as magnitudes and were resynthesized with a ISTFT using VisualAudio-Design [10].Although VisualAudio-Design is able to estimate phases iteratively, the phases were taken from white noise, because the shapes can be assumed as band-limited envelopes of the white noise.The window size was 4096 and the lowest frequency of a shape was 200 Hz and the highest 800 Hz (each for the biggest), except for F 6. For F 6, it was approx.10-990 Hz, because F 6 was a affine transformation of the square (F 5).
The range 200-800 Hz (roughly two octaves from G3 to G5) was chosen to guarantee an almost flat slant in the isophones.These sounds files were made public for reproduction with the publication of the pre-test [9] on researchgate. 1he subjects did not know that the sounds are created by elementary shapes and they were allowed to hear the audio file multiple times.But, they knew that there were three repetitions of the same sound for each F n , which become smaller and more quiet.The order of the sounds to test were static according to their naming (F 0, F 1, F 2, ..., F 6).So no randomization of the files for each subject was done, although it might be that some people have first listened to all the sounds, or have worked on the sounds in an irregular order.

Participants
In addition to the 5 participants of the pre-test who turned in drawings, a further 16 (3 female, 13 male) took part at the extended study.The mean of the age is 26.7 and has a range of 21 to 43 (compare Fig. 2).All participants are related to (media) computer science and/or media design.
The test group has a equally distributed center at 2.94 (5 is best, 1 is worst) for the experience in graphic design (see Fig. 3) and a lower mean of 2.50 for the experience in sound design.There is also a wide range of how often they make music with a mean of 2.81.Therefore, the test group can be regarded as largely unbiased, although there is a shift towards graphic design and an overhang towards male subjects.More subjects with knowledge in DSP would have been appreciated.

Task 1:
"How does the audio sound to you? Illustrate your imagination (paint a picture or sketch a shape).Record your sound impression with few key points."Task 2: "Sort the sounds from 'I like best' to 'I like least'.Give a point from 0-100 (100 is the best, 0 is the worst) and give a short explanation."Task 3: (3a will stand for task 3 with A i and 3b for B k ) "How do the A i and B k sounds differ each from each other and from the F n sounds?"Initially, all tasks were performed by the subjects.Afterwards, the results from the pre-test were discussed individually with each participant.Subjects in the extended study only considered their impression with key phrases briefly.To deal with the visual impression of the sound as key points Fig. 2 Age distribution of 16 subjects from the extended study Fig. 3 Experience distribution of 16 subjects from the extended study.They were asked for the knowledge and/or the skill in the specific domains (5 is best, 1 is worst) Fig. 4 Variations of F 0 and F 5 for task 3a (see Section 2.4)-the shape is slightly blurred (A1) and even more blurred (A2, A3) Fig. 5 Variations of all shapes for task 3b (see Section 2.4)-the shapes are outlined (only the second biggest shape was used here) Table 1 A brief overview of the results showing stylistics in contrast to the experience (+ is 4 and 5, -is 1).Thereby, lines and shapes are iconic, metaphoric is indexical or symbolic, and abstract is symbolic in terms of the representation of the signal

Findings
The pre-test (see Section 2.1) suggests that the subjects uses the iconic meaning of the perceived signal, so that they would draw lines to visualize envelopes/loudness and waveform-like curves or shapes to visualize bandwidths/spectrograms (compare Table 1 A to E).But in addition, the extended study shows that some subject uses metaphoric and abstract meanings of the sound, for instance, whirling a rope (M F6 and F3, see Fig. 6) or leafing through paper (V F3).These drawn mental images are actually iconic in themselves, because they reproduce a natural action, but viewed from the perspective of the sound, they are considered as symbolic.
All subjects who used iconic meanings used the x-axis for the time (from left to right), except for subject D. D used the y-axis (top-down).For better comparison, D's results were rotated, so that the direction of time matches to the other results.Some people drew only one sound (see Table 1), although each file had repetitions (see Section 2.2), others drew each repetition solo, or in conjunction with the others.
A is the only one who has drawn solid forms, whereas O and R visualized a solid form with additional lines in the shape.At a first glimpse, the F 0 of B, H, J, R, T, U and the F 5 of D, H, J, L, N, O, P, R, S, T, U and the F 4 of D, G, J, S match perfectly with the corresponding shape, as well as C's F 5 and F 6, due to the spikes on it outlining the corners of F 6. B reports that F 6 sounds like two tones, so apparently with G's and T's F 6. The combined impression from all subjects for each F n is shown in Table 2 (the impressions are stabilized with the feedback from the extended study).Thereby, F 5 was mostly negatively characterized, and F 3 and F 6 mostly positively.This is also reflected in the rating (see Fig. 7) from task 2. The box plot indicates that F 5 is significantly less appealing than F 3 and might be less appealing than F 2, F 1, and F 4, but no variance analysis has been done yet.The outlined versions B k shifted the appearance drastically (see Table 2).A sudden garish sound (F 5 ) became novel and nice (B 1 ), and a thrilling sound (F 6 ) became spongy (B 2 ) for instance.With blurring (A i ), the sounds became more pleasant and less sharp, but more noisy.A correspondence between a Fig. 7 Results from the rating task (task 2, see Section 2.4) as box plot with exclusive median, sorted by rating-the higher (and therefore more right) the more appealing the sound is Fig. 8 One possible syntactic order of all drawings in relationship to their meaning and function.Thereby, equivalence classes for their resembling function like sinusodial, waveform, envelope, shape, and spectrogram (all beeing indexical) and resembling meaning like metaphoric and abstract emerges noisy sound and a blurred shape appears as consistent, due to most of the subject reported this.

Discussion
First, the results and findings are interpreted.After that, the scope and further fields of interest for coherent spectral design are indicated.Finally, the interface for resynthesis used here is briefly considered.

Interpretation of the results
Various findings should be elaborated and demonstrated individually in specific studies and a clear formalization of coherent result should be done.
The two triangles pointing up (F 3) and down (F 1) were mostly illustrated round and in the case of F1 also verbally described as a round sound.This impression is reinforced by resynthesis with the ISTFT, since the large analysis window smooths finer resolutions in time like the peak of the triangle (compare the round edges in the re-analysis in Fig. 1).Also, F 2 had a rounder impression than F 4, although they were smoothed in time the same way, because of just being flipped according to the time.Here, temporal (post-)masking might benefits the impression of a harder edge of F 4.
The subjects said that they disliked the hard edges (no attack and no release) of a sound like F 2, F 4, or F 5. That could be why the square F 5 had the worst ratings (having two edges), followed by F 2 (no attack) and F 4 (no release).The blurring (A i ), which softens the hard edges, also made the sounds more pleasant.
Interestingly, the attack into the higher frequencies seems to be more appealing than the attack towards the lower frequencies (see Fig. 7).This is also the key characteristic for F 4 by D, E, F, H, J, and S. The rating of the stacked combination F 6 of F 1 and F 3 is between them.For the subjects, the swelling and un-swelling aspect of F 2 and F 4 is more important, in contrast to the direction of change of F 1 and F 3 (Figs.8 and 9).
A huge variety of different styles for each shape is given (see Fig. 10).This is emphasized in Fig. 8 by having nearly a consistent transition in the syntactic order form symbolic (the very left) and iconic (the very right).No real clusters of unique visual descriptions appeared.More drastic is that the Fig. 10 Overview of all drawings syntactically ordered for each shape triangular shapes are usually perceived as rounded, which leads to confusion with the circle (compare Fig. 11).

Topic of observation
Here, we only considered the visual shape for a resynthesized sound.But a visual design for audio means a lot more: shapes, sounds, transformations, manipulations, effects, and coloration.More in depth, what often forgotten is that additional visualizations, next to the spectrogram the spectral editing is working in, supports visually the understanding of the frequency domain (compare Fig. 9).These are topics for coherent visual audio design as well.Therefore, not only the shape or more generally the data for designing should be examined but also the entire interface and the way of interaction is important for overall coherency.

Interface for spectral editing
The spectral editing app used here for the resynthesis has multiple visualizations surrounding the spectrogram for more detailed insights and was tested with a brief user study [10].The user study indicated that this spectral editing app is creative, clear, predictable, and easy to use with potential for improvement.For a joyful user experience, the interface matters and is influencing the creativity of visual audio design.
Spectral editing means not only editing existing magnitudes but also adding new magnitudes by strokes and shapes.Drawings can also be used for query-by-sketch searches for audible content (c.f.[5,13]).Hereby, the drawing itself does not have to be the sketch to be queried (iconically) but an interpretation of the sketch can also be Fig.11 A naive association of the drawings to the visually most suitable shape used (indexically).Drawings are leading to symbols for an interpretation to the iconic spectrals.By using symbols, the interaction becomes more perceivable iconic.

Conclusion
A study method for the coherence of visual design for sound was introduced, in which the subjects only hear the resulting sound.Our approach is that when the visual impression of what is only heard correlates with the actual visual sound design, there is coherence.Nevertheless, the results have a huge bandwidth of possible drawings for the different sounds, although a visual alphabet can be derived from the results to find a translation from the visual to the sound.For this, the combination of both directions, impression from the visual to the sound and impression from the sound to the visual, should be emphasized.In future a tool for automatic clustering of the result from drawing studies should be considered, to deal with even more participants, due to this very diverse and subjective topic.

Fig. 1
Fig.1Resynthesized shapes for the user test-the shape as input magnitude for the STFT (upper row), the resynthesized waveform of the shape (middle), and the re-analysed spectrogram of the waveform (lower row)

Table 2
Combined characterizations for each F n with same order like Figs.1 and 5with corresponding comparison to the outlined variations B k