Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference

. Internet-connected devices, such as smartphones, smartwatches, and laptops, have become ubiquitous in modern life, reaching ever deeper into our private spheres. Among the sensors most commonly found in such devices are microphones. While various privacy concerns related to microphone-equipped devices have been raised and thoroughly discussed, the threat of unexpected inferences from audio data remains largely overlooked. Drawing from literature of diverse disciplines, this paper presents an overview of sensitive pieces of information that can, with the help of advanced data analysis methods, be derived from human speech and other acoustic elements in recorded audio. In addition to the linguistic content of speech, a speaker’s voice characteristics and manner of expression may implicitly contain a rich array of personal information, including cues to a speaker’s biometric identity, personality, physical traits, geographical origin, emotions, level of intoxication and sleepiness, age, gender, and health condition. Even a person’s socioeconomic status can be reflected in certain speech patterns. The findings compiled in this paper demonstrate that recent advances in voice and speech processing induce a new generation of privacy threats.


Introduction
Since the invention of the phonograph in the late 19th century, it has been technically possible to record and reproduce sounds.For a long time, this technology was exclusively used to capture pieces of audio, such as songs, audio tracks for movies, or voice memos, and for the telecommunication between humans.With recent advances in automatic speech recognition, it has also become possible and increasingly popular to interact via voice with computer systems [96].

Inference of Personal Information from Voice Recordings
Based on experimental studies from the academic literature, this section presents existing approaches to infer information about recorded speakers and their context from speech, non-verbal human sounds, and environmental background sounds commonly found in audio recordings.Where available, published patents are also referenced to illustrate the current state of the art and point to potential real-world applications.
Fig. 1 provides an introductory overview of the types of audio features and the categories of inferences discussed in this paper.

Speaker Recognition
Human voices are considered to be unique, like handwriting or fingerprints [100], allowing for the biometric identification of speakers from recorded speech [66].This has been shown to be possible with speech recorded from a distance [71] and with multispeaker recordings, even under adverse acoustic conditions (e.g., background noise, reverb) [66].Voice recognition software has already been transferred into patents [50] and is being applied in practice, for example to verify the identity of telephone customers [40] or to recognize users of virtual assistants like Amazon Alexa [1].
Mirroring the privacy implications of facial recognition, voice fingerprinting could be used to automatically link the content and context of sound-containing media files to the identity of speakers for various tracking and profiling purposes.

Inference of Body Measures
Research has shown that human listeners can draw inferences about body characteristics of a speaker based solely on hearing the target's voice [42,55,69].In [42], voicebased estimates of waist-to-hip ratio (WHR) of female speakers predicted the speaker's actual WHR, the estimated shoulder-to-hip ratio (SHR) of male speakers predicted the speaker's actual SHR measurements.In another study, human evaluators estimated the body height and weight of strangers from a voice recording almost as well as they did from a photograph [55].
Various attempts have been made to identify the acoustic voice features that enable such inferences [25,29,69].In women, relationships were discovered between voice parameters, such as subharmonics and frequency pertubation, and body features, including weight, height, body mass index, and body surface area [29].Among men, individuals with larger body shape, particularly upper body musculature, are more likely to have low-pitched voices, and the degree of formant dispersion in male voices was found to correlate with body size (height and weight) and body shape (e.g., waist, chest, neck, and shoulder circumference) [25].
Although research on the speech-based assessment of body configuration is not as advanced as other inference methods covered in this paper, corresponding algorithms have already been developed.For instance, researchers were able to automatically estimate the body height of speakers based on voice features with an accuracy of 5.3 centimeters, surpassing human performance at this task [69].
Many people feel uncomfortable sharing their body measurements with strangers [12].The researchers who developed the aforementioned approach for speech-based body height estimation suggest that their algorithm could be used for "applications related to automatic surveillance and profiling" [69], thereby highlighting just some of the privacy threats that may arise from such inference possibilities.

Mood and Emotion Recognition
There has been extensive research on the automatic identification of emotions from speech signals [21,23,53,95,99].Even slight changes in a speaker's mental state invoke physiological reactions, such as changes in the nervous system or changes in respiration and muscle tension, which in turn affect the voice production process [20].Besides voice variations, it is possible to automatically detect non-speech sounds associated with certain emotional states, such as crying, laughing, and sighing [4,23].
Automatic emotion recognition from speech can function under realistic noisy conditions [23,95] as well as across different languages [21] and has long been delivering results that exceed human performance [53].Audio-based affect sensing methods have already been patented [47,77] and translated into commercial products, such as the voice analytics app Moodies [54].
Information about a person's emotional state can be valuable and highly sensitive.For instance, Facebook's ability to automatically track emotions was a necessary precondition for the company's 2014 scandalous experiment in which the company observed and systematically manipulated mental states of over 600,000 users for opaque purposes [14].

Inference of Age and Gender
Numerous attempts have been made to uncover links between speech parameters and speaker demographics [26,34,48,92].A person's gender, for instance, can be reflected in voice onset time, articulation, and duration of vowels, which is due to various reasons, including differences in vocal fold anatomy, vocal tract dimensions, hormone levels, and sociophonetic factors [92].It has also been shown that male and female speakers differ measurably in word use [26].Like humans, computer algorithms can identify the sex of a speaker from a voice sample with high accuracy [48].Precise classification results are achieved even under adverse conditions, such as loud background noise or emotional and intoxicated speech [34].
Just as the gender of humans is reflected in their anatomy, changes in the speech apparatus also occur with the aging process.During puberty, vocal cords are thickened and elongated, the larynx descends, and the vocal tract is lengthened [15].In adults, age-related physiological changes continue to systematically transform speech parameters, such as pitch, formant frequencies, speech rate, and sound pressure [28,84].
Automated approaches have been proposed to predict a target's age range (e.g., child, adolescent, adult, senior) or actual year of birth based on such measures [28,85].In [85], researchers were able to estimate the age of male and female speakers with a mean absolute error of 4.7 years.Underlining the potential sensitivity of such inferred demographic information, unfair treatment based on age and sex are both among the most prevalent forms of discrimination [24].

Inference of Personality Traits
Abundant research has shown that it is possible to automatically assess a speaker's character traits from recorded speech [3,79,80,88].Some of the markers commonly applied for this purpose are prosodic features, such as speaking rate, pitch, energy, and formants [68] and characteristics of linguistic expression [88].
Existing approaches mostly aim to evaluate speakers along the so-called "Big Five" personality traits (also referred to as the "OCEAN model"), comprising openness, conscientiousness, extroversion, agreeableness, and neuroticism [88].The speech-based recognition of personality traits is possible both in binary form (high vs. low) and in the form of numerical scores [79].High estimation accuracies have been achieved for all OCEAN traits [3,80,88].
Besides the Big Five, voice and word use parameters have been correlated with various other personality traits, such as gestural expressiveness, interpersonal awkwardness, fearfulness, and emotionality [26].Even culture-specific attributes, such as the extent to which a speaker accepts authority and unequal power distribution, can be inferred from speech data [101].
It is well known that personality traits represent valuable information for customer profiling in various industries, including targeted advertising, insurance, and credit risk assessment -with potentially harmful effects for the data subjects [17,18].Some data analytics firms also offer tools to automatically rate job applicants and predict their likely performance based on vocal characteristics [18].

Deception Detection
Research has shown that the veracity of verbal statements can be assessed automatically [60,107].Among other speech cues, acoustic-prosodic features (e.g., formant frequencies, speech intensity) and lexical features (e.g., verb tense, use of negative emotion words) were found to be predictive of deceptive utterances [67].Increased changes in speech parameters were observed when speakers are highly motivated to deceive [98].
Speech-based lie detection methods have become effective, surpassing human performance [60] and almost reaching the accuracy of methods based on brain activity monitoring [107].There is potential to further improve the classification performance by incorporating information on the speaker's personality [2], some of which can be inferred from voice recordings as well (as we have discussed in section 2.5).
The growing possibilities of deception detection may threaten a recorded speaker's ability to use lies as a means of sharing information selectively, which is considered to be a core aspect of privacy [63].

Detection of Sleepiness and Intoxication
Medium-term states that affect cognitive and physical performance, such as fatigue and intoxication, can have a measurable effect on a speaker's voice.Approaches exist to automatically detect sleepiness from speech [19,89].There is even evidence that certain speech cues, such as speech onset time, speaking rate, and vocal tract coordination, can be used as biomarkers for the separate assessment of cognitive fatigue [93] and physical fatigue [19].
Similar to sleepiness and fatigue, intoxication can also have various physiological effects, such as dehydration, changes in the elasticity of muscles, and reduced control over the vocal apparatus, leading to changes in speech parameters like pitch, jitter, shimmer, speech rate, speech energy, nasality, and clarity of pronunciation [5,13].Slurred speech is regarded as a hallmark effect of excessive alcohol consumption [19].
Based on such symptoms, intoxicated speech can be automatically detected with high accuracy [89].For several years now, systems have been achieving results that are on par with human performance [13].Besides alcohol, the consumption of other drugs such as ±3,4-methylenedioxymethamphetamine ("MDMA") can also be detected based on speech cues [7].

Accent Recognition
During childhood and adolescence, humans develop a characteristic speaking style which encompasses articulation, phoneme production, tongue movement, and other vocal tract phenomena and is mostly determined by a person's regional and social background [64].Numerous approaches exist to automatically detect the geographical origin or first language of speakers based on their manner of pronunciation ("accent") [9,45,64].
Research has been done for discriminating accents within one language, such as regional Indian accents in spoken Hindi (e.g., Kashmiri, Manipuri, Bengali, neutral Hindi) [64] or accents within the English language (e.g., American, British, Australian, Scottish, Irish) [45], as well as for the recognition of foreign accents, such as Albanian, Kurdish, Turkish, Arabic and Russian accent in Finnish [9] or Hindi, Russian, Italian, Thai, and Vietnamese accent in English [9,39].
By means of automated speech analysis, it is not only possible to identify a person's country of origin but also to estimate his or her "degree of nativeness" on a continuous scale [33].Non-native speakers can even be detected when they are very fluent in the spoken language and have lived in the respective host country for several years [62].Experimental results show that existing accent recognition systems are effective and have long reached accuracies comparable to human performance [9,39,45,62].
Native language and geographical origin can be sensitive pieces of personal information, which could be misused for the detection and discrimination of minorities.Unfair treatment based on national origin is a widespread form of discrimination [24].

Speaker Pathology
Through indicative sounds like coughs or sneezes and certain speech parameters, such as loudness, roughness, hoarseness, and nasality, voice recordings may contain rich information about a speaker's state of health [19,20,47].Voice analysis has been described as "one of the most important research topics in biomedical electronics" [104].
But also conditions beyond the speech production can be detected from voice samples, including Huntington's disease [76], Parkinson's disease [19], amyotrophic lateral sclerosis [74], asthma [104], Alzheimer's disease [27], and respiratory tract infections caused by the common cold and flu [20].The sound of a person's voice may even serve as an indicator of overall fitness and long-term health [78,103].
Further, voice cues may reveal a speaker's smoking habit: A linear relationship has been observed between the number of cigarettes smoked per day and certain voice features, allowing for speech-based smoker detection in a relatively early stage of the habit (<10 years) [30].Recorded human sounds can also be used for the automatic recognition of physical pain levels [61] and the detection of sleep disorders like obstructive sleep apnea [19].
Computerized methods for speech-based health assessment reach near-human performance in a variety of recognition and analysis tasks and have already been translated into patents [19,47].For example, Amazon has patented a system to analyze voice commands recorded by a smart speaker to assess the user's health [47].
The EU's General Data Protection Regulation classifies health-related data as a special category of personal data for which particular protection is warranted (Art.9 GDPR).Among other discriminatory applications, such data may be used by insurance companies to adjust premiums of policyholders according to their state of health [18].

Mental Health Assessment
Speech abnormalities are a defining characteristic of various mental illnesses.A voice with little pitch variation, for example, is a common symptom in people suffering from schizophrenia or severe depression [36].Other parameters that may reveal mental health issues include verbal fluency, intonation, loudness, speech tempo, semantic coherence, and speech complexity [8,31,36].
Depressive speech can be detected automatically with high accuracy based on voice cues, even under adverse recording conditions, such as low microphone quality, short utterances, and background environmental noise [19,41].Not only the detection, but also a severity assessment of depression is possible using a speech sample: In men and women, certain voice features were found to be highly predictive of their HAMD (Hamilton Depression Rating Scale) score, which is the most widely used diagnostic tool to measure a patient's degree of depression and suicide risk [36].Researchers have even shown that it is possible to predict a future depression based on speech parameters, up to two years before the speaker meets diagnostic criteria [75].
Other mental disorders, such as schizophrenia [31], autism spectrum conditions [19], and post-traumatic stress disorder [102], can also be detected through voice and speech analysis.In some experiments, such methods have already surpassed the classification accuracy of traditional clinical interviews [8].
In common with a person's age, gender, physical health, and national origin, information about mental health problems can be very sensitive, often serving as a basis for discrimination [83].

Prediction of Interpersonal Perception
A person's voice and manner of expression have a considerable influence on how he or she is perceived by other people [44,51,88,90].In fact, a single spoken word is enough to obtain personality ratings that are highly consistent across independent listeners [10].Research has also shown that personality assessments based solely on speech correlate strongly with whole person judgements [88].Conversely, recorded speech may reveal how a speaker tends to be perceived by other people.
Studies have shown, for example, that fast talkers are perceived as more extroverted, dynamic, and competent [80], that individuals with higher-pitched voices are perceived as more open but less conscientious and emotionally stable [44], that specific intonation patterns increase a speaker's perceived trustworthiness and dominance [81], and that certain prosodic and lexical speech features correlate with observer ratings of charisma [88].
Researchers have also investigated the influence of speech parameters on the perception and treatment of speakers in specific contexts and areas of life.It was found, for instance, that voice cues of elementary school students significantly affect the judgements teachers make about their intelligence and character traits [90].Similarly, certain speech characteristics of job candidates, including their use of filler words, fluency of speaking, and manner of expression, have been used to predict interviewer ratings for traits such as engagement, excitement, and friendliness [70].Other studies show that voice plays an important role in the popularity of political candidates as it influences their perceived competence, strength, physical prowess, and integrity [51].
According to [6], voters tend to prefer candidates with a deeper voice and greater pitch variability.The same phenomenon can be observed in the appointment of board members: CEOs with lower-pitched voices tend to manage larger companies, earn more, and enjoy longer tenures.In [65], a voice pitch decrease of 22.1 Hz was associated with $187 thousand more in annual salary and a $440 million increase in the size of the enterprise managed.On top of this, voice parameters also have a measurable influence on perceived attractiveness and mate choice [44].
Based on voice samples, it is possible to predict how strangers judge a speaker along certain personality traits -a technique referred to as "automatic personality perception" [88].Considering that the impression people make on others often has a tangible impact on their possibilities and success in life [6,51,65,90], it becomes clear how sensitive and revealing such information can be.

Inference of Socioeconomic Status
Certain speech characteristics may allow insights into a person's socioeconomic status.There is ample evidence, for instance, that language abilities -including vocabulary, grammatical development, complexity of utterances, productive and receptive syntaxvary significantly between different social classes, starting in early childhood [38].Therefore, people from distinct socioeconomic backgrounds can often be told apart based on their "entirely different modes of speech" [11].Besides grammar and vocabulary, researchers found striking inter-class differences in the variety of perspectives utilized in communication and in the use of stylistic devices, observing that once the nature of the difference is grasped, it is "astonishing how quickly a characteristic organization of communication [can] be detected."[87].
Not only language skills, but also the sound of a speaker's voice may be used to draw inferences about his or her social standing.The menarcheal status of girls, for example, which can be derived from voice samples, is used by anthropologists to investigate living conditions and social inequalities in populations [15].In certain contexts, voice cues, such as pitch and loudness, can even reveal a speaker's hierarchical rank [52].
Based on existing research, it is difficult to say how precise speech-based methods for the assessment of socioeconomic status can become.However, differences between social classes certainly appear discriminative enough to allow for some forms of automatic classification.

Classification of Acoustic Scenes and Events
Aside from human speech, voice recordings often contain some form of ambient noise.By analyzing background sounds, it is possible to recognize the environment in which an audio sequence was recorded, including indoor environments (e.g., library, restaurant, grocery store, home, metro station, office), outdoor environments (e.g., beach, city center, forest, residential area, urban park), and transport modes (e.g., bus, car, train) [43,97].
Algorithms can even recognize drinking and eating moments in audio recordings and the type of food a person is eating (e.g., soup, rice, apple, nectarine, banana, crisps, biscuits, gummi bears) [19,91].Commercial applications like Shazam further demonstrate that media sounds, such as songs and movie soundtracks, can be automatically identified and classified into their respective genre with high accuracy, even based on short snippets recorded in a noisy environment [49].
Through such inferences, ambient sounds in audio recordings may not only allow insights into a device holder's context and location, but also into his or her preferences and activities.Certain environments, such as places of worship or street protests, could potentially reveal a person's religious and political affiliations.
Sensitive information can even be extracted from ultrasonic audio signals inaudible to the human ear.An example that has received a lot of media attention recently is the use of so-called "ultrasonic beacons", i.e. high-pitched Morse signals which are secretly emitted by speakers installed in businesses and stores, or embedded in TV commercials and other broadcast content, allowing companies to unobtrusively track the location and media consumption habits of consumers.A growing number of mobile apps -several hundred already, some of them very popular -are using their microphone permission to scan ambient sound for such ultrasonic signals, often without properly informing the user about it [59].

Discussion and Implications
As illustrated in the previous section, sensitive inferences can be drawn from human speech and other sounds commonly found in recorded audio.Apart from the linguistic content of a voice recording, a speaker's patterns of word use, manner of pronunciation, and voice characteristics can implicitly contain information about his or her biometric identity, body features, gender, age, personality traits, mental and physical health condition, emotions, intention to deceive, degree of intoxication and sleepiness, geographical origin, and socioeconomic status.While there is a rich and growing body of research to support the above statement, it has to be acknowledged that many of the studies cited in this paper achieved their classification results under ideal laboratory conditions (e.g., scripted speech, high quality microphones, close-capture recordings, no background noise) [10,20,30,36,55,60,70,82,94,107], which may raise doubt about the generalizability of their inference methods.Also, while impressive accuracies have been reached, it should not be neglected that nearly all of the mentioned approaches still exhibit considerable error rates.
On the other hand, since methods for voice and speech analysis are often subject to non-disclosure agreements, the most advanced know-how arguably rests within the industry and is not publicly available.It can be assumed that numerous corporate and governmental actors with access to speech data from consumer devices possess much larger amounts of training data and more advanced technical capabilities than the researchers cited in this paper.Amazon, for example, spent more than $23 billion on research and development in 2017 alone, has sold more than 100 million Alexa-enabled devices and, according to the company's latest annual report, "customers spoke to Alexa tens of billions more times in 2018 compared to 2017" [108].Moreover, companies can link speech data with auxiliary datasets (e.g., social media data, browsing behavior, purchase history) to draw other sensitive inferences [47] while the methods considered in this paper exclusively rely on human speech and other sounds commonly found in recorded audio.Looking forward, we expect the risk of unintended information disclosure from speech data to grow further with the continuing proliferation of microphone-equipped devices and the development of more efficient inference algorithms.Deep learning, for instance, still appears to offer significant improvement potential for automated voice analysis [3,19].
While recognizing the above facts and developments as a substantial privacy threat, it is not our intention to deny the many advantages that speech applications offer in areas like public health, productivity, and convenience.Devices with voice control, for instance, improve the lives of people with physical disabilities and enhance safety in situations where touch-based user interfaces are dangerous to use, e.g., while driving a car.Similarly, the detection of health issues from voice samples (see sect.2.9) could help in treating illnesses more effectively and reduce healthcare costs.
But since inferred information can be misused in countless ways [17,18], robust data protection mechanisms are needed in order to reap the benefits of voice and speech analysis in a socially acceptable manner.At the technical level, many approaches have been developed for privacy protection at different stages of the data life cycle, including operations over encrypted data, differential privacy, data anonymization, secure multiparty computation, and privacy-preserving data processing on edge devices [46,72,106].Various privacy safeguards have been specifically designed or adjusted for audio mining applications.These include voice binarization, hashing techniques for speech data, fully homomorphic inference systems, differential private learning, the computation of audio data in separate entrusted units, and speaker de-identification by voice transformation [72,73].A comprehensive review of cryptography-based solutions for speech data is provided in [72].Privacy risks can also be moderated by storing and processing only the audio data required for an application's functionality.For example, where only the linguistic content is required, voice recordings can be converted to text in order to eliminate all voice-related information and thereby minimize the potential for undesired inferences.
In advocating data collection transparency and informational self-determination, the recent privacy discourse has put a focus on the recording mode of microphone-equipped devices, where a distinction can be made between "manually activated," "speech activated," and "always on" [34].However, data scandals show that reporting modes cannot always be trusted [105].And even where audio is only recorded and transmitted with a user's explicit consent, sensitive inferences may unnoticeably be drawn from collected speech data, ultimately leaving the user without control over his or her privacy.Enabling the unrestricted screening of audio data for potentially revealing patterns and correlations, recordings are often available to providers of cloud-based services in unencrypted form -an example being voice-based virtual assistants [1,22].With personal data being the foundation for highly profitable business models and strategic surveillance practices, it is certainly not unusual for speech data to be processed in an unauthorized or unexpected manner.This is well illustrated by recently exposed cases where Amazon, Google, and Apple ordered human contractors to listen to private voice recordings of their customers [22].
The findings compiled in this paper reveal a serious threat to consumer privacy and show that more research is needed into the societal implications of voice and speech processing.In addition to investigating the technical feasibility of inferences from speech data in more detail, future research should explore technical and legal countermeasures to the presented problem, including ways to enforce existing data protection laws more effectively.Of course, the problem of undesired inferences goes far beyond microphones and needs to be addressed for other data sources as well.For example, in recent work, we have also investigated the wealth of sensitive information that can be implicitly contained in data from air quality sensors, infrared motion detectors, smart meters [56], accelerometers [57], and eye tracking sensors [58].It becomes apparent that sensors in many everyday electronic devices can reveal significantly more information than one would assume based on their advertised functionality.The crafting of solutions to either limit the immense amounts of knowledge and power this creates for certain organizations, or to at least avert impending negative consequences, will be an important challenge for privacy, social justice, and civil rights advocates over the years to come.

Conclusion
Microphones are widely used in connected devices, where they have a large variety of possible applications.While recognizing the benefits of voice and speech analysis, this paper highlights the growing privacy threat of unexpected inferences from audio data.Besides the linguistic content, a voice recording can implicitly contain information about a speaker's identity, personality, body shape, mental and physical health, age, gender, emotions, geographical origin, and socioeconomic status -and may thereby potentially reveal much more information than a speaker wishes and expects to communicate.
Further research is required into the privacy implications of microphone-equipped devices, taking into account the evolving state of the art in data mining technology.As it is impossible, however, to meaningfully determine the limits of inference methods developed behind closed doors, voice recordings -even where the linguistic content does not seem rich and revealing -should be regarded and treated as highly sensitive by default.Since existing technical and legal countermeasures are limited and do not yet offer reliable protection against large-scale misuses of audio data and undesired

Fig. 1 .
Fig. 1.Overview of some sensitive attributes discernable from speech data.