Automated Proxemic Feature Extraction and Behavior Recognition: Applications in Human-Robot Interaction
- First Online:
- Cite this article as:
- Mead, R., Atrash, A. & Matarić, M.J. Int J of Soc Robotics (2013) 5: 367. doi:10.1007/s12369-013-0189-8
- 473 Views
In this work, we discuss a set of feature representations for analyzing human spatial behavior (proxemics) motivated by metrics used in the social sciences. Specifically, we consider individual, physical, and psychophysical factors that contribute to social spacing. We demonstrate the feasibility of autonomous real-time annotation of these proxemic features during a social interaction between two people and a humanoid robot in the presence of a visual obstruction (a physical barrier). We then use two different feature representations—physical and psychophysical—to train Hidden Markov Models (HMMs) to recognize spatiotemporal behaviors that signify transitions into (initiation) and out of (termination) a social interaction. We demonstrate that the HMMs trained on psychophysical features, which encode the sensory experience of each interacting agent, outperform those trained on physical features, which only encode spatial relationships. These results suggest a more powerful representation of proxemic behavior with particular implications in autonomous socially interactive and socially assistive robotics.
KeywordsProxemicsSpatial interactionSpatial dynamicsSociable spacingSocial robotHuman-robot interactionPrimeSensorMicrosoft Kinect
Proxemics is the study of the dynamic process by which people position themselves in face-to-face social encounters . This process is governed by sociocultural norms that, in effect, determine the overall sensory experience of each interacting participant . People use proxemic signals, such as distance, stance, hip and shoulder orientation, head pose, and eye gaze, to communicate an interest in initiating, accepting, maintaining, terminating, or avoiding social interaction [10, 35, 36]. These cues are often subtle and noisy.
A lack of high-resolution metrics limited previous efforts to model proxemic behavior to coarse analyses in both space and time [24, 39]. Recent developments in markerless motion capture—such as the Microsoft Kinect,1 the Asus Xtion,2 and the PrimeSensor3—have addressed the problem of real-time human pose estimation, providing the means and justification to revisit and more accurately model the subtle dynamics of human spatial interaction.
In this work, we present a system that takes advantage of these advancements and draws on inspiration from existing metrics (features) in the social sciences to automate the analysis of proxemics. We then utilize these extracted features to recognize spatiotemporal behaviors that signify transitions into (initiation) and out of (termination) a social interaction. This automation is necessary for the development of socially situated artificial agents, both virtually and physically embodied.
While the study of proxemics in human-agent interaction is relatively new, there exists a rich body of work in the social sciences that seeks to analyze and explain proxemic phenomena. Some proxemic models from the social sciences have been validated in human-machine interactions. In this paper, we focus on models from the literature that can be applied to recognition and control of proxemics in human-robot interaction (HRI).
2.1 Proxemics in the Social Sciences
The anthropologist Edward T. Hall  coined the term “proxemics”, and proposed that psychophysical influences shaped by culture define zones of proxemic distances [15–17]. Mehrabian , Argyle and Dean , and Burgoon et al.  analyzed psychological indicators of the interpersonal relationship between social dyads. Schöne  was inspired by the spatial behaviors of biological organisms in response to stimuli, and investigated human spatial dynamics from physiological and ethological perspectives; similarly, Hayduk and Mainprize  analyzed the personal space requirements of people who are blind. Kennedy et al.  studied the amygdala and how emotional (specifically, fight-or-flight) responses regulate space. Kendon  analyzed the organizational patterns of social encounters, categorizing them into F-formations: “when two or more people sustain a spatial and orientation relationship in which the space between them is one to which they have equal, direct, and exclusive access.” Proxemics is also impacted by factors of the individual—such as involvement , sex , age , ethnicity , and personality —as well as environmental features—such as lighting , setting , location in setting and crowding , size , and permanence .
2.2 Proxemics in Human-Agent Interaction
The emergence of embodied conversational agents (ECAs)  necessitated computational models of social proxemics. Proxemics for ECAs has been parameterized on culture  and so-called “social forces” . The equilibrium theory of spacing  was found to be consistent with ECAs in [6, 31]. Pelachaud and Poggi  provides a summary of aspects of emotion, personality, and multimodal communication (including proxemics) that contribute to believable embodied agents.
As the trend in robotic systems places them in social proximity of a human user, it becomes necessary for them to interact with humans using natural modalities. Understanding social spatial behaviors is a fundamental concept for these agents to interact in an appropriate manner. Rule-based proxemic controllers have been applied to HRI [20, 29, 52]. Interpersonal dynamic models, such as , have been investigated in HRI [38, 48]. Jung et al.  proposed guidelines for robot navigation in speech-based and speechless social situations (related to Hall’s  voice loudness code), targeted at maximizing user acceptance and comfort. A spatially situated methodology for evaluating “interaction quality” (social presence) in mobile remote telepresence interactions has been proposed and validated based on Kendon’s  theory of proxemic F-formations . Contemporary probabilistic modeling techniques have been applied to socially appropriate person-aware robot navigation in dynamic crowded environments [50, 51], to calculate a robot approach trajectory to initiate interaction with a walking person , to recognize the averse and non-averse reactions of children with autism spectrum disorder to a socially assistive robot , and to position the robot for user comfort . A lack of high-resolution quantitative measures has limited these efforts to coarse analyses [23, 39].
In this work, we present a set of feature representations for analyzing proxemic behavior motivated by metrics commonly used in the social sciences. Specifically, in Sect. 3, we consider individual, physical, and psychophysical factors that contribute to social spacing. In Sects. 4 and 5, we demonstrate the feasibility of autonomous real-time annotation of these proxemic features during a social interaction between two people and a humanoid robot. In Sect. 6, we compare the performance of two different feature representations to recognize high-level spatiotemporal social behaviors. In Sect. 7, we discuss the implications of these representations of proxemic behavior, and propose an extension for more continuous and situated proxemics in HRI .
3 Feature Representation and Extraction
In this work, proxemic “features” are based on the most commonly used metrics in the social sciences literature. We first extract Schegloff’s individual features . We then use the features of two individuals to calculate the features for the dyad (pair). We are interested in two popular and validated annotation schemas of proxemics: (1) Mehrabian’s physical features , and (2) Hall’s psychophysical features .
3.1 Individual Features
Stance Pose: most dominant involvement cue; position midway between the left and right ankle positions and orientation orthogonal to the line segment connecting the left and right ankle positions
Hip Pose: subordinate to stance pose; position midway between the left and right hip positions, and orientation orthogonal to the line segment connecting the left and right hip positions
Torso Pose: subordinate to hip pose; position of torso and average of hip pose orientation and shoulder pose orientation (weighted based on relative torso position between hip pose and shoulder pose)
Shoulder Pose: subordinate to torso pose; position midway between the left and right shoulder positions and orientation orthogonal to the line segment connecting the left and right shoulder positions
Head Pose: subordinate to shoulder pose; extracted and tracked using the head pose estimation technique described in 
Hip Torque: angle between hip and stance poses
Torso Torque: angle between torso and hip poses
Shoulder Torque: angle between shoulder and torso poses
Head Torque: angle between head and shoulder poses.
3.2 Physical Features
Total Distance: magnitude of a Euclidean distance vector from the pelvis of agent A to the pelvis of agent B
Straight-Ahead Distance: magnitude of the x-component of the total distance vector
Lateral Distance: magnitude of the y-component of the total distance vector
Relative Body Orientation: magnitude of the angle of the pelvis of agent B with respect to the pelvis of agent A
3.3 Psychophysical Features
For example, upon first meeting, two Western American strangers often shake hands, and, in doing so, subconsciously gauge each other’s arm length; these strangers will then stand just outside of the extended arm’s reach of the other, so as to maintain a safe distance from a potential fist strike . This sensory experience characterizes “social distance” between strangers or acquaintances. As their relationship develops into a friendship, the risk of a fist strike is reduced, and they are willing to stand within an arm’s reach of one another at a “personal distance”; this is highlighted by the fact that brief physical embrace (e.g., hugging) is common at this range . However, olfactory and thermal sensations of one another are often not as desirable in a friendship, so some distance is still maintained to reduce the potential of these sensory experiences. For these sensory stimuli to become more desirable, the relationship would have to become more intimate; olfactory, thermal, and prolonged tactile interactions are characteristic of intimate interactions, and can only be experienced at close range, or “intimate distance” .
Hall’s  coding schema is typically annotated by social scientists based purely on distance and orientation data observed from video . The automation of this tedious process is a major contribution of this work; to our knowledge, this is the first time that these proxemic features have been automatically extracted.
Distance Code:4 based on total distance; intimate (0″–18″), personal (18″–48″), social (48″–144″), public (more than 144″)
Sociofugal-Sociopetal (SFP) Axis Code: based on relative body orientation (in 20∘ intervals), with face-to-face (axis-0) representing maximum sociopetality and back-to-face (axis-8) representing maximum sociofugality [30, 32, 47]; axis-0 (0∘–20∘) axis-1 (20∘–40∘), axis-2 (40∘–60∘), axis-3 (60∘–80∘), axis-4 (80∘–100∘), axis-5 (100∘–120∘), axis-6 (120∘–140∘), axis-7 (140∘–160∘), or axis-8 (160∘–180∘)
Visual Code: based on head pose;5 foveal (sharp; 1.5∘ off-center), macular (clear; 6.5∘ off-center), scanning (30∘ off-center), peripheral (95∘ off-center), or no visual contact
Voice Loudness Code: based on total distance; silent (0″–6″), very soft (6″–12″), soft (12″–30″), normal (30″–78″), normal plus (78″–144″), loud (144″–228″), or very loud (more than 228″)
Kinesthetic Code: based on the distances between the hip, torso, shoulder, and arm poses; within body contact distance, just outside body contact distance, within easy touching distance with only forearm extended, just outside forearm distance (“elbow room”), within touching or grasping distance with the arms fully extended, just outside this distance, within reaching distance, or outside reaching distance
Olfaction Code: based on total distance; differentiated body odor detectable (0″–6″), undifferentiated body odor detectable (6″–12″), breath detectable (12″–18″) olfaction probably present (18″–36″), or olfaction not present
Thermal Code: based on total distance; conducted heat detected (0″–6″), radiant heat detected (6″–12″), heat probably detected (12″–21″), or heat not detected
4 System Implementation and Discussion
The feature extraction system can be implemented using any human motion capture technique. We utilized the PrimeSensor structured light range sensor and the OpenNI8 person tracker for markerless motion capture. We chose this setup because it (1) is non-invasive to participants, (2) is readily deployable in a variety of environments (ranging from an instrumented workspace to a mobile robot), and (3) does not interfere with the interaction itself. Joint pose estimates provided by this setup were used to extract individual features, which were then used to extract the physical and psychophysical features of each interaction dyad (two individuals). We developed an error model of the sensor, which was then used to generate error models for individual , physical , and psychophysical  features, discussed below.
4.1 Motion Capture System
We conducted an evaluation of the precision of the PrimeSensor distance estimates. The PrimeSensor was mounted atop a tripod at a height of 1.5 meters and pointed straight at a wall. We placed the sensor rig at 0.2-meter intervals between 0.5 meters and 2.5 meters and, at each location, took distance readings (a collection of 3-dimensional points, referred to as a “point cloud”) at each location. We used a planar model segmentation technique9 to eliminate points in the point cloud that did not fit onto the wall plane. We calculated the average depth reading of each point in the segmented plane, and modeled the sensor error E as a function of distance d (in meters): E(d)=k×d2, with k=0.0055. Our procedure and results are consistent with that reported in the ROS Kinect accuracy test,10 as well as those of other structured light and stereo camera range estimates, which often differ only in the value of k; thus, if a similar range sensor were to be used, system performance would scale with k.
4.2 Individual Features
Shotton et al.  provides a comprehensive evaluation of individual joint pose estimates produced by the Microsoft Kinect, a structured light range sensor very similar in hardware to the PrimeSensor. The study reports an accuracy of 0.1 meters, and a mean average precision of 0.984 and 0.914 for tracked (observed) and inferred (obstructed or unobserved) joint estimates, respectively. While the underlying algorithms may differ, the performance is comparable for our purposes.
4.3 Physical Features
In our estimation of subsequent dyadic proxemic features, it is important to note that any range sensor detects the surface of the individual and, thus, joint pose estimates are projected into the body by some offset. In , this value is learned, with an average offset of 0.039 meters. To extract accurate physical proxemic features of the social dyad, we subtract twice this value (once for each individual) from the measured ranges to determine the surface-to-surface distance between two bodies. A comprehensive data collection and analysis of the joint pose offset used by the OpenNI software is beyond the scope and resources of this work; instead, we refer to the comparable figures reported in .
4.4 Psychophysical Features
Each feature annotation in Hall’s  psychophysical representation was developed based upon values from literature on the human sensory system . It is beyond the scope of this work to evaluate whether or not a participant actually experiences the stimulus in the way specified by a particular feature interval11—such evaluations come from literature cited by Hall [15, 16] when the representation was initially proposed. Rather, this work provides a theoretical error model of the psychophysical feature annotations as a function of their respective distance and orientation intervals based on the sensor error models provided above.
5 Interaction Study
We conducted an interaction study to observe and analyze human proxemic behavior in natural social encounters . The study was approved by the Institutional Review Board #UP-09-00204 at the University of Southern California. Each participant was presented with the Experimental Subject’s Bill of Rights, and was informed of the general nature of the study and the types of data that were being captured (video and audio).
The study objectives, setup, and interaction scenario (context) are summarized below. A full description of the interaction study can be found in .
The objective of this study was to demonstrate the utility of real-time annotation of proxemic features to recognize higher-order spatiotemporal behaviors in multi-person social encounters. To do this, we sought to capture proxemic behaviors signifying transitions into (initiation) and out of (termination) social interactions [10, 35]. Initiation behavior attempts to engage or recognize a potential social partner in discourse (also referred to as a “sociopetal” behavior [30, 32]). Termination behavior proposes the end of an interaction in a socially appropriate manner (also referred to as a “sociofugal” behavior [30, 32]). These behaviors are directed at a social stimulus (i.e., an object or another agent), and occur sequentially or in parallel w.r.t. each stimulus.
Prior to the participant entering the room, the presenter stood on floor marks X and Y for user calibration. The participant later entered the room from floor mark A, and awaited sensor calibration at floor marks B and C; note that, from all participant locations, the physical divider obstructed the participant’s view of the presenter (i.e., the participant could not see and was not aware that the presenter was in the room).
A complete description of the experimental setup and data collection systems can be found in .
As soon as the participant moved away from floor mark C and approached the robot (an initiation behavior directed at the robot), the scenario was considered to have officially begun. Once the participant verbally engaged the robot (unaware that the robot would not respond), the presenter was signaled (via laser pointer out of the field-of-view of the participant) to approach the participant from behind the divider, and attempt to enter the existing interaction between the participant and the robot (an initiation behavior directed at both the participant and the robot, often eliciting an initiation behavior from the participant directed at the presenter). Once engaged in this interaction, the dialogue between the presenter and the participant was open-ended (i.e., unscripted) and lasted 5–6 minutes. Once the interaction was over, the participant exited the room (a termination behavior directed at both the presenter and the robot); the presenter had been previously instructed to return to floor mark Y at the end of the interaction (a termination behavior directed at the robot). Once the presenter reached this destination, the scenario was considered to be complete.
A total of 18 participants were involved in the study. Joint positions recorded by the PrimeSensor were processed to extract individual , physical , and psychophysical  features discussed in Sect. 3.
The data collected from these interactions were annotated with the behavioral events initiation and termination based on each interaction dyad (i.e., behavior of one social agent A directed at another social agent B). The dataset provided 71 examples of initiation and 69 examples of termination. Two sets of features were considered for comparison: (a) Mehrabian’s  physical features, capturing distance and orientation; and (b) Hall’s  psychophysical features, capturing the sensory experience of each agent. Physical features included total, straight-ahead, and lateral distances, as well as orientation of agent B with respect to agent A ; similarly, psychophysical features included SFP axis, visual, voice loudness, kinesthetic, olfaction, thermal, and touch codes  (Fig. 14). All of these features were automatically extracted using the system described above.
6 Behavior Modeling and Recognition
To examine the utility of these proxemic feature representations, data collected from the pilot study were used to train an automated recognition system for detecting social events, specifically, initiation and termination behaviors (see Sect. 5.1).
6.1 Hidden Markov Models
Hidden Markov Models (HMMs) are stochastic processes for modeling of nondeterministic time sequence data . HMMs are defined by a set of states, s∈S, and a set of observations, o∈O. A transition function, P(s′|s), defines the likelihood of transitioning from state s at time t to state s′ at time t+1. When entering a state, s, the agent then receives an observation based on the distribution P(o|s). These models rely on the Markov assumption, which states that the conditional probability of future states and observations depend only on the current state of the agent. HMMs are commonly used in recognition of time-sequence data, such as speech, gesture, and behavior .
For multidimensional data, the observation is often factored into independent features, fi, and treated as a vector, where o=f1,f2,…,fn. The resulting likelihood is the joint distribution of all of the likelihood of the features, such that P(o|s)=∏i=1..nP(fi|s). For continuous features, P(fi|s) is maintained as a Gaussian distribution defined by a mean and variance.
When used for discrimination between multiple classes, a separate model, Mj, is trained for each class, j (e.g., initiation and termination behaviors). For a given observation sequence, o1,o2,…,om, the likelihood of that sequence given each model, P(o1,o2,…,om|Mj), is calculated. The sequence is then labeled (classified) as the most likely model. Baum-Welch is an expectation-maximization algorithm used to train the parameters of the models given a data set . For a data set, Baum-Welch determines the parameters that maximize the likelihood of the data occurring.
6.2 Results and Analysis
The intuition behind the improvement in performance of the psychophysical representation over the physical representation is that the former embraces sensory experiences situated within the environment whereas the latter does not. Specifically, the psychophysical representation encodes the visual occlusion of the physical divider (described in Sect. 5.2) separating the participants at the beginning of the interaction. For further intuition, consider two people standing 1 meter apart, but on opposite sides of a door; the physical representation would mistakingly classify this as an adequate proxemic scenario (because it only encodes the distance), while the psychophysical representation would correctly classify this as an inadequate proxemic scenario (because the people are visually occluded). These “sensory interference” conditions are the hallmark of the continuous extension of the psychophysical representation proposed in our ongoing work .
7 Summary and Conclusions
In this work, we discussed a set of feature representations for analyzing human spatial behavior (proxemics) motivated by metrics used in the social sciences. Specifically, we considered individual, physical, and psychophysical factors that contribute to social spacing. We demonstrated the feasibility of autonomous real-time annotation of these proxemic features during a social interaction between two people and a humanoid robot in the presence of a visual obstruction (a physical barrier). We then used two different feature representations—physical and psychophysical—to train HMMs to recognize spatiotemporal behaviors that signify transitions into (initiation) and out of (termination) a social interaction. We demonstrated that the HMMs trained on psychophysical features, which encode the sensory experience of each interacting agent, outperform those trained on physical features, which only encode spatial relationships. These results suggest a more powerful representation of proxemic behavior with particular implications in autonomous socially interactive systems.
The models used by the system presented in this paper utilize heuristics based on empirical measures provided by the literature [15, 36, 44], resulting in a discretization of the parameter space. We are investigating a more continuous psychophysical representation learned from data (with a focus on voice loudness, visual, and kinesthetic factors) for the development of robust proxemic controllers for robots situated in complex interactions (e.g., with more than two agents, or with individuals with hearing or visual impairments) and environments (e.g., with loud noises, low light, or visual occlusions) .
The proxemic feature extraction and behavior recognition systems are part of the Social Behavior Library in the USC Interaction Lab ROS repository.13
These proxemic distances pertain to Western American culture—they are not cross-cultural.
In this implementation, head pose was used to estimate the visual code; however, as the size of each person’s face in the recorded image frames was rather small, the results from the head tracker were quite noisy . If the head pose estimation confidence was below some threshold, the system would instead rely on the shoulder pose for eye gaze estimates.
In this implementation, we utilized the 3-dimensional point cloud provided by our motion capture system for improved accuracy (see Sect. 4.1); however, we assume nothing about the implementations of others, so total distance can be used for approximations in the general case.
More formally, Hall’s touch code distinguishes between caressing and holding, feeling or caressing, extended or prolonged holding, holding, spot touching (hand peck), and accidental touching (brushing); however, automatic extraction of such forms of touching go beyond the scope of this work.
For example, we do not measure the radiant heat or odor transmitted by one individual and the intensity at the corresponding sensory organ of the receiving individual.
This occurs for the SFP axis estimates, as well as at the public distance interval, the loud and very loud voice loudness intervals, and the outside reach kinesthetic interval.
This work is supported in part by an NSF Graduate Research Fellowship, as well as ONR MURI N00014-09-1-1031 and NSF IIS-1208500, CNS-0709296, IIS-1117279, and IIS-0803565 grants. We thank Louis-Philippe Morency for his insights in integrating his head pose estimation system  and in the experimental design process, and Mark Bolas and Evan Suma for their assistance in using the PrimeSensor, and Edward Kaszubski for his help in integrating the proxemic feature extraction and behavior recognition systems into the Social Behavior Library.