Keywords

1 Introduction

The modern Army is quickly transforming into a highly networked force with integrated platforms that will enable vast amounts of on-demand multimodal data. Individual soldiers will be responsible for unprecedented information management duties while ensuring personal and team situational awareness, decision-making and overall mission effectiveness. Strategies that mitigate the impact of information overload on the soldier are vital and must inform future system designs. Head mounted displays for the dismounted soldier [1], unmanned autonomous aerial and ground sensors [25] and communication platforms [6] could all simultaneously push information to the soldier through smaller and lighter displays. Therefore, strategies for ideal presentation of information to a user must continue to be an area of active research. Symbiosis of the soldier with machines is envisioned as a mutually-interdependent, tightly-coupled relationship that maximally exploits human and machine strengths in a seamless interface. Research communities have shown growing interest in this symbiosis due in part to recent progress in modern computing capabilities combined with the availability of ubiquitous sensing modalities for capturing information about the human user in non-laboratory conditions [7, 8]. In this paper, we highlight a few technologies that have potential to be important components of human-machine interfaces and present new scientific opportunities.

2 Multisensory Information Processing

Our brains generate a unified percept of the world through partially redundant sensory information about an object or event. We watch movies and derive enjoyment even though we are aware that the sounds from people and objects on-screen originate from television or movie theater speakers. We readily perceive that voices in the movie are coming from the actor’s lips. This sensory illusion, the ventriloquist effect, is a result of our innate ability to integrate auditory and visual information which results in the perceptual alteration of speech sound location [9]. The McGurk effect [10], another audio-visual illusion, occurs when lip movements alter the phoneme that is perceived. Sensory illusions are important tools for elucidating the neural processes underlying multisensory integration. Behavioral studies have suggested that when two sensory cues are separated by even 200 ms, the advantage of multisensory integration and perceptual consequences of ventriloquism are greatly reduced [11]. However, multisensory cells such as those recorded in the superior colliculus [12] and cortex [13] still show integrative responses to sensory cue separation of 600 ms and longer. The relationship between the temporal dynamics of single unit responses in the brain to behavior must be linked with multisensory neural network activity to inform multisensory information presentation and display technology.

2.1 Multisensory Displays

Dynamic and highly adverse operational environments often present scenarios where sensory information is degraded or obstructed. Multisensory cueing has been demonstrated as an effective strategy for orienting attention under non-ideal conditions [14, 15]. Multisensory cueing has also shown to be an effective strategy for offsetting performance decrements due to stress [16]. Delivery of temporally congruent information is being actively explored for multisensory displays with combined audio-visual and other multisensory interactions for augmenting human performance [1719]. While some studies have reported less effective impacts of combined sensory cues for specific tasks [20], the emerging and unified view is that cueing underused sensory streams provides an overall performance advantage [20, 21]. Human multisensory integration is suggested to rely upon correlations between converging sensory signals that result in statistically optimal input to the nervous system and behavioral outputs [22]. However, the manner in which congruent multisensory information impacts a user’s nervous system in real world situations has yet to be fully exploited. Emerging applications for navigation, covert communication and robotic control will benefit by further understanding how the underlying neurophysiological mechanisms of multisensory processing relate to the statistics of behavior.

2.2 Multisensory Information Processing in the Brain

The integration of information from multiple senses was originally thought to occur in high level processing areas in the frontal, temporal or parietal lobes [2325]. More recent anatomical, neurophysiological and neuroimaging studies in non-human primates and functional brain studies in humans lead to the emerging view that multisensory processing involves a diversity of cortical and sub-cortical neural networks [2628]. Based on behavioral studies with multisensory cueing, the neural coding strategy within multisensory integrative neural networks must be biased by the extent of spatial and temporal congruency of incoming sensory information [29]. Preliminary findings suggest that converging synaptic signaling by pre-cortical sensory integrating neurons of the thalamus show augmented output to the primary auditory cortex [30]. Both excitatory and inhibitory signals are strengthened by these congruent sensory inputs and highlight the diversity of computational modifications occurring within multisensory integrating networks [31]. Decoding multisensory neural network activities could potentially serve as feedback commands for closing the human-machine interaction loop.

The underlying neural codes of multisensory processes must be considered within the context of mathematical and theoretical models in order to best define pathways for improving multisensory interfaces. Feed-forward convergence of information from simultaneous senses (sensory organ to cortex) is accompanied by feed-back input from unisensory processing cortical areas onto lower-level multisensory integrating sites [32]. This view of multisensory processing builds upon the modality appropriateness hypothesis which offers the proposal that the greater acuity sensory modality for a particular discrimination task, ultimately dominates perception in a winner-take-all competition [33]. A similar, and complementary, view is that multisensory integration obeys Bayesian probability statistics [34, 35] and most closely resembles the properties of a maximum likelihood integrator [22, 36, 37]. An alternative view is that multisensory enhancement of information processing is a result of temporal or spectral multiplexing, where, for example, spike timing information from single neurons and activity from network oscillations interact in time and lead to an enhanced multiplexed code [38]. The complexity of multisensory integration-induced modifications of the neural code require improved signal processing approaches for decoding multiscale neural activity combined with appropriate theoretic frameworks and mathematical modeling to fully realize the potential of multisensory information processing for informing advanced display technologies.

3 Complementary Approaches

Three areas of active research are utilizing methods and creating technologies that can support multisensory information display technologies.

  • Brain State Awareness

  • Human Activity Monitoring

  • Direct Brain-Computer Interfaces

Together, these areas lay foundations for next-generation systems that exploit principles of human cognition to mediate ergonomically enhanced human-system interfaces that maximally augment performance. Here we review example technologies by superficially highlighting potential opportunities.

3.1 Brain State Awareness

Performing tasks under complex, dynamic, and time-pressured conditions is troublesome for maintenance of operational tempo. Mental workload is a topic of increasing importance to human factors and significant effort has been devoted to developing innovative approaches to objectively assessing cognitive load in real-time. Stress is another topic of significant importance for the deleterious impact on performance of the user of any display technology [14]. Strategies are sought to offer fatigue offsetting interventions like selecting the best information content and format for presentation of information to the human operator. Data relevant for mental state detection include facial features, involuntary gestures, tactile signals, brain neural signals, and physiological signals (e.g., speech, heart rate, respiration rate, skin temperature, and perspiration). Mental states such as anxiety or fatigue often lead to temporal changes in biophysiological signals that might be classified by machine learning algorithms. For example, anxiety may result in increased rate of heartbeat and increased blood pressure relative to physiological signals of an individual’s “normal” mental state. A major challenge is the lack of precise quantitative metrics that define mental states and the difficulty for cross-subject validation.

Stress, Anxiety, Uncertainty and Fatigue (SAUF). Recent attempts have been made to detect stress, anxiety, uncertainty and fatigue from visual and infrared images of a human face [39, 40]. An infrared image, either long wave or mid wave IR, captures the thermal signatures of the skin. Mental states, like stress or anxiety, generate subtle changes in local blood flow beneath the skin, reflected as changes in skin temperature. Thermal imagery is rather sensitive to such physiological changes although the changes may be invisible to the naked eye in certain groups of individuals [39, 41]. Non-invasive detection methods are highly desirable and offer a simple and affordable computer interface solution. State detection from imaging modalities allow for a passive means of detection without interfering with the operator’s normal activities or requiring operator cooperation, which could be amenable to real-world applications.

For visual/thermal video based SAUF detection, the first step is to determine facial landmarks such as mouth corners, eye inner and outer corners, nasal tip, eyebrow start and end points. These landmark points are algorithmically tracked so that spatial and temporal information, called features, can be extracted from both the visual and thermal videos and subsequently used in pattern classification. Features include eye and/or mouth movement and physiological features such as the temperatures of these facial points. The data size is typically huge: Frame rates for visual and thermal videos can be 30 fps or higher. Recording from hours of thermal and visible videos are needed for algorithm training. In [40], the authors described the development of a computer system for SAUF detection using both visual and thermal videos in real-time. The system achieved detection errors in the range of 3.84 %–8.45 % for anxiety detection. In addition to algorithmic accuracy, errors may also be due to view changes (resulting in face deformation), full or partial occlusion, or individual variation. While this approach may not serve as a single source solution, non-invasive imaging provides an alternative and complementary approach for brain state detection that can accompany brain signal-based detection of mental states [42].

3.2 Human Activity Analysis and Prediction

The objective of human activity analysis and prediction is to understand the physical behavior of a human operator. Near term activity analysis focuses on understanding what the operator is doing and predicting the operator’s intention for imminent action. Long term activity analysis aims at recognizing an operator’s habits and personality such as right-handed person or left-handed person, or patterns of keyboard strokes for identity confirmation. Intention recognition and high level activity recognition are active research areas in artificial intelligence [4349]. The methods for visual data analysis are general and applicable to a wide range of applications including human-machine interfaces as well as surveillance across a wide-span geographical region.

Visual data contains rich information for activity analysis and understanding. An adult can recognize activities from an image or a video segment with little effort. However, visual activity analysis and understanding by a computer has proven extremely difficult. The key challenges are that spatiotemporal features in imagery or video are typically high dimensional, noisy, ambiguous, and lie on (unknown) nonlinear manifolds. There is a lack of robust methods for detecting the underlying patterns. Human activities occur in a wide variety of contexts and at wide range of scales. In many cases, contextual information is essential for understanding human activities but often unavailable. Conceptually, vision based human activity analysis and understanding consists of several components: action representation, action recognition, activity recognition and prediction although the boundary between action and activity may not be analytically definable.

Action Representation. Activity analysis and understanding is typically carried out in a general hierarchical framework. The low-level, atomic components are “actions” or “actionlets”, i.e. primitive motion patterns typically lasting for a short duration of time, such as turning of the head or lifting the left arm. An activity is a temporal, typically complex composition of multiple actions. For example, “making a phone call” can be decomposed into four actions. At the low signal level, actions are characterized by spatiotemporal features and potentially distinguishable through pattern classification of the features. A component based hierarchical model was proposed to account for articulation and deformation of the human body due to factors such as view change or partial occlusion [48, 50, 51].

Action Recognition. Motion is a critical attribute for action recognition and spatiotemporal features can be extracted from multiple sequential frames in a video. Examples of spatial features include Scale-Invariant Feature Transform points, Histograms of Oriented Gradients and Histograms of Optical Flow. Algorithms are used for frame registration and landmark or object tracking in video to extract temporal motion information. The spatial features and temporal information are combined to feed into pattern analysis and classification algorithms for action recognition.

Activity Recognition and Prediction. Activity recognition typically requires behavior modeling and high level reasoning, which is essential for activity or near real-time intention prediction. Parametric models like Hidden Markov Model or Petri Nets and non-parametric models such as Bayesian methods for inference require the incorporation of prior knowledge learned from past data or to be manually coded. Such frameworks are flexible to allow the incorporation of novel action dependencies for human activities. A general framework for human activity analysis and prediction has been developed [49, 52, 53] and supplemented by a hierarchical framework that can automatically detect contextual information and incorporate it in activity understanding [54].

3.3 Direct Brain-Computer Interface

Machines and humans, unfortunately, do not have an inherent common language for engaging in the human-computer interaction loop. In order for the human in the loop to derive maximal benefit from the interface the computational framework on the other end must be able to accurately determine user intent in real-world settings. This includes when the user is under duress and is placed into a dynamic physiological and/or neural state. Software specifications like those used in Controlled Natural Languages may provide a possible solution [55, 56]. However, these methods have mainly been tested for simple interfaces. Complex operational environments will require other complementary solutions.

Brain-Computer Interface Methods. Brain-computer interfaces permit direct communication of user intent to machine interfaces. The general framework for open-loop brain-computer interface system control originates from the detection of brain activity related to user intent. Electroencephalography (EEG), electrocorticography (ECoG) and intracortical (single unit) recording configurations are some of the technologies currently in use for brain-computer interfaces. Other sensing modalities include magnetoencephalography (MEG), Positron Emission Tomography (PET), functional magnetic resonance imaging (fMRI) and functional near-infrared spectroscopy (fNIRS). These imaging modalities together are complementary in information attributes, spatial-temporal resolution and degree of invasiveness. For example, EEG provides high temporal but low spatial resolution while fMRI provides low temporal but high spatial resolution. ECoG is a semi-invasive technique and intracortical recordings are invasive. Following analog to digital conversion, advanced signal processing and machine learning algorithms can then be deployed to classify neural activity information and derive user intent or state.

Detection of Silent Speech. A recent effort attempted to develop a brain-based communication and orientation system using EEG and ECoG signals [57, 58]. The objective was to create signal processing methods that allow detection of imagined speech for communication and determining directional attention for orientation from brain signals. One key challenge was a lack of understanding how imagined speech related to overt speech brain function. In order to be successful this study also had to overcome the limited understanding about the interaction among networked neurons in speech processing pathways, the difficulty of determining a baseline for imagined speech and the existence of noise in the neural recording. Based upon the existing real-time software system BCI2000, algorithms were generated that are capable of extracting electrophysiological features on a single-trial basis. Based on chance accuracy of 25 %, ECoG-based decoding showed overall ~40 % performance levels for detection of vowels and consonants during both overt and covert speech [57, 58]. The results indicate higher than chance likelihood of correctly decoding imagined consonants and vowels.

For detecting attention and orientation, the setup is similar to that for imagined speech detection. Each subject was presented with visual cues and stimuli on a computer screen with built-in eye tracker, which verified ocular fixation on the central cross during data acquisition. The system achieved average detection accuracy of 84.5 % for attention engagement and 48.0 % for attention locus [59, 60] from ECoG data. While this line of work has only been able to achieve recognition of phonemes, a multisensory information processing approach may be taken to improve algorithm performance. Communication inherently involves multisensory processes which may be exploited to elucidate a new regime of neural network activity that might drive classification schemes of future brain-computer interfaces. Exploration of this idea may offer an opportunity to advance research in fundamental mechanisms of the neural processing of speech and close the loop in brain-computer interface design to facilitate performance for applications like covert communication and device control.

4 Vision for Future Multisensory Information Displays

Advances in functional neuroimaging combined with signal processing capabilities have led to new opportunities to identify spatial and temporal features of neural processing during real world experimentation [7, 8]. Research on human-machine interfaces has also considered methods for combining physiological data (e.g., respiration rate, heart rate, blood pressure and temperature) and behavioral information (e.g., posture, eye movements, gesture, and visual/thermal facial expression). The larger neural real estate devoted to multisensory processes and the diversity of signaling mechanisms available open new opportunities for human machine interfaces. Signal processing and data analytic advances can be devoted to decoding information related to this complex signaling and modification as a result of presentation of sensory information through multisensory displays. Brain-computer interface research has largely focused on the presentation of information to one of a user’s senses while decoding brain activity with open-loop pattern classification, i.e. using electroencephalography while watching a visual display. The research has demonstrated utility in direct brain-computer communications for simple choices like user control of a cursor on a screen but state-of-the-art pattern classification algorithms only show limited performance for complex tasks such as decoding intended speech. Recent advances point to an emerging opportunity for a paradigm shift. To understand how simultaneous information presentation modifies behavioral response we need to determine where and how information from different senses is combined in the brain and what are the neural computational advantages rendered by these processes.

4.1 A Lesson from Sensory Deprivation

Sensory deprivation can lead to improvements in perceptual abilities in the intact senses for the blind or deaf. For example, individuals with early onset blindness show improved temporal and spectral frequency discrimination when compared to those with late-onset blindness or those who are sighted [61]. The early-blind have also been demonstrated to show enhanced sound localization ability relative to sighted individuals [62]. Surprisingly, in the study of Lessard et al., a group of blind subjects that had maintained some level of residual peripheral vision showed degraded sound localization ability relative to the completely blind. These observations together highlight the complicated mechanisms mediating multisensory processing when information is missing or corrupted in one sensory stream. This may be relevant to situations when only degraded sensory information is available in a high attentional load operational environment to a person with full sensory capabilities. A more recent study showed that by depriving normal sighted mice of light for as little as two days was enough to elicit potentiation of specific pre-cortical inputs from the thalamus into the auditory [30] or somatosensory cortices [63]. More work is needed in this area but the underlying neurophysiological mechanisms that mediate responses to sensory deprivation, not from disease or injury, may be relevant and provide inspiration for novel neuroplasticity-based approaches to advanced human-machine interfaces capabilities and augmented cognition.

5 Conclusion

The state-of-the-art view of multisensory displays has shown advantages of multisensory stimulation and has highlighted the need to understand the underlying neural bases mediating cueing-induced behavioral improvements. New approaches leading to higher resolution multimodal data as a result of developments in sensor technologies are an enabling tool but pose significant computational challenges. However, statistical modeling approaches and advancing computational analysis capabilities are providing new methodologies to facilitate the availability of neural information for direct human-computer interaction. There is a fundamental need to study human cognitive behavior under real-world conditions and multisensory information displays offer a unique capability to engage humans while they perform outside the laboratory.

State-of-the-art advances have not completely approached the vision of closed-loop human-machine symbiosis, but have paved the way for more sophisticated theories and technologies that will enable the attainment of this vision. Here we have described example technologies that provide emerging opportunities to exploit advances in understanding the underlying principles governing neural processing of information from simultaneous sensory streams to create systems that interface with the human in intuitive and, potentially, seamless ways. Multisensory displays show great potential to support future soldier-machine technologies and future designs should be created based on principles grounded in data and theory from basic cognitive neuroscience and neurophysiology. The future military operational environment will be more complex and require more from the human operator as she interacts with soldier systems. In order to take full advantage of scientific opportunities presented by multisensory information processing, a deep understanding of how the human brain, body, and sensory systems work in concert to accomplish tasks is required in order to close the loop in human-systems interactions.

6 Disclaimer

The views and opinions contained in this paper are those of the authors and should not be construed as an official Department of the Army position, policy, or decision.