Research ReportAudio-visual integration of emotion expression
Introduction
Human beings must be able to understand the emotions of others in order to engage in successful social interactions. Affect perception, like speech perception, is a particular situation where the combination of information expressed from the face and the voice of the interlocutor optimises event identification. However, despite the fact that our ability to integrate these two sources in a unified percept could be a determinant for successful social behaviour, the perception of affective states has typically been investigated using one modality at a time.
Recently, a few studies explored the multisensory nature of affective expressions (for review see Campanella and Belin, 2007). They indicated that congruency in information between facial expression and affective prosody facilitates behavioural reactions to emotional stimuli (Dolan et al., 2001, Massaro and Egan, 1996), and that information obtained via one sense can alter information processing in another (de Gelder and Vroomen, 2000, Ethofer et al., 2006a, Massaro and Egan, 1996). Such cross-modal biases occurred even when participants were instructed to base their judgement on just one of the modalities (de Gelder and Vroomen, 2000, Ethofer et al., 2006a), supporting the notion that processes underlying integration of facial and vocal affective information is automatic.
With only a few exceptions (de Gelder et al., 1999, Kreifelts et al., 2007), studies on bimodal perception of emotional expressions were conducted using static faces as stimuli. However, neuroimaging studies have revealed that the brain regions known to be implicated in the processing of facial affect–such as the posterior superior temporal sulcus (pSTS), the amygdala and the insula–respond more to dynamic than to static emotional expressions (e.g., Haxby et al., 2000, LaBar et al., 2003, Kilts et al., 2003). Also, most importantly, authors reported cases of neurologically affected individuals that were incapable of recognizing static facial expressions but could recognize dynamic expressions (Humphreys et al., 1993, Adolphs et al., 2003). Thus, it is more appropriate, in research dealing with the recognition of real-life facial expressions, to use dynamic stimuli because (1) dynamic facial expressions are encountered in everyday life and (2) dynamic and static facial expressions are processed differently. This issue is of particular interest in the investigation of audio-visual emotion processing, where the integration of dynamic prosody variation with still pictures results in very low ecologically relevant material. Although integration effects have undoubtedly been observed for voices paired with static faces (de Gelder and Vroomen, 2000), it is clear that such integrative processing would be much stronger when dynamic faces are used (Campanella and Belin, 2007, Ghazanfar et al., 2005, Schweinberger et al., 2007, Sugihara et al., 2006). For example, a recent study on person identification provides compelling evidence that the presentation of time-synchronized articulating faces influenced more strongly the identification of familiar voices than when accompanied by static faces (Schweinberger et al., 2007). Another clear illustration of this point comes from studies of audio-visual speech perception, and in particular the McGurck effect, where clips of faces in movement, but not still photograph, influence speech perception (McGurk and MacDonald, 1976, Campanella and Belin, 2007). Another limitation of the aforementioned studies on bimodal emotion perception is that auditory affective material consisted of speech prosody (words, sentences) spoken with various emotional tones, with the possibility of affective tone of speech (emotional prosody) interacting with the affective value that may be carried by its semantic content (Scherer et al., 1984).
The present study thus attempts to asses the multisensory nature of the perception of affect expressions using ecologically relevant material that approximates real-life conditions of social communication. To do so, we used newly standardized and validated sets of dynamic visual (Simon et al., 2008) and nonverbal vocal (Belin et al., in press) clips of emotional expressions (Fig. 1). In Experiment 1, subjects were required to discriminate between fear and disgust affect expressions either displayed auditorily, visually or audio-visually, in a congruent (the same expressions in the two modalities) or incongruent way (different expressions in the two modalities). This method allows us to investigate whether the presentation of bimodal congruent stimuli improves the subject's performance and which modality dominates in a conflicting situation. Since we observed a visual dominance in the perception of multisensory affects, we also included a condition in which the reliability of the visual stimuli was decreased to challenge this dominance. To test if multisensory interaction in the processing of affective expression is a mandatory process, we conducted a second experiment with the same stimuli as those used in the first but with the explicit instruction to focus attention to only one sensory modality at a time. If multisensory interaction of affective information is an automatic process, it should take place even if the participant's attention is focused on only one modality (de Gelder and Vroomen, 2000, Massaro and Egan, 1996). Because the influence of a concurrent signal increases in situations where the reliability of a sensory channel is reduced (Ross et al., 2007), such as face perception in the dark or voice recognition in a noisy environment, the reliability of the visual and the auditory signals was manipulated.
The originality of this study resides in the use of highly ecological sets of stimuli in two experiments (the first with unconstrained and the second with constrained focus of attention) where the reliability of the sensory targets were individually challenged in order to shed light onto the mechanisms at play in the multisensory processing of affect expression.
Section snippets
Experiment 1
Correct discriminations (Fig. 2) were analysed by submitting Inverse Efficiency (IE) scores (see Data analysis section) to a 2 (Noises: Noisy or Noiseless) × 2 (Emotions: Fear or Disgust) × 3 (Stimuli: Visual, Auditory or Bimodal Congruent) repeated measures ANOVA. As expected, we obtained a main effect of the factor “Noises” (F = 16, p ≤ .001) showing better performance with noiseless than with noisy stimuli. Of great interest for the present study, we also obtained a main effect of the factor
Discussion
In the present study, participants were required to discriminate between “fear” and “disgust” emotion expressions displayed either auditorily, visually, or audio-visually via short dynamic facial and non-linguistic vocal clips. Our results provide compelling evidence for the multisensory nature of emotion processing and extend further our comprehension of the mechanisms at play in the integration of audio-visual expression of affect.
In Experiment 1, when participants were instructed to process
Participants
Sixteen paid volunteers participated in Experiment 1 (8 females; mean age 26, S.D. 9; all right-handed). The same number of subjects participated in Experiment 2 (8 females; mean age 25, S.D. 10; all right-handed with the exception of 1 left-handed female). Four subjects took part in the two experiments. All participants were without any recorded history of neurological or psychiatric problems, reported normal hearing and normal or corrected-to-normal vision and did not use psychotropic
Acknowledgments
We thank Stephane Denis for his help with the experimental setup. OC is a postdoctoral researcher at the Belgian National Funds for Scientific Research (F.R.S.-FNRS). This work was supported by the FRSQ Rehabilitation network (REPAR to OC), the Canada Research Chair Program (ML, FL) and the Natural Sciences and Engineering Research Council of Canada (ML, FL, FG).
References (41)
- et al.
Dissociable neural systems for recognizing emotions
Brain Cogn.
(2003) - et al.
Integrating face and voice in person perception
Trends Cogn. Sci.
(2007) - et al.
The combined perception of emotion from voice and face: early interaction revealed by human electric brain responses
Neurosci. Lett.
(1999) - et al.
Audio-visual integration in schizophrenia
Schizophr. Res.
(2003) - et al.
Is Alzheimer's disease a disconnection syndrome? Evidence from a crossmodal audio-visual illusory experiment
Neuropsychologia
(2007) - et al.
Merging the senses into a robust percept
Trends Cogn Sci.
(2004) - et al.
Investigating audiovisual integration of emotional signals in the human brain
Prog. Brain Res.
(2006) - et al.
The distributed human neural system for face perception
Trends Cogn. Sci.
(2000) - et al.
Expression is computed separately from facial identity, and it is computed separately for moving and static faces: neuropsychological evidence
Neuropsychologia
(1993) - et al.
Dissociable neural pathways are involved in the recognition of emotion in static and dynamic facial expressions
Neuroimage
(2003)
Audiovisual integration of emotional signals in voice and face: an event-related fMRI study
Neuroimage
Divided attention: evidence for coactivation with redundant signals
Cognit. Psychol.
Recognition and discrimination of prototypical dynamic expressions of pain and emotions
Pain
The psychophysics toolbox
Spatial Vision
Neuropsychology of fear and loathing
Nat. Rev., Neurosci.
The perception of emotions by ear and by eye
Cogn. Emot.
Fear recognition in the voice is modulated by unconsciously recognized facial expressions but not by unconsciously recognized affective pictures
Proc. Natl. Acad. Sci. U. S. A
Unconscious fear influences emotional awareness of faces and voices
Proc. Natl. Acad. Sci. U. S. A
Crossmodal binding of fear in voice and face
Proc. Natl. Acad. Sci. U. S. A
Cited by (259)
Multi-modal emotion expression and online charity crowdfunding success
2022, Decision Support SystemsCitation Excerpt :Since the vocal modality is not commonly used in crowdfunding platforms, we focus on the visual and verbal modalities. Multi-modal emotion analyses explore both superiority and congruency effects among various modalities in emotion processing or perception [18]. One modality may dominate the other if emotion expressions are inconsistent across channels [2].