Elsevier

Brain Research

Volume 1242, 25 November 2008, Pages 126-135
Brain Research

Research Report
Audio-visual integration of emotion expression

https://doi.org/10.1016/j.brainres.2008.04.023Get rights and content

Abstract

Regardless of the fact that emotions are usually recognized by combining facial and vocal expressions, the multisensory nature of affect perception has scarcely been investigated. In the present study, we show results of three experiments on multisensory perception of emotions using newly validated sets of dynamic visual and non-linguistic vocal clips of affect expressions. In Experiment 1, participants were required to categorise fear and disgust expressions displayed auditorily, visually, or using congruent or incongruent audio-visual stimuli. Results showed faster and more accurate categorisation in the bimodal congruent situation than in the unimodal conditions. In the incongruent situation, participant preferentially categorised the affective expression based on the visual modality, demonstrating a visual dominance in emotional processing. However, when the reliability of the visual stimuli was diminished, participants categorised incongruent bimodal stimuli preferentially via the auditory modality. These results demonstrate that visual dominance in affect perception does not occur in a rigid manner, but follows flexible situation-dependent rules. In Experiment 2, we requested the participants to pay attention to only one sensory modality at a time in order to test the putative mandatory nature of multisensory affective interactions. We observed that even if they were asked to ignore concurrent sensory information, the irrelevant information significantly affected the processing of the target. This observation was especially true when the target modality was less reliable. Altogether, these findings indicate that the perception of emotion expressions is a robust multisensory situation which follows rules that have been previously observed in other perceptual domains.

Introduction

Human beings must be able to understand the emotions of others in order to engage in successful social interactions. Affect perception, like speech perception, is a particular situation where the combination of information expressed from the face and the voice of the interlocutor optimises event identification. However, despite the fact that our ability to integrate these two sources in a unified percept could be a determinant for successful social behaviour, the perception of affective states has typically been investigated using one modality at a time.

Recently, a few studies explored the multisensory nature of affective expressions (for review see Campanella and Belin, 2007). They indicated that congruency in information between facial expression and affective prosody facilitates behavioural reactions to emotional stimuli (Dolan et al., 2001, Massaro and Egan, 1996), and that information obtained via one sense can alter information processing in another (de Gelder and Vroomen, 2000, Ethofer et al., 2006a, Massaro and Egan, 1996). Such cross-modal biases occurred even when participants were instructed to base their judgement on just one of the modalities (de Gelder and Vroomen, 2000, Ethofer et al., 2006a), supporting the notion that processes underlying integration of facial and vocal affective information is automatic.

With only a few exceptions (de Gelder et al., 1999, Kreifelts et al., 2007), studies on bimodal perception of emotional expressions were conducted using static faces as stimuli. However, neuroimaging studies have revealed that the brain regions known to be implicated in the processing of facial affect–such as the posterior superior temporal sulcus (pSTS), the amygdala and the insula–respond more to dynamic than to static emotional expressions (e.g., Haxby et al., 2000, LaBar et al., 2003, Kilts et al., 2003). Also, most importantly, authors reported cases of neurologically affected individuals that were incapable of recognizing static facial expressions but could recognize dynamic expressions (Humphreys et al., 1993, Adolphs et al., 2003). Thus, it is more appropriate, in research dealing with the recognition of real-life facial expressions, to use dynamic stimuli because (1) dynamic facial expressions are encountered in everyday life and (2) dynamic and static facial expressions are processed differently. This issue is of particular interest in the investigation of audio-visual emotion processing, where the integration of dynamic prosody variation with still pictures results in very low ecologically relevant material. Although integration effects have undoubtedly been observed for voices paired with static faces (de Gelder and Vroomen, 2000), it is clear that such integrative processing would be much stronger when dynamic faces are used (Campanella and Belin, 2007, Ghazanfar et al., 2005, Schweinberger et al., 2007, Sugihara et al., 2006). For example, a recent study on person identification provides compelling evidence that the presentation of time-synchronized articulating faces influenced more strongly the identification of familiar voices than when accompanied by static faces (Schweinberger et al., 2007). Another clear illustration of this point comes from studies of audio-visual speech perception, and in particular the McGurck effect, where clips of faces in movement, but not still photograph, influence speech perception (McGurk and MacDonald, 1976, Campanella and Belin, 2007). Another limitation of the aforementioned studies on bimodal emotion perception is that auditory affective material consisted of speech prosody (words, sentences) spoken with various emotional tones, with the possibility of affective tone of speech (emotional prosody) interacting with the affective value that may be carried by its semantic content (Scherer et al., 1984).

The present study thus attempts to asses the multisensory nature of the perception of affect expressions using ecologically relevant material that approximates real-life conditions of social communication. To do so, we used newly standardized and validated sets of dynamic visual (Simon et al., 2008) and nonverbal vocal (Belin et al., in press) clips of emotional expressions (Fig. 1). In Experiment 1, subjects were required to discriminate between fear and disgust affect expressions either displayed auditorily, visually or audio-visually, in a congruent (the same expressions in the two modalities) or incongruent way (different expressions in the two modalities). This method allows us to investigate whether the presentation of bimodal congruent stimuli improves the subject's performance and which modality dominates in a conflicting situation. Since we observed a visual dominance in the perception of multisensory affects, we also included a condition in which the reliability of the visual stimuli was decreased to challenge this dominance. To test if multisensory interaction in the processing of affective expression is a mandatory process, we conducted a second experiment with the same stimuli as those used in the first but with the explicit instruction to focus attention to only one sensory modality at a time. If multisensory interaction of affective information is an automatic process, it should take place even if the participant's attention is focused on only one modality (de Gelder and Vroomen, 2000, Massaro and Egan, 1996). Because the influence of a concurrent signal increases in situations where the reliability of a sensory channel is reduced (Ross et al., 2007), such as face perception in the dark or voice recognition in a noisy environment, the reliability of the visual and the auditory signals was manipulated.

The originality of this study resides in the use of highly ecological sets of stimuli in two experiments (the first with unconstrained and the second with constrained focus of attention) where the reliability of the sensory targets were individually challenged in order to shed light onto the mechanisms at play in the multisensory processing of affect expression.

Section snippets

Experiment 1

Correct discriminations (Fig. 2) were analysed by submitting Inverse Efficiency (IE) scores (see Data analysis section) to a 2 (Noises: Noisy or Noiseless) × 2 (Emotions: Fear or Disgust) × 3 (Stimuli: Visual, Auditory or Bimodal Congruent) repeated measures ANOVA. As expected, we obtained a main effect of the factor “Noises” (F = 16, p  .001) showing better performance with noiseless than with noisy stimuli. Of great interest for the present study, we also obtained a main effect of the factor

Discussion

In the present study, participants were required to discriminate between “fear” and “disgust” emotion expressions displayed either auditorily, visually, or audio-visually via short dynamic facial and non-linguistic vocal clips. Our results provide compelling evidence for the multisensory nature of emotion processing and extend further our comprehension of the mechanisms at play in the integration of audio-visual expression of affect.

In Experiment 1, when participants were instructed to process

Participants

Sixteen paid volunteers participated in Experiment 1 (8 females; mean age 26, S.D. 9; all right-handed). The same number of subjects participated in Experiment 2 (8 females; mean age 25, S.D. 10; all right-handed with the exception of 1 left-handed female). Four subjects took part in the two experiments. All participants were without any recorded history of neurological or psychiatric problems, reported normal hearing and normal or corrected-to-normal vision and did not use psychotropic

Acknowledgments

We thank Stephane Denis for his help with the experimental setup. OC is a postdoctoral researcher at the Belgian National Funds for Scientific Research (F.R.S.-FNRS). This work was supported by the FRSQ Rehabilitation network (REPAR to OC), the Canada Research Chair Program (ML, FL) and the Natural Sciences and Engineering Research Council of Canada (ML, FL, FG).

References (41)

  • KreifeltsB. et al.

    Audiovisual integration of emotional signals in voice and face: an event-related fMRI study

    Neuroimage

    (2007)
  • MillerJ.

    Divided attention: evidence for coactivation with redundant signals

    Cognit. Psychol.

    (1982)
  • SimonD. et al.

    Recognition and discrimination of prototypical dynamic expressions of pain and emotions

    Pain

    (2008)
  • Belin, P., Fillion-Bilodeau, S., Gosselin, F., in press. The “Montreal Affective Voices”: a validated set of nonverbal...
  • BrainardD.H.

    The psychophysics toolbox

    Spatial Vision

    (1997)
  • CalderA.J. et al.

    Neuropsychology of fear and loathing

    Nat. Rev., Neurosci.

    (2001)
  • de GelderB. et al.

    The perception of emotions by ear and by eye

    Cogn. Emot.

    (2000)
  • de GelderB. et al.

    Fear recognition in the voice is modulated by unconsciously recognized facial expressions but not by unconsciously recognized affective pictures

    Proc. Natl. Acad. Sci. U. S. A

    (2002)
  • de GelderB. et al.

    Unconscious fear influences emotional awareness of faces and voices

    Proc. Natl. Acad. Sci. U. S. A

    (2005)
  • DolanR.J. et al.

    Crossmodal binding of fear in voice and face

    Proc. Natl. Acad. Sci. U. S. A

    (2001)
  • Cited by (259)

    • Multi-modal emotion expression and online charity crowdfunding success

      2022, Decision Support Systems
      Citation Excerpt :

      Since the vocal modality is not commonly used in crowdfunding platforms, we focus on the visual and verbal modalities. Multi-modal emotion analyses explore both superiority and congruency effects among various modalities in emotion processing or perception [18]. One modality may dominate the other if emotion expressions are inconsistent across channels [2].

    View all citing articles on Scopus
    View full text