Background
Humans use dialog conversation constantly to exchange information between them. We talk to each other at home, at work or on the phone with the goal to communicate information which we believe to be new or important for others. Most people never recognize that the verbal exchanges between them are arranged in a highly structured way. Yet, when people talk, they usually connect their current utterances with preceding expressions of themselves or others. The philologist Hermann Paul noticed this inherent connection between questions that people ask, and the tonal realization of subsequent answers already in the 19
th century [
1].
Nowadays, the term 'information structure' is used to designate that connected utterances are composed of so-called 'information units' [
2] which can be larger than a single syllable or word. Using a simplified view, these units can either consist of 'novelties' or can comprise previously 'given', thus known, facts. In general, the term 'focus' is used to refer to information centers which are currently novel for listeners or contrast with previous assertions of dialog partners (interlocutors). On the contrary, information which listeners already encountered earlier in a discourse is referred to as non-focused or given information.
The proportions of focused and non-focused information within a discourse are subject to constant dynamics. While interlocutors speak about a certain theme they take alternating turns as speakers and listeners. These alternating turns force interlocutors to persistently reconsider which part of information is already shared between them and their conversation partners (common knowledge) and which part conveys novelties and/or contrastive assertions. Shared information between interlocutors can then provide the context for forthcoming utterances [
3]. However, when information is not yet shared it must somehow be highlighted or focused, respectively, to enter the conversational common ground of the interlocutors.
German, as the language under investigation here, provides several linguistic opportunities to realize an information focus in written and/or spoken language (focus position = bold word). These can first be syntactic means accompanied by word order changes (Jeff likes chocolate. → It is chocolate that Jeff likes.) Moreover, an information focus can be induced by semantic-pragmatic requirements, e.g. by wh-words (Who likes chocolate? → Jeff /What does Jeff like? → Chocolate). In spoken language only, a focus can also be overtly highlighted by prosody or accentuation, respectively (Jeff likes chocolate.).
The study at hand serves to explore the interplay of semantic-pragmatic and prosodic factors (i.e. accentuation) in processing the focused information in dialogs. Event-related potentials (ERPs) were utilized to particularly investigate on-line interactions between the semantic-pragmatic and the prosodic focusing device. For this purpose, the electrophysiological consequences of perceiving matching and non-matching associations of pragmatic focus and the (prosodic) focus accentuation during spoken dialog comprehension were compared (see Table
1).
Table 1
Examples of the dialogs with a focus on 'Anna' in the target sentence (F3) or non-focused given information (G3).
G1: Am Samstag hat Peter mir etwas versprochen.
Peter promised me something on Saturday.
| F1: Am Samstag hat Peter mir etwas versprochen.
Peter promised me something on Saturday.
|
G2: Hat er dir versprochen, Anna zu entlasten?
Did he promise you to support Anna?
| F2: Hat er dir versprochen, Frauke zu entlasten?
Did he promise you to support Frauke?
|
G3: Er hat mir versprochen, [Anna]G zu entlasten und die Küche zu putzen.
He promised me to support Anna
and to clean the kitchen.
| F3: Er hat mir versprochen, [ANNA]F zu entlasten und die Küche zu putzen.
He promised me to support Anna
and to clean the kitchen.
|
Condition FG
|
Condition GF
|
F1: Am Samstag hat Peter mir etwas versprochen. | G1: Am Samstag hat Peter mir etwas versprochen. |
F2: Hat er dir versprochen, Frauke zu entlasten? | G2: Hat er dir versprochen, Anna zu entlasten? |
G3: Er hat mir versprochen, [Anna]G zu entlasten und die Küche zu putzen. | F3: Er hat mir versprochen, [ANNA]F zu entlasten und die Küche zu putzen. |
For reasons of intelligibility, we will henceforth refer to the contextually driven pragmatic information centers just as 'focus'. The actual prosodic realization of these information centers will be referred to as '(focus) accentuation'. Yet, a focus does not only bear consequences for the accentuation of the information center. Rather, words preceding or following the focus position or information center, respectively, are also influenced in their prosodic properties, i.e. to enhance the prominence of the focus with respect to surrounding sentence elements ([
4] for a longer linguistics-based discussion).
Prior studies aiming at behavioral responses during dialog processing have shown that semantic-pragmatically focused information is recognized faster and easier when it is accented [
5,
6]. Moreover, focused information which is not accented is hardly acceptable for listeners while the superfluous accentuation of non-focused information is more readily accepted [
7,
8]. In single sentence processing, the influence of accent positions on sentence interpretation has been studied as well. Yet, a study by Price, Ostendorf, Shattuck-Hufnagel and Fong [
9] reported only a minor influence of accent positions on the disambiguation of syntactic structures. However, the study revealed a substantial influence of the positions of major prosodic phrase boundaries on syntactic disambiguation. On the other hand, two other studies do also report robust effects of accent positions on the syntactic disambiguation of sentences [
10,
11].
With respect to ERP responses to the processing of prosodic and pragmatic information, findings are not straightforward. In single sentence processing, a positive-going ERP is often found when listeners perceive major prosodic boundaries [
12‐
15]. Major prosodic boundaries signal the closure of intonational phrases within sentences. These boundaries manifest in tonal movements on the last syllables preceding the edges, a lengthening of the prefinal boundary syllable and an optional pause [
16]. The ERP deflections to these boundaries display a latency of approx. 500 ms, and a centro-parietal scalp distribution. Due to their eliciting factors (i.e. major prosodic phrase boundaries) the ERP has been termed Closure Positive Shift (CPS), and is interpreted as an on-line marker for speech segmentation.
However, when listeners process utterances beyond single sentences (i.e. in dialogs), the CPS reveals diverging elicitation factors. Hruska and coworkers [
17,
18] conducted a study on the processing of dialogues in German. They presented listeners with context questions either comprising the wh-pronoun 'who' or 'what'. The 'who'-question induced a novelty focus on a noun while the 'what'-question gave rise to a focus on a verb in a target sentence. In order to determine whether the elicitation of the CPS predominantly relies on the pragmatic aspects of the dialog (i.e. the contextually assigned focus positions) or on the actual prosodic realization (i.e. the focus accentuation), Hruska et al. included an additional manipulation in their design. The questions including the pronoun 'who' (inducing a noun focus) were either followed by target sentences comprising the matching (noun) or the non-matching (verb) accentuation. In addition, questions including the pronoun 'what' (inducing a verb focus) were either followed by the matching (verb) or the non-matching (noun) accentuation.
Most importantly, the results show that when listeners are presented with contextually embedded sentences (i.e. dialogs) the CPS is not elicited by perceiving major prosodic boundaries as during context-free single sentence processing [
12]. When the context-induced focus position and the accent position in the target sentence are identical, the CPS was elicited to this focused and accented position ('who'-question → CPS to the noun; 'what'-question → CPS to the verb). Yet, when focus and accent position were incongruent the ERP outcomes were less clear-cut. When a 'what'-question (inducing a verb focus) was followed by a target sentence with noun accentuation, a CPS was elicited in accordance with the accent position (i.e. the noun) which was not the focus position. Moreover, the missing accent in the focus position (i.e. verb) elicited an N400. In contrast, the association of the 'who'-question (inducing a noun focus) with a target sentence conveying a verb accent induces a CPS and a biphasic N400-P600 pattern in correspondence to the focused noun which was not accented. Thus, the results of Hruska et al. are not unequivocal in determining whether the CPS in dialogs indexes the perception of a contextually promoted focus or of a focus accent or both.
The N400 effects which were consistently caused by missing accents on focused sentence constituents are proposed to reflect semantic integration difficulties. In particular, they were attributed to the expectation of accents by listeners when encountering a focus position which was not marked by accentuation means. Moreover, the occurrence of a P600 is suggested to signal the revision of a dialog's information structure due to inconsistencies between the pragmatic (focus) and the prosodic structure (i.e. accentuation). Critically, the data of the study are only displayed and statistically evaluated from the absolute sentence onsets. Although the additionally provided acoustic analyses allow for a loose mapping of the critical sentence positions (i.e. noun and verb) with the ERP effects, the exact time course of the evoked responses are not unambiguous.
The processing of focused information in the visual domain has also been found to yield a positive-going ERP. Bornkessel, Schlesewsky and Friederici [
19] employed word order scrambling which resulted in the syntactic focus positions. The perception of these focused elements elicited a posterior parietal positivity with a latency of 280–480 ms. The ERP was then termed 'focus positivity'. Yet, the evoking conditions, latency, and scalp distribution of the visual 'focus positivity' are similar to the CPS found in auditory dialog processing [
17,
18]. As written language does not convey overt prosodic features, the data provide a hint as to the independence of the positive-going 'focus CPS' from the actual accentuation of a focus.
With respect to the electrophysiological consequences of inadequate accentuation various effects have been previously reported. Heim and Alter [
20] report a frontal P200 to unexpected sentence-initial and an N400 for sentence-medial accents in German. Further, Mietz, Toepel, Ischebeck and Alter [
21] discussed a still earlier appearing centro-parietal negativity (EN) peaking at 120 ms to unexpected sentence-medial accentuation in German. For Japanese, Ito and Garnsey [
22] find a posterior positivity between 250–500 ms for missing sentence-initial focus accents but a later fronto-temporal negativity for missing sentence-medial accents. Furthermore, Magne et al. [
23] discuss a sustained centro-posterior positivity between 300–1000 ms for 'pop-out' accents in medial and final sentence positions in French.
Up to date, the ERP data on the impact of pragmatic and prosodic aspects on utterance processing at and beyond single sentence level are still far from consistent. Yet, a line of ERP research concerned with intra- and extrasentential context effects on semantic processing as reflected in particular by the N400 component reveals major compliance between both kinds of contextual influences [
24,
25]. Evidence from the N400 component indicates that the processing of intra-sentential contextual requirements and extra-sentential semantic preconditions (e.g. constraints on semantic interpretation introduced by a preceding context or by world knowledge) are effective at a comparable speed and strength, and possibly subserved by identical neural networks [
26].
The current study thus aims at determining the eliciting factors of the CPS in the context-bound processing of dialogs. Furthermore, we will explore on potential influences of inappropriate prosodies on dialog perception. In particular, a link between the CPS as an on-line marker for utterance segmentation in context-free (i.e. single sentences → CPS at major prosodic boundaries) vs. context-bound (i.e. dialogs → CPS to focus/accent positions) speech processing is to be drawn.
In line with prior research [
17,
18], we propose that the online speech segmentation processes for context-free and context-embedded utterances manifests in a similar ERP component, namely in the CPS. Yet, the events which elicit the CPS in sentences and dialogs seem to differ. We suggest that this difference arises from a rather eclectic and economic strategy of listeners to use the most relevant cues for utterance structuring (leaving aside here the interpretation-indispensable lower-level phonological, semantic and syntactic cues). In single sentences, speech segmentation by means of prosodic boundaries can help to prevent misunderstandings as in the sentence 'When you learn gradually you worry more.' [
9]. In larger discourse and dialogs, however, prosodic boundaries are not as informative as in single sentences [
27,
28]. In lieu of recognizing the syntax of an utterance, it is superior to determine its informational content, i.e. the information centers. As mentioned beforehand, these information foci can be indicated by context-driven semantic-pragmatic means as well as by accentuation.
We created three-sentence dialogs to explore on the interplay of pragmatic and prosodic factors in discourse processing (see Table
1). In particular, dialogs were constructed in which the last (target) sentence either comprised a 'novelty' expressed by the corrected assumption of the interlocutor in noun position (i.e. focused information;
condition FF) or only previously mentioned 'given' information (i.e. non-focused;
condition GG). Since the dialogs were spoken in a collaborative setting between two speakers, the dialogs were naturally accompanied by a corresponding focus or no-focus accentuation (see section on prosodic properties for details).
We propose in general that the conversational contexts (i.e. questions posed by speakers and their prosody) influence listeners' expectations on a focus position in the target utterance. We predict that when contextual cues indicate a focus, listeners then use the anticipated focus position to structure the dialog. In turn, when utterances bear a noun focus with its corresponding accentuation (
condition FF), the CPS should be elicited in convergence to this focused and accented noun position. When the dialog does, however, not point to the existence of a focus (
condition GG) listeners are expected to structure the target utterance by means of the internal major prosodic boundaries. The CPS should then be apparent when listeners perceive the major prosodic boundaries as shown for the perception of context-free single sentences [
17].
In addition and to specifically disentangle contextually-pragmatically and prosodically driven ERP effects in dialog perception, a further manipulation entered the study design. First, we combined the contexts which give rise to a 'focus' position in the target sentence with the prosodic realization of non-focused 'given' information (condition FG). Second, the contexts which render all information in the target sentence as non-focused 'given' were combined with the target sentences incorporating the accentuation of a 'focus' (condition GF).
As mentioned beforehand, prior ERP data could not unequivocally determine whether information structural conflicts are resolved by listeners in favor of the contextually triggered pragmatic focus structure or the actual accentuation. In turn, our hypotheses on the perceptual outcomes of such a conflict have to be two-fold.
On the one hand, if listeners process the dialogs with an inadequate accentuation by primarily regarding contextual-pragmatic cues, a CPS would be expected to the target sentence noun if this bears a focus (condition FG). Thus, the latencies of the CPS in condition FG and FF would be congruous then since the target sentences of both conditions are preceded by the same contextual information. In the opposite case, where the target sentence noun only conveys non-focused given information (condition GF) a CPS would be expected at the noun-preceding major prosodic phrase boundary due to the lack of a focus position. If this assumption is valid, the CPS timing between the context-identical conditions GF and GG should be similar.
On the other hand, if listeners structure the dialog targets conveying inappropriate accentuation patterns by predominantly relying on the misleading prosody, a CPS should be induced by the noun focus accent in condition GF. The CPS latencies between condition GF and FF would then be coinciding since the target sentences of both conditions bear the same accentuation pattern. Yet, when the target sentences bear the prosody of non-focused given information (condition FG) listeners should exhibit a CPS to the perception of the noun-preceding major prosodic boundary due to the absence of a (focus) accent. Thus, the CPS pattern should then be similar between conditions FG and GG since both conditions convey prosodically identical target sentences.
Discussion
The present study was conducted to investigate the electrophysiological responses to spoken dialog perception. In particular, the study aimed at delineating the influence of contextual-pragmatic and prosodic information on the structuring of quasi-natural connected speech. For this purpose, listeners were presented with dialogs containing focused contrastive (conditions FF and FG) vs. non-focused given information in the target utterances (conditions GG and GF). Moreover, the dialogs either comprised an adequate (FF and GG) or an inadequate accentuation (condition FG and GF) with respect to the semantic-pragmatic focus.
The behavioral results indicate that listeners are not always certain which prosody should accompany a certain information structure. In fact, the judgment task seems to be easier when the target sentences convey focused information irrespective of whether it is realized with an appropriate (condition FF) or an inappropriate accentuation (condition FG). These outcomes resemble prior behavioral results on listeners' identification of mis-realized focus accentuation [
7,
31]. However, the evaluation of accentuation patterns is much easier for listeners when the 'under'-accentuation of focused information is encountered (condition FG) in contrast to an 'over'-accentuation of given information (condition GF).
In general, listeners seem less aware of the accentuation which appends to non-focused given information (condition GG and GF). With respect to condition GF, this finding is again in congruence with prior findings [
5,
7,
31]. Alternatively, participants' inaccuracy in condition GG could also be attributed to an additional facet of communication. It is rather unusual for interlocutors to repeat statements just made by someone else (apart from showing surprise about it which would then require a particular intonation). Rather, speakers signal approval of an interlocutor's statement by uttering 'Yes' or 'That's right'. We do assume that the behavioral responses to condition GG and GF are at least partly attributable to the violation of cooperation principles in conversation [
32] as information from the context is completely repeated in the target sentences.
The ERP data in general show a centro-posterior positive deflection for all conditions. However, the positive shift varies in onset latency. As apparent from the Figures
3 and
4, the positivities neither diverge as a function of the prosodic realization of the target utterances nor the ERP average onset. Figure
3 only comprises ERPs to the focus accentuation variant of the dialogs, and Figure
4 only the responses to the accentuation of non-focused given information. Moreover, the difference between conditions cannot be ascribed to the actual ERP average chosen (B: verb syllable onset; C: noun average starting from verb offset). The onset latency of the positive shifts yet differs as a function of the contexts preceding the target sentences. An effect Context in the time window from 0–500 ms of the noun average statistically corroborates the descriptive difference.
As further apparent from Figures
5 and
6, the conditions with identical contexts preceding the target sentences result in similar latencies of the positive going ERP component. In those target sentences which are preceded by a 'focus' question (condition FF and FG) the positivity starts ~300 ms later than in those targets which are preceded by a 'no focus' question. While the conditions GG and GF do both not convey a noun focus in the target, the conditions FF and FG comprise of such a focus position.
Thus, the information structural interplay of context and target sentences seems to direct the structuring of spoken dialogs. In particular, the focus structure of a dialog predominantly influences the interpretation of the dialogic information irrespective of the actual accentuation of the target.
In both 'no focus' conditions (GG and GF) the centro-posterior positivity appears with a temporal lag of ~500 ms to the onset of the prosodic boundary on the verb. Thus, it strongly resembles the Closure Positive Shift (CPS) known from single sentence processing [
12,
15] as a marker of online speech structuring.
Crucially, listeners make use of the prosodic boundary cues for utterance structuring before encountering the noun position when processing condition GG and GF. This strategy can only be attributed to listeners' exploitation of the context cues, namely the question intonation. In both 'no focus' conditions (GG and GF) the contexts are accompanied by default question prosodies. We propose that these contextual cues together with the acoustic event of a prosodic boundary in the target sentence then lead listeners towards the utilization of the boundary for utterance structuring. In turn, a CPS is elicited when listeners perceive the prosodic boundary on the verb.
In both 'focus' conditions (FF and FG), on the other hand, the positive shift is apparent with a latency of ~500 ms after the focused noun has been encountered, i.e. ~300 ms later that in the 'no focus' conditions. Due to the scalp topography, the latency and the morphology of the positive-going ERP in the conditions FF and FG we also interpret the deflection as CPS. In contrast to the 'no focus' conditions GG and GF, however, the CPS in the 'focus' conditions (FF and FG) is induced by the processing of the noun focus in the target sentences.
In accordance with prior research [
18], our findings suggest that the structuring of spoken language manifests in a similar ERP component, the Closure Positive Shift (CPS). In contrast to the CPS elicited by context-free sentence presentation, however, the events which induce the component during the perception of context-embedded utterances (i.e. dialogs) differ. When conversational contexts lead listeners to anticipate an information center or focus, respectively, they use the focus position to structure the utterance. On the other hand, when the context of an utterance do not guide listeners towards the expectation of an information center, they use major prosodic boundaries for structuring as they also do by default when perceiving context-free single sentences [
12,
14,
15].
Further, our results show that the latency of the CPS does not vary as a function of the appropriateness in the accentuation of the target sentence with respect to a preceding context. Under the 'no focus' context, similar latencies are yielded for the conditions with the appropriate accentuation (GG) and the inappropriate accentuation (GF); under the 'focus' context alike CPS timing is apparent for the conditions with the appropriate accentuation (FF) and the inappropriate accentuation (FG). Thus, our study can complement the results of Hruska and coworkers [
18] in showing that the elicitation of the CPS during dialog perception predominantly relies on contextual factors irrespective of the actual accentuation of a dialog's target.
Apart from the finding of a contextual dependence of the CPS in dialog perception, however, there is some indication that listeners can perceive a contextually inadequate accentuation, too. However, this effect is only present when a dialog context signals a focus in the target sentence which is then prosodically realized as non-focused given information (condition FG, see Figure
4A). The inappropriate 'under' accentuation then evokes a sustained centro-posterior negative deflection (NEG) which is statistically reliable from 1100–1600 ms after the onset of the target sentence.
According to the prosodic analyses, the descriptive onset of the negativity precedes the onset of the focused but unaccented noun. Thus, listeners must be readily able to exploit the subtle prosodic cues conveyed by the sentence-initial fragment ('He promised me'). The prosodic inadequacy of the target sentence emerges further and reaches statistical significance when the absence of the focus accent on the noun is detected.
Similar negative deflections in dialog comprehension have been reported by Hruska et al. [
17,
18] for German, and Magne et al. [
23] for French. These negativities were interpreted as N400 responses due to integration problems of focused but unaccented words into the information structure of a dialog. Within the current design with quasi-natural connected speech, however, the onset of the negative ERP component can hardly be fixed on to one discrete element of the target sentence. Apparently, the negative deflection in our study does also not impede the context-bound occurrence of the CPS. In terms of scalp topography and eliciting factors, the negativity (NEG) for condition FG in our study coincides with the previously reported N400 for missing focus accents [
17,
23]. Moreover, it resembles N400 reports on discourse-bound semantic processing [
24,
25]. With our present materials and design, however, we cannot make unequivocal assertions as to the timing of the NEG component.
In addition to the negativity elicited during the processing of condition FG, the effect Prosody in the time window from 1000–1500 ms of the overall ANOVA (verb offset average) also indicates an additional process accompanying the CPS. Although the ERPs in condition FG show a more pronounced positive deflection than in condition GG (cf. Figure
4C), no effect Condition was yielded within this time window from 1000–1500 ms. Such a P600-like effect could further corroborate the interpretation of the negativity in our study as an N400. In particular, it might indicate that a missing focus accent not only causes meaning-related integration problems but also hinders the constitution of the information structure of a dialog. Further exploration on this issue would, however, require experimental manipulations at the cost of the naturalness of the dialog situations. Either, the utterance-internal prosodies would have to be made artificially identical or only the processing of sentence-initial focus positions could be explored. Moreover, previous studies [
20,
22] have shown that the perceptual consequences of sentence-initial vs. sentence-medial accents are hardly comparable, even manifesting in reversed polarities in the ERPs.
Authors' contributions
UT designed and analyzed the materials for the experiment, programmed it, collected a majority of the data, analyzed the data, and drafted the initial paper. AP provided support on the statistics and for manuscript preparation and revision. KA, the head of the research project, contributed to the initial conception of the study and provided consultation for the experiment and manuscript preparation. All authors have read and approved the final manuscript.