Analytic Versus Synthetic Listening
When a harmonic in a complex tone was replaced by a non-informative noise band in the context trials, pitch judgements in the test trials followed the direction of the frequency shift of that harmonic significantly less often than when a different harmonic was replaced by a non-informative noise band in the context trials. This is consistent with the notion that (i) the replacement of a harmonic by a non-informative noise band in the context trials reduced the contribution of that harmonic to the residue pitch in the test trials and (ii) listeners gave smaller weights to a harmonic or frequency region when they had learned that this frequency region did not provide reliable information for the task. According to this view, the present finding is broadly similar to effects of reliability of information demonstrated in the visual domain, e.g., on the use of different depth cues.
Next, we need to discuss the possibility of a different interpretation. Throughout the experiment, listeners were asked to judge the overall pitch of the sounds and to ignore individual components that might pop out, i.e., listeners were encouraged to use a holistic rather than an analytical listening strategy. While we argue below that it is unlikely that listeners used an analytical listening strategy, we first consider whether, in principle, analytic listening could account for the present results. If, in test trials, listeners compared the individual mistuned component with the corresponding harmonic in the first interval, would performance be expected to deteriorate when that harmonic was substituted by a noise band in the context trials? The answer to this question is not obvious. If the noise band in the target frequency region popped out and led to increased segregation of the mistuned harmonic in the test trials, one might expect performance to improve rather than worsen. On the other hand, if subjects tried to focus on the frequency region of the mistuned harmonic, the change from a mistuned tonal component to a noise might be distracting and might make it harder to tune into the specific target pitch.
The question of the use of analytical versus holistic listening strategies when judging the pitch of a complex tone with a mistuned component is not specific to the “discrimination paradigm” used here but has been discussed before in the context of pitch
matching paradigms (Gockel et al.
2005b,
2009). However, in the studies of Gockel et al. (
2005b,
2009) and Darwin and Ciocca (
1992), the mistuned component was presented asynchronously with the remainder of the complex and therefore could always be heard as a separate tone, while in the present study, all components started and stopped together, thus reducing the cues for perceptual segregation. To avoid the possibility that listeners use an analytical listening strategy and compare the individual mistuned component with the corresponding harmonic in the first interval, ideally the two complexes would have no harmonics in common in the two intervals. However, Moore and Glasberg (
1990) showed that the differences in timbre between complex tones with no harmonics in common markedly impaired pitch discrimination. Therefore, this approach would have been unsuitable for the present research question, as pitch shifts due to mistuning of a single harmonic are small in comparison to the difference limens for F0 measured for complex tones with no harmonics in common (Moore et al.
1985; Moore and Glasberg
1990). Instead, the present experiment was designed so as to prevent the perceptual segregation of the target component from the remaining components and thus make it less likely that the mistuned component was compared with the corresponding harmonic component. To achieve this, first, in the context trials, listeners needed to “tune in” to the F0 to perform the task optimally, and reliable and informative cues for the task were conveyed only by those harmonics that were not replaced by noise bands. As noted in
Results Sect, the superior performance in the context trials compared to the test trials suggests that subjects were at least combining information across harmonics in the context trials, even though the data do not prove that by doing so they were calculating a residue pitch. Second, in the test trials, the first interval always contained the harmonic complex tone; the mistuned harmonic that might pop out to a certain degree (Moore et al.
1986) only appeared in the second interval. Third, the amount of mistuning employed was small. Moore et al. (
1986) measured thresholds for hearing out an individual mistuned harmonic as a separate tone from the remainder of the complex. For a 420-ms 200-Hz F0 complex tone, the amount of mistuning required was on average 1.3 and 1.8 % for the third and the fourth harmonics, respectively. For shorter durations, thresholds increased. In the present experiment, the tone duration was 200 ms and the amount of mistuning employed (see Table
1) in most cases was below the thresholds determined by Moore et al. (
1986). Also note that listeners 6 and 7, for whom the amount of mistuning employed here was smallest (0.3–0.6 %), showed the same effect of context reliability as was visible in the mean results.
One additional possible factor arises from evidence for the existence of frequency shift detectors (“FSDs”) in the auditory system (Demany and Ramos
2005), and that these FSDs may lead to the phenomenon of “frequency enhancement (FE)” (Erviti et al.
2011; Demany et al.
2013). Evidence for FSDs comes from a paradigm where, following an inharmonic complex tone, listeners are presented with a probe tone, which, in one task, can be slightly (typically one semitone) higher or lower than one of the components in the complex; subjects are required to report whether this shift is up or down. Demany and Ramos (
2005) reported that performance in this task was better than in another task, where the probe tone frequency could equal that of one component in the complex (“present”) or fell mid-way between two components (“absent”). They attributed this to FSDs that were most sensitive to shifts of about one semitone (Demany et al.
2009) and concluded that subjects could hear a shift in the pitch of a component that was not heard in the complex and was not heard retrospectively when the probe was presented. In our test trials, one of the harmonics (third or fourth) differed slightly between the first and second intervals, and so this shift may have been detected by an FSD. Subsequently, Erviti et al. (
2011) argued that detection of shifts via FSDs could lead to FE, as demonstrated in an experiment where a “test” complex was preceded by a precursor that was identical to the test complex or differed from it (only) in a one-semitone shift in one of its components. A probe tone, presented after the test, had a frequency equal to the possibly shifted component or mid-way between two adjacent components, and subjects reported whether the probe was present in the test complex. Erviti et al. (
2011) reported that performance was better when the frequency shift was present rather than absent and concluded that the frequency shift “enhanced” the auditory representation of that component in the test complex, thereby improving performance on the present/absent task. This is relevant to our paradigm because if this improved performance reflects increased segregation of that component, then segregation of the mistuned component in the second interval of our test trials may have been increased by the complex presented in the first interval. This in turn may have allowed subjects to “hear out” the mistuned component with smaller mistunings than in the Moore et al. (
1986) study, where the mistuned harmonic complex was not preceded by a harmonic precursor.
Although we cannot completely rule out any influence of FSDs and FEs, several factors reduce the likelihood that they can account for our results. First, the stimulus parameters used to study FSDs and FE are different from those used here. FSDs are most sensitive to shifts of about 7 %, and the smallest shift for which they have been studied is 3 % (Demany et al.
2009)—larger than any mistuning used in our experiment (see Table
1). FSDs and FE have both been studied only using inharmonic complexes, and we do not know how they would be affected by the perceptual fusion that occurs between harmonics (and near-harmonics) of a common fundamental frequency. Second, because FSDs do not lead to the “conscious perception” of the shifted component in the first sound presented (Demany and Ramos
2005), then, even if a FSD caused the mistuned component to “pop out” in the second interval of a test trial, it would not do so in the first interval. Given that subjects were instructed to focus on the residue pitch and that attempts to match the segregated component to the pitch of a fused harmonic complex in the first interval would be unsuccessful, it seems unlikely that subjects would—at least initially—adopt this strategy. Hence for this type of segregation to influence performance, it must be sufficiently salient to have a knock-on effect on subsequent test trials, perhaps by alerting subjects to the possibility of a segregated component and encouraging them to adopt an analytic listening strategy. This would in turn have to survive the presence of intervening context trials on which optimal performance should arise by combining information across harmonics and where the harmonicity and common onset and offset of the components should promote fusion. Indeed, the only physical aspect of the context stimuli that might promote segregation was that the “unreliable” harmonic (as well as three others) was replaced by a noise. As argued above, segregation of that harmonic would likely improve rather than degrade performance. Overall, then, we believe our results are most consistent with the idea that context effects observed here are likely—at least partly—due to a reduction of the contribution of a given harmonic to the residue pitch when it is experienced as non-informative in context trials.
Context Effects in Hearing
The results reported here suggest that the relative contribution of different harmonics to the perception of pitch is not fully hard wired but is—to a certain degree—plastic. While the effect of context reported here is small, it is consistent with previous reports showing that the relative contribution of individual harmonics to residue pitch depends on duration (Gockel et al.
2005a,
2007). The relative precision of the internal representation of the frequencies of different harmonics changes with duration, and it appears that each harmonic is weighted according to the precision of that internal representation.
Other context effects in relation to pitch perception have been reported. Chambers and Pressnitzer (
2014) used Shepard tones (Shepard
1964) separated by a tritone and showed that the perception of the direction of a pitch change from one complex tone to the next could be strongly influenced by the recent history of tones heard. Houtgast (
1976) showed that a single harmonic could give rise to the perception of a subharmonic low pitch when it was preceded by a complex tone of similar pitch with many harmonics and both were presented in a noise background at a low signal-to-noise ratio (but see Burns and Houtsma
1999). Effects of pitch priming on the salience of pitch have also been reported. Presentation of a tone with a salient pitch indicating “what to listen for” can improve the perceptual representation and/or the discrimination of high pass filtered iterated rippled noises whose pitch is weak (Butler and Trainor
2013). Also, when discriminating between the frequencies of a very short or a noise-like tone and a longer tone with a more salient pitch, performance is better when the longer tone with the more salient pitch is presented first rather than second in the sequence (Demany et al.
2016).
In
Introduction Sect, we mentioned that, in the auditory domain, flexibility of cue weighting has been studied mainly in the context of speech perception and phonetic categories. One exception is the study of Holt and Lotto (
2006). They investigated the effect of short-term experience on relative cue weighting in relation to the formation of non-speech categories. They trained listeners to categorize sounds as belonging to one or the other of two previously unknown categories (buttons). The sounds were frequency-modulated sinusoids that differed in center frequency (CF) and modulation frequency (MF). One category contained tones with CFs that were, on average, lower than the CFs for the other category and with MFs that were, on average, higher than that for the other category. The tones were equally spaced in terms of just-noticeable-differences in both the CF and the MF dimensions. Listeners heard one tone at a time and learned category labels through visual feedback (a light). After training (less than 2 h), listeners had to categorize novel stimuli from the same two-dimensional (CF, MF) space (without feedback). Listeners gave more weight to the CF dimension than to the MF dimension when both dimensions were equally informative with regard to category membership. Decreasing the difference between the means of the two CF ranges, leading to more overlap of the two categories on the CF dimension, but keeping other statistics characterizing the distributions constant, did not change the relative weighting of the CF and MF dimensions by the listeners. However, when the variability of the CF was increased and that of the MF decreased, listeners gave more weight to the MF than the CF dimension. Therefore, even when cues were equally informative and discriminable, they were not weighted equally; listeners had a clear “bias” towards the CF cue. However, a change in weighting strategy could be produced by changes in the distribution of the input parameters. It should be noted that the combination of higher CFs with lower MFs and lower CFs with higher MFs might be considered as somewhat unnatural, as it would rarely be encountered.
Effects of the reliability of a
single cue on (speech) categories have also been demonstrated. Clayards et al. (
2008) investigated the effect of the reliability of voice onset time (VOT), which is the primary acoustic cue for voicing of word initial stop consonants (Francis et al.
2008). Listeners identified isolated words as either (i) “beach” or “peach,” (ii) “beak” or “peak,” and (iii) “bees” or “peas.” For each pair, a continuum of VOT values was generated. Short VOTs correspond to words such as “beach” and long VOTs to words such as “peach.” For each word in each pair, the VOT values were drawn from a “Gaussian distribution” with a mean corresponding to the prototypical value for that word in American English (e.g., 0 and 50 ms for beach and peach). Stimuli were synthesized using the Klatt synthesizer (Klatt
1980) with all parameters except VOT held constant for each pair and modeled on natural stimuli. The independent variable was the variance of the two VOT distributions. One group of listeners was presented with exemplars of words generated using a small variance for the two VOT distributions, giving high reliability of the VOT cue, and the other group was presented with exemplars of words generated using a large variance, giving low reliability of the VOT cue. The probability of categorizing a word as, for example, peach as a function of VOT (i.e. the categorization function), was affected by the variance of the two VOT distributions; the slope of the categorization function was shallower for the listeners presented with the VOTs from the two wider distributions than for the group presented with VOTs from the two narrower distributions. Thus, listeners were sensitive to the entire probability distribution of the VOT, and the categorization of words with given VOTs was dependent on the distribution of previously experienced VOTs, i.e., on the experienced reliability of the VOTs.
There is also a long history of research into changes in the perception of a single cue or characteristic of a target sound, for example perceived laterality, following either an adaptor with a fixed parameter value of that same cue or an adaptor with variable and changing parameter values. For example, Dahmen et al. (
2010) investigated how auditory spatial processing adapts to stimulus statistics by presenting noise sequences with rapidly fluctuating interaural level differences (ILD) to humans and ferrets. For humans, the mean of the ILD distribution biased the perceived laterality of a following target stimulus, while spatial sensitivity decreased as the distribution’s variance increased. Corresponding neural changes were observed in the inferior colliculus of ferrets; neurons’ ILD preferences adjusted towards the mean of the stimulus distribution and the slope of their rate-ILD functions decreased as the stimulus variance increased. The large body of research into adaptation is beyond the scope of this paper, but adaptation-related mechanisms may of course contribute to changes in the relative weights given to multiple cues.