Firestone & Scholl (F&S) review an extensive body of research on visual perception. Claims of higher-level effects on lower-level processes, they show, have swept over this research field like a “tidal wave.” Unsurprisingly, other areas of cognitive psychology have been similarly inundated.
Auditory perception and visual perception are alike in the questions they raise about the interplay of cognitive processing with sensory and perceptual analysis of a highly complex and externally determined input; and like visual perception, the processes underlying auditory perception have been well mapped in recent decades. Many of the features of the vision literature F&S note have direct auditory counterparts, such as highly intuitive demonstrations (the McGurk effect, whereby auditory input of [b] combined with visual articulation of [g] produces a percept of [d]; McGurk & MacDonald Reference McGurk and MacDonald1976), control of peripheral attention shift (the “cocktail-party phenomenon,” attending to one interlocutor in a crowd of talking people), or novel pop-out effects, for example, for certain phonologically illegal sequences (Weber Reference Weber2001).
Auditory and visual perception differ, however, not only in sensory modality but also in the input domain: Visual signals play out in space; auditory input arrives across time. The temporal input dimension has had implications for how the equivalent debate in the speech perception literature has played out; it has proved natural and compelling to treat the question of modularity as one concerning the temporal order of processing – has the bottom-up processing order (e.g., of speech sounds before the words they occur in) been compromised? Studies in which ambiguous speech sounds are categorised differently in varying lexically biased contexts (e.g., a [d/t] ambiguity reported as “d” before -eep but as “t” before -eek because deep and teak are words but teep and deek are not; Ganong Reference Ganong1980) were initially taken as evidence for top-down effects. (Note, however, that rather than tapping directly into perceptual processes, categorisation tasks may largely reflect metalinguistic judgments, as per F&S's Pitfall 2.) This work prompted follow-ups showing, for example, stronger lexical effects in slower responses (Fox Reference Fox1984), and no build-up of effects with an ambiguous sound in syllable-final rather than syllable-initial position (McQueen Reference McQueen1991); these temporal arguments suggested a response bias account (F&S's Pitfall 3).
In general, F&S's Pitfall 1 (a confirmatory approach) has been the hallmark of much of the pro-top-down speech perception literature. Most of that literature takes the form of a catalogue of findings that are consistent with top-down effects but are not diagnostic. There is frequently little evidence that alternative feed-forward explanations have been considered. One of the few exceptions comes from the study of compensation for coarticulation, a known low-level process in speech perception whereby cues to phonetic contrasts may be weighted differently depending on immediately preceding sounds. An influential study by Elman and McClelland (Reference Elman and McClelland1988) reported that interpretation of a constant word-initial [t/k] ambiguity could be affected by the lexically determined interpretation of a constant immediately preceding word-final [s/sh] ambiguity (whether it served as the final sound of Christmas or foolish). Pitt and McQueen (Reference Pitt and McQueen1998) reasoned that if this compensation was a necessary consequence of the lexical effect (rather than of transitional probability, as they argued; that is, an artefact as per F&S's Pitfall 4), then if there is no compensation effect there should be no lexical effect either. With the [s/sh] occurring instead in words balanced for transitional probability (juice, bush), the word-initial [t/k] compensation disappeared; but the lexical [s/sh] effect remained. Such studies (testing disconfirmation predictions) are, however, as in the vision literature, vanishingly rare.
In our account in this journal of these and similar sets of studies (Norris et al. Reference Norris, McQueen and Cutler2000), we concluded, as F&S do for the vision literature, that there was then no viable evidence for top-down penetrability of speech-perception processes. In a more recent review (Norris et al. Reference Norris, McQueen and Cutler2016) we reach the same conclusion regarding current research programs in which similar claims have been reworded in terms of prediction processes (“predictive coding”).
Priming effects (F&S's Pitfall 6) are more or less the bread and butter of spoken-word recognition research, so that psychological studies tend to preserve the memory/perception distinction; but in an essentially separate line of speech perception studies, from the branch of linguistics known as sociophonetics, the distinction has in our opinion been blurred. Typical results from this literature include listeners' matching of heard tokens to synthesised vowel comparison tokens being influenced by (a) telling participants the speaker was from Detroit versus Canada (Niedzielski Reference Niedzielski1999), (b) labeling participants' response sheets “Australian” versus “New Zealander” (Hay et al. Reference Hay, Nolan and Drager2006), or (c) having a stuffed toy kangaroo or koala versus a stuffed toy kiwi in the room (Hay & Drager Reference Hay and Drager2010). In fairness to these authors, we note that they do not propound large claims concerning penetration of cognition into primary auditory processing (they interpret their results in terms of reference to listening experience). It seems to us, however, that a rich trove of possible new findings could appear if researchers would adopt F&S's advice and debrief participants, then correlate the match responses to debriefing outcomes.
A comprehensive and thorough review of a substantial body of research (with potentially important implications for theory) is always a great help to researchers – and especially useful if it uncovers new patterns such as, in this case, a systematic set of deficiencies. But in the present article, a service has been performed for researchers of the future as well, in the form of a checklist against which the evidence for theoretical claims can be evaluated. Only research reports that pass (or at least explicitly address) F&S's six criteria can henceforth become part of the serious theoretical conversation. As we have indicated, these criteria have application beyond visual perception; at least speech perception can use them too. Thus, we salute F&S for performing a signal service to the cognitive psychology community. Bottoms up!
Firestone & Scholl (F&S) review an extensive body of research on visual perception. Claims of higher-level effects on lower-level processes, they show, have swept over this research field like a “tidal wave.” Unsurprisingly, other areas of cognitive psychology have been similarly inundated.
Auditory perception and visual perception are alike in the questions they raise about the interplay of cognitive processing with sensory and perceptual analysis of a highly complex and externally determined input; and like visual perception, the processes underlying auditory perception have been well mapped in recent decades. Many of the features of the vision literature F&S note have direct auditory counterparts, such as highly intuitive demonstrations (the McGurk effect, whereby auditory input of [b] combined with visual articulation of [g] produces a percept of [d]; McGurk & MacDonald Reference McGurk and MacDonald1976), control of peripheral attention shift (the “cocktail-party phenomenon,” attending to one interlocutor in a crowd of talking people), or novel pop-out effects, for example, for certain phonologically illegal sequences (Weber Reference Weber2001).
Auditory and visual perception differ, however, not only in sensory modality but also in the input domain: Visual signals play out in space; auditory input arrives across time. The temporal input dimension has had implications for how the equivalent debate in the speech perception literature has played out; it has proved natural and compelling to treat the question of modularity as one concerning the temporal order of processing – has the bottom-up processing order (e.g., of speech sounds before the words they occur in) been compromised? Studies in which ambiguous speech sounds are categorised differently in varying lexically biased contexts (e.g., a [d/t] ambiguity reported as “d” before -eep but as “t” before -eek because deep and teak are words but teep and deek are not; Ganong Reference Ganong1980) were initially taken as evidence for top-down effects. (Note, however, that rather than tapping directly into perceptual processes, categorisation tasks may largely reflect metalinguistic judgments, as per F&S's Pitfall 2.) This work prompted follow-ups showing, for example, stronger lexical effects in slower responses (Fox Reference Fox1984), and no build-up of effects with an ambiguous sound in syllable-final rather than syllable-initial position (McQueen Reference McQueen1991); these temporal arguments suggested a response bias account (F&S's Pitfall 3).
In general, F&S's Pitfall 1 (a confirmatory approach) has been the hallmark of much of the pro-top-down speech perception literature. Most of that literature takes the form of a catalogue of findings that are consistent with top-down effects but are not diagnostic. There is frequently little evidence that alternative feed-forward explanations have been considered. One of the few exceptions comes from the study of compensation for coarticulation, a known low-level process in speech perception whereby cues to phonetic contrasts may be weighted differently depending on immediately preceding sounds. An influential study by Elman and McClelland (Reference Elman and McClelland1988) reported that interpretation of a constant word-initial [t/k] ambiguity could be affected by the lexically determined interpretation of a constant immediately preceding word-final [s/sh] ambiguity (whether it served as the final sound of Christmas or foolish). Pitt and McQueen (Reference Pitt and McQueen1998) reasoned that if this compensation was a necessary consequence of the lexical effect (rather than of transitional probability, as they argued; that is, an artefact as per F&S's Pitfall 4), then if there is no compensation effect there should be no lexical effect either. With the [s/sh] occurring instead in words balanced for transitional probability (juice, bush), the word-initial [t/k] compensation disappeared; but the lexical [s/sh] effect remained. Such studies (testing disconfirmation predictions) are, however, as in the vision literature, vanishingly rare.
In our account in this journal of these and similar sets of studies (Norris et al. Reference Norris, McQueen and Cutler2000), we concluded, as F&S do for the vision literature, that there was then no viable evidence for top-down penetrability of speech-perception processes. In a more recent review (Norris et al. Reference Norris, McQueen and Cutler2016) we reach the same conclusion regarding current research programs in which similar claims have been reworded in terms of prediction processes (“predictive coding”).
Priming effects (F&S's Pitfall 6) are more or less the bread and butter of spoken-word recognition research, so that psychological studies tend to preserve the memory/perception distinction; but in an essentially separate line of speech perception studies, from the branch of linguistics known as sociophonetics, the distinction has in our opinion been blurred. Typical results from this literature include listeners' matching of heard tokens to synthesised vowel comparison tokens being influenced by (a) telling participants the speaker was from Detroit versus Canada (Niedzielski Reference Niedzielski1999), (b) labeling participants' response sheets “Australian” versus “New Zealander” (Hay et al. Reference Hay, Nolan and Drager2006), or (c) having a stuffed toy kangaroo or koala versus a stuffed toy kiwi in the room (Hay & Drager Reference Hay and Drager2010). In fairness to these authors, we note that they do not propound large claims concerning penetration of cognition into primary auditory processing (they interpret their results in terms of reference to listening experience). It seems to us, however, that a rich trove of possible new findings could appear if researchers would adopt F&S's advice and debrief participants, then correlate the match responses to debriefing outcomes.
A comprehensive and thorough review of a substantial body of research (with potentially important implications for theory) is always a great help to researchers – and especially useful if it uncovers new patterns such as, in this case, a systematic set of deficiencies. But in the present article, a service has been performed for researchers of the future as well, in the form of a checklist against which the evidence for theoretical claims can be evaluated. Only research reports that pass (or at least explicitly address) F&S's six criteria can henceforth become part of the serious theoretical conversation. As we have indicated, these criteria have application beyond visual perception; at least speech perception can use them too. Thus, we salute F&S for performing a signal service to the cognitive psychology community. Bottoms up!