Predictive coding (PC) was first introduced for visual perception (Rao & Ballard Reference Rao and Ballard1999). PC theory has since been developed in a series of articles to become an influential theory of perception. In PC theory, perception is said to involve probabilistic testing of contextually based hypotheses against sensory input. Deviances from predictions result in prediction errors, which are transmitted upstream in the hierarchically organized brain. This is supposed to reduce free energy, that is, reduce the amount of information needed for perception.
Recently, a PC model of music perception was introduced (Koelsch et al. Reference Koelsch, Vuust and Friston2018). This model presumes that event-related potentials (ERPs), assessed as EEG, reflect hypothesis testing. The cause of ERPs has been debated. It has, however, been demonstrated that some ERPs are reactions on deviances from predictions and, furthermore, that these reactions occur at several levels in the brain hierarchy (Wacogne et al. Reference Wacogne, Labyt, van Wassenhove, Bekinschtein, Naccache and Dehaene2011). ERP thus seems to provide biological confirmation of PC theory. Koelsch et al. (Reference Koelsch, Vuust and Friston2018) focus on the ERPs ERAN (early right anterior negativity), and MMN (mismatch negativity). There is a vast literature on ERPs for music, indicating reactions on deviances in rhythm, melody, harmony, structure, and so forth. The ERP reaction time for music varies from 200 to 600 ms, depending on the complexity of the stimulus. This leaves us with five questions:
1. How can we play together if the perception of music is delayed? A musician trying to fit in would be about half a second late.
2. How can the sound wave even be perceived as music if the processing times for rhythm, melody, chords, and structure differ? Most musical events are predictable to some extent. Musical beat, for example, is entrained. Here, entrainment is the body's synchronization to the beat. We act on the beat. Rhythm thus is perceived directly. It has been demonstrated that periodical sounds produce bursts of gamma oscillations on sound onsets and, furthermore, that these bursts continue when the stimulus is omitted (Snyder & Large Reference Snyder and Large2005; Tal et al. Reference Tal, Large, Rabinovitch, Wei, Schroeder, Poeppel and Golumbic2017). If the perception of unpredicted tones or chords would be delayed noticeably, they would lag the rhythm. The music would fall apart.
3. Why do we not hear musical predictions? According to PC, predictions, if correct, are not affected by sensory input. The prediction should be what we perceive. If a sound is omitted from a predictable pattern, an auditory response can be emitted (Bendixen et al. Reference Bendixen, SanMiguel and Schröger2012). Predictions can sometimes be heard as the inner sensation called musical imagery (Zatorre et al. Reference Zatorre, Halpern, Perry, Meyer and Evans1996). But this is clearly different from actual music. If the band stops playing, we do not hear their music, however predictable it might be.
4. Why do we not hear two tones, when the expectation is violated? A first expected tone should be substituted by an accurate tone. We cannot assume that the second tone mutes the first tone, as the second tone does not exist (in the brain) when the first tone is heard.
5. If it is just an error signal that is sent upstream, the only sensory information about a melody tone is that it is not the expected. But, to infer the actual tone, the brain must know how much and in which direction it deviates from a preceding tone. This is sensory information concerning pitch differences. If we get this information, what is the use of hypothesis testing?
As Brette points out, the word predictive in predictive coding does not designate the prediction of future events. Classic PC models are not designed to explain how we perceive time-varying stimuli, as they do not account for neural transmission delays (Hogendoorn & Burkitt Reference Hogendoorn and Burkitt2019). …. Such delays would make it impossible to return a tennis serve, because the actual ball would be several meters ahead of the perceived ball. However, as demonstrated by Nijhawan (Reference Nijhawan1994), the visual system compensates for delays by means of extrapolation of the trajectory of the moving target. This mechanism makes us see the ball where it “should” be, and really is. A striking demonstration of such extrapolation is the flash lag effect (FLE), where a continuously moving object is compared with a discrete repetitive flashing of a stationary cube (https://www.youtube.com/watch?v=DUBM-GG0gAk). It appears as if the moving object is ahead of the flashes, although they are synchronized in real time. It is believed that the movement of the object is extrapolated but not the flashing of the cube; that, thus, the difference in position reflects the difference in neural transmission time. On comparison of FLE to music perception, two observations can be made:
1. In FLE, the perception of the visual pulse (the flashes), although regular and repetitive, is delayed. In music, the perception of the acoustic pulse (the beat) is not.
2. In FLE, time is dissociated (into extrapolated and non-extrapolated time scales), but this is not the case in music.
These points indicate that acoustic perception differs fundamentally from visual perception. Other mechanisms are in play. The Hogendoorn-Burkitt model (Hogendoorn & Burkitt Reference Hogendoorn and Burkitt2019) for visual extrapolation would not work for music, because music is not a continuous movement.
If the function of ERP is not to guide perception, then what? It is possible that ERPs simply reflect negative feedback and that the function is learning – the updating of internal models.
Perception and action are reciprocally dependent. Music is a perfect example. Accordingly, the solution of the dissociation problem may be sought along the lines suggested by Brette, that is, as enactive perception.
Predictive coding (PC) was first introduced for visual perception (Rao & Ballard Reference Rao and Ballard1999). PC theory has since been developed in a series of articles to become an influential theory of perception. In PC theory, perception is said to involve probabilistic testing of contextually based hypotheses against sensory input. Deviances from predictions result in prediction errors, which are transmitted upstream in the hierarchically organized brain. This is supposed to reduce free energy, that is, reduce the amount of information needed for perception.
Recently, a PC model of music perception was introduced (Koelsch et al. Reference Koelsch, Vuust and Friston2018). This model presumes that event-related potentials (ERPs), assessed as EEG, reflect hypothesis testing. The cause of ERPs has been debated. It has, however, been demonstrated that some ERPs are reactions on deviances from predictions and, furthermore, that these reactions occur at several levels in the brain hierarchy (Wacogne et al. Reference Wacogne, Labyt, van Wassenhove, Bekinschtein, Naccache and Dehaene2011). ERP thus seems to provide biological confirmation of PC theory. Koelsch et al. (Reference Koelsch, Vuust and Friston2018) focus on the ERPs ERAN (early right anterior negativity), and MMN (mismatch negativity). There is a vast literature on ERPs for music, indicating reactions on deviances in rhythm, melody, harmony, structure, and so forth. The ERP reaction time for music varies from 200 to 600 ms, depending on the complexity of the stimulus. This leaves us with five questions:
1. How can we play together if the perception of music is delayed? A musician trying to fit in would be about half a second late.
2. How can the sound wave even be perceived as music if the processing times for rhythm, melody, chords, and structure differ? Most musical events are predictable to some extent. Musical beat, for example, is entrained. Here, entrainment is the body's synchronization to the beat. We act on the beat. Rhythm thus is perceived directly. It has been demonstrated that periodical sounds produce bursts of gamma oscillations on sound onsets and, furthermore, that these bursts continue when the stimulus is omitted (Snyder & Large Reference Snyder and Large2005; Tal et al. Reference Tal, Large, Rabinovitch, Wei, Schroeder, Poeppel and Golumbic2017). If the perception of unpredicted tones or chords would be delayed noticeably, they would lag the rhythm. The music would fall apart.
3. Why do we not hear musical predictions? According to PC, predictions, if correct, are not affected by sensory input. The prediction should be what we perceive. If a sound is omitted from a predictable pattern, an auditory response can be emitted (Bendixen et al. Reference Bendixen, SanMiguel and Schröger2012). Predictions can sometimes be heard as the inner sensation called musical imagery (Zatorre et al. Reference Zatorre, Halpern, Perry, Meyer and Evans1996). But this is clearly different from actual music. If the band stops playing, we do not hear their music, however predictable it might be.
4. Why do we not hear two tones, when the expectation is violated? A first expected tone should be substituted by an accurate tone. We cannot assume that the second tone mutes the first tone, as the second tone does not exist (in the brain) when the first tone is heard.
5. If it is just an error signal that is sent upstream, the only sensory information about a melody tone is that it is not the expected. But, to infer the actual tone, the brain must know how much and in which direction it deviates from a preceding tone. This is sensory information concerning pitch differences. If we get this information, what is the use of hypothesis testing?
As Brette points out, the word predictive in predictive coding does not designate the prediction of future events. Classic PC models are not designed to explain how we perceive time-varying stimuli, as they do not account for neural transmission delays (Hogendoorn & Burkitt Reference Hogendoorn and Burkitt2019). …. Such delays would make it impossible to return a tennis serve, because the actual ball would be several meters ahead of the perceived ball. However, as demonstrated by Nijhawan (Reference Nijhawan1994), the visual system compensates for delays by means of extrapolation of the trajectory of the moving target. This mechanism makes us see the ball where it “should” be, and really is. A striking demonstration of such extrapolation is the flash lag effect (FLE), where a continuously moving object is compared with a discrete repetitive flashing of a stationary cube (https://www.youtube.com/watch?v=DUBM-GG0gAk). It appears as if the moving object is ahead of the flashes, although they are synchronized in real time. It is believed that the movement of the object is extrapolated but not the flashing of the cube; that, thus, the difference in position reflects the difference in neural transmission time. On comparison of FLE to music perception, two observations can be made:
1. In FLE, the perception of the visual pulse (the flashes), although regular and repetitive, is delayed. In music, the perception of the acoustic pulse (the beat) is not.
2. In FLE, time is dissociated (into extrapolated and non-extrapolated time scales), but this is not the case in music.
These points indicate that acoustic perception differs fundamentally from visual perception. Other mechanisms are in play. The Hogendoorn-Burkitt model (Hogendoorn & Burkitt Reference Hogendoorn and Burkitt2019) for visual extrapolation would not work for music, because music is not a continuous movement.
If the function of ERP is not to guide perception, then what? It is possible that ERPs simply reflect negative feedback and that the function is learning – the updating of internal models.
Perception and action are reciprocally dependent. Music is a perfect example. Accordingly, the solution of the dissociation problem may be sought along the lines suggested by Brette, that is, as enactive perception.
Acknowledgment
The author thanks Helge Malmgren for valuable comments.