Clark's target article captures well our excitement about predictive coding and the ability of humans to include uncertainty in making cognitive decisions. One additional factor for representational learning to match biological findings that has not been stressed much in the target article is the importance of sparseness constraints. We discuss this here, together with some critical remarks on Bayesian models and some remaining challenges quantifying the general approach.
There are many unsupervised generative models that can be used to learn representations to reconstruct input data. Consider, for example, photographs of natural images. A common method for dimensionality reduction is principle component analysis that represents data along orthogonal feature vectors of decreasing variance. However, as nicely pointed out by Olshausen and Field (Reference Olshausen and Field1996), the corresponding filters do not resemble receptive fields in the brain. In contrast, if a generative model has the additional constraint to minimize not only the reconstruction error but also the number of basis functions that are used for any specific image, then filters emerge that resemble receptive fields of simple cells in the primary visual cortex.
Sparse representation in the neuroscientific context actually has a long and important history. Horace Barlow pointed out for years that the visual system seems to be remarkably set up for sparse representations (Barlow Reference Barlow and Rosenblith1961), and probably the first systematic model in this direction was proposed by his student Peter Földiák (Reference Földiák1990). It seems that nearly every generative model with a sparseness constraint can reproduce receptive fields resembling simple cells (Saxe et al. Reference Saxe, Bhand, Mudur, Suresh and Ng2011), and Ng and colleagues have shown that sparse hierarchical Restricted Boltzmann Machines (RBMs) resembles features of receptive fields in V1 and V2 (Lee et al. Reference Lee, Ekanadham, Ng, Platt, Koller, Singer and Roweis2008). In our own work, we have shown how lateral inhibition can implement sparseness constrains in a biological way while also promoting topographic representations (Hollensen & Trappenberg Reference Hollensen and Trappenberg2011).
Sparse representation has great advantages. By definition, it means that only a small number of cells have to be active to reproduce inputs in great detail. This not only has advantages energetically, it also represents a large compression of the data. Of course, the extreme case of maximal sparseness corresponding to grandmother cells is not desirable, as this would hinder any generalization ability of a model. Experimental evidence of sparse coding has been found in V1 (Vinje & Gallant Reference Vinje and Gallant2000) and hippocampus (Waydo et al. Reference Waydo, Kraskov, Quiroga, Fried and Koch2006).
The relation of the efficient coding principle to free energy is discussed by Friston (Reference Friston2010), who provides a derivation of free energy as the difference between complexity and accuracy. That is, minimizing free energy maximizes the probability of the data (accuracy), while also minimizing the difference (cross-entropy) between the causes we infer from the data and our prior on causes. The fact that the latter is termed complexity reflects our intuition that causes in the world lie in a smaller space than their sensory projections. Thus, our internal representation should mirror the sparse structure of the world.
While Friston shows the equivalence of Infomax and free energy minimization given a sparse prior, a fully Bayesian implementation would treat the prior itself as a random variable to be optimized through learning. Indeed, Friston goes on to say that the criticism of where these priors come from “dissolves with hierarchical generative models, in which the priors themselves are optimized” (Friston Reference Friston2010, p. 129). This is precisely what has not yet been achieved: a model which learns a sparse representation of sensory messages due to the world's sparseness, rather than due to its architecture or static priors. Of course, we are likely endowed with a range of priors built-in to our evolved cortical architecture in order to bootstrap or guide development. What these native priors are and the form they take is an interesting and open question.
There are two alternatives to innate priors for explaining the receptive fields we observe. First, there has been a strong tendency to learn hierarchical models layer-by-layer, with each layer learning to reconstruct the output of the previous without being influenced by top-down expectations. Such top-down modulation is the prime candidate for expressing empirical priors and influencing learning to incorporate high-level tendencies. Implementing a model that balances conforming to both its input and top-down expectations while offering efficient inference and robustness is a largely open question (Jaeger Reference Jaeger2011). Second, the data typically used to train our models on differs substantially from what we are exposed to. The visual cortex experiences a stream of images with substantial temporal coherence and correlation with internal signals such as eye movements, limiting the conclusions we can draw from comparing its representation to models trained on static images (see, e.g., Rust et al. Reference Rust, Schwartz, Movshon and Simoncelli2005).
The final comment we would like to make here concerns the discussion of Bayesian processes. Bayesian models such as the ideal observer have received considerable attention in neuroscience since they seem to nicely capture human abilities to combine new evidence with prior knowledge in the “correct” probabilistic sense. However, it is important to realize that these Bayesian models are very specific to limited experimental tasks, often with only a few possible relevant states, and such models do not generalize well to changing experimental conditions. In contrast, the Bayesian model of a Boltzmann machine represents general mechanistic implementations of information processing in the brain that we believe can implement a general learning machine. While all these models are Bayesian in the sense that they represent causal models with probabilistic nodes, the nature of the models are very different. It is fascinating to think about how such specific Bayesian models as the ideal observer can emerge from general learning machines such as the RBM. Indeed, such a demonstration would be necessary to underpin the story that hierarchical generative models support the Bayesian cognitive processing as discussed in the target article.
Clark's target article captures well our excitement about predictive coding and the ability of humans to include uncertainty in making cognitive decisions. One additional factor for representational learning to match biological findings that has not been stressed much in the target article is the importance of sparseness constraints. We discuss this here, together with some critical remarks on Bayesian models and some remaining challenges quantifying the general approach.
There are many unsupervised generative models that can be used to learn representations to reconstruct input data. Consider, for example, photographs of natural images. A common method for dimensionality reduction is principle component analysis that represents data along orthogonal feature vectors of decreasing variance. However, as nicely pointed out by Olshausen and Field (Reference Olshausen and Field1996), the corresponding filters do not resemble receptive fields in the brain. In contrast, if a generative model has the additional constraint to minimize not only the reconstruction error but also the number of basis functions that are used for any specific image, then filters emerge that resemble receptive fields of simple cells in the primary visual cortex.
Sparse representation in the neuroscientific context actually has a long and important history. Horace Barlow pointed out for years that the visual system seems to be remarkably set up for sparse representations (Barlow Reference Barlow and Rosenblith1961), and probably the first systematic model in this direction was proposed by his student Peter Földiák (Reference Földiák1990). It seems that nearly every generative model with a sparseness constraint can reproduce receptive fields resembling simple cells (Saxe et al. Reference Saxe, Bhand, Mudur, Suresh and Ng2011), and Ng and colleagues have shown that sparse hierarchical Restricted Boltzmann Machines (RBMs) resembles features of receptive fields in V1 and V2 (Lee et al. Reference Lee, Ekanadham, Ng, Platt, Koller, Singer and Roweis2008). In our own work, we have shown how lateral inhibition can implement sparseness constrains in a biological way while also promoting topographic representations (Hollensen & Trappenberg Reference Hollensen and Trappenberg2011).
Sparse representation has great advantages. By definition, it means that only a small number of cells have to be active to reproduce inputs in great detail. This not only has advantages energetically, it also represents a large compression of the data. Of course, the extreme case of maximal sparseness corresponding to grandmother cells is not desirable, as this would hinder any generalization ability of a model. Experimental evidence of sparse coding has been found in V1 (Vinje & Gallant Reference Vinje and Gallant2000) and hippocampus (Waydo et al. Reference Waydo, Kraskov, Quiroga, Fried and Koch2006).
The relation of the efficient coding principle to free energy is discussed by Friston (Reference Friston2010), who provides a derivation of free energy as the difference between complexity and accuracy. That is, minimizing free energy maximizes the probability of the data (accuracy), while also minimizing the difference (cross-entropy) between the causes we infer from the data and our prior on causes. The fact that the latter is termed complexity reflects our intuition that causes in the world lie in a smaller space than their sensory projections. Thus, our internal representation should mirror the sparse structure of the world.
While Friston shows the equivalence of Infomax and free energy minimization given a sparse prior, a fully Bayesian implementation would treat the prior itself as a random variable to be optimized through learning. Indeed, Friston goes on to say that the criticism of where these priors come from “dissolves with hierarchical generative models, in which the priors themselves are optimized” (Friston Reference Friston2010, p. 129). This is precisely what has not yet been achieved: a model which learns a sparse representation of sensory messages due to the world's sparseness, rather than due to its architecture or static priors. Of course, we are likely endowed with a range of priors built-in to our evolved cortical architecture in order to bootstrap or guide development. What these native priors are and the form they take is an interesting and open question.
There are two alternatives to innate priors for explaining the receptive fields we observe. First, there has been a strong tendency to learn hierarchical models layer-by-layer, with each layer learning to reconstruct the output of the previous without being influenced by top-down expectations. Such top-down modulation is the prime candidate for expressing empirical priors and influencing learning to incorporate high-level tendencies. Implementing a model that balances conforming to both its input and top-down expectations while offering efficient inference and robustness is a largely open question (Jaeger Reference Jaeger2011). Second, the data typically used to train our models on differs substantially from what we are exposed to. The visual cortex experiences a stream of images with substantial temporal coherence and correlation with internal signals such as eye movements, limiting the conclusions we can draw from comparing its representation to models trained on static images (see, e.g., Rust et al. Reference Rust, Schwartz, Movshon and Simoncelli2005).
The final comment we would like to make here concerns the discussion of Bayesian processes. Bayesian models such as the ideal observer have received considerable attention in neuroscience since they seem to nicely capture human abilities to combine new evidence with prior knowledge in the “correct” probabilistic sense. However, it is important to realize that these Bayesian models are very specific to limited experimental tasks, often with only a few possible relevant states, and such models do not generalize well to changing experimental conditions. In contrast, the Bayesian model of a Boltzmann machine represents general mechanistic implementations of information processing in the brain that we believe can implement a general learning machine. While all these models are Bayesian in the sense that they represent causal models with probabilistic nodes, the nature of the models are very different. It is fascinating to think about how such specific Bayesian models as the ideal observer can emerge from general learning machines such as the RBM. Indeed, such a demonstration would be necessary to underpin the story that hierarchical generative models support the Bayesian cognitive processing as discussed in the target article.