Rahnev & Denison (R&D) identify many possible sources of (apparent) suboptimality in behavior, including capacity limitations, incorrect observer assumptions about stimulus statistics, heuristic decision rules, and decision noise or criterion jitter (their Table 1). They urge the field to test the collective set of these hypotheses. I strongly support this message; for example, in recent work, we tested no fewer than 24 alternatives to the optimal decision rule (Shen & Ma Reference Shen and Ma2016). However, the research agenda as a whole encounters a practical challenge: It is in most cases impossible to test one hypothesis at a time, as it would require the experimenter to make arbitrary choices regarding the other hypotheses.
The rational approach, which is taken surprisingly rarely in the study of perception and cognition, is to consider each hypothesis as a factor in a model (Keshvari et al. Reference Keshvari, van den Berg and Ma2012; van den Berg et al. Reference van den Berg, Awh and Ma2014). Each factor could be binary or multivalued; for example, the factor “decision rule” could take the values “optimal,” “heuristic 1,” and “heuristic 2.” Just like factorial design is a cherished tool in experimentation (Fisher Reference Fisher1926), models can also be combined in a factorial manner: Every logically consistent combination of values of the different factors constitutes a model that should be tested. If there are n binary factors, there will be up to 2n models. The goodness of fit of each model is then evaluated using one's favorite metric, such as the Akaike information criterion (AIC; Akaike Reference Akaike1974) or leave-one-out cross validation (Vehtari et al. Reference Vehtari, Gelman and Gabry2017).
In factorial model comparison, ranking all individual models is usually not the end of the road: often, one is less interested in the evidence for individual models than in the evidence for the different levels of a particular factor. This can be obtained by aggregating across “model families.” A model family consists of all models that share a particular value for a particular factor (e.g., the value “heuristic 1” for the factor “decision rule”). In the binary example with 2n models, each model family would have 2n-1 members. The goodness of fit of a model family is then a suitably chosen average of the goodness of fit of its members. In a fully Bayesian treatment, this averaging is marginalization. If AIC is the metric of choice, one could average the AIC weights (Wagenmakers & Farrell Reference Wagenmakers and Farrell2004) of the family members (Shen & Ma, Reference Shen and Main press). Finally, one could represent each family by its best-performing member (van den Berg et al. Reference van den Berg, Awh and Ma2014).
An alternative to factorial model comparison is to construct a “supermodel,” of which all models of interest are special cases (Acerbi et al. Reference Acerbi, Ma and Vijayakumar2014a; Pinto et al. Reference Pinto, Doukhan, DiCarlo and Cox2009). For example, an observer's belief about a Gaussian stimulus distribution with fixed mean and variance could be modeled using a Gaussian distribution with free mean and variance. Then, all inference amounts to parameter estimation, for which one can use Bayesian methods. In some cases, however, a factor is most naturally considered categorical – for example, when comparing qualitatively distinct decision rules.
My lab performed factorial model comparison for the first time in a study of change detection, crossing hypotheses about the nature of encoding precision with ones about observer assumptions about encoding precision, and with ones about the decision rule (Keshvari et al. Reference Keshvari, van den Berg and Ma2012). In this case, an optimal-observer model with variable encoding precision and right observer assumptions won convincingly.
Factorial model comparison is no silver bullet. It is easy to end up with statistically indistinguishable models in the full ranking. In a study of the limitations of visual working memory, we crossed hypotheses about the number of remembered items with ones about the nature of encoding precision, and with ones about the presence of non-target reports (van den Berg et al. Reference van den Berg, Awh and Ma2014). This produced 32 models, many of which were indistinguishable from others in goodness of fit. Family-wise aggregation helped to draw conclusions, but even that might not always be the case. Such non-identifiability, however, is not a problem of the method but a reflection of the difficulty of drawing inferences about multicomponent processes based on limited behavioral data. As we wrote in van den Berg et al. (Reference van den Berg, Awh and Ma2014, p. 145), “Factorially comparing models using likelihood-based methods is the fairest and most objective method for drawing conclusions from psychophysical data. If that forces researchers to reduce the level of confidence with which they declare particular models to be good representations of reality, we would consider that a desirable outcome.”
Rahnev & Denison (R&D) identify many possible sources of (apparent) suboptimality in behavior, including capacity limitations, incorrect observer assumptions about stimulus statistics, heuristic decision rules, and decision noise or criterion jitter (their Table 1). They urge the field to test the collective set of these hypotheses. I strongly support this message; for example, in recent work, we tested no fewer than 24 alternatives to the optimal decision rule (Shen & Ma Reference Shen and Ma2016). However, the research agenda as a whole encounters a practical challenge: It is in most cases impossible to test one hypothesis at a time, as it would require the experimenter to make arbitrary choices regarding the other hypotheses.
The rational approach, which is taken surprisingly rarely in the study of perception and cognition, is to consider each hypothesis as a factor in a model (Keshvari et al. Reference Keshvari, van den Berg and Ma2012; van den Berg et al. Reference van den Berg, Awh and Ma2014). Each factor could be binary or multivalued; for example, the factor “decision rule” could take the values “optimal,” “heuristic 1,” and “heuristic 2.” Just like factorial design is a cherished tool in experimentation (Fisher Reference Fisher1926), models can also be combined in a factorial manner: Every logically consistent combination of values of the different factors constitutes a model that should be tested. If there are n binary factors, there will be up to 2n models. The goodness of fit of each model is then evaluated using one's favorite metric, such as the Akaike information criterion (AIC; Akaike Reference Akaike1974) or leave-one-out cross validation (Vehtari et al. Reference Vehtari, Gelman and Gabry2017).
In factorial model comparison, ranking all individual models is usually not the end of the road: often, one is less interested in the evidence for individual models than in the evidence for the different levels of a particular factor. This can be obtained by aggregating across “model families.” A model family consists of all models that share a particular value for a particular factor (e.g., the value “heuristic 1” for the factor “decision rule”). In the binary example with 2n models, each model family would have 2n-1 members. The goodness of fit of a model family is then a suitably chosen average of the goodness of fit of its members. In a fully Bayesian treatment, this averaging is marginalization. If AIC is the metric of choice, one could average the AIC weights (Wagenmakers & Farrell Reference Wagenmakers and Farrell2004) of the family members (Shen & Ma, Reference Shen and Main press). Finally, one could represent each family by its best-performing member (van den Berg et al. Reference van den Berg, Awh and Ma2014).
An alternative to factorial model comparison is to construct a “supermodel,” of which all models of interest are special cases (Acerbi et al. Reference Acerbi, Ma and Vijayakumar2014a; Pinto et al. Reference Pinto, Doukhan, DiCarlo and Cox2009). For example, an observer's belief about a Gaussian stimulus distribution with fixed mean and variance could be modeled using a Gaussian distribution with free mean and variance. Then, all inference amounts to parameter estimation, for which one can use Bayesian methods. In some cases, however, a factor is most naturally considered categorical – for example, when comparing qualitatively distinct decision rules.
My lab performed factorial model comparison for the first time in a study of change detection, crossing hypotheses about the nature of encoding precision with ones about observer assumptions about encoding precision, and with ones about the decision rule (Keshvari et al. Reference Keshvari, van den Berg and Ma2012). In this case, an optimal-observer model with variable encoding precision and right observer assumptions won convincingly.
Factorial model comparison is no silver bullet. It is easy to end up with statistically indistinguishable models in the full ranking. In a study of the limitations of visual working memory, we crossed hypotheses about the number of remembered items with ones about the nature of encoding precision, and with ones about the presence of non-target reports (van den Berg et al. Reference van den Berg, Awh and Ma2014). This produced 32 models, many of which were indistinguishable from others in goodness of fit. Family-wise aggregation helped to draw conclusions, but even that might not always be the case. Such non-identifiability, however, is not a problem of the method but a reflection of the difficulty of drawing inferences about multicomponent processes based on limited behavioral data. As we wrote in van den Berg et al. (Reference van den Berg, Awh and Ma2014, p. 145), “Factorially comparing models using likelihood-based methods is the fairest and most objective method for drawing conclusions from psychophysical data. If that forces researchers to reduce the level of confidence with which they declare particular models to be good representations of reality, we would consider that a desirable outcome.”