The article by Elff et al. makes a valuable contribution on how to provide post-estimation adjustments to statistical tests in hierarchical models estimated using maximum likelihood (ML) and small group sizes.
My article referenced by them is part of a larger group of articles investigating the performance of ‘standard’ ML estimation when group sizes are small (see, for example, Bell et al. Reference Bell2014; Bryan and Jenkins Reference Bryan and Jenkins2015; Maas and Hox Reference Maas and Hox2005). When writing the article (in early 2010), I focused on ML since I perceived it to be the estimation approach most commonly employed in empirical practice.Footnote 1 Like others, I found (using a data structure commonly employed in comparative politics) that working with very small group sizes (say, about ten countries) merits some caution.Footnote 2
While diligence might be warranted when estimating models with limited group sample sizes using ML, no one is forced to abandon the approach and convert to a Bayesian viewpoint. A growing literature has discussed adjustments to statistical tests of coefficients on covariates in small group size settings when the model is estimated using (restricted) ML. Several studies suggest using the Kenward and Roger (Reference Kenward and Roger1997) correction and assess its empirical performance. The general finding is that these corrections seem to perform rather well under small group sample sizes and even under some violations of normality and sphericity (for example, Arnau, Bono and Vallejo Reference Arnau, Bono and Vallejo2009; Arnau et al. Reference Arnau2014; Kowalchuk et al. Reference Kowalchuk2004; Luke Reference Luke2017; Schaalje, McBride and Fellingham Reference Schaalje, McBride and Fellingham2002; Spilke, Piepho and Hu Reference Spilke, Piepho and Hu2005). Thus, one author concludes a recent article introducing restricted maximum-likelihood (REML) estimation followed by Kenward-Roger (KR) corrected hypothesis tests, with the statement that the availability of ‘frequentist corrections […] preclude researchers from necessarily having to resort to a Bayesian framework’ (McNeish Reference McNeish2017, 666).Footnote 3
Elff et al. forcefully echo these sentiments. In addition to their general exposition of the logic of post-estimation test correction, they provide ample simulation evidence using a data structure where the lower-level sample size is large and the second-level size is small. They thus provide Monte Carlo evidence that is much closer to data commonly found in observational political science applications (albeit still using a highly stylized data-generating process, a point I will return to below). They also discuss an immensely useful rule of thumb for approximating the degree of freedom used in corrected tests (allowing researchers to potentially skip more computationally expensive approximations, such as the one proposed by KR) and provide simulation evidence of its reliability. I agree with their central points, and I am sure that their contribution will be a major reference point for social scientists working with hierarchical models. Below, I clarify some minor points of contention, provide some thoughts on open questions, and provide an argument for a continued role of Bayesian specifications. I conclude with some thoughts on best practices for estimating hierarchical models.
On the quality of unadjusted and adjusted ML estimates
While not the main focus of their article (or of this comment), Elff et al. point out that my original MC sample size of 1,000 is too small to assess the tail behavior of simulated quantities. They propose a more involved simulation scheme which uses several random seeds for the random number generator when creating MC draws. I agree with their suggestion. However, their discussion might create the impression that the confidence interval coverage bias found for ML estimates is simply the result of not using their proposed strategy of obtaining MC samples. In Table 1 I provide some evidence that this is not the case and that conducting hypothesis tests from (unadjusted) ML estimates using small group sample sizes is indeed rather anti-conservative. The results displayed in Table 1 are based on a simple data-generating process: a random-intercept model with a macro-level covariate (w j), which is either continuous (Panel A) or dichotomous (Panel B). I follow Elff et al.'s prescription and use 10,000 MC samples based on ten random seeds.Footnote 4
Table 1. Actual coverage of nominal 95 per cent confidence interval for a second-level covariate under different estimators and post-estimation corrections
Lines 1 and 4 in Table 1 show large non-coverage for standard ML estimates. Even when looking at the right-tail of MC draws, they are about 10 percentage points too short. Note that switching to REML estimation does move the coverage of confidence intervals somewhat closer to their nominal value, but still leaves them about 4 percentage points too short (on average). A one-sided test of proportions indicates that one cannot reject the null hypothesis that the actual proportion of confidence intervals covering the true parameter value is smaller than 0.95.
Adding the KR correcting perfectly illustrates the point made by Elff et al.: the actual coverage of the adjusted confidence intervals is virtually identical to their nominal 0.95 value. In the case of a continuous covariate, the coverage rate is in fact indistinguishable from its nominal value (in the dichotomous case, the upper limit of the 95 per cent MC interval makes clear that the degree of non-coverage is negligible for any practical purposes). Finally, note that in this simulation, using the (computationally much simpler) m – l – 1 heuristic produces confidence intervals that perform equally well.
More realistic data-generating processes and suggestions for further work
The m – l – 1 heuristic is highly relevant, not only because it will likely be employed by many practitioners, but also since the authors use it to estimate ‘REML-like’ generalized linear model specifications. It is expected to work well in many settings, but it would be helpful to gather more evidence on its behavior when simulating more complex or ‘realistic’ data-generating processes (here I am echoing the criticism of Bryan and Jenkins (Reference Bryan and Jenkins2015) that most MC studies do not look like real-world applications).
To give a first indication of how the accuracy of approximation varies with model features, Figure 1 plots histograms of 5,000 MC draws of KR-type degrees of freedom estimates in a more complex random-intercept random-slope model compared to the degrees of freedom suggested by the m – l – 1 heuristic. Panel A refers to a macro-level covariate, while Panel B refers to the main effect of a covariate with a random slope.Footnote 5
Figure 1. Comparison of KR degrees of freedom approximation with m − l − 1 heurstic
Note: Panel A shows results for the second-level covariate. Panel B shows results for a first-level covariate with a random slope. Based on 5,000 MC samples.
Panel A shows that the m – l – 1 heuristic coincides almost perfectly with its KR-based counterpart. Panel B shows that the KR degree-of-freedom estimates are systematically larger. In this case the direction of the difference is such that tests using the t-distribution with m – l – 1 degrees of freedom will be more conservative than KR-based tests. But I am not sure that this direction is guaranteed in every application.
In future work, it would be of interest to see how the m – l – 1 heuristic behaves in extended model setups. I am thinking specifically of two popular variants of hierarchical models. First, what is sometimes called the ‘correlated random effects’ model, where averages (or other transforms) of (some or all) lower-level covariates are included as second-level predictors (Raudenbush Reference Raudenbush1989; Wooldridge Reference Wooldridge2010). Secondly, the class of models involving crossed random effects, which arise when lower-level units are simultaneous members of different groups. A specific example is hierarchical models applied to age-period-cohort analysis (for example, Yang and Land Reference Yang and Land2008), where the numbers of cohorts is often small. Thinking about implementation of the heuristic and assessing its empirical performance in these models – especially when estimating these models using ‘REML-like’ approaches with limited dependent variables – would be quite helpful.Footnote 6
What about Bayesian inference?
So what role remains for Bayesian analyses of hierarchical models?Footnote 7 Maybe none, as Elff et al. or McNeish (Reference McNeish2017) might argue. An alternative view, rooted in the idea of consilience (Whewell Reference Whewell1858, 83–96), might ask for a Bayesian specification to be estimated as a complementary robustness specification. In any given application of an estimator to a fixed set of data (that is, a real research application, not a Monte Carlo simulation) the congruence between different approaches (employing different aassumptions and approximations) can lend additional weight to one's results. The last few years have seen tremendous advances easing the application of Bayesian inferential procedures. In particular, the development of the Stan language (Carpenter et al. Reference Carpenter2017) and its associated tools have abstracted away some implementation difficulties. Hierarchical models are now available in ‘pre-packaged’ form (Bürkner et al. Reference Bürkner2017) so that researchers can specify models using the familiar R formula interface. Bayesian inference is also possible in Stata using a simple prefix statement.Footnote 8 Note that I am not advocating that these tools should be deployed without attention to the underlying details. But they are now available with much more computational ease and can serve (without any claim of superiority) as a simple plausibility or robustness check, for example, when using the m – l – 1 heuristic in ‘REML-like’ models without access to KR or Satterthwaite approximations or when estimating models with complex variance structures.
Final notes on best practice
Surely, best practice when working in a frequentist setting should be to estimate the model using REML and to present hypothesis tests either using the t-distribution with m – l – 1 degrees of freedom or the KR correction (which is now widely available in R and Stata). Elff's iimm R package makes comparing different strategies as easy as possible. When conducting more involved analyses, for example when using generalized linear models or when employing complex variance structures, such as crossed random effects or autoregressive structures, the original KR correction might not necessarily work well (Kenward and Roger Reference Kenward and Roger2009, 2591). I think it would be prudent to check one's analysis using an alternative (computationally expensive) strategy – be it Bayesian or frequentist, as in, for example, a parametric bootstrap. Computational performance is such that these alternative specifications can be run easily while writing one's article.
A more ambitious version of this proposal is to conduct one's own application-specific Monte Carlo simulation, taking estimated coefficients and variance-covariance matrix as the baseline data-generating process against which to compare one's chosen estimator(s). Such a data-specific simulation is easily implemented using standard software (for example, the simulate prefix in Stata, or the MonteCarlo package in R) and can be run while writing the article. Providing information on the expected non-coverage rate of one's tests would provide more transparent communication of the limits (or strengths) of a given set of results.
The article by Elff et al. makes a valuable contribution on how to provide post-estimation adjustments to statistical tests in hierarchical models estimated using maximum likelihood (ML) and small group sizes.
My article referenced by them is part of a larger group of articles investigating the performance of ‘standard’ ML estimation when group sizes are small (see, for example, Bell et al. Reference Bell2014; Bryan and Jenkins Reference Bryan and Jenkins2015; Maas and Hox Reference Maas and Hox2005). When writing the article (in early 2010), I focused on ML since I perceived it to be the estimation approach most commonly employed in empirical practice.Footnote 1 Like others, I found (using a data structure commonly employed in comparative politics) that working with very small group sizes (say, about ten countries) merits some caution.Footnote 2
While diligence might be warranted when estimating models with limited group sample sizes using ML, no one is forced to abandon the approach and convert to a Bayesian viewpoint. A growing literature has discussed adjustments to statistical tests of coefficients on covariates in small group size settings when the model is estimated using (restricted) ML. Several studies suggest using the Kenward and Roger (Reference Kenward and Roger1997) correction and assess its empirical performance. The general finding is that these corrections seem to perform rather well under small group sample sizes and even under some violations of normality and sphericity (for example, Arnau, Bono and Vallejo Reference Arnau, Bono and Vallejo2009; Arnau et al. Reference Arnau2014; Kowalchuk et al. Reference Kowalchuk2004; Luke Reference Luke2017; Schaalje, McBride and Fellingham Reference Schaalje, McBride and Fellingham2002; Spilke, Piepho and Hu Reference Spilke, Piepho and Hu2005). Thus, one author concludes a recent article introducing restricted maximum-likelihood (REML) estimation followed by Kenward-Roger (KR) corrected hypothesis tests, with the statement that the availability of ‘frequentist corrections […] preclude researchers from necessarily having to resort to a Bayesian framework’ (McNeish Reference McNeish2017, 666).Footnote 3
Elff et al. forcefully echo these sentiments. In addition to their general exposition of the logic of post-estimation test correction, they provide ample simulation evidence using a data structure where the lower-level sample size is large and the second-level size is small. They thus provide Monte Carlo evidence that is much closer to data commonly found in observational political science applications (albeit still using a highly stylized data-generating process, a point I will return to below). They also discuss an immensely useful rule of thumb for approximating the degree of freedom used in corrected tests (allowing researchers to potentially skip more computationally expensive approximations, such as the one proposed by KR) and provide simulation evidence of its reliability. I agree with their central points, and I am sure that their contribution will be a major reference point for social scientists working with hierarchical models. Below, I clarify some minor points of contention, provide some thoughts on open questions, and provide an argument for a continued role of Bayesian specifications. I conclude with some thoughts on best practices for estimating hierarchical models.
On the quality of unadjusted and adjusted ML estimates
While not the main focus of their article (or of this comment), Elff et al. point out that my original MC sample size of 1,000 is too small to assess the tail behavior of simulated quantities. They propose a more involved simulation scheme which uses several random seeds for the random number generator when creating MC draws. I agree with their suggestion. However, their discussion might create the impression that the confidence interval coverage bias found for ML estimates is simply the result of not using their proposed strategy of obtaining MC samples. In Table 1 I provide some evidence that this is not the case and that conducting hypothesis tests from (unadjusted) ML estimates using small group sample sizes is indeed rather anti-conservative. The results displayed in Table 1 are based on a simple data-generating process: a random-intercept model with a macro-level covariate (w j), which is either continuous (Panel A) or dichotomous (Panel B). I follow Elff et al.'s prescription and use 10,000 MC samples based on ten random seeds.Footnote 4
Table 1. Actual coverage of nominal 95 per cent confidence interval for a second-level covariate under different estimators and post-estimation corrections
Note: based on 10,000 Monte Carlo samples with ten random seeds. Sample size is 5,000 rows in 10 equally-sized groups. 10 × 1,000 MC draws. Ten random seeds initialized from master seed obtained from random.org. Final column entry is one-sided p-value for one-sample test of proportion.
Lines 1 and 4 in Table 1 show large non-coverage for standard ML estimates. Even when looking at the right-tail of MC draws, they are about 10 percentage points too short. Note that switching to REML estimation does move the coverage of confidence intervals somewhat closer to their nominal value, but still leaves them about 4 percentage points too short (on average). A one-sided test of proportions indicates that one cannot reject the null hypothesis that the actual proportion of confidence intervals covering the true parameter value is smaller than 0.95.
Adding the KR correcting perfectly illustrates the point made by Elff et al.: the actual coverage of the adjusted confidence intervals is virtually identical to their nominal 0.95 value. In the case of a continuous covariate, the coverage rate is in fact indistinguishable from its nominal value (in the dichotomous case, the upper limit of the 95 per cent MC interval makes clear that the degree of non-coverage is negligible for any practical purposes). Finally, note that in this simulation, using the (computationally much simpler) m – l – 1 heuristic produces confidence intervals that perform equally well.
More realistic data-generating processes and suggestions for further work
The m – l – 1 heuristic is highly relevant, not only because it will likely be employed by many practitioners, but also since the authors use it to estimate ‘REML-like’ generalized linear model specifications. It is expected to work well in many settings, but it would be helpful to gather more evidence on its behavior when simulating more complex or ‘realistic’ data-generating processes (here I am echoing the criticism of Bryan and Jenkins (Reference Bryan and Jenkins2015) that most MC studies do not look like real-world applications).
To give a first indication of how the accuracy of approximation varies with model features, Figure 1 plots histograms of 5,000 MC draws of KR-type degrees of freedom estimates in a more complex random-intercept random-slope model compared to the degrees of freedom suggested by the m – l – 1 heuristic. Panel A refers to a macro-level covariate, while Panel B refers to the main effect of a covariate with a random slope.Footnote 5
Figure 1. Comparison of KR degrees of freedom approximation with m − l − 1 heurstic
Note: Panel A shows results for the second-level covariate. Panel B shows results for a first-level covariate with a random slope. Based on 5,000 MC samples.
Panel A shows that the m – l – 1 heuristic coincides almost perfectly with its KR-based counterpart. Panel B shows that the KR degree-of-freedom estimates are systematically larger. In this case the direction of the difference is such that tests using the t-distribution with m – l – 1 degrees of freedom will be more conservative than KR-based tests. But I am not sure that this direction is guaranteed in every application.
In future work, it would be of interest to see how the m – l – 1 heuristic behaves in extended model setups. I am thinking specifically of two popular variants of hierarchical models. First, what is sometimes called the ‘correlated random effects’ model, where averages (or other transforms) of (some or all) lower-level covariates are included as second-level predictors (Raudenbush Reference Raudenbush1989; Wooldridge Reference Wooldridge2010). Secondly, the class of models involving crossed random effects, which arise when lower-level units are simultaneous members of different groups. A specific example is hierarchical models applied to age-period-cohort analysis (for example, Yang and Land Reference Yang and Land2008), where the numbers of cohorts is often small. Thinking about implementation of the heuristic and assessing its empirical performance in these models – especially when estimating these models using ‘REML-like’ approaches with limited dependent variables – would be quite helpful.Footnote 6
What about Bayesian inference?
So what role remains for Bayesian analyses of hierarchical models?Footnote 7 Maybe none, as Elff et al. or McNeish (Reference McNeish2017) might argue. An alternative view, rooted in the idea of consilience (Whewell Reference Whewell1858, 83–96), might ask for a Bayesian specification to be estimated as a complementary robustness specification. In any given application of an estimator to a fixed set of data (that is, a real research application, not a Monte Carlo simulation) the congruence between different approaches (employing different aassumptions and approximations) can lend additional weight to one's results. The last few years have seen tremendous advances easing the application of Bayesian inferential procedures. In particular, the development of the Stan language (Carpenter et al. Reference Carpenter2017) and its associated tools have abstracted away some implementation difficulties. Hierarchical models are now available in ‘pre-packaged’ form (Bürkner et al. Reference Bürkner2017) so that researchers can specify models using the familiar R formula interface. Bayesian inference is also possible in Stata using a simple prefix statement.Footnote 8 Note that I am not advocating that these tools should be deployed without attention to the underlying details. But they are now available with much more computational ease and can serve (without any claim of superiority) as a simple plausibility or robustness check, for example, when using the m – l – 1 heuristic in ‘REML-like’ models without access to KR or Satterthwaite approximations or when estimating models with complex variance structures.
Final notes on best practice
Surely, best practice when working in a frequentist setting should be to estimate the model using REML and to present hypothesis tests either using the t-distribution with m – l – 1 degrees of freedom or the KR correction (which is now widely available in R and Stata). Elff's iimm R package makes comparing different strategies as easy as possible. When conducting more involved analyses, for example when using generalized linear models or when employing complex variance structures, such as crossed random effects or autoregressive structures, the original KR correction might not necessarily work well (Kenward and Roger Reference Kenward and Roger2009, 2591). I think it would be prudent to check one's analysis using an alternative (computationally expensive) strategy – be it Bayesian or frequentist, as in, for example, a parametric bootstrap. Computational performance is such that these alternative specifications can be run easily while writing one's article.
A more ambitious version of this proposal is to conduct one's own application-specific Monte Carlo simulation, taking estimated coefficients and variance-covariance matrix as the baseline data-generating process against which to compare one's chosen estimator(s). Such a data-specific simulation is easily implemented using standard software (for example, the simulate prefix in Stata, or the MonteCarlo package in R) and can be run while writing the article. Providing information on the expected non-coverage rate of one's tests would provide more transparent communication of the limits (or strengths) of a given set of results.