Published online by Cambridge University Press: 22 August 2005
This paper introduces a test for the comparison of multiple misspecified conditional interval models, for the case of dependent observations. Model accuracy is measured using a distributional analog of mean square error, in which the approximation error associated with a given model, say, model i, for a given interval, is measured by the expected squared difference between the conditional confidence interval under model i and the “true” one.
When comparing more than two models, a “benchmark” model is specified, and the test is constructed along the lines of the “reality check” of White (2000, Econometrica 68, 1097–1126). Valid asymptotic critical values are obtained via a version of the block bootstrap that properly captures the effect of parameter estimation error. The results of a small Monte Carlo experiment indicate that the test does not have unreasonable finite sample properties, given small samples of 60 and 120 observations, although the results do suggest that larger samples should likely be used in empirical applications of the test.The authors express their gratitude to Don Andrews and an anonymous referee for providing numerous useful suggestions, all of which we feel have been instrumental in improving earlier drafts of this paper. The authors also thank Russell Davidson, Clive Granger, Lutz Kilian, Christelle Viaroux, and seminar participants at the 2002 UK Econometrics Group meeting in Bristol, the 2002 European Econometric Society meetings, the 2002 University of Pennsylvania NSF-NBER time series conference, the 2002 EC2 Conference in Bologna, Cornell University, the State University of New York at Stony Brook, and the University of California at Davis for many helpful comments and suggestions on previous versions of this paper.
There are several instances in which merely having a “good” model for the conditional mean and/or variance may not be adequate for the task at hand. For example, financial risk management involves tracking the entire distribution of a portfolio or measuring certain distributional aspects, such as value at risk (see, e.g., Duffie and Pan, 1997). In such cases, models of conditional mean and/or variance may not be satisfactory.
A very small subset of important contributions that go beyond the examination of models of conditional mean and/or variance includes papers that assess the correctness of conditional interval predictions (see, e.g., Christoffersen, 1998); assess volatility predictability by comparing unconditional and conditional interval forecasts (see, e.g., Christoffersen and Diebold, 2000); and assess conditional quantiles (see, e.g., Giacomini and Komunjer, 2005).1
Prediction confidence intervals are also discussed in Granger, White, and Kamstra (1989), Chatfield (1993), Diebold, Tay, and Wallis (1998), Clements and Taylor (2001), and the references cited therein.
All of the papers cited in the preceding paragraph consider a null hypothesis of correct dynamic specification of the conditional distribution or of a given conditional confidence interval.3
One exception is the approach taken by Corradi and Swanson (2005a), who consider testing the null of correct specification of the conditional distribution for a given information set, thus allowing for dynamic misspecification under both hypotheses.
Assume that the object of interest is a conditional interval model for a scalar random variable, Yt, given a (possibly vector valued) conditioning set, Zt, where Zt contains lags of Yt and/or other variables. In particular, given a group of (possibly) misspecified conditional interval models, say, (F1(u|Zt,θ1[dagger]) − F1(u|Zt,θ1[dagger]),…,Fm(u|Zt,θm[dagger]) − Fm(u|Zt,θm[dagger])), assume that the objective is to compare these models in terms of their closeness to the true conditional interval, F0(u|Zt,θ0) − F0(u|Zt,θ0) = Pr(u ≤ Yt ≤ u|Zt). If m > 2, we follow White (2000). Namely, we choose a particular model as the “benchmark” and test the null hypothesis that no competing model can provide a more accurate approximation of the “true” model against the alternative that at least one competitor outperforms the benchmark. Needless to say, pairwise comparison of alternative models, in which no benchmark need be specified, follows as a special case. In our context, accuracy is measured using a distributional analog of mean square error. More precisely, the squared (approximation) error associated with model i, i = 1,…,m, is measured in terms of E((Fi(u|Zt,θi[dagger]) − Fi(u|Zt,θi[dagger])) − (F0(u|Zt,θ0[dagger]) − F0(u|Zt,θ0[dagger])))2, where u,u ∈ U and U is a possibly unbounded set on the real line.
It should be pointed out that one well-known measure of distributional accuracy is the KLIC, in the sense that the “most accurate” model can be shown to be that which minimizes the KLIC (see Section 2 for a more detailed discussion). For the independent and identically distributed (i.i.d.) case, Vuong (1989) suggests a likelihood ratio test for choosing the conditional density model that is closest to the “true” conditional density in terms of the KLIC. Additionally, Giacomini (2002) suggests a weighted version of the Vuong likelihood ratio test for the case of dependent observations, whereas Kitamura (2002) employs a KLIC-based approach to select among misspecified conditional models that satisfy given moment conditions.4
Of note is that White (1982) shows that QMLEs minimize the KLIC under mild conditions.
The rest of the paper is organized as follows. Section 2 states the hypothesis of interest and describes the test statistic that will be examined. In Section 3.1, it is shown that the limiting distribution of the statistic (properly recentered) is a functional of a zero mean Gaussian process, with a covariance kernel that reflects both the contribution of parameter estimation error and the effect of (dynamic) misspecification. Section 3.2 discusses the construction of asymptotically valid critical values. This is done via an extension of White's (2000) bootstrap approach to the case of nonvanishing parameter estimation error. The results of a small Monte Carlo experiment are collected in Section 4, and concluding remarks are given in Section 5. Proofs of results stated in the text are given in the Appendix.
Hereafter, P* denotes the probability law governing the resampled series, conditional on the sample, E* and Var* are the mean and variance operators associated with P*, oP*(1), Pr-P denotes a term converging to zero in P*-probability, conditional on the sample and for all samples except a subset with probability measure approaching zero, and OP*(1), Pr-P denotes a term that is bounded in P*-probability, conditional on the sample and for all samples except a subset with probability measure approaching zero. Analogously, Oa.s.*(1) and oa.s.*(1) denote terms that are almost surely bounded and terms that approach zero almost surely, according to the probability law P* and conditional on the sample.
Our objective is to select among alternative conditional confidence interval models by using parametric conditional distributions for a scalar random variable, Yt, given Zt, where Zt = (Yt−1,…,Yt−s1,Xt,…,Xt−s2+1) with s1,s2 finite. Note that although we assume s1 and s2 are finite, we do not require (Yt,Xt) to be Markovian. In fact, Zt might not contain the entire (relevant) history, and all models may be dynamically misspecified.
Define the group of conditional interval models from which one is to make a selection as (F1(u|Zt,θ1[dagger]) − F1(u|Zt,θ1[dagger]),…,Fm(u|Zt,θm[dagger]) − Fm(u|Zt,θm[dagger])) and define the true conditional interval as
Hereafter, assume that θi[dagger] ∈ Θi, where Θi is a compact set in a finite-dimensional euclidean space, and let θi[dagger] be the probability limit of a quasi-maximum likelihood estimator (QMLE) of the parameters of the conditional distribution under model i. If model i is correctly specified, then θi[dagger] = θ0. As mentioned in the introduction, accuracy is measured in terms of a distributional analog of mean square error. In particular, we say that model 1 is more accurate than model 2 if
This measure defines a norm and implies a standard goodness of fit measure.
As mentioned previously, a very well-known measure of distributional accuracy that is already available in the literature is the KLIC (see, e.g., White, 1982; Vuong, 1989; Giacomini, 2002; Kitamura, 2002), according to which we should choose model 1 over model 2 if
The KLIC is a sensible measure of accuracy, as it chooses the model that on average gives higher probability to events that have actually occurred. Also, it leads to simple likelihood ratio type tests. Interestingly, Fernandez-Villaverde and Rubio-Ramirez (2004) have shown that the best model under the KLIC is also the model with the highest posterior probability. However, if we are interested in measuring accuracy for a given conditional confidence interval, this cannot be easily done using the KLIC. For example, if we want to evaluate the accuracy of different models for approximating the probability that the rate of inflation tomorrow, given the rate of inflation today, will be between 0.5% and 1.5%, say, this cannot be done in a straightforward manner using the KLIC. On the other hand, our approach gives an easy way of addressing questions of this type. In this sense, we believe that our approach provides a reasonable alternative to the KLIC.
In what follows, model 1 is taken as the benchmark model, and the objective is to test whether some competitor model can provide a more accurate approximation of F0(u|·,θ0) − F0(u|·,θ0) than the benchmark. The null and the alternative hypotheses are
Alternatively, if interest focuses on testing the null of equal accuracy of two conditional confidence interval models, say, models 1 and 2, we can simply state the hypotheses as
Needless to say, if the benchmark model is correctly specified, we do not reject the null. Related tests that instead focus on dynamic correct specification of a conditional interval models (as opposed to allowing for misspecification under both hypotheses, as is done with all of our tests) are discussed in Christoffersen (1998).
If the objective is to test for the correct specification of a single conditional interval model, say, model 1, for a given information set, then we can define the hypotheses as
In the definition of H0′′, θ1[dagger] should be replaced by θ0 if Zt is meant as the information set including all the relevant history.
Tests of this sort that consider the correct specification of the conditional distribution for a given information set (i.e., conditional distribution tests that allow for the possibility of dynamic misspecification under both hypotheses) are discussed in Corradi and Swanson (2005a).
To test H0 versus HA, form the following statistic:
where
with s = max{s1,s2},
where fi(Yt|Zt,θi) is the conditional density under model i. As fi(·|·) does not in general coincide with the true conditional density,
is the QMLE, and θi[dagger] ≠ θ0, in general. More broadly speaking, the results discussed subsequently hold for any estimator for which
is asymptotically normal. This is the case for several extremum estimators, for example, such as (nonlinear) least squares, (Q)MLE, and so on. However, it is not advisable to use overidentified generalized method of moments (GMM) estimators because
is not asymptotically normal, in general, when model i is not correctly specified (see, e.g., Hall and Inoue, 2003). Needless to say, if interest focuses on testing H0′ versus HA′, one should use the statistic ZT(1,2), and if interest focuses on testing H0′′ versus HA′′, the appropriate test statistic is
which is a special case of the statistic considered in Theorem 2 of Corradi and Swanson (2005a) in the context of testing for the correct specification of the “entire” conditional distribution for a given information set. The limiting distribution of (4) and the construction of valid critical values via the bootstrap follow from Theorems 2 and 4 in the paper by Corradi and Swanson (2005a), who also provide some Monte Carlo evidence. Discussion of the test statistic in (4) in relation to the existing literature on testing for the correct conditional distribution is given in the paper just mentioned.
The intuition behind equation (2) is very simple. First, note that E(1{u ≤ Yt ≤ u}|Zt) = Pr(u ≤ Yt ≤ u|Zt) = F0(u|Zt,θ0) − F0(u|Zt,θ0). Thus, 1{u ≤ Yt ≤ u} − (Fi(u|Zt,θi[dagger]) − Fi(u|Zt,θi[dagger])) can be interpreted as an “error” term associated with computation of the conditional expectation, under Fi. Now, write the statistic in equation (2) as
where μj2 = E((1{u ≤ Yt ≤ u} − (Fj(u|Zt,θj[dagger]) − Fj(u|Zt,θj[dagger])))2), j = 1,…,m. In the Appendix, it is shown that the first term in equation (5) weakly converges to a Gaussian process. Also, for j = 1,…,m,
given that the expectation of the cross product is zero (which follows because 1{u ≤ Yt ≤ u} − (F0(u|Zt,θ0) − F0(u|Zt,θ0)) is uncorrelated with any measurable function of Zt). Therefore,
Before outlining the asymptotic properties of the statistic in equation (1) two comments are worth making.
First, following the reality check approach of White (2000), the problem of testing multiple hypotheses has been reduced to a single test by applying the (single-valued) max function to multiple hypotheses. This approach has the advantage that it avoids sequential testing bias and also captures the correlation across the various models. On the other hand, if we reject the null, we can conclude that there is at least one model that outperforms the benchmark, but we do not have available to us a complete picture concerning which model(s) contribute to the rejection of the null. Of course, some information can be obtained by looking at the distributional analog of mean square error associated with the various models and forming a crude ranking of the models, although the usual cautions associated with using a mean square error type measure to rank models should be taken. Alternatively, our approach can be complemented by a multiple comparison approach, such as the false discovery rate (FDR) approach of Benjamini and Hochberg (1995), which allows one to select among alternative groups of models, in the sense that one can assess which group(s) contribute to the rejection of the null. The FDR approach has the objective of controlling the expected number of false rejections, and in practice one computes p-values associated with the m hypotheses and orders these p-values in increasing fashion, say, P1 ≤ ··· ≤ Pi ≤ ··· ≤ Pm. Then, all hypotheses characterized by Pi ≤ (1 − (i − 1)/m)α are rejected, where α is a given significance level. Such an approach, though less conservative than the Hochberg (1988) approach, is still conservative as it provides bounds on p-values. Overall, we think that a sound practical strategy could be to first implement our reality check type tests. These tests can then be complemented by using a multiple comparison approach, yielding a better overall understanding concerning which model(s) contribute to the rejection of the null, if it is indeed rejected. If the null is not rejected, then we simply choose the benchmark model. Nevertheless, even in this case, it may not hurt to see whether some of the individual hypotheses in the joint null are rejected via a multiple test comparison approach.
Second, it perhaps is worth pointing out that simulation-based versions of the tests discussed here are given in Corradi and Swanson (2005b), in the context of the evaluation of dynamic stochastic general equilibrium models.
The results stated subsequently require the following assumption.
Assumption A: (i) (Yt,Xt) is a strictly stationary and absolutely regular β-mixing process with size −4, for i = 1,…,m; (ii) Fi(u|Zt,θi) is continuously differentiable on the interior of Θi, where Θi is a compact set in
, and ∇θi Fi(u|Zt,θi[dagger]) is 2r-dominated on Θi, for all u, r > 2;
6We say that ∇θi F(u|Zt,θi) is 2r-dominated on Θi uniformly in u if its kth element, k = 1,…,pi, is such that |∇θi Fi(u|Zt,θi)|k ≤ Dt(u) and supu∈R E(|Dt(u)|2r) < ∞. For more details on domination conditions, see Gallant and White (1988, p. 33).
is positive definite; and (vi) let
for k = 2,…,m. Define analogous covariance terms, vjk, j,k = 2,…,m, and assume that cov = [vjk] is positive semipositive definite.
Recalling that Zt = (Yt−1,…,Yt−s1,Xt,…,Xt−s2+1), A(i) ensures that Zt is strictly stationary mixing with size −4. Note that A(vi) requires at least one of the competing models to be neither nested in nor nesting the benchmark model. The nonnestedness of at least one competitor ensures that the long-run covariance matrix is positive definite even in the absence of parameter estimation error. However assumption A(vi) can be relaxed, in which case the limiting distribution of the test statistic takes exactly the same form as given in Theorem 1, which follows, except that the covariance kernel contains only terms that reflect parameter estimation error.7
Note that in White (2000), the nonnestedness of at least one competitor is a necessary condition, given that in his context parameter estimation error vanishes asymptotically, whereas in the present context it does not. More precisely, White (2000) considers out-of-sample comparison, using the first R observations for model estimation and the last P observations for model validation, where T = P + R. Parameter estimation error vanishes in his setup either because P/R → 0 or because the same loss function is used for estimation and model validation.
THEOREM 1. Let Assumption A hold. Then
where Z1,k is a zero mean Gaussian process with covariance ckk = vkk + pkk + pckk, vkk denotes the component of the long-run covariance matrix that would obtain in the absence of parameter estimation error, pkk denotes the contribution of parameter estimation error, and pckk denotes the covariance across the two components. In particular:8
Note that the recentered statistic is actually
However, for notational simplicity, and given that the two are asymptotically equivalent, we “approximate”
, both in the text and in the Appendix.
with9
Note that mθi[dagger] depends on chosen interval (u,u). Hovever, for notational simplicity we omit such dependence.
As an immediate corollary, note the following result.
COROLLARY 2. Let Assumptions A(i)–(v) hold and suppose A(vi) is violated. Then
where
is a zero mean normal random variable with covariance equal to pkk, as defined in equations (10)–(12).
From Theorem 1 and Corollary 2, it follows that when all competing models provide an approximation to the true conditional interval model that is as (mean square) accurate as that provided by the benchmark (i.e., when μ12 − μk2 = 0, ∀k), then the limiting distribution corresponds to the maximum of an m − 1–dimensional zero-mean normal random vector, with a covariance kernel that reflects both the contribution of parameter estimation error and the dependent structure of the data. Additionally, when all competitor models are worse than the benchmark, the statistic diverges to minus infinity, at rate
. Finally, when only some competitor models are worse than the benchmark, the limiting distribution provides a conservative test, as ZT will always be smaller than
, asymptotically, and therefore the critical values of
provide upper bounds for the critical values of maxk=2,…,m ZT(1,k). Of course, when HA holds, the statistic diverges to plus infinity at rate
. It is well known that the maximum of a normal random vector is not a normal random variable and hence critical values cannot immediately be tabulated. In a related paper, White (2000) suggests obtaining critical values either via Monte Carlo simulation or via use of the bootstrap. Here, we focus on use of the bootstrap, although White's results do not apply in our case, as contribution of parameter estimation error does not vanish in our setup and hence must be properly taken into account when forming critical values. Before turning our attention to the bootstrap, however, we briefly outline an out-of-sample version of our test statistic.
Thus far, we have compared conditional interval models via a distributional generalization of in-sample mean square error. Needless to say, an out-of-sample version of the statistic may also be constructed. Let T = R + P, let
be a recursive estimator computed using t = R,R + 1,…,R + P − 1 observations, and let
. A one-step-ahead out-of-sample version of the statistic in equations (1) and (2) is given by
Now, Theorem 1 and Corollary 2 still apply (Corollary 2 requires P/R → π > 0), although the covariance matrices will be slightly different. However, Theorem 3 (in Section 3.2) no longer applies, as the block bootstrap is no longer valid, and is indeed characterized by a bias term whose sign varies across samples. This is because of the use of recursive estimation. This issue is studied by Corradi and Swanson (2004a), who propose a proper recentering of the quasi-likelihood function.10
Corradi and Swanson (2004a) study the case of rolling estimators.
In this section we outline how to obtain valid critical values for the asymptotic distribution of
, via use of a version of the block bootstrap that properly captures the contribution of parameter estimation error to the covariance kernel associated with the limiting distribution of the test statistic.
11In principle, we could have obtained an estimator for C = [ckj], as defined in the statement of Theorem 1, that takes into account the contribution of parameter estimation error; call it
. Then, we could draw N m − 1-dimensional standard normal random vectors, say, η(i), i = 1,…,N, and for each i: form
take the maximum of the m − 1 elements and finally compute the empirical distribution of the N maxima. However, as pointed out by White (2000), when the sample size is moderate and the number of models is large,
is a rather poor estimator for C.
To show the first-order validity of the bootstrap, we shall obtain the limiting distribution of the bootstrap statistic and show that it coincides with the limiting distribution given in Theorem 1. As all candidate models are potentially misspecified under both hypotheses, the parametric bootstrap is not generally applicable in our context. In fact, if observations are resampled from one of the candidate models, then we cannot ensure that the resampled statistic has the appropriate limiting distribution. Our approach is thus to establish the first-order validity of the block bootstrap in the presence of parameter estimation error, by drawing in part upon results of Goncalves and White (2002, 2004).12
Goncalves and White (2002, 2004) consider the more general case of heterogeneous and near epoch dependent observations.
Assume that bootstrap samples are formed as follows. Let Wt = (Yt,Zt). Draw b overlapping blocks of length l from Ws,…,WT, where s = max{s1,s2}, so that bl = T − s. Thus, Ws*,…,Ws+l*,…,WT−l+1*,…,WT* is equal to WI1+1,…,WI1+l,…,WIb+1,…,WIb+l, where Ii, i = 1,…,b are i.i.d. discrete uniform random variates on s − 1,s,…,T − l. It follows that, conditional on the sample, the pseudo time series Wt*, t = s,…,T, consists of b i.i.d. blocks of length l.
Now, consider the bootstrap analog of ZT. Define the block bootstrap QMLE as
and define the bootstrap statistic as13
It should be pointed out that ln fi(Yt|Zt,θi) and ln fi(Yt*|Z*t,θi) can be replaced by generic functions mi(Yt,Zt,θi) and mi(Yt*,Z*t,θi), provided they satisfy assumptions A and A2.1 in Goncalves and White (2004) and provided
. Thus, the results for QMLE straightforwardly extend to generic m-estimators, such as nonlinear least squares or exactly identified GMM. On the other hand, they do not apply to overidentified GMM, as
. In that case, even for first-order validity, one has to properly recenter mi(Yt*,Z*t,θi) (see, e.g., Hall and Horowitz, 1996; Andrews, 2002; Inoue and Shintani, 2004).
THEOREM 3. Let Assumption A hold. If l → ∞ and l/T1/2 → 0, as T → ∞, then
where P* denotes the probability law of the resampled series, conditional on the sample, and μ12 − μk2 is defined as in equation (6).
The preceding result suggests proceeding in the following manner. For any bootstrap replication, compute the bootstrap statistic, ZT*. Perform B bootstrap replications (B large) and compute the quantiles of the empirical distribution of the B bootstrap statistics. Reject H0 if ZT is greater than the (1 − α)th quantile. Otherwise, do not reject. Now, for all samples except a set with probability measure approaching zero, ZT has the same limiting distribution as the corresponding bootstrap statistic when μ12 − μk2 = 0, ∀k, which is the least favorable case under the null hypothesis. Thus, the preceding approach ensures that the test has asymptotic size α. On the other hand, when one or more, but not all, of the competing models are strictly dominated by the benchmark, the preceding approach ensures that the test has asymptotic size between 0 and α. When all models are dominated by the benchmark, the statistic vanishes to minus infinity, so that the rule implies zero asymptotic size. Finally, under the alternative, ZT diverges to (plus) infinity, whereas the corresponding bootstrap statistic has a well-defined limiting distribution. This ensures unit asymptotic power. From the previous discussion, we see that the bootstrap distribution provides correct asymptotic critical values only for the least favorable case under the null hypothesis, that is, when all competitor models are as good as the benchmark model. When maxk=2,…,m(μ12 − μk2) = 0, but (μ12 − μk2) < 0 for some k, then the bootstrap critical values lead to conservative inference. An alternative to our bootstrap critical values in this case is to construct critical values using subsampling (see, e.g., Politis, Romano, and Wolf, 1999, Ch. 3). Heuristically, construct T − 2bT statistics using subsamples of length bT, where bT /T → 0. The empirical distribution of these statistics computed over the various subsamples properly mimics the distribution of the statistic. Thus, subsampling provides valid critical values even for the case where maxk=2,…,m(μ12 − μk2) = 0, but (μ12 − μk2) < 0, for some k. This is the approach used by Linton, Maasoumi, and Whang (2003), for example, in the context of testing for stochastic dominance. Needless to say, one problem with subsampling is that unless the sample is very large, the empirical distribution of the subsampled statistics may yield a poor approximation of the limiting distribution of the statistic.
Hansen (2005) points out that the conservative nature of the reality check of White (2000) leads to reduced power and that it should be feasible to improve the power and reduce the sensitivity of the reality check test to poor and irrelevant alternatives via use of the modified reality check test outlined in his paper. Given the similarity between the approach taken in our paper and that taken by White (2000), it may also be possible to improve our test performance using the approach of Hansen (2005) to modify our test.
The experimental setup used in this section is as follows. We begin by generating (yt,yt−1,wt,xt,qt)′ as
where St(0,Σ,v) denotes a Student's t distribution with mean zero, variance Σ, and v degrees of freedom, with
The data generating process (DGP) of interest is assumed to be (see, e.g., Spanos, 1999)
where α = σ12 /σ2, so that the conditional mean is a linear function of yt−1 and the conditional variance is a linear function of yt−12.
In our experiments, we impose misspecification upon all estimated models by assuming normality (i.e., we assume that Fi, i = 1,…,m, is the normal c.d.f.). Our objective is to ascertain whether a given benchmark model is “better,” in the sense of having lower squared approximation error, than two given alternative models. Thus, m = 3. Level and power experiments are defined by adjusting the conditioning information sets used to estimate (via QMLE) the parameters of each conditional model and subsequently to form
. In all experiments, values of α = {0.4,0.6,0.8,0.9} are used, samples of T = 60 and 120 are tried, v = 5, σ2 = 1, and σX2 = σW2 = σQ2 = {0.1,1.0,10.0}. Throughout, the conditional confidence interval version of the test is constructed, and the upper and lower bounds of the interval are fixed at μY + γσY and μY − γσY, respectively, where μY and σY are the mean and variance of yt and where γ = ½.
14Findings corresponding to
are very similar and are available from the authors upon request.
Additional results for cases where
, and where critical values are constructed using 250 bootstrap replications are available upon request and yield qualitatively similar results to those reported in Tables 1 and 2.
In these experiments, we define the conditioning variable sets as follows. For the benchmark model (F1), use
, where
is a proper subset of Zt. For the two alternative models (F2 and F3) we set
, respectively. In this case, the estimated coefficients associated with xt,wt, and qt have probability limits equal to zero, as none of these variables enters into the true conditional mean function. In addition, all models are misspecified, as conditional normality is assumed throughout. Therefore, the benchmark and the two competitors are equally misspecified. Finally, the limiting distribution of the test statistic in this case is driven by parameter estimation error, as assumption A(vi) does not hold (see Corollary 2 for this case).
In these experiments, we set the conditioning variable sets as follows. For the benchmark model
. For the two alternative models (F2 and F3), we set
, respectively. In this manner, it is ensured that the first of the two alternative models has smaller squared approximation error than the benchmark model. In fact, all three models are incorrect for both the marginal distribution (normal instead of Student-t) and for the conditional variance, which is set equal to the unconditional value instead of being a linear function of yt−12. However, one of the competitors, model 2, is correctly specified for the conditional mean, whereas the other two are not. Therefore, model 2 is characterized by a smaller squared approximation error.
Our findings are summarized in Table 1 (empirical level experiments) and Table 2 (empirical power experiments). In these tables, the first column reports the value of α used in a particular experiment, and the remaining entries are rejection frequencies of the null hypothesis that the benchmark model is not outperformed by any of the alternative models. A number of conclusions emerge upon inspection of the tables. Turning first to the empirical level results given in Table 1, note, for example, that empirical level varies from values grossly above nominal levels (when block lengths and values of α are large), to values below or close to nominal levels (when values of α are smaller). However, note that it is often the case that moving from 60 to 120 observations results in rejection frequencies being closer to the nominal level of the test, as expected (with the exception that the test becomes even more conservative when l is 5 or 6, in many cases). Notice also that when α = 0.4 (low persistence) a block length of 2 usually suffices to capture the dependence structure of the series, whereas for α = 0.9 (high persistence) a larger block length is necessary. Finally, it is worth noting that, overall, the empirical rejection frequencies are not too distant from nominal levels, a result that is somewhat surprising given the small sample sizes used in our experiments. However, the test could clearly be expected to exhibit improved behavior were larger samples of data used.
Empirical level experiments: Interval = μY + ½σY
Empirical power experiments: Interval = μY + ½σY
With regard to empirical power (see Table 2), note that rejection frequencies increase as α increases. This is not surprising, as the contribution of yt−1 to the conditional mean, which is neglected by models 1 and 3, becomes more substantial as α increases. Overall, for α ≥ 0.6 and for a nominal level of 10%, rejection frequencies are above 0.5 in many cases, again suggesting the need for larger samples.16
Note that our Monte Carlo findings are not directly comparable with those of Christoffersen (1998), as his null corresponds to correct dynamic specification of the conditional interval model.
As noted before, rejection frequencies are sensitive to the choice of the block size parameter. This suggests that it should be useful to choose the block length in a data-driven manner. One way in which this may be accomplished is by use of a two-step procedure as follows. First, one defines the optimal rate at which the block length should grow as the sample grows. This rate usually depends on what one is interested in (e.g., the focus is confidence intervals in our setup; see Lahiri, 2003, Ch. 6, for further details). Second, one computes the optimal block size for a smaller sample via subsampling techniques, as proposed by Hall, Horowitz, and Jing (1995), and then obtains the optimal block length for the full sample, using the optimal rate in the first step.17
Further data-driven methods for computing the block size are reported in Lahiri (2003, Ch. 6).
For higher order properties for statistics studentized with HAC estimators (see, e.g., Götze and Künsch, 1996, for the sample mean; and Inoue and Shintani, 2004, for linear instrumental variables estimators).
We have provided a test that allows for the joint comparison of multiple misspecified conditional interval models for the case of dependent observations and for the case where accuracy is measured using a distributional analog of mean square error. We also outlined the construction of valid asymptotic critical values based on a version of the block bootstrap that properly takes into account the contribution of parameter estimation error. A small number of Monte Carlo experiments were also run to assess the finite-sample properties of the test, and results indicate that the test does not have unreasonable finite-sample properties given very small samples of 60 and 120 observations, although the results do suggest that larger samples should likely be used in empirical application of the test.
Proof of Theorem 1. Recall that
Thus, from (5),
where
. Note that, given Assumptions A(i) and (iii), for i = 1,…,m,
where A(θi[dagger]) = (E(−∇θi2 fi(yt|Zt,θi[dagger])))−1. Thus, ZT(1,k) converges in distribution to a normal random variable with variance equal to ckk. The statement in Theorem 1 then follows as a straightforward application of the Cramér–Wold device and the continuous mapping theorem. █
Proof of Corollary 2. Immediate from the proof of Theorem 1. █
Proof of Theorem 3. In the discussion that follows, P*, E*, and Var* denote the probability law of the resampled series, conditional on the sample, the expectation, and the variance operators associated with P*, respectively. With the notation oP*(1), Pr-P, and OP*(1), Pr-P, we mean a term approaching zero in P*-probability and a term bounded in P*-probability, conditional on the sample and for all samples except a set with probability measure approaching zero, respectively. Write ZT,u*(1,k) as
where
. Now,
as
, by Theorem 2.2 in Goncalves and White (2004), and
, as it converges in P*-distribution and because the term in square brackets is OP*(1), Pr-P. Thus, ZT*(1,k) can be written as
We begin by showing that for i = 1,…,m, conditional on the sample and for all samples except a set of probability measure approaching zero:
(a) The term the portion of (A.2) preceding the first −2/T in (A.2) has the same limiting distribution (Pr-P) as
(b) The third and fourth lines of (A.2) (from the second −2/T on) of (A.2) have the same limiting distribution (Pr-P) as
We begin by showing (a). Given the block resampling scheme described in Section 3.2, it is easy to see that
For notational simplicity, just set u = −∞. Needless to say, the same argument applies to any generic u < u. Recalling that each block, conditional on the sample, is i.i.d.
where the last equality follows from Theorem 1 in Andrews (1991), given Assumption A and given the growth rate conditions on l. Therefore, given Assumption A, by Theorem 3.5 in Künsch (1989), (a) holds.
We now need to establish (b). First, note that given the mixing and domination conditions in Assumption A, from Lemmas 4 and 5 in Goncalves and White (2004), it follows that
Thus, we can write the sum of the last two terms in equation (A.2) as
Also, by Theorem 2.2 in Goncalves and White (2004), there exists an ε > 0 such that
Thus,
has the same asymptotic normal distribution as
, conditional on the sample and for all samples except a set with probability measure approaching zero. Finally, again by the same argument used in Lemmas A4 and A5 in Goncalves and White (2004),
where mθi[dagger]′ = E(∇θi Fi(u|Zt,θi[dagger])(1{Yt ≤ u} − Fi(u|Zt,θi[dagger]))). Needless to say, the corresponding terms for model k can be treated in the same manner. Thus, ZT(1,k)* has the same limiting distribution as ZT(1,k), conditional on the sample and for all samples except a set with probability measure approaching zero. █
Empirical level experiments: Interval = μY + ½σY
Empirical power experiments: Interval = μY + ½σY