Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-02-06T08:15:06.349Z Has data issue: false hasContentIssue false

A TEST FOR COMPARING MULTIPLE MISSPECIFIED CONDITIONAL INTERVAL MODELS

Published online by Cambridge University Press:  22 August 2005

Valentina Corradi
Affiliation:
Queen Mary–University of London
Norman R. Swanson
Affiliation:
Rutgers University
Rights & Permissions [Opens in a new window]

Abstract

This paper introduces a test for the comparison of multiple misspecified conditional interval models, for the case of dependent observations. Model accuracy is measured using a distributional analog of mean square error, in which the approximation error associated with a given model, say, model i, for a given interval, is measured by the expected squared difference between the conditional confidence interval under model i and the “true” one.

When comparing more than two models, a “benchmark” model is specified, and the test is constructed along the lines of the “reality check” of White (2000, Econometrica 68, 1097–1126). Valid asymptotic critical values are obtained via a version of the block bootstrap that properly captures the effect of parameter estimation error. The results of a small Monte Carlo experiment indicate that the test does not have unreasonable finite sample properties, given small samples of 60 and 120 observations, although the results do suggest that larger samples should likely be used in empirical applications of the test.The authors express their gratitude to Don Andrews and an anonymous referee for providing numerous useful suggestions, all of which we feel have been instrumental in improving earlier drafts of this paper. The authors also thank Russell Davidson, Clive Granger, Lutz Kilian, Christelle Viaroux, and seminar participants at the 2002 UK Econometrics Group meeting in Bristol, the 2002 European Econometric Society meetings, the 2002 University of Pennsylvania NSF-NBER time series conference, the 2002 EC2 Conference in Bologna, Cornell University, the State University of New York at Stony Brook, and the University of California at Davis for many helpful comments and suggestions on previous versions of this paper.

Type
Research Article
Copyright
© 2005 Cambridge University Press

1. INTRODUCTION

There are several instances in which merely having a “good” model for the conditional mean and/or variance may not be adequate for the task at hand. For example, financial risk management involves tracking the entire distribution of a portfolio or measuring certain distributional aspects, such as value at risk (see, e.g., Duffie and Pan, 1997). In such cases, models of conditional mean and/or variance may not be satisfactory.

A very small subset of important contributions that go beyond the examination of models of conditional mean and/or variance includes papers that assess the correctness of conditional interval predictions (see, e.g., Christoffersen, 1998); assess volatility predictability by comparing unconditional and conditional interval forecasts (see, e.g., Christoffersen and Diebold, 2000); and assess conditional quantiles (see, e.g., Giacomini and Komunjer, 2005).1

Prediction confidence intervals are also discussed in Granger, White, and Kamstra (1989), Chatfield (1993), Diebold, Tay, and Wallis (1998), Clements and Taylor (2001), and the references cited therein.

Needless to say, correct specification of the conditional distribution implies correct specification of all conditional aspects of the model. Perhaps in part for this reason, there has been growing interest in recent years in providing tests for the correct specification of conditional distributions. One contribution in this direction is the conditional Kolmogorov (CK) test of Andrews (1997), which is based on the comparison of the empirical joint distribution of yt and Xt with the product of a given distribution of yt|Xt and the empirical cumulative distribution function (c.d.f.) of Xt. Other contributions in this direction include, for example, the work of Zheng (2000), who suggests a nonparametric test based on a first-order linear expansion of the Kullback–Leibler information criterion (KLIC), Altissimo and Mele (2002), and Li and Tkacz (2004), who propose a test based on the comparison of a nonparametric kernel estimate of the conditional density with the density implied under the null hypothesis.2

Whang (2000, 2001) proposes a CK type test for the correct specification of the conditional mean.

Following a different route based on use of the probability integral transform, Diebold, Gunther, and Tay (1998) suggest a simple and effective means by which predictive densities can be evaluated (see also Bai, 2003; Diebold, Hahn, and Tay, 1999; Hong, 2001; Hong and Li, 2005).

All of the papers cited in the preceding paragraph consider a null hypothesis of correct dynamic specification of the conditional distribution or of a given conditional confidence interval.3

One exception is the approach taken by Corradi and Swanson (2005a), who consider testing the null of correct specification of the conditional distribution for a given information set, thus allowing for dynamic misspecification under both hypotheses.

However, a reasonable assumption in the context of model selection may instead be that all models are approximations of the truth and hence all models are likely misspecified. Along these lines, it is our objective in this paper to provide a test that allows for the joint comparison of multiple misspecified conditional interval models, for the case of dependent observations.

Assume that the object of interest is a conditional interval model for a scalar random variable, Yt, given a (possibly vector valued) conditioning set, Zt, where Zt contains lags of Yt and/or other variables. In particular, given a group of (possibly) misspecified conditional interval models, say, (F1(u|Zt1[dagger]) − F1(u|Zt1[dagger]),…,Fm(u|Ztm[dagger]) − Fm(u|Ztm[dagger])), assume that the objective is to compare these models in terms of their closeness to the true conditional interval, F0(u|Zt0) − F0(u|Zt0) = Pr(uYtu|Zt). If m > 2, we follow White (2000). Namely, we choose a particular model as the “benchmark” and test the null hypothesis that no competing model can provide a more accurate approximation of the “true” model against the alternative that at least one competitor outperforms the benchmark. Needless to say, pairwise comparison of alternative models, in which no benchmark need be specified, follows as a special case. In our context, accuracy is measured using a distributional analog of mean square error. More precisely, the squared (approximation) error associated with model i, i = 1,…,m, is measured in terms of E((Fi(u|Zti[dagger]) − Fi(u|Zti[dagger])) − (F0(u|Zt0[dagger]) − F0(u|Zt0[dagger])))2, where u,uU and U is a possibly unbounded set on the real line.

It should be pointed out that one well-known measure of distributional accuracy is the KLIC, in the sense that the “most accurate” model can be shown to be that which minimizes the KLIC (see Section 2 for a more detailed discussion). For the independent and identically distributed (i.i.d.) case, Vuong (1989) suggests a likelihood ratio test for choosing the conditional density model that is closest to the “true” conditional density in terms of the KLIC. Additionally, Giacomini (2002) suggests a weighted version of the Vuong likelihood ratio test for the case of dependent observations, whereas Kitamura (2002) employs a KLIC-based approach to select among misspecified conditional models that satisfy given moment conditions.4

Of note is that White (1982) shows that QMLEs minimize the KLIC under mild conditions.

Furthermore, the KLIC approach has recently been employed for the evaluation of dynamic stochastic general equilibrium models (see, e.g., Schorfheide, 2000; Fernandez-Villaverde and Rubio-Ramirez, 2004; Chang, Gomes, and Schorfheide, 2002). For example, Fernandez-Villaverde and Rubio-Ramirez show that the KLIC-best model is also the model with the highest posterior probability. However, as we outline in the next section, problems concerning the comparison of conditional confidence intervals may be difficult to address using the KLIC but can be handled quite easily using our generalized mean square measure of accuracy.

The rest of the paper is organized as follows. Section 2 states the hypothesis of interest and describes the test statistic that will be examined. In Section 3.1, it is shown that the limiting distribution of the statistic (properly recentered) is a functional of a zero mean Gaussian process, with a covariance kernel that reflects both the contribution of parameter estimation error and the effect of (dynamic) misspecification. Section 3.2 discusses the construction of asymptotically valid critical values. This is done via an extension of White's (2000) bootstrap approach to the case of nonvanishing parameter estimation error. The results of a small Monte Carlo experiment are collected in Section 4, and concluding remarks are given in Section 5. Proofs of results stated in the text are given in the Appendix.

Hereafter, P* denotes the probability law governing the resampled series, conditional on the sample, E* and Var* are the mean and variance operators associated with P*, oP*(1), Pr-P denotes a term converging to zero in P*-probability, conditional on the sample and for all samples except a subset with probability measure approaching zero, and OP*(1), Pr-P denotes a term that is bounded in P*-probability, conditional on the sample and for all samples except a subset with probability measure approaching zero. Analogously, Oa.s.*(1) and oa.s.*(1) denote terms that are almost surely bounded and terms that approach zero almost surely, according to the probability law P* and conditional on the sample.

2. SETUP AND TEST STATISTICS

Our objective is to select among alternative conditional confidence interval models by using parametric conditional distributions for a scalar random variable, Yt, given Zt, where Zt = (Yt−1,…,Yts1,Xt,…,Xts2+1) with s1,s2 finite. Note that although we assume s1 and s2 are finite, we do not require (Yt,Xt) to be Markovian. In fact, Zt might not contain the entire (relevant) history, and all models may be dynamically misspecified.

Define the group of conditional interval models from which one is to make a selection as (F1(u|Zt1[dagger]) − F1(u|Zt1[dagger]),…,Fm(u|Ztm[dagger]) − Fm(u|Ztm[dagger])) and define the true conditional interval as

Hereafter, assume that θi[dagger] ∈ Θi, where Θi is a compact set in a finite-dimensional euclidean space, and let θi[dagger] be the probability limit of a quasi-maximum likelihood estimator (QMLE) of the parameters of the conditional distribution under model i. If model i is correctly specified, then θi[dagger] = θ0. As mentioned in the introduction, accuracy is measured in terms of a distributional analog of mean square error. In particular, we say that model 1 is more accurate than model 2 if

This measure defines a norm and implies a standard goodness of fit measure.

As mentioned previously, a very well-known measure of distributional accuracy that is already available in the literature is the KLIC (see, e.g., White, 1982; Vuong, 1989; Giacomini, 2002; Kitamura, 2002), according to which we should choose model 1 over model 2 if

The KLIC is a sensible measure of accuracy, as it chooses the model that on average gives higher probability to events that have actually occurred. Also, it leads to simple likelihood ratio type tests. Interestingly, Fernandez-Villaverde and Rubio-Ramirez (2004) have shown that the best model under the KLIC is also the model with the highest posterior probability. However, if we are interested in measuring accuracy for a given conditional confidence interval, this cannot be easily done using the KLIC. For example, if we want to evaluate the accuracy of different models for approximating the probability that the rate of inflation tomorrow, given the rate of inflation today, will be between 0.5% and 1.5%, say, this cannot be done in a straightforward manner using the KLIC. On the other hand, our approach gives an easy way of addressing questions of this type. In this sense, we believe that our approach provides a reasonable alternative to the KLIC.

In what follows, model 1 is taken as the benchmark model, and the objective is to test whether some competitor model can provide a more accurate approximation of F0(u|·,θ0) − F0(u|·,θ0) than the benchmark. The null and the alternative hypotheses are

Alternatively, if interest focuses on testing the null of equal accuracy of two conditional confidence interval models, say, models 1 and 2, we can simply state the hypotheses as

Needless to say, if the benchmark model is correctly specified, we do not reject the null. Related tests that instead focus on dynamic correct specification of a conditional interval models (as opposed to allowing for misspecification under both hypotheses, as is done with all of our tests) are discussed in Christoffersen (1998).

If the objective is to test for the correct specification of a single conditional interval model, say, model 1, for a given information set, then we can define the hypotheses as

5

In the definition of H0′′, θ1[dagger] should be replaced by θ0 if Zt is meant as the information set including all the relevant history.

Tests of this sort that consider the correct specification of the conditional distribution for a given information set (i.e., conditional distribution tests that allow for the possibility of dynamic misspecification under both hypotheses) are discussed in Corradi and Swanson (2005a).

To test H0 versus HA, form the following statistic:

where

with s = max{s1,s2},

where fi(Yt|Zti) is the conditional density under model i. As fi(·|·) does not in general coincide with the true conditional density,

is the QMLE, and θi[dagger] ≠ θ0, in general. More broadly speaking, the results discussed subsequently hold for any estimator for which

is asymptotically normal. This is the case for several extremum estimators, for example, such as (nonlinear) least squares, (Q)MLE, and so on. However, it is not advisable to use overidentified generalized method of moments (GMM) estimators because

is not asymptotically normal, in general, when model i is not correctly specified (see, e.g., Hall and Inoue, 2003). Needless to say, if interest focuses on testing H0′ versus HA′, one should use the statistic ZT(1,2), and if interest focuses on testing H0′′ versus HA′′, the appropriate test statistic is

which is a special case of the statistic considered in Theorem 2 of Corradi and Swanson (2005a) in the context of testing for the correct specification of the “entire” conditional distribution for a given information set. The limiting distribution of (4) and the construction of valid critical values via the bootstrap follow from Theorems 2 and 4 in the paper by Corradi and Swanson (2005a), who also provide some Monte Carlo evidence. Discussion of the test statistic in (4) in relation to the existing literature on testing for the correct conditional distribution is given in the paper just mentioned.

The intuition behind equation (2) is very simple. First, note that E(1{uYtu}|Zt) = Pr(uYtu|Zt) = F0(u|Zt0) − F0(u|Zt0). Thus, 1{uYtu} − (Fi(u|Zti[dagger]) − Fi(u|Zti[dagger])) can be interpreted as an “error” term associated with computation of the conditional expectation, under Fi. Now, write the statistic in equation (2) as

where μj2 = E((1{uYtu} − (Fj(u|Ztj[dagger]) − Fj(u|Ztj[dagger])))2), j = 1,…,m. In the Appendix, it is shown that the first term in equation (5) weakly converges to a Gaussian process. Also, for j = 1,…,m,

given that the expectation of the cross product is zero (which follows because 1{uYtu} − (F0(u|Zt0) − F0(u|Zt0)) is uncorrelated with any measurable function of Zt). Therefore,

Before outlining the asymptotic properties of the statistic in equation (1) two comments are worth making.

First, following the reality check approach of White (2000), the problem of testing multiple hypotheses has been reduced to a single test by applying the (single-valued) max function to multiple hypotheses. This approach has the advantage that it avoids sequential testing bias and also captures the correlation across the various models. On the other hand, if we reject the null, we can conclude that there is at least one model that outperforms the benchmark, but we do not have available to us a complete picture concerning which model(s) contribute to the rejection of the null. Of course, some information can be obtained by looking at the distributional analog of mean square error associated with the various models and forming a crude ranking of the models, although the usual cautions associated with using a mean square error type measure to rank models should be taken. Alternatively, our approach can be complemented by a multiple comparison approach, such as the false discovery rate (FDR) approach of Benjamini and Hochberg (1995), which allows one to select among alternative groups of models, in the sense that one can assess which group(s) contribute to the rejection of the null. The FDR approach has the objective of controlling the expected number of false rejections, and in practice one computes p-values associated with the m hypotheses and orders these p-values in increasing fashion, say, P1 ≤ ··· ≤ Pi ≤ ··· ≤ Pm. Then, all hypotheses characterized by Pi ≤ (1 − (i − 1)/m)α are rejected, where α is a given significance level. Such an approach, though less conservative than the Hochberg (1988) approach, is still conservative as it provides bounds on p-values. Overall, we think that a sound practical strategy could be to first implement our reality check type tests. These tests can then be complemented by using a multiple comparison approach, yielding a better overall understanding concerning which model(s) contribute to the rejection of the null, if it is indeed rejected. If the null is not rejected, then we simply choose the benchmark model. Nevertheless, even in this case, it may not hurt to see whether some of the individual hypotheses in the joint null are rejected via a multiple test comparison approach.

Second, it perhaps is worth pointing out that simulation-based versions of the tests discussed here are given in Corradi and Swanson (2005b), in the context of the evaluation of dynamic stochastic general equilibrium models.

3. ASYMPTOTIC RESULTS

The results stated subsequently require the following assumption.

Assumption A: (i) (Yt,Xt) is a strictly stationary and absolutely regular β-mixing process with size −4, for i = 1,…,m; (ii) Fi(u|Zti) is continuously differentiable on the interior of Θi, where Θi is a compact set in

, and ∇θi Fi(u|Zti[dagger]) is 2r-dominated on Θi, for all u, r > 2;

6

We say that ∇θi F(u|Zti) is 2r-dominated on Θi uniformly in u if its kth element, k = 1,…,pi, is such that |∇θi Fi(u|Zti)|kDt(u) and supuR E(|Dt(u)|2r) < ∞. For more details on domination conditions, see Gallant and White (1988, p. 33).

(iii) θi[dagger] is uniquely identified (i.e., E(ln fi(Yt|Zti[dagger])) > E(ln fi(Yt|Zti)), for any θi ≠ θi[dagger]), where fi is the density associated with Fi; (iv) fi is twice continuously differentiable on the interior of Θi, and ∇θi ln fi(Yt|Zti) and ∇θi2 ln fi(Yt|Zti) are 2r-dominated on Θi, with r > 2; (v) E(−∇θi2 ln fi(Yt|Zti)) is positive definite, uniformly on Θi, and

is positive definite; and (vi) let

for k = 2,…,m. Define analogous covariance terms, vjk, j,k = 2,…,m, and assume that cov = [vjk] is positive semipositive definite.

Recalling that Zt = (Yt−1,…,Yts1,Xt,…,Xts2+1), A(i) ensures that Zt is strictly stationary mixing with size −4. Note that A(vi) requires at least one of the competing models to be neither nested in nor nesting the benchmark model. The nonnestedness of at least one competitor ensures that the long-run covariance matrix is positive definite even in the absence of parameter estimation error. However assumption A(vi) can be relaxed, in which case the limiting distribution of the test statistic takes exactly the same form as given in Theorem 1, which follows, except that the covariance kernel contains only terms that reflect parameter estimation error.7

Note that in White (2000), the nonnestedness of at least one competitor is a necessary condition, given that in his context parameter estimation error vanishes asymptotically, whereas in the present context it does not. More precisely, White (2000) considers out-of-sample comparison, using the first R observations for model estimation and the last P observations for model validation, where T = P + R. Parameter estimation error vanishes in his setup either because P/R → 0 or because the same loss function is used for estimation and model validation.

3.1. Limiting Distributions

THEOREM 1. Let Assumption A hold. Then

where Z1,k is a zero mean Gaussian process with covariance ckk = vkk + pkk + pckk, vkk denotes the component of the long-run covariance matrix that would obtain in the absence of parameter estimation error, pkk denotes the contribution of parameter estimation error, and pckk denotes the covariance across the two components. In particular:8

Note that the recentered statistic is actually

However, for notational simplicity, and given that the two are asymptotically equivalent, we “approximate”

, both in the text and in the Appendix.

with9

Note that mθi[dagger] depends on chosen interval (u,u). Hovever, for notational simplicity we omit such dependence.

mθi[dagger]′ = E(∇θi(Fi(u|Zti[dagger]) − Fi(u|Zti[dagger]))(1{uYtu} − (Fi(u|Zti[dagger]) − Fi(u|Zti[dagger])))) and Ai[dagger]) = (E(−ln ∇θi2 fi(yt|Zti[dagger])))−1.

As an immediate corollary, note the following result.

COROLLARY 2. Let Assumptions A(i)–(v) hold and suppose A(vi) is violated. Then

where

is a zero mean normal random variable with covariance equal to pkk, as defined in equations (10)–(12).

From Theorem 1 and Corollary 2, it follows that when all competing models provide an approximation to the true conditional interval model that is as (mean square) accurate as that provided by the benchmark (i.e., when μ12 − μk2 = 0, ∀k), then the limiting distribution corresponds to the maximum of an m − 1–dimensional zero-mean normal random vector, with a covariance kernel that reflects both the contribution of parameter estimation error and the dependent structure of the data. Additionally, when all competitor models are worse than the benchmark, the statistic diverges to minus infinity, at rate

. Finally, when only some competitor models are worse than the benchmark, the limiting distribution provides a conservative test, as ZT will always be smaller than

, asymptotically, and therefore the critical values of

provide upper bounds for the critical values of maxk=2,…,m ZT(1,k). Of course, when HA holds, the statistic diverges to plus infinity at rate

. It is well known that the maximum of a normal random vector is not a normal random variable and hence critical values cannot immediately be tabulated. In a related paper, White (2000) suggests obtaining critical values either via Monte Carlo simulation or via use of the bootstrap. Here, we focus on use of the bootstrap, although White's results do not apply in our case, as contribution of parameter estimation error does not vanish in our setup and hence must be properly taken into account when forming critical values. Before turning our attention to the bootstrap, however, we briefly outline an out-of-sample version of our test statistic.

Thus far, we have compared conditional interval models via a distributional generalization of in-sample mean square error. Needless to say, an out-of-sample version of the statistic may also be constructed. Let T = R + P, let

be a recursive estimator computed using t = R,R + 1,…,R + P − 1 observations, and let

. A one-step-ahead out-of-sample version of the statistic in equations (1) and (2) is given by

Now, Theorem 1 and Corollary 2 still apply (Corollary 2 requires P/R → π > 0), although the covariance matrices will be slightly different. However, Theorem 3 (in Section 3.2) no longer applies, as the block bootstrap is no longer valid, and is indeed characterized by a bias term whose sign varies across samples. This is because of the use of recursive estimation. This issue is studied by Corradi and Swanson (2004a), who propose a proper recentering of the quasi-likelihood function.10

Corradi and Swanson (2004a) study the case of rolling estimators.

3.2. Bootstrap Critical Values

In this section we outline how to obtain valid critical values for the asymptotic distribution of

, via use of a version of the block bootstrap that properly captures the contribution of parameter estimation error to the covariance kernel associated with the limiting distribution of the test statistic.

11

In principle, we could have obtained an estimator for C = [ckj], as defined in the statement of Theorem 1, that takes into account the contribution of parameter estimation error; call it

. Then, we could draw N m − 1-dimensional standard normal random vectors, say, η(i), i = 1,…,N, and for each i: form

take the maximum of the m − 1 elements and finally compute the empirical distribution of the N maxima. However, as pointed out by White (2000), when the sample size is moderate and the number of models is large,

is a rather poor estimator for C.

To show the first-order validity of the bootstrap, we shall obtain the limiting distribution of the bootstrap statistic and show that it coincides with the limiting distribution given in Theorem 1. As all candidate models are potentially misspecified under both hypotheses, the parametric bootstrap is not generally applicable in our context. In fact, if observations are resampled from one of the candidate models, then we cannot ensure that the resampled statistic has the appropriate limiting distribution. Our approach is thus to establish the first-order validity of the block bootstrap in the presence of parameter estimation error, by drawing in part upon results of Goncalves and White (2002, 2004).12

Goncalves and White (2002, 2004) consider the more general case of heterogeneous and near epoch dependent observations.

Assume that bootstrap samples are formed as follows. Let Wt = (Yt,Zt). Draw b overlapping blocks of length l from Ws,…,WT, where s = max{s1,s2}, so that bl = Ts. Thus, Ws*,…,Ws+l*,…,WTl+1*,…,WT* is equal to WI1+1,…,WI1+l,…,WIb+1,…,WIb+l, where Ii, i = 1,…,b are i.i.d. discrete uniform random variates on s − 1,s,…,Tl. It follows that, conditional on the sample, the pseudo time series Wt*, t = s,…,T, consists of b i.i.d. blocks of length l.

Now, consider the bootstrap analog of ZT. Define the block bootstrap QMLE as

and define the bootstrap statistic as13

It should be pointed out that ln fi(Yt|Zti) and ln fi(Yt*|Z*ti) can be replaced by generic functions mi(Yt,Zti) and mi(Yt*,Z*ti), provided they satisfy assumptions A and A2.1 in Goncalves and White (2004) and provided

. Thus, the results for QMLE straightforwardly extend to generic m-estimators, such as nonlinear least squares or exactly identified GMM. On the other hand, they do not apply to overidentified GMM, as

. In that case, even for first-order validity, one has to properly recenter mi(Yt*,Z*ti) (see, e.g., Hall and Horowitz, 1996; Andrews, 2002; Inoue and Shintani, 2004).

THEOREM 3. Let Assumption A hold. If l → ∞ and l/T1/2 → 0, as T → ∞, then

where P* denotes the probability law of the resampled series, conditional on the sample, and μ12 − μk2 is defined as in equation (6).

The preceding result suggests proceeding in the following manner. For any bootstrap replication, compute the bootstrap statistic, ZT*. Perform B bootstrap replications (B large) and compute the quantiles of the empirical distribution of the B bootstrap statistics. Reject H0 if ZT is greater than the (1 − α)th quantile. Otherwise, do not reject. Now, for all samples except a set with probability measure approaching zero, ZT has the same limiting distribution as the corresponding bootstrap statistic when μ12 − μk2 = 0, ∀k, which is the least favorable case under the null hypothesis. Thus, the preceding approach ensures that the test has asymptotic size α. On the other hand, when one or more, but not all, of the competing models are strictly dominated by the benchmark, the preceding approach ensures that the test has asymptotic size between 0 and α. When all models are dominated by the benchmark, the statistic vanishes to minus infinity, so that the rule implies zero asymptotic size. Finally, under the alternative, ZT diverges to (plus) infinity, whereas the corresponding bootstrap statistic has a well-defined limiting distribution. This ensures unit asymptotic power. From the previous discussion, we see that the bootstrap distribution provides correct asymptotic critical values only for the least favorable case under the null hypothesis, that is, when all competitor models are as good as the benchmark model. When maxk=2,…,m12 − μk2) = 0, but (μ12 − μk2) < 0 for some k, then the bootstrap critical values lead to conservative inference. An alternative to our bootstrap critical values in this case is to construct critical values using subsampling (see, e.g., Politis, Romano, and Wolf, 1999, Ch. 3). Heuristically, construct T − 2bT statistics using subsamples of length bT, where bT /T → 0. The empirical distribution of these statistics computed over the various subsamples properly mimics the distribution of the statistic. Thus, subsampling provides valid critical values even for the case where maxk=2,…,m12 − μk2) = 0, but (μ12 − μk2) < 0, for some k. This is the approach used by Linton, Maasoumi, and Whang (2003), for example, in the context of testing for stochastic dominance. Needless to say, one problem with subsampling is that unless the sample is very large, the empirical distribution of the subsampled statistics may yield a poor approximation of the limiting distribution of the statistic.

Hansen (2005) points out that the conservative nature of the reality check of White (2000) leads to reduced power and that it should be feasible to improve the power and reduce the sensitivity of the reality check test to poor and irrelevant alternatives via use of the modified reality check test outlined in his paper. Given the similarity between the approach taken in our paper and that taken by White (2000), it may also be possible to improve our test performance using the approach of Hansen (2005) to modify our test.

4. MONTE CARLO FINDINGS

The experimental setup used in this section is as follows. We begin by generating (yt,yt−1,wt,xt,qt)′ as

where St(0,Σ,v) denotes a Student's t distribution with mean zero, variance Σ, and v degrees of freedom, with

The data generating process (DGP) of interest is assumed to be (see, e.g., Spanos, 1999)

where α = σ122, so that the conditional mean is a linear function of yt−1 and the conditional variance is a linear function of yt−12.

In our experiments, we impose misspecification upon all estimated models by assuming normality (i.e., we assume that Fi, i = 1,…,m, is the normal c.d.f.). Our objective is to ascertain whether a given benchmark model is “better,” in the sense of having lower squared approximation error, than two given alternative models. Thus, m = 3. Level and power experiments are defined by adjusting the conditioning information sets used to estimate (via QMLE) the parameters of each conditional model and subsequently to form

. In all experiments, values of α = {0.4,0.6,0.8,0.9} are used, samples of T = 60 and 120 are tried, v = 5, σ2 = 1, and σX2 = σW2 = σQ2 = {0.1,1.0,10.0}. Throughout, the conditional confidence interval version of the test is constructed, and the upper and lower bounds of the interval are fixed at μY + γσY and μY − γσY, respectively, where μY and σY are the mean and variance of yt and where γ = ½.

14

Findings corresponding to

are very similar and are available from the authors upon request.

Additionally, 5% and 10% nominal level bootstrap critical values are constructed using 100 bootstrap replications, block lengths of l = {2,3,5,6} are tried, and all reported rejection frequencies are based on 5,000 Monte Carlo simulations.15

Additional results for cases where

, and where critical values are constructed using 250 bootstrap replications are available upon request and yield qualitatively similar results to those reported in Tables 1 and 2.

Given Zt = (yt−1,xt,wt,qt), the experiments reported on are organized as follows.

Empirical Level Experiments.

In these experiments, we define the conditioning variable sets as follows. For the benchmark model (F1), use

, where

is a proper subset of Zt. For the two alternative models (F2 and F3) we set

, respectively. In this case, the estimated coefficients associated with xt,wt, and qt have probability limits equal to zero, as none of these variables enters into the true conditional mean function. In addition, all models are misspecified, as conditional normality is assumed throughout. Therefore, the benchmark and the two competitors are equally misspecified. Finally, the limiting distribution of the test statistic in this case is driven by parameter estimation error, as assumption A(vi) does not hold (see Corollary 2 for this case).

Empirical Power Experiments.

In these experiments, we set the conditioning variable sets as follows. For the benchmark model

. For the two alternative models (F2 and F3), we set

, respectively. In this manner, it is ensured that the first of the two alternative models has smaller squared approximation error than the benchmark model. In fact, all three models are incorrect for both the marginal distribution (normal instead of Student-t) and for the conditional variance, which is set equal to the unconditional value instead of being a linear function of yt−12. However, one of the competitors, model 2, is correctly specified for the conditional mean, whereas the other two are not. Therefore, model 2 is characterized by a smaller squared approximation error.

Our findings are summarized in Table 1 (empirical level experiments) and Table 2 (empirical power experiments). In these tables, the first column reports the value of α used in a particular experiment, and the remaining entries are rejection frequencies of the null hypothesis that the benchmark model is not outperformed by any of the alternative models. A number of conclusions emerge upon inspection of the tables. Turning first to the empirical level results given in Table 1, note, for example, that empirical level varies from values grossly above nominal levels (when block lengths and values of α are large), to values below or close to nominal levels (when values of α are smaller). However, note that it is often the case that moving from 60 to 120 observations results in rejection frequencies being closer to the nominal level of the test, as expected (with the exception that the test becomes even more conservative when l is 5 or 6, in many cases). Notice also that when α = 0.4 (low persistence) a block length of 2 usually suffices to capture the dependence structure of the series, whereas for α = 0.9 (high persistence) a larger block length is necessary. Finally, it is worth noting that, overall, the empirical rejection frequencies are not too distant from nominal levels, a result that is somewhat surprising given the small sample sizes used in our experiments. However, the test could clearly be expected to exhibit improved behavior were larger samples of data used.

Empirical level experiments: Interval = μY + ½σY

Empirical power experiments: Interval = μY + ½σY

With regard to empirical power (see Table 2), note that rejection frequencies increase as α increases. This is not surprising, as the contribution of yt−1 to the conditional mean, which is neglected by models 1 and 3, becomes more substantial as α increases. Overall, for α ≥ 0.6 and for a nominal level of 10%, rejection frequencies are above 0.5 in many cases, again suggesting the need for larger samples.16

Note that our Monte Carlo findings are not directly comparable with those of Christoffersen (1998), as his null corresponds to correct dynamic specification of the conditional interval model.

As noted before, rejection frequencies are sensitive to the choice of the block size parameter. This suggests that it should be useful to choose the block length in a data-driven manner. One way in which this may be accomplished is by use of a two-step procedure as follows. First, one defines the optimal rate at which the block length should grow as the sample grows. This rate usually depends on what one is interested in (e.g., the focus is confidence intervals in our setup; see Lahiri, 2003, Ch. 6, for further details). Second, one computes the optimal block size for a smaller sample via subsampling techniques, as proposed by Hall, Horowitz, and Jing (1995), and then obtains the optimal block length for the full sample, using the optimal rate in the first step.17

Further data-driven methods for computing the block size are reported in Lahiri (2003, Ch. 6).

However, it is not clear whether application of the Hall et al. (1995) approach leads to an optimal choice (i.e., to the block size that minimizes the appropriate mean squared error, say). The reason for this is that the theoretical optimal block size is obtained by comparing the first (or second) term of the Edgeworth expansion of the actual and bootstrap statistics. However, in our case the statistic is not pivotal, as ZT and ZT* are not scaled by a proper variance estimator, and consequently we cannot obtain an Edgeworth expansion with a standard normal variate as the leading term in the expansion. In principle, we could begin by scaling the test statistic by an autocorrelation and heteroskedasticity robust (HAC) variance estimator, but in such a case the statistic could no longer be written as a smooth function of the sample mean, and it is not clear whether data-driven block size selection of the variety outlined previously would actually be optimal.18

For higher order properties for statistics studentized with HAC estimators (see, e.g., Götze and Künsch, 1996, for the sample mean; and Inoue and Shintani, 2004, for linear instrumental variables estimators).

Although these issues remain unresolved, and are the focus of ongoing research, we nevertheless suggest using a data-driven approach, such as the Hall et al. (1995) approach, with the caveat that the method should at this stage only be thought of as providing a rough guide for block size selection.

5. CONCLUDING REMARKS

We have provided a test that allows for the joint comparison of multiple misspecified conditional interval models for the case of dependent observations and for the case where accuracy is measured using a distributional analog of mean square error. We also outlined the construction of valid asymptotic critical values based on a version of the block bootstrap that properly takes into account the contribution of parameter estimation error. A small number of Monte Carlo experiments were also run to assess the finite-sample properties of the test, and results indicate that the test does not have unreasonable finite-sample properties given very small samples of 60 and 120 observations, although the results do suggest that larger samples should likely be used in empirical application of the test.

APPENDIX

Proof of Theorem 1. Recall that

Thus, from (5),

where

. Note that, given Assumptions A(i) and (iii), for i = 1,…,m,

where Ai[dagger]) = (E(−∇θi2 fi(yt|Zti[dagger])))−1. Thus, ZT(1,k) converges in distribution to a normal random variable with variance equal to ckk. The statement in Theorem 1 then follows as a straightforward application of the Cramér–Wold device and the continuous mapping theorem. █

Proof of Corollary 2. Immediate from the proof of Theorem 1. █

Proof of Theorem 3. In the discussion that follows, P*, E*, and Var* denote the probability law of the resampled series, conditional on the sample, the expectation, and the variance operators associated with P*, respectively. With the notation oP*(1), Pr-P, and OP*(1), Pr-P, we mean a term approaching zero in P*-probability and a term bounded in P*-probability, conditional on the sample and for all samples except a set with probability measure approaching zero, respectively. Write ZT,u*(1,k) as

where

. Now,

as

, by Theorem 2.2 in Goncalves and White (2004), and

, as it converges in P*-distribution and because the term in square brackets is OP*(1), Pr-P. Thus, ZT*(1,k) can be written as

We begin by showing that for i = 1,…,m, conditional on the sample and for all samples except a set of probability measure approaching zero:

(a) The term the portion of (A.2) preceding the first −2/T in (A.2) has the same limiting distribution (Pr-P) as

(b) The third and fourth lines of (A.2) (from the second −2/T on) of (A.2) have the same limiting distribution (Pr-P) as

We begin by showing (a). Given the block resampling scheme described in Section 3.2, it is easy to see that

For notational simplicity, just set u = −∞. Needless to say, the same argument applies to any generic u < u. Recalling that each block, conditional on the sample, is i.i.d.

where the last equality follows from Theorem 1 in Andrews (1991), given Assumption A and given the growth rate conditions on l. Therefore, given Assumption A, by Theorem 3.5 in Künsch (1989), (a) holds.

We now need to establish (b). First, note that given the mixing and domination conditions in Assumption A, from Lemmas 4 and 5 in Goncalves and White (2004), it follows that

Thus, we can write the sum of the last two terms in equation (A.2) as

Also, by Theorem 2.2 in Goncalves and White (2004), there exists an ε > 0 such that

Thus,

has the same asymptotic normal distribution as

, conditional on the sample and for all samples except a set with probability measure approaching zero. Finally, again by the same argument used in Lemmas A4 and A5 in Goncalves and White (2004),

where mθi[dagger]′ = E(∇θi Fi(u|Zti[dagger])(1{Ytu} − Fi(u|Zti[dagger]))). Needless to say, the corresponding terms for model k can be treated in the same manner. Thus, ZT(1,k)* has the same limiting distribution as ZT(1,k), conditional on the sample and for all samples except a set with probability measure approaching zero. █

References

REFERENCES

Altissimo, F. & A. Mele (2002) Testing the Closeness of Conditional Densities by Simulated Nonparametric Methods. Working paper, LSE.
Andrews, D.W.K. (1991) Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817858.Google Scholar
Andrews, D.W.K. (1997) A conditional Kolmogorov test. Econometrica 65, 10971128.Google Scholar
Andrews, D.W.K. (2002) Higher-order improvements of a computationally attractive k-step bootstrap for extremum estimators. Econometrica 70, 119162.Google Scholar
Bai, J. (2003) Testing parametric conditional distributions of dynamic models. Review of Economics and Statistics 85, 531549.Google Scholar
Benjamini, Y. & Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, series B 57, 289300.Google Scholar
Chang, Y.S., J.F. Gomes, & F. Schorfheide (2002) Learning-by-doing as a propagation mechanism. American Economic Review 92, 14981520.Google Scholar
Chatfield, C. (1993) Calculating interval forecasts. Journal of Business & Economic Statistics 11, 121135.Google Scholar
Christoffersen, P.F. (1998) Evaluating interval forecasts. International Economic Review 39, 841862.Google Scholar
Christoffersen, P. & F.X. Diebold (2000) How relevant is volatility forecasting for financial risk management? Review of Economics and Statistics 82, 1222.Google Scholar
Clements, M.P. & N. Taylor (2001) Bootstrapping prediction intervals for autoregressive models. International Journal of Forecasting 17, 247276.Google Scholar
Corradi, V. & N.R. Swanson (2004a) Bootstrap Procedures for Recursive Estimation Schemes with Application to Forecast Model Selection. Working paper, Rutgers University.
Corradi, V. & N.R. Swanson (2004b) Predictive Density Accuracy Tests. Working paper, Rutgers University.
Corradi, V. & N.R. Swanson (2005a) Bootstrap conditional distribution tests in the presence of dynamic misspecification. Journal of Econometrics, forthcoming.Google Scholar
Corradi, V. & N.R. Swanson (2005b) Evaluation of dynamic stochastic general equilibrium models based on distributional comparison of simulated and historical data. Journal of Econometrics, forthcoming.Google Scholar
Diebold, F.X., T. Gunther, & A.S. Tay (1998) Evaluating density forecasts with applications to finance and management. International Economic Review 39, 863883.Google Scholar
Diebold, F.X., J. Hahn, & A.S. Tay (1999) Multivariate density forecast evaluation and calibration in financial risk management: High frequency returns on foreign exchange. Review of Economics and Statistics 81, 661673.Google Scholar
Diebold, F.X., A.S. Tay, & K.D. Wallis (1998) Evaluating Density Forecasts of Inflation: The Survey of Professional Forecasters. In R.F. Engle & H. White (eds.), Festschrift in Honor of C.W.J. Granger, pp. 7690. Oxford University Press.
Duffie, D. & J. Pan (1997) An overview of value at risk. Journal of Derivatives 4, 749.Google Scholar
Fernandez-Villaverde, J. & J.F. Rubio-Ramirez (2004) Comparing dynamic equilibrium models to data. Journal of Econometrics 123, 153180.Google Scholar
Gallant, A.R. & H. White (1988) A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Blackwell.
Giacomini, R. (2002) Comparing Density Forecasts via Weighted Likelihood Ratio Tests: Asymptotic and Bootstrap Methods. Working paper, University of California, San Diego.
Giacomini, R. & I. Komunjer (2005) Evaluation and combination of conditional quantile forecasts. Journal of Business and Economic Statistics, forthcoming.Google Scholar
Goncalves, S. & H. White (2002) The bootstrap of the mean for dependent and heterogeneous arrays. Econometric Theory 18, 13671384.Google Scholar
Goncalves, S. & H. White (2004) Maximum likelihood and the bootstrap for nonlinear dynamic models. Journal of Econometrics 119, 199219.Google Scholar
Götze, F. & H.R. Künsch (1996) Second-order correctness of the blockwise bootstrap for stationary observations. Annals of Statistics 24, 19141933.Google Scholar
Granger, C.W.J., H. White, & M. Kamstra (1989) Interval forecasting—An analysis based upon ARCH-quantile estimators. Journal of Econometrics 40, 8796.Google Scholar
Hall, P. & J.K. Horowitz (1996) Bootstrap critical values for tests based on generalized method of moments estimators. Econometrica 64, 891916.Google Scholar
Hall, P., J.K. Horowitz, & N.J. Jing (1995) On blocking rules for the bootstrap with dependent data. Biometrika 82, 561574.Google Scholar
Hall, A.R. & A. Inoue (2003) The large sample behavior of the generalized method of moments estimator in misspecified models. Journal of Econometrics 114, 361394.Google Scholar
Hansen, P.R. (2005) An unbiased test for superior predictive ability. Journal of Business and Economic Statistics, forthcoming.Google Scholar
Hochberg, Y. (1988) A sharper Bonferroni procedure for multiple significance tests. Biometrika 75, 800803.Google Scholar
Hong, Y. (2001) Evaluation of Out of Sample Probability Density Forecasts with Applications to S&P 500 Stock Prices. Working paper, Cornell University.
Hong, Y.M. & H. Li (2005) Out of sample performance of spot interest rate models. Review of Financial Studies 18, 3784.Google Scholar
Inoue, A. & M. Shintani (2004) Bootstrapping GMM estimators for time series. Journal of Econometrics, forthcoming.Google Scholar
Kitamura, Y. (2002) Econometric Comparisons of Conditional Models. Working paper, University of Pennsylvania.
Künsch, H.R. (1989) The jackknife and the bootstrap for general stationary observations. Annals of Statistics 17, 12171241.Google Scholar
Lahiri, S.N. (2003) Resampling Methods for Dependent Data. Springer-Verlag.
Li, F. & G. Tkacz (2004) A consistent test for conditional density functions with time dependent data. Journal of Econometrics, forthcoming.Google Scholar
Linton, O., E. Maasoumi, & Y.J. Whang (2003) Consistent testing for stochastic dominance under general sampling schemes. Review of Economic Studies, forthcoming.Google Scholar
Politis, D.N., J.P. Romano, & M. Wolf (1999) Subsampling. Springer-Verlag.
Schorfheide, F. (2000) Loss function based evaluation of DSGE models. Journal of Applied Econometrics 15, 645670.Google Scholar
Spanos, A. (1999) Probability Theory and Statistical Inference: Econometric Modelling with Observational Data. Cambridge University Press.
Vuong, Q. (1989) Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57, 307333.Google Scholar
Whang, Y.J. (2000) Consistent bootstrap tests of parametric regression functions. Journal of Econometrics 15, 2746.Google Scholar
Whang, Y.J. (2001) Consistent specification testing for conditional moment restrictions. Economics Letters 71, 299306.Google Scholar
White, H. (1982) Maximum likelihood estimation of misspecified models. Econometrica 50, 125.Google Scholar
White, H. (2000) A reality check for data snooping. Econometrica 68, 10971126.Google Scholar
Zheng, J.X. (2000) A consistent test of conditional parametric distribution. Econometric Theory 16, 667691.Google Scholar
Figure 0

Empirical level experiments: Interval = μY + ½σY

Figure 1

Empirical power experiments: Interval = μY + ½σY