1. INTRODUCTION
In this paper, we consider tests of serial correlation designed for vector time series models. In univariate time series, many tests for serial correlation have been proposed (e.g., Durbin and Watson, 1950, 1951; Box and Pierce, 1970; Ljung and Box, 1978; Robinson, 1991a; Hong, 1996; Andrews and Ploberger, 1996; Paparoditis, 2000a, 2000b; Lee and Hong, 2001). Testing for serial correlation in multivariate time series appears equally important, because real applications often involve handling several variables, in addition to their potentially complex interactions. Statistical inference and testing procedures in a multivariate setting are discussed in Chitturi (1974, 1976), Hosking (1980, 1981a, 1981b), Li and McLeod (1981), Lütkepohl (1993), Duchesne and Roy (2004), and Paparoditis (2005), among others.
The approach in Duchesne and Roy (2004) relies on a normalized version of a quadratic distance between two multivariate spectral densities. Using a kernel-based spectral density estimator for a d-dimensional process x, the resulting test statistic provides a generalization of the so-called multivariate portmanteau statistic considered in Hosking (1980). See also Chitturi (1974) and Li and McLeod (1981). The test statistic of Duchesne and Roy (2004) is based on a kernel function. The truncated uniform or rectangular kernel delivers a normalized version of Hosking's test. Also, when d = 1, the test reduces to the Hong (1996) test for serial correlation.
The spectral density may exhibit various forms of irregularities for many time series encountered in practice. For example, strong serial dependence or seasonality may produce peaks or bumps in the spectral density at seasonal frequencies. It is well known that, in univariate time series, kernel estimators tend to underestimate the mode of the spectral density (e.g., Priestley, 1981; Lee and Hong, 2001). To circumvent that problem, Lee and Hong (2001) considered wavelet-based spectral density estimation for univariate time series. The first objective of this paper is to extend the Lee and Hong (2001) approach to vector time series models. First, we give the wavelet representation of the multivariate spectral density. Then, we estimate that representation directly. The test is constructed using a normalized metric similar to the one studied in Duchesne and Roy (2004). However, the comparison is made between the multivariate wavelet-based spectral density estimator and the spectral density under the null hypothesis of noncorrelation. With the multivariate wavelet-based estimator of the spectral density, we expect to obtain a more accurate estimation of the spectral density characterized by irregular features, such as peaks, kinks, or spikes, because in such circumstances this wavelet-based estimator offers more powerful procedures than do kernel-based methods. However, if the spectral density displays regular behavior, the kernel-based procedure is expected to perform well. The new wavelet-based test should be an interesting complement to the test procedure of Duchesne and Roy (2004).
A fixed lag order p, say, must be determined a priori to compute the Box–Pierce/Hosking test statistics. This is true for many popular portmanteau procedures. For example, in vector autoregressive (VAR) models, the Hosking (1980) test statistic relies on the first p autocorrelation matrices. In real applications, a frequent procedure with Box–Pierce/Hosking type test statistics consists of performing individual tests at the 5% nominal level with many values of p: for example, computing the test statistic for each p = 6,12,18. Statistically, this procedure is not recommended, because computing several test statistics at the 5% level does not give a statistical procedure with an overall nominal level of 5%. Obviously, Bonferroni's procedure or similar adjustments could be performed, but because the individual tests based on different p's are not usually independent, only an upper bound for the overall level is obtained (the exact level being unknown). The Duchesne and Roy (2004) procedure relies on a kernel-based spectral density estimator, and it permits a data-dependent choice of the smoothing parameter. Consequently, as a second objective of the present paper, we propose and justify a suitable data-driven method for choosing the finest scale of the new wavelet-based estimator; the finest scale represents the smoothing parameter in the wavelet estimation context. As in kernel-based spectral density estimation, it seems highly desirable, in practical applications, to have an objective choice of the finest scale. Lee and Hong (2001) considered an algorithm drawn from Walter (1994) but without justification. In particular, the choice of the finest scale was not taken into account in the asymptotic analysis. In this paper, we state the precise conditions under which the asymptotic distribution of the new multivariate test will remain unaffected by a data-dependent method. Furthermore, we establish the integrated mean squared error (MSE) of the wavelet spectral density estimator, and finally we discuss a particular implementation: a plug-in method of the parametric type. Because our test statistic, say,
, where
is the data-dependent finest scale estimator, possesses a rigorously established asymptotic distribution, this allows us to perform hypothesis testing at a chosen nominal level.
The organization of the paper is as follows. In Section 2, some preliminaries are given. The wavelet-based test statistic is introduced in Section 3. It is shown to admit a standard normal distribution asymptotically, under the null hypothesis of noncorrelation. In Section 4, general results on data-dependent methods are presented. The integrated MSE of the spectral density estimator is established for the case when the process admits a general dependence structure. As an application of this result, we propose and justify a particular data-driven method using the (parametric) plug-in technique. In Section 5, we present the results of a small Monte Carlo experiment conducted to study the exact level and power of the tests for finite samples. We compare the power of the kernel-based and wavelet-based estimators using data-dependent techniques for the choice of the smoothing parameter and the choice of the finest scale. We conclude in Section 6 with some remarks, and the Appendix contains the proofs of our results.
2. PRELIMINARIES
Let
be a multivariate second-order stationary process of dimension d. We assume that x is of mean μ, with a regular covariance matrix E [(xt − μ)(xt − μ)[top ]] = Σ.
2.1. Temporal and Spectral Notations
The autocovariance at lag j will be denoted by
where
. If we write Γx(j) = [Γx,pq(j)]p,q=1d, the multivariate spectral density f(ω) of x is defined by
A sufficient condition for f(ω) to be a multivariate spectral density is the absolute summability of the cross- and autocovariances (see, e.g., Fuller, 1996, pp. 169–170):
In general, f(·) is complex valued. We denote (·) the complex conjugate of f(·) and f*(·) the transposed complex conjugate of f(·), that is, f*(·) = [top ](·) (for a matrix A, A* will denote the transposed conjugate of A, i.e., A* = [top ]). It is well known that f is a Hermitian matrix, meaning that f*(ω) = f(ω).
Whenever the existence of fourth-order moments is required, we shall suppose that process x is fourth-order stationary, and the fourth-order moments and cumulants will be denoted, respectively, by
where
. Imposing the existence of the fourth-order moments is equivalent to the existence of the fourth-order cumulants. If process x is Gaussian, it is well known that the fourth-order cumulants will vanish.
2.2. Wavelet Analysis of the Spectral Density
Wavelet analysis is a mathematical technique generalizing the Fourier analysis; it can be used to decompose a function g(·) in the
space. Using the multiresolution analysis first developed by Mallat (1989) and Meyer (1992), any square-integrable function g admits the following decomposition:
where
corresponds to a cutoff point resolution level and the wavelet coefficients are given by
The function ψ(·) is called the mother wavelet, and the function φ(·) represents the father wavelet. Given {φ(·),ψ(·)}, we obtain {φjk(·),ψjk(·)} using the relations
where
. The system {φjk(·),ψjk(·)} constitutes a complete orthonormal basis for the
space. The Fourier transforms of the functions ψ(·) and φ(·) are defined by
where i denotes the complex number
(note that following the traditional wavelet literature,
is the Fourier transform of ψ(·), not an estimator; because ψ(·) is not a random quantity, this notation should not be cumbersome). The mother and father wavelets satisfy the following assumption.
Assumption A. The mother wavelet
is orthonormal such that
. The father wavelet
satisfies
; it is chosen such that the modula of its Fourier transform
is continuous on
, or
if |z| > π.
The main motivation for the wavelet appellation is explicated by the property
. The function ψ is well localized, and by appropriate scaling such localization can be made arbitrarily fine. The integer j in φjk(·) or ψjk(·) is called the scale parameter (or a resolution level), and the integer k represents a translation parameter. Priestley (1996) gives useful comparisons between Fourier and wavelet analysis; he provides a heuristic interpretation of j as a frequency parameter and k as related to time.
By Assumption A, the Fourier transforms of the functions ψ(·) and φ(·) exist. Note that, according to Proposition 2.17 of Hernández and Weiss (1996), Assumption A implies that
We assume the following smoothness condition on
.
Assumption B. (i) The mother wavelet satisfies
, for some
, and C ∈ (0,∞). (ii)
or
, where b(·) is a real-valued function.
In Assumption B(i), we require that
possess some regularity at z = 0 and sufficiently fast decay when z → ∞. Examples of wavelets that satisfy Assumption B(ii) are the Franklin and Meyer wavelets. In fact, most commonly used wavelets satisfy this assumption. The Franklin wavelet ψF(·) is defined via its Fourier transform and is given by
Franklin wavelets are of unbounded support and correspond to infinite impulse response filters. For fundamental notions of signal processing, see Vidakovic (1999). The Meyer wavelet is another example of a wavelet satisfying Assumption B, denoted as ψM(·); it has compact support in the frequency domain and fast decay in the time domain. The Fourier transform of the Meyer wavelet is given by
The function ν(·) satisfies ν(x) = 0, x ≤ 0 and ν(x) = 1, x ≥ 1 with the additional property ν(x) + ν(1 − x) = 1 for x ∈ [0,1]. See Daubechies (1992, pp. 116–117) for additional details on the Meyer (1986) construction. Meyer's ν-polynomials are given in Vidakovic (1999, p. 66), up to order five. Lee and Hong (2001) adopted a second-order ν-polynomial, which is defined by ν(x) = x2(3 − 2x), x ∈ [0,1]. Hernández and Weiss (1996, Sect. 2.2) give additional examples of wavelets satisfying Assumption B(ii). Vidakovic (1999), among others, also gives other examples of wavelet functions.
We shall now discuss the wavelet representation of the multivariate spectral density f given in (1). Because f is a 2π-periodic function, it is not square integrable. However, a 2π-periodic basis {Φjk(·),Ψjk(·)} can be constructed, using the so-called periodization technique:
which produces an orthonormal basis for the L2 [−π,π] space. Note that using the definition of the Fourier transform,
and by the periodization technique and a change of variable,
Similar relations hold for
. For more details, see Lee and Hong (2001, p. 392).
Let the cutoff point j0 in (2) equal zero. Assumption A and relations (3), (4), (7), and (8) imply that Φ01(ω) = (2π)−1/2 and β01 = (2π)−1/2Γx(0). Consequently, f can be written as
The quantity Ajk is called the wavelet coefficient; it corresponds to a d × d matrix. It is given by
where
is the Fourier transform of
denotes the complex conjugate of
.
Using classical Fourier analysis, the Fourier coefficients reduce to the autocovariances. It is well known that they depend on the global behavior of the spectral density between [−π,π]. See Brockwell and Davis (1991, Sect. 11.6), among others. However, the wavelet coefficient Ajk depends on the local behavior of the spectral density, because Ψjk(ω) reflects the local behavior of f in an interval of width 2π/2j centered at k/2j. This explains why wavelet analysis represents an appropriate tool when the spectral density admits irregular features; the wavelet coefficients should describe more adequately these local characteristics. Naturally, the spectral density relies on the theoretical autocovariance function. In the next section, we discuss estimation of representation (9) in finite samples.
2.3. Wavelet-Based Spectral Density Estimator
Given that x1,…,xn is a realization of length n of process x, the sample autocovariance at lag j, 0 ≤ | j| ≤ n − 1, is defined by
where
represents the sample average. A natural estimator for the wavelet coefficient Ajk is given by
called the empirical wavelet coefficient. The second equality is obtained using the fact that
.
We consider the following wavelet-based spectral density estimator:
where J is the finest scale. Estimator (11) has been considered in Hong (2001) in the context of wavelet-based estimation of heteroskedasticity and of autocorrelation consistent covariance matrices; he has shown that
produces a consistent estimator of f(0). The parameter J ≡ Jn is nonstochastic. It plays a role similar to that of the smoothing parameter in kernel-based estimation. Intuitively, if J is large, in view of (9) and (11), the bias should be small. However, because many
will be included in the estimator, the variance should be higher. Inversely, a small J should provide a less variable estimator, at the price of a larger bias.
The advocated spectral density estimator (11) appears to be linear. We do not consider threshold shrinkage as in Neumann (1996), among others. In fact, Neumann (1996) shows that a nonlinear thresholding wavelet estimator can attain a near optimal convergence rate for any spectral density in the balls of the so-called Besov space. Recall that, given an interval I, the Besov space Bp,qm represents a set of functions f in Lp(I) that have smoothness m. Neumann (1996) shows that linear wavelet estimators can attain the optimal rate of convergence if p ≥ 2. In our typical applications, the spectral density is square integrable on [−π,π] with first-order continuous derivatives; therefore, it corresponds to a continuous function. The kind of irregularities that we may encounter in our applications will be due principally to cycles, bumps, and periodicities in the spectral density. See Wei (1990, pp. 248–249) for visual illustrations. Although from an analytic point of view, this provides a heuristic argument in favor of the linear wavelet estimator, considering estimator (9) allows us to develop an asymptotic theory. In the next section, a test statistic based on a suitable quadratic norm is studied; it offers an elegant formulation in a multivariate framework and a well-established asymptotic distribution under the null hypothesis. Furthermore, in Section 4 we can derive the integrated MSE over the frequencies [−π,π] of the estimator
. From that latter result, a suitable J can be found to balance the squared bias and the variance according to the integrated MSE criterion.
3. THE TEST STATISTIC AND ITS ASYMPTOTIC NULL DISTRIBUTION
We now introduce the new test statistic. The hypothesis of interest states that x corresponds to a white noise against the alternative of serial correlation of unknown form. More precisely, the null and alternative hypotheses can be written as
We can formulate H0 and H1 with the spectral density f(ω) of x. The hypothesis H0 becomes f(ω) = f0(ω) = (2π)−1Γx(0), ∀ω ∈ [−π,π]. Under the null hypothesis, the spectral density reduces to the constant matrix (2π)−1Γx(0). Under the alternative hypothesis, f is not identically equal to f0.
To obtain our test statistic, we first define a distance measure. We consider the following normalized quadratic distance:
The distance measure Q2(f1;f2) can be interpreted as an integrated measure over the frequencies [−π,π]. See Duchesne and Roy (2004) for more details. In the following proposition, we compare the true spectral density f of x with the spectral density f0 of x under the null hypothesis.
PROPOSITION 1. Let Q2(f;f0) be the distance measure given by (12), where f corresponds to the wavelet representation of the spectral density given by (9) and f0 = (2π)−1Γx(0). Then, we have
Proof. First note that
Define δjk = 1 if j = k, and δjk = 0 elsewhere. Using the orthonormality relation
we obtain the desired result. █
When the traditional representation of the spectral density given by (1) is used, Duchesne and Roy (2004) have shown that
We can interpret the distance measure given by (14) as based on an infinite number of lags, whereas the distance measure (13) using the wavelet representation of the spectral density depends on an infinite number of resolution levels and, consequently, an infinite degree of approximation.
Our test statistic will be defined as a distance measure between
. By direct calculation, it can be shown that
using
and because Cx(0) − Γx(0) = OP(n−1/2).
Because the covariance matrix Γx(0) is usually unknown, we replace that quantity by the sample covariance matrix Cx(0). It may be shown that
The new test statistic proposed is essentially interpreted as a standardized version of
. It is defined by
The quantities d2(2J+1 − 1) and 4d2(2J+1 − 1) denote essentially the mean and variance of
. Using the previous approximations, it is easily shown that
The test Wn is motivated by the distance measure Q2(f;f0), which satisfies Q2(f;f0) ≥ 0, with Q2(f;f0) = 0 if and only if f ≡ f0. Consequently, (15) rejects the null hypothesis for large values of Wn, and the test statistic is one sided.
To establish the asymptotic distribution of Wn under the null hypothesis, we need the following assumption.
Assumption C. Let
be a strong white noise that is an independent and identically distributed (i.i.d.) time series of dimension d, whose mean is E(xt) = μ and with a regular covariance matrix Σ. We assume that the fourth moments of xt are finite.
A random sample x1,x2,…,xn of sample size n is obtained. We now state the main result of this section. The symbol →d stands for convergence in distribution.
THEOREM 1. Under Assumptions A, B, and C, J → ∞, 22J/n → 0, the test statistic Wn defined by (15) follows a normal distribution asymptotically, that is, Wn →d N(0,1).
Note that to obtain Theorem 1, we do not assume that process x is Gaussian. The test statistic Wn can be used to test if x1,…,xn are independent when process x is Gaussian. In general, the test statistic Wn allows us to check for the hypothesis H0 of no serial correlation with non-Gaussian observations. Because Wn is a one-sided test statistic, it rejects the null hypothesis at level α if Wn > z1−α, where z1−α denotes the quantile of order 1 − α of the standard normal distribution.
Our test statistic reduces to the Lee and Hong (2001) test when d = 1, because
become scalar quantities and the tr(·) of a scalar reduces to that scalar. Lee and Hong established the consistency of their test, and an asymptotic power analysis shows that there is no optimal choice of the wavelet function, because the asymptotic power does not depend on ψ(·). This contrasts strongly with the multivariate procedure of Duchesne and Roy (2004), because a local power analysis makes it possible to derive the asymptotic power, which is a function of the kernel; the maximization of that power in a certain class of kernel functions leads to an optimal choice of
: the Daniell kernel.
For the validity of the asymptotic normal distribution in Theorem 1, the finest scale needs to go to infinity at a slower rate than n. In some sense, that parameter plays the same role as does the bandwidth in kernel-based spectral density estimation. However, the interpretation is fundamentally different. When testing for serial correlation using the kernel-based spectral density estimator, if the kernel possesses a compact support, then the bandwidth admits an easy interpretation; it corresponds to a lag order. For example, for the multivariate serial correlation test of Duchesne and Roy (2004) based on the uniform truncated kernel, the bandwidth determines the number of lags to be considered in the test statistic. Here, the interpretation of J corresponds to the finest resolution level, and this parameter determines the degree of approximation of the spectral density to be estimated; this is not a lag order. In view of this, the selection of J represents an important practical consideration (note that for practical reasons, the same can be said for the choice of the bandwidth in kernel-based estimation; see Robinson, 1991b, p. 1332). Lee and Hong (2001) have considered the data-driven method of Walter (1994) for choosing the finest scale. However, it seems unknown whether the asymptotic distribution of their test statistic remains unchanged when specifying J with Walter's procedure, because, to our knowledge, no formal result on the rate of Walter's
has been derived in the literature. In the next section, we discuss the adaptive selection of J using the plug-in method. We show, in particular, that with our data-driven method, the asymptotic distribution of our test statistic, say,
, for a stochastic
obtained with the plug-in method, is still normal under the null hypothesis. This makes our procedure fully operational in practice.
4. ADAPTIVE CHOICE OF THE FINEST SCALE
4.1. A General Result for the Estimation of the Finest Scale
As in a kernel-based spectral density estimation, the wavelet-based spectral density estimator given by (11) relies on a “smoothing parameter,” the finest scale J. The choice of the finest scale may affect the power of the test statistic. However, it seems difficult to select an optimal J to maximize the power. Furthermore, because J does not correspond to a lag order, it may be hard to obtain specific knowledge of the alternative hypothesis. Therefore, it appears highly desirable to choose J via suitable automatic methods. These methods will reveal some valuable information on the unknown alternatives. Let Wn ≡ Wn(J). Theorem 2, which follows, does not concentrate on a particular data-driven method. It gives appropriate conditions on the data-driven J, denoted
, under which
under the null hypothesis. In particular, a standard application of Slutsky's theorem allows us to conclude that
under H0, for
satisfying certain conditions.
THEOREM 2. Let
be a data-dependent J such that
Under Assumptions A, B, and C, J → ∞, 22J/n → 0, the test statistic Wn ≡ Wn(J) defined by (15) is such that
.
Intuitively, because J → ∞, an interpretation of the condition (17) stipulates that
as n → ∞, with the specified convergence rate. According to Theorem 2, if
converges to J at a suitable rate, then the estimation of J will have no impact on the asymptotic distribution of the test statistic. Condition (17) in Theorem 2 does not impose overly severe restrictions. For example, if 2J ∝ n2ν, ν > 0, then the condition becomes
. Theorem 2 will be of interest if we can find a data-driven method satisfying the condition
. Before finding such data-dependent
, we establish, in the next section, the integrated MSE of (11).
4.2. Derivation of the Integrated Mean Squared Error of the Wavelet-Based Spectral Density Estimator
In this section, we derive the asymptotic integrated MSE of
over the entire frequency domain [−π,π]. Although this result is of interest in its own right, it will also suggest, in Section 4.3, a suitable method for using the data to select the finest scale automatically. We consider the following definition for the MSE:
When ω = 0, the MSE defined in (18) is similar to the MSE considered in Andrews (1991) and Hong (2001). The integrated MSE is then obtained as
The matrix
corresponds to a d2 × d2 nonstochastic positive semidefinite (psd) weight matrix. We need the following assumption.
Assumption D. Assume that the dependence structure of process x is such that [sum ]j∥Γx(j)∥2 ≤ C and suppose that the following cumulant condition is satisfied:
where p,q,r,s ∈ {1,…, d}. We suppose that there exists φ ∈ (0,1) such that ∥Γx(h)∥ ≤ Cφ|h| and we assume that [sum ]j| j|q∥Γx(j)∥ ≤ C, where q ∈ [1,∞) is as in Assumption B.
The cumulant condition of Assumption D allows for autocorrelation. Fourth-order stationary linear processes with absolutely summable coefficients and innovations whose fourth moments exist satisfy the cumulant condition. See Hannan (1970, p. 211). Finally, when process x is Gaussian, the fourth-order cumulants are zero; the cumulant condition is also satisfied in that latter case.
Define the generalized derivative of f(ω),
Note that when q is an even integer, f(q)(ω) = (−1)q/2 dqf(ω)/dωq. Put
The constant λq characterizes the smoothness of
at zero. Because the Fourier transform of the Meyer wavelet is flat in a neighborhood of 0, λq ≡ 0, q > 0. For the Franklin wavelet defined in (5), we have q = 2 and λ2 = π4/45. As will be shown subsequently, the integrated MSE contains two conflicting factors, a factor corresponding to the asymptotic variance and another for the squared bias. The asymptotic bias is a function of the generalized derivative and of the factor λq. These quantities are related to the smoothness of the spectral density and to the smoothness of
at z = 0. We now state the theorem in which the integrated MSE is derived.
THEOREM 3. Suppose that Assumptions A, B, and D hold, λq ∈ (0,∞), J → ∞, 2J/n → 0 as n → ∞. Then
Theorem 3 gives the integrated MSE over [−π,π] in the vector case of the wavelet spectral density estimator. Other related results include the MSE at a given frequency ω of a kernel-based spectral density estimator, derived in Parzen (1957) in the scalar case and in Hannan (1970, pp. 280, 283) and Andrews (1991, p. 825) in the multivariate case. The integrated MSE of a scalar kernel-based spectral density estimator is given in Robinson (1991b, pp. 1343–1344). Finally, Hong (2001) gives the MSE of the multivariate wavelet spectral density estimator at the frequency ω = 0. Then, Theorem 3 represents an extension of the pointwise result of Hong (2001).
To have power under a large class of alternatives, it seems intuitively preferable to construct test statistics based on a measure relying on all frequencies ω ∈ [−π,π]. Consequently, in testing for serial correlation, the test procedure Wn(J) is based on a distance measure that exploits the wavelet spectral density estimator over the entire frequency domain [−π,π] and not only on a single frequency. Similarly, in Theorem 3, it seems preferable to derive the formula for the asymptotic integrated MSE, because it will reveal valuable information about the global behavior of the spectral density, not only the local behavior at a single frequency such as ω = 0. For example, for seasonal time series, the spectral density is significant at the typical seasonal frequencies, which are nonzero. The integrated MSE in Theorem 3, which relies on all frequencies, should reveal more information on seasonal behavior than the MSE based solely on the single zero frequency.
4.3. On the Estimation of the Finest Scale with a Plug-In Method of the Parametric Type
In this section, we discuss applying Theorem 3 to select the finest scale according to the integrated MSE criterion. The approach advocated here is basically the same as the one Andrews (1991) proposed in a different context. To obtain the optimal J, it suffices to minimize the asymptotic integrated MSE of
. Computing the derivative of the asymptotic integrated MSE with respect to J and setting the derivative equal to zero, this yields 2J0+1 = [2qλqα0(q)n]1/(2q+1), where
Note that we could have α0(q) = 0 (e.g., under the null hypothesis). However, to satisfy the conditions of Theorem 2, J must be such that J → ∞. This caveat can be easily fixed if we introduce the following slight modification:
where l is an arbitrary fixed (with respect to n) lower bound. The introduction of (21) is similar to the modification of the cross-validation method considered in Hong and Shehadeh (1999, pp. 94–95). Then, we define
This J0′ is nevertheless unfeasible because α0(l)(q) is unknown under the alternative hypothesis. However, we can produce some sort of estimator, say,
for α0(l)(q). That is, an estimator is “plugged in” in place of the unknown quantity. This gives a data-driven finest scale
satisfying
, which can be alternatively written as
where c = (2qλq l)1/(2q+1). The following corollary states that the proposed plug-in method satisfies the hypotheses of Theorem 2.
COROLLARY 1. Suppose that Assumptions A, B, C, and D hold,
is given as in (23),
. Then
.
In Corollary 1, we require
at a rate faster than n−1/(2q+1). This ensures that the use of
rather than ϑ(q) has no impact on Wn(J) asymptotically. The condition on
seems rather easy to satisfy. For example, we do not require the probability limit p limit
, where α0(l)(q) is as in (21). When and only when ϑ(q) = α0(l)(q) will
in (23) converge to the J0′ in (22).
In summary, the proposed method gives a rule to select the resolution J based on the minimization of an integrated MSE criterion. In a hypothesis testing context, a sensible criterion consists of finding J such that it maximizes the power. The optimal J that maximizes power and the J given in (22) can be expected to differ. However, this is a technically complicated issue in the present context, and it seems likely that a higher order asymptotic analysis is needed. However, our Monte Carlo experiments in Section 5 suggest that (23) delivers reasonably large power for the test Wn in finite samples. It thus seems preferable to have an automatic method that tries to find a J suitably adapted to the shape of the estimated spectral density, rather than an ad hoc procedure.
4.4. Particular Implementation of the Proposed Parametric Plug-In Method
We shall now discuss the practical estimation of the quantity
. In general, expression (20) necessitates integration of a function of the matricial spectral density and of its generalized derivative of order q; in practice, multivariate estimators
should be plugged in. A judicious choice of the matrix
may simplify the procedure considerably. As in Andrews (1991, p. 834), we consider a matrix
that gives weight only to the diagonal elements of f. Consequently, α0(q) becomes
where faa(q)(ω) and faa(ω) denote the ath diagonal elements of f(q) and f, respectively. A common choice of υa is 1 for a = 1,…, d.
Plug-in estimators may be parametric or nonparametric. Parametric plug-in estimators yield a less variable finest scale parameter than nonparametric estimators, but at the price of an asymptotic bias in the estimation of the optimal finest scale parameter. This happens because of the approximate nature of the specified parametric model. However, with our conditions, this bias does not affect the consistency or the rate of convergence of the multivariate wavelet-based spectral density estimator. Consequently, we advocate here a parametric AR(p(a)) model for {xt(a)}, a = 1,2,…, d:
where we set the initial values xt(a) ≡ 0 if t ≤ 0 and var{εt(a)} = σε2(a). Suppose that
are appropriate estimators in (24). For concreteness, we consider q = 2 here. We have
where
. We also use the fact that the generalized spectral derivative
. The estimator
incorporates the information of faa(·) over [−π,π] rather than at 0 only. This is analogous to the work of Robinson (1991b), who considers data-driven choices of the bandwidth in multivariate kernel-based spectral density estimation over the entire frequency domain [−π,π]. We can use one-dimensional numerical integrations to compute
. Interestingly,
satisfies the conditions of Corollary 1 with q = 2 because, for parametric AR(p) approximations,
(Andrews, 1991). Note that in practice, we can employ the Akaike information criterion/Bayesian information criterion (AIC/BIC) methods to determine p(a) (for a description of these methods, see Priestley, 1981, p. 372).
To summarize, our procedure for testing multivariate serial correlation can be described as follows: (i) Find
by the autoregression in (24). If a wavelet basis that satisfies q = 2 is adopted (such as a Franklin wavelet with a Fourier transform (5)), simplifications occur, and formula (25) can be used. (ii) Compute the test statistic
in (15). (iii) Reject the null hypothesis of no serial correlation if the test statistic
, where z1−α is the 1 − α quantile of N(0,1).
5. SIMULATION RESULTS
In the previous sections, we have studied a new class of test statistics that have interesting properties when the spectral density possesses irregular features. However, from a practitioner's point of view, it is natural to inquire about the finite-sample properties of the wavelet-based test statistic, in particular its exact level and power. The power comparison between the kernel-based test statistic and the wavelet-based test statistic seems of particular interest. To partially answer these questions, we conduct a small Monte Carlo experiment. For each multivariate i.i.d. process described subsequently, we examine the empirical frequencies of rejection of the null hypothesis when the latter is true by using tests with two different nominal levels (5 and 10%) for each of two series lengths (n = 100, 200). The kernel-based test statistic described in Duchesne and Roy (2004) is given by
where
are given by
where
is a kernel function and pn is the smoothing parameter. We choose the Daniell kernel given by
, which displays optimality in a given class of sufficiently smooth kernels. Usually, the exact alternative hypothesis and, therefore, the optimal value of pn are unknown. In that case, considering the automatic choice of pn seems particularly relevant. The multivariate version of the cross-validation procedure of Robinson (1991b) for determining the bandwidth of a kernel spectrum estimator is employed. We choose that particular nonparametric method because, to our knowledge, in the context of multivariate spectral density estimation, it represents the only data-dependent technique discussed in the statistical literature. For the smoothing parameter pn, we retain the value of m that minimizes the pseudo-log-likelihood defined by
where P(·) denotes the periodogram,
a leave-two-out type smooth periodogram, and ζj = 2πj/n,j = 1,…, n correspond to the Fourier frequencies. The optimization was performed for the integer values m = 2,3,…,20.
The power properties of the tests are also investigated for cases where the process can be formulated as a vectorial autoregressive process of order p, denoted VAR(p) for p = 1,4,8,12. Furthermore, because of its popularity in vector time series modeling, a vectorial autoregressive moving average VARMA(1,1) process is considered. The specific stochastic processes are described in Section 5.1. The power results are size adjusted. Indeed, for a given alternative, the number of rejections of the null hypothesis is calculated using the exact (empirical) critical values obtained from the level study. This practice allows us to compare the wavelet-based and kernel-based tests on an equal basis.
5.1. Description of the Experiment
For our investigations of the exact levels of the test statistics, we consider five multivariate i.i.d. processes of dimension d = 2,3,4:
with
where Id is the d × d identity matrix and 1d = (1,…,1)[top ]. We let ρ = 0.1 and σ = 0.5. The multivariate-t distribution with ν degrees of freedom corresponds to the law of the random variable x ≡ (χν2/ν)−1/2z, where z ∼ N(0, Σ) and χν2 is a chi-squared random variable, z and χν2 being independent. This distribution is related to the so-called multivariate Pearson type VII distribution. See Johnson (1987, pp. 117–118) for more details. If x ∼ tν, E(xx[top ]) = [ν/(ν − 2)]Σ with kurtosis equal to
= 2/(ν − 4). Consequently, we obtain
= 2 and
= ½ for the t5 and t8 distributions, respectively. In robustness studies, the contaminated normal distribution appears to be a useful distribution. It allows us to simulate random variables that deviate from the normal distribution as a result of possible outliers. The kurtosis of the distributions CN1 and CN2 is
= 0.028 and
= 0.059, respectively. Note that in unreported simulation experiments, we also considered σ = 2.0 (giving
= 0.323 and
= 0.479, respectively) but the empirical results were largely similar to the case σ = 0.5. To study the level of the new wavelet-based test statistic for a chosen distribution, B = 10,000 independent realizations are generated for each value of n, and the computations are made in the following way. Note that all the codes are written in the FORTRAN 77 language with the NAG library.
Step 1. The Gaussian, chi-squared, and uniform random variables are generated using the subroutines G05EAF and G05EZF, the function G05DHF, and the function G05CAF, respectively; these subroutines/functions are taken from the NAG library.
Step 2. For each realization, the test statistic Tn is computed using the Daniell kernel. The data-driven method of Robinson (1991b) has been implemented to determine with the data the smoothing parameter pn. The test statistic Wn is computed using the Franklin and Meyer wavelets given by (5) and (6) for a nonstochastic finest scale J, such that 2J+1 = 3n1/5. For the nonstochastic finest scale, denoted J1, we obtain J1 ≡ 2 when n = 100 and J1 ≡ 3 when n = 200. Because λ2 = π4/45 for the Franklin wavelet and λq ≡ 0 for the Meyer wavelet, the parametric plug-in method as described in Section 4.3 is considered for Wn based on ψF only. We use the NAG subroutine D01GAF for the numerical integration. We fit parametric AR(p(a)) models, for p(a) = 1,2,…,20, a = 1,2, using the NAG subroutine G02DAF for least-squares estimation. The autoregressive order is chosen using the AIC criteria. We then modify the AIC method because we need a data-dependent
such that
as n → ∞. The data-driven proposal (23) with c = 3 is considered. We denote by
the resulting data-driven proposal. More details on the automatic methods
are provided in Section 5.4.
Step 3. Finally, for each series of length n, for each test statistic we obtain from the B realizations the empirical frequencies of rejection of the null hypothesis of noncorrelation.
To study the power, we consider five alternatives:
where {at} is i.i.d. The definitions of the parameters Φd, Θd, Ad, Bd, and Cd, d = 2,3,4 are given in Table 1. As in the level study, we consider the normal (N), the multivariate-t (t5 and t8), and the contaminated normal distribution (CN1 and CN2) for the distribution of at.
The alternatives are chosen according to the shapes of their spectral and cross-spectral densities. First, the VARMA(1,1) alternative admits spatially homogeneous spectral and cross-spectral densities: one spectral density is dominated by low frequencies and the other by high frequencies. The VAR(1) alternative displays spatially homogeneous spectral and cross-spectral densities; the spectral densities are dominated by low frequencies. The third alternative is a VAR(4), which could arise from quarterly data: here, the spectral and cross-spectral densities possess local features. Finally, the VAR(8) and VAR(12) alternatives exhibit spectral densities and cross-spectral densities with many spikes and distinctive local features. The VAR(12) alternative could arise from monthly economic data. The power analysis is conducted in a way similar to the level study. The difference relies on the first step, which is replaced by the following step.
Step 1′. Set the initial values xt = 0 for t ≤ 0. The white noise is generated independently using the same subroutines/functions from the NAG library. However, to reduce the effect of the initial values, 2n + 1 observations are simulated for each alternative to obtain the realization {xt}t=12n+1. We keep the last n observations.
In step 3, we generate B = 10,000 independent realizations in the level study and only B = 1,000 independent realizations in the power study. For the nominal level α and B independent realizations, the standard error of the empirical levels is given by
, which reduces to 0.218% for the nominal level 5% and 0.300% for 10%. The results are given in Tables 2, 3, 4, 5, 6, & 7.
5.2. Discussion of the Level Study
In Table 2, the results for the level study are presented for the normal, multivariate-t, and contaminated normal distributions. Overall, the empirical results for normal and nonnormal processes are very similar in our experiments. As the dimension d increases, the empirical levels improve slightly for the kernel and wavelet test statistics. If we look more specifically at the wavelet test Wn, it appears that the tests
admit reasonable levels at the 10% level. The levels of the wavelet test statistics exhibit some overrejection at the 5% level but are rather comparable to those of the kernel-based test statistic. For n = 200, the wavelet test statistic presents empirical levels closer to the nominal level 5%. The levels of the kernel-based test statistic
are satisfying and in accordance with the simulation results of Duchesne and Roy (2004).
We now compute descriptive statistics for
. Let
. For the bivariate normal distribution we obtain
for the two sample sizes under investigation. Similar results are obtained for dimensions d = 3,4. It can be shown empirically that Wn(J) exhibits overrejection of the null hypothesis when J is fixed and very small (e.g., Wn(J) with J ≡ 0), suggesting the importance of the requirement J → ∞ in our asymptotic theory. More precisely, large overrejections occurred for the wavelet-based tests when J ≡ 0, and underrejections occurred when J was fixed and such that J > 4. This differs from the test statistic Tn(pn), which delivers reasonable levels when the smoothing parameter pn is fixed and small (see, e.g., the simulation experiments described in Duchesne and Roy, 2004). Note that to establish the asymptotic distribution of the kernel-based test statistic Tn(pn), pn must satisfy pn → ∞. The descriptive results for
for nonnormal processes are largely similar to the results obtained with the normal process.
5.3. Discussion of the Power Study
Tables 3 and 4 report the power against VAR(1) and VARMA(1,1) alternatives, respectively. The results for normal and nonnormal processes suggest that the kernel-based test statistic is more powerful in these situations. The differences between the kernel and the wavelet approaches seem to increase as d increases in our experiment. Because the spectral and cross-spectral densities of these alternatives possess spatially regular features, the kernel-based test is expected to perform reasonably well in such situations. The power of the wavelet-based test (Franklin wavelet) based on
is slightly inferior to the power of the kernel-based test with a cross-validated choice of pn. The differences seem fairly small when n = 200, particularly when d = 2,3. The test statistics Wn(J1,ψF) and Wn(J1,ψM) behave very similarly. Under VARMA(1,1) and VAR(1) alternatives, for the chosen dimensions, we obtain for the processes under consideration
when n = 100 and
when n = 200.
Table 5 reports the power against the VAR(4) alternative. The results suggest that the choice of the underlying distribution for the innovation has a negligible impact on the empirical powers. The wavelet tests Wn(J1,ψF), Wn(J1,ψM), and
dominate the kernel-based test statistic Tn(pn) when the alternative is a VAR(4) process. This illustrates that the wavelet test is more powerful than the kernel-based method when the spectral and cross-spectral densities have local distinctive features, due, for example, to a seasonal economic phenomenon.
Finally, Tables 6 and 7 report the power against VAR(8) and VAR(12) alternatives. Under these two alternatives, the wavelet test clearly outperforms the kernel-based test. This may be related to the many local and irregular features in the spectral density of the VAR(8) and VAR(12) processes. For the particular VAR(8) alternative, the differences become even larger when the dimension d increases. For both sample sizes, the wavelet tests
display a better power than Tn with the cross-validated choice of pn. Comparing the wavelet-based tests in Table 6, Wn(J1,ψF) appears more powerful than Wn(J1,ψM) when n = 100, but their power is comparable when n = 200. The VAR(12) alternative highlights the need for a data-driven finest scale, because the power of
appears much higher than that of Wn(J1,ψF) and Wn(J1,ψM), based on the deterministic rule 2J+1 ∝ n1/5. In this situation, the deterministic values J1 ≡ 2 (n = 100) and J1 ≡ 3 (n = 200) are not appropriate. Much larger values of J are typically necessary, and the proposed method
adapts very well, at least in our simulations.
From our study, it seems that the cross-validated method does not easily find an appropriate estimator of pn for the VAR(p) alternatives, p ∈ {4,8,12}. This may be due to the slow convergence rate of that nonparametric technique. The problem is more pronounced in large dimensions in several situations, likely due, in part, to the curse of dimensionality. We also compute some descriptive statistics of the proposed data-dependent method. Under Gaussian innovations and d = 2, we obtain
for the VAR(4) alternative when n = 100,200, whereas, for the VAR(12) alternative, the data-dependent method
delivers the values
. Results are similar for d = 3,4. Again, this illustrates that the plug-in method reveals some information on the true alternative under consideration. Intuitively, we see that a larger resolution level (or a higher level of approximation) is needed when the spectral density possesses more complicated local features.
5.4. Discussion of the Implemented Plug-In Method
To implement the parametric plug-in method
described in Section 4.4, a lower bound l in formula (23) must be specified. The reason is that our asymptotic analysis requires J → ∞ but in some situations we could have α0(q) = 0: an important example of this is under the null hypothesis. This lower bound ensures that J tends to infinity asymptotically, even under the null hypothesis. The plug-in method based on autoregressive approximating models, as presented in Section 4.4, provides an estimator of α0(q) that tends to 0 in probability under the null hypothesis, that is,
. Given that the proposed method satisfies this property, it is expected that any reasonable lower bound l will be largely selected when the stochastic process is uncorrelated. On the other hand, more wavelet coefficients are typically needed under strong dependence. Consequently, a larger J should be necessary in this kind of situation, and the lower bound should be less systematically retained. Interestingly, because J is an integer, the proposed method does not rely too strongly on the choice of c or l. It seems that the minimal requirement is to choose a reasonable bound ensuring that J is such that J increases with the sample size n. Easy calculations show that any c in the interval (2k/n1/(2q+1),2k+1/n1/(2q+1)) leads to the same lower bound, that is,
, where we used the relation J = [lceil ]log2(cn1/(2q+1)) − 1[rceil ], [lceil ]a[rceil ] denoting the smallest integer larger than a.
In unreported results, the wavelet-based test statistic (15) based on a fixed J displayed reasonable empirical levels when J ∈ {1,2,3} and J ∈ {2,3,4} for the sample sizes n = 100 and n = 200, respectively. This means that, if the lower bound is not too small or too large,
should lead to similar and reasonable results. However, from an econometric point of view, to consider fixed values of J does not represent a satisfactory solution: (1) from our asymptotic theory, J is such that J → ∞ and (2) from the discussion given in Section 5.2, empirical experiments suggest that of J that are too low tend to display overrejections of the null hypothesis. These considerations indicate that a value of c or l that is too low (large) is expected to deliver overrejection (underrejection) of a specified nominal level.
To assert empirically the impact of the lower bound l in our simulation experiments, we report in Table 8 the number of times (in percentages) that
exceeds the lower bound l (and such that
is strictly larger than J1 after rounding) for the parametric plug-in method described in Section 4.4, where
represents an estimator of (25), satisfying (23), based on the Franklin wavelet (5). Recall that for the Franklin wavelet, q = 2, λq ≡ λ2 = π4/45, and l and c are related by the relation c = (2qλq l)1/(2q+1). We have chosen l and c such that 2J+1 = cn1/5 ≡ 3n1/5; this arbitrary choice of c gives
when n = 100 and
when n = 200. Note that the nonstochastic finest scale J1 described in step 2 of Section 5.1 delivers J1 = 2 when n = 100 and J1 = 3 when n = 200. From Table 8, we observe that under the null hypothesis and for the VAR(1) and VARMA(1,1) alternatives, the lower bound is more often retained. This is particularly true under the null hypothesis of no serial correlation where the lower bound is largely selected, as expected. In fact, from our empirical experiments, the fixed values J = 2 and J = 3 should deliver reasonable levels for sample sizes n = 100 and n = 200, respectively. Note, in passing, that as a function of the dimension d, the lower bound is slightly more often exceeded for a larger dimension, at least in our simulations. The VARMA(1,1) and VAR(1) alternatives offer spatially homogeneous spectral and cross-spectral densities. In these cases, the dependence is explained by few wavelet coefficients and a low value of the finest scale J. Consequently, the lower bounds are often retained. Note that in unreported simulation results the fixed values J = 0,1 displayed higher power than the nonstochastic finest scale J1 or than the proposed plug-in method
for these particular alternatives. A smaller lower bound would increase the power of the wavelet-based test, but the empirical levels were not well controlled for these fixed values of J. However, under seasonal alternatives, for example, for VAR(q) alternatives with q = 4,8,12, the lower bound is more often exceeded. Under the VAR(4) alternative, a J larger than the lower bound appears necessary to explain the dependence, particularly for a smaller sample size. For n = 200, the method selects the bound J = 3 most of the time. From the discussion in Section 5.3, for both sample sizes,
displayed high power for these choices of the finest scale. For the VAR(8) and more particularly for the VAR(12) alternatives, the lower bound is selected less systematically. Given that the spectral densities and cross-spectral densities exhibit many spikes and distinctive local features, more wavelet coefficients are necessary and a larger J is selected. Interestingly, for the VAR(12) alternative, a larger power is offered by the plug-in method, and
exceeds the bound J = 3 more than 99% of the time when n = 200, suggesting the limited impact of the lower bound, at least for this alternative.
6. CONCLUSION
In this paper, new tests of serial correlation are proposed for cases where there is no information on the true alternative hypothesis. Because wavelet analysis represents an effective tool when the spectral and cross-spectral densities exhibit peaks and other local features, we have used a suitable distance measure to develop a test statistic based on a comparison between a multivariate spectral density estimator calculated with the wavelet method and the true spectral density under the null hypothesis of absence of correlation in the process. This new one-sided test admits an asymptotic normal distribution under the null hypothesis. We have proposed and justified a data-driven method for the choice of the finest scale, and we have given some practical implementations. In particular, we have described a parametric plug-in method.
In the simulation experiment, the level and power of the new test were investigated in finite samples for vector time series of dimension d = 2,3,4. We have compared the wavelet test with the kernel-based test statistic of Duchesne and Roy (2004), which relies on a distance between a multivariate spectral density estimator calculated with the kernel method and the true spectral density under the null hypothesis of absence of correlation in the process. For the kernel-based test, we applied the cross-validation technique described in Robinson (1991b) for choosing pn. For the wavelet test, we considered our parametric plug-in method. We have discussed in detail the particular implementation of the proposed method, including the practical choice of a lower bound satisfying the requirements of our asymptotic theory. The level of the wavelet test appeared reasonably well controlled at the nominal levels 5% and 10% with series of 100 and 200 observations using the proposed data-driven method, even if it did tend to overreject slightly at the 5% nominal level.
In conclusion, when the multivariate spectral density exhibits regular features, the kernel-based test of Duchesne and Roy (2004) may be recommended. However, it is very common that the local properties of the spectral density will be unknown to the experimenter. So, as the multivariate spectral density is expected to present irregular features due to quarterly, monthly, or other kinds of seasonal data, the new wavelet test would seem to be a useful complement for the practitioner.
Some avenues of further research would include constructing a test statistic based on a nonlinear wavelet spectral density estimator. It seems likely that the methods used in this paper can be extended to derive the asymptotic distribution of the test statistic deduced from that alternative approach, but the technicalities involved would be far from trivial. In finite samples, it would be particularly interesting to compare the advantages of the nonlinear wavelet estimator over those of the linear wavelet estimator proposed in this paper. These challenging considerations are left for future studies.
APPENDIX
Proof of Theorem 1. We adapt the proof of Lee and Hong (2001), which is valid for a univariate process. We will only give the delicate arguments of the generalization to the multivariate case. For simplicity, we assume that E(xt) = 0, and we consider
. In fact, there is no loss of generality to assume that the process has a zero mean. It can be shown that if a process is such that E(xt) = μ, Cx(j) given by (10), then the asymptotic distribution of Wn(J) is unchanged.
We consider the standardized process yt = Γx−1/2(0)xt with zero mean and such that var(yt) = Id. Then Cy(j) = Γx−1/2(0)Cx(j)Γx−1/2(0), and we let
Our approach differs slightly from that of Lee and Hong (2001) because we show in a first step the asymptotic normality of the test statistic where Cx(0) is replaced by Γx(0). In a final step, we show that replacing Γx(0) by Cx(0) has no impact asymptotically on the limiting distribution. We can write
where bJ(h,m) = aJ(h,m) + aJ(−h,−m) + aJ(h,−m) + aJ(−h,m) and
where
Compare Priestley (1981, eqn. (6.19), p. 392). Lee and Hong (2001, Lem. A.1) give several useful properties of the quantities bJ(h,m). Noting that we can write
where 〈x,y〉 = x[top ]y, we decompose (A.1) as
where
The quantity q is such that q → ∞, q/2J → ∞, and q2/n → 0. Note that E(∥yt∥2) = d and E(〈yt−m,yt−h〉) = δhm d. Using Lemma A.1(v) of Lee and Hong (2001), and following an argumentation similar to their Propositions 1 and 2, it is possible to show that
because
where we used Assumption B(ii) (see also Lee and Hong, 2001, App. B). Note that the constant τ is as in Assumption B. More precisely, to study
, we proceed as in Lee and Hong (2001, pp. 406–411), and we make intensive use of inequalities involving vector random variables (see also Duchesne and Roy, 2004). First, we consider
. Note that
Using the inequality |tr(AB[top ])| ≤ [tr(AA[top ])]1/2[tr(BB[top ])]1/2 and using the usual Cauchy–Schwarz inequality, it follows that
Because
we deduce that
. Using the same argument as Lee and Hong (2001, p. 408), we find that
and consequently
. Similarly,
. We now consider
. Similar calculations give
because according to Lemma A.1(ii) of Lee and Hong (2001),
Setting ν = 0, this shows that
, because q2/n → 0 and 22J/n → 0. To study
, we introduce the following matrix:
Straightforward calculations give
An argument similar to that of Lee and Hong (2001, p. 409) allows us to deduce that
. Finally, we study
. Let
It is easy to show that
. We introduce the decomposition zt = z1t + z2t + z3t, where
where I(A) denotes the indicator function of the set A. Using the inequality tr[(A + B + C)(A + B + C)[top ]] ≤ 4[tr(AA[top ]) + tr(BB[top ]) + tr(CC[top ])], it follows that
where Cint = E [tr(zitzit[top ])], i = 1,2,3. Proceeding as in Proposition 2 of Lee and Hong (2001), it follows that
Collecting the results, it follows that
and consequently,
. We have shown that
. The asymptotic normality comes from the statistic
. Let
and write
, where
where St−q−1(m) is given by (A.5). It is possible to show that 4d2(2J+1 − 1)/vn2 → 1. To establish the asymptotic normality of
, we use the martingale central limit theorem of Brown (1971) as described in Taniguchi and Kakizawa (2000). The result follows if we can show these two conditions:
It is possible to show that E(Unt4) = O(t222J), which is sufficient for (i). For (ii), it is enough to establish
where
. It is possible to write
where
If we proceed as in Lemma A.2 of Lee and Hong (2001), it follows that
This shows (ii), and because
this establishes the asymptotic normality of Wn(J). █
Proof of Theorem 2. Let
First, we show that
. We can write
We have to establish two results.
To show (i) is immediate because
. We now show (ii). Let
We decompose
as
where
We now study
. To show that
, we show, using the definition of convergence in probability, that
. Using the identity A = (A ∩ B) ∪ (A ∩ Bc), where A and B are two sets and Bc denotes the complement of B, and because in general Pr(A ∪ B) ≤ Pr(A) + Pr(B), it follows that
where M > 0 and ε > 0 are arbitrarily large and small numbers, respectively. Because by hypothesis the relation
holds, we deduce that
. Also, given formula (A.3),
, we have that
Furthermore,
using the Cauchy–Schwarz inequality. Using the fact that
and also that
(cf. Lee and Hong, 2001, App. B), we conclude that
. Using Markov's inequality, this shows that
. Similarly, using the fact that
. This shows that
. An argument similar to the one leading to relation (A.6) gives that
. Because
, this completes the proof of the theorem. █
Proof of Theorem 3. First we note that
The first term is associated with the variance, whereas the last term is the squared bias term.
We now study the bias term. We have that
Direct algebra gives
Using the orthonormality of the wavelet basis, the first term is equal to
We have that
Replacing (A.8) in (A.7), using the psd property of
and noting that |bJ(h,m)| ≤ C(J + 1), we deduce that
Because [sum ]h|h|∥Γx(h)∥ ≤ C, this shows that
For the second term, we obtain that
We now evaluate separately the terms corresponding to r = 0 and r ≠ 0. In the former case,
In the case r ≠ 0, because
by Assumption D, it follows that
We now calculate the variance term. Straightforward calculations give
The first term is easily shown to be O(n−1). From Hannan (1970, p. 313), the second term written in matrix notation is equal to
where
where Kdd is the commutation matrix, Cum is a certain matrix containing the fourth-order cumulants, and the function wn(u,m,h) is defined in Hannan (1970). By Assumption D, |A3n| ≤ C(J + 1)/n. With a change of variables and proceeding as in Parzen (1957), we obtain |A2n| ≤ C(J + 1)/n. We now study A1n. We have that
A truncation argument on j allows us to show that
Multiplying by
and taking the tr(·) gives
Collecting the results, we obtain Theorem 3. █
Proof of Corollary 1. The result follows immediately from Theorem 2 because the assumption of the corollary implies
, where the nonstochastic finest scale J is given by
The latter satisfies the conditions of Theorem 2. █