THE LIVE METHOD FOR GENERALIZED ADDITIVE VOLATILITY MODELS

Woocheol Kim; Oliver Linton

doi:10.1017/S026646660420603X

THE LIVE METHOD FOR GENERALIZED ADDITIVE VOLATILITY MODELS

Published online by Cambridge University Press: 01 December 2004

Woocheol Kim and

Oliver Linton

Show author details

Woocheol Kim: Affiliation:
Korea Institute of Public Finance and Humboldt University of Berlin
Oliver Linton: Affiliation:
The London School of Economics

Article contents

Abstract
1. INTRODUCTION
2. NONPARAMETRIC INSTRUMENTAL VARIABLES: THE MAIN IDEA
3. INSTRUMENTAL VARIABLE PROCEDURE FOR GANARCH
4. MAIN RESULTS
5. MORE EFFICIENT ESTIMATION
6. NUMERICAL EXAMPLES
APPENDIX
References

Rights & Permissions

Abstract

We investigate a new separable nonparametric model for time series, which includes many autoregressive conditional heteroskedastic (ARCH) models and autoregressive (AR) models already discussed in the literature. We also propose a new estimation procedure called LIVE, or local instrumental variable estimation, that is based on a localization of the classical instrumental variable method. Our method has considerable computational advantages over the competing marginal integration or projection method. We also consider a more efficient two-step likelihood-based procedure and show that this yields both asymptotic and finite-sample performance gains.This paper is based on Chapter 2 of the first author's Ph.D. dissertation from Yale University. We thank Wolfgang Härdle, Joel Horowitz, Peter Phillips, and Dag Tjøstheim for helpful discussions. We are also grateful to Donald Andrews and two anonymous referees for valuable comments. The second author thanks the National Science Foundation and the ESRC for financial support.

Type: Research Article
Information: Econometric Theory , Volume 20 , Issue 6 , December 2004 , pp. 1094 - 1139

DOI: https://doi.org/10.1017/S026646660420603X [Opens in a new window]
Copyright: © 2004 Cambridge University Press

1. INTRODUCTION

Volatility models are of considerable interest in empirical finance. There are many types of parametric volatility models, following the seminal work of Engle (1982). These models are typically nonlinear, which poses difficulties both in computation and in deriving useful tools for statistical inference. Parametric models are prone to misspecification, especially when there is no theoretical reason to prefer one specification over another. Nonparametric models can provide greater flexibility. However, the greater generality of these models comes at a cost—including a large number of lags requires estimation of a high-dimensional smooth, which is known to behave very badly (Silverman, 1986). The “curse of dimensionality” puts severe limits on the dynamic flexibility of nonparametric models. Separable models offer an intermediate position between the complete generality of nonparametric models and the restrictiveness of parametric models. These models have been investigated in cross-sectional settings and also in time series settings.

In this paper, we investigate a generalized additive nonlinear autoregressive conditional heteroskedastic model (GANARCH):

where m_α(·) and v_α(·) are smooth but unknown functions and F_m(·) and F_v(·) are known monotone transformations (whose inverses are G_m(·) and G_v(·), respectively).

The extension to allow the F transformations to be of unknown functional form is considerably more complicated; see Horowitz (2001).

The error process, {ε_t}, is assumed to be a martingale difference with unit scale, that is,

, where

is the σ-algebra of events generated by {y_k}_k=−∞^t. Under some weak assumptions, the time series of nonlinear autoregressive models can be shown to be stationary and strongly mixing with mixing coefficients decaying exponentially fast. Auestadt and Tjøstheim (1990) use α-mixing or geometric ergodicity to identify their nonlinear time series model. Similar results are obtained for the additive nonlinear autoregressive conditional heteroskedastic (ARCH) process by Masry and Tjøstheim (1997); see also Cai and Masry (2000) and Carrasco and Chen (2002). We follow the same argument as Masry and Tjøstheim (1997) and will assume all the necessary conditions for stationarity and mixing property of the process {y_t}_t=1ⁿ in (1.1). The standard identification for the components of the mean and variance is made by

for all α = 1,…,d. The notable aspect of the model is additivity via known links for conditional mean and volatility functions. As will be shown later, (1.1)–(1.3) include a wide variety of time series models in the literature. See Horowitz (2001) for a discussion of generalized additive models in a cross-section context.

In a much simpler univariate setup, Robinson (1983), Auestadt and Tjøstheim (1990), and Härdle and Vieu (1992) study the kernel estimation of the conditional mean function m(·) in (1.1). The so-called CHARN (conditionally heteroskedastic autoregressive nonlinear) model is the same as (1.1) except that m(·) and v(·) are univariate functions of y_t−1. Masry and Tjøstheim (1995) and Härdle and Tsybakov (1997) apply the Nadaraya–Watson and local linear smoothing methods, respectively, to jointly estimate v(·) together with m(·). Alternatively, Fan and Yao (1996) and Ziegelmann (2002) propose local linear least square estimation for the volatility function, with the extension given by Avramidis (2002) based on local linear maximum likelihood estimation. Also, in a nonlinear vector autoregressive (VAR) context, Härdle, Tsybakov, and Yang (1998) deal with the estimation of conditional mean in a multilagged extension similar to (1.1). Unfortunately, however, introducing more lags in nonparametric time series models has unpleasant consequences, more so than in the parametric approach. As is well known, smoothing methods in high dimensions suffer from a slower convergence rate—the “curse of dimensionality.” Under twice differentiability of m(·), the optimal rate is n^−2/(4+d), which gets rapidly worse with dimension. In high dimensions it is also difficult to describe graphically the function m.

The additive structure has been proposed as a useful way to circumvent these problems in multivariate smoothing. By assuming the target function to be a sum of functions of covariates, say,

, we can effectively reduce the dimensionality of a regression problem and improve the implementability of multivariate smoothing up to that of the one-dimensional case. Stone (1985, 1986) shows that it is possible to estimate m_α(·) and m(·) with the one-dimensional optimal rate of convergence—for example, n^2/5 for twice differentiable functions—regardless of d. The estimates are easily illustrated and interpreted. For these reasons, since the 1980s, additive models have been fundamental to nonparametric regression among both econometricians and statisticians. Regarding the estimation method for achieving the one-dimensional optimal rate, the literature suggests two different approaches: backfitting and marginal integration. The former, originally suggested by Breiman and Friedman (1985), Buja, Hastie, and Tibshirani (1989), and Hastie and Tibshirani (1987, 1990), is to execute iterative calculations of one-dimensional smoothing until some convergence criterion is satisfied. Though appealing to our intuition, the statistical properties of backfitting algorithm were not clearly understood until the very recent works by Opsomer and Ruppert (1997) and Mammen, Linton, and Nielsen (1999). They develop specific (linear) backfitting procedures and establish the geometric convergence of their algorithms and the pointwise asymptotic distributions under some conditions. However, one disadvantage of these procedures is the time-consuming iterations required for implementation. Also, the proofs for the linear case cannot be easily generalized to nonlinear cases such as generalized additive models.

A more recent approach, called marginal integration (MI), is theoretically more manipulable—its statistical properties are easy to derive, because it simply uses averaging of multivariate kernel estimates. Developed independently by Newey (1994), Tjøstheim and Auestad (1994), and Linton and Nielsen (1995), its simplicity inspired subsequent applications such as Linton, Wang, Chen, and Härdle (1995) for transformation models and Linton, Nielsen, and van de Geer (2003) for hazard models with censoring. In the time series models that are special cases of (1.1) and (1.2) with F_m being the identity, Chen and Tsay (1993a, 1993b) and Masry and Tjøstheim (1997) apply backfitting and MI, respectively, to estimate the conditional mean function. Mammen et al. (1999) provide useful results for the same type of models by improving the previous backfitting method with some modification and successfully deriving the asymptotic properties under weak conditions. The separability assumption is also used in volatility estimation by Yang, Härdle, and Nielsen (1999), where the nonlinear ARCH model is of additive mean and multiplicative volatility in the form of

To estimate (1.5), they rely on marginal integration with local linear fits as a pilot estimate and derive asymptotic properties.

This paper features two contributions to the additive literature. The first concerns theoretical development of a new estimation tool called the local instrumental variable estimator for the components of additive models (LIVE for CAM), which was outlined for simple additive cross-sectional regression in Kim, Linton, and Hengartner (1999). The novelty of the procedure lies in the simple definition of the estimator based on univariate smoothing combined with new kernel weights. That is, adjusting kernel weights via conditional density of the covariate enables a univariate kernel smoother to estimate consistently the corresponding additive component function. In many respects, the new estimator preserves the good properties of univariate smoothers. The instrumental variable method is analytically tractable for asymptotic theory: it is shown to attain the optimal one-dimensional rate. Furthermore, it is computationally more efficient than the two existing methods (backfitting and MI) in the sense that it reduces the computations by a factor of n smoothings. The other contribution relates to the general coverage of the model we work with. The model in (1.1)–(1.3) extends ARCH models to a generalized additive framework where both the mean and variance functions are additive after some known transformation (see Hastie and Tibshirani, 1990). All the time series models in our previous discussion are regarded as a subclass of the data generating process for {y_t} in (1.1)–(1.3). For example, setting G_m to be an identity and G_v a logarithmic function reduces our model to (1.5). Similar efforts to apply transformation have been made in parametric ARCH models. Nelson (1991) considers a model for the log of the conditional variance—the exponential (G)ARCH class—to embody the multiplicative effects of volatility. It has also been argued to use the Box–Cox transformation for volatility, which is intermediate between linear and logarithm and which allows nonseparable news impact curves. Because it is hard to tell a priori which structure of volatility is more realistic and it should be determined by real data, our generalized additive model provides useful flexible specifications for empirical work. Additionally, from the perspective of potential misspecification problems, the transformation used here alleviates the restriction imposed by the additivity assumption, which increases the approximating power of our model. Note that when the lagged variables in (1.1)–(1.3) are replaced by different covariates and the observations are independent and identically distributed (i.i.d.), the model becomes the cross-sectional additive model studied by Linton and Härdle (1996). Finally, we also consider more efficient estimation along the lines of Linton (1996, 2000).

The rest of the paper is organized as follows. Section 2 describes the main estimation idea in a simple setting. In Section 3, we define the estimator for the full model. In Section 4 we give our main results, including the asymptotic normality of our estimators. Section 5 discusses more efficient estimation. Section 6 reports a small Monte Carlo study. The proofs are contained in the Appendix.

2. NONPARAMETRIC INSTRUMENTAL VARIABLES: THE MAIN IDEA

This section explains the basic idea behind the instrumental variable method and defines the estimation procedure. For ease of exposition, this will be carried out using an example of simple additive models with i.i.d. data. We then extend the definition to the generalized additive ARCH case in (1.1)–(1.3).

Consider a bivariate additive regression model for i.i.d. data (y,X₁,X₂),

where E(ε|X) = 0 with X = (X₁,X₂) and the components satisfy the identification conditions E [m_α(X_α)] = 0, for α = 1,2 (the constant term is assumed to be zero, for simplicity). Letting η = m₂(X₂) + ε, we rewrite the model as

which is a classical example of “omitted variable” regression. That is, although (2.6) appears to take the form of a univariate nonparametric regression model, smoothing y on X₁ will incur a bias due to the omitted variable η, because η contains X₂, which in general depends on X₁. One solution to this is suggested by the classical econometric notion of instrumental variable. That is, we look for an instrument W such that

with probability one.²

Note the contrast with the marginal integration or projection method. In this approach one defines m₁ by some unconditional expectation

for some weighting function W that depends only on X₂ and that satisfies

If such a random variable exists, we can write

This suggests that we estimate the function m₁(·) by nonparametric smoothing of Wy on X₁ and W on X₁. In parametric models the choice of instrument is usually not obvious and requires some caution. However, our additive model has a natural class of instruments—p₂(X₂)/p(X) times any measurable function of X₁ will do, where p(·), p₁(·), and p₂(·) are the density functions of the covariates X, X₁, and X₂, respectively. It follows that

as required. This formula shows what the instrumental variable estimator is estimating when m is not additive—an average of the regression function over the X₂ direction, exactly the same as the target of the marginal integration estimator. For simplicity we will take

throughout.³

If instead we take

this satisfies E(W|X₁) = 1 and E(Wη|X₁) = 0. However, the term p₁(X₁) cancels out of the expression and is redundant.

Up to now, it was implicitly assumed that the distributions of the covariates are known a priori. In practice, this is rarely true, and we have to rely on estimates of these quantities. Let

be kernel estimates of the densities p(·),p₁(·), and p₂(·), respectively. Then the feasible procedure is defined with a replacement of the instrumental variable W by

and taking sample averages instead of population expectations. Section 3 provides a rigorous statistical treatment for feasible instrumental variable estimators based on local linear estimation. See Kim et al. (1999) for a slightly different approach.

Next, we come to the main advantage that the local instrumental variable method has. This is in terms of the computational cost. The marginal integration method actually needs n² regression smoothings evaluated at the pairs (X_1i,X_2j), for i,j = 1,…,n, whereas the backfitting method requires nr operations—where r is the number of iterations to achieve convergence. The instrumental variable procedure, in contrast, takes at most 2n operations of kernel smoothings in a preliminary step for estimating the instrumental variable and another n operations for the regressions. Thus, it can be easily combined with the bootstrap method whose computational costs often become prohibitive in the case of marginal integration (see Kim et al., 1999).

Finally, we show how the instrumental variable approach can be applied to generalized additive models. Let F(·) be the inverse of a known link function G(·) and let m(X) = E(y|X). The model is defined as

or equivalently G(m(X)) = m₁(X₁) + m₂(X₂). We maintain the same identification condition, E [m_α(X_α)] = 0. Unlike in the simple additive model, there is no direct way to relate Wy to m₁(X₁) here, so (2.8) cannot be implemented. However, under additivity

for the W defined in (2.9). Because m(·) is unknown, we need consistent estimates of m(X) in a preliminary step, and then the calculation in (2.11) is feasible. In the next section we show how these ideas are translated into estimators for the general time series setting.

3. INSTRUMENTAL VARIABLE PROCEDURE FOR GANARCH

We start with some simplifying notations that will be used repeatedly in the discussion that follows. Let x_t be the vector of d lagged variables until t − 1, that is, x_t = (y_t−1,…,y_t−d), or concisely, x_t = (y_t−α,y_t−α), where y_t−α = (y_t−1,…,y_t−α−1,y_t−α+1,…,y_t−d). Defining

, we can reformulate (1.1)–(1.3) with a focus on the αth components of the mean and variance as

To save space we will use the following abbreviations for the functions to be estimated:

Note that the components

are identified, up to constant c, by φ_α(·), which will be our major interest in estimation. Subsequently, we examine in some detail each relevant step for computing the feasible nonparametric instrumental variable estimator of φ_α(·). The set of observations is given by

, where n′ = n + d.

3.1. Step I. Preliminary Estimation of r_t = H(x_t)

Because r_t is unknown, we start with computing the pilot estimates of the regression surface by a local linear smoother. Let

be the first component of

that solves

where

is a one-dimensional kernel function, and h = h(n) is a bandwidth sequence. In a similar way, we get the estimate of the volatility surface,

, from (3.12) by replacing y_t with the squared residuals,

. Then, transforming

by the known links will lead to consistent estimates of

3.2. Step II: Instrumental Variable Estimation of Additive Components

This step involves the estimation of φ_α(·), which is equivalent to

, up to the constant c. Let p(·) and p_α(·) denote the density functions of the random variables (y_t−α,y_t−α) and y_t−α, respectively. Define the feasible instrument as

where

are computed using the kernel function L(·), for example,

with L_g(·) ≡ L(·/g)/g and g = g(n) is a bandwidth sequence. The instrumental variable local linear estimates

are given as

through minimizing the localized squared errors elementwise:

where

is the jth element of

⁴

For simplicity, we choose the common bandwidth parameter for the kernel function K(·) in (3.12) and (3.13), which amounts to undersmoothing (for our choice of h) for the purposes of estimating m. Undersmoothing in the preliminary estimation of step I allows us control over the biases from estimating m and v. In addition, the convolution kernel function in the asymptotic variance of Theorem 1 relies on the condition of the same bandwidth for K(·).

The closed form of the solution is

4. MAIN RESULTS

Let

be the σ-algebra of events generated by {y_t}_a^b and α(k) the strong mixing coefficient of {y_t} that is defined by

Throughout the paper, we make the following assumptions.

Assumption A.

A1. {y_t}_t=1^∞ is a stationary and strongly mixing process generated by (1.1)–(1.3), with a mixing coefficient such that

, for some ν > 2 and 0 < a < (1 − 2/ν).

As pointed out by Masry and Tjøstheim (1997), the condition on the mixing coefficient in A1 is milder than assumed on the standard mixing process where the coefficient decreases at a geometric rate, that is, α(k) = ρ^−βk (for some β > 0). Some technical conditions for regularity are stated here. For simplicity, we assume that the process {y_t}_t=1^∞ has a compact support.

A2. The additive component functions, m_α(·) and v_α(·), for α = 1,…,d, are continuous and twice differentiable on the compact support.

A3. The link functions, G_m and G_v, have bounded continuous second-order derivatives over any compact interval.

A4. The joint and marginal density functions, p(·), p_α(·), and p_α(·), for α = 1,…,d, are continuous, twice differentiable with bounded (partial) derivatives, and bounded away from zero on the compact support.

A5. The kernel functions, K(·) and L(·), are a real bounded nonnegative symmetric (around zero) function on a compact support satisfying ∫K(u) du = ∫L(u) du = 1, ∫uK(u) du = ∫uL(u) du = 0. Also, assume that the kernel functions are Lipschitz-continuous, |K(u) − K(v)| ≤ C|u − v|.

A6. (i)

. (ii)

. (iii) The bandwidth satisfies

, where {t(n)} is a sequence of positive integers, t(n) → ∞, such that

Conditions A2–A5 are standard in kernel estimation. The continuity assumption in A2 and A4, together with the compact support, implies that the functions are bounded. The bandwidth conditions in A6(i) and A6(ii) are necessary for showing negligibility of the stochastic error terms arising from the preliminary estimation of m, v, and p_α(·). Under twice-differentiability of these functions as in A2–A4, the given side conditions are satisfied when d ≤ 4. Our asymptotic results that follow can be extended into a more general case of d > 4, although we do not prove it in the paper. One way of extension to higher dimensions is to strengthen the differentiability conditions in A2–A4 and use higher order polynomials (see Kim et al., 1999). The additional bandwidth condition in A6(iii) is necessary to control the effects from the dependence of the mixing processes in showing the asymptotic normality of instrumental variable estimates. The proof of consistency, however, does not require this condition. Define

and [∇G_m(t), ∇G_v(t)] = [dG_m(t)/dt,dG_v(t)/dt] . Let (K * K)_i(u) = ∫K(w)K(w + u) ×wⁱ dw, a convolution of kernel functions, μ_K*K² = ∫(K * K)₀(u)u² du, and ∥K∥₂² denote ∫K²(u) du. The asymptotic properties of the feasible instrumental variable estimates in (3.14) are summarized in the following theorem, whose proof is in the Appendix. Let κ₃(y_α,z_α) = E [ε_t³|x_t = (y_α,z_α)] and κ₄(y_α,z_α) = E [(ε_t² − 1)²|x_t = (y_α,z_α)] . A [odot ] B denotes the matrix Hadamard product.

THEOREM 1. Assume that conditions A1–A6 hold. Then,

Remarks.

1. To estimate

we can use the following recentered estimates:

, where

. Because

, the bias and variance of

are the same as those of

. For y = (y₁,…,y_d), the estimates for the conditional mean and volatility are defined by

Let

. Then, by Theorem 1 and the delta method, their asymptotic distribution satisfies

where

. It is easy to see that

are asymptotically uncorrelated for any α and β and that the asymptotic variance of their sum is also the sum of the variances of

2. The first term of the bias is of standard form, depending only on the second derivatives as in other local linear smoothing. The last term reflects the biases from using estimates for density functions to construct the feasible instrumental variable,

. When the instrument consisting of known density functions, p_α(y_t−α)/p(x_t), is used in (3.13), the asymptotic properties of instrumental variable estimates are the same as those from Theorem 1 except that the new asymptotic bias now includes only the first two terms of B_α(y_α).

3. The convolution kernel (K * K)(·) is the legacy of double smoothing in the instrumental variable estimation of “generalized” additive models because we smooth

with

given by (multivariate) local linear fits. When G_m(·) is the identity, we can directly smooth y instead of

to estimate the components of the conditional mean function. Then, as the following theorem shows, the second term of the bias of B_α does not arise, and the convolution kernel in the variance is replaced by a usual kernel function.

Suppose that F_m(t) = F_v(t) = t in (1.2) and (1.3). In this case, the instrumental variable estimates of φ_α(y_α) can be defined in a simpler way. For φ_α(y_α) = [M_α(y_α),V_α(y_α)] = [c_m + m_α(y_α), c_v + v_α(y_α)] , we define

by the solution to the adjusted-kernel least squares in (3.13) with the modification that the (2 × 1) vector

is replaced by

, where

is given in step I in Section 3.1. Theorem 2 shows the asymptotic normality of these estimates. The proof is almost the same as that of Theorem 1 and thus is omitted.

THEOREM 2. Under the same conditions as Theorem 1,

Although the instrumental variable estimators achieve the one-dimensional optimal convergence rate, there is room for improvement in terms of variance. For example, compared with the marginal integration estimators of Linton and Härdle (1996) or Linton and Nielsen (1995), the asymptotic variances of the instrumental variable estimates for m₁(·) in Theorems 1 and 2 include an additional factor of m₂²(·). This is because the instrumental variable approach treats η = m₂(X₂) + ε in (2.6) as if it were the error term of the regression equation for m₁(·). Note that the second term of the asymptotic covariance in Theorem 2 is the same as that in Yang et al. (1999), where the authors only considered the case with additive mean and multiplicative volatility functions. The issue of efficiency in estimating an additive component was first addressed by Linton (1996) based on “oracle efficiency” bounds of infeasible estimators under the knowledge of other components. According to this, both instrumental variable and marginal integration estimators are inefficient, but they can attain the efficiency bounds through one simple additional step, following Linton (1996, 2000) and Kim et al. (1999).

5. MORE EFFICIENT ESTIMATION

5.1. Oracle Standard

In this section we define a standard of efficiency that could be achieved in the presence of certain information, and then we show how to achieve this in practice. There are several routes to efficiency here, depending on the assumptions one is willing to make about ε_t. We shall take an approach based on likelihood, that is, we shall assume that ε_t is i.i.d. with known density function f like the normal or t with given degrees of freedom. It is easy to generalize this to the case where f contains unknown parameters, but we shall not do so here. It is also possible to build an efficiency standard based on the moment conditions in (1.1)–(1.3). We choose the likelihood approach because it leads to easy calculations and links with existing work and is the most common method for estimating parametric ARCH/GARCH models in applied work.

There are several standards that we could apply here. First, suppose that we know (c_m,{m_β(·) : β ≠ α}) and (c_v,{v_α(·) : α}); then what is the best estimator we can obtain for the function m_α within the local polynomial paradigm? Similarly, suppose that we know (c_m,{m_α(·) : α}) and (c_v,{v_β(·) : β ≠ α}); then what is the best estimator we can obtain for the function v_α? It turns out that this standard is very high and cannot be achieved in practice. Instead we ask: suppose that we know (c_m,{m_β(·) : β ≠ α}) and (c_v,{v_β(·) : β ≠ α}); then what is the best estimator we can obtain for the functions (m_α,v_α)? It turns out that this standard can be achieved in practice. Let π denote −log f (·), where f (·) is the density function of ε_t. We use z_t to denote (x_t,y_t), where x_t = (y_t−1,…,y_t−d) = (y_t−α,y_t−α). For θ = (θ_a,θ_b) = (a_m,a_v,b_m,b_v), we define

where

. With l_t(θ,γ_α) being the (negative) conditional local log likelihood, the infeasible local likelihood estimator

is defined by the minimizer of

where γ_α⁰(·) = (γ_1α⁰(·),γ_2α⁰(·)) = (c_m⁰ + m_α⁰(·),c_v⁰ + v_α⁰(·)). From the definition for the score function

the first-order condition for

is given by

The asymptotic distribution of the local maximum likelihood estimator has been studied by Avramidis (2002). For y = (y₁,…,y_d) = (y_α,y_α), define

where

With a minor generalization of the results by Avramidis (2002, Theorem 2), we obtain the following asymptotic properties for the infeasible estimators:

. Let

, that is, φ_α^c(y_α) = φ_α(y_α) − c, where c = (c_m,c_v).

THEOREM 3. Under Assumption C in the Appendix, it holds that

where

A more specific form for the asymptotic variance can be calculated. For example, suppose that the error density function, f (·), is symmetric. Then, the asymptotic variance of the volatility function is given by

where g(y) = f′(y) f⁻¹(y)y + 1 and q(y) = [y²f′′(y) f (y) + yf′(y) f (y) − y²f′(y)²] f⁻²(y).

When the error distribution is Gaussian, we can further simplify the asymptotic variance; that is,

In this case, one can easily find the infeasible estimator to have lower asymptotic variance than the instrumental variable estimator. To see this, we note that ∇G_m = 1/∇F_m and ∥K∥₂² ≤ ∥(K * K)₀∥₂² and apply the Cauchy–Schwarz inequality to get

In a similar way, from κ₄ = 3 due to the Gaussianity assumption on ε, it follows that

These, together with κ₃ = 0, imply that the second term of Σ_α*(y_α) in Theorem 1 is greater than Ω_α*(y_α) in the sense of positive definiteness, and hence Σ_α*(y_α) ≥ Ω_α*(y_α), because the first term of Σ_α*(y_α) is a nonnegative matrix. The infeasible estimator is more efficient than the instrumental variable estimator because the former uses more information concerning the mean-variance structure. We finally remark that the infeasible estimator is also more efficient that the marginal integration estimator in Yang et al. (1999) whose asymptotic variance corresponds to the second term of Σ_α*(y_α); see the discussion following Theorem 2.

5.2. Feasible Estimation

Let

be the estimators from (3.12) and (3.13) in Section 3, with the common bandwidth parameter h₀ chosen for the kernel function K(·). We define the feasible local likelihood estimator

as the minimizers of

where

is given by (5.15), with the additional bandwidth parameter h, possibly different from h₀. Then, the first-order condition for

is given by

Let

. We have the following result.

THEOREM 4. Under Assumptions B and C in the Appendix, it holds that

This result shows that the oracle efficiency bound is achieved by the two-step estimator.

6. NUMERICAL EXAMPLES

A small-scale simulation is carried out to investigate the finite-sample properties of both the instrumental variable and two-step estimators. The design in our experiment is additive nonlinear ARCH(2):

where Φ_N(·) is the (cumulative) standard normal distribution function and ε_t is i.i.d. with N(0,1). Figure 1 (solid lines) depicts the shapes of the volatility functions defined by v₁(·) and v₂(·). Based on the preceding model, we simulate 500 samples of ARCH processes with sample size n = 500. For each realization of the ARCH process, we apply the instrumental variable estimation procedure in (3.13) with

to get preliminary estimates of v₁(·) and v₂(·). Those estimates then are used to compute the two-step estimates of volatility functions based on the feasible local maximum likelihood estimator in Section 5.2, under the normality assumption for the errors. The infeasible oracle estimates are also provided for comparisons. The Gaussian kernel is used for all the nonparametric estimates, and bandwidths are chosen according to the rule of thumb (Härdle, 1990), h = c_h std(y_t)n^−1/(4+d), where std(y_t) is the standard deviation of y_t. We fix c_h = 1 for both the density estimates (for computing the instruments, W) and instrumental variable estimates in (3.13) and c_h = 1.5 for the (feasible and infeasible) local maximum likelihood estimator. To evaluate the performance of the estimators, we calculate the mean squared error, together with the mean absolute deviation error, for each simulated datum; for α = 1,2,

where {y₁,..,y₅₀} are grid points on [−1,1). The grid range covers about 70% of the observations on average. Table 1 gives averages of e_α,MSE's and e_α,MAE's from 500 repetitions.

Averages of volatility estimates (demeaned): (a) first lag; (b) second lag.

Averages MSE and MAE for three volatility estimators

Table 1 shows that the infeasible oracle estimator is the best out of the three, as would be expected. The performance of the instrumental variable estimator seems to be reasonably good, compared to the local maximum likelihood estimators, at least in estimating the volatility function of the first lagged variable. However, the overall accuracy of the instrumental variable estimates is improved by the two-step procedure, which behaves almost as well as the infeasible one, confirming our theoretical results in Theorem 4. For more comparisons, Figure 1 shows the averaged estimates of volatility functions, where the averages are made, at each grid, over 500 simulations. In Figure 2, we also illustrate the estimates for three typical (consecutive) realizations of ARCH processes.

Volatility estimates (demeaned).

APPENDIX

A.1. Proofs for Section 4.

The proof of Theorem 1 consists of three steps. Without loss of generality we deal with the case α = 1; here we will use the subscript 2, for expositional convenience, to denote the nuisance direction. That is, p₂(y_k−1) = p₁(y_k−1) in the case of density function. For component functions, m₂(y_k−1), v₂(y_k−1), and H₂(y_k−1) will be used instead of m₁(y_k−1), v₁(y_k−1), and H₁(y_k−1), respectively. We start by decomposing the estimation errors,

, into the main stochastic term and bias. Use X_n ≃ Y_n to denote X_n = Y_n{1 + o_p(1)} in the following. Let vec(X) denote the vectorization of the elements of the matrix X along with columns.

Proof of Theorem 1.

Step I. Decompositions and Approximations.

Because

is a column vector, the vectorization of equation (3.14) gives

A similar form is obtained for the true function, φ₁(y₁),

by the identity

because

By defining

, the estimation errors are

where

Observing

where

, it follows by adding and subtracting r_k = φ₁(y_k−1) + H₂(y_k−1) that

As a result of the boundedness condition in Assumption A2, the Taylor expansion applied to

at [m(x_k),v(x_k)] yields the first term of τ_n as

where

and m*(x_k)[v*(x_k)] is between

. In a similar way, the Taylor expansion of φ₁(y_k−1) at y₁ gives the second term of τ_n as

The term

continues to be simplified by some further approximations. Define the marginal expectation of estimated density functions

as follows:

In the first approximation, we replace the estimated instrument,

, by the ratio of the expectations of the kernel density estimates, p₂(y_k−1)/p(x_k) and deal with the linear terms in the Taylor expansions. That is,

is approximated with an error of

by t_1n + t_2n:

based on the following results:

To show (i), consider the first two elements of the term, for example, which are bounded elementwise by

The last equality is direct from the uniform convergence theorems in Masry (1996) that

and

. The proof for (ii) is shown by applying Lemma A.1, which follows. The negligibility of (iii) follows in a similar way from (ii), considering (A.1). Although the asymptotic properties of s_0n and t_2n are relatively easy to derive, additional approximation is necessary to make t_1n more tractable. Note that the estimation errors of the local linear fits,

, are decomposed into

from the approximation results for the local linear smoother in Jones, Davies, and Park (1994). A similar expression holds for volatility estimates,

, with a stochastic term of (1/n)[sum ]_l [K_h(x_l − x_k)/p(x_l)]v(x_l)(ε_l² − 1). Define

and let J(x_l) denote the marginal expectation of J_k,n with respect to x_k. Then, the stochastic term of t_1n, after rearranging its the double sums, is approximated by

because the approximation error from J(X_l) negligible, that is,

applying the same method as in Lemma A.1. A straightforward calculation gives

where

Observe that (K * K)_i((y_l−1 − y₁)/h) in J(X_l) is actually a convolution kernel and behaves just like a one-dimensional kernel function of y_l−1. This means that the standard method (central limit theorem or law of least numbers) for univariate kernel estimates can be applied to show the asymptotics of

If we define s_1n as the remaining bias term of t_1n, the estimation errors of

consist of two stochastic terms,

, and three bias terms,

, where

Step II. Computation of Variance and Bias.

We start with showing the order of the main stochastic term,

where ξ_k = ξ_1k + ξ_2k,

by calculating its asymptotic variance. Dividing a normalized variance of

into the sums of variances and covariances gives

where the last equality comes from the stationarity assumption.

We claim that

where

Proof of (a). Noting

by the stationarity assumption. Applying the integration with substitution of variable and Taylor expansion, the expectation term is

where κ₃(y₁,z₂) = E [ε_t³|x_t = (y₁,z₂)] and κ₄(y₁,z₂) = E [(ε_t² − 1)²|x_t = (y₁,z₂)]. █

Proof of (b). Because

By setting c(n)h → 0, as n → ∞, we separate the covariance terms into two parts:

To show the negligibility of the first part of the covariances, consider that the dominated convergence theorem used after Taylor expansion and the integration with substitution of variables gives

Therefore, it follows from the assumption on the boundedness condition in Assumption A2 that

where A ≤ B means a_ij ≤ b_ij, for all element of matrices A and B. By the construction of c(n),

Next, we turn to the negligibility of the second part of the covariances:

Let ξ_2kⁱ be the ith element of ξ_2k, for i = 1,…,4. Using Davydov's lemma (in Hall and Heyde, 1980, Theorem A.5), we obtain

for some v > 2. The boundedness of

, for example, is evident from the direct calculation that

Thus, the covariance is bounded by

This implies

if a is such that

for example, c(n)^ah^1−2/v = 1, which implies c(n) → ∞. If we further restrict a such that

then

Thus, c(n)h → 0 as required. Therefore,

as n goes to ∞. █

The proof of (c) is immediate from (a) and (b).

Next, we consider the asymptotic bias. Using the standard result on the kernel weighted sum of the stationary series, we first get

because

For the asymptotic bias of s_1n, we again use the approximation results in Jones et al. (1994). Then, the first component of s_1n, for example, is

and converges to

based on the argument for the convolution kernel given previously. A convolution of symmetric kernels is symmetric, so that ∫(K * K)₀(u)udu = 0 and ∫(K * K)₁(u)u² du = ∫∫wK(w)K(w + u)u² dwdu = 0. This implies that

To calculate s_2n, we use the Taylor series expansion of p₂(y_k−1)/p(X_k):

Thus,

Finally, for the probability limit of

we note that

with

, for i = 0,1,2, and

where q₀ = 1, q₁ = 0, and q₂ = μ_K².

Thus,

. Therefore,

Step III. Asymptotic Normality of
.

Applying the Cramer–Wold device, it is sufficient to show

for all

. We use the small block–large block argument (see Masry and Tjøstheim, 1997). Partition the set {d,d + 1,…,n} into 2k + 1 subsets with large blocks of size r = r_n and small blocks of size s = s_n where

and [x] denotes the integer part of x. Define

Then,

Because of Assumption A6, there exists a sequence a_n → ∞ such that

defining the large block size as

It is easy to show by (A.2) and (A.3) that as n → ∞

We first show that S_n′′ and S_n′′′ are asymptotically negligible. The same argument used in step II yields

which implies

from the condition (A.4). Next, consider

where N_j = j(r + s) + r. Because |N_i − N_j + k₁ − k₂| ≥ r, for i ≠ j, the covariance term is bounded by

The last equality also follows from step II. Hence, (1/n)E {(S_n′′)²} → 0, as n → ∞. Repeating a similar argument for S_n′′′, we get

Now, it remains to show

Because η_j is a function of

-measurable, the Volkonskii and Rozanov's lemma (1959) in the appendix of Masry and Tjøstheim (1997) implies that, with

where the last two equalities follow from (A.4). Thus, the summands {η_j} in S_n′ are asymptotically independent, because an operation similar to (A.5) yields

Finally, because of the boundedness of density and kernel functions, the Lindeberg–Feller condition for the asymptotic normality of S_n′ holds:

for every δ > 0. This completes the proof of step III.

From

, the Slutzky theorem implies

, where

. In sum,

given by

LEMMA A.1. Assume the conditions in Assumptions A1 and A4–A6. For a bounded function, F(·), it holds that

Proof. The proof of (b) is almost the same as (a). Therefore we only show (a). By adding and subtracting L_l|k(y_l−2|y_k−2), the conditional expectation of L_g(y_l−2 − y_k−2) given y_k−2 in r_1n, we get r_1n = ξ_1n + ξ_2n, where

Rewrite ξ_2n as

where k*(n) is increasing to infinity as n → ∞. Let

which exists as a result of the boundedness of F(x_k). Then, for a large n, the first part of ξ_2n is asymptotically equivalent to (1/n)k*(n)B. The second part of ξ_2n is bounded by

Therefore,

, for example.

It remains to show

. Because E(ξ_1n) = 0 from the law of iteration, we just compute

(1) Consider the case k = i and l ≠ j.

because, by the law of iteration and the definition of L_j|k(y_k−2),

(2) Consider the case l = j and k ≠ i.

We only calculate

because the rest of the triple sum consists of expectations of standard kernel estimates and is O(1/n). Note that

where (L * L)_g(·) = (1/g)∫L(u)L(u + ·/g) is a convolution kernel. Thus, (A.6) is

(3) Consider the case with i = k, j = m:

(4) Consider the case k ≠ i, l ≠ j:

for the same reason as in (1). █

A.2. Proofs for Section 5.

Recall that x_t = (y_t−1,…,y_t−d) = (y_t−α,y_t−α) and z_t = (x_t,y_t). In a similar context, let x = (y₁,..,y_d) = (y_α,y_α) and z = (x,y₀). For the score function s*(z,θ,γ_α) = s*(z,θ,γ_α(y_α)), we define its first derivative with respect to the parameter θ by

and use

to denote E [s*(z_t,θ,γ_α)] and E [∇_θ s*(z_t,θ,γ_α)] , respectively. Also, the score function s*(z,θ,·) is said to be Frechet differentiable (with respect to the sup norm ∥·∥_∞) if there is S*(z,θ,γ_α) such that for all γ_α with ∥γ_α − γ_α⁰∥_∞ small enough,

for some bounded function b(·). The term S*(z,θ,γ_α⁰) is called the functional derivative of s*(z,θ,γ_α) with respect to γ_α. In a similar way, we define ∇_γ S*(z,θ,γ_α) to be the functional derivative of S*(z,θ,γ_α) with respect to γ_α.

Assumption B. Suppose that (i)

is nonsingular; (ii) S*(z,θ,γ_α(y_α)) and ∇_γ S*(z,θ,γ_α(y_α)) exist and have square integrable envelopes S*(·) and ∇_γ S*(·), satisfying

and (iii) both s*(z,θ,γ_α) and S*(z,θ,γ_α) are continuously differentiable in θ, with derivatives bounded by square integrable envelopes.

Note that the first condition is related to the identification condition of component functions, whereas the second concerns Frechet differentiability (up to the second order) of the score function and uniform boundedness of the functional derivatives. For the main results in Section 5, we need the following conditions. Some of the assumptions are stronger than their counterparts in Assumption A in Section 4. Let h₀ and h denote the bandwidth parameter used for the preliminary instrumental variable and the two-step estimates, respectively, and g denote the bandwidth parameter for the kernel density.

Assumption C.

1. {y_t}_t=1^∞ is stationary and strongly mixing with a mixing coefficient α(k) = ρ^−βk, for some β > 0, and E(ε_t⁴x_t) < ∞, where ε_t = y_t − E(y_t|x_t).

2. The joint density function, p(·), is bounded away from zero and q-times continuously differentiable on the compact supports

, with Lipschitz continuous remainders, that is, there exists C < ∞ such that for all

, for all vectors μ = (μ₁,…,μ_d) with

3. The component functions, m_α(·) and v_α(·), for α = 1,…,d, are q-times continuously differentiable on

with Lipschitz continuous qth derivative.

4. The link functions, G_m and G_v, are q-times continuously differentiable over any compact interval of the real line.

5. The kernel functions, K(·) and L(·), are of bounded support, symmetric about zero, satisfying ∫K(u) du = ∫L(u) du = 1, and of order q, that is, ∫uⁱK(u) du = ∫uⁱL(u) du = 0, for i = 1,…,q − 1. Also, the kernel functions are q-times differentiable with Lipschitz continuous qth derivative.

6. The true parameters θ₀ = (m_α(y_α),v_α(y_α),m_α′(y_α),v_α′(y_α)) lie in the interior of the compact parameter space Θ.

7. (i) g → 0, ng^d → ∞ and (ii) h₀ → 0, nh₀ → ∞.

8. (i)

and for some integer ω > d/2,

(ii) n(h₀ h)^2ω+1/log n → ∞; h₀^q−ωh^−ω−1/2 → 0;

(iii) nh₀^d+(4ω+1)/log n → ∞; q ≥ 2ω + 1.

Some facts about empirical processes will be useful in the discussion that follows. Define the L²-Sobolev norm (of order q) on the class of real-valued function with domain

where, for

and a k-vector μ = (μ₁,…,μ_k) of nonnegative integers,

and q ≥ 1 is some positive integer. Let

be an open set in

with minimally smooth boundary as defined by, for example, Stein (1970), and

. Define

as a class of smooth functions on

whose L²-Sobolev norm is bounded by some constant

. In a similar way,

Define (i) an empirical process, v_1n(·), indexed by

with pseudometric ρ₁(·,·) on

where f₁(w;τ) = h^−1/2K((w_α − y_α)/h)S*(w,γ_α⁰)τ₁(w_α); and (ii) an empirical process, v_2n(·,·), indexed by

with pseudometric ρ₂(·,·) on

where f₂(w; y_α,τ₂) = h₀^−1/2K [(w_α − y_α)/h₀][p_α(w_α)/p(w)]G_m′(m(w))τ₂(w).

We say that the processes {ν_1n(·)} and {ν_2n(·,·)} are stochastically equicontinuous at τ₁⁰ and (y_α⁰,τ₂⁰), respectively (with respect to the pseudometric ρ₁(·,·) and ρ₂(·,·), respectively), if

and

respectively, where P* denotes the outer measure of the corresponding probability measure.

Let

be the class of functions such as f₁(·) defined previously. Note that (A.10) follows, if Pollard's entropy condition is satisfied by

with some square integrable envelope F₁; see Pollard (1990) for more details. Because f₁(w;τ₁) = c₁(w)τ₁(w_α) is the product of smooth functions τ₁ from an infinite-dimensional class (with uniformly bounded partial derivatives up to order q) and a single unbounded function c(w) = [h^−1/2K((w_α − y_α)/h)S*(w,γ_α⁰)] , the entropy condition is verified by Theorem 2 in Andrews (1994) on a class of functions of type III. Square integrability of the envelope F₁ comes from Assumption B(ii). In a similar way, we can show (A.11) by applying the “mix and match” argument of Theorem 3 in Andrews (1994) to f₂(w; y_α,τ₂) = c₂(w)h^−1/2K((w_α − y_α)/h₀)τ₂(w), where K(·) is Lipschitz continuous in y_α, that is, a function of type II.

Proof of Theorem 4. We only give a sketch, because the whole proof is lengthy and relies on arguments similar to Andrews (1994) or Gozalo and Linton (2000) for the i.i.d. case. Expanding the first-order condition in (5.16) and solving for

yields

where θ is the mean value between

. By the uniform law of large numbers in Gozalo and Linton (2000), we have

, which, together with (i) uniform convergence of

by Lemma A.3 and (ii) uniform continuity of the localized likelihood function, Q_n(θ,γ_α) over Θ × Γ_α, yields

and thus consistency of

. Based on the ergodic theorem on the stationary time series and a similar argument to Theorem 1 in Andrews (1994), consistency of

and uniform convergence of

imply

For the numerator, we first linearize the score function. Under Assumption B(ii), s*(z,θ,γ_α) is Frechet differentiable and (A.7) holds, which, because of

(by Lemma A.3 and Assumption C.8(i)), yields a proper linearization of the score term:

where S*(z_t,γ_α⁰(y_t−α)) = S*(z_t,θ₀,γ_α⁰(y_t−α)). Or equivalently, by letting

and u_t = S*(x_t,γ_α⁰(y_t−α)) − E [S*(x_t,γ_α⁰(y_t−α))|x_t = y] , we have

Note that the asymptotic expansion of the infeasible estimator is equivalent to the first term of the linearized score function premultiplied by the inverse Hessian matrix in (A.12). Because of the asymptotic boundedness of (A.12), it suffices to show the negligibility of the second and third terms.

To calculate the asymptotic order of T_2n, we make use of the preceding stochastic equicontinuity results. For a real-valued function δ(·) on

, we define an empirical process

where f (x_t; y_α,δ) = K((y_t−α − y_α)/h)h^ωS*(x_t,γ_α⁰(y_t−α))δ(y_t−α), for some integer ω > d/2. Let

. From the uniform convergence rate in Lemma A.3 and the bandwidth condition C.8(ii), it follows that

Because

is bounded uniformly over

, with probability approaching one, it holds that

. Also, because, for some positive constant C < ∞,

we have

. Hence, following Andrews (1994, p. 2257), the stochastic equicontinuity condition of v_n(y_α,·) at δ⁰ = 0 implies that

; that is, T_2n is approximated (with an o_p(1) error) by

We proceed to show negligibility of T_n2*. From the integrability condition on S*(z,γ_α⁰(y_α)), it follows, by change of variables and the dominated convergence theorem, that ∫K_h(y_α − y_α⁰)S*(z,γ_α⁰(y_α)) dF₀(z) = ∫S*[(y,y_α⁰,y_α),γ_α⁰(y_α)] × p(y,y_α⁰,y_α) d(y,y_α) < ∞, which, together with

-consistency of

, means that

. Because

this yields

From Lemma A.3,

where

. Under the condition C.8(i),

, integrability of the bias function b_β(y_β) and S*(z,θ₀,γ_α⁰(y_α)) imply

where

Let

be the ith elements of

, respectively, with S*^ij(·) being the (i,j) element of S*(·). By the dominated convergence theorem and the integrability condition, we have

where

and ∇G^j(·) = ∇G_m(·), for j = 1; ∇G_v(·), for j = 2. Because p₂(·)/p(·) and

are bounded under the condition of compact support, applying the law of large numbers for i.i.d. errors

leads to

and consequently

. Likewise,

where

and, for the same reason as before, we get

, because E(m_α(y_t−α)) = E(v_α(y_t−α)) = 0.

We finally show negligibility of the last term:

Substituting the error decomposition for

and interchanging the summations gives

where the o_p(1) errors for the remaining bias terms hold under the assumption that

. For

we can easily check that E(π_1n^i,β(z_t,z_s)|z_t) = E(π_1n^i,β(z_t,z_s)|z_s) = 0, for t ≠ s, implying that [sum ][sum ]_t≠s π_n^i,β(z_t,z_s) is a degenerate second-order U-statistic. The same conclusion also holds for the second term. Hence, the two double sums are mean zero and have variance of the same order as

which is of order n⁻¹h⁻¹. Therefore, T_3n = o_p(1). █

LEMMA A.2. (Masry, 1996). Suppose that Assumption C holds. Then, for any vector

with |μ| = Σ_j μ_j ≤ ω,

LEMMA A.3. Suppose that Assumption C holds. Then, for any vector

with |μ| = Σ_j μ_j ≤ ω,

Proof. We first show (b). For notational simplicity, the bandwidth parameter h (only in this proof) abbreviates h₀. From the decomposition results for the instrumental variable estimates,

By the Cauchy–Schwarz inequality and Lemma A.2 applied with Taylor expansion, it holds that

where the boundedness condition of C.2 is used for the last line. Hence, the standard argument of Masry (1996) implies that

, where q_i = ∫K(u₁)u₁ⁱ du₁. From q₀ = 1, q₁ = 0, and q₂ = μ_K², we get the following uniform convergence result for the denominator term; that is,

, uniformly in

. For the numerator, we show the uniform convergence rate of the first element of τ_n because the other terms can be treated in the same way. Let τ_n¹ denote the first element of τ_n, that is,

or alternatively,

where

Because p_α(·)/p(·) is bounded away from zero and G_m has a bounded second-order derivative, the functional r(x_t;g) is Frechet differentiable in g, with respect to the sup norm ∥·∥_∞, with the (bounded) functional derivative R(x_t;g) = [∂r(x_t;g)/∂g]↓_{g=g(x_t)}. This implies that for all g with ∥g − g⁰∥_∞ small enough, there exists some bounded function b(·) such that

By Lemma A.2,

, and consequently, we can properly linearize τ_n¹ as

where the O_p(ρ_n²) error term is uniformly in x_α. After plugging G_m(m(x_t)) = c_m + Σ_1≤β≤d m_β(y_t−β) into r(x_t;g⁰), a straightforward calculation shows that

where ς_t = [p₂(y_t−α)/p(x_t)]M_α(y_t−α) and M_α(y_t−α) = Σ_{1≤β≤d,(≠α)} m_β(y_t−α). Note that as a result of the identification condition E [ς_t|y_t−α] = 0 and consequently the first term is a standard stochastic term appearing in kernel estimates. For a further asymptotic expansion of the second term of τ_n¹, we use the stochastic equicontinuity argument to the empirical process {v_n(·,·)}, indexed by

, with

, such that

where f (x_t; y_α,δ) = K[(y_t−α − y_α)/h] h^ω[p_α(y_t−α)/p(x_t)]G_m′(m(x_t))δ(y_t−α), for some positive integer ω > d/2. Let

. From the uniform convergence rate in Lemma A.2 and the bandwidth condition in C.8(iii), it follows that

, leading to (i)

and (ii)

, where δ⁰ = 0. These conditions and stochastic equicontinuity of v_n(·,·) at (y_α,δ⁰) yield

. Thus, the second term of τ_n¹ is approximated with an

error (uniform in y_α) by

which, by substituting

, is given by

where (K * K)(·) is actually a convolution kernel as defined before. Hence, by letting b_α(y_α) summarize two bias terms appearing in (A.13) and (A.14), Lemma A.3(b) is shown. The uniform convergence results in part (a) then follow by the standard arguments of Masry (1996), because two stochastic terms in the asymptotic expansion of

consist only of univariate kernels. █

References

REFERENCES

Andrews, D.W.K. (1994) Empirical process methods in econometrics. In R.F. Engle & D. McFadden (eds.), Handbook of Econometrics, vol. IV, pp. 2247–2294. North-Holland.

Auestadt, B. & D. Tjøstheim (1990) Identification of nonlinear time series: First order characterization and order estimation. Biometrika 77, 669–687.Google Scholar

Avramidis, P. (2002) Local maximum likelihood estimation of volatility function. Manuscript, LSE.

Breiman, L. & J.H. Friedman (1985) Estimating optimal transformations for multiple regression and correlation (with discussion). Journal of the American Statistical Association 80, 580–619.Google Scholar

Buja, A., T. Hastie, & R. Tibshirani (1989) Linear smoothers and additive models (with discussion). Annals of Statistics 17, 453–555.Google Scholar

Cai, Z. & E. Masry (2000) Nonparametric estimation of additive nonlinear ARX time series: Local linear fitting and projections. Econometric Theory 16, 465–501.Google Scholar

Carrasco, M. & X. Chen (2002) Mixing and moment properties of various GARCH and stochastic volatility models. Econometric Theory 18, 17–39.Google Scholar

Chen, R. (1996) A nonparametric multi-step prediction estimator in Markovian structures. Statistical Sinica 6, 603–615.Google Scholar

Chen, R. & R.S. Tsay (1993a) Nonlinear additive ARX models. Journal of the American Statistical Association 88, 955–967.Google Scholar

Chen, R. & R.S. Tsay (1993b) Functional-coefficient autoregressive models. Journal of the American Statistical Association 88, 298–308.Google Scholar

Engle, R.F. (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of U.K. inflation. Econometrica 50, 987–1008.Google Scholar

Fan, J. & Q. Yao (1996) Efficient estimation of conditional variance functions in stochastic regression. Biometrica 85, 645–660.Google Scholar

Gozalo, P. & O. Linton (2000) Local nonlinear least squares: Using parametric information in nonparametric regression. Journal of Econometrics 99(1), 63–106.Google Scholar

Hall, P. & C. Heyde (1980) Martingale Limit Theory and Its Application. Academic Press.

Härdle, W. (1990) Applied Nonparametric Regression. Econometric Monograph Series 19. Cambridge University Press.

Härdle, W. & A.B. Tsybakov (1997) Locally polynomial estimators of the volatility function. Journal of Econometrics 81, 223–242.Google Scholar

Härdle, W., A.B. Tsybakov, & L. Yang (1998) Nonparametric vector autoregression. Journal of Statistical Planning and Inference 68(2), 221–245.Google Scholar

Härdle, W. & P. Vieu (1992) Kernel regression smoothing of time series. Journal of Time Series Analysis 13, 209–232.Google Scholar

Hastie, T. & R. Tibshirani (1990) Generalized Additive Models. Chapman and Hall.

Hastie, T. & R. Tibshirani (1987) Generalized additive models: Some applications. Journal of the American Statistical Association 82, 371–386.Google Scholar

Horowitz, J. (2001) Estimating generalized additive models. Econometrica 69, 499–513.Google Scholar

Jones, M.C., S.J. Davies, & B.U. Park (1994) Versions of kernel-type regression estimators. Journal of the American Statistical Association 89, 825–832.Google Scholar

Kim, W., O. Linton, & N. Hengartner (1999) A computationally efficient oracle estimator of additive nonparametric regression with bootstrap confidence intervals. Journal of Computational and Graphical Statistics 8, 1–20.Google Scholar

Linton, O.B. (1996) Efficient estimation of additive nonparametric regression models. Biometrika 84, 469–474.Google Scholar

Linton, O.B. (2000) Efficient estimation of generalized additive nonparametric regression models. Econometric Theory 16, 502–523.Google Scholar

Linton, O.B. & W. Härdle (1996) Estimating additive regression models with known links. Biometrika 83, 529–540.Google Scholar

Linton, O.B. & J. Nielsen (1995) A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika 82, 93–100.Google Scholar

Linton, O.B., J. Nielsen, & S. van de Geer (2003) Estimating multiplicative and additive hazard functions by kernel methods. Annals of Statistics 31, 464–492.Google Scholar

Linton, O.B., N. Wang, R. Chen, & W. Härdle (1995) An analysis of transformation for additive nonparametric regression. Journal of the American Statistical Association 92, 1512–1521.Google Scholar

Mammen, E., O.B. Linton, & J. Nielsen (1999) The existence and asymptotic properties of a backfitting projection algorithm under weak conditions. Annals of Statistics 27, 1443–1490.Google Scholar

Masry, E. (1996) Multivariate local polynomial regression for time series: Uniform strong consistency and rates. Journal of Time Series Analysis 17, 571–599.Google Scholar

Masry, E. & D. Tjøstheim (1995) Nonparametric estimation and identification of nonlinear ARCH time series: Strong convergence and asymptotic normality. Econometric Theory 11, 258–289.Google Scholar

Masry, E. & D. Tjøstheim (1997) Additive nonlinear ARX time series and projection estimates. Econometric Theory 13, 214–252.Google Scholar

Nelson, D.B. (1991) Conditional heteroskedasticity in asset returns: A new approach. Econometrica 59, 347–370.Google Scholar

Newey, W.K. (1994) Kernel estimation of partial means. Econometric Theory 10, 233–253.Google Scholar

Opsomer, J.D. & D. Ruppert (1997) Fitting a bivariate additive model by local polynomial regression. Annals of Statistics 25, 186–211.Google Scholar

Pollard, D. (1990) Empirical Processes: Theory and Applications. CBMS Conference Series in Probability and Statistics, vol. 2. Institute of Mathematical Statistics.

Robinson, P.M. (1983) Nonparametric estimation for time series models. Journal of Time Series Analysis 4, 185–208.Google Scholar

Silverman, B.W. (1986) Density Estimation for Statistics and Data Analysis. Chapman and Hall.

Stein, E.M. (1970) Singular Integrals and Differentiability Properties of Functions. Princeton University Press.

Stone, C.J. (1985) Additive regression and other nonparametric models. Annals of Statistics 13, 685–705.Google Scholar

Stone, C.J. (1986) The dimensionality reduction principle for generalized additive models. Annals of Statistics 14, 592–606.Google Scholar

Tjøstheim, D. & B. Auestad (1994) Nonparametric identification of nonlinear time series: Projections. Journal of the American Statistical Association 89, 1398–1409.Google Scholar

Volkonskii, V. & Y. Rozanov (1959) Some limit theorems for random functions. Theory of Probability and Applications 4, 178–197.Google Scholar

Yang, L., W. Härdle, & J. Nielsen (1999) Nonparametric autoregression with multiplicative volatility and additive mean. Journal of Time Series Analysis 20, 579–604.Google Scholar

Ziegelmann, F. (2002) Nonparametric estimation of volatility functions: The local exponential estimator. Econometric Theory 18, 985–992.Google Scholar

Averages of volatility estimates (demeaned): (a) first lag; (b) second lag.

Averages MSE and MAE for three volatility estimators

Volatility estimates (demeaned).

Article contents

THE LIVE METHOD FOR GENERALIZED ADDITIVE VOLATILITY MODELS

Abstract

1. INTRODUCTION

2. NONPARAMETRIC INSTRUMENTAL VARIABLES: THE MAIN IDEA

3. INSTRUMENTAL VARIABLE PROCEDURE FOR GANARCH

3.1. Step I. Preliminary Estimation of rt = H(xt)

3.2. Step II: Instrumental Variable Estimation of Additive Components

4. MAIN RESULTS

5. MORE EFFICIENT ESTIMATION

5.1. Oracle Standard

5.2. Feasible Estimation

6. NUMERICAL EXAMPLES

APPENDIX

A.1. Proofs for Section 4.

Step I. Decompositions and Approximations.

Step II. Computation of Variance and Bias.

Step III. Asymptotic Normality of .

A.2. Proofs for Section 5.

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests

3.1. Step I. Preliminary Estimation of r_t = H(x_t)

Step III. Asymptotic Normality of
.