Published online by Cambridge University Press: 01 December 2004
We introduce a nonparametric regression estimator that is consistent in the presence of measurement error in the explanatory variable when one repeated observation of the mismeasured regressor is available. The approach taken relies on a useful property of the Fourier transform, namely, its ability to convert complicated integral equations into simple algebraic equations. The proposed estimator is shown to be asymptotically normal, and its rate of convergence in probability is derived as a function of the smoothness of the densities and conditional expectations involved. The resulting rates are often comparable to kernel deconvolution estimators, which provide consistent estimation under the much stronger assumption that the density of the measurement error is known. The finite-sample properties of the estimator are investigated through Monte Carlo experiments.This work was made possible in part through financial support from the National Science Foundation via grant SES-0214068. The author is grateful to the referees and the co-editor for their helpful comments.
The bias resulting from the presence of measurement error in the explanatory variables is a common problem in regression analysis. Although numerous solutions to this problem have been derived for parametric regression models, the corresponding problem in nonparametric specifications has remained relatively unexplored.
Some aspects of the nonparametric errors-in-variables problem have been previously investigated. The problem of estimating the density of an unobserved variable when this variable is measured with error and when the density of the error is known has received considerable attention in the literature. In this setting, the so-called kernel deconvolution estimator (for a review of the extensive literature, see, e.g., Carroll and Hall, 1988; Liu and Taylor, 1989; Carroll, Ruppert, and Stefanski, 1995) has been shown to reach the optimal rate of convergence (Fan, 1991b). The problem of the nonparametric estimation of a regression function when the independent variable is measured with an error drawn from a known distribution has also been studied. In this case, a kernel regression estimator based on kernel deconvolution is known to achieve optimal convergence rates (Fan and Truong, 1993). A more challenging problem is the estimation of densities and regression functions when the independent variable is measured with an error drawn from an unknown distribution. Thanks to an identity due to Kotlarski (see Rao, 1992, p. 21), the identification of the density of an unobserved random variable is possible when the joint density of two error-contaminated measurements of that variable is known. Li and Vuong (1998) show that the empirical version of this identity leads to a consistent estimator with known convergence rates.
In contrast to the nonparametric density estimation problem, the nonparametric estimation of conditional expectations under similar conditions has so far remained unsolved. This is the gap our paper intends to fill by extending the traditional Nadaraya–Watson kernel regression estimator to allow for the independent variable to be contaminated with an error of an unknown distribution. We show that the availability of two error-contaminated measurements of the independent variable is all that is needed to achieve identification. The usefulness of this result stems from the observation that although distributional assumptions are often not appropriate in applications, thus precluding the use of kernel deconvolution estimators, repeated measurements can frequently be found in data sets (Ashenfelter and Krueger, 1994; Hausman, Newey, and Powell, 1995; Morey and Waldman, 1998; Bowles, 1972; Borus and Nestel, 1973; Freeman, 1984).1
Freeman's data set (the January 1977 Employer Employee Matched Sample, Current Population Survey) contains wages reported by employers and employees, which are perfect examples of repeated measurements.
Our analysis not only derives the convergence rate of the proposed estimator but also provides its asymptotic distribution. The asymptotic properties of the estimator are analyzed through various analytical examples, and its finite-sample properties are investigated through Monte Carlo simulations that illustrate the bias-correcting power of our estimator. All proofs can be found in the Appendix.
To understand the difficulties faced in nonparametric estimation in the presence of measurement error, it is instructive to recall the well-known solution to the simpler problem of finding the density of an unobserved variable x* given an imperfect measurement z (for a review, see Carroll et al., 1995):
The measurement error Δz is usually assumed to be independent from x* and to be drawn from a known density. It is well known that the density of z is given by the convolution of the density of x* with the density of Δz. Thanks to the convolution theorem, this relationship can be concisely expressed using characteristic functions:
where φ(ν), m(ν), and ψ(ν), respectively, denote the characteristic functions of x*, z, and Δz. We can therefore identify the characteristic function of interest, φ(ν), through
where m(ν) can be estimated by the Fourier transform of a nonparametric estimator of the density of z, such as a kernel estimator. The problem with this procedure arises from the fact that, under mild assumptions (such as assuming that the density of Δz is continuous), ψ(ν) vanishes as ν → ∞, so that this operation is not well defined for all ν. Hence, merely replacing m(ν) by a consistent estimate
may not yield a consistent estimate of φ(ν), because small errors on
are magnified by the arbitrarily large factor 1/ψ(ν). This is the well-known ill-defined inverse problem that occurs when one tries to invert a convolution operation. The so-called kernel deconvolution estimator (Carroll et al., 1995; Fan, 1991b) addresses this problem by estimating m(ν) using a kernel whose Fourier transform, κ(ν), is compactly supported. This ensures that the estimated characteristic function
is also compactly supported, which in turn guarantees that the numerator of equation (3) will vanish well before the denominator causes the ratio to diverge.
It is clear that truncating the characteristic function of z in this fashion introduces a bias. To obtain a consistent estimator, the support of κ(ν) is allowed to expand as sample size grows in such a way that the total integrated noise over all frequencies in the support of κ(ν) decreases. The faster ψ(ν) → 0 as ν → ∞, the more slowly the support of κ(ν) can expand with sample size and the slower the convergence rate. This is the fundamental difficulty associated with nonparametric estimation in the presence of measurement error. As the smoothness of the density of the measurement error increases, the characteristic function ψ(ν) goes to zero increasingly rapidly as ν → ∞ and the convergence rate worsens. The smoothness of the density of x* also plays a role in determining the convergence rate. The bias introduced by the truncation of m(ν) at a finite frequency is a function of the magnitude of the rate of decay of m(ν) as ν → ∞. The smoother the density of x*, the faster its Fourier transform m(ν) decays as ν → ∞, and the faster the bias decreases as the kernel bandwidth shrinks.
The literature focusing on kernel deconvolution estimators typically describes the smoothness of a density in terms of the asymptotic rate of decay of its Fourier transform as frequency ν goes to infinity. The basis for such a description is that the number of derivatives of a density that are continuous is directly related to the asymptotic behavior of its Fourier transform as ν → ∞. This leads to the traditional distinction between “ordinarily smooth” functions (which admit a finite number of continuous derivatives and whose Fourier transform decays as |ν|γ, γ < 0) and “supersmooth” functions (which admit an infinite number of continuous derivatives and whose Fourier transform decays as exp(α|ν|β), α < 0,β > 0). Examples of ordinarily smooth functions are gamma, uniform, and double exponential, and normal and Cauchy are supersmooth functions.
The kernel deconvolution estimator exhibits a wide variety of convergence rates depending on the smoothness of the densities involved. Whenever the densities of x* and of Δz are ordinarily smooth, the kernel deconvolution will exhibit a rate of convergence of the form n−c for some c > 0 where n is sample size. The situation degrades significantly when the density of Δz is supersmooth while the density of x* remains ordinarily smooth. The convergence rate is then of the form (ln n)−c for some c > 0, which is slower than any negative power of n.
The problem solved in this paper is more challenging than the one described above. First, we focus on a kernel regression estimator rather than a kernel density estimator. Second, we assume the density of the measurement error to be unknown.
Our task is to find a function
such that
We consider x* a scalar to simplify the exposition, although a multivariate extension is clearly possible.
2As in any nonparametric regression, the well-known “curse of dimensionality” of course limits the number of dimensions that can be handled in practice.
at a given point
where xl* and yl for l = 1…n denote the data points and the kernel Kh(·) is of the form
and h is the bandwidth parameter. The problem we are facing is that x* is not observed. As shown in Schennach (2004) the availability of two repeated measurements of x*
provides enough information to identify any moment of the form E [u(y,x*)] for any function u(y,x*). Because the probability limit (at constant bandwidth h) of the Nadaraya–Watson kernel estimator is the ratio
a similar technique can be applied here, setting
, for k = 0,1. The extension of the existing results to a nonparametric setting nevertheless requires additional steps to handle the fact that we need to characterize an infinite family of moments, indexed by
. Fortunately, this complication can be elegantly handled by observing that the convolution operations involved in computing the Nadaraya–Watson estimator are converted into simple products through the Fourier transform operation, enabling the whole family of moments to be estimated in a single operation. The formal result that permits identification is summarized in the following set of assumptions and associated theorem. Throughout the paper, we will take the convention that integrals without explicit bounds are taken over the whole real line.
Assumption 1.
Δz and x* are mutually independent.
Assumption 2. E [|x*|] , E [|Δx|], and E [|y|] are finite.
Assumption 3.
for all
, any h > 0, and k = 0,1.
THEOREM 1. Under Assumptions 1–3, and provided |E [eiξz]| > 0 for any finite ξ, the function
for
, can be expressed solely in terms of moments that involve the observable variables y, x, and z:
where, for k = 0,1,
and where φk(ξ) ≡ E [yk exp(iξx*)] is given by3
Equation (14) is similar to an identity derived by Kotlarski (see Rao, 1992, p. 21), but our proof of this result requires weaker independence assumptions. In particular, we do not require independence between Δx and x* and between Δx and Δz.
where
is the Fourier transform of the kernel K(x*) and
Note that knowledge of the moments ma(ξ), for a = 1,x,y, which involve observable variables only, is sufficient to identify
. Because the moments ma(ξ) can be estimated from the corresponding sample averages, we propose the following estimator.
DEFINITION 1. Let (xi,yi,zi), for i = 1,…,n denote a sample of size n. For a given
and some sequence of bandwidths hn → 0, let
where, for k = 0,1,
and where, for a = 1,x,y,
An interesting property of this estimator is that it reduces to the Nadaraya–Watson estimator in the absence of measurement error (i.e., when z = x = x*). Indeed, in that case,
, and equation (19) can be integrated analytically to yield
, thus implying that equation (20) becomes
. With these equalities in mind, equation (18) then defines the Fourier representation of the numerator and the denominator of the Nadaraya–Watson estimator.
To ensure that the proposed estimator is well behaved, we need to make the following assumption.
Assumption 4. The Fourier transform of the kernel, κ(ξ), is (i) bounded and (ii) compactly supported (without loss of generality, we consider the support to be [−1,1]).
The boundedness of κ(ξ) is a very weak requirement because any kernel K(z) violating it would necessarily fail to be absolutely integrable. The assumption of compact support of κ(ξ) is commonly made in the derivation of the asymptotic properties of kernel deconvolution estimators (Fan and Truong, 1993). The need for this assumption arises from the fact that the estimator involves a division by an asymptotically vanishing characteristic function. Under very mild smoothness requirements, characteristic functions decay to zero as frequency increases toward infinity. A compactly supported kernel (in Fourier representation) explicitly makes the frequency range considered in a given sample finite, ensuring that the divergence is kept under control.
The restriction of compact support (in Fourier representation) poses few problems in practice, because one can take any given kernel K(x*) and construct a modified kernel
that exhibits most of the properties of the original kernel, while possessing a compact support in Fourier representation. This is achieved by computing the Fourier transform κ(ξ) of the original kernel K(x*) and multiplying it by a “windowing” function W(ξ) that vanishes beyond a given frequency:
Judicious choice of a windowing function will ensure that the modified kernel
keeps most of the properties of the original kernel. For instance, a windowing function such as
will leave the order of the kernel unaffected, because the windowing function is constant in the neighborhood of the origin. The fact that this windowing function is infinitely many times differentiable will guarantee that the modified kernel
decays faster than any power of x* as |x*| → ∞ (provided that the original kernel K(x*) had this property).
This section is organized as follows. To facilitate the analysis of the asymptotic properties of the proposed estimator
, we first provide a linear representation of this estimator, denoted
, that will be shown to be asymptotically equivalent to
. This linearization serves two purposes. First, it will enable the derivation of the convergence rate of the estimator using techniques that are analogous to the standard bias and variance decomposition used in the context of conventional kernel estimators. Second, a linear representation is essential to establish the asymptotic normality of the estimator.
In this section, we will provide very general results that summarize the properties of a linearized estimator
that will be used to establish the asymptotic properties of
. The form of the estimator prompts for two levels of linearization. First, as is commonly done in the analysis of nonparametric conditional expectation kernel estimators, the ratio of
in equation (17) is expanded in a Taylor series up to first order. Second, unlike the usual Nadaraya–Watson estimator and kernel deconvolution estimators,
themselves take the form of nonlinear functionals of the data generating process. It is thus convenient to carry out the linearization a step further by calculating the Fréchet derivative of
with respect to the estimated moment
in the vicinity of
. The following definition gives a linearized version
of the estimator
.
4The calculation of the Fréchet derivative can be found in the proof of Lemma 2 in the Appendix.
DEFINITION 2. For
, let
where, for
is given by equation (13),
The advantage of the linear representation provided by Definition 2 is that it is possible to decompose the error
into well-defined “bias” and “variance” terms, as given by Lemma 1, which follows.
Assumption 5. [yi,xi,zi,xi*,Δyi,Δxi,Δzi] for i = 1…n is an independent and identically distributed (i.i.d.) sequence.
Assumption 6. E [y2−j|z|j] < ∞, E [x2−j|z|j] < ∞, for j = 0,1.
Assumption 7. The density of x* is nonzero at
.
LEMMA 1. Under Assumptions 1–7, for
,
where the bias term
and the variance term
are given by
and where
satisfies
and
where
for k1,k2 = 0,1, where
is given in Definition 2, where
for k = 0,1 is given in Theorem 1, and where [dagger] denotes complex conjugation, and
Under our assumptions, the expectation and the variance of
are well defined, even though the corresponding moments of
may not exist. As long as the remainder
can be shown to be asymptotically negligible in probability, the mean and the variance of
can be interpreted as the mean and the variance of the limiting distribution of
, whether or not the first two moments of
are bounded. This situation is not unique, as these observations apply to any estimator involving ratios of random quantities. To ascertain that the linear approximation
is appropriate, the following lemma provides the order of the remainder of the linearization of
and also the order of the statistical fluctuations in
. This result is included for completeness, but it is not essential for the reader to master it to understand the main results of the subsequent sections.
LEMMA 2. Let Assumptions 1–7 hold and let, for φ0(ζ), φ1(ζ), and m1(ζ) as in Theorem 1,
for φ0′(ξ) ≡ dφ0(ξ)/dξ.
5Note that the ratio |φ0′(ζ)|/|φ0(ζ)| entering the definitions of λ(hn) and U(hn) can equivalently be written as |mx(ζ)|/|m1(ζ)| because |mx(ζ)|/|m1(ζ)| = |E [xeiζz]|/|E [eiζz]| = |E [x*eiζz]|/|E [eiζz]|=(|E [x*eiζx*]|/|E [eiζx*]|)(|E [eiζΔz]|/|E [eiζΔz]|)=|E [x*eiζx*]|/|E [eiζx*]|=|φ0′(ζ)|/ |φ0(ζ)|.
, (ii) U(hn)n−1/2 → 0, and (iii) λ(hn)n−1/2+ε → 0 for some ε > 0, then
If, in addition, (iii)
, then
The quantity U(hn) is defined so that it bounds any of the quantities defined in equations (26) that enter the expression of the asymptotic variance of the estimator, whereas λ(hn) bounds the remainder terms from the linearization performed in Definition 2. As expected, the preceding stochastic expansion is written in terms of successive powers of n−1/2, with the exception that the second term is proportional to n−1+ε instead of n−1, because bounding the second remainder term involves uniformly bounding various random functions, which slows the rate down by a factor nε.
In the proof of our convergence rate and asymptotic normality results, we will subsequently verify that the hypotheses of Lemma 2 are implied by more primitive regularity conditions. The first conclusion of the Lemma (equations (36) and (37)) will be sufficient to obtain the convergence rate of the estimator. Indeed, if it can be shown that λ(hn)n−1/2+ε → 0, the convergence rate is then simply given by Op(U(hn)n−1/2). Because Op(U(hn)n−1/2) is an upper bound on the convergence rate, which may or may not be binding, the second, slightly stronger conclusion of Lemma 2 (equation (38)) will be needed to obtain the limiting distribution of the estimator. The basic intuition behind the additional condition (iii) is that, for the Op(U(hn)n−1/2λ(hn)n−1/2+ε) nonlinear remainder to have no effect on the limiting distribution, it must be asymptotically negligible relative to the exact standard deviation of
, which is given by
, by Lemma 1.
We now provide primitive regularity conditions that will enable us to derive explicit convergence rates. These regularity conditions take the form of smoothness restrictions imposed via constraints on the tail behavior of various Fourier transforms. To specify the regularity conditions, we employ the following convenient notation.
DEFINITION 3. An expression of the form f (ζ) [prcue ] g(ζ) for
indicates that there exists a constant C > 0, independent of ζ, such that f (ζ) ≤ Cg(ζ) for all
(and similarly for [sccue ]). Analogously, an [prcue ] bn for two sequences an,bn indicates that there exists a constant C independent of n such that an ≤ Cbn for all
.
The literature focusing on “kernel deconvolution estimators” (see, e.g., Carroll et al., 1995) and related estimators (Fan and Truong, 1993) traditionally distinguishes between “ordinarily smooth” functions (whose Fourier transform decays as |ζ|γ, γ < 0 as |ζ| → ∞) and “supersmooth” functions (whose Fourier transform decays as exp(α|ζ|β), α < 0,β > 0 as |ζ| → ∞). For the benefit of conciseness, our regularity conditions are given in terms of expressions of the form (1 + |ζ|)γ exp(α|ζ|β), thereby simultaneously covering the ordinarily smooth and supersmooth cases.
Assumption 8. The functions φ0(ζ) = E [eiζx*] , φ0′(ζ) ≡ dφ0(ζ)/dζ, φ1(ζ) = E [yeiζx*] , and m1(ζ) = E [eiζz] satisfy
for some γr ≥ 0 and
for some
such that γφ βφ ≥ 0 and γm βm ≥ 0.
A few remarks are in order. While the rate of decay of φ0(ζ), the characteristic function of x*, is entirely determined by the smoothness of the density f (x*) of x*, the rate of decay of φ1(ζ) is governed by the smoothness of f (x*)E [y|x*]. Verifying equation (40) would first involve finding bounds on |φ0(ζ)| and |φ1(ζ)| individually before taking the most slowly decaying term. Regrouping φ0(ζ) and φ1(ζ) in a single assumption is possible without loss of generality, because both quantities enter the expression of the estimator in a similar fashion. This grouping is also notationally convenient, as it will reduce the number of independent orders of magnitude that have to be considered when determining the convergence rates of the estimator.
As is always the case in deconvolution-type estimators, one quantity (here m1(ζ)) needs to be bounded below (in equation (41)), instead of above, because it appears in a denominator in the expression of the estimator. Note that equation (41) is implied by separate lower bounds on the modulus of the characteristic functions of x* and Δz because m1(ζ) = E [eiζz] = E [eiζx*]E [eiζΔz]. The grouping of E [eiζx*] and E [eiζΔz] is also aimed at reducing the notational burden. Although the constraint on the ratio φ0′(ζ)/φ0(ζ) imposed by equation (39) may appear unusual, it is clear that it is implied by a more familiar upper bound on |φ′(ζ)| and a lower bound on |φ(ζ)|. The absence of a term of the form exp(αr|ζ|βr) in equation (39) results in very little loss of generality, because all common ordinarily smooth and supersmooth functions are such that equation (39) holds for γr = 1.
Before we can derive the convergence rate of the estimator, we also need to characterize the type of kernel K(x*) used. While most studies of measurement error in nonparametric settings focus either on finite-order kernels (Fan, 1991b; Fan and Truong, 1993) or on infinite-order kernels (Politis and Romano, 1999; Li and Vuong, 1998), we will consider both finite- and infinite-order kernels. The traditional finite-order kernels we consider are defined in Assumption 9.
Assumption 9. ∫K(x*) dx* = 1 and, for some integer γκ > 0,
We also consider the following class of “infinite-order” kernels.
Assumption 10. The Fourier transform of the kernel, κ(ξ), is such that κ(ξ) = 1 for |ξ| < ξ for some ξ > 0.
Assumption 10 allows for a kernel of the form
which is particularly suited to the Fourier representation because its Fourier transform is 1 in the [−1,1] interval and zero elsewhere. This type of kernel has previously been used in other Fourier-based estimators (Li and Vuong, 1998) and amounts to truncating the Fourier transform above a given frequency. When both E [y|x*] and the density of x* are infinitely many times differentiable, an infinite-order kernel will guarantee that the bias goes to zero faster than any power of the bandwidth. The bias could then, for instance, be an exponentially decaying function of the inverse bandwidth h−1.
The procedure to determine the asymptotic rates of pointwise convergence in probability can be outlined as follows.
Calculation of the bias
. We distinguish two cases, depending on whether the kernel used satisfies Assumption 9 or Assumption 10. In the following two lemmas, recall that the parameters γφ, αφ, and βφ, defined in Assumption 8, describe the smoothness of the density f (x*) of x* and of the conditional expectation E [y|x*] by specifying that their Fourier transforms both decay at least as fast as ζγφ exp(αφ|ζ|βφ) as frequency ζ → ∞.
LEMMA 3. Under Assumptions 1–8, if the kernel is of order γκ, as defined by Assumption 9, then the bias satisfies
where αb = 0, βb = 0, and
LEMMA 4. Under Assumptions 1–8, if the kernel satisfies Assumption 10 for some constant ξ, then the bias satisfies
where
In short, when a finite-order kernel is used, the rate of decrease of the bias is controlled either by the order of the kernel γκ or by the smoothness of f (x*) and E [y|x*] , whichever is more limiting. In particular, when both f (x*) and E [y|x*] are supersmooth, so that αφ ≠ 0, it is the order of the kernel that determines the rate of decrease of the bias. When an infinite-order kernel is used, only the smoothness of f (x*) and E [y|x*] matters. Note that the bias term is identical to that of a traditional kernel estimator that would be used if x* were perfectly observed, because, via equations (12) and (13), the bias can be expressed entirely in terms of φk(ζ) for k = 0,1 and the kernel, which are nonrandom measurement error-free quantities.
Calculation of the order of the variance term
.
LEMMA 5. Under Assumptions 1–8, the variance term satisfies
where
Note that the order of the variance term is determined not only by the smoothness of f (x*) and E [y|x*] (through γφ, αφ, βφ, and γr) but also by the smoothness of the density of the measurement error Δz (through the terms γm, αm, and βm). It is important to point out that the variance term increases much faster as h → 0 (at constant n) than that of a standard kernel estimator with perfectly observed variables (whose variance term is Op((hn n)−1/2)). Combined with the fact that the bias term is unchanged, as indicated in step 1, this implies that the achievable convergence rates will generally be slower than for a conventional kernel estimator.
Determination of the rate of decrease of the bandwidth that offers the best trade-off between bias squared and variance. To obtain explicit rates of convergence, we need to distinguish various cases, based on the values of βb, which characterizes the rate of convergence of the bias term as the bandwidth shrinks, and βv, which characterizes the rate of divergence of the variance term as the bandwidth shrinks (at constant sample size). Both βb and βv represent an “exponent of supersmoothness,” that is, the constant β in an expression of the form (hn−1)γ exp(α(hn−1)β).
THEOREM 2. Under Assumptions 1–8 and either Assumption 9 or 10, the optimal bandwidth choices and the corresponding convergence rates in probability of the estimator can be expressed in terms of the constants αb, βb,γb,αv,βv,γv defined by Lemmas 3–5. Let ε > 0 be arbitrarily small, let C1,C2 be some positive constants, and let
be given.
Case 1. If βv > βb > 0
Case 2. If βv > 0 and βb = 0 (with αb = 0 and γb < 0)
Case 3. If βb = βv ≠ 0
Case 4. If βb = βv = 0 (with αb = αv = 0 and γb < 0)
A few remarks are in order. First, it can be verified (see the proof of Theorem 2 in the Appendix) that the bandwidth sequences given above are such that conditions (i) and (ii) of Lemma 2 hold, thus implying that the nonlinear remainders are indeed negligible and that our simple bias-variance decomposition is justified. Second, the arbitrarily small ε was introduced to drastically simplify the calculations and the statement of the results at the expense of a very small loss in precision. Third, it is impossible to have βb > βv because βb = βφ, βv = βm, and
The convergence rate of the proposed estimator varies substantially as a function of the smoothness of the densities and the conditional expectations involved. An important trend to observe among these rates is that large values of βb (indicating a rapidly decreasing bias as h → 0) and small values of βv (indicating a slowly increasing variance as h → 0) are desirable. The convergence rates obtained are typically slower than that of the Nadaraya–Watson kernel estimator used when the variables are perfectly observed. This limitation is not an artifact of our estimation procedure: it has also been observed in the simpler Fan and Truong estimator, which is known to be optimal under stronger assumptions than ours (see Fan and Truong, 1993). The different cases will be discussed—and compared to Fan and Truong's findings—in more detail in Section 4.
Although we have focused on pointwise convergence rates, our results also provide information regarding global convergence rates. The upper bounds on the pointwise bias and variance (and of the nonlinear remainder terms) are in fact independent of
. If the density of x* is bounded away from zero over some finite interval [a,b] , it is straightforward to show that
converges to zero in probability at the same rate as the pointwise rates derived earlier for any bounded weighting function
and any p ∈ [1,2]. However, rates of uniform convergence in probability do not follow directly from the results presented above.
To establish the asymptotic normality of the proposed estimator, we need to introduce a few additional assumptions. First, we need assumptions that are commonly made whenever a central limit theorem for triangular arrays is invoked (see, e.g., Härdle and Linton, 1994, Theorem 2; Andrews, 1991, Assumption A).
Assumption 11. There exists C > 0 such that E [|x|2+δ|z] ≤ C, E [|y|2+δ|z] ≤ C, Var[x|z] ≥ C, and Var[y|z] ≥ C for all z.6
The familiar condition E [|K(x*)|2+δ] < ∞, which is helpful to show the asymptotic normality of standard kernel estimators, is of no use in establishing the asymptotic normality of our more complex estimator. In any case, Assumption 4 implies that E [|K(x*)|2+δ] < ∞.
The remaining assumptions are used to ensure that the condition
in Lemma 2 holds, so that the higher order remainder terms are asymptotically negligible relative to the standard deviation of the linearized estimator
. The main obstacle to overcome is the necessity to find a lower bound for the variance
of the estimator. The difficulty of obtaining such a result is noted by Fan (1991a) in his study of the limiting distribution of the kernel deconvolution estimator. Fan's solution to this problem is simply to assume that the tails of the various Fourier transforms entering the estimator are not only bounded by some function of the form ζγ exp(α|ζ|β) but are asymptotically equal (as |ζ| → ∞) to such functional form, thereby limiting the set of allowed functions. Our solution to this problem is similar in spirit to Fan's but considerably expands the range of possible behavior toward infinity by employing the concept of functions that are “well behaved at infinity,” as described by Lighthill (1962). The following definition formalizes this notion.
7We expand Lighthill's definition by allowing for exponential tails, which is essential to handle supersmooth functions.
DEFINITION 4. Let
be the set of all functions
such that (i) ψ(ζ) is absolutely integrable in every finite interval and (ii) ∫|ζ|≥T|ψ(ζ) − Ψ(ζ)| dζ < ∞ for some
and some function Ψ(ζ) that can be written as a finite linear combination of finite products of functions of the form |ζ|c+, sgn(ζ)|ζ|c+, ln|ζ|, sin(cζ), cos(cζ), exp(cζγ) with
.
Assumption 12. For a given
, the functions
, for k = 0,1 and l = 1,x,y and for
given in equation (26), belong to
.
For simplicity, we do not state Assumption 12 in terms of elementary quantities such as m1(ζ) and φk(ζ), but it is clear that Assumption 12 is only a few algebraic manipulations away from being a primitive condition. We need to constrain the derivative of
to rule out counterexamples where the density of z arbitrarily far away from the point of evaluation
could have a nonvanishing influence on
asymptotically, making it difficult to characterize the behavior of the variance as n → ∞.
The following condition requires the distribution of z to be supported on
, which is usually the case in deconvolution problems because distributions that have a nonvanishing characteristic function (as imposed by equation (41) in Assumption 8) rarely have compact support.
Assumption 13. f (z) > 0 for all
.
Finally, we need to impose a few constraints that would be very difficult to state in a more primitive fashion. However, these assumptions are not very restrictive because the counterexamples violating them are somewhat contrived.
Assumption 14.
.
This assumption merely states that the variance of the estimator is of an order no less than any term in its asymptotic representation. This constraint can only be violated if two or more of the terms
happen to cancel out asymptotically, which is unlikely because each term depends on different random quantities.
Assumption 15. For
as in equation (27),
for k = 0,1, for all
and all
.
This assumption requires that
be of the same order. It precludes
from having an oscillatory behavior (as ξ varies) such that a precise cancellation would occur between the values of
at different ξ during the integration. The cancellation would have to occur for all ζ and n sufficiently large and be such that the order of
would be affected.
Assumptions 12–15 imply condition
in Lemma 2, thus establishing the required asymptotic negligibility of the nonlinear remainder terms. If it is possible to calculate
directly and verify that it goes to 0 asymptotically, then Assumptions 12–15 can be avoided altogether.
8And the term n1/(3+2γr−2γm) in equation (64) can be replaced by n1/(2+2γr−2γm).
THEOREM 3. Under Assumptions 1–8 and 11–15, for any given
and any sequence hn satisfying
for some η > 0, we have
where
are given in Lemma 1.
Section 3.3 derives the convergence rates of the proposed estimator under very general conditions. We now focus on specific examples that will allow us to compare these convergence rates with those derived for the estimator proposed by Fan and Truong (1993), which is the most closely related to ours. Fan and Truong's estimator extends the standard kernel deconvolution estimators used for density estimation in the presence of a measurement error drawn from a known distribution to the case of nonparametric regressions. The estimator presented here accomplishes a more difficult task than Fan and Truong's because it considers the density of the measurement error unknown, relying instead on two error-contaminated measurements of the unobserved regressor. Hence, it would come as no surprise if the kernel deconvolution rates were better. The comparison is nevertheless instructive, because it quantifies the precision loss incurred by relaxing the distributional assumptions regarding the measurement error.
We consider four examples. We first study the “difficult” deconvolution problem that consists of estimating an ordinarily smooth conditional expectation (E [y|x*]) when the density of both the true regressor x* and the measurement error Δz are supersmooth. This problem is difficult because a supersmooth measurement error strongly damps out the high-frequency components of E [y|x*] and of the density of x*. Inverting this operation involves the amplification of these damped-out components, an operation that necessarily causes a substantial magnification of the statistical noise. In standard kernel deconvolution estimators, this situation gives rise to extremely slow convergence rates, and it is instructive to verify that the situation does not degrade further when the distribution of the measurement error is unknown. The second example shows that this slow convergence problem is avoided when the conditional expectation E [y|x*] is supersmooth as well. The third example assumes the density of the measurement error is ordinarily smooth, a situation that avoids the slow convergence problem for the kernel deconvolution estimator but, as we will see, not for our estimator. The final example completes the analysis by showing that when all quantities are ordinarily smooth, the slow convergence problem is avoided.
Table 1 summarizes the assumptions made in each of the four cases considered. A few remarks are in order. In each case, we assume that the order of the kernel is sufficiently large so that the smoothness of E [y|x*] and of the density of x* (and not the order of the kernel) is the factor limiting the rate at which the bias goes to zero. We also assume that equation (39) holds with γr = 1. Table 1 also summarizes the convergence rates obtained by applying Theorem 2 in each of the four examples considered. We will now discuss the significance of these results.
Convergence rates obtained under given regularity assumptions
In Example 1, the rates are entirely comparable to those obtained by Fan and Truong (1993) for kernel deconvolution estimators. They found rates of the form (ln n)k/β where k is the number of continuous derivatives that g(x*) possesses. Because a function whose Fourier transform behaves asymptotically as ζ−(k+1+ε) necessarily has k continuous derivatives, it is clear that the rates are comparable. The rates differ by ε, because Fan and Truong formulate their regularity conditions in terms of derivatives whereas we formulate them in terms of the asymptotic behavior of Fourier transforms. Formulating our regularity conditions in terms of derivatives would yield results identical to Fan and Truong's. It is remarkable that under the assumptions leading to the worst-case convergence rates for kernel deconvolution estimators, the assumption of a known measurement error distribution can be relaxed without bringing the convergence rate down further.
Example 2 shows that the slow convergence rate problem can be alleviated if the unknown regression function g(x*) is supersmooth and if an “infinite-order” kernel is used. This situation ensures that the bias term goes to zero faster than any power of h, which is sufficient to convert a convergence rate of the form (ln n)γ to a rate of the form nγ for γ < 0. More generally, relatively fast convergence rates can be achieved with infinite-order kernels whenever case 3 of Section 3.3 applies. Caution is, however, advised when using high-order kernels. They are known not to perform as well in finite samples as their asymptotic properties would suggest (see Härdle and Linton, 1994). The origin of the problem is that a high-order kernel must necessarily take negative values over a portion of its support, which makes it likely for the denominator of the Nadaraya–Watson kernel estimator to approach zero, even at a point where the true density is bounded away from zero.
In Example 3, making the density of the measurement error Δz ordinarily smooth instead of supersmooth does not improve the convergence rates relative to Example 1. This is in sharp contrast to the behavior of kernel deconvolution estimators, whose convergence rates are of the form nk under the same assumptions. The reason for this distinction is that the only characteristic function appearing in the denominator of a kernel deconvolution estimator is that of the measurement error Δz, whereas in our estimator, it is the characteristic function of z that appears in the denominator. The density of z is supersmooth if either the density of the true regressor x* or of the measurement error Δz is supersmooth. Hence, a supersmooth density for x* will also cause our estimator to converge slowly.
In Example 4, it is seen that when the density of x* is made ordinarily smooth as well, the slow convergence problem is avoided, as expected. The resulting rates are not necessarily identical to those of Fan and Truong's kernel deconvolution estimator, but the rates at least take the form of a negative power of n, indicating that the distributional assumptions regarding the measurement error can be relaxed without an undue increase in the statistical noise.
We now investigate the finite-sample properties of the proposed estimator through various Monte Carlo simulations. The designs are chosen so as to illustrate the examples of Section 4, summarized in Table 1, which cover the most common combinations of smooth and supersmooth distributions and conditional expectations. As an example of a supersmooth distribution, the normal distribution with variance σ2 naturally comes to mind. Its characteristic function has a tail of the form exp(−(σ2/2)|ζ|2). As an example of an ordinarily smooth distribution, we consider the Laplace (or double exponential) distribution with mean μ and variance σ2 denoted by L(μ,σ2) and defined as
for any
. The tail of the characteristic function of a Laplace density is of the form |ζ|−2.
Our example of a supersmooth regression function is the error function
having a Fourier transform decaying at the rate |ζ|−1 exp(−¼|ζ|2)as |ζ| → ∞. Finally, our example of an ordinarily smooth regression function is a piecewise linear continuous function with a discontinuous first derivative
whose Fourier transform decays as |ζ|−2. To simplify comparisons, both functions are normalized to have the same range and a similar general shape, so that any difference in the results can be attributed to their difference in smoothness. All simulations proceed by drawing 500 samples of 1,000, 2,000, or 8,000 observations from the distributions given in Table 2. Table 2 also provides the theoretical convergence rate in each case, obtained by substituting the appropriate smoothness parameters in the expressions of Table 1. The distribution of Δy is never altered, because it has little impact on the asymptotic properties of the estimator except for a trivial scaling of some of the components of the asymptotic variance. For each sample, the variables x,y,z are constructed through
The variables (y,x,z) are used as an input for our estimator, and the variables (y,x) are fed into the Nadaraya–Watson estimator. We also construct an (infeasible) Nadaraya–Watson estimator from the variables (y,x*) for comparative purposes. For all three estimators, an infinite-order kernel whose Fourier transform is given by equation (23) with ξ = ½ is used. In this fashion, the kernel is never the factor limiting the convergence rate. For each sample, we keep track of the value of the estimated function at a given point (here, x* = 1) and use it to calculate the bias squared, the variance, and the sum of the two, the mean square error. A set of bandwidths ranging from 1.0 to 2.5 is scanned in increments of 0.05 in search of the bandwidth minimizing the mean square error.9
For less than 0.5% of the samples drawn, numerical issues associated with near division by zero in equations (19) and (20) were observed for a few of the smallest bandwidths sampled. To simplify the reporting of the results as a function of bandwidth, these draws were discarded and new draws were made so that the total number of samples kept remains 500. Of course, when studying any given sample, practitioners would simply never choose such a small bandwidth. The problem only occurs because we are performing Monte Carlo simulations and wish to report averages over replications as a function of bandwidth.
Monte Carlo simulation designs
Table 3 compares the bias squared, the variance, and the mean square error of the three estimators considered as a function of bandwidth for a sample size of 1,000. For conciseness, only a subset of the bandwidths considered is shown. The rightmost column gives all quantities evaluated at the optimal bandwidth (which may lie between two of the bandwidths listed in the previous columns). A few important features can be consistently observed throughout the four examples considered.
Monte Carlo simulation results for the examples
In comparison with the Nadaraya–Watson estimator, our estimator is clearly very effective at reducing the bias. More specifically, it is clear that the bias of the Nadaraya–Watson estimator does not converge to zero with decreasing bandwidth but instead settles to a nonzero value. In contrast, the bias of our estimator decreases by orders of magnitude over the range of bandwidths sampled, as the bandwidth decreases. Our estimator's residual bias is attributable to the fact that we are performing a nonparametric estimation, so that a fully unbiased estimation is impossible. In fact, it can readily be seen that, at a given bandwidth, the bias of our estimator is very close to the bias of the infeasible Nadaraya–Watson estimator using the uncontaminated regressor x*, thus indicating that our estimator does not appear to introduce additional bias at the sample size considered. Of course, because the variance of our estimator is larger than the infeasible Nadaraya–Watson estimator, a larger bandwidth must be used, and the resulting bias, evaluated at the optimal bandwidth, is slightly larger than in the error-free case.
The bias reduction made possible by the proposed estimator comes at the expense of an increased variance relative to the Nadaraya–Watson estimator based on mismeasured regressors. However, the decrease in the bias more than offsets the increase in the variance, so that the mean square error we obtain is still better than for the Nadaraya–Watson estimator.
It is instructive to observe the estimator's behavior as a function of the smoothness of the various densities and conditional expectations considered. The asymptotic theory presented earlier predicts the convergence rate, which can be directly compared with the change in the mean square error at the optimal bandwidth, as a function of sample size for each of the examples considered (see Table 4). The fifth column of Table 4, labeled “MSE8000/MSE2000,” reports the ratio of mean square error at a sample size of 8,000 relative to the mean square error at sample size 2,000. We focus on these sample sizes because the differences between the various examples are more readily seen at large sample sizes. In Examples 1 and 3, where the convergence rate should be slow (i.e., a negative power of the log of sample size), convergence is indeed much slower than for Examples 2 and 4, where the convergence rate should be fast (i.e., a negative power of sample size). Moreover, the decrease in mean square error predicted by asymptotic theory (obtained by squaring the rates given in Table 2 and shown in the last column of Table 4) is an excellent predictor of the actual decrease in three out of the four examples. Note that the systematic changes in bandwidth as a function of sample size are difficult to distinguish from the inherent simulation noise, because bandwidth variations are much smaller than the changes in mean square error, as predicted by Theorem 2.
Monte Carlo simulation results as a function of sample size
Monte Carlo simulations can also be used to verify the applicability of the asymptotic distribution in a finite sample. The designs described in Table 2 are again used, with the mean square minimizing bandwidths given in Table 3 and a sample size of 1,000. For each sample, we keep track of the value of the estimated function at a given point (x* = 1.0) and the estimated variance at that point obtained with equations (31) and (32) by replacing all expected values by sample averages. The point estimates are then standardized, that is, demeaned by the average of the point estimates and normalized by the average of the estimated pointwise variance. Figure 1 shows the empirical cumulative distribution function (c.d.f.) of the standardized point estimates pi for i = 1,…,500, obtained by sorting the pi in increasing order and by joining the points (pi,(i − 1)/499) by lines. The resulting empirical c.d.f. (jagged lines in Figure 1) agrees very well with the normal c.d.f. predicted by asymptotic theory (shown as a smooth line in Figure 1).
Comparison between the finite-sample and the asymptotic distributions of the estimator. The abscissa is .
This paper presents a new kernel-based nonparametric estimator that extends the conventional Nadaraya–Watson kernel estimator to cover the case of an error-ridden regressor. We show that identification is achievable when one repeated measurement of the error-contaminated regressor is available. One remarkable property of our estimator is that it requires no knowledge of the distribution of the measurement error, contrary to the popular kernel deconvolution estimator. The convergence rate and the asymptotic distribution of the proposed estimator are derived. A series of examples illustrates the main factors determining the convergence rate and enables us to compare the convergence rates we obtain with those of earlier estimators. Various Monte Carlo simulations are used to investigate the finite-sample properties of the estimator.
Proof of Theorem 1. The result can be shown by direct substitution. Assumption 2 ensures that all expectations are well defined. First, observe that equation (14) indeed provides the value of φ0(ξ), by using Assumption 1:
Letting f (x*) be the density of x*, one can then show that
, respectively, provide the numerator and the denominator of the Nadaraya–Watson estimator. In what follows, we use the independence between x* and Δz and the fact that
█
Proof of Lemma 1. The fact that
follows from equation (25) and the fact that
. Finally, to calculate
we note that, by equation (25),
Equation (31) then follows directly from squaring equation (24), taking its expectation, and using the expression for
just derived. █
LEMMA 6. If aj and zj are sequences of i.i.d. real-valued random variables such that E [aj2] < ∞ and E [|aj||zj|] < ∞, then, for any u,U ≥ 0 and ε > 0,
where
.
Proof. See Lemma 6 in Schennach (2004).
Proof of Lemma 2. To compute the Fréchet derivative of
with respect to the estimated moment
in the vicinity of
, we first note a few simple results. A ratio of two random functions
can be exactly written as
where
and where
can be written in two alternative ways: Either
or
Similarly, for
, and some random function δQx(ξ) such that
,
Substituting expansions (A.32) and (A.37) into
for k = 0,1 and keeping the terms linear in
gives the linearization of
, denoted
:
By making use of the identity
for any absolutely integrable function f, we obtain
The order of
(in probability) can be found through its variance
given by Lemma 1:
where, by Assumptions 5 and 6,
It follows that
and therefore that
where U(hn), given in the statement of the lemma, has been explicitly constructed to bound any of the
terms (up to a multiplicative constant). By equation (30), equation (A.49) implies equation (36) in the statement of the lemma, provided that Assumption 7 holds.
To establish equation (37), we substitute expansions (A.32) and (A.37) into
for k = 0,1 and remove the terms linear in
. We then find that
can be written as
where
These terms can then be bounded in terms of λ(hn), U(hn) (given in the statement of the lemma), and
, where the supremum can be taken over [−hn−1,hn−1] because κ(hnξ) vanishes outside that interval. By Lemma 6,
for any ε > 0. Also, we note that
. Now, for k = 0,1, we have
The remaining terms can be similarly bounded:
It then follows that
for some δ > 0. By a standard Taylor expansion of the ratio
around
, we have
for some
lying between
. Because (i) we have just shown that
, (ii)
by assumption, and (iii)
is bounded and
is bounded away from zero by assumption, it follows that
converge in probability to finite quantities and therefore
is of the same order as
, thus implying equation (37).
To establish the second conclusion of the theorem, we note that, because
, we can write
Then,
because
by assumption. █
Proof of Lemma 3. First, by equation (13), we have, for k = 0,1,
Expanding the Fourier transform of the kernel in a Taylor series up to order γ, we obtain
Now let the order γ be chosen as follows. If αφ (defined in Assumption 8) is nonzero, then let γ = γκ, the order of the kernel. If αφ = 0, then let γ be the largest integer such that γ ≤ γκ and γ < −γφ − 1. With this choice of γ, equation (A.98) simplifies to
because all terms where i < γ vanish, by the definition of the order of a kernel. Furthermore, we have
where
is finite by equation (44) of Assumption 9. The term
is finite also, because our choice of γ guarantees that the integrand decays to zero faster than ξ−1. Then, by a standard Taylor expansion of the ratio
around
, the convergence rate of
also. █
LEMMA 7. For ζ ≥ 0, if γ > 0, α < 0, β > 0 or if
, then
Proof. The case where α = β = 0 is trivial. If α < 0 and β > 0, Lemma 4.2 in Li and Vuong (1998) shows that, for γ > 0 and
, thus implying the result because ξ1+γ−β exp(αξβ) [prcue ] ξ1+γ exp(αξβ). █
Proof of Lemma 4. From equation (A.96), we have
Then, by a standard Taylor expansion of the ratio
around
, the convergence rate of
is O((1 + h−1)γφ+1 exp(αφ(ξh−1)βφ)) also. █
LEMMA 8. For ζ ≥ 0, if β ≥ 0 and if (1 + ξ)γ exp(αξβ) is increasing in ξ,
Proof.
Proof of Lemma 5. By Lemma 2, the order of the variance term is Op(U(hn)n−1/2), where
where, for k = 0,1,
and
It follows that
Hence,
. █
Proof of Theorem 2. We make use of the order of the bias
provided by Lemmas 3 and 4 and of the order of the variance term
provided by Lemma 5. To check that the higher order term
does not affect the rates obtained by considering the first-order terms only, we observe that, by Lemma 2, the upper bound on
provided by Lemma 5 holds for
also, if we can show that λ(hn)n−(1/2)+ε = o(1) for some ε.
We consider each subcase of the theorem separately. Let Rn be such that
. Throughout the proof, let ε,ε1,ε2,… denote arbitrarily small positive numbers.
Case 1. βv > βb. If the bandwidth hn is chosen to be
for some εv,εb > 0, the bias and the variance are of the same order and the convergence rate is
Now, to check the negligibility of the higher-order terms, we verify that λ(hn)n−1/2+ε1 = o(1) for some suitably chosen ε1 > 0. Noting that βv = βm and αv = −αm if βv > βb, we have
Case 2. βb = 0 (and γb < 0) and βv > 0. For some εv > 0, let
Then,
and
Case 3. βb = βv ≠ 0. For some ε > 0, let
Then,
Case 4. βb = βv = 0 (and αb = αv = 0 and γb < 0). Let hn−1 = n1/(2γv−2γb). Then,
Noting that γb ≥ γφ + 1, γv = 2 + γφ − γm + γr, and γm < 0, the exponent of n can be written as
█
LEMMA 9. Let Kn(z) be a sequence of real-valued nonrandom functions of a real variable, let aj and zj be i.i.d. sequences with aj satisfying E [aj2+δ|zj = z] ≤ C for some C,δ > 0 for all z and Var[aj|zj = z] ≥ C for some C > 0 and for all z, and let
If infn≥N σn > 0 for some
for some δ > 0, then
Proof. Let Znj = aj Kn(zj). The proof consists in verifying that Znj satisfies the hypothesis of the Lindeberg–Feller central limit theorem for triangular arrays. Indeed, the Zn1,…,Znn are i.i.d. by assumption, and it remains to be shown that the Lindeberg condition holds: for all ε > 0,
First, noting that 1(ab ≥ c) ≤ 1(a ≥ cη) + 1(b ≥ c1−η) for any
and any η ∈]0,1[, we can write
where
Then, T1 = σn−2E [E [1(|aj| ≥ εησnηnη/2)aj2|zj]|Kn(zj)|2] [prcue ] σn−2E [(εησnηnη/2)−δ × |Kn(zj)|2] because E [aj2+δ|zj = z] ≤ C (i.e., E [1(|aj| ≥ c)aj2|zj] ≤ E [1(|aj| ≥ c) × (aj /c)δaj2|zj] ≤ c−δE [1(|aj| ≥ c)aj2+δ|zj] ≤ c−δE [aj2+δ|zj] [prcue ] c−δ). Noting that
we have T1 [prcue ] (E [Kn2(zj)])−1E [(εησnηnη/2)−δKn2(zj)] = (εησnηnη/2)−δ(E [Kn2(zj)])−1 × E [Kn2(zj)] = (εησnηnη/2)−δ → 0. Also, T2 = σn−2E [E [aj2|zj]1(|Kn(zj)| ≥ ε1−ησn1−ηn(1−η)/2)Kn2(zj)] ≤ Cσn−2E [1(|Kn(zj)| ≥ ε1−ησn1−ηn(1−η)/2)Kn2(zj)] [prcue ] E [1(|Kn(zj)| ≥ ε1−ησn1−ηn(1−η)/2)Kn2(zj)].
Let
. For a given value of E [Kn2(zj)] the maximum value of E [1(|Kn(zj)| ≥ C)Kn2(zj)] for some C > 0 is obtained when the support of Kn(z) is inside the support of the distribution of z and when Kn(z) is triangular,
where ln = (3σn2/(2sn2))1/3. Then,
where
and it follows that In → 0, as desired. █
LEMMA 10. If
, then lim|z|→∞ p(z) = 0, where p(z) is the inverse Fourier transform of ψ(ζ).
Proof. This result is Theorem 18 in Lighthill (1962) with the trivial modification that the Fourier transform is replaced by the inverse Fourier transform and with the slight extension that allows the tail behavior of the function ψ(ζ) to be exponential (see Definition 4). This extension is straightforward because Lighthill's proof proceeds by writing ψ(ζ) = Ψ(ζ) + (ψ(ζ) − Ψ(ζ)) where (ψ(ζ) − Ψ(ζ)) can be handled using the Riemann–Lebesgue lemma. By the assumption that
, the function Ψ(ζ) can be chosen such that its inverse Fourier transform, p∞(z), can be calculated analytically and be shown to satisfy lim|z|→∞ p∞(z) = 0. All that is needed to allow for more flexible choices of tail behavior than initially employed by Lighthill is to find functions Ψ(ζ) whose inverse Fourier transform can be calculated analytically and have the appropriate tail behavior. Using the techniques described in Gel'fand and Shilov, (1964, Example 5, p. 169) the inverse Fourier transform of exponentials of the form exp(cζγ) for
can be shown to be
where δ(k)(ζ) denotes the kth derivative of Dirac's delta distribution. This distribution clearly vanishes as |z| → ∞, as required. Note that, although such a distribution does not belong to the class of the so-called tempered distributions, it does belong to the wider class of distributions that forms the dual of compactly supported infinitely differentiable test functions (i.e., the so-called Type K distributions of Gel'fand and Shilov). █
Proof of Theorem 3. According to the second conclusion of Lemma 2, to have
we need to show that
for some ε > 0. We proceed by finding a lower bound on
and relating it to U(hn). First, by Assumption 14,
, where Tk,a,n, for k = 0,1 and a = 1,x,y, is given by
where
Then,
for any finite interval Ia,k not reduced to a point. By Assumptions 11 and 13, infz∈Ia,k E [a2|z] f (z) ≥ C > 0, and we have
We now show that
remains bounded as n → ∞, thus implying that ∫(Kk,a,n(z))2 dz diverges at the same rate as ∫z∈Ia,k(Kk,a,n(z))2 dz. First, limn→∞ Kk,a,n(z) ≡ Kk,a,∞(z) is the inverse Fourier transform of Uak(ζ,x*,0) and, by the moment theorem, the inverse Fourier transform of
is izKk,a,∞(z). Because
belongs to
by Assumption 12, we can apply Lemma 10 to conclude that lim|z|→∞|z||Kk,a,∞(z)| = 0. Therefore, there exist constants A,C > 0 such that |Kk,a,∞2(z)| ≤ A|z|−2 for |z| ≥ C and k = 0,1 and l = 1,x,y. It is therefore impossible for
to become unbounded as n → ∞ if Ia,k is chosen to be [−C,C]. We can then write
By Parseval's identify and the fact that
vanishes for |ζ| ≥ hn−1, we have
By the Cauchy–Schwartz inequality,
which becomes, upon rearrangement,
Collecting equations (A.197), (A.198), and (A.200), we have
We then observe that, by equation (34) and Assumption 15,
Combining equations (A.202) and (A.203) yields
thus implying that hn1/2λ(hn)n−1/2+ε → 0 for some ε > 0 is a sufficient condition for the asymptotic negligibility of the higher order terms, which we can now verify.
If αm = 0, then hn1/2λ(hn)n−1/2+ε = (1 + hn−1)1/2(1 + hn−1)(1 + hn−1)γr−γmn−1/2+ε = (1 + hn−1)3/2+γr−γmn−1/2+ε [prcue ] (n−ηn1/(3+2γr−2γm))3/2+γr−γmn−1/2+ε [prcue ] n−η(3/2+γr−γm) × nε = o(1) for ε > 0 sufficiently small.
If αm ≠ 0, then hn1/2λ(hn−1)n−1+ε [prcue ] (1 + hn−1)3/2+γr−γm exp(−αm(hn−1)βm)n−1/2+ε [prcue ] exp(−αm(1 + ε2)(hn−1)βm)n−1/2+ε [prcue ] exp((αm(1 + ε2)(1 + η)/2αm)ln n)n−1/2+ε = n−1/2(ε2+η+ηε2)nε = o(1) for some ε2 > 0 and for ε > 0 sufficiently small.
We have now shown that the limiting distribution of
is the same as that of
. To obtain the limiting distribution of
, we note that
is a finite linear combination of the kernel-type estimators Tk,a,n defined in equation (A.191) using the kernel Kk,a,n(z) defined by equation (A.192). The asymptotic normality of Tk,a,n can be shown using Lemma 9, provided that we can show that
for some δ > 0. By the moment theorem, this requirement is satisfied if
for k = 0,1 and a = 1,x,y. Using the same techniques as in the proof of Lemma 5, we have
where γ = 3 + γφ − γm + γr and
If αm = 0, then α = 0 and we have
because (i) (3 + γφ − γm + γr) > 0 because γφ ≥ −γm and γr ≥ 0 and (ii)
If αm ≠ 0, then
Hence, the hypotheses of Lemma 9 are verified, and the Tk,a,n are asymptotically normal. The expectation and the variance of
can then be calculated as in Lemma 1. █
Convergence rates obtained under given regularity assumptions
Monte Carlo simulation designs
Monte Carlo simulation results for the examples
Monte Carlo simulation results as a function of sample size
Comparison between the finite-sample and the asymptotic distributions of the estimator. The abscissa is .