NONPARAMETRIC REGRESSION IN THE PRESENCE OF MEASUREMENT ERROR

Susanne M. Schennach

doi:10.1017/S0266466604206028

NONPARAMETRIC REGRESSION IN THE PRESENCE OF MEASUREMENT ERROR

Published online by Cambridge University Press: 01 December 2004

Susanne M. Schennach

Show author details

Susanne M. Schennach: Affiliation:
University of Chicago

Article contents

Abstract
1. INTRODUCTION
2. ESTIMATION PROCEDURE
3. ASYMPTOTIC PROPERTIES
4. EXAMPLES
5. MONTE CARLO SIMULATIONS
6. CONCLUSION
APPENDIX: PROOFS
References

Rights & Permissions

Abstract

We introduce a nonparametric regression estimator that is consistent in the presence of measurement error in the explanatory variable when one repeated observation of the mismeasured regressor is available. The approach taken relies on a useful property of the Fourier transform, namely, its ability to convert complicated integral equations into simple algebraic equations. The proposed estimator is shown to be asymptotically normal, and its rate of convergence in probability is derived as a function of the smoothness of the densities and conditional expectations involved. The resulting rates are often comparable to kernel deconvolution estimators, which provide consistent estimation under the much stronger assumption that the density of the measurement error is known. The finite-sample properties of the estimator are investigated through Monte Carlo experiments.This work was made possible in part through financial support from the National Science Foundation via grant SES-0214068. The author is grateful to the referees and the co-editor for their helpful comments.

Type: Research Article
Information: Econometric Theory , Volume 20 , Issue 6 , December 2004 , pp. 1046 - 1093

DOI: https://doi.org/10.1017/S0266466604206028 [Opens in a new window]
Copyright: © 2004 Cambridge University Press

1. INTRODUCTION

1.1. Motivation

The bias resulting from the presence of measurement error in the explanatory variables is a common problem in regression analysis. Although numerous solutions to this problem have been derived for parametric regression models, the corresponding problem in nonparametric specifications has remained relatively unexplored.

Some aspects of the nonparametric errors-in-variables problem have been previously investigated. The problem of estimating the density of an unobserved variable when this variable is measured with error and when the density of the error is known has received considerable attention in the literature. In this setting, the so-called kernel deconvolution estimator (for a review of the extensive literature, see, e.g., Carroll and Hall, 1988; Liu and Taylor, 1989; Carroll, Ruppert, and Stefanski, 1995) has been shown to reach the optimal rate of convergence (Fan, 1991b). The problem of the nonparametric estimation of a regression function when the independent variable is measured with an error drawn from a known distribution has also been studied. In this case, a kernel regression estimator based on kernel deconvolution is known to achieve optimal convergence rates (Fan and Truong, 1993). A more challenging problem is the estimation of densities and regression functions when the independent variable is measured with an error drawn from an unknown distribution. Thanks to an identity due to Kotlarski (see Rao, 1992, p. 21), the identification of the density of an unobserved random variable is possible when the joint density of two error-contaminated measurements of that variable is known. Li and Vuong (1998) show that the empirical version of this identity leads to a consistent estimator with known convergence rates.

In contrast to the nonparametric density estimation problem, the nonparametric estimation of conditional expectations under similar conditions has so far remained unsolved. This is the gap our paper intends to fill by extending the traditional Nadaraya–Watson kernel regression estimator to allow for the independent variable to be contaminated with an error of an unknown distribution. We show that the availability of two error-contaminated measurements of the independent variable is all that is needed to achieve identification. The usefulness of this result stems from the observation that although distributional assumptions are often not appropriate in applications, thus precluding the use of kernel deconvolution estimators, repeated measurements can frequently be found in data sets (Ashenfelter and Krueger, 1994; Hausman, Newey, and Powell, 1995; Morey and Waldman, 1998; Bowles, 1972; Borus and Nestel, 1973; Freeman, 1984).¹

Freeman's data set (the January 1977 Employer Employee Matched Sample, Current Population Survey) contains wages reported by employers and employees, which are perfect examples of repeated measurements.

For instance, a given quantity may be repeatedly measured over time, or the same quantity may be reported by different people, such as different family members or an employer and an employee. The error on one of the measurements does not need to have zero mean, thus expanding the set of valid repeated measurements to more general indicators, or to repeated measurements that exhibit a systematic drift.

Our analysis not only derives the convergence rate of the proposed estimator but also provides its asymptotic distribution. The asymptotic properties of the estimator are analyzed through various analytical examples, and its finite-sample properties are investigated through Monte Carlo simulations that illustrate the bias-correcting power of our estimator. All proofs can be found in the Appendix.

1.2. Background

To understand the difficulties faced in nonparametric estimation in the presence of measurement error, it is instructive to recall the well-known solution to the simpler problem of finding the density of an unobserved variable x* given an imperfect measurement z (for a review, see Carroll et al., 1995):

The measurement error Δz is usually assumed to be independent from x* and to be drawn from a known density. It is well known that the density of z is given by the convolution of the density of x* with the density of Δz. Thanks to the convolution theorem, this relationship can be concisely expressed using characteristic functions:

where φ(ν), m(ν), and ψ(ν), respectively, denote the characteristic functions of x*, z, and Δz. We can therefore identify the characteristic function of interest, φ(ν), through

where m(ν) can be estimated by the Fourier transform of a nonparametric estimator of the density of z, such as a kernel estimator. The problem with this procedure arises from the fact that, under mild assumptions (such as assuming that the density of Δz is continuous), ψ(ν) vanishes as ν → ∞, so that this operation is not well defined for all ν. Hence, merely replacing m(ν) by a consistent estimate

may not yield a consistent estimate of φ(ν), because small errors on

are magnified by the arbitrarily large factor 1/ψ(ν). This is the well-known ill-defined inverse problem that occurs when one tries to invert a convolution operation. The so-called kernel deconvolution estimator (Carroll et al., 1995; Fan, 1991b) addresses this problem by estimating m(ν) using a kernel whose Fourier transform, κ(ν), is compactly supported. This ensures that the estimated characteristic function

is also compactly supported, which in turn guarantees that the numerator of equation (3) will vanish well before the denominator causes the ratio to diverge.

It is clear that truncating the characteristic function of z in this fashion introduces a bias. To obtain a consistent estimator, the support of κ(ν) is allowed to expand as sample size grows in such a way that the total integrated noise over all frequencies in the support of κ(ν) decreases. The faster ψ(ν) → 0 as ν → ∞, the more slowly the support of κ(ν) can expand with sample size and the slower the convergence rate. This is the fundamental difficulty associated with nonparametric estimation in the presence of measurement error. As the smoothness of the density of the measurement error increases, the characteristic function ψ(ν) goes to zero increasingly rapidly as ν → ∞ and the convergence rate worsens. The smoothness of the density of x* also plays a role in determining the convergence rate. The bias introduced by the truncation of m(ν) at a finite frequency is a function of the magnitude of the rate of decay of m(ν) as ν → ∞. The smoother the density of x*, the faster its Fourier transform m(ν) decays as ν → ∞, and the faster the bias decreases as the kernel bandwidth shrinks.

The literature focusing on kernel deconvolution estimators typically describes the smoothness of a density in terms of the asymptotic rate of decay of its Fourier transform as frequency ν goes to infinity. The basis for such a description is that the number of derivatives of a density that are continuous is directly related to the asymptotic behavior of its Fourier transform as ν → ∞. This leads to the traditional distinction between “ordinarily smooth” functions (which admit a finite number of continuous derivatives and whose Fourier transform decays as |ν|^γ, γ < 0) and “supersmooth” functions (which admit an infinite number of continuous derivatives and whose Fourier transform decays as exp(α|ν|^β), α < 0,β > 0). Examples of ordinarily smooth functions are gamma, uniform, and double exponential, and normal and Cauchy are supersmooth functions.

The kernel deconvolution estimator exhibits a wide variety of convergence rates depending on the smoothness of the densities involved. Whenever the densities of x* and of Δz are ordinarily smooth, the kernel deconvolution will exhibit a rate of convergence of the form n^−c for some c > 0 where n is sample size. The situation degrades significantly when the density of Δz is supersmooth while the density of x* remains ordinarily smooth. The convergence rate is then of the form (ln n)^−c for some c > 0, which is slower than any negative power of n.

The problem solved in this paper is more challenging than the one described above. First, we focus on a kernel regression estimator rather than a kernel density estimator. Second, we assume the density of the measurement error to be unknown.

2. ESTIMATION PROCEDURE

Our task is to find a function

such that

We consider x* a scalar to simplify the exposition, although a multivariate extension is clearly possible.

As in any nonparametric regression, the well-known “curse of dimensionality” of course limits the number of dimensions that can be handled in practice.

When both y and x* are observed, a natural candidate for this task is the well-known Nadaraya–Watson kernel estimator of

at a given point

where x_l* and y_l for l = 1…n denote the data points and the kernel K_h(·) is of the form

and h is the bandwidth parameter. The problem we are facing is that x* is not observed. As shown in Schennach (2004) the availability of two repeated measurements of x*

provides enough information to identify any moment of the form E [u(y,x*)] for any function u(y,x*). Because the probability limit (at constant bandwidth h) of the Nadaraya–Watson kernel estimator is the ratio

a similar technique can be applied here, setting

, for k = 0,1. The extension of the existing results to a nonparametric setting nevertheless requires additional steps to handle the fact that we need to characterize an infinite family of moments, indexed by

. Fortunately, this complication can be elegantly handled by observing that the convolution operations involved in computing the Nadaraya–Watson estimator are converted into simple products through the Fourier transform operation, enabling the whole family of moments to be estimated in a single operation. The formal result that permits identification is summarized in the following set of assumptions and associated theorem. Throughout the paper, we will take the convention that integrals without explicit bounds are taken over the whole real line.

Assumption 1.

Δz and x* are mutually independent.

Assumption 2. E [|x*|] , E [|Δx|], and E [|y|] are finite.

Assumption 3.

for all

, any h > 0, and k = 0,1.

THEOREM 1. Under Assumptions 1–3, and provided |E [e^iξz]| > 0 for any finite ξ, the function

for

, can be expressed solely in terms of moments that involve the observable variables y, x, and z:

where, for k = 0,1,

and where φ_k(ξ) ≡ E [y^k exp(iξx*)] is given by³

Equation (14) is similar to an identity derived by Kotlarski (see Rao, 1992, p. 21), but our proof of this result requires weaker independence assumptions. In particular, we do not require independence between Δx and x* and between Δx and Δz.

where

is the Fourier transform of the kernel K(x*) and

Note that knowledge of the moments m_a(ξ), for a = 1,x,y, which involve observable variables only, is sufficient to identify

. Because the moments m_a(ξ) can be estimated from the corresponding sample averages, we propose the following estimator.

DEFINITION 1. Let (x_i,y_i,z_i), for i = 1,…,n denote a sample of size n. For a given

and some sequence of bandwidths h_n → 0, let

where, for k = 0,1,

and where, for a = 1,x,y,

An interesting property of this estimator is that it reduces to the Nadaraya–Watson estimator in the absence of measurement error (i.e., when z = x = x*). Indeed, in that case,

, and equation (19) can be integrated analytically to yield

, thus implying that equation (20) becomes

. With these equalities in mind, equation (18) then defines the Fourier representation of the numerator and the denominator of the Nadaraya–Watson estimator.

To ensure that the proposed estimator is well behaved, we need to make the following assumption.

Assumption 4. The Fourier transform of the kernel, κ(ξ), is (i) bounded and (ii) compactly supported (without loss of generality, we consider the support to be [−1,1]).

The boundedness of κ(ξ) is a very weak requirement because any kernel K(z) violating it would necessarily fail to be absolutely integrable. The assumption of compact support of κ(ξ) is commonly made in the derivation of the asymptotic properties of kernel deconvolution estimators (Fan and Truong, 1993). The need for this assumption arises from the fact that the estimator involves a division by an asymptotically vanishing characteristic function. Under very mild smoothness requirements, characteristic functions decay to zero as frequency increases toward infinity. A compactly supported kernel (in Fourier representation) explicitly makes the frequency range considered in a given sample finite, ensuring that the divergence is kept under control.

The restriction of compact support (in Fourier representation) poses few problems in practice, because one can take any given kernel K(x*) and construct a modified kernel

that exhibits most of the properties of the original kernel, while possessing a compact support in Fourier representation. This is achieved by computing the Fourier transform κ(ξ) of the original kernel K(x*) and multiplying it by a “windowing” function W(ξ) that vanishes beyond a given frequency:

Judicious choice of a windowing function will ensure that the modified kernel

keeps most of the properties of the original kernel. For instance, a windowing function such as

will leave the order of the kernel unaffected, because the windowing function is constant in the neighborhood of the origin. The fact that this windowing function is infinitely many times differentiable will guarantee that the modified kernel

decays faster than any power of x* as |x*| → ∞ (provided that the original kernel K(x*) had this property).

3. ASYMPTOTIC PROPERTIES

This section is organized as follows. To facilitate the analysis of the asymptotic properties of the proposed estimator

, we first provide a linear representation of this estimator, denoted

, that will be shown to be asymptotically equivalent to

. This linearization serves two purposes. First, it will enable the derivation of the convergence rate of the estimator using techniques that are analogous to the standard bias and variance decomposition used in the context of conventional kernel estimators. Second, a linear representation is essential to establish the asymptotic normality of the estimator.

3.1. Linearization

In this section, we will provide very general results that summarize the properties of a linearized estimator

that will be used to establish the asymptotic properties of

. The form of the estimator prompts for two levels of linearization. First, as is commonly done in the analysis of nonparametric conditional expectation kernel estimators, the ratio of

in equation (17) is expanded in a Taylor series up to first order. Second, unlike the usual Nadaraya–Watson estimator and kernel deconvolution estimators,

themselves take the form of nonlinear functionals of the data generating process. It is thus convenient to carry out the linearization a step further by calculating the Fréchet derivative of

with respect to the estimated moment

in the vicinity of

. The following definition gives a linearized version

of the estimator

⁴

The calculation of the Fréchet derivative can be found in the proof of Lemma 2 in the Appendix.

DEFINITION 2. For

, let

where, for

is given by equation (13),

The advantage of the linear representation provided by Definition 2 is that it is possible to decompose the error

into well-defined “bias” and “variance” terms, as given by Lemma 1, which follows.

Assumption 5. [y_i,x_i,z_i,x_i*,Δy_i,Δx_i,Δz_i] for i = 1…n is an independent and identically distributed (i.i.d.) sequence.

Assumption 6. E [y^2−j|z|^j] < ∞, E [x^2−j|z|^j] < ∞, for j = 0,1.

Assumption 7. The density of x* is nonzero at

LEMMA 1. Under Assumptions 1–7, for

where the bias term

and the variance term

are given by

and where

satisfies

and

where

for k₁,k₂ = 0,1, where

is given in Definition 2, where

for k = 0,1 is given in Theorem 1, and where [dagger] denotes complex conjugation, and

Under our assumptions, the expectation and the variance of

are well defined, even though the corresponding moments of

may not exist. As long as the remainder

can be shown to be asymptotically negligible in probability, the mean and the variance of

can be interpreted as the mean and the variance of the limiting distribution of

, whether or not the first two moments of

are bounded. This situation is not unique, as these observations apply to any estimator involving ratios of random quantities. To ascertain that the linear approximation

is appropriate, the following lemma provides the order of the remainder of the linearization of

and also the order of the statistical fluctuations in

. This result is included for completeness, but it is not essential for the reader to master it to understand the main results of the subsequent sections.

LEMMA 2. Let Assumptions 1–7 hold and let, for φ₀(ζ), φ₁(ζ), and m₁(ζ) as in Theorem 1,

for φ₀′(ξ) ≡ dφ₀(ξ)/dξ.

⁵

Note that the ratio |φ₀′(ζ)|/|φ₀(ζ)| entering the definitions of λ(h_n) and U(h_n) can equivalently be written as |m_x(ζ)|/|m₁(ζ)| because |m_x(ζ)|/|m₁(ζ)| = |E [xe^iζz]|/|E [e^iζz]| = |E [x*e^iζz]|/|E [e^iζz]|=(|E [x*e^iζx*]|/|E [e^iζx*]|)(|E [e^iζΔz]|/|E [e^iζΔz]|)=|E [x*e^iζx*]|/|E [e^iζx*]|=|φ₀′(ζ)|/ |φ₀(ζ)|.

If the sequence h_n is such that (i)

, (ii) U(h_n)n^−1/2 → 0, and (iii) λ(h_n)n^−1/2+ε → 0 for some ε > 0, then

If, in addition, (iii)

, then

The quantity U(h_n) is defined so that it bounds any of the quantities defined in equations (26) that enter the expression of the asymptotic variance of the estimator, whereas λ(h_n) bounds the remainder terms from the linearization performed in Definition 2. As expected, the preceding stochastic expansion is written in terms of successive powers of n^−1/2, with the exception that the second term is proportional to n^−1+ε instead of n⁻¹, because bounding the second remainder term involves uniformly bounding various random functions, which slows the rate down by a factor n^ε.

In the proof of our convergence rate and asymptotic normality results, we will subsequently verify that the hypotheses of Lemma 2 are implied by more primitive regularity conditions. The first conclusion of the Lemma (equations (36) and (37)) will be sufficient to obtain the convergence rate of the estimator. Indeed, if it can be shown that λ(h_n)n^−1/2+ε → 0, the convergence rate is then simply given by O_p(U(h_n)n^−1/2). Because O_p(U(h_n)n^−1/2) is an upper bound on the convergence rate, which may or may not be binding, the second, slightly stronger conclusion of Lemma 2 (equation (38)) will be needed to obtain the limiting distribution of the estimator. The basic intuition behind the additional condition (iii) is that, for the O_p(U(h_n)n^−1/2λ(h_n)n^−1/2+ε) nonlinear remainder to have no effect on the limiting distribution, it must be asymptotically negligible relative to the exact standard deviation of

, which is given by

, by Lemma 1.

3.2. Regularity Conditions

We now provide primitive regularity conditions that will enable us to derive explicit convergence rates. These regularity conditions take the form of smoothness restrictions imposed via constraints on the tail behavior of various Fourier transforms. To specify the regularity conditions, we employ the following convenient notation.

DEFINITION 3. An expression of the form f (ζ) [prcue ] g(ζ) for

indicates that there exists a constant C > 0, independent of ζ, such that f (ζ) ≤ Cg(ζ) for all

(and similarly for [sccue ]). Analogously, a_n [prcue ] b_n for two sequences a_n,b_n indicates that there exists a constant C independent of n such that a_n ≤ Cb_n for all

The literature focusing on “kernel deconvolution estimators” (see, e.g., Carroll et al., 1995) and related estimators (Fan and Truong, 1993) traditionally distinguishes between “ordinarily smooth” functions (whose Fourier transform decays as |ζ|^γ, γ < 0 as |ζ| → ∞) and “supersmooth” functions (whose Fourier transform decays as exp(α|ζ|^β), α < 0,β > 0 as |ζ| → ∞). For the benefit of conciseness, our regularity conditions are given in terms of expressions of the form (1 + |ζ|)^γ exp(α|ζ|^β), thereby simultaneously covering the ordinarily smooth and supersmooth cases.

Assumption 8. The functions φ₀(ζ) = E [e^iζx*] , φ₀′(ζ) ≡ dφ₀(ζ)/dζ, φ₁(ζ) = E [ye^iζx*] , and m₁(ζ) = E [e^iζz] satisfy

for some γ_r ≥ 0 and

for some

such that γ_φ β_φ ≥ 0 and γ_m β_m ≥ 0.

A few remarks are in order. While the rate of decay of φ₀(ζ), the characteristic function of x*, is entirely determined by the smoothness of the density f (x*) of x*, the rate of decay of φ₁(ζ) is governed by the smoothness of f (x*)E [y|x*]. Verifying equation (40) would first involve finding bounds on |φ₀(ζ)| and |φ₁(ζ)| individually before taking the most slowly decaying term. Regrouping φ₀(ζ) and φ₁(ζ) in a single assumption is possible without loss of generality, because both quantities enter the expression of the estimator in a similar fashion. This grouping is also notationally convenient, as it will reduce the number of independent orders of magnitude that have to be considered when determining the convergence rates of the estimator.

As is always the case in deconvolution-type estimators, one quantity (here m₁(ζ)) needs to be bounded below (in equation (41)), instead of above, because it appears in a denominator in the expression of the estimator. Note that equation (41) is implied by separate lower bounds on the modulus of the characteristic functions of x* and Δz because m₁(ζ) = E [e^iζz] = E [e^iζx*]E [e^iζΔz]. The grouping of E [e^iζx*] and E [e^iζΔz] is also aimed at reducing the notational burden. Although the constraint on the ratio φ₀′(ζ)/φ₀(ζ) imposed by equation (39) may appear unusual, it is clear that it is implied by a more familiar upper bound on |φ′(ζ)| and a lower bound on |φ(ζ)|. The absence of a term of the form exp(α_r|ζ|^β_r) in equation (39) results in very little loss of generality, because all common ordinarily smooth and supersmooth functions are such that equation (39) holds for γ_r = 1.

Before we can derive the convergence rate of the estimator, we also need to characterize the type of kernel K(x*) used. While most studies of measurement error in nonparametric settings focus either on finite-order kernels (Fan, 1991b; Fan and Truong, 1993) or on infinite-order kernels (Politis and Romano, 1999; Li and Vuong, 1998), we will consider both finite- and infinite-order kernels. The traditional finite-order kernels we consider are defined in Assumption 9.

Assumption 9. ∫K(x*) dx* = 1 and, for some integer γ_κ > 0,

We also consider the following class of “infinite-order” kernels.

Assumption 10. The Fourier transform of the kernel, κ(ξ), is such that κ(ξ) = 1 for |ξ| < ξ for some ξ > 0.

Assumption 10 allows for a kernel of the form

which is particularly suited to the Fourier representation because its Fourier transform is 1 in the [−1,1] interval and zero elsewhere. This type of kernel has previously been used in other Fourier-based estimators (Li and Vuong, 1998) and amounts to truncating the Fourier transform above a given frequency. When both E [y|x*] and the density of x* are infinitely many times differentiable, an infinite-order kernel will guarantee that the bias goes to zero faster than any power of the bandwidth. The bias could then, for instance, be an exponentially decaying function of the inverse bandwidth h⁻¹.

3.3. Rate of Convergence in Probability

The procedure to determine the asymptotic rates of pointwise convergence in probability can be outlined as follows.

Bound the bias of the linearized estimator in terms of the bounds given in Assumption 8.
Bound the variance of the linearized estimator in terms of the bounds given in Assumption 8.
Find the sequence of bandwidths h_n that makes the order of the bias squared and of the variance equal and verify that the higher order terms are asymptotically negligible, so that the asymptotic properties of the estimator
can be obtained from the properties of its linearization
.

Step 1.

Calculation of the bias

. We distinguish two cases, depending on whether the kernel used satisfies Assumption 9 or Assumption 10. In the following two lemmas, recall that the parameters γ_φ, α_φ, and β_φ, defined in Assumption 8, describe the smoothness of the density f (x*) of x* and of the conditional expectation E [y|x*] by specifying that their Fourier transforms both decay at least as fast as ζ^γ_φ exp(α_φ|ζ|^β_φ) as frequency ζ → ∞.

LEMMA 3. Under Assumptions 1–8, if the kernel is of order γ_κ, as defined by Assumption 9, then the bias satisfies

where α_b = 0, β_b = 0, and

LEMMA 4. Under Assumptions 1–8, if the kernel satisfies Assumption 10 for some constant ξ, then the bias satisfies

where

In short, when a finite-order kernel is used, the rate of decrease of the bias is controlled either by the order of the kernel γ_κ or by the smoothness of f (x*) and E [y|x*] , whichever is more limiting. In particular, when both f (x*) and E [y|x*] are supersmooth, so that α_φ ≠ 0, it is the order of the kernel that determines the rate of decrease of the bias. When an infinite-order kernel is used, only the smoothness of f (x*) and E [y|x*] matters. Note that the bias term is identical to that of a traditional kernel estimator that would be used if x* were perfectly observed, because, via equations (12) and (13), the bias can be expressed entirely in terms of φ_k(ζ) for k = 0,1 and the kernel, which are nonrandom measurement error-free quantities.

Step 2.

Calculation of the order of the variance term

LEMMA 5. Under Assumptions 1–8, the variance term satisfies

where

Note that the order of the variance term is determined not only by the smoothness of f (x*) and E [y|x*] (through γ_φ, α_φ, β_φ, and γ_r) but also by the smoothness of the density of the measurement error Δz (through the terms γ_m, α_m, and β_m). It is important to point out that the variance term increases much faster as h → 0 (at constant n) than that of a standard kernel estimator with perfectly observed variables (whose variance term is O_p((h_n n)^−1/2)). Combined with the fact that the bias term is unchanged, as indicated in step 1, this implies that the achievable convergence rates will generally be slower than for a conventional kernel estimator.

Step 3.

Determination of the rate of decrease of the bandwidth that offers the best trade-off between bias squared and variance. To obtain explicit rates of convergence, we need to distinguish various cases, based on the values of β_b, which characterizes the rate of convergence of the bias term as the bandwidth shrinks, and β_v, which characterizes the rate of divergence of the variance term as the bandwidth shrinks (at constant sample size). Both β_b and β_v represent an “exponent of supersmoothness,” that is, the constant β in an expression of the form (h_n⁻¹)^γ exp(α(h_n⁻¹)^β).

THEOREM 2. Under Assumptions 1–8 and either Assumption 9 or 10, the optimal bandwidth choices and the corresponding convergence rates in probability of the estimator can be expressed in terms of the constants α_b, β_b,γ_b,α_v,β_v,γ_v defined by Lemmas 3–5. Let ε > 0 be arbitrarily small, let C₁,C₂ be some positive constants, and let

be given.

Case 1. If β_v > β_b > 0

Case 2. If β_v > 0 and β_b = 0 (with α_b = 0 and γ_b < 0)

Case 3. If β_b = β_v ≠ 0

Case 4. If β_b = β_v = 0 (with α_b = α_v = 0 and γ_b < 0)

A few remarks are in order. First, it can be verified (see the proof of Theorem 2 in the Appendix) that the bandwidth sequences given above are such that conditions (i) and (ii) of Lemma 2 hold, thus implying that the nonlinear remainders are indeed negligible and that our simple bias-variance decomposition is justified. Second, the arbitrarily small ε was introduced to drastically simplify the calculations and the statement of the results at the expense of a very small loss in precision. Third, it is impossible to have β_b > β_v because β_b = β_φ, β_v = β_m, and

The convergence rate of the proposed estimator varies substantially as a function of the smoothness of the densities and the conditional expectations involved. An important trend to observe among these rates is that large values of β_b (indicating a rapidly decreasing bias as h → 0) and small values of β_v (indicating a slowly increasing variance as h → 0) are desirable. The convergence rates obtained are typically slower than that of the Nadaraya–Watson kernel estimator used when the variables are perfectly observed. This limitation is not an artifact of our estimation procedure: it has also been observed in the simpler Fan and Truong estimator, which is known to be optimal under stronger assumptions than ours (see Fan and Truong, 1993). The different cases will be discussed—and compared to Fan and Truong's findings—in more detail in Section 4.

Although we have focused on pointwise convergence rates, our results also provide information regarding global convergence rates. The upper bounds on the pointwise bias and variance (and of the nonlinear remainder terms) are in fact independent of

. If the density of x* is bounded away from zero over some finite interval [a,b] , it is straightforward to show that

converges to zero in probability at the same rate as the pointwise rates derived earlier for any bounded weighting function

and any p ∈ [1,2]. However, rates of uniform convergence in probability do not follow directly from the results presented above.

3.4. Asymptotic Normality

To establish the asymptotic normality of the proposed estimator, we need to introduce a few additional assumptions. First, we need assumptions that are commonly made whenever a central limit theorem for triangular arrays is invoked (see, e.g., Härdle and Linton, 1994, Theorem 2; Andrews, 1991, Assumption A).

Assumption 11. There exists C > 0 such that E [|x|^2+δ|z] ≤ C, E [|y|^2+δ|z] ≤ C, Var[x|z] ≥ C, and Var[y|z] ≥ C for all z.⁶

The familiar condition E [|K(x*)|^2+δ] < ∞, which is helpful to show the asymptotic normality of standard kernel estimators, is of no use in establishing the asymptotic normality of our more complex estimator. In any case, Assumption 4 implies that E [|K(x*)|^2+δ] < ∞.

The remaining assumptions are used to ensure that the condition

in Lemma 2 holds, so that the higher order remainder terms are asymptotically negligible relative to the standard deviation of the linearized estimator

. The main obstacle to overcome is the necessity to find a lower bound for the variance

of the estimator. The difficulty of obtaining such a result is noted by Fan (1991a) in his study of the limiting distribution of the kernel deconvolution estimator. Fan's solution to this problem is simply to assume that the tails of the various Fourier transforms entering the estimator are not only bounded by some function of the form ζ^γ exp(α|ζ|^β) but are asymptotically equal (as |ζ| → ∞) to such functional form, thereby limiting the set of allowed functions. Our solution to this problem is similar in spirit to Fan's but considerably expands the range of possible behavior toward infinity by employing the concept of functions that are “well behaved at infinity,” as described by Lighthill (1962). The following definition formalizes this notion.

⁷

We expand Lighthill's definition by allowing for exponential tails, which is essential to handle supersmooth functions.

DEFINITION 4. Let

be the set of all functions

such that (i) ψ(ζ) is absolutely integrable in every finite interval and (ii) ∫_|ζ|≥T|ψ(ζ) − Ψ(ζ)| dζ < ∞ for some

and some function Ψ(ζ) that can be written as a finite linear combination of finite products of functions of the form |ζ|^c⁺, sgn(ζ)|ζ|^c⁺, ln|ζ|, sin(cζ), cos(cζ), exp(cζ^γ) with

Assumption 12. For a given

, the functions

, for k = 0,1 and l = 1,x,y and for

given in equation (26), belong to

For simplicity, we do not state Assumption 12 in terms of elementary quantities such as m₁(ζ) and φ_k(ζ), but it is clear that Assumption 12 is only a few algebraic manipulations away from being a primitive condition. We need to constrain the derivative of

to rule out counterexamples where the density of z arbitrarily far away from the point of evaluation

could have a nonvanishing influence on

asymptotically, making it difficult to characterize the behavior of the variance as n → ∞.

The following condition requires the distribution of z to be supported on

, which is usually the case in deconvolution problems because distributions that have a nonvanishing characteristic function (as imposed by equation (41) in Assumption 8) rarely have compact support.

Assumption 13. f (z) > 0 for all

Finally, we need to impose a few constraints that would be very difficult to state in a more primitive fashion. However, these assumptions are not very restrictive because the counterexamples violating them are somewhat contrived.

Assumption 14.

This assumption merely states that the variance of the estimator is of an order no less than any term in its asymptotic representation. This constraint can only be violated if two or more of the terms

happen to cancel out asymptotically, which is unlikely because each term depends on different random quantities.

Assumption 15. For

as in equation (27),

for k = 0,1, for all

and all

This assumption requires that

be of the same order. It precludes

from having an oscillatory behavior (as ξ varies) such that a precise cancellation would occur between the values of

at different ξ during the integration. The cancellation would have to occur for all ζ and n sufficiently large and be such that the order of

would be affected.

Assumptions 12–15 imply condition

in Lemma 2, thus establishing the required asymptotic negligibility of the nonlinear remainder terms. If it is possible to calculate

directly and verify that it goes to 0 asymptotically, then Assumptions 12–15 can be avoided altogether.

⁸

And the term n^{1/(3+2γ_r−2γ_m)} in equation (64) can be replaced by n^{1/(2+2γ_r−2γ_m)}.

We are now ready to state our asymptotic normality result.

THEOREM 3. Under Assumptions 1–8 and 11–15, for any given

and any sequence h_n satisfying

for some η > 0, we have

where

are given in Lemma 1.

4. EXAMPLES

Section 3.3 derives the convergence rates of the proposed estimator under very general conditions. We now focus on specific examples that will allow us to compare these convergence rates with those derived for the estimator proposed by Fan and Truong (1993), which is the most closely related to ours. Fan and Truong's estimator extends the standard kernel deconvolution estimators used for density estimation in the presence of a measurement error drawn from a known distribution to the case of nonparametric regressions. The estimator presented here accomplishes a more difficult task than Fan and Truong's because it considers the density of the measurement error unknown, relying instead on two error-contaminated measurements of the unobserved regressor. Hence, it would come as no surprise if the kernel deconvolution rates were better. The comparison is nevertheless instructive, because it quantifies the precision loss incurred by relaxing the distributional assumptions regarding the measurement error.

We consider four examples. We first study the “difficult” deconvolution problem that consists of estimating an ordinarily smooth conditional expectation (E [y|x*]) when the density of both the true regressor x* and the measurement error Δz are supersmooth. This problem is difficult because a supersmooth measurement error strongly damps out the high-frequency components of E [y|x*] and of the density of x*. Inverting this operation involves the amplification of these damped-out components, an operation that necessarily causes a substantial magnification of the statistical noise. In standard kernel deconvolution estimators, this situation gives rise to extremely slow convergence rates, and it is instructive to verify that the situation does not degrade further when the distribution of the measurement error is unknown. The second example shows that this slow convergence problem is avoided when the conditional expectation E [y|x*] is supersmooth as well. The third example assumes the density of the measurement error is ordinarily smooth, a situation that avoids the slow convergence problem for the kernel deconvolution estimator but, as we will see, not for our estimator. The final example completes the analysis by showing that when all quantities are ordinarily smooth, the slow convergence problem is avoided.

Table 1 summarizes the assumptions made in each of the four cases considered. A few remarks are in order. In each case, we assume that the order of the kernel is sufficiently large so that the smoothness of E [y|x*] and of the density of x* (and not the order of the kernel) is the factor limiting the rate at which the bias goes to zero. We also assume that equation (39) holds with γ_r = 1. Table 1 also summarizes the convergence rates obtained by applying Theorem 2 in each of the four examples considered. We will now discuss the significance of these results.

Convergence rates obtained under given regularity assumptions

In Example 1, the rates are entirely comparable to those obtained by Fan and Truong (1993) for kernel deconvolution estimators. They found rates of the form (ln n)^k/β where k is the number of continuous derivatives that g(x*) possesses. Because a function whose Fourier transform behaves asymptotically as ζ^−(k+1+ε) necessarily has k continuous derivatives, it is clear that the rates are comparable. The rates differ by ε, because Fan and Truong formulate their regularity conditions in terms of derivatives whereas we formulate them in terms of the asymptotic behavior of Fourier transforms. Formulating our regularity conditions in terms of derivatives would yield results identical to Fan and Truong's. It is remarkable that under the assumptions leading to the worst-case convergence rates for kernel deconvolution estimators, the assumption of a known measurement error distribution can be relaxed without bringing the convergence rate down further.

Example 2 shows that the slow convergence rate problem can be alleviated if the unknown regression function g(x*) is supersmooth and if an “infinite-order” kernel is used. This situation ensures that the bias term goes to zero faster than any power of h, which is sufficient to convert a convergence rate of the form (ln n)^γ to a rate of the form n^γ for γ < 0. More generally, relatively fast convergence rates can be achieved with infinite-order kernels whenever case 3 of Section 3.3 applies. Caution is, however, advised when using high-order kernels. They are known not to perform as well in finite samples as their asymptotic properties would suggest (see Härdle and Linton, 1994). The origin of the problem is that a high-order kernel must necessarily take negative values over a portion of its support, which makes it likely for the denominator of the Nadaraya–Watson kernel estimator to approach zero, even at a point where the true density is bounded away from zero.

In Example 3, making the density of the measurement error Δz ordinarily smooth instead of supersmooth does not improve the convergence rates relative to Example 1. This is in sharp contrast to the behavior of kernel deconvolution estimators, whose convergence rates are of the form n^k under the same assumptions. The reason for this distinction is that the only characteristic function appearing in the denominator of a kernel deconvolution estimator is that of the measurement error Δz, whereas in our estimator, it is the characteristic function of z that appears in the denominator. The density of z is supersmooth if either the density of the true regressor x* or of the measurement error Δz is supersmooth. Hence, a supersmooth density for x* will also cause our estimator to converge slowly.

In Example 4, it is seen that when the density of x* is made ordinarily smooth as well, the slow convergence problem is avoided, as expected. The resulting rates are not necessarily identical to those of Fan and Truong's kernel deconvolution estimator, but the rates at least take the form of a negative power of n, indicating that the distributional assumptions regarding the measurement error can be relaxed without an undue increase in the statistical noise.

5. MONTE CARLO SIMULATIONS

We now investigate the finite-sample properties of the proposed estimator through various Monte Carlo simulations. The designs are chosen so as to illustrate the examples of Section 4, summarized in Table 1, which cover the most common combinations of smooth and supersmooth distributions and conditional expectations. As an example of a supersmooth distribution, the normal distribution with variance σ² naturally comes to mind. Its characteristic function has a tail of the form exp(−(σ²/2)|ζ|²). As an example of an ordinarily smooth distribution, we consider the Laplace (or double exponential) distribution with mean μ and variance σ² denoted by L(μ,σ²) and defined as

for any

. The tail of the characteristic function of a Laplace density is of the form |ζ|⁻².

Our example of a supersmooth regression function is the error function

having a Fourier transform decaying at the rate |ζ|⁻¹ exp(−¼|ζ|²)as |ζ| → ∞. Finally, our example of an ordinarily smooth regression function is a piecewise linear continuous function with a discontinuous first derivative

whose Fourier transform decays as |ζ|⁻². To simplify comparisons, both functions are normalized to have the same range and a similar general shape, so that any difference in the results can be attributed to their difference in smoothness. All simulations proceed by drawing 500 samples of 1,000, 2,000, or 8,000 observations from the distributions given in Table 2. Table 2 also provides the theoretical convergence rate in each case, obtained by substituting the appropriate smoothness parameters in the expressions of Table 1. The distribution of Δy is never altered, because it has little impact on the asymptotic properties of the estimator except for a trivial scaling of some of the components of the asymptotic variance. For each sample, the variables x,y,z are constructed through

The variables (y,x,z) are used as an input for our estimator, and the variables (y,x) are fed into the Nadaraya–Watson estimator. We also construct an (infeasible) Nadaraya–Watson estimator from the variables (y,x*) for comparative purposes. For all three estimators, an infinite-order kernel whose Fourier transform is given by equation (23) with ξ = ½ is used. In this fashion, the kernel is never the factor limiting the convergence rate. For each sample, we keep track of the value of the estimated function at a given point (here, x* = 1) and use it to calculate the bias squared, the variance, and the sum of the two, the mean square error. A set of bandwidths ranging from 1.0 to 2.5 is scanned in increments of 0.05 in search of the bandwidth minimizing the mean square error.⁹

For less than 0.5% of the samples drawn, numerical issues associated with near division by zero in equations (19) and (20) were observed for a few of the smallest bandwidths sampled. To simplify the reporting of the results as a function of bandwidth, these draws were discarded and new draws were made so that the total number of samples kept remains 500. Of course, when studying any given sample, practitioners would simply never choose such a small bandwidth. The problem only occurs because we are performing Monte Carlo simulations and wish to report averages over replications as a function of bandwidth.

Of course, this method of locating the optimal bandwidth relies on our knowledge of the true regression function. Although this is appropriate for the purpose of investigating the properties of an estimator, a feasible bandwidth selection rule would be a useful tool to develop.

Monte Carlo simulation designs

Table 3 compares the bias squared, the variance, and the mean square error of the three estimators considered as a function of bandwidth for a sample size of 1,000. For conciseness, only a subset of the bandwidths considered is shown. The rightmost column gives all quantities evaluated at the optimal bandwidth (which may lie between two of the bandwidths listed in the previous columns). A few important features can be consistently observed throughout the four examples considered.

Monte Carlo simulation results for the examples

In comparison with the Nadaraya–Watson estimator, our estimator is clearly very effective at reducing the bias. More specifically, it is clear that the bias of the Nadaraya–Watson estimator does not converge to zero with decreasing bandwidth but instead settles to a nonzero value. In contrast, the bias of our estimator decreases by orders of magnitude over the range of bandwidths sampled, as the bandwidth decreases. Our estimator's residual bias is attributable to the fact that we are performing a nonparametric estimation, so that a fully unbiased estimation is impossible. In fact, it can readily be seen that, at a given bandwidth, the bias of our estimator is very close to the bias of the infeasible Nadaraya–Watson estimator using the uncontaminated regressor x*, thus indicating that our estimator does not appear to introduce additional bias at the sample size considered. Of course, because the variance of our estimator is larger than the infeasible Nadaraya–Watson estimator, a larger bandwidth must be used, and the resulting bias, evaluated at the optimal bandwidth, is slightly larger than in the error-free case.

The bias reduction made possible by the proposed estimator comes at the expense of an increased variance relative to the Nadaraya–Watson estimator based on mismeasured regressors. However, the decrease in the bias more than offsets the increase in the variance, so that the mean square error we obtain is still better than for the Nadaraya–Watson estimator.

It is instructive to observe the estimator's behavior as a function of the smoothness of the various densities and conditional expectations considered. The asymptotic theory presented earlier predicts the convergence rate, which can be directly compared with the change in the mean square error at the optimal bandwidth, as a function of sample size for each of the examples considered (see Table 4). The fifth column of Table 4, labeled “MSE₈₀₀₀/MSE₂₀₀₀,” reports the ratio of mean square error at a sample size of 8,000 relative to the mean square error at sample size 2,000. We focus on these sample sizes because the differences between the various examples are more readily seen at large sample sizes. In Examples 1 and 3, where the convergence rate should be slow (i.e., a negative power of the log of sample size), convergence is indeed much slower than for Examples 2 and 4, where the convergence rate should be fast (i.e., a negative power of sample size). Moreover, the decrease in mean square error predicted by asymptotic theory (obtained by squaring the rates given in Table 2 and shown in the last column of Table 4) is an excellent predictor of the actual decrease in three out of the four examples. Note that the systematic changes in bandwidth as a function of sample size are difficult to distinguish from the inherent simulation noise, because bandwidth variations are much smaller than the changes in mean square error, as predicted by Theorem 2.

Monte Carlo simulation results as a function of sample size

Monte Carlo simulations can also be used to verify the applicability of the asymptotic distribution in a finite sample. The designs described in Table 2 are again used, with the mean square minimizing bandwidths given in Table 3 and a sample size of 1,000. For each sample, we keep track of the value of the estimated function at a given point (x* = 1.0) and the estimated variance at that point obtained with equations (31) and (32) by replacing all expected values by sample averages. The point estimates are then standardized, that is, demeaned by the average of the point estimates and normalized by the average of the estimated pointwise variance. Figure 1 shows the empirical cumulative distribution function (c.d.f.) of the standardized point estimates p_i for i = 1,…,500, obtained by sorting the p_i in increasing order and by joining the points (p_i,(i − 1)/499) by lines. The resulting empirical c.d.f. (jagged lines in Figure 1) agrees very well with the normal c.d.f. predicted by asymptotic theory (shown as a smooth line in Figure 1).

Comparison between the finite-sample and the asymptotic distributions of the estimator. The abscissa is .

6. CONCLUSION

This paper presents a new kernel-based nonparametric estimator that extends the conventional Nadaraya–Watson kernel estimator to cover the case of an error-ridden regressor. We show that identification is achievable when one repeated measurement of the error-contaminated regressor is available. One remarkable property of our estimator is that it requires no knowledge of the distribution of the measurement error, contrary to the popular kernel deconvolution estimator. The convergence rate and the asymptotic distribution of the proposed estimator are derived. A series of examples illustrates the main factors determining the convergence rate and enables us to compare the convergence rates we obtain with those of earlier estimators. Various Monte Carlo simulations are used to investigate the finite-sample properties of the estimator.

APPENDIX: PROOFS

Proof of Theorem 1. The result can be shown by direct substitution. Assumption 2 ensures that all expectations are well defined. First, observe that equation (14) indeed provides the value of φ₀(ξ), by using Assumption 1:

Letting f (x*) be the density of x*, one can then show that

, respectively, provide the numerator and the denominator of the Nadaraya–Watson estimator. In what follows, we use the independence between x* and Δz and the fact that

█

Proof of Lemma 1. The fact that

follows from equation (25) and the fact that

. Finally, to calculate

we note that, by equation (25),

Equation (31) then follows directly from squaring equation (24), taking its expectation, and using the expression for

just derived. █

LEMMA 6. If a_j and z_j are sequences of i.i.d. real-valued random variables such that E [a_j²] < ∞ and E [|a_j||z_j|] < ∞, then, for any u,U ≥ 0 and ε > 0,

where

Proof. See Lemma 6 in Schennach (2004).

Proof of Lemma 2. To compute the Fréchet derivative of

with respect to the estimated moment

in the vicinity of

, we first note a few simple results. A ratio of two random functions

can be exactly written as

where

and where

can be written in two alternative ways: Either

Similarly, for

, and some random function δQ_x(ξ) such that

Substituting expansions (A.32) and (A.37) into

for k = 0,1 and keeping the terms linear in

gives the linearization of

, denoted

By making use of the identity

for any absolutely integrable function f, we obtain

The order of

(in probability) can be found through its variance

given by Lemma 1:

where, by Assumptions 5 and 6,

It follows that

and therefore that

where U(h_n), given in the statement of the lemma, has been explicitly constructed to bound any of the

terms (up to a multiplicative constant). By equation (30), equation (A.49) implies equation (36) in the statement of the lemma, provided that Assumption 7 holds.

To establish equation (37), we substitute expansions (A.32) and (A.37) into

for k = 0,1 and remove the terms linear in

. We then find that

can be written as

where

These terms can then be bounded in terms of λ(h_n), U(h_n) (given in the statement of the lemma), and

, where the supremum can be taken over [−h_n⁻¹,h_n⁻¹] because κ(h_nξ) vanishes outside that interval. By Lemma 6,

for any ε > 0. Also, we note that

. Now, for k = 0,1, we have

The remaining terms can be similarly bounded:

It then follows that

for some δ > 0. By a standard Taylor expansion of the ratio

around

, we have

for some

lying between

. Because (i) we have just shown that

, (ii)

by assumption, and (iii)

is bounded and

is bounded away from zero by assumption, it follows that

converge in probability to finite quantities and therefore

is of the same order as

, thus implying equation (37).

To establish the second conclusion of the theorem, we note that, because

, we can write

Then,

because

by assumption. █

Proof of Lemma 3. First, by equation (13), we have, for k = 0,1,

Expanding the Fourier transform of the kernel in a Taylor series up to order γ, we obtain

Now let the order γ be chosen as follows. If α_φ (defined in Assumption 8) is nonzero, then let γ = γ_κ, the order of the kernel. If α_φ = 0, then let γ be the largest integer such that γ ≤ γ_κ and γ < −γ_φ − 1. With this choice of γ, equation (A.98) simplifies to

because all terms where i < γ vanish, by the definition of the order of a kernel. Furthermore, we have

where

is finite by equation (44) of Assumption 9. The term

is finite also, because our choice of γ guarantees that the integrand decays to zero faster than ξ⁻¹. Then, by a standard Taylor expansion of the ratio

around

, the convergence rate of

also. █

LEMMA 7. For ζ ≥ 0, if γ > 0, α < 0, β > 0 or if

, then

Proof. The case where α = β = 0 is trivial. If α < 0 and β > 0, Lemma 4.2 in Li and Vuong (1998) shows that, for γ > 0 and

, thus implying the result because ξ^1+γ−β exp(αξ^β) [prcue ] ξ^1+γ exp(αξ^β). █

Proof of Lemma 4. From equation (A.96), we have

Then, by a standard Taylor expansion of the ratio

around

, the convergence rate of

is O((1 + h⁻¹)^γ_φ+1 exp(α_φ(ξh⁻¹)^β_φ)) also. █

LEMMA 8. For ζ ≥ 0, if β ≥ 0 and if (1 + ξ)^γ exp(αξ^β) is increasing in ξ,

Proof.

Proof of Lemma 5. By Lemma 2, the order of the variance term is O_p(U(h_n)n^−1/2), where

where, for k = 0,1,

and

It follows that

Hence,

. █

Proof of Theorem 2. We make use of the order of the bias

provided by Lemmas 3 and 4 and of the order of the variance term

provided by Lemma 5. To check that the higher order term

does not affect the rates obtained by considering the first-order terms only, we observe that, by Lemma 2, the upper bound on

provided by Lemma 5 holds for

also, if we can show that λ(h_n)n^−(1/2)+ε = o(1) for some ε.

We consider each subcase of the theorem separately. Let R_n be such that

. Throughout the proof, let ε,ε₁,ε₂,… denote arbitrarily small positive numbers.

Case 1. β_v > β_b. If the bandwidth h_n is chosen to be

for some ε_v,ε_b > 0, the bias and the variance are of the same order and the convergence rate is

Now, to check the negligibility of the higher-order terms, we verify that λ(h_n)n^−1/2+ε₁ = o(1) for some suitably chosen ε₁ > 0. Noting that β_v = β_m and α_v = −α_m if β_v > β_b, we have

Case 2. β_b = 0 (and γ_b < 0) and β_v > 0. For some ε_v > 0, let

Then,

and

Case 3. β_b = β_v ≠ 0. For some ε > 0, let

Then,

Case 4. β_b = β_v = 0 (and α_b = α_v = 0 and γ_b < 0). Let h_n⁻¹ = n^{1/(2γ_v−2γ_b)}. Then,

Noting that γ_b ≥ γ_φ + 1, γ_v = 2 + γ_φ − γ_m + γ_r, and γ_m < 0, the exponent of n can be written as

█

LEMMA 9. Let K_n(z) be a sequence of real-valued nonrandom functions of a real variable, let a_j and z_j be i.i.d. sequences with a_j satisfying E [a_j^2+δ|z_j = z] ≤ C for some C,δ > 0 for all z and Var[a_j|z_j = z] ≥ C for some C > 0 and for all z, and let

If inf_n≥N σ_n > 0 for some

for some δ > 0, then

Proof. Let Z_nj = a_j K_n(z_j). The proof consists in verifying that Z_nj satisfies the hypothesis of the Lindeberg–Feller central limit theorem for triangular arrays. Indeed, the Z_n1,…,Z_nn are i.i.d. by assumption, and it remains to be shown that the Lindeberg condition holds: for all ε > 0,

First, noting that 1(ab ≥ c) ≤ 1(a ≥ c^η) + 1(b ≥ c^1−η) for any

and any η ∈]0,1[, we can write

where

we have T₁ [prcue ] (E [K_n²(z_j)])⁻¹E [(ε^ησ_n^ηn^η/2)^−δK_n²(z_j)] = (ε^ησ_n^ηn^η/2)^−δ(E [K_n²(z_j)])⁻¹ × E [K_n²(z_j)] = (ε^ησ_n^ηn^η/2)^−δ → 0. Also, T₂ = σ_n⁻²E [E [a_j²|z_j]1(|K_n(z_j)| ≥ ε^1−ησ_n^1−ηn^(1−η)/2)K_n²(z_j)] ≤ Cσ_n⁻²E [1(|K_n(z_j)| ≥ ε^1−ησ_n^1−ηn^(1−η)/2)K_n²(z_j)] [prcue ] E [1(|K_n(z_j)| ≥ ε^1−ησ_n^1−ηn^(1−η)/2)K_n²(z_j)].

Let

. For a given value of E [K_n²(z_j)] the maximum value of E [1(|K_n(z_j)| ≥ C)K_n²(z_j)] for some C > 0 is obtained when the support of K_n(z) is inside the support of the distribution of z and when K_n(z) is triangular,

where l_n = (3σ_n²/(2s_n²))^1/3. Then,

where

and it follows that I_n → 0, as desired. █

LEMMA 10. If

, then lim_|z|→∞ p(z) = 0, where p(z) is the inverse Fourier transform of ψ(ζ).

Proof. This result is Theorem 18 in Lighthill (1962) with the trivial modification that the Fourier transform is replaced by the inverse Fourier transform and with the slight extension that allows the tail behavior of the function ψ(ζ) to be exponential (see Definition 4). This extension is straightforward because Lighthill's proof proceeds by writing ψ(ζ) = Ψ(ζ) + (ψ(ζ) − Ψ(ζ)) where (ψ(ζ) − Ψ(ζ)) can be handled using the Riemann–Lebesgue lemma. By the assumption that

, the function Ψ(ζ) can be chosen such that its inverse Fourier transform, p_∞(z), can be calculated analytically and be shown to satisfy lim_|z|→∞ p_∞(z) = 0. All that is needed to allow for more flexible choices of tail behavior than initially employed by Lighthill is to find functions Ψ(ζ) whose inverse Fourier transform can be calculated analytically and have the appropriate tail behavior. Using the techniques described in Gel'fand and Shilov, (1964, Example 5, p. 169) the inverse Fourier transform of exponentials of the form exp(cζ^γ) for

can be shown to be

where δ^(k)(ζ) denotes the kth derivative of Dirac's delta distribution. This distribution clearly vanishes as |z| → ∞, as required. Note that, although such a distribution does not belong to the class of the so-called tempered distributions, it does belong to the wider class of distributions that forms the dual of compactly supported infinitely differentiable test functions (i.e., the so-called Type K distributions of Gel'fand and Shilov). █

Proof of Theorem 3. According to the second conclusion of Lemma 2, to have

we need to show that

for some ε > 0. We proceed by finding a lower bound on

and relating it to U(h_n). First, by Assumption 14,

, where T_k,a,n, for k = 0,1 and a = 1,x,y, is given by

where

Then,

for any finite interval I_a,k not reduced to a point. By Assumptions 11 and 13, inf_{z∈I_a,k} E [a²|z] f (z) ≥ C > 0, and we have

We now show that

remains bounded as n → ∞, thus implying that ∫(K_k,a,n(z))² dz diverges at the same rate as ∫_{z∈I_a,k}(K_k,a,n(z))² dz. First, lim_n→∞ K_k,a,n(z) ≡ K_k,a,∞(z) is the inverse Fourier transform of U_a^k(ζ,x*,0) and, by the moment theorem, the inverse Fourier transform of

is izK_k,a,∞(z). Because

belongs to

by Assumption 12, we can apply Lemma 10 to conclude that lim_|z|→∞|z||K_k,a,∞(z)| = 0. Therefore, there exist constants A,C > 0 such that |K_k,a,∞²(z)| ≤ A|z|⁻² for |z| ≥ C and k = 0,1 and l = 1,x,y. It is therefore impossible for

to become unbounded as n → ∞ if I_a,k is chosen to be [−C,C]. We can then write

By Parseval's identify and the fact that

vanishes for |ζ| ≥ h_n⁻¹, we have

By the Cauchy–Schwartz inequality,

which becomes, upon rearrangement,

Collecting equations (A.197), (A.198), and (A.200), we have

We then observe that, by equation (34) and Assumption 15,

Combining equations (A.202) and (A.203) yields

thus implying that h_n^1/2λ(h_n)n^−1/2+ε → 0 for some ε > 0 is a sufficient condition for the asymptotic negligibility of the higher order terms, which we can now verify.

If α_m = 0, then h_n^1/2λ(h_n)n^−1/2+ε = (1 + h_n⁻¹)^1/2(1 + h_n⁻¹)(1 + h_n⁻¹)^γ_r−γ_mn^−1/2+ε = (1 + h_n⁻¹)^{3/2+γ_r−γ_m}n^−1/2+ε [prcue ] (n^−ηn^{1/(3+2γ_r−2γ_m)})^{3/2+γ_r−γ_m}n^−1/2+ε [prcue ] n^{−η(3/2+γ_r−γ_m)} × n^ε = o(1) for ε > 0 sufficiently small.

If α_m ≠ 0, then h_n^1/2λ(h_n⁻¹)n^−1+ε [prcue ] (1 + h_n⁻¹)^{3/2+γ_r−γ_m} exp(−α_m(h_n⁻¹)^β_m)n^−1/2+ε [prcue ] exp(−α_m(1 + ε₂)(h_n⁻¹)^β_m)n^−1/2+ε [prcue ] exp((α_m(1 + ε₂)(1 + η)/2α_m)ln n)n^−1/2+ε = n^{−1/2(ε₂+η+ηε₂)}n^ε = o(1) for some ε₂ > 0 and for ε > 0 sufficiently small.

We have now shown that the limiting distribution of

is the same as that of

. To obtain the limiting distribution of

, we note that

is a finite linear combination of the kernel-type estimators T_k,a,n defined in equation (A.191) using the kernel K_k,a,n(z) defined by equation (A.192). The asymptotic normality of T_k,a,n can be shown using Lemma 9, provided that we can show that

for some δ > 0. By the moment theorem, this requirement is satisfied if

for k = 0,1 and a = 1,x,y. Using the same techniques as in the proof of Lemma 5, we have

where γ = 3 + γ_φ − γ_m + γ_r and

If α_m = 0, then α = 0 and we have

because (i) (3 + γ_φ − γ_m + γ_r) > 0 because γ_φ ≥ −γ_m and γ_r ≥ 0 and (ii)

If α_m ≠ 0, then

Hence, the hypotheses of Lemma 9 are verified, and the T_k,a,n are asymptotically normal. The expectation and the variance of

can then be calculated as in Lemma 1. █

References

REFERENCES

Andrews, D.W.K. (1991) Asymptotic normality of series estimators for nonparametric and semiparametric regression models. Econometrica 59, 307–345.Google Scholar

Ashenfelter, O. & A.B. Krueger (1994) Estimates of the economic returns to schooling from a new sample of twins. American Economic Review 84, 1157–1173.Google Scholar

Borus, M.E. & G. Nestel (1973) Response bias in reports of father's education and socioeconomic status. Journal of the American Statistical Association 68, 816–820.Google Scholar

Bowles, S. (1972) Schooling and inequality from generation to generation. Journal of Political Economy 80, S219–S251.Google Scholar

Carroll, R. & P. Hall (1988) Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association 83, 1184–1186.Google Scholar

Carroll, R.J., D. Ruppert, & L.A. Stefanski (1995) Measurement Error in Nonlinear Models. Chapman and Hall.

Fan, J. (1991a) Asymptotic normality for deconvolution kernel density estimators. Sankhyma: The Indian Journal of Statistics 53, series A, pt. 1, 97–110.Google Scholar

Fan, J. (1991b) On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics 19, 1257–1272.Google Scholar

Fan, J. & Y.K. Truong (1993) Nonparametric regression with errors in variables. Annals of Statistics 21, 1900–1925.Google Scholar

Freeman, R.B. (1984) Longitudinal analysis of the effects of trade unions. Journal of Labor Economics 2, 1–26.Google Scholar

Gel'fand, I.M. & G.E. Shilov (1964) Generalized Functions. Academic Press.

Härdle, W. & O. Linton (1994) Applied nonparametric methods. In R. Engle & D. McFadden (eds.), Handbook of Econometrics vol. IV, pp. 2295–2339. Elsevier Science.

Hausman, J., W. Newey, & J. Powell (1995) Nonlinear errors in variables: Estimation of some Engel curves. Journal of Econometrics 65, 205–233.Google Scholar

Li, T. & Q. Vuong (1998) Nonparametric estimation of the measurement error model using multiple indicators. Journal of Multivariate Analysis 65, 139–165.Google Scholar

Lighthill, M.J. (1962) Introduction to Fourier Analysis and Generalized Function. Cambridge University Press.

Liu, M. & R. Taylor (1989) A consistent nonparametric density estimator for the deconvolution problem. Canadian Journal of Statistics 17, 427–438.Google Scholar

Morey, E.R. & D.M. Waldman (1998) Measurement error in recreation demand models: The joint estimation of participation, site choice, and site characteristics. Journal of Environmental Economics and Management 35, 262–276.Google Scholar

Politis, D.N. & J.P. Romano (1999) Multivariate density estimation with general flat-top kernels of infinite order. Journal of Multivariate Analysis 68, 1–25.Google Scholar

Rao, P. (1992) Identifiability in Stochastic Models. Academic Press.

Schennach, S.M. (2004) Estimation of nonlinear models with measurement error. Econometrica 72, 33–75.Google Scholar

Convergence rates obtained under given regularity assumptions

Monte Carlo simulation designs

Monte Carlo simulation results for the examples

Monte Carlo simulation results as a function of sample size

Comparison between the finite-sample and the asymptotic distributions of the estimator. The abscissa is .

Article contents

NONPARAMETRIC REGRESSION IN THE PRESENCE OF MEASUREMENT ERROR

Abstract

1. INTRODUCTION

1.1. Motivation

1.2. Background

2. ESTIMATION PROCEDURE

3. ASYMPTOTIC PROPERTIES

3.1. Linearization

3.2. Regularity Conditions

3.3. Rate of Convergence in Probability

Step 1.

Step 2.

Step 3.

3.4. Asymptotic Normality

4. EXAMPLES

5. MONTE CARLO SIMULATIONS

6. CONCLUSION

APPENDIX: PROOFS

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests