Hostname: page-component-745bb68f8f-b6zl4 Total loading time: 0 Render date: 2025-02-06T07:43:40.638Z Has data issue: false hasContentIssue false

THE UNIQUENESS OF CROSS-VALIDATION SELECTED SMOOTHING PARAMETERS IN KERNEL ESTIMATION OF NONPARAMETRIC MODELS

Published online by Cambridge University Press:  22 August 2005

Qi Li
Affiliation:
Texas A&M University and Tsinghua University
Jianxin Zhou
Affiliation:
Texas A&M University
Rights & Permissions [Opens in a new window]

Abstract

We investigate the issue of the uniqueness of the cross-validation selected smoothing parameters in kernel estimation of multivariate nonparametric regression or conditional probability functions. When the covariates are all continuous variables, we provide a necessary and sufficient condition, and when the covariates are a mixture of categorical and continuous variables, we provide a simple sufficient condition that guarantees asymptotically the uniqueness of the cross-validation selected smoothing parameters.We thank a referee for the constructive comments.

Type
NOTES AND PROBLEMS
Copyright
© 2005 Cambridge University Press

1. MOTIVATION AND RESULTS

The kernel method is the most popular technique used in the estimation of nonparametric/semiparametric models, and it is well known that the selection of smoothing parameters in nonparametric kernel estimation is of crucial importance. In the context of a regression model, Clarke (1975) proposes the leave-one-out least squares cross-validation method for selecting the smoothing parameters. The asymptotic optimality of this approach is studied by Härdle and Marron (1985) and Härdle, Hall, and Marron (1988) in the context of a univariate regression model, and Fan and Gijbels (1995) have studied bandwidth selection in the context of local polynomial kernel regression. For a regression model with a single (univariate) continuous regressor, Härdle and Marron (1985) and Härdle et al. (1988) show that the cross-validation function has the following expression:

where

is the leave-one-out local constant kernel estimator of g(Xi) ≡ E(Yi|Xi), k(.) is a second-order kernel function, h is the smoothing parameter, w(.) is a weight function, C1 = ∫{κ2 /2[g′′(x) f (x) + 2g′(x) f′(x)]}2w(x) f (x)−1 dx, C2 = κ ∫σ2(x)w(x) dx, κ2 = ∫k(v)v2 dv, κ = ∫k(v)2 dv, g′(.) and g′′(.) denote first- and second-order derivative functions, and σ2(x) = Var(Yi|Xi = x).

The terms of C1 h4 and C2 /(nh) in (1.1) are the leading squared bias and variance of CV(h), respectively. Let

denote the cross-validation selected smoothing parameter that minimizes CV(h); then from (1.1) it is easy to show that the

, where h0 = [C2 /(4C1)]1/5n−1/5. Note that C1 is nonnegative and C2 > 0. Therefore, a necessary and sufficient condition for the existence of the unique benchmark nonstochastic optimal smoothing parameter h0 is that C1 > 0. The assumption that C1 > 0 puts some restrictions on g(.); for example, g(.) cannot be a constant function. A similar necessary and sufficient condition exists that guarantees an asymptotically uniquely defined cross-validation selected smoothing parameter in estimating a conditional probability density function (p.d.f.) with an univariate continuous conditional variable.

The cross-validation procedure can be easily extended to the multivariate (regression or p.d.f. estimation) settings for selecting the smoothing parameters. However, the conditions that ensure the uniqueness of cross-validation selected smoothing parameters become more complex. Recently, Hall, Racine, and Li (2004), Hall, Li, and Racine (2004), and Li and Racine (2003, 2004) have considered the problem of nonparametric estimation of conditional density and regression functions with mixed discrete and continuous data. They propose to use the data-driven cross-validation (CV) methods to select the smoothing parameters, and they have shown that the CV selected smoothing parameters are asymptotically equivalent to the nonstochastic optimal smoothing parameters that minimize the asymptotic weighted estimation mean square error. However, when discussing the existence of the asymptotically uniquely defined optimal smoothing parameters, Hall, Racine, and Li (2004) and Li and Racine (2004) impose overly strong conditions. In this note we provide substantially weaker sufficient conditions that guarantee the existence of the uniquely defined CV selected optimal smoothing parameters. We show that when all covariates are continuous random variables, the condition becomes necessary and sufficient for the existence of uniquely defined optimal smoothing parameters.

We consider a nonparametric regression model with mixed discrete and continuous covariates:

where g(·) has an unknown functional form, E(ui|Xi) = 0, Xi = (Xic,Xid), Xid is a q × 1 vector of regressors that assume discrete values, and XicRp are the remaining continuous regressors. We use Xijd to denote the jth component of Xid, and we assume that Xijd takes cj ≥ 2 different values, that is, Xijd ∈ {0,1,…,cj − 1} for j = 1,…,q. We use

to denote the range assumed by xd. We are interested in estimating g(x) = E(Yi|Xi = x) by the nonparametric kernel method. We use f (x) = f (xc,xd) to denote the joint density function. For xc = (x1c,…,xpc) we use the product kernel:

, where k is a symmetric, univariate density function and 0 < hj < ∞ is the smoothing parameter for xjc. For a discrete regressor we define, for 1 ≤ jq,

where 0 ≤ λj ≤ 1 is the smoothing parameter for xjd. Therefore, the product kernel for xd = (x1d,…,xqd) is given by

. The kernel function for the mixed regressors x = (xc,xd) is simply the product of Kc and Kd, that is,

. The nonparametric estimate of g(x) is given by

. We choose (h,λ) = (h1,…,hp1,…,λq) by minimizing the following CV function:

where

is the leave-one-out local-constant (LC) kernel estimator of g(Xi) and 0 ≤ w(.) ≤ 1 is a weight function that serves to avoid difficulties caused by dividing by zero, or by the slow convergence rate for when Xi is near the boundary of the support of X.

Define an indicator function

. Note that Ij(vd,xd) = 1 if and only if vd and xd differ only in their jth component. Letting mj(x) and mjj(x) (m = g or m = f) denote the first-order and second-order partial derivatives of m(xc,xd) with respect to xjc, Hall, Li, and Racine (2004) have shown that (∫dx = [sum ]xdDdxc, D is the support of Xd)

The preceding results are based on the LC kernel estimation result. Li and Racine (2004) have considered the local linear (LL) CV method. The CV objective function is the same as given in (1.4) but with

replaced by a leave-one-out LL kernel estimator. Li and Racine (2004) have shown that the resulting CV function has the same form as (1.5) with the term 2gj(x) fj(x) removed.

Define zj = n−2/(4+p)hj2 for j = 1,…,p, and zp+j = n−2/(4+p)λj for j = 1,…,q; then both the leading terms of CVLC(h,λ) and CVLL(h,λ) can be written in the form of c0 n−4/(p+4)χ(z1,…,zp,zp+1,…,zp+q), where c0 = κp∫σ2(x)w(x) dx > 0 is a constant, and

where z = (z1,…,zp+q)′ (the prime denotes transpose), A is a (p + q) × (p + q) symmetric positive semidefinite matrix with its (j,s)th element given by A(j,s) = ∫Bj(x)Bs(x) dx, where Bj(x) = c0−1/22 /2)[gjj(x) f (x) + 2gj(x) fj(x)]w(x)1/2f (x)−1/2 (one removes 2gj(x) fj(x) if it is a local linear CV function) for j = 1,…,p, and Bp+j(x) = c0−1/2 [sum ]vdD Ij(vd,xd)[g(xc,vd) − g(x)] f (xc,vd)w(x)1/2f (x)−1/2 for j = 1,…,q.

Hall, Racine, and Li (2004) have considered the CV selection of smoothing parameters in a conditional probability (density) estimation framework and show that their CV objective function also has a leading term of the form as given in (1.6) with of course a different definition of Bj(x) for j = 1,…,p + q. Therefore, the leading term of the CV objective function, in either a regression or a conditional probability model, has the expression as given by (1.6). The uniqueness of the CV selected optimal smoothing parameters replies on the uniqueness of a nonnegative vector

that minimizes (1.6), where

. Subsequently we will first focus on the simple case that all covariates are continuous.

When q = 0 (no discrete covariates), all covariates are continuous random variables, and (1.6) becomes

with z = (z1,…,zp)′, and A is now of dimension p × p. The uniqueness of the CV selected optimal smoothing parameters of h1,…,hp hinges on the uniqueness of a vector

that minimizes (1.7). Let z* denote the vector of z that minimizes χc(z) over

; we ask that

If (1.8) holds true, then the CV selected smoothing parameters are all well defined asymptotically. In fact, it follows from Hall, Li, and Racine (2004) and Hall, Racine, and Li (2004) that

, or equivalently,

, where

is the benchmark nonstochastic optimal smoothing parameter (j = 1,…,p). The next theorem gives a simple necessary and sufficient condition for (1.8) to hold.

THEOREM 1.1. Assume that q = 0 so that z = (z1,…,zp)′; define

Then χ(z) has a unique minimizer

with 0 < zj* < ∞ for all j = 1,…,p if and only if

Next, we discuss the general case with a mixture of continuous and discrete covariates. Now, z = (z1,…,zp+q)′ and A is a (p + q) × (p + q) symmetric positive semidefinite matrix. Let

where

and let

denote a minimizer of χ(z1,…,zp+q). We seek conditions that ensure the following result:

Condition (1.10) will lead to asymptotically uniquely defined CV selected smoothing parameters of

. We partition the A matrix as

where A11 is of dimension p × p, and A22 is of dimension q × q, and A12 has a comfortable dimension. The following theorem gives the existence and uniqueness of a minimizer for χ(z).

THEOREM 1.2. Let

If μ > 0, then χ has a minimizer

with χ(z*) < +∞, and a necessary and sufficient condition for a point

to be a minimizer of χ is that z(1) = z(1)* and z(2) = z(2)* + z(2)0 for some

, the null space of A22.

1

The null space of A22 is defined as

.

In particular, if q = 0 or A22 is positive definite, then the Hessian (second derivative) matrix

of χ is positive definite at every point

with χ(z) < +∞. Thus χ has a unique minimizer z* satisfying (1.10).

2. Proofs and Discussions

Proof of Theorem 1.1. The “if” part of Theorem 1.1 is a special case of Theorem 1.2 with q = 0. Thus we only need to prove the “only if” part. Let μ = 0 be attained at some

with ∥z*∥ = 1. If zi* ≠ 0 for all i = 1,…,p, then χ(tz*) → 0 as t → +∞. This implies that χ has no minimizer. If zi* = 0 for some 1 ≤ ip, without loss of generality, we assume that z1* = ··· = zr* = 0 for some 1 ≤ rp − 1. Let ε > 0 be chosen such that p(1 − ε) > r. Let

with zi = 1,1 ≤ ir,zi = 0,r + 1 ≤ ip. Consider

for all t > 0, because

is a convex cone. We have

because p(ε − 1) + r < 0, which implies that χ(z(t)) → 0 as t → 0. Therefore χ has no minimizer. █

Remark 2.1. From the proof of Theorem 1.1 we know that μ > 0 is a necessary and sufficient condition for the existence of a minimizer z* that minimizes χc(z); the uniqueness of the minimizer z* comes from the fact that the Hessian matrix of χc(z) is positive definite.

Note that in Theorem 1.1 μ is defined as the infimum of zAz, not of χc(z) as it does not contain the term of

. Also note that the minimization is done over the unit sphere restricted to the first quadrant. Theorem 1.1 states that μ > 0 is a necessary and sufficient condition for the existence of a unique minimizer z* with each component zj* (j = 1,…,p) positive and finite. This condition is substantially weaker than the requirement that A be a positive definite matrix as assumed in Hall, Racine, and Li (2004) and Li and Racine (2004). It is obvious that when A is positive definite, then μ > 0 because z ≠ 0 when restricted to ∥z∥ = 1. However, consider the LL regression case with p = 2 and that g(x1,x2) = x12 + x22; then g11(x) = g22(x) = 2, and this leads to

where c > 0 is a constant. Thus, A is a singular matrix, and hence it is not positive definite. Nevertheless, it is easy to check that μ > 0 because in this case zAz = c(z1 + z2)2 > 0 for any

with ∥z∥ = 1. Therefore, by Theorem 1.1 we know that z* is uniquely defined with 0 < zj* < ∞ (j = 1,2); this implies that the CV selected smoothing parameters are well defined. In fact,

. This result is quite intuitive; given that g(x) is nonlinear in both x1 and x2, one would expect that the CV selected smoothing parameters should converge to zero with the rate of Op(n−1/(4+p)) = Op(n−1/6).

Proof of Theorem 1.2. It is clear that

is a convex cone in

. For each

, we write z = (z(1),z(2)) where

. We have

By the definition (1.6), χ is a lower semicontinuous function from

. For each

with ∥z∥ = 1 and t > 0, we have

. For r > 0, denote

. Thus there exists R > 0 such that

Because

is a nonempty compact set, by the Weierstrass theorem, the lower semicontinuous function χ attains its minimum at

with χ(z*) < +∞.

To continue our proof of the theorem, let us examine the Hessian (the second-order derivative) matrix

of χ at each point

with χ(z) < +∞. A direct calculation shows that

where G is a p × p diagonal matrix with its jth diagonal element given by 1/zj2 for j = 1,…,p, and J is a p × p matrix with its (j,s)th element given by 1/(zj zs), j,s = 1,…,p; that is, J = (z1−1,…,zp−1)′(z1−1,…,zp−1) is positive semidefinite. Thus 2G + J is a symmetric positive definite matrix. Because A is symmetric positive semiefinite,

is always symmetric positive semidefinite. The case q = 0 implies that

is positive definite because 2G + J is positive definite; the case q > 0 and A22 being positive definite implies that the sum of the two matrices in the right-hand side of (2.1) is positive definite. That is, the Hessian matrix

is positive definite at any point

with χ(z) < +∞. Thus, χ(z) has a unique minimizer.

To prove the necessary and sufficient condition, let

. If z is another minimizer of χ, then let χ(z*) = χ(z) = m. Denote

. Because χ is convex, we have

which implies χ(z(α)) = m ∀0 ≤ α ≤ 1. Because

where the last term o(∥zz*∥2) represents a higher order term, we must have

. By (2.1), this can be true only if z(1) = z(1)*. Then we have z(α)′Az(α) = z*′Az* = C. Denote h(α) = z(α)′Az(α) = (2α2 − 2α + 1)C + (2α − 2α2)zAz* for 0 ≤ α ≤ 1. For 0 < α < 1, we have 0 = h′(α) = (4α − 2)C + (2 − 4α)zAz*, which leads to zAz* = C, and then (zz*)′A(zz*) = 0. Because A is symmetric positive semidefinite, this implies A(zz*) = 0, and then A22(z(2)z(2)*) = 0. Thus z(2) = z(2)* + z(2)0 where

.

Conversely, if

with z(1) = z(1)* and z(2) = z(2)* + z(2)0 for some

, to prove that z is a minimizer of χ, we only have to show that zAz = z*′Az*. But this can be easily verified by substituting z = z* + (0,z(2)0). This completes the proof of Theorem 1.2. █

Let us apply Theorem 1.2 to show how to determine the existence and uniqueness of a minimizer for a simple case of p = 1 and q = 2 with

Then zAz = z12 + (z2 + z3)2, and it is easy to see that μ > 0 in this case. So by Theorem 1.2 we know there exists a minimizer z*. However, q = 2 and A22 is not positive definite, so from the last part of Theorem 1.2 we cannot infer the uniqueness of z*. Nevertheless, it is easy to check that in this case

and that

is a minimizer of χ(z). Let

be another minimizer of χ. By the second part of Theorem 1.2, we have

for some

(because z(2)* = (0,0)′). However,

implies that z3 = −z2; this together with

implies that z(2) = (0,0)′. Hence, z = z*, and z* is the unique minimizer of χ(z).

References

REFERENCES

Clarke, R.M. (1975) A calibration curve for radiocarbon dates. Antiquity 49, 251256.Google Scholar
Fan, J. & I. Gijbels (1995) Data-driven bandwidth selection in local polynomial regression: Variable bandwidth selection and spatial adaptation. Journal of the Royal Statistical Association, series B 57, 371394.Google Scholar
Härdle, W., P. Hall, & J.S. Marron (1988) How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association 83, 8699.Google Scholar
Härdle, W. & J.S. Marron (1985) Optimal bandwidth selection in nonparametric regression function estimation. Annals of Statistics 13, 14651481.Google Scholar
Hall, P., Q. Li, & J. Racine (2004) Estimation of Regression Function in the Presence of Irrelevant Variables. Working paper, Department of Economics, Texas A&M University.
Hall, P., J. Racine, & Q. Li (2004) Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association 99, 10151026.Google Scholar
Li, Q. & J. Racine (2003) Nonparametric estimation of distributions with categorical and continuous data. Journal of Multivariate Analysis 86, 266292.Google Scholar
Li, Q. & J. Racine (2004) Cross-validation on local linear estimators. Statistica Sinica 14, 485512.Google Scholar