THE UNIQUENESS OF CROSS-VALIDATION SELECTED SMOOTHING PARAMETERS IN KERNEL ESTIMATION OF NONPARAMETRIC MODELS

Qi Li; Jianxin Zhou

doi:10.1017/S0266466605050504

THE UNIQUENESS OF CROSS-VALIDATION SELECTED SMOOTHING PARAMETERS IN KERNEL ESTIMATION OF NONPARAMETRIC MODELS

Published online by Cambridge University Press: 22 August 2005

Qi Li and

Jianxin Zhou

Show author details

Qi Li: Affiliation:
Texas A&M University and Tsinghua University
Jianxin Zhou: Affiliation:
Texas A&M University

Article contents

Abstract
1. MOTIVATION AND RESULTS
2. Proofs and Discussions
References

Rights & Permissions

Abstract

We investigate the issue of the uniqueness of the cross-validation selected smoothing parameters in kernel estimation of multivariate nonparametric regression or conditional probability functions. When the covariates are all continuous variables, we provide a necessary and sufficient condition, and when the covariates are a mixture of categorical and continuous variables, we provide a simple sufficient condition that guarantees asymptotically the uniqueness of the cross-validation selected smoothing parameters.We thank a referee for the constructive comments.

Type: NOTES AND PROBLEMS
Information: Econometric Theory , Volume 21 , Issue 5 , October 2005 , pp. 1017 - 1025

DOI: https://doi.org/10.1017/S0266466605050504 [Opens in a new window]
Copyright: © 2005 Cambridge University Press

1. MOTIVATION AND RESULTS

The kernel method is the most popular technique used in the estimation of nonparametric/semiparametric models, and it is well known that the selection of smoothing parameters in nonparametric kernel estimation is of crucial importance. In the context of a regression model, Clarke (1975) proposes the leave-one-out least squares cross-validation method for selecting the smoothing parameters. The asymptotic optimality of this approach is studied by Härdle and Marron (1985) and Härdle, Hall, and Marron (1988) in the context of a univariate regression model, and Fan and Gijbels (1995) have studied bandwidth selection in the context of local polynomial kernel regression. For a regression model with a single (univariate) continuous regressor, Härdle and Marron (1985) and Härdle et al. (1988) show that the cross-validation function has the following expression:

where

is the leave-one-out local constant kernel estimator of g(X_i) ≡ E(Y_i|X_i), k(.) is a second-order kernel function, h is the smoothing parameter, w(.) is a weight function, C₁ = ∫{κ₂ /2[g′′(x) f (x) + 2g′(x) f′(x)]}²w(x) f (x)⁻¹ dx, C₂ = κ ∫σ²(x)w(x) dx, κ₂ = ∫k(v)v² dv, κ = ∫k(v)² dv, g′(.) and g′′(.) denote first- and second-order derivative functions, and σ²(x) = Var(Y_i|X_i = x).

The terms of C₁ h⁴ and C₂ /(nh) in (1.1) are the leading squared bias and variance of CV(h), respectively. Let

denote the cross-validation selected smoothing parameter that minimizes CV(h); then from (1.1) it is easy to show that the

, where h₀ = [C₂ /(4C₁)]^1/5n^−1/5. Note that C₁ is nonnegative and C₂ > 0. Therefore, a necessary and sufficient condition for the existence of the unique benchmark nonstochastic optimal smoothing parameter h₀ is that C₁ > 0. The assumption that C₁ > 0 puts some restrictions on g(.); for example, g(.) cannot be a constant function. A similar necessary and sufficient condition exists that guarantees an asymptotically uniquely defined cross-validation selected smoothing parameter in estimating a conditional probability density function (p.d.f.) with an univariate continuous conditional variable.

The cross-validation procedure can be easily extended to the multivariate (regression or p.d.f. estimation) settings for selecting the smoothing parameters. However, the conditions that ensure the uniqueness of cross-validation selected smoothing parameters become more complex. Recently, Hall, Racine, and Li (2004), Hall, Li, and Racine (2004), and Li and Racine (2003, 2004) have considered the problem of nonparametric estimation of conditional density and regression functions with mixed discrete and continuous data. They propose to use the data-driven cross-validation (CV) methods to select the smoothing parameters, and they have shown that the CV selected smoothing parameters are asymptotically equivalent to the nonstochastic optimal smoothing parameters that minimize the asymptotic weighted estimation mean square error. However, when discussing the existence of the asymptotically uniquely defined optimal smoothing parameters, Hall, Racine, and Li (2004) and Li and Racine (2004) impose overly strong conditions. In this note we provide substantially weaker sufficient conditions that guarantee the existence of the uniquely defined CV selected optimal smoothing parameters. We show that when all covariates are continuous random variables, the condition becomes necessary and sufficient for the existence of uniquely defined optimal smoothing parameters.

We consider a nonparametric regression model with mixed discrete and continuous covariates:

where g(·) has an unknown functional form, E(u_i|X_i) = 0, X_i = (X_i^c,X_i^d), X_i^d is a q × 1 vector of regressors that assume discrete values, and X_i^c ∈ R^p are the remaining continuous regressors. We use X_ij^d to denote the jth component of X_i^d, and we assume that X_ij^d takes c_j ≥ 2 different values, that is, X_ij^d ∈ {0,1,…,c_j − 1} for j = 1,…,q. We use

to denote the range assumed by x^d. We are interested in estimating g(x) = E(Y_i|X_i = x) by the nonparametric kernel method. We use f (x) = f (x^c,x^d) to denote the joint density function. For x^c = (x₁^c,…,x_p^c) we use the product kernel:

, where k is a symmetric, univariate density function and 0 < h_j < ∞ is the smoothing parameter for x_j^c. For a discrete regressor we define, for 1 ≤ j ≤ q,

where 0 ≤ λ_j ≤ 1 is the smoothing parameter for x_j^d. Therefore, the product kernel for x^d = (x₁^d,…,x_q^d) is given by

. The kernel function for the mixed regressors x = (x^c,x^d) is simply the product of K^c and K^d, that is,

. The nonparametric estimate of g(x) is given by

. We choose (h,λ) = (h₁,…,h_p,λ₁,…,λ_q) by minimizing the following CV function:

where

is the leave-one-out local-constant (LC) kernel estimator of g(X_i) and 0 ≤ w(.) ≤ 1 is a weight function that serves to avoid difficulties caused by dividing by zero, or by the slow convergence rate for when X_i is near the boundary of the support of X.

Define an indicator function

. Note that I_j(v^d,x^d) = 1 if and only if v^d and x^d differ only in their jth component. Letting m_j(x) and m_jj(x) (m = g or m = f) denote the first-order and second-order partial derivatives of m(x^c,x^d) with respect to x_j^c, Hall, Li, and Racine (2004) have shown that (∫dx = [sum ]_x^d∈D ∫dx^c, D is the support of X^d)

The preceding results are based on the LC kernel estimation result. Li and Racine (2004) have considered the local linear (LL) CV method. The CV objective function is the same as given in (1.4) but with

replaced by a leave-one-out LL kernel estimator. Li and Racine (2004) have shown that the resulting CV function has the same form as (1.5) with the term 2g_j(x) f_j(x) removed.

Define z_j = n^−2/(4+p)h_j² for j = 1,…,p, and z_p+j = n^−2/(4+p)λ_j for j = 1,…,q; then both the leading terms of CV_LC(h,λ) and CV_LL(h,λ) can be written in the form of c₀ n^−4/(p+4)χ(z₁,…,z_p,z_p+1,…,z_p+q), where c₀ = κ^p∫σ²(x)w(x) dx > 0 is a constant, and

where z = (z₁,…,z_p+q)′ (the prime denotes transpose), A is a (p + q) × (p + q) symmetric positive semidefinite matrix with its (j,s)th element given by A_(j,s) = ∫B_j(x)B_s(x) dx, where B_j(x) = c₀^−1/2(κ₂ /2)[g_jj(x) f (x) + 2g_j(x) f_j(x)]w(x)^1/2f (x)^−1/2 (one removes 2g_j(x) f_j(x) if it is a local linear CV function) for j = 1,…,p, and B_p+j(x) = c₀^−1/2 [sum ]_v^d∈D I_j(v^d,x^d)[g(x^c,v^d) − g(x)] f (x^c,v^d)w(x)^1/2f (x)^−1/2 for j = 1,…,q.

Hall, Racine, and Li (2004) have considered the CV selection of smoothing parameters in a conditional probability (density) estimation framework and show that their CV objective function also has a leading term of the form as given in (1.6) with of course a different definition of B_j(x) for j = 1,…,p + q. Therefore, the leading term of the CV objective function, in either a regression or a conditional probability model, has the expression as given by (1.6). The uniqueness of the CV selected optimal smoothing parameters replies on the uniqueness of a nonnegative vector

that minimizes (1.6), where

. Subsequently we will first focus on the simple case that all covariates are continuous.

When q = 0 (no discrete covariates), all covariates are continuous random variables, and (1.6) becomes

with z = (z₁,…,z_p)′, and A is now of dimension p × p. The uniqueness of the CV selected optimal smoothing parameters of h₁,…,h_p hinges on the uniqueness of a vector

that minimizes (1.7). Let z* denote the vector of z that minimizes χ_c(z) over

; we ask that

If (1.8) holds true, then the CV selected smoothing parameters are all well defined asymptotically. In fact, it follows from Hall, Li, and Racine (2004) and Hall, Racine, and Li (2004) that

, or equivalently,

, where

is the benchmark nonstochastic optimal smoothing parameter (j = 1,…,p). The next theorem gives a simple necessary and sufficient condition for (1.8) to hold.

THEOREM 1.1. Assume that q = 0 so that z = (z₁,…,z_p)′; define

Then χ(z) has a unique minimizer

with 0 < z_j* < ∞ for all j = 1,…,p if and only if

Next, we discuss the general case with a mixture of continuous and discrete covariates. Now, z = (z₁,…,z_p+q)′ and A is a (p + q) × (p + q) symmetric positive semidefinite matrix. Let

where

and let

denote a minimizer of χ(z₁,…,z_p+q). We seek conditions that ensure the following result:

Condition (1.10) will lead to asymptotically uniquely defined CV selected smoothing parameters of

. We partition the A matrix as

where A₁₁ is of dimension p × p, and A₂₂ is of dimension q × q, and A₁₂ has a comfortable dimension. The following theorem gives the existence and uniqueness of a minimizer for χ(z).

THEOREM 1.2. Let

If μ > 0, then χ has a minimizer

with χ(z*) < +∞, and a necessary and sufficient condition for a point

to be a minimizer of χ is that z₍₁₎ = z₍₁₎* and z₍₂₎ = z₍₂₎* + z₍₂₎⁰ for some

, the null space of A₂₂.

The null space of A₂₂ is defined as

In particular, if q = 0 or A₂₂ is positive definite, then the Hessian (second derivative) matrix

of χ is positive definite at every point

with χ(z) < +∞. Thus χ has a unique minimizer z* satisfying (1.10).

2. Proofs and Discussions

Proof of Theorem 1.1. The “if” part of Theorem 1.1 is a special case of Theorem 1.2 with q = 0. Thus we only need to prove the “only if” part. Let μ = 0 be attained at some

with ∥z*∥ = 1. If z_i* ≠ 0 for all i = 1,…,p, then χ(tz*) → 0 as t → +∞. This implies that χ has no minimizer. If z_i* = 0 for some 1 ≤ i ≤ p, without loss of generality, we assume that z₁* = ··· = z_r* = 0 for some 1 ≤ r ≤ p − 1. Let ε > 0 be chosen such that p(1 − ε) > r. Let

with z_i = 1,1 ≤ i ≤ r,z_i = 0,r + 1 ≤ i ≤ p. Consider

for all t > 0, because

is a convex cone. We have

because p(ε − 1) + r < 0, which implies that χ(z(t)) → 0 as t → 0. Therefore χ has no minimizer. █

Remark 2.1. From the proof of Theorem 1.1 we know that μ > 0 is a necessary and sufficient condition for the existence of a minimizer z* that minimizes χ_c(z); the uniqueness of the minimizer z* comes from the fact that the Hessian matrix of χ_c(z) is positive definite.

Note that in Theorem 1.1 μ is defined as the infimum of z′Az, not of χ_c(z) as it does not contain the term of

. Also note that the minimization is done over the unit sphere restricted to the first quadrant. Theorem 1.1 states that μ > 0 is a necessary and sufficient condition for the existence of a unique minimizer z* with each component z_j* (j = 1,…,p) positive and finite. This condition is substantially weaker than the requirement that A be a positive definite matrix as assumed in Hall, Racine, and Li (2004) and Li and Racine (2004). It is obvious that when A is positive definite, then μ > 0 because z ≠ 0 when restricted to ∥z∥ = 1. However, consider the LL regression case with p = 2 and that g(x₁,x₂) = x₁² + x₂²; then g₁₁(x) = g₂₂(x) = 2, and this leads to

where c > 0 is a constant. Thus, A is a singular matrix, and hence it is not positive definite. Nevertheless, it is easy to check that μ > 0 because in this case z′Az = c(z₁ + z₂)² > 0 for any

with ∥z∥ = 1. Therefore, by Theorem 1.1 we know that z* is uniquely defined with 0 < z_j* < ∞ (j = 1,2); this implies that the CV selected smoothing parameters are well defined. In fact,

. This result is quite intuitive; given that g(x) is nonlinear in both x₁ and x₂, one would expect that the CV selected smoothing parameters should converge to zero with the rate of O_p(n^−1/(4+p)) = O_p(n^−1/6).

Proof of Theorem 1.2. It is clear that

is a convex cone in

. For each

, we write z = (z₍₁₎,z₍₂₎) where

. We have

By the definition (1.6), χ is a lower semicontinuous function from

. For each

with ∥z∥ = 1 and t > 0, we have

. For r > 0, denote

. Thus there exists R > 0 such that

Because

is a nonempty compact set, by the Weierstrass theorem, the lower semicontinuous function χ attains its minimum at

with χ(z*) < +∞.

To continue our proof of the theorem, let us examine the Hessian (the second-order derivative) matrix

of χ at each point

with χ(z) < +∞. A direct calculation shows that

where G is a p × p diagonal matrix with its jth diagonal element given by 1/z_j² for j = 1,…,p, and J is a p × p matrix with its (j,s)th element given by 1/(z_j z_s), j,s = 1,…,p; that is, J = (z₁⁻¹,…,z_p⁻¹)′(z₁⁻¹,…,z_p⁻¹) is positive semidefinite. Thus 2G + J is a symmetric positive definite matrix. Because A is symmetric positive semiefinite,

is always symmetric positive semidefinite. The case q = 0 implies that

is positive definite because 2G + J is positive definite; the case q > 0 and A₂₂ being positive definite implies that the sum of the two matrices in the right-hand side of (2.1) is positive definite. That is, the Hessian matrix

is positive definite at any point

with χ(z) < +∞. Thus, χ(z) has a unique minimizer.

To prove the necessary and sufficient condition, let

. If z is another minimizer of χ, then let χ(z*) = χ(z) = m. Denote

. Because χ is convex, we have

which implies χ(z(α)) = m ∀0 ≤ α ≤ 1. Because

where the last term o(∥z − z*∥²) represents a higher order term, we must have

. By (2.1), this can be true only if z₍₁₎ = z₍₁₎*. Then we have z(α)′Az(α) = z*′Az* = C. Denote h(α) = z(α)′Az(α) = (2α² − 2α + 1)C + (2α − 2α²)z′Az* for 0 ≤ α ≤ 1. For 0 < α < 1, we have 0 = h′(α) = (4α − 2)C + (2 − 4α)z′Az*, which leads to z′Az* = C, and then (z − z*)′A(z − z*) = 0. Because A is symmetric positive semidefinite, this implies A(z − z*) = 0, and then A₂₂(z₍₂₎ − z₍₂₎*) = 0. Thus z₍₂₎ = z₍₂₎* + z₍₂₎⁰ where

Conversely, if

with z₍₁₎ = z₍₁₎* and z₍₂₎ = z₍₂₎* + z₍₂₎⁰ for some

, to prove that z is a minimizer of χ, we only have to show that z′Az = z*′Az*. But this can be easily verified by substituting z = z* + (0,z₍₂₎⁰). This completes the proof of Theorem 1.2. █

Let us apply Theorem 1.2 to show how to determine the existence and uniqueness of a minimizer for a simple case of p = 1 and q = 2 with

Then z′Az = z₁² + (z₂ + z₃)², and it is easy to see that μ > 0 in this case. So by Theorem 1.2 we know there exists a minimizer z*. However, q = 2 and A₂₂ is not positive definite, so from the last part of Theorem 1.2 we cannot infer the uniqueness of z*. Nevertheless, it is easy to check that in this case

and that

is a minimizer of χ(z). Let

be another minimizer of χ. By the second part of Theorem 1.2, we have

for some

(because z₍₂₎* = (0,0)′). However,

implies that z₃ = −z₂; this together with

implies that z₍₂₎ = (0,0)′. Hence, z = z*, and z* is the unique minimizer of χ(z).

References

REFERENCES

Clarke, R.M. (1975) A calibration curve for radiocarbon dates. Antiquity 49, 251–256.Google Scholar

Fan, J. & I. Gijbels (1995) Data-driven bandwidth selection in local polynomial regression: Variable bandwidth selection and spatial adaptation. Journal of the Royal Statistical Association, series B 57, 371–394.Google Scholar

Härdle, W., P. Hall, & J.S. Marron (1988) How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association 83, 86–99.Google Scholar

Härdle, W. & J.S. Marron (1985) Optimal bandwidth selection in nonparametric regression function estimation. Annals of Statistics 13, 1465–1481.Google Scholar

Hall, P., Q. Li, & J. Racine (2004) Estimation of Regression Function in the Presence of Irrelevant Variables. Working paper, Department of Economics, Texas A&M University.

Hall, P., J. Racine, & Q. Li (2004) Cross-validation and the estimation of conditional probability densities. Journal of the American Statistical Association 99, 1015–1026.Google Scholar

Li, Q. & J. Racine (2003) Nonparametric estimation of distributions with categorical and continuous data. Journal of Multivariate Analysis 86, 266–292.Google Scholar

Li, Q. & J. Racine (2004) Cross-validation on local linear estimators. Statistica Sinica 14, 485–512.Google Scholar

Article contents

THE UNIQUENESS OF CROSS-VALIDATION SELECTED SMOOTHING PARAMETERS IN KERNEL ESTIMATION OF NONPARAMETRIC MODELS

Abstract

1. MOTIVATION AND RESULTS

2. Proofs and Discussions

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests