Published online by Cambridge University Press: 22 August 2005
We investigate the issue of the uniqueness of the cross-validation selected smoothing parameters in kernel estimation of multivariate nonparametric regression or conditional probability functions. When the covariates are all continuous variables, we provide a necessary and sufficient condition, and when the covariates are a mixture of categorical and continuous variables, we provide a simple sufficient condition that guarantees asymptotically the uniqueness of the cross-validation selected smoothing parameters.We thank a referee for the constructive comments.
The kernel method is the most popular technique used in the estimation of nonparametric/semiparametric models, and it is well known that the selection of smoothing parameters in nonparametric kernel estimation is of crucial importance. In the context of a regression model, Clarke (1975) proposes the leave-one-out least squares cross-validation method for selecting the smoothing parameters. The asymptotic optimality of this approach is studied by Härdle and Marron (1985) and Härdle, Hall, and Marron (1988) in the context of a univariate regression model, and Fan and Gijbels (1995) have studied bandwidth selection in the context of local polynomial kernel regression. For a regression model with a single (univariate) continuous regressor, Härdle and Marron (1985) and Härdle et al. (1988) show that the cross-validation function has the following expression:
where
is the leave-one-out local constant kernel estimator of g(Xi) ≡ E(Yi|Xi), k(.) is a second-order kernel function, h is the smoothing parameter, w(.) is a weight function, C1 = ∫{κ2 /2[g′′(x) f (x) + 2g′(x) f′(x)]}2w(x) f (x)−1 dx, C2 = κ ∫σ2(x)w(x) dx, κ2 = ∫k(v)v2 dv, κ = ∫k(v)2 dv, g′(.) and g′′(.) denote first- and second-order derivative functions, and σ2(x) = Var(Yi|Xi = x).
The terms of C1 h4 and C2 /(nh) in (1.1) are the leading squared bias and variance of CV(h), respectively. Let
denote the cross-validation selected smoothing parameter that minimizes CV(h); then from (1.1) it is easy to show that the
, where h0 = [C2 /(4C1)]1/5n−1/5. Note that C1 is nonnegative and C2 > 0. Therefore, a necessary and sufficient condition for the existence of the unique benchmark nonstochastic optimal smoothing parameter h0 is that C1 > 0. The assumption that C1 > 0 puts some restrictions on g(.); for example, g(.) cannot be a constant function. A similar necessary and sufficient condition exists that guarantees an asymptotically uniquely defined cross-validation selected smoothing parameter in estimating a conditional probability density function (p.d.f.) with an univariate continuous conditional variable.
The cross-validation procedure can be easily extended to the multivariate (regression or p.d.f. estimation) settings for selecting the smoothing parameters. However, the conditions that ensure the uniqueness of cross-validation selected smoothing parameters become more complex. Recently, Hall, Racine, and Li (2004), Hall, Li, and Racine (2004), and Li and Racine (2003, 2004) have considered the problem of nonparametric estimation of conditional density and regression functions with mixed discrete and continuous data. They propose to use the data-driven cross-validation (CV) methods to select the smoothing parameters, and they have shown that the CV selected smoothing parameters are asymptotically equivalent to the nonstochastic optimal smoothing parameters that minimize the asymptotic weighted estimation mean square error. However, when discussing the existence of the asymptotically uniquely defined optimal smoothing parameters, Hall, Racine, and Li (2004) and Li and Racine (2004) impose overly strong conditions. In this note we provide substantially weaker sufficient conditions that guarantee the existence of the uniquely defined CV selected optimal smoothing parameters. We show that when all covariates are continuous random variables, the condition becomes necessary and sufficient for the existence of uniquely defined optimal smoothing parameters.
We consider a nonparametric regression model with mixed discrete and continuous covariates:
where g(·) has an unknown functional form, E(ui|Xi) = 0, Xi = (Xic,Xid), Xid is a q × 1 vector of regressors that assume discrete values, and Xic ∈ Rp are the remaining continuous regressors. We use Xijd to denote the jth component of Xid, and we assume that Xijd takes cj ≥ 2 different values, that is, Xijd ∈ {0,1,…,cj − 1} for j = 1,…,q. We use
to denote the range assumed by xd. We are interested in estimating g(x) = E(Yi|Xi = x) by the nonparametric kernel method. We use f (x) = f (xc,xd) to denote the joint density function. For xc = (x1c,…,xpc) we use the product kernel:
, where k is a symmetric, univariate density function and 0 < hj < ∞ is the smoothing parameter for xjc. For a discrete regressor we define, for 1 ≤ j ≤ q,
where 0 ≤ λj ≤ 1 is the smoothing parameter for xjd. Therefore, the product kernel for xd = (x1d,…,xqd) is given by
. The kernel function for the mixed regressors x = (xc,xd) is simply the product of Kc and Kd, that is,
. The nonparametric estimate of g(x) is given by
. We choose (h,λ) = (h1,…,hp,λ1,…,λq) by minimizing the following CV function:
where
is the leave-one-out local-constant (LC) kernel estimator of g(Xi) and 0 ≤ w(.) ≤ 1 is a weight function that serves to avoid difficulties caused by dividing by zero, or by the slow convergence rate for when Xi is near the boundary of the support of X.
Define an indicator function
. Note that Ij(vd,xd) = 1 if and only if vd and xd differ only in their jth component. Letting mj(x) and mjj(x) (m = g or m = f) denote the first-order and second-order partial derivatives of m(xc,xd) with respect to xjc, Hall, Li, and Racine (2004) have shown that (∫dx = [sum ]xd∈D ∫dxc, D is the support of Xd)
The preceding results are based on the LC kernel estimation result. Li and Racine (2004) have considered the local linear (LL) CV method. The CV objective function is the same as given in (1.4) but with
replaced by a leave-one-out LL kernel estimator. Li and Racine (2004) have shown that the resulting CV function has the same form as (1.5) with the term 2gj(x) fj(x) removed.
Define zj = n−2/(4+p)hj2 for j = 1,…,p, and zp+j = n−2/(4+p)λj for j = 1,…,q; then both the leading terms of CVLC(h,λ) and CVLL(h,λ) can be written in the form of c0 n−4/(p+4)χ(z1,…,zp,zp+1,…,zp+q), where c0 = κp∫σ2(x)w(x) dx > 0 is a constant, and
where z = (z1,…,zp+q)′ (the prime denotes transpose), A is a (p + q) × (p + q) symmetric positive semidefinite matrix with its (j,s)th element given by A(j,s) = ∫Bj(x)Bs(x) dx, where Bj(x) = c0−1/2(κ2 /2)[gjj(x) f (x) + 2gj(x) fj(x)]w(x)1/2f (x)−1/2 (one removes 2gj(x) fj(x) if it is a local linear CV function) for j = 1,…,p, and Bp+j(x) = c0−1/2 [sum ]vd∈D Ij(vd,xd)[g(xc,vd) − g(x)] f (xc,vd)w(x)1/2f (x)−1/2 for j = 1,…,q.
Hall, Racine, and Li (2004) have considered the CV selection of smoothing parameters in a conditional probability (density) estimation framework and show that their CV objective function also has a leading term of the form as given in (1.6) with of course a different definition of Bj(x) for j = 1,…,p + q. Therefore, the leading term of the CV objective function, in either a regression or a conditional probability model, has the expression as given by (1.6). The uniqueness of the CV selected optimal smoothing parameters replies on the uniqueness of a nonnegative vector
that minimizes (1.6), where
. Subsequently we will first focus on the simple case that all covariates are continuous.
When q = 0 (no discrete covariates), all covariates are continuous random variables, and (1.6) becomes
with z = (z1,…,zp)′, and A is now of dimension p × p. The uniqueness of the CV selected optimal smoothing parameters of h1,…,hp hinges on the uniqueness of a vector
that minimizes (1.7). Let z* denote the vector of z that minimizes χc(z) over
; we ask that
If (1.8) holds true, then the CV selected smoothing parameters are all well defined asymptotically. In fact, it follows from Hall, Li, and Racine (2004) and Hall, Racine, and Li (2004) that
, or equivalently,
, where
is the benchmark nonstochastic optimal smoothing parameter (j = 1,…,p). The next theorem gives a simple necessary and sufficient condition for (1.8) to hold.
THEOREM 1.1. Assume that q = 0 so that z = (z1,…,zp)′; define
Then χ(z) has a unique minimizer
with 0 < zj* < ∞ for all j = 1,…,p if and only if
Next, we discuss the general case with a mixture of continuous and discrete covariates. Now, z = (z1,…,zp+q)′ and A is a (p + q) × (p + q) symmetric positive semidefinite matrix. Let
where
and let
denote a minimizer of χ(z1,…,zp+q). We seek conditions that ensure the following result:
Condition (1.10) will lead to asymptotically uniquely defined CV selected smoothing parameters of
. We partition the A matrix as
where A11 is of dimension p × p, and A22 is of dimension q × q, and A12 has a comfortable dimension. The following theorem gives the existence and uniqueness of a minimizer for χ(z).
THEOREM 1.2. Let
If μ > 0, then χ has a minimizer
with χ(z*) < +∞, and a necessary and sufficient condition for a point
to be a minimizer of χ is that z(1) = z(1)* and z(2) = z(2)* + z(2)0 for some
, the null space of A22.
1The null space of A22 is defined as
.
of χ is positive definite at every point
with χ(z) < +∞. Thus χ has a unique minimizer z* satisfying (1.10).
Proof of Theorem 1.1. The “if” part of Theorem 1.1 is a special case of Theorem 1.2 with q = 0. Thus we only need to prove the “only if” part. Let μ = 0 be attained at some
with ∥z*∥ = 1. If zi* ≠ 0 for all i = 1,…,p, then χ(tz*) → 0 as t → +∞. This implies that χ has no minimizer. If zi* = 0 for some 1 ≤ i ≤ p, without loss of generality, we assume that z1* = ··· = zr* = 0 for some 1 ≤ r ≤ p − 1. Let ε > 0 be chosen such that p(1 − ε) > r. Let
with zi = 1,1 ≤ i ≤ r,zi = 0,r + 1 ≤ i ≤ p. Consider
for all t > 0, because
is a convex cone. We have
because p(ε − 1) + r < 0, which implies that χ(z(t)) → 0 as t → 0. Therefore χ has no minimizer. █
Remark 2.1. From the proof of Theorem 1.1 we know that μ > 0 is a necessary and sufficient condition for the existence of a minimizer z* that minimizes χc(z); the uniqueness of the minimizer z* comes from the fact that the Hessian matrix of χc(z) is positive definite.
Note that in Theorem 1.1 μ is defined as the infimum of z′Az, not of χc(z) as it does not contain the term of
. Also note that the minimization is done over the unit sphere restricted to the first quadrant. Theorem 1.1 states that μ > 0 is a necessary and sufficient condition for the existence of a unique minimizer z* with each component zj* (j = 1,…,p) positive and finite. This condition is substantially weaker than the requirement that A be a positive definite matrix as assumed in Hall, Racine, and Li (2004) and Li and Racine (2004). It is obvious that when A is positive definite, then μ > 0 because z ≠ 0 when restricted to ∥z∥ = 1. However, consider the LL regression case with p = 2 and that g(x1,x2) = x12 + x22; then g11(x) = g22(x) = 2, and this leads to
where c > 0 is a constant. Thus, A is a singular matrix, and hence it is not positive definite. Nevertheless, it is easy to check that μ > 0 because in this case z′Az = c(z1 + z2)2 > 0 for any
with ∥z∥ = 1. Therefore, by Theorem 1.1 we know that z* is uniquely defined with 0 < zj* < ∞ (j = 1,2); this implies that the CV selected smoothing parameters are well defined. In fact,
. This result is quite intuitive; given that g(x) is nonlinear in both x1 and x2, one would expect that the CV selected smoothing parameters should converge to zero with the rate of Op(n−1/(4+p)) = Op(n−1/6).
Proof of Theorem 1.2. It is clear that
is a convex cone in
. For each
, we write z = (z(1),z(2)) where
. We have
By the definition (1.6), χ is a lower semicontinuous function from
. For each
with ∥z∥ = 1 and t > 0, we have
. For r > 0, denote
. Thus there exists R > 0 such that
Because
is a nonempty compact set, by the Weierstrass theorem, the lower semicontinuous function χ attains its minimum at
with χ(z*) < +∞.
To continue our proof of the theorem, let us examine the Hessian (the second-order derivative) matrix
of χ at each point
with χ(z) < +∞. A direct calculation shows that
where G is a p × p diagonal matrix with its jth diagonal element given by 1/zj2 for j = 1,…,p, and J is a p × p matrix with its (j,s)th element given by 1/(zj zs), j,s = 1,…,p; that is, J = (z1−1,…,zp−1)′(z1−1,…,zp−1) is positive semidefinite. Thus 2G + J is a symmetric positive definite matrix. Because A is symmetric positive semiefinite,
is always symmetric positive semidefinite. The case q = 0 implies that
is positive definite because 2G + J is positive definite; the case q > 0 and A22 being positive definite implies that the sum of the two matrices in the right-hand side of (2.1) is positive definite. That is, the Hessian matrix
is positive definite at any point
with χ(z) < +∞. Thus, χ(z) has a unique minimizer.
To prove the necessary and sufficient condition, let
. If z is another minimizer of χ, then let χ(z*) = χ(z) = m. Denote
. Because χ is convex, we have
which implies χ(z(α)) = m ∀0 ≤ α ≤ 1. Because
where the last term o(∥z − z*∥2) represents a higher order term, we must have
. By (2.1), this can be true only if z(1) = z(1)*. Then we have z(α)′Az(α) = z*′Az* = C. Denote h(α) = z(α)′Az(α) = (2α2 − 2α + 1)C + (2α − 2α2)z′Az* for 0 ≤ α ≤ 1. For 0 < α < 1, we have 0 = h′(α) = (4α − 2)C + (2 − 4α)z′Az*, which leads to z′Az* = C, and then (z − z*)′A(z − z*) = 0. Because A is symmetric positive semidefinite, this implies A(z − z*) = 0, and then A22(z(2) − z(2)*) = 0. Thus z(2) = z(2)* + z(2)0 where
.
Conversely, if
with z(1) = z(1)* and z(2) = z(2)* + z(2)0 for some
, to prove that z is a minimizer of χ, we only have to show that z′Az = z*′Az*. But this can be easily verified by substituting z = z* + (0,z(2)0). This completes the proof of Theorem 1.2. █
Let us apply Theorem 1.2 to show how to determine the existence and uniqueness of a minimizer for a simple case of p = 1 and q = 2 with
Then z′Az = z12 + (z2 + z3)2, and it is easy to see that μ > 0 in this case. So by Theorem 1.2 we know there exists a minimizer z*. However, q = 2 and A22 is not positive definite, so from the last part of Theorem 1.2 we cannot infer the uniqueness of z*. Nevertheless, it is easy to check that in this case
and that
is a minimizer of χ(z). Let
be another minimizer of χ. By the second part of Theorem 1.2, we have
for some
(because z(2)* = (0,0)′). However,
implies that z3 = −z2; this together with
implies that z(2) = (0,0)′. Hence, z = z*, and z* is the unique minimizer of χ(z).