Published online by Cambridge University Press: 19 July 2005
We develop general conditions for rates of convergence and convergence in distribution of iterative procedures for estimating finite-dimensional parameters. An asymptotic contraction mapping condition is the centerpiece of the theory. We illustrate some of the results by deriving the limiting distribution of a two-stage iterative estimator of regression parameters in a semiparametric binary response model. Simulation results illustrating the computational benefits of the first-stage iterative estimator are also reported.We thank a co-editor and two referees for comments and criticisms that led to significant improvements in this paper. We also thank Roger Klein for providing us with Gauss code to compute his estimator.
There is a substantial literature on convergence properties of various widely applied iterative estimation procedures such as the expectation-maximization (EM) algorithm and its many descendants (see, e.g., McLachlan and Krishnan, 1997). Often in this literature, observed data are conditioned on, and convergence refers to numerical convergence of sample iterates to a sample fixed point. Similarly, a rate of convergence refers to how fast the sample iterates converge to the sample fixed point as the number of iterations increases. Although useful computationally, such information is not sufficient for doing asymptotic inference on the parameter of interest, namely, the population parameter estimated by the sample fixed point. For example, such information neither reveals the limiting distribution of the sample sequence nor implies a rate of convergence to the parameter of interest, and it gives no guidelines about the number of iterations needed to achieve these results. The literature on finite-step estimation sheds some light on these problems but requires a starting value that converges to the parameter of interest at a known rate (see, e.g., Robinson, 1988; Lehmann, 1983, Ch. 6.3; Bickel, 1975). By contrast, most iterative estimation procedures do not start at consistent starting values. Hence the need for theory enabling asymptotic inference about a parameter of interest for general iterative estimation procedures. This paper provides such theory.
Specifically, this paper develops checkable conditions for consistency, rates of convergence, and convergence in distribution of iterative procedures for estimating finite-dimensional parameters. The theory covers procedures such as expectation-maximization (EM), Newton–Raphson (NR), and iterative least squares (ILS). The key requirement is that the sample mapping generating the procedure and a population analogue be contraction mappings, asymptotically. We also isolate a convenient bias condition from which sensible stopping rules can be derived.
We illustrate the theory by establishing the limiting distribution of a two-stage iterative estimator of the regression parameters in a semiparametric binary response model. The first stage is an ILS procedure. We apply the theory to show that this procedure consistently estimates the regression parameters in the model. The second stage is a NR procedure that starts at the ILS estimates and is based on the criterion function of the Klein and Spady (1993) estimator that achieves the semiparametric efficiency bound for this model established by Chamberlain (1986) and Cosslett (1987). We use the theory to show that the two-stage procedure also achieves this bound. Although we use this application to illustrate the asymptotic theory, the ILS estimator is interesting in its own right because of the substantial computational advantages it can provide, particularly in applications with many observations and many regressors. To illustrate these benefits, we provide a small simulation study comparing the ILS estimator to the efficient estimator of Klein and Spady (1993). We also briefly discuss extensions of the ILS procedure to semiparametric censored regression models.
Simultaneously and independently of this work, Pastorello, Patilea, and Renault (2003, Sects. 1–4) have developed a similar theory of iterative estimation and consider applications to structural nonadaptive econometric models with an emphasis on financial market models. The theory developed in this paper covers all of their main applications. However, their theory requires a continuity condition (Pastorello et al., 2003, Assumption 1a, p. 452) that limits the applicability of their results. For example, we show that this continuity condition does not hold for the ILS estimator developed in Section 3 for the semiparametric binary response model. Nor do these authors provide guidelines for stopping rules.
The rest of the paper is organized as follows. In Section 2, we develop conditions for consistency, rates of convergence, and convergence in distribution of general iterative estimation procedures. We also provide guidelines for developing stopping rules for these procedures. In Section 3, we apply the theory to the iterative estimator for the semiparametric binary response model described earlier. Section 4 presents the simulation study. Section 5 summarizes. Proofs and other technical supporting material are provided in an Appendix.
This section develops general conditions for rates of convergence and convergence in distribution of iterative procedures for estimating finite-dimensional parameters. We begin by developing some notation.
Let
denote a probability space. That is, Ω is a sample space,
is a σ-field of subsets of Ω, and
is a probability measure on
. For ω ∈ Ω, let Z1(ω,β0),…,Zn(ω,β0) denote a sample of size n from
, where β0 denotes a fixed parameter of interest in
.
Let M(φ) denote a mapping from
with fixed point β0. That is, M(β0) = β0. Population analogues of sample mappings generating common iterative estimators satisfy this fixed point condition. For example, a population analogue of the sample mapping defining an EM procedure has the form
where Qn(β|φ) is the conditional expectation of the complete data log-likelihood function given the observed data and φ. The outer expectation is over the distribution of the observed data. It follows that
is the expected value of the complete data log-likelihood function, which is maximized at β0. That is, M(β0) = β0. A population analogue of a sample mapping defining a NR procedure has the form
where G(φ) is a population analogue of the gradient of the sample objective function and H(φ) is a population analogue of the corresponding sample Hessian. Under general conditions, G(β0) = 0, and so M(β0) = β0.
We show in Section 3 that the fixed point condition just described is also satisfied by a population analogue of a semiparametric ILS mapping. However, this population mapping depends on both the sample size n and the ω that generated the sample from
. To cover applications like this, from now on we write Mn(φ) for a generic population mapping, letting the subscript n suggest the possible dependence of this mapping on both n and ω. The fixed point condition, then, becomes Mn(β0) = β0 for all n and ω. Write
sample analogue of Mn(φ). Let
denote a starting point in
. This starting point may depend on n and ω. For i ≥ 1, define
. We call βni a population iterate. We call
a sample iterate and also an iterative estimator of β0.
Contraction mappings play a central role in establishing the limiting behavior of
. We now introduce the notion of an asymptotic contraction mapping.
DEFINITION. For each n ≥ 1 and ω ∈ Ω, let Knω(·) be a function defined on a set
where
is a metric space. The collection {Knω(·) : n ≥ 1, ω ∈ Ω} is an asymptotic contraction mapping (denoted ACM) on
if there exist a constant c in [0,1) that does not depend on n or ω, and sets {An} with each
as n → ∞, such that for each ω ∈ An, Knω(·) maps
to itself and for all
,
The ACM property is a property of the collection of functions {Knω(·) : n ≥ 1,ω ∈ Ω}. For ease of notation, we write {Knω(·)} for this collection.
If {Knω(·)} is an ACM on
and ω ∈ An, where An is one of the “good” sets described in the definition, then Knω(·) is a contraction mapping on
. Then, by the Banach fixed point theorem (Aliprantis and Border, 1994, pp. 88–89), Knω(·) has a unique fixed point
, and any sequence defined by
where
converges to
as i → ∞. Note that the iterates and the fixed point can depend on n and ω. Also, note that {Knω(·)} can be an ACM without Knω(·) being a contraction mapping for each n and ω.
To establish the limiting distribution of
we require that both
be ACMs on (B0,Ek). Here Ek is the euclidean metric on
is the closed ball centered at β0 of radius |βn0 − β0|. If {Mn(φ)} is an ACM on (B0,Ek), then, for each ω ∈ An (see the definition), β0 is the unique fixed point of Mn(φ) on B0. If
is an ACM on (B0,Ek), then, for each ω ∈ An (not necessarily the same An as for Mn(φ)),
has a unique fixed point on B0, which we denote
. Unlike
typically depends on both n and ω.
Theorem 1 gives conditions for rates of convergence of
. Let Z+ be the positive integers.
THEOREM 1. Let i(n) be a function from Z+ to Z+. Fix δ > 0. If
then
as n → ∞.
Remark 1. To apply Theorem 1, one must choose i(n) to satisfy condition (ii). The choice depends on the order of convergence of the iterative procedure. Let βi, i ≥ 1, and β be points in
. The sequence {βi}i=1∞ converges to β of order σ ≥ 1 if there exists a constant κ > 0 such that for i ≥ 1, |ei|/|ei−1|σ ≤ κ where ei = βi − β (cf. Burden, Faires, and Reynolds, 1981, p. 45). The leading special cases are σ = 1 (linear convergence) and σ = 2 (quadratic convergence). For example, the population sequence for the semiparametric ILS procedure analyzed in Section 3 exhibits linear convergence to its fixed point, whereas the population and sample sequences for the semiparametric NR procedure analyzed in Section 3 exhibit quadratic convergence to their respective fixed points.
First, we choose i(n) to satisfy condition (ii) when σ = 1. In this case, the constant κ can be taken to equal the modulus of contraction c guaranteed by condition (i). Assume supn|βn0 − β0| < ∞. A simple recursive calculation shows that nδ|βni(n) − β0| ≤ |βn0 − β0| for each n ≥ 1 (a stronger condition than condition (ii)) provided i(n) ≥ −δ ln n/ln c. This bound is sharp and can be used to develop a stopping rule for the iterative sequence. For instance, if σ = 1, δ = ½, n = 5,000, and c ≤ 0.9, then the stronger condition just stated is satisfied provided i(5,000) ≥ 41. Alternatively, one can estimate c with the maximum of the ratios
over a small number of ratios and for different starting values for the sequence.
Next, we choose i(n) to satisfy condition (ii) when σ > 1. Define α(σ) = κ1/(σ−1)|βn0 − β0| and assume α(σ) < 1. A recursive calculation shows that nδ|βni(n) − β0| ≤ |βn0 − β0| for n ≥ 1 (again, a stronger condition than condition (ii)) provided i(n) ≥ [ln σ]−1 ln(−δ ln n/ln α(σ)). To illustrate, for a smooth NR procedure, σ = 2 and the constant κ can be taken to equal 2Ck with 2C an upper bound on the absolute value of each of the second-order mixed partial derivatives of each of the k components of Mn (cf. Burden et al., 1981, p. 47).1
Recall that Mn(φ) = (Mn1(φ),…,Mnk(φ))′ where each Mnj(φ) is a real-valued function of φ = (φ1,…,φk)′. A NR procedure is smooth if, for each j, the k × k matrix of second-order mixed partial derivatives of Mnj(φ) exists.
Finally, we note that condition (ii) does not require that i(n) → ∞ as n → ∞. As a simple example, consider sampling independent and identically distributed (i.i.d.) observations from a normal distribution with unknown mean β0 and known variance and estimating β0 using a NR procedure. Then it is trivial to show that no matter what the starting value βn0, βn1 = β0. That is, condition (ii) is satisfied for all δ > 0 with i(n) = 1 for all n.
Remark 2. Many iterative procedures converge at rate
, corresponding to δ = ½ in Theorem 1. However, this need not be the case. For example, consider estimating the regression parameters in a semiparametric binary response model using a NR procedure based on the smoothed maximum score estimator of Horowitz (1992). Depending on the smoothness of the data distribution, under the conditions of Theorem 1, this NR procedure will converge at rate nδ, for some
.
The next result gives conditions for consistency without a rate of convergence.
THEOREM 2. Let i(n) be a function from Z+ to Z+. If
then
as n → ∞.
Remark 3. In Theorem 2, if condition (i) holds, then condition (ii) holds if i(n) → ∞ as n → ∞. No minimum rate of growth is required for i(n) because the bias term in (ii) is not inflated by a factor of nδ as it is in Theorem 1.
Our next objective is to develop conditions for convergence in distribution of
. To do so, we require that
be an ACM on (B0,Ek).
Assume that the first partial derivatives of the components of
and Mn(φ) exist. Let ∇φ denote the differential operator (∂/∂φ1,…,∂/∂φk). Write Vn(φ) for the k × k matrix
for the k × k matrix
. For a k × k matrix A = (aij), let ∥A∥ denote the matrix norm
.
LEMMA 3. Suppose
exist. If
then
is an ACM on (B0,Ek).
We are now in a position to establish a convergence in distribution result for
. The limiting distribution depends on the limiting behavior of the infeasible estimator
. Recall that if
is an ACM on (B0,Ek), then, for
has a unique fixed point on B0, denoted
. Assume that the probability limit of Vn(φ) exists and let V(φ) denote this quantity. In what follows, we use the symbol ⇒ to denote convergence in distribution.
THEOREM 4. Let i(n) be a function from Z+ to Z+ and let {εn} denote an arbitrary sequence of nonnegative real numbers converging to zero as n → ∞. For δ > 0, if
then
as n → ∞, where D = [Ik − V(β0)]−1.
Remark 4. Theorem 2 provides checkable conditions that imply condition (i). Lemma 3 provides checkable conditions that imply condition (ii). Remark 1 concerning the choice of i(n) can be adapted to establish condition (iii). For example, in the application in Section 3, the sample sequence of the semiparametric NR procedure is started at a consistent estimator
and exhibits quadratic convergence to its sample fixed point. Replacing Mn with
with
in Remark 1 and choosing i(n) ≥ [ln 2]−1 ln(−δ ln n/ln α(2)) is sufficient to establish condition (iii). Finally, note that procedures exhibiting linear convergence satisfy V(β0) ≠ 0, whereas procedures exhibiting quadratic convergence (such as NR procedures) satisfy V(β0) = 0 (Burden et al. 1981, pp. 47–48). Thus, for NR procedures such as the one presented in Section 3, D is the identity matrix, and so
as n → ∞.
In this section, we illustrate the theory developed in Section 2 by establishing the limiting behavior of a two-stage iterative estimator of regression parameters in a semiparametric binary response model. The first-stage estimator is an ILS estimator. We establish consistency of this estimator by developing primitive conditions implying the conditions of Theorem 2 in Section 2.2
Wang and Zhou (1995) appear to have been the first to propose a semiparametric ILS estimator for binary response models. They use isotonic regression to estimate a certain conditional expectation. They do not prove consistency. The theory developed in this paper covers the ILS estimator of Wang and Zhou. However, the contraction mapping condition and uniform convergence condition of Theorem 1 are harder to prove due to the difficulty in analyzing the isotonic regression component of their estimator.
Consider the binary response model Y = 1{Y* ≥ 0} where the latent variable Y* = X′β0 − u, X = (W1,…,Wk,Wk+1)′, β0 = (β01,…,β0k,β0,k+1)′, and u is an error term with unknown distribution. Because the distribution of u is not known in this model, restrictions are needed to identify the regression parameters. We assume that Wk+1 is nonconstant and normalize β0,k+1 to unity. Rather than introduce new notation, we reinterpret β0 as the first k components of the true parameter vector divided by β0,k+1 and u as the true error divided by β0,k+1. Also, for each φ in
, we write X′φ for W1φ1 + ··· + Wkφk + Wk+1. In addition, we take W1 = 1. That is, W1 is the regressor corresponding to the intercept term in the model.
Let (Y1,X1),…,(Yn,Xn) denote a sample of independent observations from the model just defined and let Xn denote the n × k matrix comprised of the first k components of each Xj, j = 1,…,n. Prewhiten Xn so that Xn′Xn = nIk.
Next, let F and f denote the unknown cumulative distribution function and density function of u. Fix t in
and φ in
. Let
. Assume that ∇t F(t,φ) exists and write f (t,φ) = ∇t F(t,φ). Note that F(t,β0) = F(t) and f (t,β0) = f (t). Define
The second representations for ν(t,φ) and π(t,φ) follow from integration by parts arguments. Note that
. Write un(φ) for (u1(φ),…,un(φ))′ with uj(φ) = F(Xj′ β0)ν(Xj′φ,φ) + (1 − F(Xj′ β0))π(Xj′φ,φ). Write Yj(φ) for Xj′φ − uj(φ). Define the population ILS mapping
Note that for all j,
. Deduce from this and prewhitening that Mn(β0) = β0. That is, β0 is a fixed point of Mn(φ).
We now construct a sample analogue of Mn(φ) by developing numerical integral approximations of ν(Xj′φ,φ) and π(Xj′φ,φ) using nearest neighbor estimators of the F(Xj′φ,φ)'s. Fix α > 1 and let {cn} denote a sequence of nonnegative real numbers satisfying cn → ∞ as n → ∞. For
, define the trimming functions τn(t) = 1{|t| ≤ cn} and σn(t) = 1{|t| ≤ cnα}. Note that τn(·) trims more severely than σn(·). These functions help control tail behavior. Relabel observation numbers so that index values are ordered from smallest to largest. That is, let X1′φ ≤ ··· ≤ Xn′φ. Let X0′φ = X1′φ, Xn+1′φ = Xn′φ, and Δ(Xi′φ) = Xi′φ − Xi−1′φ, i = 1,…,n + 1. For j = 1,…,n, define
In the preceding expressions,
is a nearest neighbor estimator of F(Xi′φ,φ). Let kn be a positive integer. If kn < i ≤ n − kn, then
is the arithmetic average of the 2kn + 1 binary values Yi−kn,…,Yi+kn. These are symmetric nearest neighbor estimators. If i ≤ kn, then
is the average of the kn + 1 binary values Yi,…,Yi+kn. If i > n − kn, then
is the average of the kn + 1 binary values Yi−kn,…,Yi.
Write
with
. Write
.
3Note that
is a function of the nearest neighbor estimators
, which are step functions. This implies that
(and, therefore, the least squares criterion function) is a discontinuous function of φ, violating Assumption 1a in the theory developed in Pastorello et al. (2003, p. 452).
Let
denote an arbitrary starting value and for each i ≥ 1, define
. We call
a semiparametric ILS estimator of β0.
To prove consistency of the ILS estimator
, we see from Theorem 2 in Section 2 that we must show that Mn(φ) is an asymptotic contraction mapping and that
converges uniformly to Mn(φ). We now state and discuss primitive conditions sufficient to imply these high-level conditions.
Let
denote the closure of an open convex neighborhood of β0. Write Su for the support of u and Sφ for the support of X′φ. Write g(t,φ) for the density of X′φ evaluated at t. Throughout, we will maintain the following assumptions for
.
Assumption A0 describes the data and the model.
Assumptions A1 and A2 are assumptions about u and X and their relationship to each other. Note that we assume that u is log-concave. This is a large class of distributions and includes normal, logistic, extreme-value, Laplacian, gamma, beta, and triangular families and many others also. However, log-concavity is only a sufficient condition. As we will demonstrate in the next section, the ILS procedure can work well even when this assumption does not hold, as when u is log-convex. It is not necessary that
. We make these assumptions for ease of exposition and because the infinite support case is a leading special case. The prewhitening in Assumption A2 is used to establish the local contraction mapping property. In general, prewhitening would not be effective in this regard if the index were not linear.
Assumption A3 defines the number of neighbors up to scale. From the proofs of Lemma 1A and Lemma 2A in the Appendix, we see that n1/2 << kn << n is sufficient. However, we choose n3/4 because this rate optimally balances convergence rates of stochastic and bias terms in the nearest neighbor estimators (see proof of Lemma 3A in the Appendix). The constant of proportionality could be any consistent estimate of a measure of spread in the Xi′φ's.
Assumptions A4–A8 define trimming and tail conditions. The interval [−cn,cn] defines the set of t values for which numerical integral estimates of ν(t,φ) and π(t,φ) are computed, and the interval [−cnα,cnα] defines the set of t values at which nearest neighbor nonparametric regression estimates of F(t,φ) are computed. These latter estimates are ratios with estimates of g(t,φ) in denominators and appear in denominators of estimates of ν(t,φ) and π(t,φ). To give an example of what we have in mind for Assumptions A4–A8, suppose cn is proportional to
. Then simple calculations show that Assumption A5 is satisfied provided the tails of X′φ decrease no faster than normal tails. If, in addition, α > (1 + δ)/δ for δ > 0, then the conditions in Assumptions A6–A8 are satisfied at β0 provided the tails of u decrease no faster than normal tails and no slower than |t|−(2+δ) as |t| → ∞.
Assumptions A9–A11 are conditions useful for bounding remainder terms in Taylor expansions or in proving uniform convergence results.
LEMMA 5. Suppose Assumptions A0, A1, A2, and A11 hold. Then there exists an open ball centered at β0 with closure
such that {Mn(φ)} is an ACM on (B0,Ek).
The following result is based on Lemmas 1A, 2A, and 3A in the Appendix, which extend some results of Ichimura (1997) relating kn, the number of nearest neighbors, to nearest neighbor window widths. It also uses Lemma 4A in the Appendix, a result on maximum sample spacings, to show that numerical integral approximations are close to their estimands.
LEMMA 6. Assumptions A0–A11 imply
as n → ∞.
The next result follows from Lemma 5, Lemma 6, and Theorem 2.
THEOREM 7. Suppose Assumptions A0–A11 hold and
from Lemma 5. If i(n) → ∞ as n → ∞, then
as n → ∞.
Remark 5. The starting value,
, for the ILS procedure may or may not be an element of B0. Thus, successful implementation of the procedure may require good starting values. Standard parametric estimates, such as probit or logit estimates, can be tried. For our simulations, we used ordinary least squares (OLS) estimates. Although these were generally poor estimates of β0, they proved to be good enough starting values for the ILS procedure. Also, note that an informal check of the contraction mapping condition can always be made: given a sequence of ILS iterates
, compute the ratios
. Ratios less than unity give informal support to the contraction mapping assumption.
We now develop the efficient estimator. We do so by starting from the consistent ILS estimates and taking Newton–Raphson steps using the criterion function for the efficient Klein and Spady (1993) estimator. Let d = k − 1, reinterpret β0 as the last d components of the true parameter vector, and write
as the last d components of the ILS estimator. In short, we throw out the intercept term. We do this because the Klein and Spady estimator does not estimate the intercept. Let φ denote an element of
. Define the sample NR mapping
where
is the gradient and
the Hessian or outer product gradient of the criterion function
of the efficient Klein and Spady estimator as defined in Sherman (1994). This function has the form
where
is a nonparametric regression estimator of F(Xi′φ,φ) and τn(x) = {|x| ≤ cn}. Define
and for i ≥ 1, define
. We call
a semiparametric NR estimator of β0.
Define the population NR mapping
where
the gradient and
the Hessian or outer product gradient of the function
where
Note that M(φ) does not depend on n or ω and M(β0) = β0.
Define
and note that V(β0) = 0d, the d × d zero matrix. Assumptions in Sherman (1994) imply that H(φ) is continuous in a neighborhood of β0. It follows that V(φ) is continuous in a neighborhood of β0. Deduce that there exists an open ball centered at β0 with closure B0 such that M(φ) is a contraction mapping on (B0,Ed). Arguments in Sherman (1994) can be adapted to show that as n → ∞, (i)
, (ii)
, and (iii)
. Deduce from these facts, Lemma 3, and Theorem 4 that for i(n) ≥ [ln 2]−1 ln(−0.5 ln n/ln κ) for κ ∈ (0,1) (see Remark 1),
converges in distribution to a N(0,−[H(β0)]−1) random variable as n → ∞. This is the limiting distribution of the efficient Klein and Spady (1993) estimator.
Finally, we note that the ILS procedure developed in this section can be easily extended to semiparametric censored regression models. For example, if the latent variable Y* = X′β0 − u and we observe (Y,X) where Y = Y*1{Y* ≥ 0}, then an ILS estimator of β0 can be defined exactly as in this section after setting
. Similar extensions can cover other censoring schemes such as those that involve top-coding.
In this section, we provide a small simulation study comparing the speed and performance of the ILS procedure developed in Section 3 to the speed and performance of the efficient estimator of Klein and Spady (1993) for the semiparametric binary response model.
The model we use in the simulations has the form Y = 1{u ≤ X′β0} where the regressor vector X = (W1,…,W6,W7)′ and
with β01 = 0, β02 = −2, β03 = −1, β04 = −0.5, β05 = 0.5, β06 = 2, and β07 = 1. In this model, W1 = 1 and W2–W7 are independent standard exponential random variables. In addition, u is independent of X and has either a χ2(1) or a χ2(3) distribution standardized to have mean zero and variance equal to the variance of X′β0. Thus, the signal to noise ratio in the simulations is unity. We normalize on β07 and report estimates of the slope coefficients β02,β03,β04,β05,β06.4
Although the ILS procedure directly produces an intercept estimate, the Klein and Spady estimator does not. Consequently, we do not report intercept estimates.
For the ILS procedure, we choose the number of neighbors kn by a “leave one out” method of least squares cross-validation calibrated on one of the models defined previously with n = 10,000. This and A3 produce the rule kn ≈ 0.14n3/4, which we use in all the simulations. We do no trimming. Also, we use OLS starting values and say that the procedure has converged when the maximum relative difference between the components of successive iterates is less than 10−4 in absolute value. If this convergence criterion is not met after 1,000 iterations, we stop and use the last five components of the 1,000th iterate as an estimate of (β02,…,β06).
We compute the Klein and Spady criterion function using a program written in the GAUSS language by Roger Klein. This program likewise does no trimming. We compute the Klein and Spady estimator using the MAXLIK optimization routine with PROBIT starting values and say that the procedure has converged when the maximum component of the gradient is less than 10−4 in absolute value. If this criterion is not met after 50 iterations, we stop and use the 50th iterate as an estimate of (β02,…,β06). Typically, the components of the 50th iterate are not changing in the first four decimal places.
Results for both the ILS and Klein and Spady estimators are based on the same simulation data sets. All simulations are performed using GAUSS 3.5 for Windows on a Pentium-III PC with 800 megahertz of RAM.
Table 1 presents means and root-mean squared error (RMSE) statistics for the ILS and Klein and Spady (KS) estimators, based on 100 simulations. Results for the PROBIT and OLS estimators are also provided as points of comparison. For both error distributions, when n = 1,000, both the ILS and Klein and Spady estimators do well in terms of bias. The Klein and Spady estimator slightly outperforms the ILS estimator in terms of RMSE for χ2(1) errors; the two estimators are close in RMSE when errors are χ2(3). When n = 5,000, the estimators are almost indistinguishable in terms of both bias and RMSE. Note that both absolute and relative bias in PROBIT and OLS estimates increase as the parameter values increase in magnitude, the effect being more pronounced for the more skewed χ2(1) distribution.
Table 2 presents computation statistics that allow timing and convergence comparisons of the ILS and Klein and Spady estimators. These statistics correspond to the results presented in Table 1. The first column gives the median number of iterations per simulation, the second column gives the approximate time per iteration, and the third column gives the percentage of simulations where the respective convergence criteria were satisfied. We see that for both error distributions, when n = 1,000, most of the ILS simulations do not satisfy the ILS criterion. However, we see from Table 1 that this does not imply that the ILS iterates are diverging. Rather, the iterates oscillate between vectors whose components differ in the third or fourth decimal places. We suspect that this oscillation is due to lack of smoothness in
when n = 1,000. Note that when n = 5,000, the median number of ILS iterations decreases to 175, whereas nearly all the simulations satisfy the ILS criterion. The Klein and Spady estimator, on the other hand, satisfies its convergence criterion most of the time, though contrary to our expectations, the percentage decreases as skewness decreases and as sample size increases.
A notable aspect of Table 2 is the timing comparison. The ILS estimator, on average, requires only about 20 seconds to estimate six parameters (five slope coefficients and an intercept) when n = 5,000. The Klein and Spady estimator, on average, requires well over 2 hours to estimate five parameters when n = 5,000. The ILS procedure is fast because (i) only O(n) calculations are needed to compute the nearest neighbor estimators of all the F(Xj′φ,φ)'s and (ii) computation time for a least squares calculation is nearly constant as a function of k, the number of estimated parameters. By contrast, computing the Klein and Spady criterion function, which involves local smoothing, is an O(n2hn) calculation where hn is the deterministic bandwidth for the kernel regression estimators of the F(Xj′φ,φ)'s. In addition, computation time can increase as a quadratic in k, because steps can require the computation of numerical gradients and Hessians of the criterion function.
This paper develops general conditions for rates of convergence and convergence in distribution of iterative procedures for estimating finite-dimensional parameters. The theory covers iterative estimation schemes such as expectation-maximization (EM), Newton–Raphson (NR), and iterative least squares (ILS) procedures. The theory requires a combination of asymptotic contraction mapping conditions and uniform convergence conditions, with convergence in distribution requiring an additional convergence in distribution result for a certain infeasible estimator. A bias condition is isolated that can be used to derive sensible stopping rules.
We illustrate the theory by establishing the limiting distribution of a two-stage iterative estimator of regression parameters in a semiparametric binary response model. The first stage is a consistent ILS procedure. The second stage is a NR procedure that is started at the ILS estimates and is based on the criterion function of the efficient Klein and Spady (1993) estimator. The ILS/NR procedure achieves the semiparametric efficiency bound for this model established by Chamberlain (1986) and Cosslett (1987). Simulations show that the ILS procedure is very fast to compute even for models with many observations and many estimable parameters. In addition, the ILS estimator is comparable to the Klein and Spady estimator in terms of root-mean-squared error in the given simulations. The ILS procedure developed for the semiparametric binary response model easily extends to cover various semiparametric censored regression models.
Proof of Theorem 1. By (i), there exists a constant c in [0,1) that does not depend on n or ω, and sets {An} with each
as n → ∞, such that for each ω ∈ An, |Mn(φ) − Mn(γ)| ≤ c|φ − γ| for each φ,γ in B0. Let ρ0 = |βn0 − β0|. By (iii), there exist sets {Bn} with each
as n → ∞, such that for each ω ∈ Bn for all n large enough,
is bounded and
is arbitrarily small, in particular, smaller than 1 − cρ0. Let Cn = An ∩ Bn and note that
as n → ∞. It follows that for each ω ∈ Cn and for each γ in B0,
Deduce that for each
maps B0 to itself.
Next, note that
can be bounded by a bias term plus a stochastic term. That is,
By (ii), the bias term has order Op(n−δ) as n → ∞. Consider the stochastic term. Recall that
and that for each
maps B0 to itself. It follows that for each
. Thus, for each ω ∈ Cn,
Apply the last inequality recursively to see that for each ω ∈ Cn,
Condition (iii) implies that the stochastic term has order Op(n−δ), which proves the result. █
Proof of Theorem 2. This involves a trivial modification of the proof of Theorem 1. █
Proof of Lemma 3. Fix φ and γ in B0. Because B0 is convex, we may apply a multivariate Taylor expansion (e.g., Dieudonné, 1969, p. 190) to write Mn(φ) − Mn(γ) = Λn(φ,γ)[φ − γ] where
. Similarly, we may write
where
. Use (i) and (ii) and argue as in the proof of Theorem 1 that there exist sets {Cn} with each
as n → ∞, such that for each
maps B0 to itself and Mn(φ) is a contraction mapping on (B0,Ek) with fixed point β0 and modulus of contraction c ∈ [0,1) independent of n and ω. Then for each ω ∈ Cn and for each φ,γ in B0,
By (iii), the last term has order op(|φ − γ|) as n → ∞, from which the result follows. █
Proof of Theorem 4. By (ii), there exist sets {An} with each
as n → ∞ such that for each
is a fixed point of
. For ω ∈ An, write
. By (iii),
has order op(n−δ) as n → ∞. The result will follow if, as n → ∞,
Because
is a fixed point of
. Thus, for each ω ∈ An,
Expand
about β0 to get
where βn* is on the line segment connecting
. Combine the last two expressions to get
Note that
Conditions (i) and (ii) imply that
as n → ∞. This and the definition of βn* imply that |βn* − β0| = op(1) as n → ∞. This and condition (v) imply that the first term in brackets has order op(1) as n → ∞. Condition (vi) and the fact that |βn* − β0| = op(1) as n → ∞ imply that the second term in brackets has order op(1) as n → ∞. This and condition (iv) imply the result. █
Next, we prove several lemmas used in the proof of consistency of the semiparametric ILS estimator. The first is an extension of a result of Ichimura (1997, Lemma 4.1, p. 29) relating a nearest neighbor window width to the number of neighbors.
Recall that
is a neighborhood of β0 and for
is the density of X′φ evaluated at
. Define
. Note that for all
and d(t,φ) ≥ 1. In addition, by Assumption A10, both c(t,φ) and d(t,φ) are continuous in t and φ. Define an(t,φ) to be the distance from t to its knth nearest neighbor to the right and bn(t,φ) the distance from t to its knth nearest neighbor to the left. Recall that the trimming function σn(t) = {|t| ≤ cnα}. Fix μ > 0. Define Sn to be the interval [−cnα,cnα], Ln the lower interval [−cnα,−μ], M the middle interval [−μ,μ], and Un the upper interval [μ,cnα].
LEMMA 1A. Suppose Assumptions A2, A3, A5, A9, and A10 hold. Then, as n → ∞,
where either rn(t,φ) = [nc(t,φ)g(t,φ)an(t,φ)]/kn or rn(t,φ) = [nd(t,φ)g(t,φ) × bn(t,φ)]/kn.
Proof. We shall prove the result for the case rn(t,φ) = [nc(t,φ)g(t,φ)an(t,φ)]/kn. The proof for the other case is similar.
Let H(·,t,φ) denote the cumulative distribution function (c.d.f.) of (X′φ − t)+. That is, for s ≥ 0,
Let Hn(s,t,φ) denote the corresponding empirical c.d.f. That is, for s ≥ 0,
Note that Hn(an(t,φ),t,φ) = kn /n and so
Standard empirical process results (e.g., Pakes and Pollard, 1989) imply that as n → ∞,
Thus, uniformly over
,
This and a Taylor expansion of H about s = 0 imply that uniformly over
,
where s* is between zero and an(t,φ). Because kn >> n1/2, we get from (A.1) that uniformly over
,
The continuity of g(t,φ) implies that
is a continuous function of t and φ on the compact set
. Thus, it achieves its minimum on this set. Moreover, this minimum must be positive because the support of
for each φ in
. Deduce from a standard uniform law of large numbers that wp → 1 as n → ∞, there exists a positive number p independent of
, such that there are at least pn points in [t,t + 1] for each
. Because kn << n, wp → 1 as n → ∞, an(t,φ) (and therefore, s*) must be in the interval [0,1] for each
. Because n1/2 << kn << n and 1/[c(t,φ)g(t,φ)] is bounded over
, it follows from (A.1) that an(t,φ) = Op(kn /n) = op(1) uniformly over
. The left-hand side of (A.2) equals
The result follows from (A.3), an(t,φ) = Op(kn /n), and a Taylor expansion of g(t + s*,φ) about s = 0 together with the uniform convergence of s* to zero and supt,φ|∇t g(t,φ)| < ∞. █
LEMMA 2A. Suppose Assumptions A2, A3, A5, A9, and A10 hold. Then
Proof. We shall prove (i). The proof of (ii) is similar.
To establish (i), note that an argument similar to the one given in the proof of Lemma 1A shows that wp → 1 as n → ∞, an(t,φ) (and therefore, s*) must be in the interval [0,cnα − μ + 1] for each
. Fix
. Because n1/2 << kn << n1−δ and 1/[c(t,φ)g(t,φ)] << nδ uniformly over
, it follows from (A.1) that an(t,φ) = op(kn /n1−2δ) uniformly over
. Result (i) now follows from (A.2) and (A.3), a Taylor expansion of g(t + s*,φ) about s = 0, the fact that s* = op(kn /n1−δ) uniformly over
, and supt,φ|∇t g(t,φ)| < ∞. █
Recall that we use three types of nearest neighbor estimators of F(t,φ): symmetric, asymmetric from the right, and asymmetric from the left. The nearest neighbor estimator that is asymmetric from the right has the form
whereas the the nearest neighbor estimator that is asymmetric from the left has the form
The symmetric nearest neighbor estimator has the form
We see that if t ≠ Xj′φ for any
is an arithmetic average of
. We shall establish rates of uniform consistency for
. The same results with similar proofs can be obtained for the other types. Note that we can write
where
Our next result gives a rate of uniform convergence of
.
LEMMA 3A. Suppose Assumptions A2, A3, A5, A9, and A10 hold. Then, as n → ∞, for all δ > 0,
Proof. We first prove (i). Decompose
into a sum of a stochastic term and a bias term:
Let {εn} denote an arbitrary sequence of nonnegative real numbers satisfying εn → 0 as n → ∞. By Lemma 1A and Lemma 2A, wp → 1 as n → ∞, for
, and a satisfying |nc(t,φ)g(t,φ)a/kn − 1| ≤ εn, the first term in (A.4) is bounded by
Standard empirical process results (e.g., Pakes and Pollard, 1989) show that uniformly over
, and a ≥ 0, the term in absolute value signs in (A.5) has order Op(n−1/2) as n → ∞. Deduce from this, Assumptions A3 and A5, and the condition on a, that for all δ > 0, the term in (A.5) has order op(n−1/4+δ) as n → ∞.
Next, consider the bias term in (A.4). This term, wp → 1 as n → ∞, for
, and a satisfying |nc(t,φ)g(t,φ)a/kn − 1| ≤ εn, is bounded by
After the change of variable z = (v − t)/a,
A Taylor expansion of g(t + az,φ) about t, together with the bounded derivative in Assumptions A9, A3, and A5 and the condition on a, implies that for each δ > 0, the term in (A.6) has order op(n−1/4+δ) as n → ∞. This proves (i).
Mimic the proof of (i) to prove that
To prove (ii), note that
Apply (i), (A.7), and Assumption A5 to get the result. █
LEMMA 4A. Let P be a probability distribution and V1,…,Vn a sample of independent observations from P. Let λn be the infimum of the density of P over [−κn,κn], 0 < κn < ∞, and suppose λn > 0. For αn > 0, partition [−κn,κn] into N = (2κn n)/αn intervals of length αn /n. For i = 1,…,N, let Ai be the event that the ith interval contains at least one sample point. As n → ∞,
Proof. By Bonferroni's inequality,
Both (i) and (ii) now follow from simple calculus. █
Lemma 5A is used in the proof of Lemma 5. It is proved in a more general form in Klein and Spady (1993). For ease of reference, we restate the result in the form in which we use it in the proof of Lemma 5.
LEMMA 5A. Suppose f (·) is bounded and
. Then
Proof of Lemma 5. Note that Vn(φ) = ∇φ Mn(φ) is an average of i.i.d. random variables. By Assumption A11 and Lemma 2.13 in Pakes and Pollard (1989), Vn(φ) converges uniformly to
. Moreover, Assumption A11 implies that
is continuous on
.
Fix
. By a multivariate Taylor expansion, there exists a φ* on the line segment between φ and γ such that as n → ∞,
By the continuity of
and a standard linear algebra result, it is enough to show that wp → 1 as n → ∞, the largest eigenvalue of Vn(β0) in absolute value is strictly less than unity. To this end, recall that
where
Write ∇c for ∂/∂φc, c = 1,2,…,k. The rcth element of Vn(β0) equals
Recall that X = (W1,…,Wk,Wk+1) and W1 = 1. Integrate by parts and then differentiate and apply Lemma 5A to get that [∇cν(Xj′φ,φ)]β0 equals
Similarly, [∇cπ(Xj′φ,φ)]β0 equals
Integration by parts arguments show that
Write d(Xj′ β0) for F(Xj′ β0)ν1(Xj′ β0) + [1 − F(Xj′ β0)]π1(Xj′ β0). The term in outer brackets in (A.8) is equal to
where
This and prewhitening let us write the rcth element of Vn(β0) as
Because W1 = 1, κ1 = 1. Prewhitening implies that Wj1 = 1 for all j. Deduce that for r = 1,…,k and c = 1, the rcth element of Vn(β0) equals zero. That is, the first column of Vn(β0) is a vector of zeros. This implies that one solution of the characteristic equation of Vn(β0) must be zero. Thus, to prove that wp → 1 as n → ∞, the maximum eigenvalue of Vn(β0) in absolute value is strictly less than unity, it is enough to prove this for the (k − 1) × (k − 1) lower right-hand submatrix of Vn(β0). For convenience, we call this submatrix An.
Prewhitening implies that
. Deduce from this and (A.10) that the rcth element of An equals
Let n denote the n × (k − 1) matrix comprised of the second through kth components of each regressor vector and let Xβ0 denote the n × 1 vector with jth component Xj′ β0. Define
. Also, let Dn denote the n × n diagonal matrix with jjth element d(Xj′ β0). We see that
By prewhitening, Tn′Tn = Ik−1 = Tn′InTn. The fact that Tn′Dn Sn = Sn′Dn Sn + op(1) as n → ∞ implies that wp → 1 as n → ∞,
where Wn = [In − Dn]1/2Tn and Zn = Dn1/2[Tn − Sn]. (Note that log-concavity of u and Proposition 1 in Heckman and Honoré, 1990, imply that the diagonal elements of Dn are in [0,1], making it possible to form Dn1/2 and [In − Dn]1/2.) Thus, wp → 1 as n → ∞, An is a nonnegative definite matrix and so must have all nonnegative eigenvalues. Recall from the preceding discussion that wp → 1 as n → ∞,
where Σn = Dn1/2Sn. It follows that wp → 1 as n → ∞, the maximum eigenvalue of An is equal to
Assumptions A1 and A2 imply that
. Thus, x′Σn′Σn x is positive and bounded away from zero for all unit vectors
. Thus, wp → 1 as n → ∞, the maximum eigenvalue of An is strictly less than unity. This proves the result. █
Proof of Lemma 6. Note that
is bounded by
where
with
with
, n(φ))′ with uj(φ) = [F(Xj′ β0)ν(Xj′φ,φ) + (1 − F(Xj′ β0)) π(Xj′φ,φ)]τn(Xj′φ) and un(φ) = (u1(φ),…,un(φ))′ with uj(φ) = [F(Xj′ β0) ν(Xj′φ,φ) + (1 − F(Xj′ β0))π(Xj′φ,φ)].
Each component of the second term in (A.11) is an average of zero mean i.i.d. random variables. Deduce from this, Assumption A11,
, and Lemma 2.13 in Pakes and Pollard (1989) that this term has order Op(n−1/2) uniformly over
.
Each component of the third term in (A.11) can be decomposed into its mean plus a term that is an average of zero mean i.i.d. random variables. This latter term has order Op(n−1/2) uniformly over
. This follows from the same argument used to handle the second term in (A.11). The mean of the ith component is
, which converges to zero as n → ∞ by Assumptions A2 and A11 and dominated convergence, because τn(t) → 1 as n → ∞ for all
.
Consider the first term in (A.11). The result will follow if wp → 1 as n → ∞, uniformly over j for which
converges to
converges to π(Xj′φ,φ). We will prove the former. Proof of the latter is similar. Note that
equals
Consider (A.13) in the last display. Assumption A4 and the fact that
is bounded between zero and unity imply that for all δ > 0,
Deduce from (A.15), Lemma 3A(iii), and Assumption A6 that for all δ > 0, wp → 1 as n → ∞, uniformly over j and φ,
It then follows from (A.15), (A.16), Lemma 3A(iii), and Assumption A6 that for all δ > 0, wp → 1 as n → ∞, uniformly over j and φ, the term in (A.13) has order op(n−1/4+δ).
Next, consider (A.14). Write Xσ′φ for the smallest index value for which σn(Xi′φ) = 1. If τn(Xj′φ) = 1, then Xj′φ ≥ −cn. By Assumption A7, F(t,φ) [nearr ] 0 as t → −∞ for each
. This and the triangle inequality imply that wp → 1 as n → ∞ the term in (A.14) is bounded by
By Assumption A5 and Lemma 4A, wp → 1 as n → ∞, Xσ−1′φ can be made arbitrarily close to −cnα. Deduce that the second term in (A.17) is bounded by
, which converges to zero by Assumption A7. Consider the first term in (A.17). Define
. By Assumption A9 and a Taylor expansion of H(ε) about ε = 0 followed by a Taylor expansion of F(t − ε,φ) about ε = 0, we see that H(ε) = εF(t,φ) + O(ε2) uniformly over
. Deduce from this and Assumption A5 that the first term in (A.17) has order o(n1+δΔn2) for all δ > 0 where Δn = supφ Δn(φ) with Δn(φ) = maxi{Δ(Xi′φ)σn(Xi′φ)}, the norm of the partition of the interval [−cnα,cnα]. By Assumption A5 and Lemma 4A, wp → 1 as n → ∞, Δn << nδ log n/n for all δ > 0. Deduce that as n → ∞, the first term in (A.17) has order op(nδ−1) for all δ > 0. This proves the result. █