Published online by Cambridge University Press: 19 July 2005
The purpose of this paper is to describe the performance of generalized empirical likelihood (GEL) methods for time series instrumental variable models specified by nonlinear moment restrictions as in Stock and Wright (2000, Econometrica 68, 1055–1096) when identification may be weak. The paper makes two main contributions. First, we show that all GEL estimators are first-order equivalent under weak identification. The GEL estimator under weak identification is inconsistent and has a nonstandard asymptotic distribution. Second, the paper proposes new GEL test statistics, which have chi-square asymptotic null distributions independent of the strength or weakness of identification. Consequently, unlike those for Wald and likelihood ratio statistics, the size of tests formed from these statistics is not distorted by the strength or weakness of identification. Modified versions of the statistics are presented for tests of hypotheses on parameter subvectors when the parameters not under test are strongly identified. Monte Carlo results for the linear instrumental variable regression model suggest that tests based on these statistics have very good size properties even in the presence of conditional heteroskedasticity. The tests have competitive power properties, especially for thick-tailed or asymmetric error distributions.This paper is a revision of Guggenberger's job market paper “Generalized Empirical Likelihood Tests under Partial, Weak, and Strong Identification.” We are thankful to the editor, P.C.B. Phillips, and three referees for very helpful suggestions on an earlier version of this paper. Guggenberger gratefully acknowledges the continuous help and support of his adviser, Donald Andrews, who played a prominent role in the formulation of this paper. He thanks Peter Phillips and Joseph Altonji for their extremely valuable comments. We also thank Vadim Marner for help with the simulation section and John Chao, Guido Imbens, Michael Jansson, Frank Kleibergen, Marcelo Moreira, Jonathan Wright, and Motohiro Yogo for helpful comments. Aspects of this research have been presented at the 2003 Econometric Society European Meetings; York Econometrics Workshop 2004; Seminaire Malinvaud; CREST-INSEE; and seminars at Albany, Alicante, Austin (Texas), Brown, Chicago, Chicago GSB, Harvard/MIT, Irvine, ISEG/Universidade Tecnica de Lisboa, Konstanz, Laval, Madison (Wisconsin), Mannheim, Maryland, NYU, Penn, Penn State, Pittsburgh, Princeton, Rice, Riverside, Rochester, San Diego, Texas A&M, UCLA, USC, and Yale. We thank all the seminar participants. Guggenberger and Smith received financial support through a Carl Arvid Anderson Prize Fellowship and a 2002 Leverhulme Major Research Fellowship, respectively.
It is often the case that the instrumental variables available to empirical researchers are only weakly correlated with the endogenous variables. That is, identification is weak. Phillips (1989), Nelson and Startz (1990), and a large literature following these early contributions show that in such situations classical normal and chi-square asymptotic approximations to the finite-sample distributions of instrumental variable (IV) estimators and statistics can be very poor. For example, even though likelihood ratio and Wald test statistics are asymptotically chi-square, use of chi-square critical values can lead to extreme size distortions in finite samples (see Dufour, 1997). The purpose of this paper is to ascertain the performance of generalized empirical likelihood (GEL) methods (Newey and Smith, 2004; Smith, 1997, 2001) for time series IV models specified by nonlinear moment restrictions when identification may be weak (as in Stock and Wright, 2000). In particular, the paper makes two principal contributions. First, the asymptotic distribution of the GEL estimator is derived for a weakly identified setup. Second, the paper proposes new, theoretically and computationally attractive GEL test statistics. The asymptotic null distribution of these statistics is chi-square under partial (Phillips, 1989), weak (Stock and Wright, 2000), and strong identification. Thus, the size of tests formed from these statistics is invariant to the strength or weakness of identification. Importantly, we also provide asymptotic power results for the various statistics suggested in this paper.
GEL estimators and test statistics are alternatives to those based on generalized method of moments (GMM); see Hansen (1982), Newey (1985), and Newey and West (1987). GEL has received considerable attention recently because of its competitive bias properties. For example, Newey and Smith (2004) show that for many models the asymptotic bias of empirical likelihood (EL) does not grow with the number of moment restrictions, whereas that of GMM estimators grows without bound, a finding that may imply favorable properties for GEL-based test statistics.
Similar to the findings in Phillips (1984, 1989) and Stock and Wright (2000) for limited information maximum likelihood (LIML), two stage least squares (2SLS), and GMM, GEL estimators of weakly identified parameters have nonstandard asymptotic distributions and are in general inconsistent. Therefore, inference based on the classical normal approximation is inappropriate under weak identification. As in Newey and Smith (2004) for strong identification, the first-order asymptotics of the GEL estimator under weak identification do not depend on the choice of the GEL criterion function. This finding is rather surprising and contrasts with 2SLS and LIML estimators, whose first-order asymptotic theory differs under weak identification.
The statistics proposed here are asymptotically pivotal in contrast to classical Wald and likelihood ratio statistics no matter what the strength of identification. The first statistic, GELRρ, is based on the GEL criterion function and may be thought of as a nonparametric likelihood ratio statistic. Two further statistics generalize the GMM-based K-statistic of Kleibergen (2001) to the GEL context. Like the K-statistic, which is a quadratic form in the first-order derivative vector of the continuous updating GMM objective function, the second GEL statistic, Sρ, is a score-type statistic, being a quadratic form in the GEL criterion score vector. The third statistic, LMρ, is similar in structure to a GMM Lagrange multiplier statistic (Newey and West, 1987) and is asymptotically equivalent to the score-type statistic, being a quadratic form in the sample moment vector. Confidence regions constructed from the K- and GEL score-type statistics are never empty and contain the continuous updating estimator (CUE) and GEL estimator, respectively. All forms of GEL statistics admit limiting chi-square null distributions with degrees of freedom equal to the number of instrumental variables or moment conditions for the first statistic and the dimension of the parameter vector for the second and third statistics. In overidentified situations, therefore, tests based on the latter statistics should be expected to have better power properties than those based on the former. In many cases, an applied researcher is interested in inference on a parameter subvector rather than the whole parameter vector. Modified versions of these statistics are therefore suggested for the subvector case when the remaining parameters are strongly identified.
Monte Carlo simulations for the independent and identically distributed (i.i.d.) linear IV model with a wide range of error distributions compare our test statistics to several others, including homoskedastic and heteroskedastic versions of the K-statistic of Kleibergen (2001, 2002a) and the similar conditional likelihood ratio statistic LRM of Moreira (2003). We find that our tests have very good size properties even in the presence of conditional heteroskedasticity. In contrast, the homoskedastic version of the K-statistic of Kleibergen (2002a) and the LRM-statistic of Moreira (2003) are size-distorted under conditional heteroskedasticity. Our tests have competitive power properties, especially for thick-tailed or asymmetric error distributions. Given the nonparametric construction of the GEL estimator, robustness of GEL-based test statistics to different error distributions should be expected.
Like the work of Stock and Wright (2000), our paper allows for both i.i.d. and martingale difference sequences (m.d.s.) but does not apply to more general time series models; see Assumption Mθ(ii), which follows. Allowing for m.d.s. observations covers various cases of intertemporal Euler equations applications and regression models with m.d.s. errors. Therefore, the extension from the i.i.d. linear (Guggenberger, 2003, Ch. 1) to the particular time series setting with nonlinear moment restrictions considered here seems worthwhile, especially because there is essentially no cost (in terms of complications of the proofs) to making this extension. The proofs for consistency and for the asymptotic distribution of the GEL estimator build on Guggenberger (2003), which adapts those given in Newey and Smith (2004) for the i.i.d. strongly identified context.
Subsequent to the i.i.d. linear version of this paper, two related papers have appeared. First, Caner (2003) derives the asymptotic distribution of the exponential tilting (ET) estimator (see Imbens, Spady, and Johnson, 1998; Kitamura and Stutzer, 1997) under weak identification with nonlinear moment restrictions for independent observations. Caner (2003) also obtains an ET version of the K-statistic for nonlinear moment restrictions. Second, Otsu (2003) considers GEL-based tests under weak identification in a more general time series setting than considered here and examines the GEL criterion function statistic GELRρ and a modified version of the K-statistic based on the Kitamura and Stutzer (1997) and Smith (2001) kernel smoothed GEL estimator that is efficient under strong identification; see also Guggenberger and Smith (2003).
The remainder of the paper is organized as follows. In Section 2, the model and the assumptions are discussed, the GEL estimator is briefly reviewed, and the asymptotic distribution of the GEL estimator under weak identification is derived. Section 3 introduces the GEL-based test statistics. We derive their asymptotic limiting distribution and show that it is unaffected by the degree of identification. Section 4 generalizes these results to hypotheses involving subvectors of the unknown parameter vector. Section 5 describes the simulation results. All proofs are relegated to the Appendix.
The following notation is used in the paper. The symbols →d, →p, and ⇒ denote convergence in distribution, convergence in probability, and weak convergence of empirical processes, respectively. For the latter, see Andrews (1994) for a definition. For convergence “almost surely” we write “a.s.” and “with probability approaching 1” is replaced by “w.p.a.1.”
The space Ci(M) contains all functions that are i times continuously differentiable on M. For a symmetric matrix A, A > 0 means that A is positive definite and λmin(A) and λmax(A) denote the smallest and largest eigenvalue of A in absolute value, respectively. By A′ we denote the transpose of a matrix A. For a full column rank matrix A ∈ Rk×p and positive definite matrix K ∈ Rk×k, we denote by PA(K) the oblique projection matrix A(A′K−1A)−1A′K−1 on the column space of A in the metric K and define MA(K) := Ik − PA(K), where Ik is the k-dimensional identity matrix; we abbreviate this notation to PA and MA if K = Ik. The symbol [otimes ] denotes the Kronecker product. Furthermore, vec(M) stands for the column vectorization of the k × p matrix M; i.e., if M = (m1,…,mp) then vec(M) = (m1′,…,mp′)′. Finally, ∥M∥ equals the square root of the largest eigenvalue of M′M.
This section is concerned with the asymptotic distribution of the GEL estimator when some elements of the parameter vector of interest may be only weakly identified. Intuitively, then, the moment conditions that define the model may not be particularly informative about these parameters.
We consider models specified by a finite number of moment restrictions. Let {zi : i = 1,…,n} be Rl-valued data and, for each n ∈ N, gn : G × Θ → Rk a given function, where G ⊂ Rl and Θ ⊂ Rp denotes the parameter space. The model has a true parameter θ0 for which the moment condition
is satisfied. For gn(zi,θ) we will usually write gi(θ).
Guggenberger (2003, Ch. 1) discusses in detail GEL estimation and testing for this model under weak identification. The structural form (SF) equation is given by
and the reduced form (RF) for Y by
where y,u ∈ Rn, Y,V ∈ Rn×p, Z ∈ Rn×k, and Π ∈ Rk×p. The matrix Y may contain both exogenous and endogenous variables, Y = (X,W) say, where X ∈ Rn×pX and W ∈ Rn×pW denote the respective observation matrices of exogenous and endogenous variables. The variables Z = (X,ZW) constitute a set of instruments for the endogenous variables W. The first pX columns of Π equal the first pX columns of Ik, and the first pX columns of V are 0. Denote by Yi, Vi, Zi,…(i = 1,…,n) the ith row of the matrix Y, V, Z,… written as a column vector. Assuming the instruments and the structural error are uncorrelated, Eui Zi = 0, it follows that Egi(θ0) = 0, where for each i = 1,…,n, gi(θ) := (yi − Yi′θ)Zi. Note that in this example gi(θ) depends on n if the RF coefficient matrix Π is modeled to depend on n (see Staiger and Stock, 1997), where Πn = n−1/2C for a fixed matrix C.
As in Stock and Wright (2000) the moment conditions may result from conditional moment restrictions. Assume E [h(Yi,θ0)|Fi] = 0, where h : H × Θ → Rk1, H ⊂ Rk2, and Fi is the information set at time i. Let Zi be a k3-dimensional vector of instruments contained in Fi. If gi(θ) := h(Yi,θ) [otimes ] Zi, then Egi(θ0) = 0 follows by taking iterated expectations. In (2.1), k = k1 k3 and l = k2 + k3.
This section is concerned with the asymptotic distribution of the GEL estimator for θ when some components of θ0 = (α0′,β0′)′, α0 say, α0 ∈ A, A ⊂ RpA, are only weakly identified. Intuitively, this means that the moment condition (2.1) is not very informative about α0. For parameter vectors θ = (α′,β0′)′, Egn(zi,θ) may be very close to zero, not only for α close to α0 but also when α is far from α0. In that case, the restriction Egn(zi,θ0) = 0 is not very helpful for making inference on α0. Assumption ID, which follows, provides a theoretical asymptotic framework for this phenomenon, which is taken from Assumption C in Stock and Wright (2000, p. 1061). We refer the reader to Stock and Wright (2000, pp. 1060–1061), which provides substantial detailed motivation for this assumption and an explanation of why it models α0 as weakly and β0 as strongly identified.
To describe the moment and distributional assumptions, we require some additional notation:
where, if defined, Gi(θ) := (∂gi /∂θ)(θ) ∈ Rk×p. For notational convenience, a subscript n has been omitted in certain expressions. Define the k × k matrices1
Note that Δ(θ) is Ω(θ) in Stock and Wright (2000). We choose our notation for Ω(θ) for consistency with Newey and Smith (2004).
Let θ = (α′,β′)′, where α ∈ A, A ⊂ RpA, β ∈ B, B ⊂ RpB, and pA + pB = p. Also let
denote an open neighborhood of β0.
Assumption Θ. The true parameter θ0 = (α0′,β0′)′ is in the interior of the compact space Θ = A × B.
Assumption ID.
Next we detail the necessary moment assumptions.2
Weak convergence here is defined with respect to the sup-norm on function spaces and euclidean norm on Rk.
Assumption M.
Assumption M(i) adapts Assumption 1(d) of Newey and Smith (2004), E supβ∈B∥gi(β)∥ξ < ∞ for some ξ > 2, from the i.i.d. setting with strong identification (pA = 0 and thus θ = β and Θ = B) to the weakly identified setup considered here. A sufficient condition for M(i) in the time series context and under ID is given by
Indeed, a simple application of the Markov inequality shows that (2.4) implies max1≤i≤n supθ∈Θ∥gi(θ)∥ = Op(n1/ξ) = op(n1/2). See the Appendix for a proof. Assumption M(ii), which adapts Assumption 1(e) of Newey and Smith to the weakly identified setup, ensures that
is nonsingular for
. Assumption M(iii) is essentially the “high-level” Assumption B of Stock and Wright (2000, p. 1059) that states that Ψn obeys a functional central limit theorem. In Assumption B′, Stock and Wright provide primitive sufficient conditions for their Assumption B that can also be found in Andrews (1994). Note that the definition of weak convergence [Andrews (1994, p. 2250)] and M(iii) imply that supθ∈Θ∥Ψn(θ)∥ →d supθ∈Θ∥Ψ(θ)∥ and, thus, also that
. In the proof of Theorem 2 we require
bounded in probability.
It is interesting to note that for i.i.d. data an application of the Borel–Cantelli lemma shows that M(i) is implied by Assumption 1(d) of Newey and Smith (2004) even if ξ = 2; see Owen (1990, Lemma 3) for a proof. Hence, using Lemmas 7–9 given subsequently, their Assumption 1(d) can be weakened to ξ ≥ 2 for the consistency and asymptotic normality of the GEL estimator under strong identification with i.i.d. data (see their Theorems 3.1 and 3.2). Therefore, for i.i.d. data, identical assumptions guarantee consistency and asymptotic normality for both GEL and two-step efficient GMM estimators (Hansen, 1982).
See Guggenberger (2003). For the linear IV model (2.2) Assumption ID can be expressed as the following assumption.
Assumption ID′. Π = Πn = (ΠAn,ΠB) ∈ Rk×(pA+pB), where pA + pB = p. For a fixed matrix CA ∈ Rk×pA, ΠAn = n−1/2CA and ΠB has full column rank.
Under Assumption ID′, i.i.d. data, and instrument exogeneity it follows that
which implies that in the notation of ID(i), m1n(θ) = m1(θ) = E(Zi Zi′) CA(α0 − α) and m2(β) = E(Zi Zi′)ΠB(β0 − β). Also, note that Assumption ID′ includes the partially identified model of Phillips (1989). In particular, choosing pA and setting CA = 0, one obtains a model in which Π may have any desired (less than full) rank.
We now give simple sufficient conditions that imply Assumption M. Let U := (u,V).
Assumption M′.
(i) {(Ui,Zi) : i ≥ 1} are i.i.d.;
(ii) EZiUi′ = 0;
(iii) E∥Zi∥4 < ∞, QZZ := E(Zi Zi′) > 0, Eui2Zi Zi′, EuiVij Zi Zi′, and EVijVik Zi Zi′ exist and are finite for j,k = 1,…,p, where Vij denotes the jth component of the vector Vi;
(iv) Ω(θ) is nonsingular for all θ ∈ A × {β0}.
Assumptions M′(i) and (ii) state that errors and exogenous variables are jointly i.i.d. and the standard instrument exogeneity assumption is satisfied, whereas M′(iii) and (iv) are technical conditions.
The following lemma shows that Assumption M′ in the linear model implies Assumption M.
LEMMA 1. Suppose that Assumptions ID′, M′, and Θ hold in the linear IV model (2.2). Then Assumptions ID and M hold.
Therefore the various technical conditions of Assumption M reduce to very simple moment conditions in the linear model. Note that M′ implies E [supθ∈Θ∥gi(θ)∥ξ] < ∞ for ξ = 2. However, we do not need the assumption E [supθ∈Θ∥gi(θ)∥ξ] < ∞ for a ξ > 2 to prove n1/2-consistency of the GEL estimator of the strongly identified parameters.
Assumption HOM (conditional homoskedasticity). E(UiUi′|Zi) = ΣU > 0.
HOM, which is used in Staiger and Stock (1997), is sufficient for Assumption M′(iv). That is, Assumptions M′(i)–(iii) and HOM imply M′(iv) under ID′. This follows from Ω(θ) = QZZ vα′ΣuVA vα for θ ∈ A × {β0}, where vα′ := (1,(α0 − α)′) and ΣuVA is the (1 + pA) × (1 + pA) upper left submatrix of ΣU. However, M′ is more general than HOM because it allows for conditional heteroskedasticity. For example, ui = ∥Zi∥ζi, where ζi ∼ N(0,1) is independent of Zi ∼ N(0,Ik), is compatible with M′.
This section provides a formal definition of the GEL estimator of θ0.
Let ρ be a real-valued function Q → R, where Q is an open interval of the real line that contains 0 and
If defined, let ρj(v) := (∂jρ/∂vj)(v) and ρj := ρj(0) for nonnegative integers j.
The GEL estimator is the solution to a saddle point problem3
For compact Θ, continuous ρ, and gi (i = 1,…,n), the existence of an argmin
may be shown. In fact,
, viewed as a function in θ, can be shown to be lower semicontinuous (ls). A function f (x) is ls at x0 if, for each real number c such that c < f (x0), there exists an open neighborhood U of x0 such that c < f (x) holds for all x ∈ U. The function f is said to be ls if it is ls at each x0 of its domain. It is easily shown that ls functions on compact sets take on their minimum. Uniqueness of
, however, is not implied. As a simple example, consider the i.i.d. linear IV model in (2.2) when p = 2 and let the two components Yij, (j = 1,2), of Yi be independent Bernoulli random variables. Then, for each n, the probability that Yi1 = Yi2 for every i = 1,…,n is positive. If Yi1 = Yi2 for every
is an argmin of
, then each θ ∈ Θ with
is also. To uniquely define
, we could, for example, do the following. From the set of all vectors θ ∈ Θ that minimize
, let
be the vector that has the smallest first component. (If that does not pin down
uniquely, choose from the remaining vectors according to the second component, and so on.)
where
Assumption ρ.
(i) ρ is concave on Q;
(ii) ρ is C2 in a neighborhood of 0 and ρ1 = ρ2 = −1.
The definition of the GEL estimator
is adopted from Newey and Smith (2004). We slightly modify their definition of
by recentering and rescaling, which simplifies the presentation. We usually write
.
The most popular GEL estimators are the CUE, the EL, and the ET estimator, which correspond to ρ(v) = −(1 + v)2/2, ρ(v) = ln(1 − v), and ρ(v) = −exp v, respectively. The EL estimator was introduced by Imbens (1997), Owen (1988, 1990), and Qin and Lawless (1994) and the ET estimator by Imbens et al. (1998) and Kitamura and Stutzer (1997). For a recent survey of GEL methods see Imbens (2002).4
A choice of
as the weighting matrix WT(θT(θ)) in Stock and Wright (2000, equation (2.2), p. 1058), i.e.,
, results in the CUE which is the GEL estimator based on ρ(v) = −(1 + v)2/2; see Newey and Smith (2004, Theorem 2.1). Hansen, Heaton, and Yaron (1996) and Pakes and Pollard (1989) define the (GMM) CUE using the centered weighting matrix
. However, as shown in Newey and Smith (2004, footnote 2), both versions of the CUE are numerically identical.
This section obtains the asymptotic distribution of the GEL estimator
under Assumption ID. Theorem 2 shows that the weakly identified parameters of θ0 are estimated inconsistently and their GEL estimator has a nonstandard limiting distribution whereas the GEL estimator of the strongly identified parameters is n1/2-consistent but no longer asymptotically normal. Analogous results are available for LIML or more generally for GMM; see Phillips (1984) and Stock and Wright (2000, Theorem 1). The rather surprising finding is that the first-order asymptotic theory under ID is identical for all GEL estimators
, as long as ρ satisfies Assumption ρ.
5The proof of Theorem 2 uses a second-order Taylor expansion of
in λ about 0 in which the only impact of ρ asymptotically is through ρ1 and ρ2, which are both −1.
If defined, let
For θ = (α′,β′)′ ∈ Θ and b ∈ RpB let
The next theorem establishes the asymptotic behavior of
under Assumption ID.
THEOREM 2. Suppose Assumptions Θ, ID, M, and ρ are satisfied. Then
Remark 1. Theorem 2(ii) is analogous to Theorem 1 in Stock and Wright (2000, p. 1062) for GMM. Note that from (A.5) in the Appendix
. Moreover, using the proof of Theorem 2 it can be shown that
Therefore, like
, although n1/2-consistent,
admits a nonstandard asymptotic distribution (see also Caner, 2003). If pA = 0, where all parameters are strongly identified,
, where M2 := M2(β0), Ω := Ω(β0), and Δ := Δ(β0). The covariance matrix reduces to Ω−1MM2(Ω) in the i.i.d. case.
The proof of Theorem 2 also provides a formula (equation (A.7) in the Appendix) for b*(α) := arg minb∈RpB Pαb for α ∈ A. In particular, if pA = 0, (A.7) shows that
where
The matrix V(β0) simplifies to (M2′Ω−1M2)−1 in the i.i.d. case, and thus the preceding formula coincides with Theorem 3.2 of Newey and Smith (2004). However, the asymptotic variance matrix of
in the time series context is in general different from that in Newey and Smith, and the estimator
as defined previously would thus be inefficient. Block methods as in Kitamura (1997) or kernel-smoothing methods as in Smith (2001) can be used for efficient GEL estimation in a time series context with strong identification. In the case pA > 0, the fact that the asymptotic distribution of the strongly identified parameter estimates is in general nonnormal is a consequence of the inconsistent estimation of α0.
Remark 2. Given the nonnormal asymptotic distribution of the GMM and GEL parameter estimates under weak identification (established in Theorem 1 in Stock and Wright, 2000, and our Theorem 2, respectively) the asymptotic distribution of test statistics based on these estimators, such as t- or Wald statistics, will also be nonstandard and non-pivotal. Furthermore, these limiting distributions depend on quantities that cannot be consistently estimated (see Staiger and Stock, 1997, p. 564), which militates against their use for the construction of test statistics or confidence regions for θ0. The next section introduces alternative approaches that overcome these difficulties.
The specialization of Theorem 2 to the i.i.d. linear IV model of Example 1 was derived in Guggenberger (2003).
This section proposes several statistics to test the simple hypothesis H0 : θ = θ0 versus H1 : θ ≠ θ0. We establish that they are asymptotically pivotal quantities and have limiting chi-square null distributions under Assumption ID. Therefore these statistics lead to tests whose size properties are unaffected by the strength or weakness of identification. For the time series setup considered here there are at least two other statistics that share this property, namely, the Anderson and Rubin (1949) AR-statistic and the Kleibergen (2001, 2002a) K-statistic. The first statistic, GELRρ(θ), that we describe may be interpreted as a likelihood ratio statistic. It has an asymptotic χ2(k) null distribution and is first-order equivalent to the AR-statistic. The second set of statistics in this section, Sρ(θ) and LMρ(θ), are based on the first-order conditions (FOC) of
with respect to θ. Each has a limiting χ2(p) null distribution and is first-order equivalent to the K-statistic. For a recent survey on robust inference methods with weak identification, see Stock, Wright, and Yogo (2002).
To motivate the first statistic, consider an i.i.d. setting. In this case, GELREL(θ) may be thought of in terms of the empirical likelihood ratio statistic R(θ), where6
Newey and Smith (2004) show that under certain conditions including {zi : i ≥ 1} i.i.d.,
. Thus ln R(θ) can be interpreted as the criterion function of the EL estimator.
The criterion function R(θ) can be interpreted as a nonparametric likelihood ratio. Indeed, for fixed θ ∈ Θ and given gi(θ), (i = 1,…,n), the numerator of R(θ) is the maximal probability of observing the given sample gi(θ), (i = 1,…,n), over all discrete probability distributions (w1,…,wn) on the sample such that the sample analogue
of the moment condition (2.1) is satisfied. The denominator (1/n)n equals the unrestricted maximal probability. It can then be shown that
, where λ(θ0) is the vector of Lagrange multipliers associated with the k moment restrictions
in the constrained maximization problem (3.1). Therefore, the renormalized criterion function of the EL estimator has an interpretation as −2 times the logarithm of the likelihood ratio statistic R(θ0).
Generalizing from the i.i.d. to the time series setup and from EL to arbitrary ρ, the first statistic we consider is the renormalized GEL criterion function (2.7):
Second, following Kleibergen's (2001) suggestion of constructing a statistic from the FOC with respect to θ but in the GMM framework, we construct a test statistic based on the GEL FOC for
. If the minimum of the objective function
is obtained in the interior of Θ, the score vector with respect to θ must equal 0 at
, i.e.,
For θ ∈ Θ, define the k × p matrix
Thus, (3.3) may be written as
. The test statistic is therefore given as a quadratic form in the score vector λ(θ)′Dρ(θ) evaluated at the hypothesized parameter vector θ
where ρ is any function satisfying Assumption
is a consistent estimator of Δ(θ). We also consider the following variant of Sρ(θ):
that substitutes
for λ(θ) in Sρ(θ); see (A.8) in the Appendix, where it is shown that
. The statistic LMρ(θ) is similar to a GMM Lagrange multiplier statistic given in Newey and West (1987). To make the origin of the preceding test statistics clearer, we adopted the notation LMρ(θ) and Sρ(θ), respectively, in place of Kρ(θ) and KρL(θ) previously given to the statistics in Guggenberger (2003). To use these statistics for hypothesis tests or for the construction of confidence regions one needs a consistent estimator
of Δ(θ). Under assumptions given later, the sample average
may be used for
.
7Alternatively, instead of using uniform weights in the definition of
one could use empirical probabilities that are associated with each GEL estimator; see Section 2 of Newey and Smith (2004). However, preliminary Monte Carlo simulations (not reported here) showed no clear improvement in the performance of the test statistics.
in the definition of DCUE(θ), where
denotes any generalized inverse of
.
As noted previously the GEL and GMM CUE are numerically identical. However, although the structures of the two statistics coincide, in general, the statistic LMCUE(θ) and the Kleibergen (2001) K-statistic based on the GMM CUE are not identical. The reason is that, in general, the first-order derivatives of the GMM and GEL CUE objective functions are not equal. The K-statistic in Kleibergen (2001) is based on the FOC of the GMM CUE criterion
. It replaces DCUE(θ) in LMCUE(θ) by
, where
is an estimator for
. The particular assumptions made on Δ(θ) determine the choice of estimators
. If the sample average
is used for
for
, then the statistic LMCUE(θ) and the K-statistic coincide.
Some intuition for these test statistics is provided under strong identification. Under strong identification, Newey and Smith (2004) show consistency of
. Therefore, if the FOC (3.3) hold at
, then, at least asymptotically, they also hold at the true value θ0. The statistic Sρ(θ) can then be interpreted as a quadratic form whose criterion is expected to be small at the true value θ0. If, however, all parameters are weakly identified this argument is no longer valid. From Theorem 2,
is no longer consistent for θ0. Therefore, although the FOC hold at
, this does not imply automatically that they also approximately hold at the true value θ0. However, it can be shown that under weak identification the FOC λ(θ)′Dρ(θ) = 0′ not only hold at
w.p.a.1 but are satisfied to order Op(T−1) uniformly over θ ∈ Θ. Thus, under weak identification the FOC do not pin down the true value θ0. Consequently, the power properties of hypothesis tests for θ0 based on the statistics Sρ(θ) or LMρ(θ) should be expected to be better under strong rather than weak identification. Size properties however are not affected by the strength or weakness of identification. This is corroborated by the Monte Carlo simulations reported subsequently and theoretically by Theorem 4.
We now consider the asymptotic distribution of GELRρ(θ) evaluated at a vector θ = (α′,β0′)′, thus allowing for a fixed alternative in the weakly identified components. We need the following local version of Assumption M.
Assumption Mθ. Let θ = (α′,β0′)′ ∈ A × {β0}. Suppose
Note that for θ = (α′,β0′)′ Mθ(iii) and ID imply that
. Thus, under Mθ(iii) and ID the assumption
in Mθ(ii) is equivalent to the assumption
for θ = (α′,β0′)′, which is Assumption D′ in Stock and Wright (2000). The assumption rules out many interesting time series cases. However, it is more general than an i.i.d. assumption. The assumption allows for m.d.s. and thus covers various intertemporal Euler equations applications and regression models with m.d.s. errors. As in Stock and Wright, a possible application is the intertemporally separable consumption capital asset pricing model (CCAPM). Without assuming
, a limiting chi-square distribution would no longer obtain in the following theorems. The problem arises because the GEL estimator as defined in (2.6) is not efficient in the time series setup considered here.
THEOREM 3. Suppose ID, Mθ(i)–(iii), and ρ hold for θ = (α′,β0′)′. Then
where the noncentrality parameter δ = m1(θ)′Δ(θ)−1m1(θ). In particular,
To describe the asymptotic distribution of the statistics LMρ(θ0) and Sρ(θ0), we need the following additional assumptions. Write Gi(θ) = (GiA(θ), GiB(θ)), where the matrices GiA(θ) and GiB(θ) are of column dimension pA and pB, respectively.
Let
be an open neighborhood of θ.
Assumption Mθ (continued).
In Mθ(vii) write
Assumption Mθ(iv) allows the interchange of the order of integration and differentiation in Assumption ID, i.e.,
. It also guarantees that M1n(θ) → M1(θ) := (∂m1 /∂θ)(θ). Assumptions ID and Mθ thus imply that
where by ID the limit matrix (0,M2(β0)) is of deficient rank pB. Assumption Mθ(v) is comparable to Mθ(ii), where
was assumed and extends Mθ(ii) to cross-product terms in vec GiA(θ) and gi(θ). Assumption Mθ(vi) contains additional weak technical conditions that guarantee that certain expressions in the proof of Theorem 4 are asymptotically negligible.
The key assumption is Mθ(vii), which is a stronger version of Mθ(iii) and states that a central limit theorem (CLT) holds simultaneously for the centered gi(θ) and part of the derivative matrix, namely, vec GiA(θ). Write
, where
. As shown in the proof of Theorem 4, for θ = (α′,β0′)′, Assumptions ID, ρ, Mθ(i)–(vi), and
imply that D →p − (0,M2(β0)). Therefore, the probability limit of
is not invertible without renormalization. Define D* := DΛ where the p × p diagonal matrix Λ := diag(n1/2,…,n1/2,1,…,1) with first pA diagonal elements equal to n1/2 and the remainder equal to unity. Hence,
In the proof of Theorem 4 we show that under Assumptions ID, ρ, and Mθ(i)–(vi)
Assumption Mθ(vii), in particular the full rank assumption on V(θ), ensures that
has full rank w.p.a.1. Assumption Mθ(vii) is closely related to Assumption 1 of Kleibergen (2001). Unlike Kleibergen (2001), however, we assume ID, which, as just shown, requires that we are specific about which part of the derivative matrix Gi(θ) together with gi(θ) satisfies a CLT with full rank covariance matrix, namely, GiA(θ), which corresponds to the weakly identified parameters. Assumption ID possesses the advantage that we can obtain the asymptotic distribution of the test statistics under fixed alternatives of the form θ = (α′,β0′)′ and therefore derive asymptotic power results.
THEOREM 4. Suppose ID, Mθ(i)–(vii), and ρ hold for θ = (α′,β0′)′. Then,
where the random p-vector W(α) is defined in (A.11) in the Appendix, ζ ∼ N(0,Ip), and W and ζ are independent. We have W(α0) ≡ 0, and therefore
Remark 1. The proof of Theorem 4 crucially hinges on the fact that n1/2λ(θ0) and vec Dρ(θ0) (suitably normalized) from the FOC (3.3) are asymptotically jointly normally distributed and, moreover, are asymptotically independent. A similar result is critical also for the Kleibergen (2001) K-statistic, which generalizes the Brown and Newey (1998) analysis of efficient GMM moment estimation to the weakly identified setup. Therefore, by using an appropriate weighting matrix in the quadratic forms (3.5) and (3.6) that define Sρ(θ0) and LMρ(θ0), respectively, we immediately obtain the limiting χ2(p) null distribution of Theorem 4.
Remark 2. Theorems 3 and 4 provide a straightforward method to construct confidence regions or hypothesis tests on θ0. For example, a critical region for a test of the hypothesis H0 : θ = θ0 versus H1 : θ ≠ θ0 at significance level r is given by {GELRρ(θ0) ≥ χr2(k)}, where χr2(k) denotes the (1 − r)-critical value from the χ2(k) distribution. A (1 − r)-confidence region for θ0 is obtained by inverting the just-described test, i.e., {θ ∈ Θ : GELRρ(θ) ≤ χr2(k)}. Confidence regions and hypothesis tests based on Sρ(θ) and LMρ(θ) may be constructed in a similar fashion.
Remark 3. Theorems 3 and 4 demonstrate that GELRρ(θ0), Sρ(θ0), and LMρ(θ0) are asymptotically pivotal statistics under weak and strong identification. Therefore, the size of tests based on these statistics should not vary much with the strength or weakness of identification in finite samples. However, these results also show that under weak identification hypothesis tests based on these statistics are inconsistent. For example, the noncentrality parameter δ does not diverge to infinity for increasing sample size, and therefore the rejection rate under the alternative does not converge to 1. This is intuitively reasonable because if identification is weak one cannot learn much about α0 from the data.
Remark 4. A drawback of GELRρ(θ0) is that its limiting null distribution has degrees of freedom equal to k, the number of moment conditions, rather than the dimension of the parameter vector. In general, this has a negative impact on the power properties of hypothesis tests based on GELRρ(θ0) in overidentified situations. On the other hand, the limiting null distribution of Sρ(θ0) and LMρ(θ0) has degrees of freedom equal to p. Therefore the power of tests based on these statistics should not be negatively affected by a high degree of overidentification. The AR-statistic of Anderson and Rubin (1949) has a χ2(k) limiting null distribution also. Kleibergen (2002b) shows that it equals the sum of two independent statistics, namely, the K-statistic (Kleibergen, 2002a) and a J-statistic (Hansen, 1982) that test location and misspecification, respectively. Mutatis mutandis, a similar decomposition may be given for the GELRρ(θ0) statistic in terms of Sρ(θ0) or LMρ(θ0).
Remark 5. Stock and Wright (2000, Theorem 2) derive the asymptotic distribution under weak identification of the analogue of GELRρ(θ0) for the (GMM) CUE, which is also a χ2(k) null distribution. In the i.i.d. context, Qin and Lawless (1994, Theorem 2) propose the statistic
to test the hypothesis H0 : θ = θ0, which is shown to be asymptotically distributed as χ2(p) under strong identification. However, because of the dependence on
, this statistic is no longer asymptotically pivotal and thus leads to size-distorted tests under weak identification.
Guggenberger (2003) derives the results given in Theorems 3 and 4 under Assumptions Θ, ID′, M′, and ρ allowing for alternatives α ∈ A and Pitman drift in the data generating process (DGP) for the strongly identified parameters to assess the asymptotic power properties of the tests; i.e., ID′ holds and for some fixed b ∈ RpB, y = Y(θ0 + n−1/2(0′,b′)′) + u. To simplify our presentation here we ignore the possibility of Pitman drift. Results for the i.i.d. linear IV model follow directly from the preceding theorems because, as is easily shown, Assumptions ID′, M′, ρ, and V(θ) > 0 imply Mθ for any consistent estimator
. In particular, V(θ) has a simple representation. For θ = (α′,β0′)′, Ω(θ) = Δ(θ) and ΔAA(θ) = E(ViAViA′ [otimes ] Zi Zi′), where ViA consists of the first pA components of Vi in (2.3).
We now assume that interest is focused on the subvector α0 ∈ RpA of θ0 = (α0′,β0′)′. However, we no longer maintain Assumption ID. In particular, α0 may not necessarily be weakly identified.
To adapt the test statistics of Section 3 to the subvector case, the basic idea is to replace β by a GEL estimator
. To make this idea more rigorous, define the GEL estimator
for β0:
We usually write
where there is no ambiguity. A requirement of the analysis that follows is that
. Therefore, we assume that the nuisance parameters β0 that are not involved in the hypothesis under test are strongly identified; see Theorem 2. On the other hand, the components of α0 can be weakly or strongly identified, and in Assumption IDα, which follows, we assume the former holds for α01 and the latter for α02, where α0 = (α01′,α02′)′. The main advantage of the subvector test statistics introduced in this section is that asymptotically they have accurate sizes independent of whether α0 is weakly or strongly identified. This property is not shared by classical tests based on Wald, likelihood ratio, or Lagrange multiplier statistics. In general, they have correct size only if θ0 is strongly identified. In contrast, the subvector tests in Guggenberger and Wolf (2004) based on a subsampling approach have exact asymptotic sizes without any additional identification assumption.
Let θ = (α1′,α2′,β′)′, where αj ∈ Aj, Aj ⊂ RpAj, (j = 1,2), pA1 + pA2 = pA and β ∈ B, B ⊂ RpB. Also let
be an open neighborhood of (α02,β0).
Assumption A. The true parameter θ0 = (α01′,α02′,β0′)′ is in the interior of the compact space Θ, where Θ = A1 × A2 × B.
Assumption IDα.
Assumption IDα implies that α01 and (α02,β0) are weakly and strongly identified, respectively. Assumptions A and IDα adapt Assumptions Θ and ID in Section 2 for the subvector case.
Let
We now introduce the subvector statistics. Recall the definition of GELRρ(θ) in (3.2). The GELRρ subvector test statistic is given by
We need the following technical assumptions for our derivation of its asymptotic distribution. To obtain theoretical power properties, we again allow a fixed alternative for the weakly identified components, α01 here.
For a1 ∈ A1 let a := (a1′,α02′)′ be a fixed vector whose strongly identified component α02 is the same as the corresponding component of the true parameter vector θ0. Let
be an open neighborhood of β0.
Assumption Mα.
Mutatis mutandis, Mα has the same interpretation as Mθ. For example Mα(ii) guarantees that
is bounded and
is bounded away from zero w.p.a.1, whereas Mα(iv) and IDα imply that for
we have
. By IDα this last matrix has full column rank for β = β0. If we assume that the GiB(θaβ), (i = 1,…,n), viewed as functions of β, are continuous at β0 a.s., then we can simplify Mα(vi) to
. A similar comment holds for the assumptions in the continuation of Mα that follows.
THEOREM 5. Assume 1 ≤ pA < p. Suppose Assumptions A, IDα, Mα(i)–(vi), and ρ hold for some a1 ∈ A1 and a = (a1′,α02′)′. Then,
where the noncentrality parameter δ is given by
where M2β(·) := (∂m2 /∂β)(·) ∈ Rk×pB. In particular,
Theorem 5 confirms that the subvector statistic GELRρsub(α0), like the full vector statistic GELRρ(θ0), is asymptotically pivotal. As before, this result can be used to construct hypothesis tests and confidence regions for α0.
We now generalize the statistics Sρ and LMρ to the subvector case. The asymptotic variance matrices of
differ from those of
. Therefore different weighting matrices are required in the quadratic forms defining these subvector statistics. In the Appendix (see proofs of Theorems 5 and 6) it is shown that for a = (a1′,α02′)′,
exists w.p.a.1 and that
is asymptotically normal with covariance matrix M(a), where for α = (α1′,α2′)′ ∈ RpA
The first pA elements of the FOC (3.3), evaluated at
, are
For α ∈ RpA, let
which coincides with the definition of Dρ(θ) in (3.4) when α is the full vector θ. Similarly to Sρ(θ) in (3.5) the subvector test statistic Sρsub(α) is constructed as a quadratic form in the vector
from (4.3) with weighting matrix given by M(α) in (4.2). Let
be an estimator of M(α) that is given by replacing the expressions Δ(θαβ0) and M2β(α2,β0) in M(α) by consistent estimators,
say. By Assumptions Mα(ii) and Mα(iv)–(v) we may choose
when α = a = (a1′,α02′)′. Hence,
The statistic LMρsub(α) is constructed like Sρsub(α) but replaces
by
. Thus,
Let
be an open neighborhood of β0, and
.
Assumption Mα (continued).
In Mα(x) write
Assumption Mα(x) is the key assumption and plays a role similar to Mθ(vii). Assumption Mα(vii) extends Mα(iv) by explicitly assuming that integration and differentiation can be exchanged in the expectation of
, whereas Mα(iv) gave primitive conditions that imply that exchange holds for
. Assumptions Mα(v), Mα(vii), and IDα imply that
, which is an important result used in the proof of the next theorem; in a linear model this result is trivially true because
. Assumptions Mα(vii)–(x) are analogous to Mθ(iv)–(vii) with A1 and A2 now playing the roles of A and B, respectively.
THEOREM 6. Assume 1 ≤ pA < p. Suppose Assumptions A, IDα, Mα(i)–(x), and ρ hold for a = (a1′,α02′)′ for a1 ∈ A1. Then,
where the random pA-vector Wα(a) is defined in (A.22) of the Appendix, ζα ∼ N(0,IpA), and ζα and Wα are independent. We have Wα(α0) ≡ 0, and therefore
Remark 1. The subvector statistics are asymptotically pivotal when elements of α0 are arbitrarily weakly or strongly identified. This result can be used for the construction of test statistics or confidence regions that have correct size or coverage probabilities asymptotically, independent of the strength or weakness of identification of α0. Compared to the GMM-subvector statistic of Kleibergen (2001)the statistics Sρsub(a) and LMρsub(a) are appealing because of their compact formulation.
Remark 2. Even though it is unclear how the asymptotic distribution of these test statistics might be derived without assuming strong identification of β0, it is obvious that neither Sρsub(α0) nor LMρsub(α0) would converge to a χ2(pA) random variable. In general the quantities
in Sρsub(α0) and
in LMρsub(α0) are no longer asymptotically normal because of their dependence on the GEL estimator
, which as a direct consequence of Theorem 2 has a nonstandard limiting distribution if β0 is not strongly identified. Moreover, the subvector version of the K-statistic of Kleibergen (2001) also experiences the same problem in these circumstances as the (GMM) CUE of β0 has a nonnormal limiting distribution under weak identification (see Stock and Wright, 2000). Somewhat surprisingly, however, Monte Carlo simulations by the authors (not reported here) for the subvector statistic LMρsub(α0) indicate that its size properties are not much affected by the strength or weakness of identification of β0. Startz, Zivot, and Nelson (2004) report similar findings from Monte Carlo simulations for the subvector test statistic of Kleibergen (2001).
Guggenberger (2003) derives the corresponding results. Note that Assumptions Θ, ID′, M′, and ρ and also assuming that Vα(θaβ0) is full column rank imply Assumption Mα. In the linear model the components of Vα(θaβ0) can be easily calculated. For example, ΔA1 A1 = E(ViA1ViA1′ [otimes ] Zi Zi′), where ViA1 is the subvector of Vi that contains its first pA1 components. Let Y = (X,W) denote the partition of the included variables of the structural equation into exogenous and endogenous variables. Partition θ0 = (θX0′,θW0′)′ and θ = (θX′,θW′)′ conformably. Valid inference is possible on any subvector of θW0 if the appropriate assumptions given previously are fulfilled. Unfortunately, if the dimension of the parameter vector not subject to test is large, then the argmin-sup problem in (4.1) is computationally very involved. Premultiplication of equation (2.2) by MX should ameliorate this problem through the elimination of the exogenous variables; i.e., MX y = MXWθW0 + MXu. If Assumption Mα holds for θW0 = (αW0,βW0) and gi(θW) := MX,i′(y − WθW)Zi, where MX,i denotes the ith row of MX written as a column vector, valid inference may be undertaken on αW0.
To assess the efficacy of the hypothesis tests introduced in Theorems 3 and 4, we conduct a set of Monte Carlo experiments. The DGP is given by model (2.2) considered in Example 1 and is similar to that in Kleibergen (2002a, p. 1791), namely,
There are a single right-hand-side endogenous variable and no included exogenous variables, p = 1, Z ∼ N(0,Ik [otimes ] In), where k is the number of instruments and n the sample size. In the just-identified case, i.e., k = 1, Π = Π1, whereas, in the overidentified case, k > 1, Π = (Π1,0′)′, i.e., irrelevant instruments are added.
Interest focuses on testing the scalar null hypothesis H0 : θ0 = 0 versus the alternative hypothesis H1 : θ0 ≠ 0.
We examine several distributions for (u,V) to investigate the robustness of the test statistics to potentially different features of the error distribution. All designs are constructed from Design (I) obtained by modifying the distribution of the structural error u.
Design (II) examines the robustness of the performance of the test statistics to thick-tailed distributions for the structural equation error. Design (III) examines robustness with respect to asymmetric structural error distributions. In Design (IV) the structural error ui is bimodal with peaks at −2 and +2.
In addition, the impact of conditional heteroskedasticity on the performance of the test statistics is examined. Designs (IHET)–(IVHET) modify Designs (I)–(IV), respectively, replacing ui by ui = ∥Zi∥ui.
We calculate three versions of the statistic GELRρ(θ) in (3.2), for ρ(v) = −(1 + v)2/2 (CUE), ρ(v) = ln(1 − v) (EL), and ρ(v) = −exp v (ET). We also consider the corresponding versions for each of Sρ(θ) in (3.5) and LMρ(θ) in (3.6) with
replaced by
. As noted previously, for CUE, Sρ(θ) and LMρ(θ) are then numerically identical. Theorems 3 and 4 present the asymptotic null distributions of these statistics.
8To calculate GELRρ(θ), Sρ(θ), and LMρ(θ) for EL and ET, the globally concave maximization problem
must be solved numerically. To do so we implement a variant of the Newton–Raphson algorithm. We initialize the algorithm by setting λ equal to the zero vector. At each iteration the algorithm tries several shrinking step sizes in the search direction and accepts the first one that increases the function value compared to the previous value for λ. This procedure enforces an “uphill climbing” feature of the algorithm.
Additional statistics considered are the Anderson–Rubin test statistic (AR) (see Anderson and Rubin, 1949), two versions of the K-statistic proposed by Kleibergen (2001, 2002a), one assuming homoskedastic errors K, the other robust to conditional heteroskedasticity KHET, the conditional likelihood ratio test LRM of Moreira (2003), and two versions of the two-stage least squares (2SLS) Wald statistic 2SLS (see, e.g., Wooldridge, 2002, pp. 98, 100), one assuming homoskedastic errors (2SLSHOM) and the other robust to conditional heteroskedasticity (2SLSHET).9
The statistics are defined as follows:
where suu(θ) := (y − Yθ)′MZ(y − Yθ)/(n − k),
where
. The statistic K(θ) (Kleibergen, 2002a), is not robust to conditional heteroskedasticity. However, a version of the K-statistic in Kleibergen (2001, equation (22)) that uses a heteroskedasticity consistent estimator for the covariance matrix of gi(θ) overcomes this drawback. For model (5.1), the statistic is given by
where
, and
. The statistic KHET(θ) is identical in structure to LMCUE(θ) except the centered components
are used in place of gi(θ) and Gi, respectively. Note that Gi := Gi(θ) does not depend on θ in a linear model. For the LRM statistic, see Moreira (2003, Sect. 3). Finally, the Wald statistics are given by
where
, and
is a conditional heteroskedasticity robust estimator for the variance of
.
Empirical sizes are calculated using 5% asymptotic critical values for all of the preceding statistics for DGPs (5.1) corresponding to all 54 possible combinations of sample size n = 50, 100, 250, number of instruments k = 1, 5, 10, SF and RF error correlation ρuV = 0.0, 0.5, 0.99, and RF coefficient Π1 = 0.1, 1.0 for Designs (I)–(IV) and (IHET)–(IVHET).10
10. Kleibergen (2002a) generates one sample for the instrument matrix Z from a N(0,Ik [otimes ] In) distribution and then keeps Z fixed across R = 10,000 samples of the DGP (5.1) using Design (I) with n = 100 and ρuV = 0.99. We simulate a new matrix Z with each sample of the DGP (5.1). As a consequence, our results do not coincide with those reported by Kleibergen (2002a).
To investigate the sensitivity of the results in Kleibergen (2002a) to the choice of Z, we iterated Kleibergen's (2002a) procedure 100 times; i.e., each time we simulated a matrix Z of instruments that we then kept fixed across R = 1,000 samples of the DGP (5.1). We found strong dependence of the numerical results of the Monte Carlo experiment on Z. For example, in the case Π1 = 1, k = 1, the power of the K-statistic to reject the hypothesis θ0 = 0 when θ0 = 0.4 varied from about 60% to 95% in the 100 experiments. For the specific Z that Kleibergen (2002a) generates, he reports power of about 93% (see his Figure 1, p. 1793).
We use R = 3,000 replications of each DGP. We also use 3,000 realizations each of χ2(1) and χ2(k − 1) random variables to simulate the critical values of Moreira's LRM statistic. For the results reported in Tables 1 and 2, which follow, we use R = 10,000 replications. We refer to Π1 = 0.1 and 1.0 as the “weak” and “strong” instrument cases, respectively. The value of ρuV allows the degree of endogeneity of Y to be varied. Whereas for ρuV = 0, Y is exogenous, Y is strongly endogenous for ρuV = 0.99. We include the just-identified case, k = 1, and two overidentified-cases, k = 5 and 10.
We now describe the results for Designs (I) and (IHET) given in Tables 1 and 2, respectively, which exclude those for GELREL, SET, LMET, AR, and the case n = 100. The qualitative features of the size results for GELREL, SET, and LMET are identical to their ET/EL counterparts. For k = 1, AR coincides with K, and, for k > 1, we find that in most cases K has better size properties than AR. We report K and 2SLSHOM for the homoskedastic and KHET and 2SLSHET for the heteroskedastic design. We now discuss the results for the homoskedastic case of Design (I).
First, we consider the separate effects of Π1,n, ρuV, and k on the size results.
The most important finding is that the empirical sizes of all statistics except 2SLS show little or no dependence on Π1 (some additional Monte Carlo results show that this even holds true for the completely unidentified case where Π1 = 0). However, those for 2SLS depend crucially on the strength or weakness of identification. Although for Π1 = 1.0, 2SLS has reliable size properties for many cases, with weak instruments sizes range over the entire interval, 0% to 100%.
In general, increasing n leads to more accurate size across all statistics. This holds especially true for those that are poor for smaller n. For example, the 2SLS statistics, GELRET and SEL, severely overreject in overidentified and strongly endogenous cases when n = 50. Even though they still overreject for n = 250, the rejection rates are much closer to the 5% significance level.
It is easily shown that the rejection rates under the null hypothesis for AR and GELRρ are independent of the value of ρuV. The slight dependence of the size results in Table 1 on ρuV results from the use of different samples. For all the remaining statistics except for 2SLS, there does not appear to be a clear pattern for how ρuV affects their size properties. Moreover, there is little dependence of the results on ρuV. However, for 2SLS, increasing ρuV leads to severe overrejection when combined with overidentification, especially so in the weak instrument case.
Increasing the number of instruments k usually leads to overrejection for 2SLS, GELRET, and SEL. For 2SLS this is especially true under weak identification and/or strong endogeneity. All the other statistics show little dependence on k.
We now turn to a comparison of performance across statistics. The 2SLS statistics should not be used with weak instruments or in strongly endogenous overidentified situations. In all other cases, 2SLS has competitive size properties. The statistics GELRET and SEL severely overreject in overidentified problems when the sample size is small. Overall, then, the statistics LMEL, K, and LRM lead to the best size results. The statistics LMCUE and GELRCUE come in only as second winners because they tend to underreject, especially in overidentified situations. Across the 36 experiments in Table 1, the sizes of LMEL, LMCUE, GELRCUE, K, and LRM are in the intervals [4.0,6.2], [1.6,5.3], [1.3,5.3], [4.8,8.6] and [4.3,10.3], respectively. The statistics K and LRM usually slightly overreject. In 22 of the 36 cases, the size of LMEL comes closest to the 5% significance level across all the statistics. The corresponding numbers for LMCUE, GELRCUE, K and LRM are 8, 8, 9, and 7. Based on Design (I), LMEL seems to have a slight advantage over the remaining statistics.
We now discuss the size results for Design (IHET) summarized in Table 2. As most findings are similar to those discussed for Design (I), we only describe the new features.
The statistics 2SLSHOM, K, and LRM perform uniformly worse than in Design (I). Tests based on these statistics severely overreject, especially in the just-identified case. Their performance does not improve when n increases. We therefore report results for the heteroskedasticity robust versions 2SLSHET and KHET. Their size properties and those of the statistics based on GEL methods do not appear to be negatively influenced by the presence of conditional heteroskedasticity. This is to be expected from our earlier theoretical discussion of the GEL statistics, which does not assume conditional homoskedasticity. Of course, 2SLSHET still suffers in weakly identified models, and GELRET and SEL perform poorly in overidentified situations for small n. Rejection rates of the test statistics LMEL, LMCUE, GELRCUE, KHET, and LRM across the 36 experiments of Table 2 are in the intervals [3.6,6.4], [1.6,5.1], [1.0,5.1], [4.3,9.2], and [7.8,28.8], respectively. In 21 of the 36 cases, the size of LMEL comes closest to the 5% significance level across all the statistics. The test statistic KHET wins in 18 cases.
In summary, the only statistics with accurate size properties across all experiments of Designs (I) and (IHET) are LMEL, LMCUE, GELRCUE, and KHET. Based on the preceding results it seems that LMEL enjoys a slight advantage over the other statistics. From the 72 cases in Tables 1 and 2 the empirical size of LMEL is closest to the nominal 5% in 43 cases across all statistics.
The qualitative features of the size results for Designs (II)–(IV) and (IIHET)–(IVHET) are generally very similar to their normal counterparts of Designs (I) and (IHET). For this reason, we do not include additional tables for these designs. One striking difference however occurs for 2SLS under weak identification with χ2(1) (Design (III)) and bimodal errors (Design (IV)). Rejection rates across these 54 combinations for 2SLSHOM are in the intervals [0.1,7.1] and [0.0,5.4], respectively. Whereas with normal errors and weak identification 2SLS severely overrejects, with these error distributions it severely underrejects.
To summarize this size study, LMEL, LMCUE, GELRCUE, and KHET have reliable size properties across all designs that appear independent of both the strength or weakness of identification and possible conditional heteroskedasticity. The test statistic 2SLS performs very poorly in the presence of weak instruments. The LRM statistic performs well in homoskedastic cases but poorly otherwise.
Empirical power curves are calculated for the preceding statistics and DGPs (5.1) corresponding to all 16 possible combinations of sample size n = 100, 250, number of instruments k = 5, 10, SF and RF error correlation ρuV = 0.5, 0.99, and RF coefficient Π1 = 0.1, 1.0 for each of the error distributions of Designs (I)–(III). Except for LRM, we report size-corrected power curves at the 5% significance level, using critical values calculated in the preceding size comparison. We do so because size correction of LRM is not straightforward as a result of the conditional construction of LRM and, as shown before, for Designs (I)–(III), LRM has empirical size very close to nominal at the 5% significance level.
We use R = 1,000 replications from the DGP (5.1) with various values of the true value θ0. The null hypothesis under test is again H0 : θ0 = 0. For weak identification (Π1 = 0.1), θ0 takes values in the interval [−4.0,4.0] whereas, with strong identification (Π1 = 1.0), θ0 ∈ [−0.4,0.4]. We use 1,000 realizations each of χ2(1) and χ2(k − 1) random variables to simulate the critical values of LRM. For those results reported in the figures that follow, we use 10,000 replications from (5.1).
Detailed results are presented only for the statistics LMEL, K, LRM, and 2SLSHET. The statistics LMCUE, LMEL, and LMET display a very similar performance across almost all scenarios. We therefore only report results for LMEL. We do not report power results for the statistics SEL and SET because, as seen earlier, their size properties appear to be quite poor for the sample sizes considered here. When k = 1, AR and K are numerically identical. In overidentified cases, K generally performs better than AR. We therefore do not report results for AR (see Kleibergen, 2002a, for a comparison of K and AR). Similarly, GELRCUE is numerically identical to LMρ for k = 1 but leads to a less powerful test for k > 1. Also EL and ET versions of GELRρ have rather unreliable size properties for the sample sizes considered here. Therefore we do not report detailed results for GELRρ.
We first focus on the separate effects of Π1,n,ρuV, and k on power.
With strong identification all statistics have a U-shaped power curve. With the exception of 2SLSHET, the lowest point of the power curve is usually achieved at θ0 = 0. In Designs (I) and (II), 2SLSHET is usually biased, taking on its lowest value at a negative θ0 value in the interval [−0.2,0.0]. When θ0 is weakly identified, the power curves of LMEL, K, and LRM are generally very flat across all θ0 values, often only slightly exceeding the significance level of the test. This is especially true for LMEL and K but less so for LRM, which is generally more powerful than the other two statistics in this situation. There is one exception when the power of the three tests is high. In Design (I) with ρuV = 0.99, although being flat at about 5% for positive θ0 values, the power curves reach a sharp peak of almost 100% around θ0 = −1. The reason for this anomaly is most easily explained in the case k = 1, where
. We have
, which in Design (I) with Π1 = 0.1 equals 1 + 2θ0 ρuV + (1.01)θ02. If ρuV = 0.99 this expression is minimized at around θ0 = −0.98 where it equals approximately 0.03. Therefore, this peak is caused by
taking on large values for θ0 in the neighborhood of −1.
For negative θ0 values with |θ0| > 1 power quickly falls, reaching between 20% and 50% across the different designs at θ0 = −4.
In contrast to the power curves of LMEL, K, and LRM, the power curve of 2SLSHET retains its U-shaped form for Π1 = 0.1. In many cases, the power curve reaches values close to 100% when |θ0| is close to 4.
As to be expected the tests are more powerful when n is increased from 100 to 250. This holds uniformly across all statistics and designs with a more pronounced power increase in the strongly identified cases.
There does not seem to be a systematic effect due to ρuV as it varies with the specific design. For reasons explained previously, the shape of the power curves can change dramatically in Design (I) when ρuV is increased from 0.5 to 0.99 if Π1 = 0.1.
In most cases, there is only little change in the power functions when k is increased from 5 to 10. In general, if the power function changes, then power is slightly lower for larger k.
We now compare the power functions across statistics. Figures 1a–c display the power curves of the four statistics for Designs (I)–(III) in the case Π1 = 1.0, n = 250, ρuV = 0.5, and k = 5 (the figures for Π1 = 0.1 and for the other parameter combinations are available upon request). The qualitative comparison for the other parameter combinations is very similar, and we therefore focus on these representative cases.
When identification is weak, the test based on LRM is usually more powerful than those based on LMEL and K. The power gain of using LRM is quite substantial for negative θ0 values but less so for positive θ0. However, the Wald test 2SLSHET is by far the most powerful test in all three designs. Except for some small negative θ0 values its power curve uniformly dominates the power curves of the other tests. Recall though that 2SLSHET has unreliable size properties under weak identification.
When identification is strong, LMEL uniformly dominates LRM and K in Designs (II) and (III) (see Figures 1b and 1c). However, LRM and K uniformly dominate LMEL in Design (I) (see Figure 1a). This result is to be expected. On the one hand, the LMEL test is based on nonparametric GEL methods. On the other hand, LRM and K are motivated within the normal model framework. Although the power gain of LMEL is small in Design (III), it is substantial in Design (II). Therefore, LMEL should be used when errors have thick tails.
With strong identification, the Wald test is the most powerful test for positive θ0 values. For negative θ0 values, its performance varies from being most powerful in Design (III) to least powerful in Design (I). These results confirm that the Wald test is a reasonable choice when identification is strong.
Overall, therefore, the power study does not lead to an unambiguous ranking of the different tests considered here. Which test is most appropriate depends on the particular error distribution and degree of identification. We find that with strong identification and errors with thick tails or asymmetric errors, LMEL seems to be the best choice whereas with normal errors LRM and K appear preferable. When identification is weak, LRM generally dominates K and LMEL in terms of power although as noted previously the size properties of LRM deteriorate substantially in the presence of heteroskedasticity.
Proof of Equation (2.4). Let fi := supθ∈Θ∥gi(θ)∥. Define K := supi≥1 Efiξ < ∞. Let ε > 0 and choose a positive C ∈ R such that K/C < ε. Then
where the first inequality follows from Pr(A ∪ B) ≤ Pr(A) + Pr(B) and the second uses the Markov inequality. It follows that (max1≤i≤n fi)n−1/ξ = Op(1) and thus (max1≤i≤n fi) = op(n1/2) by ξ > 2. Thus (2.4) implies M(i). █
Proof of Lemma 1. ID holds trivially. By (2.2) and (2.3), gi(θ) = (yi − Yi′θ)Zi = Zi(Zi′Π + Vi′)(θ0 − θ) + Ziui. Next max1≤i≤n supθ∈Θ∥gi(θ)∥ = op(n1/2) is established. An application of the Borel–Cantelli lemma shows that for real-valued i.i.d. random variables Wi such that EWi2 < ∞, max1≤i≤n|Wi| = op(n1/2); see Owen (1990, Lemma 3) for a proof. By the definition of gi(θ) and the triangle inequality,
By Assumption M′(iii), we can apply the just-mentioned result to each of the three summands in the preceding equation, which proves the result.
Next M(ii) is shown. By the i.i.d. assumption, Ω(θ) = limn→∞ Egi(θ)gi(θ)′, and continuity and boundedness in M(ii) follow immediately from M′(iii) and compactness of Θ. The same is true for the Op(1) statement in M(ii). Finally, uniform convergence follows from the weak law of large numbers and compactness of Θ.
Next M(iii) is proved. Because
, we only have to deal with the empirical process
Finite-dimensional joint convergence follows from the CLT and M′(iii), and stochastic equicontinuity follows from the fact that (θ0 − θ) enters Ψn(·,θ) linearly:
where the last expression is bounded by δOp(1) by the CLT. Furthermore, Θ is compact by assumption. The proposition in Andrews (1994, p. 2251) can thus be applied, which yields the desired result. █
The following proofs are straightforward generalizations of the Guggenberger (2003) proofs for the i.i.d. linear model to the more general context considered here. We require three lemmas that are modified versions of Lemmas A1–A3 in Newey and Smith (2004) for the proofs of our theorems. These modifications are necessary because unlike Newey and Smith we need to work with weakly and strongly identified parameters and do not make an i.i.d. assumption.
For
let Θn ⊂ Θ. Let cn := n−1/2 max1≤i≤n supθ∈Θn∥gi(θ)∥. Let Λn := {λ ∈ Rk : ∥λ∥ ≤ n−1/2cn−1/2} if cn > 0 and Λn = Rk otherwise. Write “u.w.p.a.1” for “uniformly over θ ∈ Θn w.p.a.1.”
LEMMA 7. Assume max1≤i≤n supθ∈Θn∥gi(θ)∥ = op(n1/2).
Then
, where
is defined in (2.5).
Proof. The case cn = 0 is trivial, and thus wlog cn ≠ 0 can be assumed. By assumption cn = op(1), and the first part of the statement follows from
which also immediately implies the second part. █
LEMMA 8. Suppose
for some
uniformly over θ ∈ Θn and Assumption ρ holds.
Then
satisfying
exists u.w.p.a.1,
uniformly over θ ∈ Θn.
Proof. Without loss of generality cn ≠ 0, and thus Λn can be assumed compact. For θ ∈ Θn, let λθ ∈ Λn be such that
. Such a λθ ∈ Λn exists u.w.p.a.1 because a continuous function takes on its maximum on a compact set and by Lemma 7 and Assumption ρ,
(as a function in λ for fixed θ) is C2 on some open neighborhood of Λn u.w.p.a.1. We now show that actually
u.w.p.a.1, which then proves the first part of the lemma. By a second-order Taylor expansion around λ = 0, there is a λθ* on the line segment joining 0 and λθ such that for some positive constants C1 and C2
u.w.p.a.1, where the second inequality follows as max1≤i≤n ρ2(λθ*′gi(θ)) < −½ u.w.p.a.1 from Lemma 7, continuity of ρ2(·) at zero, and ρ2 = −1. The last inequality follows from
. Now, (A.1) implies that
, the latter being Op(n−1/2) uniformly over θ ∈ Θn by assumption. It follows that λθ ∈ int(Λn) u.w.p.a.1. To prove this, let ε > 0. Because λθ = Op(n−1/2) uniformly over θ ∈ Θn and cn = op(1), there exist
such that Pr(∥n1/2λθ∥ ≤ Mε) > 1 − ε/2 uniformly over θ ∈ Θn and Pr(cn−1/2 > Mε) > 1 − ε/2 for all n ≥ nε. Then Pr(λθ ∈ int(Λn)) = Pr(∥n1/2λθ∥ < cn−1/2) ≥ Pr((∥n1/2λθ∥ ≤ Mε) ∧ (cn−1/2 > Mε)) > 1 − ε for n ≥ nε uniformly over θ ∈ Θn.
Hence, the FOC for an interior maximum
hold at λ = λθ u.w.p.a.1. By Lemma 7,
, and thus by concavity of
(as a function in λ for fixed θ) and convexity of
it follows that
, which implies the first part of the lemma. From before λθ = Op(n−1/2) uniformly over θ ∈ Θn. Thus the second and by (A.1) the third parts of the lemma follow. █
Suppose Θ1 × Θ2 ⊂ Θ, Θi ⊂ Rpi, p1 + p2 = p. Partition θ0 = (θ01′,θ02′)′ accordingly and assume θ02 ∈ Θ2. For d1 ∈ Θ1 define
By u.w.p.a.1 we denote “uniformly over d1 ∈ Θ1 w.p.a.1.”
LEMMA 9. Suppose max1≤i≤n supθ∈Θ1×Θ2∥gi(θ)∥ = op(n1/2),
for some
uniformly over d1 ∈ Θ1, and Assumption ρ holds.
Then
uniformly over d1 ∈ Θ1.
Proof. Without loss of generality
can be assumed. Define
. Note that λ ∈ Λn and thus
uniformly over θ ∈ Θn w.p.a.1 (see Lemma 7 with Θn := Θ1 × Θ2). By a second-order Taylor expansion around λ = 0, there is a
on the line segment joining 0 and λ such that for some positive constants C1 and C2
u.w.p.a.1, where the first inequality follows from Lemma 7, which implies that
. The second inequality follows by
. The definition of
implies
uniformly over d1 ∈ Θ1. Combining equations (A.2) and (A.3) implies
uniformly over d1 ∈ Θ1. █
Proof of Theorem 2. (i) We first show consistency of
. By Assumption ID and M(iii)
, where m2(β) = 0 if and only if β = β0. Therefore,
is a sufficient condition for consistency of
. Applying Lemma 8 to the case Θn = {θ0} gives
. Assumption M(ii) implies
for some κ < ∞, and thus Lemma 9 (applied to the case p1 = 0, Θ2 = Θ) implies
.
Next we establish n1/2-consistency of
. By consistency of
and Assumption M(ii)
for some ε > 0, and thus Lemma 8 for the case
implies that the FOC
have to hold at
, where
and λ(θ), for given θ ∈ Θ, is defined in Lemma 8. Expanding the FOC in λ around 0, there exists a mean value
between
(that may be different for each row) such that
where the matrix
has been implicitly defined. Because
, Lemma 7 and Assumption ρ imply that
. By Assumption M(ii), it follows that
and thus
is invertible w.p.a.1 and
. Therefore
w.p.a.1. Inserting this into a second-order Taylor expansion for
(with mean value λ* as in (A.1)) it follows that
The same argument as for
proves
. We therefore have
. By the definition of
,
By Assumption ID, we have up to op(1) terms that
. The same analysis as in the proof of Lemma A1 in Stock and Wright (2000, p. 1091, line six from the top) can now be applied to prove n1/2-consistency of
, where the symmetric matrix
plays the role of
in Stock and Wright. Note that in equation (A.4) in Stock and Wright, Assumption M(iii) of bounded sample paths w.p.a.1 is used. Finally, note that
is bounded away from zero w.p.a.1.
(ii) By Assumption M(iii)
and by ID we have for some mean-vector β between β0 and β0 + n−1/2b (that may differ across rows)
Because the latter expression is bounded, it follows that
, where u.w.p.a.1 stands for “uniformly over (α,b) ∈ A × BM w.p.a.1.” Therefore, by Lemma 8, λ(θαb) such that
exists u.w.p.a.1 and λ(θαb) = Op(n−1/2) uniformly over (α,b) ∈ A × BM. This implies that the FOC
have to hold at λ = λ(θαb) and θ = θαb u.w.p.a.1. Expanding the FOC and using the same steps and notation as in part (i), it follows that
, and upon inserting this into a second-order Taylor expansion of
we have
u.w.p.a.1. The matrices
converge to Ω((α′,β0′)′) uniformly over A × BM. By M(iii),
, and therefore
on A × BM.
By part (i) of the proof and Lemma 3.2.1 in van der Vaart and Wellner (1996, p. 286) it follows that
For given α ∈ A, one can calculate arg minb∈RpB Pαb by solving the FOC for b. Writing Ω for Ω((α′,β0′)′) and M2 for M2(β0) the result is
This holds in particular for α = α*. It follows that α* = arg minα∈A Pαb*(α). █
Proof of Theorem 3. Applying Lemma 8 to the case Θn = {θ}, it follows that
exists such that
. Using the same steps and notation as in the proof of Theorem 2 leads to
w.p.a.1, where by Mθ(ii) both
converge in probability to Δ(θ). By Mθ(iii),
from which the result follows. █
Proof of Theorem 4. Using Mθ(i)–(iii) and an argument similar to the argument that led to (A.5) we have
and therefore the statement of the theorem involving Sρ(θ) follows immediately from the one for LMρ(θ). Therefore, we only deal with the statistic LMρ(θ) given in equation (3.8).
First, we show that the matrix D* is asymptotically independent of
. For notational convenience from now on we omit the argument θ; e.g., we write gi for gi(θ). By a mean-value expansion about 0 we have ρ1(λ′gi) = −1 + ρ2(ξi)gi′λ for a mean value ξi between 0 and λ′gi, and thus by (A.8) and the definition of Λ we have
where for the last equality we use (3.7) and Assumptions Mθ(v)–(vi). By Assumption Mθ(v) it thus follows that
where w1 := vec(0,−M2(β0),0) ∈ RkpA+kpB+k and
M and v have dimensions (kpA + kpB + k) × (kpA + k) and (kpA + k) × 1, respectively. By Assumption ID, Mθ(vii), and (3.7) v →d N(w2,V(θ)), where w2 := ((vec M1A)′,m1′)′ and M1A are the first pA columns of M1. Therefore
where Ψ := ΔAA − ΔA Δ−1ΔA′ is positive definite. Equation (A.9) proves that
are asymptotically independent.
We now derive the asymptotic distribution of LMρ(θ). Denote by D and g the limiting normal distributions of
, respectively (see equation (A.9)). Subsequently we show that the function h : Rk×p → Rp×k defined by h(D) := (D′Δ−1D)−1/2D′ for D ∈ Rk×p is continuous on a set C ⊂ Rk×p with Pr(D ∈ C) = 1. By the continuous mapping theorem and Mθ(v) we have
By the independence of D and g, the latter random variable is distributed as W + ζ, where the random p-vector W is defined as
ζ ∼ N(0,Ip), and W and ζ are independent. Note that for θ = θ0, W ≡ 0. From (A.10) the statement of the theorem follows.
We now prove the continuity claim for h. Note that h is continuous at each D that has full column rank. It is therefore sufficient to show that D has full column rank a.s. From (A.9) it follows that the last pB columns of D equal −M2(β0), which has full column rank by assumption. Define
and the k × p-matrix
has linearly dependent columns}. Clearly, O is closed and therefore Lebesgue-measurable. Furthermore, O has empty interior and thus has Lebesgue measure 0. For the first pA columns of D, DpA say, it has been shown that vecDpA is normally distributed with full rank covariance matrix Ψ. This implies that for any measurable set O* ⊂ RkpA with Lebesgue measure 0, it holds that Pr(vec(DpA) ∈ O*) = 0, in particular, for O* = O. This proves the continuity claim for h. █
Proof of Theorem 5. By Assumptions
, and by Lemmas 8 and 9 (applied to Θn = {θaβ0} and Θ1 = {a}, Θ2 = B, respectively) we have
. Assumption IDα then implies consistency of
. Applying Lemma 8 to the case
implies that the FOC for λ must hold in the definition of
(see equation (A.4)). Then repeating the analysis that leads to (A.6) in the proof of Theorem 2, we have by Mα(ii)
The next goal is to derive the asymptotic distribution of
. Our analysis follows Newey and Smith (2004); see their proof of Theorem 3.2. Differentiating the FOC (A.4) with respect to λ yields the matrix
, which by Mα(ii) converges in probability to −Δ(θaβ0), which is nonsingular. Therefore, the implicit function theorem implies that there is a neighborhood of
where the solution to the FOC, say
, is continuously differentiable w.p.a.1. The envelope theorem then implies
w.p.a.1. Also, a mean-value expansion of (A.4) in (β,λ) about (β0,0) yields (where gi(θ) inside ρ1 is kept constant at
)
where (β′,λ′) are mean values on the line segment that joins
that may be different for each row. Combining the pB rows of (A.13) with the k rows of (A.14) we get
where the (pB + k) × (pB + k) matrix M has been implicitly defined. By Mα(ii) and Mα(iv)–(vi) the matrix M converges in probability to M, where (writing M2β for M2β((α02,β0)))
and (omitting the argument θaβ0)
It follows that M is nonsingular w.p.a.1. Equation (A.15) implies that w.p.a.1
An expansion of
in β around β0 and the preceding lead to
for some appropriate mean value θ. Note that
which has rank k − pB. From (A.12), GELRρsub(a) →d ξ′Δ(θaβ0)−1MM2β(Δ(θaβ0))ξ, where ξ ∼ N(m1(θaβ0),Δ(θaβ0)), which concludes the proof. █
Proof of Theorem 6. As in the proof of Theorem 5,
. Hence, the result for LMρsub(a) implies the result for Sρsub(a).
As in the proof of Theorem 4 renormalize D* := Dρ(a)Λ, where the diagonal pA × pA matrix Λ := diag(n1/2,…,n1/2,1,…,1) has first pA1 diagonal elements equal to n1/2 and the remaining pA2 elements equal to unity. We now show that
are asymptotically independent. By a mean-value expansion about θaβ0 and Assumption Mα(vii) we have for some mean value
(that may be different for each row)
where we have used (A.16) for the last equation. Assumptions Mα(vii) and IDα imply
(recall that m2 does not depend on α1) and thus
Proceeding exactly as in the proof of Theorem 4, using (A.17), (A.19), and Assumptions Mα(vii)–(ix), it follows that
where M ∈ R(kpA1+kpA2+k)×(kpA1+k) and
where the arguments (α02,β0) in M2β and (∂m2 /∂α2) and θaβ0 in ΔA1 and Δ are omitted. By Mα(x), v is asymptotically normal with full rank covariance matrix Vα(θaβ0), and thus the asymptotic covariance matrix of
is given by MVα(θaβ0)M′. For independence of
the upper right k(pA1 + pA2) × k-submatrix of MVα(θaβ0)M′ must be 0. This is clear for the kpA2 × k-dimensional submatrix, and we only have to show that the kpA1 × k upper right submatrix
is 0. Using (A.18), the matrix in (A.21) equals −ΔA1 Δ−1PM2β(Δ)MM2β(Δ)Δ, which is clearly 0. This proves the independence claim.
Now denote by D and g the limiting normal distributions of
, implied by (A.20). Recall M(a) = Δ−1MM2β(Δ) (see equation (4.2)). If the function h : Rk×pA → RpA×k defined by h(D) := (D′M(a)D)−1/2D′ for D ∈ Rk×pA is continuous on a set C ⊂ Rk×pA with Pr(D ∈ C) = 1, then by the continuous mapping theorem
By (A.17) and (A.18) the latter variable is distributed as Wα(a) + ζα, where
Therefore the theorem is proved once we have proved the continuity claim for h. For this step of the proof we need the positive definite assumption for Vα(θaβ0) in Mα(x). It is enough to show that with probability 1, rank(MM2β(Δ)D) = pA. Because the span of the columns of M2β equals the kernel of MM2β(Δ) and rank(M2β) = pB, the latter condition holds if rank(M2β,D) = p. Denote by DpA2 the last pA2 columns of D, which by (A.20) equal −(∂m2 /∂α2). By Assumption IDα, the matrix (∂m2 /∂(α2′,β′)′)((α02,β0)) has rank pA2 + pB, and it remains to show that with probability one, this matrix is linearly independent of the first pA1 columns of D, DpA1 say. Using (A.20) and Vα(θaβ0) > 0, the covariance matrix of vecDpA1 is easily shown to have full column rank pA1 k. An argument analogous to the last step in the proof of Theorem 4 can then be applied to conclude the proof. █