Hostname: page-component-745bb68f8f-lrblm Total loading time: 0 Render date: 2025-02-05T06:55:45.691Z Has data issue: false hasContentIssue false

GENERALIZED EMPIRICAL LIKELIHOOD ESTIMATORS AND TESTS UNDER PARTIAL, WEAK, AND STRONG IDENTIFICATION

Published online by Cambridge University Press:  19 July 2005

Patrik Guggenberger
Affiliation:
UCLA
Richard J. Smith
Affiliation:
cemmap, UCL and IFS and University of Warwick
Rights & Permissions [Opens in a new window]

Abstract

The purpose of this paper is to describe the performance of generalized empirical likelihood (GEL) methods for time series instrumental variable models specified by nonlinear moment restrictions as in Stock and Wright (2000, Econometrica 68, 1055–1096) when identification may be weak. The paper makes two main contributions. First, we show that all GEL estimators are first-order equivalent under weak identification. The GEL estimator under weak identification is inconsistent and has a nonstandard asymptotic distribution. Second, the paper proposes new GEL test statistics, which have chi-square asymptotic null distributions independent of the strength or weakness of identification. Consequently, unlike those for Wald and likelihood ratio statistics, the size of tests formed from these statistics is not distorted by the strength or weakness of identification. Modified versions of the statistics are presented for tests of hypotheses on parameter subvectors when the parameters not under test are strongly identified. Monte Carlo results for the linear instrumental variable regression model suggest that tests based on these statistics have very good size properties even in the presence of conditional heteroskedasticity. The tests have competitive power properties, especially for thick-tailed or asymmetric error distributions.This paper is a revision of Guggenberger's job market paper “Generalized Empirical Likelihood Tests under Partial, Weak, and Strong Identification.” We are thankful to the editor, P.C.B. Phillips, and three referees for very helpful suggestions on an earlier version of this paper. Guggenberger gratefully acknowledges the continuous help and support of his adviser, Donald Andrews, who played a prominent role in the formulation of this paper. He thanks Peter Phillips and Joseph Altonji for their extremely valuable comments. We also thank Vadim Marner for help with the simulation section and John Chao, Guido Imbens, Michael Jansson, Frank Kleibergen, Marcelo Moreira, Jonathan Wright, and Motohiro Yogo for helpful comments. Aspects of this research have been presented at the 2003 Econometric Society European Meetings; York Econometrics Workshop 2004; Seminaire Malinvaud; CREST-INSEE; and seminars at Albany, Alicante, Austin (Texas), Brown, Chicago, Chicago GSB, Harvard/MIT, Irvine, ISEG/Universidade Tecnica de Lisboa, Konstanz, Laval, Madison (Wisconsin), Mannheim, Maryland, NYU, Penn, Penn State, Pittsburgh, Princeton, Rice, Riverside, Rochester, San Diego, Texas A&M, UCLA, USC, and Yale. We thank all the seminar participants. Guggenberger and Smith received financial support through a Carl Arvid Anderson Prize Fellowship and a 2002 Leverhulme Major Research Fellowship, respectively.

Type
Research Article
Copyright
© 2005 Cambridge University Press

1. INTRODUCTION

It is often the case that the instrumental variables available to empirical researchers are only weakly correlated with the endogenous variables. That is, identification is weak. Phillips (1989), Nelson and Startz (1990), and a large literature following these early contributions show that in such situations classical normal and chi-square asymptotic approximations to the finite-sample distributions of instrumental variable (IV) estimators and statistics can be very poor. For example, even though likelihood ratio and Wald test statistics are asymptotically chi-square, use of chi-square critical values can lead to extreme size distortions in finite samples (see Dufour, 1997). The purpose of this paper is to ascertain the performance of generalized empirical likelihood (GEL) methods (Newey and Smith, 2004; Smith, 1997, 2001) for time series IV models specified by nonlinear moment restrictions when identification may be weak (as in Stock and Wright, 2000). In particular, the paper makes two principal contributions. First, the asymptotic distribution of the GEL estimator is derived for a weakly identified setup. Second, the paper proposes new, theoretically and computationally attractive GEL test statistics. The asymptotic null distribution of these statistics is chi-square under partial (Phillips, 1989), weak (Stock and Wright, 2000), and strong identification. Thus, the size of tests formed from these statistics is invariant to the strength or weakness of identification. Importantly, we also provide asymptotic power results for the various statistics suggested in this paper.

GEL estimators and test statistics are alternatives to those based on generalized method of moments (GMM); see Hansen (1982), Newey (1985), and Newey and West (1987). GEL has received considerable attention recently because of its competitive bias properties. For example, Newey and Smith (2004) show that for many models the asymptotic bias of empirical likelihood (EL) does not grow with the number of moment restrictions, whereas that of GMM estimators grows without bound, a finding that may imply favorable properties for GEL-based test statistics.

Similar to the findings in Phillips (1984, 1989) and Stock and Wright (2000) for limited information maximum likelihood (LIML), two stage least squares (2SLS), and GMM, GEL estimators of weakly identified parameters have nonstandard asymptotic distributions and are in general inconsistent. Therefore, inference based on the classical normal approximation is inappropriate under weak identification. As in Newey and Smith (2004) for strong identification, the first-order asymptotics of the GEL estimator under weak identification do not depend on the choice of the GEL criterion function. This finding is rather surprising and contrasts with 2SLS and LIML estimators, whose first-order asymptotic theory differs under weak identification.

The statistics proposed here are asymptotically pivotal in contrast to classical Wald and likelihood ratio statistics no matter what the strength of identification. The first statistic, GELRρ, is based on the GEL criterion function and may be thought of as a nonparametric likelihood ratio statistic. Two further statistics generalize the GMM-based K-statistic of Kleibergen (2001) to the GEL context. Like the K-statistic, which is a quadratic form in the first-order derivative vector of the continuous updating GMM objective function, the second GEL statistic, Sρ, is a score-type statistic, being a quadratic form in the GEL criterion score vector. The third statistic, LMρ, is similar in structure to a GMM Lagrange multiplier statistic (Newey and West, 1987) and is asymptotically equivalent to the score-type statistic, being a quadratic form in the sample moment vector. Confidence regions constructed from the K- and GEL score-type statistics are never empty and contain the continuous updating estimator (CUE) and GEL estimator, respectively. All forms of GEL statistics admit limiting chi-square null distributions with degrees of freedom equal to the number of instrumental variables or moment conditions for the first statistic and the dimension of the parameter vector for the second and third statistics. In overidentified situations, therefore, tests based on the latter statistics should be expected to have better power properties than those based on the former. In many cases, an applied researcher is interested in inference on a parameter subvector rather than the whole parameter vector. Modified versions of these statistics are therefore suggested for the subvector case when the remaining parameters are strongly identified.

Monte Carlo simulations for the independent and identically distributed (i.i.d.) linear IV model with a wide range of error distributions compare our test statistics to several others, including homoskedastic and heteroskedastic versions of the K-statistic of Kleibergen (2001, 2002a) and the similar conditional likelihood ratio statistic LRM of Moreira (2003). We find that our tests have very good size properties even in the presence of conditional heteroskedasticity. In contrast, the homoskedastic version of the K-statistic of Kleibergen (2002a) and the LRM-statistic of Moreira (2003) are size-distorted under conditional heteroskedasticity. Our tests have competitive power properties, especially for thick-tailed or asymmetric error distributions. Given the nonparametric construction of the GEL estimator, robustness of GEL-based test statistics to different error distributions should be expected.

Like the work of Stock and Wright (2000), our paper allows for both i.i.d. and martingale difference sequences (m.d.s.) but does not apply to more general time series models; see Assumption Mθ(ii), which follows. Allowing for m.d.s. observations covers various cases of intertemporal Euler equations applications and regression models with m.d.s. errors. Therefore, the extension from the i.i.d. linear (Guggenberger, 2003, Ch. 1) to the particular time series setting with nonlinear moment restrictions considered here seems worthwhile, especially because there is essentially no cost (in terms of complications of the proofs) to making this extension. The proofs for consistency and for the asymptotic distribution of the GEL estimator build on Guggenberger (2003), which adapts those given in Newey and Smith (2004) for the i.i.d. strongly identified context.

Subsequent to the i.i.d. linear version of this paper, two related papers have appeared. First, Caner (2003) derives the asymptotic distribution of the exponential tilting (ET) estimator (see Imbens, Spady, and Johnson, 1998; Kitamura and Stutzer, 1997) under weak identification with nonlinear moment restrictions for independent observations. Caner (2003) also obtains an ET version of the K-statistic for nonlinear moment restrictions. Second, Otsu (2003) considers GEL-based tests under weak identification in a more general time series setting than considered here and examines the GEL criterion function statistic GELRρ and a modified version of the K-statistic based on the Kitamura and Stutzer (1997) and Smith (2001) kernel smoothed GEL estimator that is efficient under strong identification; see also Guggenberger and Smith (2003).

The remainder of the paper is organized as follows. In Section 2, the model and the assumptions are discussed, the GEL estimator is briefly reviewed, and the asymptotic distribution of the GEL estimator under weak identification is derived. Section 3 introduces the GEL-based test statistics. We derive their asymptotic limiting distribution and show that it is unaffected by the degree of identification. Section 4 generalizes these results to hypotheses involving subvectors of the unknown parameter vector. Section 5 describes the simulation results. All proofs are relegated to the Appendix.

The following notation is used in the paper. The symbols →d, →p, and ⇒ denote convergence in distribution, convergence in probability, and weak convergence of empirical processes, respectively. For the latter, see Andrews (1994) for a definition. For convergence “almost surely” we write “a.s.” and “with probability approaching 1” is replaced by “w.p.a.1.”

The space Ci(M) contains all functions that are i times continuously differentiable on M. For a symmetric matrix A, A > 0 means that A is positive definite and λmin(A) and λmax(A) denote the smallest and largest eigenvalue of A in absolute value, respectively. By A′ we denote the transpose of a matrix A. For a full column rank matrix ARk×p and positive definite matrix KRk×k, we denote by PA(K) the oblique projection matrix A(AK−1A)−1AK−1 on the column space of A in the metric K and define MA(K) := IkPA(K), where Ik is the k-dimensional identity matrix; we abbreviate this notation to PA and MA if K = Ik. The symbol [otimes ] denotes the Kronecker product. Furthermore, vec(M) stands for the column vectorization of the k × p matrix M; i.e., if M = (m1,…,mp) then vec(M) = (m1′,…,mp′)′. Finally, ∥M∥ equals the square root of the largest eigenvalue of MM.

2. ESTIMATION

This section is concerned with the asymptotic distribution of the GEL estimator when some elements of the parameter vector of interest may be only weakly identified. Intuitively, then, the moment conditions that define the model may not be particularly informative about these parameters.

2.1. Model

We consider models specified by a finite number of moment restrictions. Let {zi : i = 1,…,n} be Rl-valued data and, for each nN, gn : G × Θ → Rk a given function, where GRl and Θ ⊂ Rp denotes the parameter space. The model has a true parameter θ0 for which the moment condition

is satisfied. For gn(zi,θ) we will usually write gi(θ).

Example 1 (i.i.d. linear IV regression)

Guggenberger (2003, Ch. 1) discusses in detail GEL estimation and testing for this model under weak identification. The structural form (SF) equation is given by

and the reduced form (RF) for Y by

where y,uRn, Y,VRn×p, ZRn×k, and Π ∈ Rk×p. The matrix Y may contain both exogenous and endogenous variables, Y = (X,W) say, where XRn×pX and WRn×pW denote the respective observation matrices of exogenous and endogenous variables. The variables Z = (X,ZW) constitute a set of instruments for the endogenous variables W. The first pX columns of Π equal the first pX columns of Ik, and the first pX columns of V are 0. Denote by Yi, Vi, Zi,…(i = 1,…,n) the ith row of the matrix Y, V, Z,… written as a column vector. Assuming the instruments and the structural error are uncorrelated, Eui Zi = 0, it follows that Egi0) = 0, where for each i = 1,…,n, gi(θ) := (yiYi′θ)Zi. Note that in this example gi(θ) depends on n if the RF coefficient matrix Π is modeled to depend on n (see Staiger and Stock, 1997), where Πn = n−1/2C for a fixed matrix C.

Example 2 (conditional moment restrictions)

As in Stock and Wright (2000) the moment conditions may result from conditional moment restrictions. Assume E [h(Yi0)|Fi] = 0, where h : H × Θ → Rk1, HRk2, and Fi is the information set at time i. Let Zi be a k3-dimensional vector of instruments contained in Fi. If gi(θ) := h(Yi,θ) [otimes ] Zi, then Egi0) = 0 follows by taking iterated expectations. In (2.1), k = k1 k3 and l = k2 + k3.

2.2. Assumptions

This section is concerned with the asymptotic distribution of the GEL estimator for θ when some components of θ0 = (α0′,β0′)′, α0 say, α0A, ARpA, are only weakly identified. Intuitively, this means that the moment condition (2.1) is not very informative about α0. For parameter vectors θ = (α′,β0′)′, Egn(zi,θ) may be very close to zero, not only for α close to α0 but also when α is far from α0. In that case, the restriction Egn(zi0) = 0 is not very helpful for making inference on α0. Assumption ID, which follows, provides a theoretical asymptotic framework for this phenomenon, which is taken from Assumption C in Stock and Wright (2000, p. 1061). We refer the reader to Stock and Wright (2000, pp. 1060–1061), which provides substantial detailed motivation for this assumption and an explanation of why it models α0 as weakly and β0 as strongly identified.

To describe the moment and distributional assumptions, we require some additional notation:

where, if defined, Gi(θ) := (∂gi /∂θ)(θ) ∈ Rk×p. For notational convenience, a subscript n has been omitted in certain expressions. Define the k × k matrices1

Note that Δ(θ) is Ω(θ) in Stock and Wright (2000). We choose our notation for Ω(θ) for consistency with Newey and Smith (2004).

Let θ = (α′,β′)′, where α ∈ A, ARpA, β ∈ B, BRpB, and pA + pB = p. Also let

denote an open neighborhood of β0.

Assumption Θ. The true parameter θ0 = (α0′,β0′)′ is in the interior of the compact space Θ = A × B.

Assumption ID.

Next we detail the necessary moment assumptions.2

Weak convergence here is defined with respect to the sup-norm on function spaces and euclidean norm on Rk.

Assumption M.

Assumption M(i) adapts Assumption 1(d) of Newey and Smith (2004), E supβ∈Bgi(β)∥ξ < ∞ for some ξ > 2, from the i.i.d. setting with strong identification (pA = 0 and thus θ = β and Θ = B) to the weakly identified setup considered here. A sufficient condition for M(i) in the time series context and under ID is given by

Indeed, a simple application of the Markov inequality shows that (2.4) implies max1≤in supθ∈Θgi(θ)∥ = Op(n1/ξ) = op(n1/2). See the Appendix for a proof. Assumption M(ii), which adapts Assumption 1(e) of Newey and Smith to the weakly identified setup, ensures that

is nonsingular for

. Assumption M(iii) is essentially the “high-level” Assumption B of Stock and Wright (2000, p. 1059) that states that Ψn obeys a functional central limit theorem. In Assumption B′, Stock and Wright provide primitive sufficient conditions for their Assumption B that can also be found in Andrews (1994). Note that the definition of weak convergence [Andrews (1994, p. 2250)] and M(iii) imply that supθ∈Θ∥Ψn(θ)∥ →d supθ∈Θ∥Ψ(θ)∥ and, thus, also that

. In the proof of Theorem 2 we require

bounded in probability.

It is interesting to note that for i.i.d. data an application of the Borel–Cantelli lemma shows that M(i) is implied by Assumption 1(d) of Newey and Smith (2004) even if ξ = 2; see Owen (1990, Lemma 3) for a proof. Hence, using Lemmas 7–9 given subsequently, their Assumption 1(d) can be weakened to ξ ≥ 2 for the consistency and asymptotic normality of the GEL estimator under strong identification with i.i.d. data (see their Theorems 3.1 and 3.2). Therefore, for i.i.d. data, identical assumptions guarantee consistency and asymptotic normality for both GEL and two-step efficient GMM estimators (Hansen, 1982).

Example 1 (continued)

See Guggenberger (2003). For the linear IV model (2.2) Assumption ID can be expressed as the following assumption.

Assumption ID′. Π = Πn = (ΠAnB) ∈ Rk×(pA+pB), where pA + pB = p. For a fixed matrix CARk×pA, ΠAn = n−1/2CA and ΠB has full column rank.

Under Assumption ID′, i.i.d. data, and instrument exogeneity it follows that

which implies that in the notation of ID(i), m1n(θ) = m1(θ) = E(Zi Zi′) CA0 − α) and m2(β) = E(Zi Zi′)ΠB0 − β). Also, note that Assumption ID′ includes the partially identified model of Phillips (1989). In particular, choosing pA and setting CA = 0, one obtains a model in which Π may have any desired (less than full) rank.

We now give simple sufficient conditions that imply Assumption M. Let U := (u,V).

Assumption M′.

(i) {(Ui,Zi) : i ≥ 1} are i.i.d.;

(ii) EZiUi′ = 0;

(iii) EZi4 < ∞, QZZ := E(Zi Zi′) > 0, Eui2Zi Zi′, EuiVij Zi Zi′, and EVijVik Zi Zi′ exist and are finite for j,k = 1,…,p, where Vij denotes the jth component of the vector Vi;

(iv) Ω(θ) is nonsingular for all θ ∈ A × {β0}.

Assumptions M′(i) and (ii) state that errors and exogenous variables are jointly i.i.d. and the standard instrument exogeneity assumption is satisfied, whereas M′(iii) and (iv) are technical conditions.

The following lemma shows that Assumption M′ in the linear model implies Assumption M.

LEMMA 1. Suppose that Assumptions ID, M, and Θ hold in the linear IV model (2.2). Then Assumptions ID and M hold.

Therefore the various technical conditions of Assumption M reduce to very simple moment conditions in the linear model. Note that M′ implies E [supθ∈Θgi(θ)∥ξ] < ∞ for ξ = 2. However, we do not need the assumption E [supθ∈Θgi(θ)∥ξ] < ∞ for a ξ > 2 to prove n1/2-consistency of the GEL estimator of the strongly identified parameters.

Assumption HOM (conditional homoskedasticity). E(UiUi′|Zi) = ΣU > 0.

HOM, which is used in Staiger and Stock (1997), is sufficient for Assumption M′(iv). That is, Assumptions M′(i)–(iii) and HOM imply M′(iv) under ID′. This follows from Ω(θ) = QZZ vα′ΣuVA vα for θ ∈ A × {β0}, where vα′ := (1,(α0 − α)′) and ΣuVA is the (1 + pA) × (1 + pA) upper left submatrix of ΣU. However, M′ is more general than HOM because it allows for conditional heteroskedasticity. For example, ui = ∥Zi∥ζi, where ζiN(0,1) is independent of ZiN(0,Ik), is compatible with M′.

2.3. The GEL Estimator

This section provides a formal definition of the GEL estimator of θ0.

Let ρ be a real-valued function QR, where Q is an open interval of the real line that contains 0 and

If defined, let ρj(v) := (∂jρ/∂vj)(v) and ρj := ρj(0) for nonnegative integers j.

The GEL estimator is the solution to a saddle point problem3

For compact Θ, continuous ρ, and gi (i = 1,…,n), the existence of an argmin

may be shown. In fact,

, viewed as a function in θ, can be shown to be lower semicontinuous (ls). A function f (x) is ls at x0 if, for each real number c such that c < f (x0), there exists an open neighborhood U of x0 such that c < f (x) holds for all xU. The function f is said to be ls if it is ls at each x0 of its domain. It is easily shown that ls functions on compact sets take on their minimum. Uniqueness of

, however, is not implied. As a simple example, consider the i.i.d. linear IV model in (2.2) when p = 2 and let the two components Yij, (j = 1,2), of Yi be independent Bernoulli random variables. Then, for each n, the probability that Yi1 = Yi2 for every i = 1,…,n is positive. If Yi1 = Yi2 for every

is an argmin of

, then each θ ∈ Θ with

is also. To uniquely define

, we could, for example, do the following. From the set of all vectors θ ∈ Θ that minimize

, let

be the vector that has the smallest first component. (If that does not pin down

uniquely, choose from the remaining vectors according to the second component, and so on.)

where

Assumption ρ.

(i) ρ is concave on Q;

(ii) ρ is C2 in a neighborhood of 0 and ρ1 = ρ2 = −1.

The definition of the GEL estimator

is adopted from Newey and Smith (2004). We slightly modify their definition of

by recentering and rescaling, which simplifies the presentation. We usually write

.

The most popular GEL estimators are the CUE, the EL, and the ET estimator, which correspond to ρ(v) = −(1 + v)2/2, ρ(v) = ln(1 − v), and ρ(v) = −exp v, respectively. The EL estimator was introduced by Imbens (1997), Owen (1988, 1990), and Qin and Lawless (1994) and the ET estimator by Imbens et al. (1998) and Kitamura and Stutzer (1997). For a recent survey of GEL methods see Imbens (2002).4

A choice of

as the weighting matrix WT(θT(θ)) in Stock and Wright (2000, equation (2.2), p. 1058), i.e.,

, results in the CUE which is the GEL estimator based on ρ(v) = −(1 + v)2/2; see Newey and Smith (2004, Theorem 2.1). Hansen, Heaton, and Yaron (1996) and Pakes and Pollard (1989) define the (GMM) CUE using the centered weighting matrix

. However, as shown in Newey and Smith (2004, footnote 2), both versions of the CUE are numerically identical.

2.4. First-Order Equivalence

This section obtains the asymptotic distribution of the GEL estimator

under Assumption ID. Theorem 2 shows that the weakly identified parameters of θ0 are estimated inconsistently and their GEL estimator has a nonstandard limiting distribution whereas the GEL estimator of the strongly identified parameters is n1/2-consistent but no longer asymptotically normal. Analogous results are available for LIML or more generally for GMM; see Phillips (1984) and Stock and Wright (2000, Theorem 1). The rather surprising finding is that the first-order asymptotic theory under ID is identical for all GEL estimators

, as long as ρ satisfies Assumption ρ.

5

The proof of Theorem 2 uses a second-order Taylor expansion of

in λ about 0 in which the only impact of ρ asymptotically is through ρ1 and ρ2, which are both −1.

This is in contrast to the asymptotic theory for k-class estimators under weak identification. As shown in Staiger and Stock (1997, Theorem 1), the nonstandard asymptotic distribution of the k-class estimator depends on κ defined by n(k − 1) →d κ. Therefore, LIML and 2SLS are not first-order equivalent under weak identification.

If defined, let

For θ = (α′,β′)′ ∈ Θ and bRpB let

The next theorem establishes the asymptotic behavior of

under Assumption ID.

THEOREM 2. Suppose Assumptions Θ, ID, M, and ρ are satisfied. Then

Remark 1. Theorem 2(ii) is analogous to Theorem 1 in Stock and Wright (2000, p. 1062) for GMM. Note that from (A.5) in the Appendix

. Moreover, using the proof of Theorem 2 it can be shown that

Therefore, like

, although n1/2-consistent,

admits a nonstandard asymptotic distribution (see also Caner, 2003). If pA = 0, where all parameters are strongly identified,

, where M2 := M20), Ω := Ω(β0), and Δ := Δ(β0). The covariance matrix reduces to Ω−1MM2(Ω) in the i.i.d. case.

The proof of Theorem 2 also provides a formula (equation (A.7) in the Appendix) for b*(α) := arg minbRpB Pαb for α ∈ A. In particular, if pA = 0, (A.7) shows that

where

The matrix V0) simplifies to (M2′Ω−1M2)−1 in the i.i.d. case, and thus the preceding formula coincides with Theorem 3.2 of Newey and Smith (2004). However, the asymptotic variance matrix of

in the time series context is in general different from that in Newey and Smith, and the estimator

as defined previously would thus be inefficient. Block methods as in Kitamura (1997) or kernel-smoothing methods as in Smith (2001) can be used for efficient GEL estimation in a time series context with strong identification. In the case pA > 0, the fact that the asymptotic distribution of the strongly identified parameter estimates is in general nonnormal is a consequence of the inconsistent estimation of α0.

Remark 2. Given the nonnormal asymptotic distribution of the GMM and GEL parameter estimates under weak identification (established in Theorem 1 in Stock and Wright, 2000, and our Theorem 2, respectively) the asymptotic distribution of test statistics based on these estimators, such as t- or Wald statistics, will also be nonstandard and non-pivotal. Furthermore, these limiting distributions depend on quantities that cannot be consistently estimated (see Staiger and Stock, 1997, p. 564), which militates against their use for the construction of test statistics or confidence regions for θ0. The next section introduces alternative approaches that overcome these difficulties.

Example 1 (continued)

The specialization of Theorem 2 to the i.i.d. linear IV model of Example 1 was derived in Guggenberger (2003).

3. TEST STATISTICS

This section proposes several statistics to test the simple hypothesis H0 : θ = θ0 versus H1 : θ ≠ θ0. We establish that they are asymptotically pivotal quantities and have limiting chi-square null distributions under Assumption ID. Therefore these statistics lead to tests whose size properties are unaffected by the strength or weakness of identification. For the time series setup considered here there are at least two other statistics that share this property, namely, the Anderson and Rubin (1949) AR-statistic and the Kleibergen (2001, 2002a) K-statistic. The first statistic, GELRρ(θ), that we describe may be interpreted as a likelihood ratio statistic. It has an asymptotic χ2(k) null distribution and is first-order equivalent to the AR-statistic. The second set of statistics in this section, Sρ(θ) and LMρ(θ), are based on the first-order conditions (FOC) of

with respect to θ. Each has a limiting χ2(p) null distribution and is first-order equivalent to the K-statistic. For a recent survey on robust inference methods with weak identification, see Stock, Wright, and Yogo (2002).

To motivate the first statistic, consider an i.i.d. setting. In this case, GELREL(θ) may be thought of in terms of the empirical likelihood ratio statistic R(θ), where6

Newey and Smith (2004) show that under certain conditions including {zi : i ≥ 1} i.i.d.,

. Thus ln R(θ) can be interpreted as the criterion function of the EL estimator.

The criterion function R(θ) can be interpreted as a nonparametric likelihood ratio. Indeed, for fixed θ ∈ Θ and given gi(θ), (i = 1,…,n), the numerator of R(θ) is the maximal probability of observing the given sample gi(θ), (i = 1,…,n), over all discrete probability distributions (w1,…,wn) on the sample such that the sample analogue

of the moment condition (2.1) is satisfied. The denominator (1/n)n equals the unrestricted maximal probability. It can then be shown that

, where λ(θ0) is the vector of Lagrange multipliers associated with the k moment restrictions

in the constrained maximization problem (3.1). Therefore, the renormalized criterion function of the EL estimator has an interpretation as −2 times the logarithm of the likelihood ratio statistic R0).

Generalizing from the i.i.d. to the time series setup and from EL to arbitrary ρ, the first statistic we consider is the renormalized GEL criterion function (2.7):

Second, following Kleibergen's (2001) suggestion of constructing a statistic from the FOC with respect to θ but in the GMM framework, we construct a test statistic based on the GEL FOC for

. If the minimum of the objective function

is obtained in the interior of Θ, the score vector with respect to θ must equal 0 at

, i.e.,

For θ ∈ Θ, define the k × p matrix

Thus, (3.3) may be written as

. The test statistic is therefore given as a quadratic form in the score vector λ(θ)′Dρ(θ) evaluated at the hypothesized parameter vector θ

where ρ is any function satisfying Assumption

is a consistent estimator of Δ(θ). We also consider the following variant of Sρ(θ):

that substitutes

for λ(θ) in Sρ(θ); see (A.8) in the Appendix, where it is shown that

. The statistic LMρ(θ) is similar to a GMM Lagrange multiplier statistic given in Newey and West (1987). To make the origin of the preceding test statistics clearer, we adopted the notation LMρ(θ) and Sρ(θ), respectively, in place of Kρ(θ) and KρL(θ) previously given to the statistics in Guggenberger (2003). To use these statistics for hypothesis tests or for the construction of confidence regions one needs a consistent estimator

of Δ(θ). Under assumptions given later, the sample average

may be used for

.

7

Alternatively, instead of using uniform weights in the definition of

one could use empirical probabilities that are associated with each GEL estimator; see Section 2 of Newey and Smith (2004). However, preliminary Monte Carlo simulations (not reported here) showed no clear improvement in the performance of the test statistics.

Note that when ρ(v) = −(1 + v)2/2, i.e., in the case of a GEL CUE criterion, the GEL statistics Sρ(θ) (3.5) and LMρ(θ) (3.6) are then identical and given in closed form by (3.6) with

in the definition of DCUE(θ), where

denotes any generalized inverse of

.

As noted previously the GEL and GMM CUE are numerically identical. However, although the structures of the two statistics coincide, in general, the statistic LMCUE(θ) and the Kleibergen (2001) K-statistic based on the GMM CUE are not identical. The reason is that, in general, the first-order derivatives of the GMM and GEL CUE objective functions are not equal. The K-statistic in Kleibergen (2001) is based on the FOC of the GMM CUE criterion

. It replaces DCUE(θ) in LMCUE(θ) by

, where

is an estimator for

. The particular assumptions made on Δ(θ) determine the choice of estimators

. If the sample average

is used for

for

, then the statistic LMCUE(θ) and the K-statistic coincide.

Some intuition for these test statistics is provided under strong identification. Under strong identification, Newey and Smith (2004) show consistency of

. Therefore, if the FOC (3.3) hold at

, then, at least asymptotically, they also hold at the true value θ0. The statistic Sρ(θ) can then be interpreted as a quadratic form whose criterion is expected to be small at the true value θ0. If, however, all parameters are weakly identified this argument is no longer valid. From Theorem 2,

is no longer consistent for θ0. Therefore, although the FOC hold at

, this does not imply automatically that they also approximately hold at the true value θ0. However, it can be shown that under weak identification the FOC λ(θ)′Dρ(θ) = 0′ not only hold at

w.p.a.1 but are satisfied to order Op(T−1) uniformly over θ ∈ Θ. Thus, under weak identification the FOC do not pin down the true value θ0. Consequently, the power properties of hypothesis tests for θ0 based on the statistics Sρ(θ) or LMρ(θ) should be expected to be better under strong rather than weak identification. Size properties however are not affected by the strength or weakness of identification. This is corroborated by the Monte Carlo simulations reported subsequently and theoretically by Theorem 4.

We now consider the asymptotic distribution of GELRρ(θ) evaluated at a vector θ = (α′,β0′)′, thus allowing for a fixed alternative in the weakly identified components. We need the following local version of Assumption M.

Assumption Mθ. Let θ = (α′,β0′)′ ∈ A × {β0}. Suppose

Note that for θ = (α′,β0′)′ Mθ(iii) and ID imply that

. Thus, under Mθ(iii) and ID the assumption

in Mθ(ii) is equivalent to the assumption

for θ = (α′,β0′)′, which is Assumption D′ in Stock and Wright (2000). The assumption rules out many interesting time series cases. However, it is more general than an i.i.d. assumption. The assumption allows for m.d.s. and thus covers various intertemporal Euler equations applications and regression models with m.d.s. errors. As in Stock and Wright, a possible application is the intertemporally separable consumption capital asset pricing model (CCAPM). Without assuming

, a limiting chi-square distribution would no longer obtain in the following theorems. The problem arises because the GEL estimator as defined in (2.6) is not efficient in the time series setup considered here.

THEOREM 3. Suppose ID, Mθ(i)–(iii), and ρ hold for θ = (α′,β0′)′. Then

where the noncentrality parameter δ = m1(θ)′Δ(θ)−1m1(θ). In particular,

To describe the asymptotic distribution of the statistics LMρ0) and Sρ0), we need the following additional assumptions. Write Gi(θ) = (GiA(θ), GiB(θ)), where the matrices GiA(θ) and GiB(θ) are of column dimension pA and pB, respectively.

Let

be an open neighborhood of θ.

Assumption Mθ (continued).

In Mθ(vii) write

Assumption Mθ(iv) allows the interchange of the order of integration and differentiation in Assumption ID, i.e.,

. It also guarantees that M1n(θ) → M1(θ) := (∂m1 /∂θ)(θ). Assumptions ID and Mθ thus imply that

where by ID the limit matrix (0,M20)) is of deficient rank pB. Assumption Mθ(v) is comparable to Mθ(ii), where

was assumed and extends Mθ(ii) to cross-product terms in vec GiA(θ) and gi(θ). Assumption Mθ(vi) contains additional weak technical conditions that guarantee that certain expressions in the proof of Theorem 4 are asymptotically negligible.

The key assumption is Mθ(vii), which is a stronger version of Mθ(iii) and states that a central limit theorem (CLT) holds simultaneously for the centered gi(θ) and part of the derivative matrix, namely, vec GiA(θ). Write

, where

. As shown in the proof of Theorem 4, for θ = (α′,β0′)′, Assumptions ID, ρ, Mθ(i)–(vi), and

imply that Dp − (0,M20)). Therefore, the probability limit of

is not invertible without renormalization. Define D* := DΛ where the p × p diagonal matrix Λ := diag(n1/2,…,n1/2,1,…,1) with first pA diagonal elements equal to n1/2 and the remainder equal to unity. Hence,

In the proof of Theorem 4 we show that under Assumptions ID, ρ, and Mθ(i)–(vi)

Assumption Mθ(vii), in particular the full rank assumption on V(θ), ensures that

has full rank w.p.a.1. Assumption Mθ(vii) is closely related to Assumption 1 of Kleibergen (2001). Unlike Kleibergen (2001), however, we assume ID, which, as just shown, requires that we are specific about which part of the derivative matrix Gi(θ) together with gi(θ) satisfies a CLT with full rank covariance matrix, namely, GiA(θ), which corresponds to the weakly identified parameters. Assumption ID possesses the advantage that we can obtain the asymptotic distribution of the test statistics under fixed alternatives of the form θ = (α′,β0′)′ and therefore derive asymptotic power results.

THEOREM 4. Suppose ID, Mθ(i)–(vii), and ρ hold for θ = (α′,β0′)′. Then,

where the random p-vector W(α) is defined in (A.11) in the Appendix, ζ ∼ N(0,Ip), and W and ζ are independent. We have W0) ≡ 0, and therefore

Remark 1. The proof of Theorem 4 crucially hinges on the fact that n1/2λ(θ0) and vec Dρ0) (suitably normalized) from the FOC (3.3) are asymptotically jointly normally distributed and, moreover, are asymptotically independent. A similar result is critical also for the Kleibergen (2001) K-statistic, which generalizes the Brown and Newey (1998) analysis of efficient GMM moment estimation to the weakly identified setup. Therefore, by using an appropriate weighting matrix in the quadratic forms (3.5) and (3.6) that define Sρ0) and LMρ0), respectively, we immediately obtain the limiting χ2(p) null distribution of Theorem 4.

Remark 2. Theorems 3 and 4 provide a straightforward method to construct confidence regions or hypothesis tests on θ0. For example, a critical region for a test of the hypothesis H0 : θ = θ0 versus H1 : θ ≠ θ0 at significance level r is given by {GELRρ0) ≥ χr2(k)}, where χr2(k) denotes the (1 − r)-critical value from the χ2(k) distribution. A (1 − r)-confidence region for θ0 is obtained by inverting the just-described test, i.e., {θ ∈ Θ : GELRρ(θ) ≤ χr2(k)}. Confidence regions and hypothesis tests based on Sρ(θ) and LMρ(θ) may be constructed in a similar fashion.

Remark 3. Theorems 3 and 4 demonstrate that GELRρ0), Sρ0), and LMρ0) are asymptotically pivotal statistics under weak and strong identification. Therefore, the size of tests based on these statistics should not vary much with the strength or weakness of identification in finite samples. However, these results also show that under weak identification hypothesis tests based on these statistics are inconsistent. For example, the noncentrality parameter δ does not diverge to infinity for increasing sample size, and therefore the rejection rate under the alternative does not converge to 1. This is intuitively reasonable because if identification is weak one cannot learn much about α0 from the data.

Remark 4. A drawback of GELRρ0) is that its limiting null distribution has degrees of freedom equal to k, the number of moment conditions, rather than the dimension of the parameter vector. In general, this has a negative impact on the power properties of hypothesis tests based on GELRρ0) in overidentified situations. On the other hand, the limiting null distribution of Sρ0) and LMρ0) has degrees of freedom equal to p. Therefore the power of tests based on these statistics should not be negatively affected by a high degree of overidentification. The AR-statistic of Anderson and Rubin (1949) has a χ2(k) limiting null distribution also. Kleibergen (2002b) shows that it equals the sum of two independent statistics, namely, the K-statistic (Kleibergen, 2002a) and a J-statistic (Hansen, 1982) that test location and misspecification, respectively. Mutatis mutandis, a similar decomposition may be given for the GELRρ0) statistic in terms of Sρ0) or LMρ0).

Remark 5. Stock and Wright (2000, Theorem 2) derive the asymptotic distribution under weak identification of the analogue of GELRρ0) for the (GMM) CUE, which is also a χ2(k) null distribution. In the i.i.d. context, Qin and Lawless (1994, Theorem 2) propose the statistic

to test the hypothesis H0 : θ = θ0, which is shown to be asymptotically distributed as χ2(p) under strong identification. However, because of the dependence on

, this statistic is no longer asymptotically pivotal and thus leads to size-distorted tests under weak identification.

Example 1 (continued)

Guggenberger (2003) derives the results given in Theorems 3 and 4 under Assumptions Θ, ID′, M′, and ρ allowing for alternatives α ∈ A and Pitman drift in the data generating process (DGP) for the strongly identified parameters to assess the asymptotic power properties of the tests; i.e., ID′ holds and for some fixed bRpB, y = Y0 + n−1/2(0′,b′)′) + u. To simplify our presentation here we ignore the possibility of Pitman drift. Results for the i.i.d. linear IV model follow directly from the preceding theorems because, as is easily shown, Assumptions ID′, M′, ρ, and V(θ) > 0 imply Mθ for any consistent estimator

. In particular, V(θ) has a simple representation. For θ = (α′,β0′)′, Ω(θ) = Δ(θ) and ΔAA(θ) = E(ViAViA′ [otimes ] Zi Zi′), where ViA consists of the first pA components of Vi in (2.3).

4. SUBVECTOR TEST STATISTICS

We now assume that interest is focused on the subvector α0RpA of θ0 = (α0′,β0′)′. However, we no longer maintain Assumption ID. In particular, α0 may not necessarily be weakly identified.

To adapt the test statistics of Section 3 to the subvector case, the basic idea is to replace β by a GEL estimator

. To make this idea more rigorous, define the GEL estimator

for β0:

We usually write

where there is no ambiguity. A requirement of the analysis that follows is that

. Therefore, we assume that the nuisance parameters β0 that are not involved in the hypothesis under test are strongly identified; see Theorem 2. On the other hand, the components of α0 can be weakly or strongly identified, and in Assumption IDα, which follows, we assume the former holds for α01 and the latter for α02, where α0 = (α01′,α02′)′. The main advantage of the subvector test statistics introduced in this section is that asymptotically they have accurate sizes independent of whether α0 is weakly or strongly identified. This property is not shared by classical tests based on Wald, likelihood ratio, or Lagrange multiplier statistics. In general, they have correct size only if θ0 is strongly identified. In contrast, the subvector tests in Guggenberger and Wolf (2004) based on a subsampling approach have exact asymptotic sizes without any additional identification assumption.

Let θ = (α1′,α2′,β′)′, where αjAj, AjRpAj, (j = 1,2), pA1 + pA2 = pA and β ∈ B, BRpB. Also let

be an open neighborhood of (α020).

Assumption A. The true parameter θ0 = (α01′,α02′,β0′)′ is in the interior of the compact space Θ, where Θ = A1 × A2 × B.

Assumption IDα.

Assumption IDα implies that α01 and (α020) are weakly and strongly identified, respectively. Assumptions A and IDα adapt Assumptions Θ and ID in Section 2 for the subvector case.

Let

We now introduce the subvector statistics. Recall the definition of GELRρ(θ) in (3.2). The GELRρ subvector test statistic is given by

We need the following technical assumptions for our derivation of its asymptotic distribution. To obtain theoretical power properties, we again allow a fixed alternative for the weakly identified components, α01 here.

For a1A1 let a := (a1′,α02′)′ be a fixed vector whose strongly identified component α02 is the same as the corresponding component of the true parameter vector θ0. Let

be an open neighborhood of β0.

Assumption Mα.

Mutatis mutandis, Mα has the same interpretation as Mθ. For example Mα(ii) guarantees that

is bounded and

is bounded away from zero w.p.a.1, whereas Mα(iv) and IDα imply that for

we have

. By IDα this last matrix has full column rank for β = β0. If we assume that the GiBaβ), (i = 1,…,n), viewed as functions of β, are continuous at β0 a.s., then we can simplify Mα(vi) to

. A similar comment holds for the assumptions in the continuation of Mα that follows.

THEOREM 5. Assume 1 ≤ pA < p. Suppose Assumptions A, IDα, Mα(i)–(vi), and ρ hold for some a1A1 and a = (a1′,α02′)′. Then,

where the noncentrality parameter δ is given by

where M(·) := (∂m2 /∂β)(·) ∈ Rk×pB. In particular,

Theorem 5 confirms that the subvector statistic GELRρsub0), like the full vector statistic GELRρ0), is asymptotically pivotal. As before, this result can be used to construct hypothesis tests and confidence regions for α0.

We now generalize the statistics Sρ and LMρ to the subvector case. The asymptotic variance matrices of

differ from those of

. Therefore different weighting matrices are required in the quadratic forms defining these subvector statistics. In the Appendix (see proofs of Theorems 5 and 6) it is shown that for a = (a1′,α02′)′,

exists w.p.a.1 and that

is asymptotically normal with covariance matrix M(a), where for α = (α1′,α2′)′ ∈ RpA

The first pA elements of the FOC (3.3), evaluated at

, are

For α ∈ RpA, let

which coincides with the definition of Dρ(θ) in (3.4) when α is the full vector θ. Similarly to Sρ(θ) in (3.5) the subvector test statistic Sρsub(α) is constructed as a quadratic form in the vector

from (4.3) with weighting matrix given by M(α) in (4.2). Let

be an estimator of M(α) that is given by replacing the expressions Δ(θαβ0) and M20) in M(α) by consistent estimators,

say. By Assumptions Mα(ii) and Mα(iv)–(v) we may choose

when α = a = (a1′,α02′)′. Hence,

The statistic LMρsub(α) is constructed like Sρsub(α) but replaces

by

. Thus,

Let

be an open neighborhood of β0, and

.

Assumption Mα (continued).

In Mα(x) write

Assumption Mα(x) is the key assumption and plays a role similar to Mθ(vii). Assumption Mα(vii) extends Mα(iv) by explicitly assuming that integration and differentiation can be exchanged in the expectation of

, whereas Mα(iv) gave primitive conditions that imply that exchange holds for

. Assumptions Mα(v), Mα(vii), and IDα imply that

, which is an important result used in the proof of the next theorem; in a linear model this result is trivially true because

. Assumptions Mα(vii)–(x) are analogous to Mθ(iv)–(vii) with A1 and A2 now playing the roles of A and B, respectively.

THEOREM 6. Assume 1 ≤ pA < p. Suppose Assumptions A, IDα, Mα(i)–(x), and ρ hold for a = (a1′,α02′)′ for a1A1. Then,

where the random pA-vector Wα(a) is defined in (A.22) of the Appendix, ζαN(0,IpA), and ζα and Wα are independent. We have Wα0) ≡ 0, and therefore

Remark 1. The subvector statistics are asymptotically pivotal when elements of α0 are arbitrarily weakly or strongly identified. This result can be used for the construction of test statistics or confidence regions that have correct size or coverage probabilities asymptotically, independent of the strength or weakness of identification of α0. Compared to the GMM-subvector statistic of Kleibergen (2001)the statistics Sρsub(a) and LMρsub(a) are appealing because of their compact formulation.

Remark 2. Even though it is unclear how the asymptotic distribution of these test statistics might be derived without assuming strong identification of β0, it is obvious that neither Sρsub0) nor LMρsub0) would converge to a χ2(pA) random variable. In general the quantities

in Sρsub0) and

in LMρsub0) are no longer asymptotically normal because of their dependence on the GEL estimator

, which as a direct consequence of Theorem 2 has a nonstandard limiting distribution if β0 is not strongly identified. Moreover, the subvector version of the K-statistic of Kleibergen (2001) also experiences the same problem in these circumstances as the (GMM) CUE of β0 has a nonnormal limiting distribution under weak identification (see Stock and Wright, 2000). Somewhat surprisingly, however, Monte Carlo simulations by the authors (not reported here) for the subvector statistic LMρsub0) indicate that its size properties are not much affected by the strength or weakness of identification of β0. Startz, Zivot, and Nelson (2004) report similar findings from Monte Carlo simulations for the subvector test statistic of Kleibergen (2001).

Example 1 (continued)

Guggenberger (2003) derives the corresponding results. Note that Assumptions Θ, ID′, M′, and ρ and also assuming that Vαaβ0) is full column rank imply Assumption Mα. In the linear model the components of Vαaβ0) can be easily calculated. For example, ΔA1 A1 = E(ViA1ViA1′ [otimes ] Zi Zi′), where ViA1 is the subvector of Vi that contains its first pA1 components. Let Y = (X,W) denote the partition of the included variables of the structural equation into exogenous and endogenous variables. Partition θ0 = (θX0′,θW0′)′ and θ = (θX′,θW′)′ conformably. Valid inference is possible on any subvector of θW0 if the appropriate assumptions given previously are fulfilled. Unfortunately, if the dimension of the parameter vector not subject to test is large, then the argmin-sup problem in (4.1) is computationally very involved. Premultiplication of equation (2.2) by MX should ameliorate this problem through the elimination of the exogenous variables; i.e., MX y = MXWθW0 + MXu. If Assumption Mα holds for θW0 = (αW0W0) and giW) := MX,i′(yWθW)Zi, where MX,i denotes the ith row of MX written as a column vector, valid inference may be undertaken on αW0.

5. SIMULATION EVIDENCE

To assess the efficacy of the hypothesis tests introduced in Theorems 3 and 4, we conduct a set of Monte Carlo experiments. The DGP is given by model (2.2) considered in Example 1 and is similar to that in Kleibergen (2002a, p. 1791), namely,

There are a single right-hand-side endogenous variable and no included exogenous variables, p = 1, ZN(0,Ik [otimes ] In), where k is the number of instruments and n the sample size. In the just-identified case, i.e., k = 1, Π = Π1, whereas, in the overidentified case, k > 1, Π = (Π1,0′)′, i.e., irrelevant instruments are added.

Interest focuses on testing the scalar null hypothesis H0 : θ0 = 0 versus the alternative hypothesis H1 : θ0 ≠ 0.

5.1. Error Distributions

We examine several distributions for (u,V) to investigate the robustness of the test statistics to potentially different features of the error distribution. All designs are constructed from Design (I) obtained by modifying the distribution of the structural error u.

  • Design (I): (u,V)′ ∼ N(0,Σ [otimes ] In), where Σ ∈ R2×2 with diagonal elements unity and off-diagonal elements ρuV.
  • Design (II): ui in Design (I) is modified as ui /(wi /r)1/2, where wi is a χ2(r) random variable independent of ui and Vi, i.e., ui is tr-distributed. We fix r = 2.
  • Design (III): modifies Design (I) by exchanging ui2 − 1 for ui, i.e., ui is a recentered χ2(1) random variable.
  • Design (IV): ui from Design (I) is replaced by Bi|ui + 2| − (1 − Bi)|ui + 2| where Bi is Bernoulli (0.5,0.5) distributed and independent of all other random variables.

Design (II) examines the robustness of the performance of the test statistics to thick-tailed distributions for the structural equation error. Design (III) examines robustness with respect to asymmetric structural error distributions. In Design (IV) the structural error ui is bimodal with peaks at −2 and +2.

In addition, the impact of conditional heteroskedasticity on the performance of the test statistics is examined. Designs (IHET)–(IVHET) modify Designs (I)–(IV), respectively, replacing ui by ui = ∥Ziui.

5.2. Test Statistics

We calculate three versions of the statistic GELRρ(θ) in (3.2), for ρ(v) = −(1 + v)2/2 (CUE), ρ(v) = ln(1 − v) (EL), and ρ(v) = −exp v (ET). We also consider the corresponding versions for each of Sρ(θ) in (3.5) and LMρ(θ) in (3.6) with

replaced by

. As noted previously, for CUE, Sρ(θ) and LMρ(θ) are then numerically identical. Theorems 3 and 4 present the asymptotic null distributions of these statistics.

8

To calculate GELRρ(θ), Sρ(θ), and LMρ(θ) for EL and ET, the globally concave maximization problem

must be solved numerically. To do so we implement a variant of the Newton–Raphson algorithm. We initialize the algorithm by setting λ equal to the zero vector. At each iteration the algorithm tries several shrinking step sizes in the search direction and accepts the first one that increases the function value compared to the previous value for λ. This procedure enforces an “uphill climbing” feature of the algorithm.

Additional statistics considered are the Anderson–Rubin test statistic (AR) (see Anderson and Rubin, 1949), two versions of the K-statistic proposed by Kleibergen (2001, 2002a), one assuming homoskedastic errors K, the other robust to conditional heteroskedasticity KHET, the conditional likelihood ratio test LRM of Moreira (2003), and two versions of the two-stage least squares (2SLS) Wald statistic 2SLS (see, e.g., Wooldridge, 2002, pp. 98, 100), one assuming homoskedastic errors (2SLSHOM) and the other robust to conditional heteroskedasticity (2SLSHET).9

The statistics are defined as follows:

where suu(θ) := (yYθ)′MZ(yYθ)/(nk),

where

. The statistic K(θ) (Kleibergen, 2002a), is not robust to conditional heteroskedasticity. However, a version of the K-statistic in Kleibergen (2001, equation (22)) that uses a heteroskedasticity consistent estimator for the covariance matrix of gi(θ) overcomes this drawback. For model (5.1), the statistic is given by

where

, and

. The statistic KHET(θ) is identical in structure to LMCUE(θ) except the centered components

are used in place of gi(θ) and Gi, respectively. Note that Gi := Gi(θ) does not depend on θ in a linear model. For the LRM statistic, see Moreira (2003, Sect. 3). Finally, the Wald statistics are given by

where

, and

is a conditional heteroskedasticity robust estimator for the variance of

.

Under H0 : θ0 = 0, AR0) →d χ2(k) and K0) →d χ2(p). In the just-identified case k = p = 1, the AR- and K-statistics coincide. Both Wald statistics are asymptotically distributed as χ2(1) under H0 : θ = θ0 and strong identification.

5.3. Size Comparison

Empirical sizes are calculated using 5% asymptotic critical values for all of the preceding statistics for DGPs (5.1) corresponding to all 54 possible combinations of sample size n = 50, 100, 250, number of instruments k = 1, 5, 10, SF and RF error correlation ρuV = 0.0, 0.5, 0.99, and RF coefficient Π1 = 0.1, 1.0 for Designs (I)–(IV) and (IHET)–(IVHET).10

10. Kleibergen (2002a) generates one sample for the instrument matrix Z from a N(0,Ik [otimes ] In) distribution and then keeps Z fixed across R = 10,000 samples of the DGP (5.1) using Design (I) with n = 100 and ρuV = 0.99. We simulate a new matrix Z with each sample of the DGP (5.1). As a consequence, our results do not coincide with those reported by Kleibergen (2002a).

To investigate the sensitivity of the results in Kleibergen (2002a) to the choice of Z, we iterated Kleibergen's (2002a) procedure 100 times; i.e., each time we simulated a matrix Z of instruments that we then kept fixed across R = 1,000 samples of the DGP (5.1). We found strong dependence of the numerical results of the Monte Carlo experiment on Z. For example, in the case Π1 = 1, k = 1, the power of the K-statistic to reject the hypothesis θ0 = 0 when θ0 = 0.4 varied from about 60% to 95% in the 100 experiments. For the specific Z that Kleibergen (2002a) generates, he reports power of about 93% (see his Figure 1, p. 1793).

We use R = 3,000 replications of each DGP. We also use 3,000 realizations each of χ2(1) and χ2(k − 1) random variables to simulate the critical values of Moreira's LRM statistic. For the results reported in Tables 1 and 2, which follow, we use R = 10,000 replications. We refer to Π1 = 0.1 and 1.0 as the “weak” and “strong” instrument cases, respectively. The value of ρuV allows the degree of endogeneity of Y to be varied. Whereas for ρuV = 0, Y is exogenous, Y is strongly endogenous for ρuV = 0.99. We include the just-identified case, k = 1, and two overidentified-cases, k = 5 and 10.

Size results for Design (I) at 5% significance level

Size results for Design (IHET) at 5% significance level

We now describe the results for Designs (I) and (IHET) given in Tables 1 and 2, respectively, which exclude those for GELREL, SET, LMET, AR, and the case n = 100. The qualitative features of the size results for GELREL, SET, and LMET are identical to their ET/EL counterparts. For k = 1, AR coincides with K, and, for k > 1, we find that in most cases K has better size properties than AR. We report K and 2SLSHOM for the homoskedastic and KHET and 2SLSHET for the heteroskedastic design. We now discuss the results for the homoskedastic case of Design (I).

First, we consider the separate effects of Π1,n, ρuV, and k on the size results.

The most important finding is that the empirical sizes of all statistics except 2SLS show little or no dependence on Π1 (some additional Monte Carlo results show that this even holds true for the completely unidentified case where Π1 = 0). However, those for 2SLS depend crucially on the strength or weakness of identification. Although for Π1 = 1.0, 2SLS has reliable size properties for many cases, with weak instruments sizes range over the entire interval, 0% to 100%.

In general, increasing n leads to more accurate size across all statistics. This holds especially true for those that are poor for smaller n. For example, the 2SLS statistics, GELRET and SEL, severely overreject in overidentified and strongly endogenous cases when n = 50. Even though they still overreject for n = 250, the rejection rates are much closer to the 5% significance level.

It is easily shown that the rejection rates under the null hypothesis for AR and GELRρ are independent of the value of ρuV. The slight dependence of the size results in Table 1 on ρuV results from the use of different samples. For all the remaining statistics except for 2SLS, there does not appear to be a clear pattern for how ρuV affects their size properties. Moreover, there is little dependence of the results on ρuV. However, for 2SLS, increasing ρuV leads to severe overrejection when combined with overidentification, especially so in the weak instrument case.

Increasing the number of instruments k usually leads to overrejection for 2SLS, GELRET, and SEL. For 2SLS this is especially true under weak identification and/or strong endogeneity. All the other statistics show little dependence on k.

We now turn to a comparison of performance across statistics. The 2SLS statistics should not be used with weak instruments or in strongly endogenous overidentified situations. In all other cases, 2SLS has competitive size properties. The statistics GELRET and SEL severely overreject in overidentified problems when the sample size is small. Overall, then, the statistics LMEL, K, and LRM lead to the best size results. The statistics LMCUE and GELRCUE come in only as second winners because they tend to underreject, especially in overidentified situations. Across the 36 experiments in Table 1, the sizes of LMEL, LMCUE, GELRCUE, K, and LRM are in the intervals [4.0,6.2], [1.6,5.3], [1.3,5.3], [4.8,8.6] and [4.3,10.3], respectively. The statistics K and LRM usually slightly overreject. In 22 of the 36 cases, the size of LMEL comes closest to the 5% significance level across all the statistics. The corresponding numbers for LMCUE, GELRCUE, K and LRM are 8, 8, 9, and 7. Based on Design (I), LMEL seems to have a slight advantage over the remaining statistics.

We now discuss the size results for Design (IHET) summarized in Table 2. As most findings are similar to those discussed for Design (I), we only describe the new features.

The statistics 2SLSHOM, K, and LRM perform uniformly worse than in Design (I). Tests based on these statistics severely overreject, especially in the just-identified case. Their performance does not improve when n increases. We therefore report results for the heteroskedasticity robust versions 2SLSHET and KHET. Their size properties and those of the statistics based on GEL methods do not appear to be negatively influenced by the presence of conditional heteroskedasticity. This is to be expected from our earlier theoretical discussion of the GEL statistics, which does not assume conditional homoskedasticity. Of course, 2SLSHET still suffers in weakly identified models, and GELRET and SEL perform poorly in overidentified situations for small n. Rejection rates of the test statistics LMEL, LMCUE, GELRCUE, KHET, and LRM across the 36 experiments of Table 2 are in the intervals [3.6,6.4], [1.6,5.1], [1.0,5.1], [4.3,9.2], and [7.8,28.8], respectively. In 21 of the 36 cases, the size of LMEL comes closest to the 5% significance level across all the statistics. The test statistic KHET wins in 18 cases.

In summary, the only statistics with accurate size properties across all experiments of Designs (I) and (IHET) are LMEL, LMCUE, GELRCUE, and KHET. Based on the preceding results it seems that LMEL enjoys a slight advantage over the other statistics. From the 72 cases in Tables 1 and 2 the empirical size of LMEL is closest to the nominal 5% in 43 cases across all statistics.

The qualitative features of the size results for Designs (II)–(IV) and (IIHET)–(IVHET) are generally very similar to their normal counterparts of Designs (I) and (IHET). For this reason, we do not include additional tables for these designs. One striking difference however occurs for 2SLS under weak identification with χ2(1) (Design (III)) and bimodal errors (Design (IV)). Rejection rates across these 54 combinations for 2SLSHOM are in the intervals [0.1,7.1] and [0.0,5.4], respectively. Whereas with normal errors and weak identification 2SLS severely overrejects, with these error distributions it severely underrejects.

To summarize this size study, LMEL, LMCUE, GELRCUE, and KHET have reliable size properties across all designs that appear independent of both the strength or weakness of identification and possible conditional heteroskedasticity. The test statistic 2SLS performs very poorly in the presence of weak instruments. The LRM statistic performs well in homoskedastic cases but poorly otherwise.

5.4. Power Comparison

Empirical power curves are calculated for the preceding statistics and DGPs (5.1) corresponding to all 16 possible combinations of sample size n = 100, 250, number of instruments k = 5, 10, SF and RF error correlation ρuV = 0.5, 0.99, and RF coefficient Π1 = 0.1, 1.0 for each of the error distributions of Designs (I)–(III). Except for LRM, we report size-corrected power curves at the 5% significance level, using critical values calculated in the preceding size comparison. We do so because size correction of LRM is not straightforward as a result of the conditional construction of LRM and, as shown before, for Designs (I)–(III), LRM has empirical size very close to nominal at the 5% significance level.

We use R = 1,000 replications from the DGP (5.1) with various values of the true value θ0. The null hypothesis under test is again H0 : θ0 = 0. For weak identification (Π1 = 0.1), θ0 takes values in the interval [−4.0,4.0] whereas, with strong identification (Π1 = 1.0), θ0 ∈ [−0.4,0.4]. We use 1,000 realizations each of χ2(1) and χ2(k − 1) random variables to simulate the critical values of LRM. For those results reported in the figures that follow, we use 10,000 replications from (5.1).

Detailed results are presented only for the statistics LMEL, K, LRM, and 2SLSHET. The statistics LMCUE, LMEL, and LMET display a very similar performance across almost all scenarios. We therefore only report results for LMEL. We do not report power results for the statistics SEL and SET because, as seen earlier, their size properties appear to be quite poor for the sample sizes considered here. When k = 1, AR and K are numerically identical. In overidentified cases, K generally performs better than AR. We therefore do not report results for AR (see Kleibergen, 2002a, for a comparison of K and AR). Similarly, GELRCUE is numerically identical to LMρ for k = 1 but leads to a less powerful test for k > 1. Also EL and ET versions of GELRρ have rather unreliable size properties for the sample sizes considered here. Therefore we do not report detailed results for GELRρ.

We first focus on the separate effects of Π1,nuV, and k on power.

With strong identification all statistics have a U-shaped power curve. With the exception of 2SLSHET, the lowest point of the power curve is usually achieved at θ0 = 0. In Designs (I) and (II), 2SLSHET is usually biased, taking on its lowest value at a negative θ0 value in the interval [−0.2,0.0]. When θ0 is weakly identified, the power curves of LMEL, K, and LRM are generally very flat across all θ0 values, often only slightly exceeding the significance level of the test. This is especially true for LMEL and K but less so for LRM, which is generally more powerful than the other two statistics in this situation. There is one exception when the power of the three tests is high. In Design (I) with ρuV = 0.99, although being flat at about 5% for positive θ0 values, the power curves reach a sharp peak of almost 100% around θ0 = −1. The reason for this anomaly is most easily explained in the case k = 1, where

. We have

, which in Design (I) with Π1 = 0.1 equals 1 + 2θ0 ρuV + (1.01)θ02. If ρuV = 0.99 this expression is minimized at around θ0 = −0.98 where it equals approximately 0.03. Therefore, this peak is caused by

taking on large values for θ0 in the neighborhood of −1.

For negative θ0 values with |θ0| > 1 power quickly falls, reaching between 20% and 50% across the different designs at θ0 = −4.

In contrast to the power curves of LMEL, K, and LRM, the power curve of 2SLSHET retains its U-shaped form for Π1 = 0.1. In many cases, the power curve reaches values close to 100% when |θ0| is close to 4.

As to be expected the tests are more powerful when n is increased from 100 to 250. This holds uniformly across all statistics and designs with a more pronounced power increase in the strongly identified cases.

There does not seem to be a systematic effect due to ρuV as it varies with the specific design. For reasons explained previously, the shape of the power curves can change dramatically in Design (I) when ρuV is increased from 0.5 to 0.99 if Π1 = 0.1.

In most cases, there is only little change in the power functions when k is increased from 5 to 10. In general, if the power function changes, then power is slightly lower for larger k.

We now compare the power functions across statistics. Figures 1a–c display the power curves of the four statistics for Designs (I)–(III) in the case Π1 = 1.0, n = 250, ρuV = 0.5, and k = 5 (the figures for Π1 = 0.1 and for the other parameter combinations are available upon request). The qualitative comparison for the other parameter combinations is very similar, and we therefore focus on these representative cases.

Power curves, strong instrument. (a) Normal errors, (b) t(2) errors, (c) χ2 errors.

When identification is weak, the test based on LRM is usually more powerful than those based on LMEL and K. The power gain of using LRM is quite substantial for negative θ0 values but less so for positive θ0. However, the Wald test 2SLSHET is by far the most powerful test in all three designs. Except for some small negative θ0 values its power curve uniformly dominates the power curves of the other tests. Recall though that 2SLSHET has unreliable size properties under weak identification.

When identification is strong, LMEL uniformly dominates LRM and K in Designs (II) and (III) (see Figures 1b and 1c). However, LRM and K uniformly dominate LMEL in Design (I) (see Figure 1a). This result is to be expected. On the one hand, the LMEL test is based on nonparametric GEL methods. On the other hand, LRM and K are motivated within the normal model framework. Although the power gain of LMEL is small in Design (III), it is substantial in Design (II). Therefore, LMEL should be used when errors have thick tails.

With strong identification, the Wald test is the most powerful test for positive θ0 values. For negative θ0 values, its performance varies from being most powerful in Design (III) to least powerful in Design (I). These results confirm that the Wald test is a reasonable choice when identification is strong.

Overall, therefore, the power study does not lead to an unambiguous ranking of the different tests considered here. Which test is most appropriate depends on the particular error distribution and degree of identification. We find that with strong identification and errors with thick tails or asymmetric errors, LMEL seems to be the best choice whereas with normal errors LRM and K appear preferable. When identification is weak, LRM generally dominates K and LMEL in terms of power although as noted previously the size properties of LRM deteriorate substantially in the presence of heteroskedasticity.

APPENDIX: Proofs

Proof of Equation (2.4). Let fi := supθ∈Θgi(θ)∥. Define K := supi≥1 Efiξ < ∞. Let ε > 0 and choose a positive CR such that K/C < ε. Then

where the first inequality follows from Pr(AB) ≤ Pr(A) + Pr(B) and the second uses the Markov inequality. It follows that (max1≤in fi)n−1/ξ = Op(1) and thus (max1≤in fi) = op(n1/2) by ξ > 2. Thus (2.4) implies M(i). █

Proof of Lemma 1. ID holds trivially. By (2.2) and (2.3), gi(θ) = (yiYi′θ)Zi = Zi(Zi′Π + Vi′)(θ0 − θ) + Ziui. Next max1≤in supθ∈Θgi(θ)∥ = op(n1/2) is established. An application of the Borel–Cantelli lemma shows that for real-valued i.i.d. random variables Wi such that EWi2 < ∞, max1≤in|Wi| = op(n1/2); see Owen (1990, Lemma 3) for a proof. By the definition of gi(θ) and the triangle inequality,

By Assumption M′(iii), we can apply the just-mentioned result to each of the three summands in the preceding equation, which proves the result.

Next M(ii) is shown. By the i.i.d. assumption, Ω(θ) = limn→∞ Egi(θ)gi(θ)′, and continuity and boundedness in M(ii) follow immediately from M′(iii) and compactness of Θ. The same is true for the Op(1) statement in M(ii). Finally, uniform convergence follows from the weak law of large numbers and compactness of Θ.

Next M(iii) is proved. Because

, we only have to deal with the empirical process

Finite-dimensional joint convergence follows from the CLT and M′(iii), and stochastic equicontinuity follows from the fact that (θ0 − θ) enters Ψn(·,θ) linearly:

where the last expression is bounded by δOp(1) by the CLT. Furthermore, Θ is compact by assumption. The proposition in Andrews (1994, p. 2251) can thus be applied, which yields the desired result. █

The following proofs are straightforward generalizations of the Guggenberger (2003) proofs for the i.i.d. linear model to the more general context considered here. We require three lemmas that are modified versions of Lemmas A1–A3 in Newey and Smith (2004) for the proofs of our theorems. These modifications are necessary because unlike Newey and Smith we need to work with weakly and strongly identified parameters and do not make an i.i.d. assumption.

For

let Θn ⊂ Θ. Let cn := n−1/2 max1≤in supθ∈Θngi(θ)∥. Let Λn := {λ ∈ Rk : ∥λ∥ ≤ n−1/2cn−1/2} if cn > 0 and Λn = Rk otherwise. Write “u.w.p.a.1” for “uniformly over θ ∈ Θn w.p.a.1.”

LEMMA 7. Assume max1≤in supθ∈Θngi(θ)∥ = op(n1/2).

Then

, where

is defined in (2.5).

Proof. The case cn = 0 is trivial, and thus wlog cn ≠ 0 can be assumed. By assumption cn = op(1), and the first part of the statement follows from

which also immediately implies the second part. █

LEMMA 8. Suppose

for some

uniformly over θ ∈ Θn and Assumption ρ holds.

Then

satisfying

exists u.w.p.a.1,

uniformly over θ ∈ Θn.

Proof. Without loss of generality cn ≠ 0, and thus Λn can be assumed compact. For θ ∈ Θn, let λθ ∈ Λn be such that

. Such a λθ ∈ Λn exists u.w.p.a.1 because a continuous function takes on its maximum on a compact set and by Lemma 7 and Assumption ρ,

(as a function in λ for fixed θ) is C2 on some open neighborhood of Λn u.w.p.a.1. We now show that actually

u.w.p.a.1, which then proves the first part of the lemma. By a second-order Taylor expansion around λ = 0, there is a λθ* on the line segment joining 0 and λθ such that for some positive constants C1 and C2

u.w.p.a.1, where the second inequality follows as max1≤in ρ2θ*′gi(θ)) < −½ u.w.p.a.1 from Lemma 7, continuity of ρ2(·) at zero, and ρ2 = −1. The last inequality follows from

. Now, (A.1) implies that

, the latter being Op(n−1/2) uniformly over θ ∈ Θn by assumption. It follows that λθ ∈ int(Λn) u.w.p.a.1. To prove this, let ε > 0. Because λθ = Op(n−1/2) uniformly over θ ∈ Θn and cn = op(1), there exist

such that Pr(∥n1/2λθ∥ ≤ Mε) > 1 − ε/2 uniformly over θ ∈ Θn and Pr(cn−1/2 > Mε) > 1 − ε/2 for all nnε. Then Pr(λθ ∈ int(Λn)) = Pr(∥n1/2λθ∥ < cn−1/2) ≥ Pr((∥n1/2λθ∥ ≤ Mε) ∧ (cn−1/2 > Mε)) > 1 − ε for nnε uniformly over θ ∈ Θn.

Hence, the FOC for an interior maximum

hold at λ = λθ u.w.p.a.1. By Lemma 7,

, and thus by concavity of

(as a function in λ for fixed θ) and convexity of

it follows that

, which implies the first part of the lemma. From before λθ = Op(n−1/2) uniformly over θ ∈ Θn. Thus the second and by (A.1) the third parts of the lemma follow. █

Suppose Θ1 × Θ2 ⊂ Θ, ΘiRpi, p1 + p2 = p. Partition θ0 = (θ01′,θ02′)′ accordingly and assume θ02 ∈ Θ2. For d1 ∈ Θ1 define

By u.w.p.a.1 we denote “uniformly over d1 ∈ Θ1 w.p.a.1.”

LEMMA 9. Suppose max1≤in supθ∈Θ1×Θ2gi(θ)∥ = op(n1/2),

for some

uniformly over d1 ∈ Θ1, and Assumption ρ holds.

Then

uniformly over d1 ∈ Θ1.

Proof. Without loss of generality

can be assumed. Define

. Note that λ ∈ Λn and thus

uniformly over θ ∈ Θn w.p.a.1 (see Lemma 7 with Θn := Θ1 × Θ2). By a second-order Taylor expansion around λ = 0, there is a

on the line segment joining 0 and λ such that for some positive constants C1 and C2

u.w.p.a.1, where the first inequality follows from Lemma 7, which implies that

. The second inequality follows by

. The definition of

implies

uniformly over d1 ∈ Θ1. Combining equations (A.2) and (A.3) implies

uniformly over d1 ∈ Θ1. █

Proof of Theorem 2. (i) We first show consistency of

. By Assumption ID and M(iii)

, where m2(β) = 0 if and only if β = β0. Therefore,

is a sufficient condition for consistency of

. Applying Lemma 8 to the case Θn = {θ0} gives

. Assumption M(ii) implies

for some κ < ∞, and thus Lemma 9 (applied to the case p1 = 0, Θ2 = Θ) implies

.

Next we establish n1/2-consistency of

. By consistency of

and Assumption M(ii)

for some ε > 0, and thus Lemma 8 for the case

implies that the FOC

have to hold at

, where

and λ(θ), for given θ ∈ Θ, is defined in Lemma 8. Expanding the FOC in λ around 0, there exists a mean value

between

(that may be different for each row) such that

where the matrix

has been implicitly defined. Because

, Lemma 7 and Assumption ρ imply that

. By Assumption M(ii), it follows that

and thus

is invertible w.p.a.1 and

. Therefore

w.p.a.1. Inserting this into a second-order Taylor expansion for

(with mean value λ* as in (A.1)) it follows that

The same argument as for

proves

. We therefore have

. By the definition of

,

By Assumption ID, we have up to op(1) terms that

. The same analysis as in the proof of Lemma A1 in Stock and Wright (2000, p. 1091, line six from the top) can now be applied to prove n1/2-consistency of

, where the symmetric matrix

plays the role of

in Stock and Wright. Note that in equation (A.4) in Stock and Wright, Assumption M(iii) of bounded sample paths w.p.a.1 is used. Finally, note that

is bounded away from zero w.p.a.1.

(ii) By Assumption M(iii)

and by ID we have for some mean-vector β between β0 and β0 + n−1/2b (that may differ across rows)

Because the latter expression is bounded, it follows that

, where u.w.p.a.1 stands for “uniformly over (α,b) ∈ A × BM w.p.a.1.” Therefore, by Lemma 8, λ(θαb) such that

exists u.w.p.a.1 and λ(θαb) = Op(n−1/2) uniformly over (α,b) ∈ A × BM. This implies that the FOC

have to hold at λ = λ(θαb) and θ = θαb u.w.p.a.1. Expanding the FOC and using the same steps and notation as in part (i), it follows that

, and upon inserting this into a second-order Taylor expansion of

we have

u.w.p.a.1. The matrices

converge to Ω((α′,β0′)′) uniformly over A × BM. By M(iii),

, and therefore

on A × BM.

By part (i) of the proof and Lemma 3.2.1 in van der Vaart and Wellner (1996, p. 286) it follows that

For given α ∈ A, one can calculate arg minbRpB Pαb by solving the FOC for b. Writing Ω for Ω((α′,β0′)′) and M2 for M20) the result is

This holds in particular for α = α*. It follows that α* = arg minα∈A Pαb*(α). █

Proof of Theorem 3. Applying Lemma 8 to the case Θn = {θ}, it follows that

exists such that

. Using the same steps and notation as in the proof of Theorem 2 leads to

w.p.a.1, where by Mθ(ii) both

converge in probability to Δ(θ). By Mθ(iii),

from which the result follows. █

Proof of Theorem 4. Using Mθ(i)–(iii) and an argument similar to the argument that led to (A.5) we have

and therefore the statement of the theorem involving Sρ(θ) follows immediately from the one for LMρ(θ). Therefore, we only deal with the statistic LMρ(θ) given in equation (3.8).

First, we show that the matrix D* is asymptotically independent of

. For notational convenience from now on we omit the argument θ; e.g., we write gi for gi(θ). By a mean-value expansion about 0 we have ρ1(λ′gi) = −1 + ρ2i)gi′λ for a mean value ξi between 0 and λ′gi, and thus by (A.8) and the definition of Λ we have

where for the last equality we use (3.7) and Assumptions Mθ(v)–(vi). By Assumption Mθ(v) it thus follows that

where w1 := vec(0,−M20),0) ∈ RkpA+kpB+k and

M and v have dimensions (kpA + kpB + k) × (kpA + k) and (kpA + k) × 1, respectively. By Assumption ID, Mθ(vii), and (3.7) vd N(w2,V(θ)), where w2 := ((vec M1A)′,m1′)′ and M1A are the first pA columns of M1. Therefore

where Ψ := ΔAA − ΔA Δ−1ΔA′ is positive definite. Equation (A.9) proves that

are asymptotically independent.

We now derive the asymptotic distribution of LMρ(θ). Denote by D and g the limiting normal distributions of

, respectively (see equation (A.9)). Subsequently we show that the function h : Rk×pRp×k defined by h(D) := (D′Δ−1D)−1/2D′ for DRk×p is continuous on a set CRk×p with Pr(DC) = 1. By the continuous mapping theorem and Mθ(v) we have

By the independence of D and g, the latter random variable is distributed as W + ζ, where the random p-vector W is defined as

ζ ∼ N(0,Ip), and W and ζ are independent. Note that for θ = θ0, W ≡ 0. From (A.10) the statement of the theorem follows.

We now prove the continuity claim for h. Note that h is continuous at each D that has full column rank. It is therefore sufficient to show that D has full column rank a.s. From (A.9) it follows that the last pB columns of D equal −M20), which has full column rank by assumption. Define

and the k × p-matrix

has linearly dependent columns}. Clearly, O is closed and therefore Lebesgue-measurable. Furthermore, O has empty interior and thus has Lebesgue measure 0. For the first pA columns of D, DpA say, it has been shown that vecDpA is normally distributed with full rank covariance matrix Ψ. This implies that for any measurable set O* ⊂ RkpA with Lebesgue measure 0, it holds that Pr(vec(DpA) ∈ O*) = 0, in particular, for O* = O. This proves the continuity claim for h. █

Proof of Theorem 5. By Assumptions

, and by Lemmas 8 and 9 (applied to Θn = {θaβ0} and Θ1 = {a}, Θ2 = B, respectively) we have

. Assumption IDα then implies consistency of

. Applying Lemma 8 to the case

implies that the FOC for λ must hold in the definition of

(see equation (A.4)). Then repeating the analysis that leads to (A.6) in the proof of Theorem 2, we have by Mα(ii)

The next goal is to derive the asymptotic distribution of

. Our analysis follows Newey and Smith (2004); see their proof of Theorem 3.2. Differentiating the FOC (A.4) with respect to λ yields the matrix

, which by Mα(ii) converges in probability to −Δ(θaβ0), which is nonsingular. Therefore, the implicit function theorem implies that there is a neighborhood of

where the solution to the FOC, say

, is continuously differentiable w.p.a.1. The envelope theorem then implies

w.p.a.1. Also, a mean-value expansion of (A.4) in (β,λ) about (β0,0) yields (where gi(θ) inside ρ1 is kept constant at

)

where (β′,λ′) are mean values on the line segment that joins

that may be different for each row. Combining the pB rows of (A.13) with the k rows of (A.14) we get

where the (pB + k) × (pB + k) matrix M has been implicitly defined. By Mα(ii) and Mα(iv)–(vi) the matrix M converges in probability to M, where (writing M for M((α020)))

and (omitting the argument θaβ0)

It follows that M is nonsingular w.p.a.1. Equation (A.15) implies that w.p.a.1

An expansion of

in β around β0 and the preceding lead to

for some appropriate mean value θ. Note that

which has rank kpB. From (A.12), GELRρsub(a) →d ξ′Δ(θaβ0)−1MM(Δ(θaβ0))ξ, where ξ ∼ N(m1aβ0),Δ(θaβ0)), which concludes the proof. █

Proof of Theorem 6. As in the proof of Theorem 5,

. Hence, the result for LMρsub(a) implies the result for Sρsub(a).

As in the proof of Theorem 4 renormalize D* := Dρ(a)Λ, where the diagonal pA × pA matrix Λ := diag(n1/2,…,n1/2,1,…,1) has first pA1 diagonal elements equal to n1/2 and the remaining pA2 elements equal to unity. We now show that

are asymptotically independent. By a mean-value expansion about θaβ0 and Assumption Mα(vii) we have for some mean value

(that may be different for each row)

where we have used (A.16) for the last equation. Assumptions Mα(vii) and IDα imply

(recall that m2 does not depend on α1) and thus

Proceeding exactly as in the proof of Theorem 4, using (A.17), (A.19), and Assumptions Mα(vii)–(ix), it follows that

where MR(kpA1+kpA2+k)×(kpA1+k) and

where the arguments (α020) in M and (∂m2 /∂α2) and θaβ0 in ΔA1 and Δ are omitted. By Mα(x), v is asymptotically normal with full rank covariance matrix Vαaβ0), and thus the asymptotic covariance matrix of

is given by MVαaβ0)M′. For independence of

the upper right k(pA1 + pA2) × k-submatrix of MVαaβ0)M′ must be 0. This is clear for the kpA2 × k-dimensional submatrix, and we only have to show that the kpA1 × k upper right submatrix

is 0. Using (A.18), the matrix in (A.21) equals −ΔA1 Δ−1PM(Δ)MM(Δ)Δ, which is clearly 0. This proves the independence claim.

Now denote by D and g the limiting normal distributions of

, implied by (A.20). Recall M(a) = Δ−1MM(Δ) (see equation (4.2)). If the function h : Rk×pARpA×k defined by h(D) := (DM(a)D)−1/2D′ for DRk×pA is continuous on a set CRk×pA with Pr(DC) = 1, then by the continuous mapping theorem

By (A.17) and (A.18) the latter variable is distributed as Wα(a) + ζα, where

Therefore the theorem is proved once we have proved the continuity claim for h. For this step of the proof we need the positive definite assumption for Vαaβ0) in Mα(x). It is enough to show that with probability 1, rank(MM(Δ)D) = pA. Because the span of the columns of M equals the kernel of MM(Δ) and rank(M) = pB, the latter condition holds if rank(M,D) = p. Denote by DpA2 the last pA2 columns of D, which by (A.20) equal −(∂m2 /∂α2). By Assumption IDα, the matrix (∂m2 /∂(α2′,β′)′)((α020)) has rank pA2 + pB, and it remains to show that with probability one, this matrix is linearly independent of the first pA1 columns of D, DpA1 say. Using (A.20) and Vαaβ0) > 0, the covariance matrix of vecDpA1 is easily shown to have full column rank pA1 k. An argument analogous to the last step in the proof of Theorem 4 can then be applied to conclude the proof. █

References

REFERENCES

Anderson, T.W. & H. Rubin (1949) Estimators of the parameters of a single equation in a complete set of stochastic equations. Annals of Mathematical Statistics 21, 570582.Google Scholar
Andrews, D.W.K. (1994) Empirical process methods in econometrics. In R. Engle & D. McFadden (eds.), Handbook of Econometrics, vol. 4, 22472294. North-Holland.
Brown, B.W. & W.K. Newey (1998) Efficient semiparametric estimation of expectations. Econometrica 66, 453464.Google Scholar
Caner, M. (2003) Exponential Tilting with Weak Instruments: Estimation and Testing. Working paper, University of Pittsburgh.
Dufour, J. (1997) Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 13651387.Google Scholar
Guggenberger, P. (2003) Econometric essays on generalized empirical likelihood, long-memory time series, and volatility. Ph.D. thesis, Yale University.
Guggenberger, P. & R.J. Smith (2003) Generalized Empirical Likelihood Tests in Time Series Models with Potential Identification Failure. Working paper, UCLA and University of Warwick.
Guggenberger, P. & M. Wolf (2004) Subsampling Tests of Parameter Hypotheses and Overidentifying Restrictions with Possible Failure of Identification. Working paper, UCLA.
Hansen, L.P. (1982) Large sample properties of generalized method of moment estimators. Econometrica 50, 10291054.Google Scholar
Hansen, L.P., J. Heaton, & A. Yaron (1996) Finite-sample properties of some alternative GMM estimators. Journal of Business & Economic Statistics 14, 262280.Google Scholar
Imbens, G. (1997) One-step estimators for over-identified generalized method of moments models. Review of Economic Studies 64, 359383.Google Scholar
Imbens, G. (2002) Generalized method of moments and empirical likelihood. Journal of Business & Economic Statistics 20, 493506.Google Scholar
Imbens, G., R.H. Spady, & P. Johnson (1998) Information theoretic approaches to inference in moment condition models. Econometrica 66, 333357.Google Scholar
Kitamura, Y. (1997) Empirical likelihood methods with weakly dependent processes. Annals of Statistics 25, 20842102.Google Scholar
Kitamura, Y. & M. Stutzer (1997) An information-theoretic alternative to generalized method of moments estimation. Econometrica 65, 861874.Google Scholar
Kleibergen, F. (2001) Testing parameters in GMM without assuming that they are identified. Econometrica, forthcoming.Google Scholar
Kleibergen, F. (2002a) Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70, 17811805.Google Scholar
Kleibergen, F. (2002b) Two Independent Pivotal Statistics That Test Location and Misspecification and Add-Up to the Anderson-Rubin Statistic. Working paper, Brown University.
Moreira, M.J. (2003) A conditional likelihood ratio test for structural models. Econometrica 71, 10271048.Google Scholar
Nelson, C.R. & R. Startz (1990) Some further results on the exact small sample properties of the instrumental variable estimator. Econometrica 58, 967976.Google Scholar
Newey, W.K. (1985) Generalized method of moments specification testing. Journal of Econometrics 29, 229256.Google Scholar
Newey, W.K. & R.J. Smith (2004) Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219255.Google Scholar
Newey, W.K. & K.D. West (1987) Hypothesis testing with efficient method of moments estimation. International Economic Review 28, 777787.Google Scholar
Otsu, T. (2003) Generalized Empirical Likelihood Inference under Weak Identification. Working paper, University of Wisconsin.
Owen, A. (1988) Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237249.Google Scholar
Owen, A. (1990) Empirical likelihood ratio confidence regions. Annals of Statistics 18, 90120.Google Scholar
Pakes, A. & D. Pollard (1989) Simulation and the asymptotics of optimization estimators. Econometrica 57, 10271057.Google Scholar
Phillips, P.C.B. (1984) The exact distribution of LIML: I. International Economic Review 25, 249261.Google Scholar
Phillips, P.C.B. (1989) Partially identified econometric models. Econometric Theory 5, 181240.Google Scholar
Qin, J. & J. Lawless (1994) Empirical likelihood and general estimating equations. Annals of Statistics 22, 300325.Google Scholar
Smith, R.J. (1997) Alternative semi-parametric likelihood approaches to generalized method of moments estimation. Economic Journal 107, 503519.Google Scholar
Smith, R.J. (2001) GEL Criteria for Moment Condition Models. Working paper, University of Bristol. Revised version CWP 19/04, cemmap, IFS and UCL. http://cemmap.ifs.org.uk/wps/cwp0419.pdf.
Staiger, D. & J.H. Stock (1997) Instrumental variables regression with weak instruments. Econometrica 65, 557586.Google Scholar
Startz, R., E. Zivot, & C.R. Nelson (2004) Improved inference in weakly identified instrumental variables regression. In Frontiers of Analysis and Applied Research: Essays in Honor of Peter C.B. Phillips. Cambridge University Press.
Stock, J.H. & J.H. Wright (2000) GMM with weak identification. Econometrica 68, 10551096.Google Scholar
Stock, J.H., J.H. Wright, & M. Yogo (2002) A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics 20, 518529.Google Scholar
van der Vaart, A.W. & J.A. Wellner (1996) Weak Convergence and Empirical Processes. Springer.
Wooldridge, J. (2002) Econometric Analysis of Cross Section and Panel Data. MIT Press.
Figure 0

Size results for Design (I) at 5% significance level

Figure 1

Size results for Design (IHET) at 5% significance level

Figure 2

Power curves, strong instrument. (a) Normal errors, (b) t(2) errors, (c) χ2 errors.