GENERALIZED EMPIRICAL LIKELIHOOD ESTIMATORS AND TESTS UNDER PARTIAL, WEAK, AND STRONG IDENTIFICATION

Patrik Guggenberger; Richard J. Smith

doi:10.1017/S0266466605050371

GENERALIZED EMPIRICAL LIKELIHOOD ESTIMATORS AND TESTS UNDER PARTIAL, WEAK, AND STRONG IDENTIFICATION

Published online by Cambridge University Press: 19 July 2005

Patrik Guggenberger and

Richard J. Smith

Show author details

Patrik Guggenberger: Affiliation:
UCLA
Richard J. Smith: Affiliation:
cemmap, UCL and IFS and University of Warwick

Article contents

Abstract
1. INTRODUCTION
2. ESTIMATION
3. TEST STATISTICS
4. SUBVECTOR TEST STATISTICS
5. SIMULATION EVIDENCE
APPENDIX: Proofs
References

Rights & Permissions

Abstract

The purpose of this paper is to describe the performance of generalized empirical likelihood (GEL) methods for time series instrumental variable models specified by nonlinear moment restrictions as in Stock and Wright (2000, Econometrica 68, 1055–1096) when identification may be weak. The paper makes two main contributions. First, we show that all GEL estimators are first-order equivalent under weak identification. The GEL estimator under weak identification is inconsistent and has a nonstandard asymptotic distribution. Second, the paper proposes new GEL test statistics, which have chi-square asymptotic null distributions independent of the strength or weakness of identification. Consequently, unlike those for Wald and likelihood ratio statistics, the size of tests formed from these statistics is not distorted by the strength or weakness of identification. Modified versions of the statistics are presented for tests of hypotheses on parameter subvectors when the parameters not under test are strongly identified. Monte Carlo results for the linear instrumental variable regression model suggest that tests based on these statistics have very good size properties even in the presence of conditional heteroskedasticity. The tests have competitive power properties, especially for thick-tailed or asymmetric error distributions.This paper is a revision of Guggenberger's job market paper “Generalized Empirical Likelihood Tests under Partial, Weak, and Strong Identification.” We are thankful to the editor, P.C.B. Phillips, and three referees for very helpful suggestions on an earlier version of this paper. Guggenberger gratefully acknowledges the continuous help and support of his adviser, Donald Andrews, who played a prominent role in the formulation of this paper. He thanks Peter Phillips and Joseph Altonji for their extremely valuable comments. We also thank Vadim Marner for help with the simulation section and John Chao, Guido Imbens, Michael Jansson, Frank Kleibergen, Marcelo Moreira, Jonathan Wright, and Motohiro Yogo for helpful comments. Aspects of this research have been presented at the 2003 Econometric Society European Meetings; York Econometrics Workshop 2004; Seminaire Malinvaud; CREST-INSEE; and seminars at Albany, Alicante, Austin (Texas), Brown, Chicago, Chicago GSB, Harvard/MIT, Irvine, ISEG/Universidade Tecnica de Lisboa, Konstanz, Laval, Madison (Wisconsin), Mannheim, Maryland, NYU, Penn, Penn State, Pittsburgh, Princeton, Rice, Riverside, Rochester, San Diego, Texas A&M, UCLA, USC, and Yale. We thank all the seminar participants. Guggenberger and Smith received financial support through a Carl Arvid Anderson Prize Fellowship and a 2002 Leverhulme Major Research Fellowship, respectively.

Type: Research Article
Information: Econometric Theory , Volume 21 , Issue 4 , August 2005 , pp. 667 - 709

DOI: https://doi.org/10.1017/S0266466605050371 [Opens in a new window]
Copyright: © 2005 Cambridge University Press

1. INTRODUCTION

It is often the case that the instrumental variables available to empirical researchers are only weakly correlated with the endogenous variables. That is, identification is weak. Phillips (1989), Nelson and Startz (1990), and a large literature following these early contributions show that in such situations classical normal and chi-square asymptotic approximations to the finite-sample distributions of instrumental variable (IV) estimators and statistics can be very poor. For example, even though likelihood ratio and Wald test statistics are asymptotically chi-square, use of chi-square critical values can lead to extreme size distortions in finite samples (see Dufour, 1997). The purpose of this paper is to ascertain the performance of generalized empirical likelihood (GEL) methods (Newey and Smith, 2004; Smith, 1997, 2001) for time series IV models specified by nonlinear moment restrictions when identification may be weak (as in Stock and Wright, 2000). In particular, the paper makes two principal contributions. First, the asymptotic distribution of the GEL estimator is derived for a weakly identified setup. Second, the paper proposes new, theoretically and computationally attractive GEL test statistics. The asymptotic null distribution of these statistics is chi-square under partial (Phillips, 1989), weak (Stock and Wright, 2000), and strong identification. Thus, the size of tests formed from these statistics is invariant to the strength or weakness of identification. Importantly, we also provide asymptotic power results for the various statistics suggested in this paper.

GEL estimators and test statistics are alternatives to those based on generalized method of moments (GMM); see Hansen (1982), Newey (1985), and Newey and West (1987). GEL has received considerable attention recently because of its competitive bias properties. For example, Newey and Smith (2004) show that for many models the asymptotic bias of empirical likelihood (EL) does not grow with the number of moment restrictions, whereas that of GMM estimators grows without bound, a finding that may imply favorable properties for GEL-based test statistics.

Similar to the findings in Phillips (1984, 1989) and Stock and Wright (2000) for limited information maximum likelihood (LIML), two stage least squares (2SLS), and GMM, GEL estimators of weakly identified parameters have nonstandard asymptotic distributions and are in general inconsistent. Therefore, inference based on the classical normal approximation is inappropriate under weak identification. As in Newey and Smith (2004) for strong identification, the first-order asymptotics of the GEL estimator under weak identification do not depend on the choice of the GEL criterion function. This finding is rather surprising and contrasts with 2SLS and LIML estimators, whose first-order asymptotic theory differs under weak identification.

The statistics proposed here are asymptotically pivotal in contrast to classical Wald and likelihood ratio statistics no matter what the strength of identification. The first statistic, GELR_ρ, is based on the GEL criterion function and may be thought of as a nonparametric likelihood ratio statistic. Two further statistics generalize the GMM-based K-statistic of Kleibergen (2001) to the GEL context. Like the K-statistic, which is a quadratic form in the first-order derivative vector of the continuous updating GMM objective function, the second GEL statistic, S_ρ, is a score-type statistic, being a quadratic form in the GEL criterion score vector. The third statistic, LM_ρ, is similar in structure to a GMM Lagrange multiplier statistic (Newey and West, 1987) and is asymptotically equivalent to the score-type statistic, being a quadratic form in the sample moment vector. Confidence regions constructed from the K- and GEL score-type statistics are never empty and contain the continuous updating estimator (CUE) and GEL estimator, respectively. All forms of GEL statistics admit limiting chi-square null distributions with degrees of freedom equal to the number of instrumental variables or moment conditions for the first statistic and the dimension of the parameter vector for the second and third statistics. In overidentified situations, therefore, tests based on the latter statistics should be expected to have better power properties than those based on the former. In many cases, an applied researcher is interested in inference on a parameter subvector rather than the whole parameter vector. Modified versions of these statistics are therefore suggested for the subvector case when the remaining parameters are strongly identified.

Monte Carlo simulations for the independent and identically distributed (i.i.d.) linear IV model with a wide range of error distributions compare our test statistics to several others, including homoskedastic and heteroskedastic versions of the K-statistic of Kleibergen (2001, 2002a) and the similar conditional likelihood ratio statistic LR_M of Moreira (2003). We find that our tests have very good size properties even in the presence of conditional heteroskedasticity. In contrast, the homoskedastic version of the K-statistic of Kleibergen (2002a) and the LR_M-statistic of Moreira (2003) are size-distorted under conditional heteroskedasticity. Our tests have competitive power properties, especially for thick-tailed or asymmetric error distributions. Given the nonparametric construction of the GEL estimator, robustness of GEL-based test statistics to different error distributions should be expected.

Like the work of Stock and Wright (2000), our paper allows for both i.i.d. and martingale difference sequences (m.d.s.) but does not apply to more general time series models; see Assumption M_θ(ii), which follows. Allowing for m.d.s. observations covers various cases of intertemporal Euler equations applications and regression models with m.d.s. errors. Therefore, the extension from the i.i.d. linear (Guggenberger, 2003, Ch. 1) to the particular time series setting with nonlinear moment restrictions considered here seems worthwhile, especially because there is essentially no cost (in terms of complications of the proofs) to making this extension. The proofs for consistency and for the asymptotic distribution of the GEL estimator build on Guggenberger (2003), which adapts those given in Newey and Smith (2004) for the i.i.d. strongly identified context.

Subsequent to the i.i.d. linear version of this paper, two related papers have appeared. First, Caner (2003) derives the asymptotic distribution of the exponential tilting (ET) estimator (see Imbens, Spady, and Johnson, 1998; Kitamura and Stutzer, 1997) under weak identification with nonlinear moment restrictions for independent observations. Caner (2003) also obtains an ET version of the K-statistic for nonlinear moment restrictions. Second, Otsu (2003) considers GEL-based tests under weak identification in a more general time series setting than considered here and examines the GEL criterion function statistic GELR_ρ and a modified version of the K-statistic based on the Kitamura and Stutzer (1997) and Smith (2001) kernel smoothed GEL estimator that is efficient under strong identification; see also Guggenberger and Smith (2003).

The remainder of the paper is organized as follows. In Section 2, the model and the assumptions are discussed, the GEL estimator is briefly reviewed, and the asymptotic distribution of the GEL estimator under weak identification is derived. Section 3 introduces the GEL-based test statistics. We derive their asymptotic limiting distribution and show that it is unaffected by the degree of identification. Section 4 generalizes these results to hypotheses involving subvectors of the unknown parameter vector. Section 5 describes the simulation results. All proofs are relegated to the Appendix.

The following notation is used in the paper. The symbols →_d, →_p, and ⇒ denote convergence in distribution, convergence in probability, and weak convergence of empirical processes, respectively. For the latter, see Andrews (1994) for a definition. For convergence “almost surely” we write “a.s.” and “with probability approaching 1” is replaced by “w.p.a.1.”

The space Cⁱ(M) contains all functions that are i times continuously differentiable on M. For a symmetric matrix A, A > 0 means that A is positive definite and λ_min(A) and λ_max(A) denote the smallest and largest eigenvalue of A in absolute value, respectively. By A′ we denote the transpose of a matrix A. For a full column rank matrix A ∈ R^k×p and positive definite matrix K ∈ R^k×k, we denote by P_A(K) the oblique projection matrix A(A′K⁻¹A)⁻¹A′K⁻¹ on the column space of A in the metric K and define M_A(K) := I_k − P_A(K), where I_k is the k-dimensional identity matrix; we abbreviate this notation to P_A and M_A if K = I_k. The symbol [otimes ] denotes the Kronecker product. Furthermore, vec(M) stands for the column vectorization of the k × p matrix M; i.e., if M = (m₁,…,m_p) then vec(M) = (m₁′,…,m_p′)′. Finally, ∥M∥ equals the square root of the largest eigenvalue of M′M.

2. ESTIMATION

This section is concerned with the asymptotic distribution of the GEL estimator when some elements of the parameter vector of interest may be only weakly identified. Intuitively, then, the moment conditions that define the model may not be particularly informative about these parameters.

2.1. Model

We consider models specified by a finite number of moment restrictions. Let {z_i : i = 1,…,n} be R^l-valued data and, for each n ∈ N, g_n : G × Θ → R^k a given function, where G ⊂ R^l and Θ ⊂ R^p denotes the parameter space. The model has a true parameter θ₀ for which the moment condition

is satisfied. For g_n(z_i,θ) we will usually write g_i(θ).

Example 1 (i.i.d. linear IV regression)

Guggenberger (2003, Ch. 1) discusses in detail GEL estimation and testing for this model under weak identification. The structural form (SF) equation is given by

and the reduced form (RF) for Y by

where y,u ∈ Rⁿ, Y,V ∈ R^n×p, Z ∈ R^n×k, and Π ∈ R^k×p. The matrix Y may contain both exogenous and endogenous variables, Y = (X,W) say, where X ∈ R^n×p_X and W ∈ R^n×p_W denote the respective observation matrices of exogenous and endogenous variables. The variables Z = (X,Z_W) constitute a set of instruments for the endogenous variables W. The first p_X columns of Π equal the first p_X columns of I_k, and the first p_X columns of V are 0. Denote by Y_i, V_i, Z_i,…(i = 1,…,n) the ith row of the matrix Y, V, Z,… written as a column vector. Assuming the instruments and the structural error are uncorrelated, Eu_i Z_i = 0, it follows that Eg_i(θ₀) = 0, where for each i = 1,…,n, g_i(θ) := (y_i − Y_i′θ)Z_i. Note that in this example g_i(θ) depends on n if the RF coefficient matrix Π is modeled to depend on n (see Staiger and Stock, 1997), where Π_n = n^−1/2C for a fixed matrix C.

Example 2 (conditional moment restrictions)

As in Stock and Wright (2000) the moment conditions may result from conditional moment restrictions. Assume E [h(Y_i,θ₀)|F_i] = 0, where h : H × Θ → R^k₁, H ⊂ R^k₂, and F_i is the information set at time i. Let Z_i be a k₃-dimensional vector of instruments contained in F_i. If g_i(θ) := h(Y_i,θ) [otimes ] Z_i, then Eg_i(θ₀) = 0 follows by taking iterated expectations. In (2.1), k = k₁ k₃ and l = k₂ + k₃.

2.2. Assumptions

This section is concerned with the asymptotic distribution of the GEL estimator for θ when some components of θ₀ = (α₀′,β₀′)′, α₀ say, α₀ ∈ A, A ⊂ R^p_A, are only weakly identified. Intuitively, this means that the moment condition (2.1) is not very informative about α₀. For parameter vectors θ = (α′,β₀′)′, Eg_n(z_i,θ) may be very close to zero, not only for α close to α₀ but also when α is far from α₀. In that case, the restriction Eg_n(z_i,θ₀) = 0 is not very helpful for making inference on α₀. Assumption ID, which follows, provides a theoretical asymptotic framework for this phenomenon, which is taken from Assumption C in Stock and Wright (2000, p. 1061). We refer the reader to Stock and Wright (2000, pp. 1060–1061), which provides substantial detailed motivation for this assumption and an explanation of why it models α₀ as weakly and β₀ as strongly identified.

To describe the moment and distributional assumptions, we require some additional notation:

where, if defined, G_i(θ) := (∂g_i /∂θ)(θ) ∈ R^k×p. For notational convenience, a subscript n has been omitted in certain expressions. Define the k × k matrices¹

Note that Δ(θ) is Ω(θ) in Stock and Wright (2000). We choose our notation for Ω(θ) for consistency with Newey and Smith (2004).

Let θ = (α′,β′)′, where α ∈ A, A ⊂ R^p_A, β ∈ B, B ⊂ R^p_B, and p_A + p_B = p. Also let

denote an open neighborhood of β₀.

Assumption Θ. The true parameter θ₀ = (α₀′,β₀′)′ is in the interior of the compact space Θ = A × B.

Assumption ID.

Next we detail the necessary moment assumptions.²

Weak convergence here is defined with respect to the sup-norm on function spaces and euclidean norm on R^k.

Assumption M.

Assumption M(i) adapts Assumption 1(d) of Newey and Smith (2004), E sup_β∈B∥g_i(β)∥^ξ < ∞ for some ξ > 2, from the i.i.d. setting with strong identification (p_A = 0 and thus θ = β and Θ = B) to the weakly identified setup considered here. A sufficient condition for M(i) in the time series context and under ID is given by

Indeed, a simple application of the Markov inequality shows that (2.4) implies max_1≤i≤n sup_θ∈Θ∥g_i(θ)∥ = O_p(n^1/ξ) = o_p(n^1/2). See the Appendix for a proof. Assumption M(ii), which adapts Assumption 1(e) of Newey and Smith to the weakly identified setup, ensures that

is nonsingular for

. Assumption M(iii) is essentially the “high-level” Assumption B of Stock and Wright (2000, p. 1059) that states that Ψ_n obeys a functional central limit theorem. In Assumption B′, Stock and Wright provide primitive sufficient conditions for their Assumption B that can also be found in Andrews (1994). Note that the definition of weak convergence [Andrews (1994, p. 2250)] and M(iii) imply that sup_θ∈Θ∥Ψ_n(θ)∥ →_d sup_θ∈Θ∥Ψ(θ)∥ and, thus, also that

. In the proof of Theorem 2 we require

bounded in probability.

It is interesting to note that for i.i.d. data an application of the Borel–Cantelli lemma shows that M(i) is implied by Assumption 1(d) of Newey and Smith (2004) even if ξ = 2; see Owen (1990, Lemma 3) for a proof. Hence, using Lemmas 7–9 given subsequently, their Assumption 1(d) can be weakened to ξ ≥ 2 for the consistency and asymptotic normality of the GEL estimator under strong identification with i.i.d. data (see their Theorems 3.1 and 3.2). Therefore, for i.i.d. data, identical assumptions guarantee consistency and asymptotic normality for both GEL and two-step efficient GMM estimators (Hansen, 1982).

Example 1 (continued)

See Guggenberger (2003). For the linear IV model (2.2) Assumption ID can be expressed as the following assumption.

Assumption ID′. Π = Π_n = (Π_An,Π_B) ∈ R^k×(p_A+p_B), where p_A + p_B = p. For a fixed matrix C_A ∈ R^k×p_A, Π_An = n^−1/2C_A and Π_B has full column rank.

Under Assumption ID′, i.i.d. data, and instrument exogeneity it follows that

which implies that in the notation of ID(i), m_1n(θ) = m₁(θ) = E(Z_i Z_i′) C_A(α₀ − α) and m₂(β) = E(Z_i Z_i′)Π_B(β₀ − β). Also, note that Assumption ID′ includes the partially identified model of Phillips (1989). In particular, choosing p_A and setting C_A = 0, one obtains a model in which Π may have any desired (less than full) rank.

We now give simple sufficient conditions that imply Assumption M. Let U := (u,V).

Assumption M′.

(i) {(U_i,Z_i) : i ≥ 1} are i.i.d.;

(ii) EZ_iU_i′ = 0;

(iii) E∥Z_i∥⁴ < ∞, Q_ZZ := E(Z_i Z_i′) > 0, Eu_i²Z_i Z_i′, Eu_iV_ij Z_i Z_i′, and EV_ijV_ik Z_i Z_i′ exist and are finite for j,k = 1,…,p, where V_ij denotes the jth component of the vector V_i;

(iv) Ω(θ) is nonsingular for all θ ∈ A × {β₀}.

Assumptions M′(i) and (ii) state that errors and exogenous variables are jointly i.i.d. and the standard instrument exogeneity assumption is satisfied, whereas M′(iii) and (iv) are technical conditions.

The following lemma shows that Assumption M′ in the linear model implies Assumption M.

LEMMA 1. Suppose that Assumptions ID′, M′, and Θ hold in the linear IV model (2.2). Then Assumptions ID and M hold.

Therefore the various technical conditions of Assumption M reduce to very simple moment conditions in the linear model. Note that M′ implies E [sup_θ∈Θ∥g_i(θ)∥^ξ] < ∞ for ξ = 2. However, we do not need the assumption E [sup_θ∈Θ∥g_i(θ)∥^ξ] < ∞ for a ξ > 2 to prove n^1/2-consistency of the GEL estimator of the strongly identified parameters.

Assumption HOM (conditional homoskedasticity). E(U_iU_i′|Z_i) = Σ_U > 0.

HOM, which is used in Staiger and Stock (1997), is sufficient for Assumption M′(iv). That is, Assumptions M′(i)–(iii) and HOM imply M′(iv) under ID′. This follows from Ω(θ) = Q_ZZ v_α′Σ_{uV_A} v_α for θ ∈ A × {β₀}, where v_α′ := (1,(α₀ − α)′) and Σ_{uV_A} is the (1 + p_A) × (1 + p_A) upper left submatrix of Σ_U. However, M′ is more general than HOM because it allows for conditional heteroskedasticity. For example, u_i = ∥Z_i∥ζ_i, where ζ_i ∼ N(0,1) is independent of Z_i ∼ N(0,I_k), is compatible with M′.

2.3. The GEL Estimator

This section provides a formal definition of the GEL estimator of θ₀.

Let ρ be a real-valued function Q → R, where Q is an open interval of the real line that contains 0 and

If defined, let ρ_j(v) := (∂^jρ/∂v^j)(v) and ρ_j := ρ_j(0) for nonnegative integers j.

The GEL estimator is the solution to a saddle point problem³

For compact Θ, continuous ρ, and g_i (i = 1,…,n), the existence of an argmin

may be shown. In fact,

, viewed as a function in θ, can be shown to be lower semicontinuous (ls). A function f (x) is ls at x₀ if, for each real number c such that c < f (x₀), there exists an open neighborhood U of x₀ such that c < f (x) holds for all x ∈ U. The function f is said to be ls if it is ls at each x₀ of its domain. It is easily shown that ls functions on compact sets take on their minimum. Uniqueness of

, however, is not implied. As a simple example, consider the i.i.d. linear IV model in (2.2) when p = 2 and let the two components Y_ij, (j = 1,2), of Y_i be independent Bernoulli random variables. Then, for each n, the probability that Y_i1 = Y_i2 for every i = 1,…,n is positive. If Y_i1 = Y_i2 for every

is an argmin of

, then each θ ∈ Θ with

is also. To uniquely define

, we could, for example, do the following. From the set of all vectors θ ∈ Θ that minimize

, let

be the vector that has the smallest first component. (If that does not pin down

uniquely, choose from the remaining vectors according to the second component, and so on.)

where

Assumption ρ.

(i) ρ is concave on Q;

(ii) ρ is C² in a neighborhood of 0 and ρ₁ = ρ₂ = −1.

The definition of the GEL estimator

is adopted from Newey and Smith (2004). We slightly modify their definition of

by recentering and rescaling, which simplifies the presentation. We usually write

The most popular GEL estimators are the CUE, the EL, and the ET estimator, which correspond to ρ(v) = −(1 + v)²/2, ρ(v) = ln(1 − v), and ρ(v) = −exp v, respectively. The EL estimator was introduced by Imbens (1997), Owen (1988, 1990), and Qin and Lawless (1994) and the ET estimator by Imbens et al. (1998) and Kitamura and Stutzer (1997). For a recent survey of GEL methods see Imbens (2002).⁴

A choice of

as the weighting matrix W_T(θ_T(θ)) in Stock and Wright (2000, equation (2.2), p. 1058), i.e.,

, results in the CUE which is the GEL estimator based on ρ(v) = −(1 + v)²/2; see Newey and Smith (2004, Theorem 2.1). Hansen, Heaton, and Yaron (1996) and Pakes and Pollard (1989) define the (GMM) CUE using the centered weighting matrix

. However, as shown in Newey and Smith (2004, footnote 2), both versions of the CUE are numerically identical.

2.4. First-Order Equivalence

This section obtains the asymptotic distribution of the GEL estimator

under Assumption ID. Theorem 2 shows that the weakly identified parameters of θ₀ are estimated inconsistently and their GEL estimator has a nonstandard limiting distribution whereas the GEL estimator of the strongly identified parameters is n^1/2-consistent but no longer asymptotically normal. Analogous results are available for LIML or more generally for GMM; see Phillips (1984) and Stock and Wright (2000, Theorem 1). The rather surprising finding is that the first-order asymptotic theory under ID is identical for all GEL estimators

, as long as ρ satisfies Assumption ρ.

⁵

The proof of Theorem 2 uses a second-order Taylor expansion of

in λ about 0 in which the only impact of ρ asymptotically is through ρ₁ and ρ₂, which are both −1.

This is in contrast to the asymptotic theory for k-class estimators under weak identification. As shown in Staiger and Stock (1997, Theorem 1), the nonstandard asymptotic distribution of the k-class estimator depends on κ defined by n(k − 1) →_d κ. Therefore, LIML and 2SLS are not first-order equivalent under weak identification.

If defined, let

For θ = (α′,β′)′ ∈ Θ and b ∈ R^p_B let

The next theorem establishes the asymptotic behavior of

under Assumption ID.

THEOREM 2. Suppose Assumptions Θ, ID, M, and ρ are satisfied. Then

Remark 1. Theorem 2(ii) is analogous to Theorem 1 in Stock and Wright (2000, p. 1062) for GMM. Note that from (A.5) in the Appendix

. Moreover, using the proof of Theorem 2 it can be shown that

Therefore, like

, although n^1/2-consistent,

admits a nonstandard asymptotic distribution (see also Caner, 2003). If p_A = 0, where all parameters are strongly identified,

, where M₂ := M₂(β₀), Ω := Ω(β₀), and Δ := Δ(β₀). The covariance matrix reduces to Ω⁻¹M_M₂(Ω) in the i.i.d. case.

The proof of Theorem 2 also provides a formula (equation (A.7) in the Appendix) for b*(α) := arg min_{b∈R^p_B} P_αb for α ∈ A. In particular, if p_A = 0, (A.7) shows that

where

The matrix V(β₀) simplifies to (M₂′Ω⁻¹M₂)⁻¹ in the i.i.d. case, and thus the preceding formula coincides with Theorem 3.2 of Newey and Smith (2004). However, the asymptotic variance matrix of

in the time series context is in general different from that in Newey and Smith, and the estimator

as defined previously would thus be inefficient. Block methods as in Kitamura (1997) or kernel-smoothing methods as in Smith (2001) can be used for efficient GEL estimation in a time series context with strong identification. In the case p_A > 0, the fact that the asymptotic distribution of the strongly identified parameter estimates is in general nonnormal is a consequence of the inconsistent estimation of α₀.

Remark 2. Given the nonnormal asymptotic distribution of the GMM and GEL parameter estimates under weak identification (established in Theorem 1 in Stock and Wright, 2000, and our Theorem 2, respectively) the asymptotic distribution of test statistics based on these estimators, such as t- or Wald statistics, will also be nonstandard and non-pivotal. Furthermore, these limiting distributions depend on quantities that cannot be consistently estimated (see Staiger and Stock, 1997, p. 564), which militates against their use for the construction of test statistics or confidence regions for θ₀. The next section introduces alternative approaches that overcome these difficulties.

Example 1 (continued)

The specialization of Theorem 2 to the i.i.d. linear IV model of Example 1 was derived in Guggenberger (2003).

3. TEST STATISTICS

This section proposes several statistics to test the simple hypothesis H₀ : θ = θ₀ versus H₁ : θ ≠ θ₀. We establish that they are asymptotically pivotal quantities and have limiting chi-square null distributions under Assumption ID. Therefore these statistics lead to tests whose size properties are unaffected by the strength or weakness of identification. For the time series setup considered here there are at least two other statistics that share this property, namely, the Anderson and Rubin (1949) AR-statistic and the Kleibergen (2001, 2002a) K-statistic. The first statistic, GELR_ρ(θ), that we describe may be interpreted as a likelihood ratio statistic. It has an asymptotic χ²(k) null distribution and is first-order equivalent to the AR-statistic. The second set of statistics in this section, S_ρ(θ) and LM_ρ(θ), are based on the first-order conditions (FOC) of

with respect to θ. Each has a limiting χ²(p) null distribution and is first-order equivalent to the K-statistic. For a recent survey on robust inference methods with weak identification, see Stock, Wright, and Yogo (2002).

To motivate the first statistic, consider an i.i.d. setting. In this case, GELR_EL(θ) may be thought of in terms of the empirical likelihood ratio statistic R(θ), where⁶

Newey and Smith (2004) show that under certain conditions including {z_i : i ≥ 1} i.i.d.,

. Thus ln R(θ) can be interpreted as the criterion function of the EL estimator.

The criterion function R(θ) can be interpreted as a nonparametric likelihood ratio. Indeed, for fixed θ ∈ Θ and given g_i(θ), (i = 1,…,n), the numerator of R(θ) is the maximal probability of observing the given sample g_i(θ), (i = 1,…,n), over all discrete probability distributions (w₁,…,w_n) on the sample such that the sample analogue

of the moment condition (2.1) is satisfied. The denominator (1/n)ⁿ equals the unrestricted maximal probability. It can then be shown that

, where λ(θ₀) is the vector of Lagrange multipliers associated with the k moment restrictions

in the constrained maximization problem (3.1). Therefore, the renormalized criterion function of the EL estimator has an interpretation as −2 times the logarithm of the likelihood ratio statistic R(θ₀).

Generalizing from the i.i.d. to the time series setup and from EL to arbitrary ρ, the first statistic we consider is the renormalized GEL criterion function (2.7):

Second, following Kleibergen's (2001) suggestion of constructing a statistic from the FOC with respect to θ but in the GMM framework, we construct a test statistic based on the GEL FOC for

. If the minimum of the objective function

is obtained in the interior of Θ, the score vector with respect to θ must equal 0 at

, i.e.,

For θ ∈ Θ, define the k × p matrix

Thus, (3.3) may be written as

. The test statistic is therefore given as a quadratic form in the score vector λ(θ)′D_ρ(θ) evaluated at the hypothesized parameter vector θ

where ρ is any function satisfying Assumption

is a consistent estimator of Δ(θ). We also consider the following variant of S_ρ(θ):

that substitutes

for λ(θ) in S_ρ(θ); see (A.8) in the Appendix, where it is shown that

. The statistic LM_ρ(θ) is similar to a GMM Lagrange multiplier statistic given in Newey and West (1987). To make the origin of the preceding test statistics clearer, we adopted the notation LM_ρ(θ) and S_ρ(θ), respectively, in place of K_ρ(θ) and K_ρ^L(θ) previously given to the statistics in Guggenberger (2003). To use these statistics for hypothesis tests or for the construction of confidence regions one needs a consistent estimator

of Δ(θ). Under assumptions given later, the sample average

may be used for

⁷

Alternatively, instead of using uniform weights in the definition of

one could use empirical probabilities that are associated with each GEL estimator; see Section 2 of Newey and Smith (2004). However, preliminary Monte Carlo simulations (not reported here) showed no clear improvement in the performance of the test statistics.

Note that when ρ(v) = −(1 + v)²/2, i.e., in the case of a GEL CUE criterion, the GEL statistics S_ρ(θ) (3.5) and LM_ρ(θ) (3.6) are then identical and given in closed form by (3.6) with

in the definition of D_CUE(θ), where

denotes any generalized inverse of

As noted previously the GEL and GMM CUE are numerically identical. However, although the structures of the two statistics coincide, in general, the statistic LM_CUE(θ) and the Kleibergen (2001) K-statistic based on the GMM CUE are not identical. The reason is that, in general, the first-order derivatives of the GMM and GEL CUE objective functions are not equal. The K-statistic in Kleibergen (2001) is based on the FOC of the GMM CUE criterion

. It replaces D_CUE(θ) in LM_CUE(θ) by

, where

is an estimator for

. The particular assumptions made on Δ(θ) determine the choice of estimators

. If the sample average

is used for

for

, then the statistic LM_CUE(θ) and the K-statistic coincide.

Some intuition for these test statistics is provided under strong identification. Under strong identification, Newey and Smith (2004) show consistency of

. Therefore, if the FOC (3.3) hold at

, then, at least asymptotically, they also hold at the true value θ₀. The statistic S_ρ(θ) can then be interpreted as a quadratic form whose criterion is expected to be small at the true value θ₀. If, however, all parameters are weakly identified this argument is no longer valid. From Theorem 2,

is no longer consistent for θ₀. Therefore, although the FOC hold at

, this does not imply automatically that they also approximately hold at the true value θ₀. However, it can be shown that under weak identification the FOC λ(θ)′D_ρ(θ) = 0′ not only hold at

w.p.a.1 but are satisfied to order O_p(T⁻¹) uniformly over θ ∈ Θ. Thus, under weak identification the FOC do not pin down the true value θ₀. Consequently, the power properties of hypothesis tests for θ₀ based on the statistics S_ρ(θ) or LM_ρ(θ) should be expected to be better under strong rather than weak identification. Size properties however are not affected by the strength or weakness of identification. This is corroborated by the Monte Carlo simulations reported subsequently and theoretically by Theorem 4.

We now consider the asymptotic distribution of GELR_ρ(θ) evaluated at a vector θ = (α′,β₀′)′, thus allowing for a fixed alternative in the weakly identified components. We need the following local version of Assumption M.

Assumption M_θ. Let θ = (α′,β₀′)′ ∈ A × {β₀}. Suppose

Note that for θ = (α′,β₀′)′ M_θ(iii) and ID imply that

. Thus, under M_θ(iii) and ID the assumption

in M_θ(ii) is equivalent to the assumption

for θ = (α′,β₀′)′, which is Assumption D′ in Stock and Wright (2000). The assumption rules out many interesting time series cases. However, it is more general than an i.i.d. assumption. The assumption allows for m.d.s. and thus covers various intertemporal Euler equations applications and regression models with m.d.s. errors. As in Stock and Wright, a possible application is the intertemporally separable consumption capital asset pricing model (CCAPM). Without assuming

, a limiting chi-square distribution would no longer obtain in the following theorems. The problem arises because the GEL estimator as defined in (2.6) is not efficient in the time series setup considered here.

THEOREM 3. Suppose ID, M_θ(i)–(iii), and ρ hold for θ = (α′,β₀′)′. Then

where the noncentrality parameter δ = m₁(θ)′Δ(θ)⁻¹m₁(θ). In particular,

To describe the asymptotic distribution of the statistics LM_ρ(θ₀) and S_ρ(θ₀), we need the following additional assumptions. Write G_i(θ) = (G_iA(θ), G_iB(θ)), where the matrices G_iA(θ) and G_iB(θ) are of column dimension p_A and p_B, respectively.

Let

be an open neighborhood of θ.

Assumption M_θ (continued).

In M_θ(vii) write

Assumption M_θ(iv) allows the interchange of the order of integration and differentiation in Assumption ID, i.e.,

. It also guarantees that M_1n(θ) → M₁(θ) := (∂m₁ /∂θ)(θ). Assumptions ID and M_θ thus imply that

where by ID the limit matrix (0,M₂(β₀)) is of deficient rank p_B. Assumption M_θ(v) is comparable to M_θ(ii), where

was assumed and extends M_θ(ii) to cross-product terms in vec G_iA(θ) and g_i(θ). Assumption M_θ(vi) contains additional weak technical conditions that guarantee that certain expressions in the proof of Theorem 4 are asymptotically negligible.

The key assumption is M_θ(vii), which is a stronger version of M_θ(iii) and states that a central limit theorem (CLT) holds simultaneously for the centered g_i(θ) and part of the derivative matrix, namely, vec G_iA(θ). Write

, where

. As shown in the proof of Theorem 4, for θ = (α′,β₀′)′, Assumptions ID, ρ, M_θ(i)–(vi), and

imply that D →_p − (0,M₂(β₀)). Therefore, the probability limit of

is not invertible without renormalization. Define D* := DΛ where the p × p diagonal matrix Λ := diag(n^1/2,…,n^1/2,1,…,1) with first p_A diagonal elements equal to n^1/2 and the remainder equal to unity. Hence,

In the proof of Theorem 4 we show that under Assumptions ID, ρ, and M_θ(i)–(vi)

Assumption M_θ(vii), in particular the full rank assumption on V(θ), ensures that

has full rank w.p.a.1. Assumption M_θ(vii) is closely related to Assumption 1 of Kleibergen (2001). Unlike Kleibergen (2001), however, we assume ID, which, as just shown, requires that we are specific about which part of the derivative matrix G_i(θ) together with g_i(θ) satisfies a CLT with full rank covariance matrix, namely, G_iA(θ), which corresponds to the weakly identified parameters. Assumption ID possesses the advantage that we can obtain the asymptotic distribution of the test statistics under fixed alternatives of the form θ = (α′,β₀′)′ and therefore derive asymptotic power results.

THEOREM 4. Suppose ID, M_θ(i)–(vii), and ρ hold for θ = (α′,β₀′)′. Then,

where the random p-vector W(α) is defined in (A.11) in the Appendix, ζ ∼ N(0,I_p), and W and ζ are independent. We have W(α₀) ≡ 0, and therefore

Remark 1. The proof of Theorem 4 crucially hinges on the fact that n^1/2λ(θ₀) and vec D_ρ(θ₀) (suitably normalized) from the FOC (3.3) are asymptotically jointly normally distributed and, moreover, are asymptotically independent. A similar result is critical also for the Kleibergen (2001) K-statistic, which generalizes the Brown and Newey (1998) analysis of efficient GMM moment estimation to the weakly identified setup. Therefore, by using an appropriate weighting matrix in the quadratic forms (3.5) and (3.6) that define S_ρ(θ₀) and LM_ρ(θ₀), respectively, we immediately obtain the limiting χ²(p) null distribution of Theorem 4.

Remark 2. Theorems 3 and 4 provide a straightforward method to construct confidence regions or hypothesis tests on θ₀. For example, a critical region for a test of the hypothesis H₀ : θ = θ₀ versus H₁ : θ ≠ θ₀ at significance level r is given by {GELR_ρ(θ₀) ≥ χ_r²(k)}, where χ_r²(k) denotes the (1 − r)-critical value from the χ²(k) distribution. A (1 − r)-confidence region for θ₀ is obtained by inverting the just-described test, i.e., {θ ∈ Θ : GELR_ρ(θ) ≤ χ_r²(k)}. Confidence regions and hypothesis tests based on S_ρ(θ) and LM_ρ(θ) may be constructed in a similar fashion.

Remark 3. Theorems 3 and 4 demonstrate that GELR_ρ(θ₀), S_ρ(θ₀), and LM_ρ(θ₀) are asymptotically pivotal statistics under weak and strong identification. Therefore, the size of tests based on these statistics should not vary much with the strength or weakness of identification in finite samples. However, these results also show that under weak identification hypothesis tests based on these statistics are inconsistent. For example, the noncentrality parameter δ does not diverge to infinity for increasing sample size, and therefore the rejection rate under the alternative does not converge to 1. This is intuitively reasonable because if identification is weak one cannot learn much about α₀ from the data.

Remark 4. A drawback of GELR_ρ(θ₀) is that its limiting null distribution has degrees of freedom equal to k, the number of moment conditions, rather than the dimension of the parameter vector. In general, this has a negative impact on the power properties of hypothesis tests based on GELR_ρ(θ₀) in overidentified situations. On the other hand, the limiting null distribution of S_ρ(θ₀) and LM_ρ(θ₀) has degrees of freedom equal to p. Therefore the power of tests based on these statistics should not be negatively affected by a high degree of overidentification. The AR-statistic of Anderson and Rubin (1949) has a χ²(k) limiting null distribution also. Kleibergen (2002b) shows that it equals the sum of two independent statistics, namely, the K-statistic (Kleibergen, 2002a) and a J-statistic (Hansen, 1982) that test location and misspecification, respectively. Mutatis mutandis, a similar decomposition may be given for the GELR_ρ(θ₀) statistic in terms of S_ρ(θ₀) or LM_ρ(θ₀).

Remark 5. Stock and Wright (2000, Theorem 2) derive the asymptotic distribution under weak identification of the analogue of GELR_ρ(θ₀) for the (GMM) CUE, which is also a χ²(k) null distribution. In the i.i.d. context, Qin and Lawless (1994, Theorem 2) propose the statistic

to test the hypothesis H₀ : θ = θ₀, which is shown to be asymptotically distributed as χ²(p) under strong identification. However, because of the dependence on

, this statistic is no longer asymptotically pivotal and thus leads to size-distorted tests under weak identification.

Example 1 (continued)

Guggenberger (2003) derives the results given in Theorems 3 and 4 under Assumptions Θ, ID′, M′, and ρ allowing for alternatives α ∈ A and Pitman drift in the data generating process (DGP) for the strongly identified parameters to assess the asymptotic power properties of the tests; i.e., ID′ holds and for some fixed b ∈ R^p_B, y = Y(θ₀ + n^−1/2(0′,b′)′) + u. To simplify our presentation here we ignore the possibility of Pitman drift. Results for the i.i.d. linear IV model follow directly from the preceding theorems because, as is easily shown, Assumptions ID′, M′, ρ, and V(θ) > 0 imply M_θ for any consistent estimator

. In particular, V(θ) has a simple representation. For θ = (α′,β₀′)′, Ω(θ) = Δ(θ) and Δ_AA(θ) = E(V_iAV_iA′ [otimes ] Z_i Z_i′), where V_iA consists of the first p_A components of V_i in (2.3).

4. SUBVECTOR TEST STATISTICS

We now assume that interest is focused on the subvector α₀ ∈ R^p_A of θ₀ = (α₀′,β₀′)′. However, we no longer maintain Assumption ID. In particular, α₀ may not necessarily be weakly identified.

To adapt the test statistics of Section 3 to the subvector case, the basic idea is to replace β by a GEL estimator

. To make this idea more rigorous, define the GEL estimator

for β₀:

We usually write

where there is no ambiguity. A requirement of the analysis that follows is that

. Therefore, we assume that the nuisance parameters β₀ that are not involved in the hypothesis under test are strongly identified; see Theorem 2. On the other hand, the components of α₀ can be weakly or strongly identified, and in Assumption ID_α, which follows, we assume the former holds for α₀₁ and the latter for α₀₂, where α₀ = (α₀₁′,α₀₂′)′. The main advantage of the subvector test statistics introduced in this section is that asymptotically they have accurate sizes independent of whether α₀ is weakly or strongly identified. This property is not shared by classical tests based on Wald, likelihood ratio, or Lagrange multiplier statistics. In general, they have correct size only if θ₀ is strongly identified. In contrast, the subvector tests in Guggenberger and Wolf (2004) based on a subsampling approach have exact asymptotic sizes without any additional identification assumption.

Let θ = (α₁′,α₂′,β′)′, where α_j ∈ A_j, A_j ⊂ R^p_{A_j}, (j = 1,2), p_A₁ + p_A₂ = p_A and β ∈ B, B ⊂ R^p_B. Also let

be an open neighborhood of (α₀₂,β₀).

Assumption A. The true parameter θ₀ = (α₀₁′,α₀₂′,β₀′)′ is in the interior of the compact space Θ, where Θ = A₁ × A₂ × B.

Assumption ID_α.

Assumption ID_α implies that α₀₁ and (α₀₂,β₀) are weakly and strongly identified, respectively. Assumptions A and ID_α adapt Assumptions Θ and ID in Section 2 for the subvector case.

Let

We now introduce the subvector statistics. Recall the definition of GELR_ρ(θ) in (3.2). The GELR_ρ subvector test statistic is given by

We need the following technical assumptions for our derivation of its asymptotic distribution. To obtain theoretical power properties, we again allow a fixed alternative for the weakly identified components, α₀₁ here.

For a₁ ∈ A₁ let a := (a₁′,α₀₂′)′ be a fixed vector whose strongly identified component α₀₂ is the same as the corresponding component of the true parameter vector θ₀. Let

be an open neighborhood of β₀.

Assumption M_α.

Mutatis mutandis, M_α has the same interpretation as M_θ. For example M_α(ii) guarantees that

is bounded and

is bounded away from zero w.p.a.1, whereas M_α(iv) and ID_α imply that for

we have

. By ID_α this last matrix has full column rank for β = β₀. If we assume that the G_iB(θ_aβ), (i = 1,…,n), viewed as functions of β, are continuous at β₀ a.s., then we can simplify M_α(vi) to

. A similar comment holds for the assumptions in the continuation of M_α that follows.

THEOREM 5. Assume 1 ≤ p_A < p. Suppose Assumptions A, ID_α, M_α(i)–(vi), and ρ hold for some a₁ ∈ A₁ and a = (a₁′,α₀₂′)′. Then,

where the noncentrality parameter δ is given by

where M_2β(·) := (∂m₂ /∂β)(·) ∈ R^k×p_B. In particular,

Theorem 5 confirms that the subvector statistic GELR_ρ^sub(α₀), like the full vector statistic GELR_ρ(θ₀), is asymptotically pivotal. As before, this result can be used to construct hypothesis tests and confidence regions for α₀.

We now generalize the statistics S_ρ and LM_ρ to the subvector case. The asymptotic variance matrices of

differ from those of

. Therefore different weighting matrices are required in the quadratic forms defining these subvector statistics. In the Appendix (see proofs of Theorems 5 and 6) it is shown that for a = (a₁′,α₀₂′)′,

exists w.p.a.1 and that

is asymptotically normal with covariance matrix M(a), where for α = (α₁′,α₂′)′ ∈ R^p_A

The first p_A elements of the FOC (3.3), evaluated at

, are

For α ∈ R^p_A, let

which coincides with the definition of D_ρ(θ) in (3.4) when α is the full vector θ. Similarly to S_ρ(θ) in (3.5) the subvector test statistic S_ρ^sub(α) is constructed as a quadratic form in the vector

from (4.3) with weighting matrix given by M(α) in (4.2). Let

be an estimator of M(α) that is given by replacing the expressions Δ(θ_αβ₀) and M_2β(α₂,β₀) in M(α) by consistent estimators,

say. By Assumptions M_α(ii) and M_α(iv)–(v) we may choose

when α = a = (a₁′,α₀₂′)′. Hence,

The statistic LM_ρ^sub(α) is constructed like S_ρ^sub(α) but replaces

. Thus,

Let

be an open neighborhood of β₀, and

Assumption M_α (continued).

In M_α(x) write

Assumption M_α(x) is the key assumption and plays a role similar to M_θ(vii). Assumption M_α(vii) extends M_α(iv) by explicitly assuming that integration and differentiation can be exchanged in the expectation of

, whereas M_α(iv) gave primitive conditions that imply that exchange holds for

. Assumptions M_α(v), M_α(vii), and ID_α imply that

, which is an important result used in the proof of the next theorem; in a linear model this result is trivially true because

. Assumptions M_α(vii)–(x) are analogous to M_θ(iv)–(vii) with A₁ and A₂ now playing the roles of A and B, respectively.

THEOREM 6. Assume 1 ≤ p_A < p. Suppose Assumptions A, ID_α, M_α(i)–(x), and ρ hold for a = (a₁′,α₀₂′)′ for a₁ ∈ A₁. Then,

where the random p_A-vector W_α(a) is defined in (A.22) of the Appendix, ζ_α ∼ N(0,I_{p_A}), and ζ_α and W_α are independent. We have W_α(α₀) ≡ 0, and therefore

Remark 1. The subvector statistics are asymptotically pivotal when elements of α₀ are arbitrarily weakly or strongly identified. This result can be used for the construction of test statistics or confidence regions that have correct size or coverage probabilities asymptotically, independent of the strength or weakness of identification of α₀. Compared to the GMM-subvector statistic of Kleibergen (2001)the statistics S_ρ^sub(a) and LM_ρ^sub(a) are appealing because of their compact formulation.

Remark 2. Even though it is unclear how the asymptotic distribution of these test statistics might be derived without assuming strong identification of β₀, it is obvious that neither S_ρ^sub(α₀) nor LM_ρ^sub(α₀) would converge to a χ²(p_A) random variable. In general the quantities

in S_ρ^sub(α₀) and

in LM_ρ^sub(α₀) are no longer asymptotically normal because of their dependence on the GEL estimator

, which as a direct consequence of Theorem 2 has a nonstandard limiting distribution if β₀ is not strongly identified. Moreover, the subvector version of the K-statistic of Kleibergen (2001) also experiences the same problem in these circumstances as the (GMM) CUE of β₀ has a nonnormal limiting distribution under weak identification (see Stock and Wright, 2000). Somewhat surprisingly, however, Monte Carlo simulations by the authors (not reported here) for the subvector statistic LM_ρ^sub(α₀) indicate that its size properties are not much affected by the strength or weakness of identification of β₀. Startz, Zivot, and Nelson (2004) report similar findings from Monte Carlo simulations for the subvector test statistic of Kleibergen (2001).

Example 1 (continued)

Guggenberger (2003) derives the corresponding results. Note that Assumptions Θ, ID′, M′, and ρ and also assuming that V^α(θ_aβ₀) is full column rank imply Assumption M_α. In the linear model the components of V^α(θ_aβ₀) can be easily calculated. For example, Δ_{A₁
A₁} = E(V_iA₁V_iA₁′ [otimes ] Z_i Z_i′), where V_iA₁ is the subvector of V_i that contains its first p_A₁ components. Let Y = (X,W) denote the partition of the included variables of the structural equation into exogenous and endogenous variables. Partition θ₀ = (θ_X0′,θ_W0′)′ and θ = (θ_X′,θ_W′)′ conformably. Valid inference is possible on any subvector of θ_W0 if the appropriate assumptions given previously are fulfilled. Unfortunately, if the dimension of the parameter vector not subject to test is large, then the argmin-sup problem in (4.1) is computationally very involved. Premultiplication of equation (2.2) by M_X should ameliorate this problem through the elimination of the exogenous variables; i.e., M_X y = M_XWθ_W0 + M_Xu. If Assumption M_α holds for θ_W0 = (α_W0,β_W0) and g_i(θ_W) := M_X,i′(y − Wθ_W)Z_i, where M_X,i denotes the ith row of M_X written as a column vector, valid inference may be undertaken on α_W0.

5. SIMULATION EVIDENCE

To assess the efficacy of the hypothesis tests introduced in Theorems 3 and 4, we conduct a set of Monte Carlo experiments. The DGP is given by model (2.2) considered in Example 1 and is similar to that in Kleibergen (2002a, p. 1791), namely,

There are a single right-hand-side endogenous variable and no included exogenous variables, p = 1, Z ∼ N(0,I_k [otimes ] I_n), where k is the number of instruments and n the sample size. In the just-identified case, i.e., k = 1, Π = Π₁, whereas, in the overidentified case, k > 1, Π = (Π₁,0′)′, i.e., irrelevant instruments are added.

Interest focuses on testing the scalar null hypothesis H₀ : θ₀ = 0 versus the alternative hypothesis H₁ : θ₀ ≠ 0.

5.1. Error Distributions

We examine several distributions for (u,V) to investigate the robustness of the test statistics to potentially different features of the error distribution. All designs are constructed from Design (I) obtained by modifying the distribution of the structural error u.

Design (I): (u,V)′ ∼ N(0,Σ [otimes ] I_n), where Σ ∈ R^2×2 with diagonal elements unity and off-diagonal elements ρ_uV.
Design (II): u_i in Design (I) is modified as u_i /(w_i /r)^1/2, where w_i is a χ²(r) random variable independent of u_i and V_i, i.e., u_i is t_r-distributed. We fix r = 2.
Design (III): modifies Design (I) by exchanging u_i² − 1 for u_i, i.e., u_i is a recentered χ²(1) random variable.
Design (IV): u_i from Design (I) is replaced by B_i|u_i + 2| − (1 − B_i)|u_i + 2| where B_i is Bernoulli (0.5,0.5) distributed and independent of all other random variables.

Design (II) examines the robustness of the performance of the test statistics to thick-tailed distributions for the structural equation error. Design (III) examines robustness with respect to asymmetric structural error distributions. In Design (IV) the structural error u_i is bimodal with peaks at −2 and +2.

In addition, the impact of conditional heteroskedasticity on the performance of the test statistics is examined. Designs (I_HET)–(IV_HET) modify Designs (I)–(IV), respectively, replacing u_i by u_i = ∥Z_i∥u_i.

5.2. Test Statistics

We calculate three versions of the statistic GELR_ρ(θ) in (3.2), for ρ(v) = −(1 + v)²/2 (CUE), ρ(v) = ln(1 − v) (EL), and ρ(v) = −exp v (ET). We also consider the corresponding versions for each of S_ρ(θ) in (3.5) and LM_ρ(θ) in (3.6) with

replaced by

. As noted previously, for CUE, S_ρ(θ) and LM_ρ(θ) are then numerically identical. Theorems 3 and 4 present the asymptotic null distributions of these statistics.

⁸

To calculate GELR_ρ(θ), S_ρ(θ), and LM_ρ(θ) for EL and ET, the globally concave maximization problem

must be solved numerically. To do so we implement a variant of the Newton–Raphson algorithm. We initialize the algorithm by setting λ equal to the zero vector. At each iteration the algorithm tries several shrinking step sizes in the search direction and accepts the first one that increases the function value compared to the previous value for λ. This procedure enforces an “uphill climbing” feature of the algorithm.

Additional statistics considered are the Anderson–Rubin test statistic (AR) (see Anderson and Rubin, 1949), two versions of the K-statistic proposed by Kleibergen (2001, 2002a), one assuming homoskedastic errors K, the other robust to conditional heteroskedasticity K_HET, the conditional likelihood ratio test LR_M of Moreira (2003), and two versions of the two-stage least squares (2SLS) Wald statistic 2SLS (see, e.g., Wooldridge, 2002, pp. 98, 100), one assuming homoskedastic errors (2SLS_HOM) and the other robust to conditional heteroskedasticity (2SLS_HET).⁹

The statistics are defined as follows:

where s_uu(θ) := (y − Yθ)′M_Z(y − Yθ)/(n − k),

where

. The statistic K(θ) (Kleibergen, 2002a), is not robust to conditional heteroskedasticity. However, a version of the K-statistic in Kleibergen (2001, equation (22)) that uses a heteroskedasticity consistent estimator for the covariance matrix of g_i(θ) overcomes this drawback. For model (5.1), the statistic is given by

where

, and

. The statistic K_HET(θ) is identical in structure to LM_CUE(θ) except the centered components

are used in place of g_i(θ) and G_i, respectively. Note that G_i := G_i(θ) does not depend on θ in a linear model. For the LR_M statistic, see Moreira (2003, Sect. 3). Finally, the Wald statistics are given by

where

, and

is a conditional heteroskedasticity robust estimator for the variance of

Under H₀ : θ₀ = 0, AR(θ₀) →_d χ²(k) and K(θ₀) →_d χ²(p). In the just-identified case k = p = 1, the AR- and K-statistics coincide. Both Wald statistics are asymptotically distributed as χ²(1) under H₀ : θ = θ₀ and strong identification.

5.3. Size Comparison

Empirical sizes are calculated using 5% asymptotic critical values for all of the preceding statistics for DGPs (5.1) corresponding to all 54 possible combinations of sample size n = 50, 100, 250, number of instruments k = 1, 5, 10, SF and RF error correlation ρ_uV = 0.0, 0.5, 0.99, and RF coefficient Π₁ = 0.1, 1.0 for Designs (I)–(IV) and (I_HET)–(IV_HET).¹⁰

10. Kleibergen (2002a) generates one sample for the instrument matrix Z from a N(0,I_k [otimes ] I_n) distribution and then keeps Z fixed across R = 10,000 samples of the DGP (5.1) using Design (I) with n = 100 and ρ_uV = 0.99. We simulate a new matrix Z with each sample of the DGP (5.1). As a consequence, our results do not coincide with those reported by Kleibergen (2002a).

To investigate the sensitivity of the results in Kleibergen (2002a) to the choice of Z, we iterated Kleibergen's (2002a) procedure 100 times; i.e., each time we simulated a matrix Z of instruments that we then kept fixed across R = 1,000 samples of the DGP (5.1). We found strong dependence of the numerical results of the Monte Carlo experiment on Z. For example, in the case Π₁ = 1, k = 1, the power of the K-statistic to reject the hypothesis θ₀ = 0 when θ₀ = 0.4 varied from about 60% to 95% in the 100 experiments. For the specific Z that Kleibergen (2002a) generates, he reports power of about 93% (see his Figure 1, p. 1793).

We use R = 3,000 replications of each DGP. We also use 3,000 realizations each of χ²(1) and χ²(k − 1) random variables to simulate the critical values of Moreira's LR_M statistic. For the results reported in Tables 1 and 2, which follow, we use R = 10,000 replications. We refer to Π₁ = 0.1 and 1.0 as the “weak” and “strong” instrument cases, respectively. The value of ρ_uV allows the degree of endogeneity of Y to be varied. Whereas for ρ_uV = 0, Y is exogenous, Y is strongly endogenous for ρ_uV = 0.99. We include the just-identified case, k = 1, and two overidentified-cases, k = 5 and 10.

Size results for Design (I) at 5% significance level

Size results for Design (IHET) at 5% significance level

We now describe the results for Designs (I) and (I_HET) given in Tables 1 and 2, respectively, which exclude those for GELR_EL, S_ET, LM_ET, AR, and the case n = 100. The qualitative features of the size results for GELR_EL, S_ET, and LM_ET are identical to their ET/EL counterparts. For k = 1, AR coincides with K, and, for k > 1, we find that in most cases K has better size properties than AR. We report K and 2SLS_HOM for the homoskedastic and K_HET and 2SLS_HET for the heteroskedastic design. We now discuss the results for the homoskedastic case of Design (I).

First, we consider the separate effects of Π₁,n, ρ_uV, and k on the size results.

The most important finding is that the empirical sizes of all statistics except 2SLS show little or no dependence on Π₁ (some additional Monte Carlo results show that this even holds true for the completely unidentified case where Π₁ = 0). However, those for 2SLS depend crucially on the strength or weakness of identification. Although for Π₁ = 1.0, 2SLS has reliable size properties for many cases, with weak instruments sizes range over the entire interval, 0% to 100%.

In general, increasing n leads to more accurate size across all statistics. This holds especially true for those that are poor for smaller n. For example, the 2SLS statistics, GELR_ET and S_EL, severely overreject in overidentified and strongly endogenous cases when n = 50. Even though they still overreject for n = 250, the rejection rates are much closer to the 5% significance level.

It is easily shown that the rejection rates under the null hypothesis for AR and GELR_ρ are independent of the value of ρ_uV. The slight dependence of the size results in Table 1 on ρ_uV results from the use of different samples. For all the remaining statistics except for 2SLS, there does not appear to be a clear pattern for how ρ_uV affects their size properties. Moreover, there is little dependence of the results on ρ_uV. However, for 2SLS, increasing ρ_uV leads to severe overrejection when combined with overidentification, especially so in the weak instrument case.

Increasing the number of instruments k usually leads to overrejection for 2SLS, GELR_ET, and S_EL. For 2SLS this is especially true under weak identification and/or strong endogeneity. All the other statistics show little dependence on k.

We now turn to a comparison of performance across statistics. The 2SLS statistics should not be used with weak instruments or in strongly endogenous overidentified situations. In all other cases, 2SLS has competitive size properties. The statistics GELR_ET and S_EL severely overreject in overidentified problems when the sample size is small. Overall, then, the statistics LM_EL, K, and LR_M lead to the best size results. The statistics LM_CUE and GELR_CUE come in only as second winners because they tend to underreject, especially in overidentified situations. Across the 36 experiments in Table 1, the sizes of LM_EL, LM_CUE, GELR_CUE, K, and LR_M are in the intervals [4.0,6.2], [1.6,5.3], [1.3,5.3], [4.8,8.6] and [4.3,10.3], respectively. The statistics K and LR_M usually slightly overreject. In 22 of the 36 cases, the size of LM_EL comes closest to the 5% significance level across all the statistics. The corresponding numbers for LM_CUE, GELR_CUE, K and LR_M are 8, 8, 9, and 7. Based on Design (I), LM_EL seems to have a slight advantage over the remaining statistics.

We now discuss the size results for Design (I_HET) summarized in Table 2. As most findings are similar to those discussed for Design (I), we only describe the new features.

The statistics 2SLS_HOM, K, and LR_M perform uniformly worse than in Design (I). Tests based on these statistics severely overreject, especially in the just-identified case. Their performance does not improve when n increases. We therefore report results for the heteroskedasticity robust versions 2SLS_HET and K_HET. Their size properties and those of the statistics based on GEL methods do not appear to be negatively influenced by the presence of conditional heteroskedasticity. This is to be expected from our earlier theoretical discussion of the GEL statistics, which does not assume conditional homoskedasticity. Of course, 2SLS_HET still suffers in weakly identified models, and GELR_ET and S_EL perform poorly in overidentified situations for small n. Rejection rates of the test statistics LM_EL, LM_CUE, GELR_CUE, K_HET, and LR_M across the 36 experiments of Table 2 are in the intervals [3.6,6.4], [1.6,5.1], [1.0,5.1], [4.3,9.2], and [7.8,28.8], respectively. In 21 of the 36 cases, the size of LM_EL comes closest to the 5% significance level across all the statistics. The test statistic K_HET wins in 18 cases.

In summary, the only statistics with accurate size properties across all experiments of Designs (I) and (I_HET) are LM_EL, LM_CUE, GELR_CUE, and K_HET. Based on the preceding results it seems that LM_EL enjoys a slight advantage over the other statistics. From the 72 cases in Tables 1 and 2 the empirical size of LM_EL is closest to the nominal 5% in 43 cases across all statistics.

The qualitative features of the size results for Designs (II)–(IV) and (II_HET)–(IV_HET) are generally very similar to their normal counterparts of Designs (I) and (I_HET). For this reason, we do not include additional tables for these designs. One striking difference however occurs for 2SLS under weak identification with χ²(1) (Design (III)) and bimodal errors (Design (IV)). Rejection rates across these 54 combinations for 2SLS_HOM are in the intervals [0.1,7.1] and [0.0,5.4], respectively. Whereas with normal errors and weak identification 2SLS severely overrejects, with these error distributions it severely underrejects.

To summarize this size study, LM_EL, LM_CUE, GELR_CUE, and K_HET have reliable size properties across all designs that appear independent of both the strength or weakness of identification and possible conditional heteroskedasticity. The test statistic 2SLS performs very poorly in the presence of weak instruments. The LR_M statistic performs well in homoskedastic cases but poorly otherwise.

5.4. Power Comparison

Empirical power curves are calculated for the preceding statistics and DGPs (5.1) corresponding to all 16 possible combinations of sample size n = 100, 250, number of instruments k = 5, 10, SF and RF error correlation ρ_uV = 0.5, 0.99, and RF coefficient Π₁ = 0.1, 1.0 for each of the error distributions of Designs (I)–(III). Except for LR_M, we report size-corrected power curves at the 5% significance level, using critical values calculated in the preceding size comparison. We do so because size correction of LR_M is not straightforward as a result of the conditional construction of LR_M and, as shown before, for Designs (I)–(III), LR_M has empirical size very close to nominal at the 5% significance level.

We use R = 1,000 replications from the DGP (5.1) with various values of the true value θ₀. The null hypothesis under test is again H₀ : θ₀ = 0. For weak identification (Π₁ = 0.1), θ₀ takes values in the interval [−4.0,4.0] whereas, with strong identification (Π₁ = 1.0), θ₀ ∈ [−0.4,0.4]. We use 1,000 realizations each of χ²(1) and χ²(k − 1) random variables to simulate the critical values of LR_M. For those results reported in the figures that follow, we use 10,000 replications from (5.1).

Detailed results are presented only for the statistics LM_EL, K, LR_M, and 2SLS_HET. The statistics LM_CUE, LM_EL, and LM_ET display a very similar performance across almost all scenarios. We therefore only report results for LM_EL. We do not report power results for the statistics S_EL and S_ET because, as seen earlier, their size properties appear to be quite poor for the sample sizes considered here. When k = 1, AR and K are numerically identical. In overidentified cases, K generally performs better than AR. We therefore do not report results for AR (see Kleibergen, 2002a, for a comparison of K and AR). Similarly, GELR_CUE is numerically identical to LM_ρ for k = 1 but leads to a less powerful test for k > 1. Also EL and ET versions of GELR_ρ have rather unreliable size properties for the sample sizes considered here. Therefore we do not report detailed results for GELR_ρ.

We first focus on the separate effects of Π₁,n,ρ_uV, and k on power.

With strong identification all statistics have a U-shaped power curve. With the exception of 2SLS_HET, the lowest point of the power curve is usually achieved at θ₀ = 0. In Designs (I) and (II), 2SLS_HET is usually biased, taking on its lowest value at a negative θ₀ value in the interval [−0.2,0.0]. When θ₀ is weakly identified, the power curves of LM_EL, K, and LR_M are generally very flat across all θ₀ values, often only slightly exceeding the significance level of the test. This is especially true for LM_EL and K but less so for LR_M, which is generally more powerful than the other two statistics in this situation. There is one exception when the power of the three tests is high. In Design (I) with ρ_uV = 0.99, although being flat at about 5% for positive θ₀ values, the power curves reach a sharp peak of almost 100% around θ₀ = −1. The reason for this anomaly is most easily explained in the case k = 1, where

. We have

, which in Design (I) with Π₁ = 0.1 equals 1 + 2θ₀ ρ_uV + (1.01)θ₀². If ρ_uV = 0.99 this expression is minimized at around θ₀ = −0.98 where it equals approximately 0.03. Therefore, this peak is caused by

taking on large values for θ₀ in the neighborhood of −1.

For negative θ₀ values with |θ₀| > 1 power quickly falls, reaching between 20% and 50% across the different designs at θ₀ = −4.

In contrast to the power curves of LM_EL, K, and LR_M, the power curve of 2SLS_HET retains its U-shaped form for Π₁ = 0.1. In many cases, the power curve reaches values close to 100% when |θ₀| is close to 4.

As to be expected the tests are more powerful when n is increased from 100 to 250. This holds uniformly across all statistics and designs with a more pronounced power increase in the strongly identified cases.

There does not seem to be a systematic effect due to ρ_uV as it varies with the specific design. For reasons explained previously, the shape of the power curves can change dramatically in Design (I) when ρ_uV is increased from 0.5 to 0.99 if Π₁ = 0.1.

In most cases, there is only little change in the power functions when k is increased from 5 to 10. In general, if the power function changes, then power is slightly lower for larger k.

We now compare the power functions across statistics. Figures 1a–c display the power curves of the four statistics for Designs (I)–(III) in the case Π₁ = 1.0, n = 250, ρ_uV = 0.5, and k = 5 (the figures for Π₁ = 0.1 and for the other parameter combinations are available upon request). The qualitative comparison for the other parameter combinations is very similar, and we therefore focus on these representative cases.

Power curves, strong instrument. (a) Normal errors, (b) t(2) errors, (c) χ2 errors.

When identification is weak, the test based on LR_M is usually more powerful than those based on LM_EL and K. The power gain of using LR_M is quite substantial for negative θ₀ values but less so for positive θ₀. However, the Wald test 2SLS_HET is by far the most powerful test in all three designs. Except for some small negative θ₀ values its power curve uniformly dominates the power curves of the other tests. Recall though that 2SLS_HET has unreliable size properties under weak identification.

When identification is strong, LM_EL uniformly dominates LR_M and K in Designs (II) and (III) (see Figures 1b and 1c). However, LR_M and K uniformly dominate LM_EL in Design (I) (see Figure 1a). This result is to be expected. On the one hand, the LM_EL test is based on nonparametric GEL methods. On the other hand, LR_M and K are motivated within the normal model framework. Although the power gain of LM_EL is small in Design (III), it is substantial in Design (II). Therefore, LM_EL should be used when errors have thick tails.

With strong identification, the Wald test is the most powerful test for positive θ₀ values. For negative θ₀ values, its performance varies from being most powerful in Design (III) to least powerful in Design (I). These results confirm that the Wald test is a reasonable choice when identification is strong.

Overall, therefore, the power study does not lead to an unambiguous ranking of the different tests considered here. Which test is most appropriate depends on the particular error distribution and degree of identification. We find that with strong identification and errors with thick tails or asymmetric errors, LM_EL seems to be the best choice whereas with normal errors LR_M and K appear preferable. When identification is weak, LR_M generally dominates K and LM_EL in terms of power although as noted previously the size properties of LR_M deteriorate substantially in the presence of heteroskedasticity.

APPENDIX: Proofs

Proof of Equation (2.4). Let f_i := sup_θ∈Θ∥g_i(θ)∥. Define K := sup_i≥1 Ef_i^ξ < ∞. Let ε > 0 and choose a positive C ∈ R such that K/C < ε. Then

where the first inequality follows from Pr(A ∪ B) ≤ Pr(A) + Pr(B) and the second uses the Markov inequality. It follows that (max_1≤i≤n f_i)n^−1/ξ = O_p(1) and thus (max_1≤i≤n f_i) = o_p(n^1/2) by ξ > 2. Thus (2.4) implies M(i). █

Proof of Lemma 1. ID holds trivially. By (2.2) and (2.3), g_i(θ) = (y_i − Y_i′θ)Z_i = Z_i(Z_i′Π + V_i′)(θ₀ − θ) + Z_iu_i. Next max_1≤i≤n sup_θ∈Θ∥g_i(θ)∥ = o_p(n^1/2) is established. An application of the Borel–Cantelli lemma shows that for real-valued i.i.d. random variables W_i such that EW_i² < ∞, max_1≤i≤n|W_i| = o_p(n^1/2); see Owen (1990, Lemma 3) for a proof. By the definition of g_i(θ) and the triangle inequality,

By Assumption M′(iii), we can apply the just-mentioned result to each of the three summands in the preceding equation, which proves the result.

Next M(ii) is shown. By the i.i.d. assumption, Ω(θ) = lim_n→∞ Eg_i(θ)g_i(θ)′, and continuity and boundedness in M(ii) follow immediately from M′(iii) and compactness of Θ. The same is true for the O_p(1) statement in M(ii). Finally, uniform convergence follows from the weak law of large numbers and compactness of Θ.

Next M(iii) is proved. Because

, we only have to deal with the empirical process

Finite-dimensional joint convergence follows from the CLT and M′(iii), and stochastic equicontinuity follows from the fact that (θ₀ − θ) enters Ψ_n(·,θ) linearly:

where the last expression is bounded by δO_p(1) by the CLT. Furthermore, Θ is compact by assumption. The proposition in Andrews (1994, p. 2251) can thus be applied, which yields the desired result. █

The following proofs are straightforward generalizations of the Guggenberger (2003) proofs for the i.i.d. linear model to the more general context considered here. We require three lemmas that are modified versions of Lemmas A1–A3 in Newey and Smith (2004) for the proofs of our theorems. These modifications are necessary because unlike Newey and Smith we need to work with weakly and strongly identified parameters and do not make an i.i.d. assumption.

For

let Θ_n ⊂ Θ. Let c_n := n^−1/2 max_1≤i≤n sup_{θ∈Θ_n}∥g_i(θ)∥. Let Λ_n := {λ ∈ R^k : ∥λ∥ ≤ n^−1/2c_n^−1/2} if c_n > 0 and Λ_n = R^k otherwise. Write “u.w.p.a.1” for “uniformly over θ ∈ Θ_n w.p.a.1.”

LEMMA 7. Assume max_1≤i≤n sup_{θ∈Θ_n}∥g_i(θ)∥ = o_p(n^1/2).

Then

, where

is defined in (2.5).

Proof. The case c_n = 0 is trivial, and thus wlog c_n ≠ 0 can be assumed. By assumption c_n = o_p(1), and the first part of the statement follows from

which also immediately implies the second part. █

LEMMA 8. Suppose

for some

uniformly over θ ∈ Θ_n and Assumption ρ holds.

Then

satisfying

exists u.w.p.a.1,

uniformly over θ ∈ Θ_n.

Proof. Without loss of generality c_n ≠ 0, and thus Λ_n can be assumed compact. For θ ∈ Θ_n, let λ_θ ∈ Λ_n be such that

. Such a λ_θ ∈ Λ_n exists u.w.p.a.1 because a continuous function takes on its maximum on a compact set and by Lemma 7 and Assumption ρ,

(as a function in λ for fixed θ) is C² on some open neighborhood of Λ_n u.w.p.a.1. We now show that actually

u.w.p.a.1, which then proves the first part of the lemma. By a second-order Taylor expansion around λ = 0, there is a λ_θ* on the line segment joining 0 and λ_θ such that for some positive constants C₁ and C₂

u.w.p.a.1, where the second inequality follows as max_1≤i≤n ρ₂(λ_θ*′g_i(θ)) < −½ u.w.p.a.1 from Lemma 7, continuity of ρ₂(·) at zero, and ρ₂ = −1. The last inequality follows from

. Now, (A.1) implies that

, the latter being O_p(n^−1/2) uniformly over θ ∈ Θ_n by assumption. It follows that λ_θ ∈ int(Λ_n) u.w.p.a.1. To prove this, let ε > 0. Because λ_θ = O_p(n^−1/2) uniformly over θ ∈ Θ_n and c_n = o_p(1), there exist

such that Pr(∥n^1/2λ_θ∥ ≤ M_ε) > 1 − ε/2 uniformly over θ ∈ Θ_n and Pr(c_n^−1/2 > M_ε) > 1 − ε/2 for all n ≥ n_ε. Then Pr(λ_θ ∈ int(Λ_n)) = Pr(∥n^1/2λ_θ∥ < c_n^−1/2) ≥ Pr((∥n^1/2λ_θ∥ ≤ M_ε) ∧ (c_n^−1/2 > M_ε)) > 1 − ε for n ≥ n_ε uniformly over θ ∈ Θ_n.

Hence, the FOC for an interior maximum

hold at λ = λ_θ u.w.p.a.1. By Lemma 7,

, and thus by concavity of

(as a function in λ for fixed θ) and convexity of

it follows that

, which implies the first part of the lemma. From before λ_θ = O_p(n^−1/2) uniformly over θ ∈ Θ_n. Thus the second and by (A.1) the third parts of the lemma follow. █

Suppose Θ₁ × Θ₂ ⊂ Θ, Θ_i ⊂ R^p_i, p₁ + p₂ = p. Partition θ₀ = (θ₀₁′,θ₀₂′)′ accordingly and assume θ₀₂ ∈ Θ₂. For d₁ ∈ Θ₁ define

By u.w.p.a.1 we denote “uniformly over d₁ ∈ Θ₁ w.p.a.1.”

LEMMA 9. Suppose max_1≤i≤n sup_{θ∈Θ₁×Θ₂}∥g_i(θ)∥ = o_p(n^1/2),

for some

uniformly over d₁ ∈ Θ₁, and Assumption ρ holds.

Then

uniformly over d₁ ∈ Θ₁.

Proof. Without loss of generality

can be assumed. Define

. Note that λ ∈ Λ_n and thus

uniformly over θ ∈ Θ_n w.p.a.1 (see Lemma 7 with Θ_n := Θ₁ × Θ₂). By a second-order Taylor expansion around λ = 0, there is a

on the line segment joining 0 and λ such that for some positive constants C₁ and C₂

u.w.p.a.1, where the first inequality follows from Lemma 7, which implies that

. The second inequality follows by

. The definition of

implies

uniformly over d₁ ∈ Θ₁. Combining equations (A.2) and (A.3) implies

uniformly over d₁ ∈ Θ₁. █

Proof of Theorem 2. (i) We first show consistency of

. By Assumption ID and M(iii)

, where m₂(β) = 0 if and only if β = β₀. Therefore,

is a sufficient condition for consistency of

. Applying Lemma 8 to the case Θ_n = {θ₀} gives

. Assumption M(ii) implies

for some κ < ∞, and thus Lemma 9 (applied to the case p₁ = 0, Θ₂ = Θ) implies

Next we establish n^1/2-consistency of

. By consistency of

and Assumption M(ii)

for some ε > 0, and thus Lemma 8 for the case

implies that the FOC

have to hold at

, where

and λ(θ), for given θ ∈ Θ, is defined in Lemma 8. Expanding the FOC in λ around 0, there exists a mean value

between

(that may be different for each row) such that

where the matrix

has been implicitly defined. Because

, Lemma 7 and Assumption ρ imply that

. By Assumption M(ii), it follows that

and thus

is invertible w.p.a.1 and

. Therefore

w.p.a.1. Inserting this into a second-order Taylor expansion for

(with mean value λ* as in (A.1)) it follows that

The same argument as for

proves

. We therefore have

. By the definition of

By Assumption ID, we have up to o_p(1) terms that

. The same analysis as in the proof of Lemma A1 in Stock and Wright (2000, p. 1091, line six from the top) can now be applied to prove n^1/2-consistency of

, where the symmetric matrix

plays the role of

in Stock and Wright. Note that in equation (A.4) in Stock and Wright, Assumption M(iii) of bounded sample paths w.p.a.1 is used. Finally, note that

is bounded away from zero w.p.a.1.

(ii) By Assumption M(iii)

and by ID we have for some mean-vector β between β₀ and β₀ + n^−1/2b (that may differ across rows)

Because the latter expression is bounded, it follows that

, where u.w.p.a.1 stands for “uniformly over (α,b) ∈ A × B_M w.p.a.1.” Therefore, by Lemma 8, λ(θ_αb) such that

exists u.w.p.a.1 and λ(θ_αb) = O_p(n^−1/2) uniformly over (α,b) ∈ A × B_M. This implies that the FOC

have to hold at λ = λ(θ_αb) and θ = θ_αb u.w.p.a.1. Expanding the FOC and using the same steps and notation as in part (i), it follows that

, and upon inserting this into a second-order Taylor expansion of

we have

u.w.p.a.1. The matrices

converge to Ω((α′,β₀′)′) uniformly over A × B_M. By M(iii),

, and therefore

on A × B_M.

By part (i) of the proof and Lemma 3.2.1 in van der Vaart and Wellner (1996, p. 286) it follows that

For given α ∈ A, one can calculate arg min_{b∈R^p_B} P_αb by solving the FOC for b. Writing Ω for Ω((α′,β₀′)′) and M₂ for M₂(β₀) the result is

This holds in particular for α = α*. It follows that α* = arg min_α∈A P_αb*(α). █

Proof of Theorem 3. Applying Lemma 8 to the case Θ_n = {θ}, it follows that

exists such that

. Using the same steps and notation as in the proof of Theorem 2 leads to

w.p.a.1, where by M_θ(ii) both

converge in probability to Δ(θ). By M_θ(iii),

from which the result follows. █

Proof of Theorem 4. Using M_θ(i)–(iii) and an argument similar to the argument that led to (A.5) we have

and therefore the statement of the theorem involving S_ρ(θ) follows immediately from the one for LM_ρ(θ). Therefore, we only deal with the statistic LM_ρ(θ) given in equation (3.8).

First, we show that the matrix D* is asymptotically independent of

. For notational convenience from now on we omit the argument θ; e.g., we write g_i for g_i(θ). By a mean-value expansion about 0 we have ρ₁(λ′g_i) = −1 + ρ₂(ξ_i)g_i′λ for a mean value ξ_i between 0 and λ′g_i, and thus by (A.8) and the definition of Λ we have

where for the last equality we use (3.7) and Assumptions M_θ(v)–(vi). By Assumption M_θ(v) it thus follows that

where w₁ := vec(0,−M₂(β₀),0) ∈ R^kp_A+kp_B+k and

M and v have dimensions (kp_A + kp_B + k) × (kp_A + k) and (kp_A + k) × 1, respectively. By Assumption ID, M_θ(vii), and (3.7) v →_d N(w₂,V(θ)), where w₂ := ((vec M_1A)′,m₁′)′ and M_1A are the first p_A columns of M₁. Therefore

where Ψ := Δ_AA − Δ_A Δ⁻¹Δ_A′ is positive definite. Equation (A.9) proves that

are asymptotically independent.

We now derive the asymptotic distribution of LM_ρ(θ). Denote by D and g the limiting normal distributions of

, respectively (see equation (A.9)). Subsequently we show that the function h : R^k×p → R^p×k defined by h(D) := (D′Δ⁻¹D)^−1/2D′ for D ∈ R^k×p is continuous on a set C ⊂ R^k×p with Pr(D ∈ C) = 1. By the continuous mapping theorem and M_θ(v) we have

By the independence of D and g, the latter random variable is distributed as W + ζ, where the random p-vector W is defined as

ζ ∼ N(0,I_p), and W and ζ are independent. Note that for θ = θ₀, W ≡ 0. From (A.10) the statement of the theorem follows.

We now prove the continuity claim for h. Note that h is continuous at each D that has full column rank. It is therefore sufficient to show that D has full column rank a.s. From (A.9) it follows that the last p_B columns of D equal −M₂(β₀), which has full column rank by assumption. Define

and the k × p-matrix

has linearly dependent columns}. Clearly, O is closed and therefore Lebesgue-measurable. Furthermore, O has empty interior and thus has Lebesgue measure 0. For the first p_A columns of D, D_{p_A} say, it has been shown that vecD_{p_A} is normally distributed with full rank covariance matrix Ψ. This implies that for any measurable set O* ⊂ R^kp_A with Lebesgue measure 0, it holds that Pr(vec(D_{p_A}) ∈ O*) = 0, in particular, for O* = O. This proves the continuity claim for h. █

Proof of Theorem 5. By Assumptions

, and by Lemmas 8 and 9 (applied to Θ_n = {θ_aβ₀} and Θ₁ = {a}, Θ₂ = B, respectively) we have

. Assumption ID_α then implies consistency of

. Applying Lemma 8 to the case

implies that the FOC for λ must hold in the definition of

(see equation (A.4)). Then repeating the analysis that leads to (A.6) in the proof of Theorem 2, we have by M_α(ii)

The next goal is to derive the asymptotic distribution of

. Our analysis follows Newey and Smith (2004); see their proof of Theorem 3.2. Differentiating the FOC (A.4) with respect to λ yields the matrix

, which by M_α(ii) converges in probability to −Δ(θ_aβ₀), which is nonsingular. Therefore, the implicit function theorem implies that there is a neighborhood of

where the solution to the FOC, say

, is continuously differentiable w.p.a.1. The envelope theorem then implies

w.p.a.1. Also, a mean-value expansion of (A.4) in (β,λ) about (β₀,0) yields (where g_i(θ) inside ρ₁ is kept constant at

)

where (β′,λ′) are mean values on the line segment that joins

that may be different for each row. Combining the p_B rows of (A.13) with the k rows of (A.14) we get

where the (p_B + k) × (p_B + k) matrix M has been implicitly defined. By M_α(ii) and M_α(iv)–(vi) the matrix M converges in probability to M, where (writing M_2β for M_2β((α₀₂,β₀)))

and (omitting the argument θ_aβ₀)

It follows that M is nonsingular w.p.a.1. Equation (A.15) implies that w.p.a.1

An expansion of

in β around β₀ and the preceding lead to

for some appropriate mean value θ. Note that

which has rank k − p_B. From (A.12), GELR_ρ^sub(a) →_d ξ′Δ(θ_aβ₀)⁻¹M_{M_2β}(Δ(θ_aβ₀))ξ, where ξ ∼ N(m₁(θ_aβ₀),Δ(θ_aβ₀)), which concludes the proof. █

Proof of Theorem 6. As in the proof of Theorem 5,

. Hence, the result for LM_ρ^sub(a) implies the result for S_ρ^sub(a).

As in the proof of Theorem 4 renormalize D* := D_ρ(a)Λ, where the diagonal p_A × p_A matrix Λ := diag(n^1/2,…,n^1/2,1,…,1) has first p_A₁ diagonal elements equal to n^1/2 and the remaining p_A₂ elements equal to unity. We now show that

are asymptotically independent. By a mean-value expansion about θ_aβ₀ and Assumption M_α(vii) we have for some mean value

(that may be different for each row)

where we have used (A.16) for the last equation. Assumptions M_α(vii) and ID_α imply

(recall that m₂ does not depend on α₁) and thus

Proceeding exactly as in the proof of Theorem 4, using (A.17), (A.19), and Assumptions M_α(vii)–(ix), it follows that

where M ∈ R^{(kp_A₁+kp_A₂+k)×(kp_A₁+k)} and

where the arguments (α₀₂,β₀) in M_2β and (∂m₂ /∂α₂) and θ_aβ₀ in Δ_A₁ and Δ are omitted. By M_α(x), v is asymptotically normal with full rank covariance matrix V^α(θ_aβ₀), and thus the asymptotic covariance matrix of

is given by MV^α(θ_aβ₀)M′. For independence of

the upper right k(p_A₁ + p_A₂) × k-submatrix of MV^α(θ_aβ₀)M′ must be 0. This is clear for the kp_A₂ × k-dimensional submatrix, and we only have to show that the kp_A₁ × k upper right submatrix

is 0. Using (A.18), the matrix in (A.21) equals −Δ_A₁ Δ⁻¹P_{M_2β}(Δ)M_{M_2β}(Δ)Δ, which is clearly 0. This proves the independence claim.

Now denote by D and g the limiting normal distributions of

, implied by (A.20). Recall M(a) = Δ⁻¹M_{M_2β}(Δ) (see equation (4.2)). If the function h : R^k×p_A → R^p_A×k defined by h(D) := (D′M(a)D)^−1/2D′ for D ∈ R^k×p_A is continuous on a set C ⊂ R^k×p_A with Pr(D ∈ C) = 1, then by the continuous mapping theorem

By (A.17) and (A.18) the latter variable is distributed as W_α(a) + ζ_α, where

Therefore the theorem is proved once we have proved the continuity claim for h. For this step of the proof we need the positive definite assumption for V^α(θ_aβ₀) in M_α(x). It is enough to show that with probability 1, rank(M_{M_2β}(Δ)D) = p_A. Because the span of the columns of M_2β equals the kernel of M_{M_2β}(Δ) and rank(M_2β) = p_B, the latter condition holds if rank(M_2β,D) = p. Denote by D_{p_A₂} the last p_A₂ columns of D, which by (A.20) equal −(∂m₂ /∂α₂). By Assumption ID_α, the matrix (∂m₂ /∂(α₂′,β′)′)((α₀₂,β₀)) has rank p_A₂ + p_B, and it remains to show that with probability one, this matrix is linearly independent of the first p_A₁ columns of D, D_{p_A₁} say. Using (A.20) and V^α(θ_aβ₀) > 0, the covariance matrix of vecD_{p_A₁} is easily shown to have full column rank p_A₁ k. An argument analogous to the last step in the proof of Theorem 4 can then be applied to conclude the proof. █

References

REFERENCES

Anderson, T.W. & H. Rubin (1949) Estimators of the parameters of a single equation in a complete set of stochastic equations. Annals of Mathematical Statistics 21, 570–582.Google Scholar

Andrews, D.W.K. (1994) Empirical process methods in econometrics. In R. Engle & D. McFadden (eds.), Handbook of Econometrics, vol. 4, 2247–2294. North-Holland.

Brown, B.W. & W.K. Newey (1998) Efficient semiparametric estimation of expectations. Econometrica 66, 453–464.Google Scholar

Caner, M. (2003) Exponential Tilting with Weak Instruments: Estimation and Testing. Working paper, University of Pittsburgh.

Dufour, J. (1997) Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 1365–1387.Google Scholar

Guggenberger, P. (2003) Econometric essays on generalized empirical likelihood, long-memory time series, and volatility. Ph.D. thesis, Yale University.

Guggenberger, P. & R.J. Smith (2003) Generalized Empirical Likelihood Tests in Time Series Models with Potential Identification Failure. Working paper, UCLA and University of Warwick.

Guggenberger, P. & M. Wolf (2004) Subsampling Tests of Parameter Hypotheses and Overidentifying Restrictions with Possible Failure of Identification. Working paper, UCLA.

Hansen, L.P. (1982) Large sample properties of generalized method of moment estimators. Econometrica 50, 1029–1054.Google Scholar

Hansen, L.P., J. Heaton, & A. Yaron (1996) Finite-sample properties of some alternative GMM estimators. Journal of Business & Economic Statistics 14, 262–280.Google Scholar

Imbens, G. (1997) One-step estimators for over-identified generalized method of moments models. Review of Economic Studies 64, 359–383.Google Scholar

Imbens, G. (2002) Generalized method of moments and empirical likelihood. Journal of Business & Economic Statistics 20, 493–506.Google Scholar

Imbens, G., R.H. Spady, & P. Johnson (1998) Information theoretic approaches to inference in moment condition models. Econometrica 66, 333–357.Google Scholar

Kitamura, Y. (1997) Empirical likelihood methods with weakly dependent processes. Annals of Statistics 25, 2084–2102.Google Scholar

Kitamura, Y. & M. Stutzer (1997) An information-theoretic alternative to generalized method of moments estimation. Econometrica 65, 861–874.Google Scholar

Kleibergen, F. (2001) Testing parameters in GMM without assuming that they are identified. Econometrica, forthcoming.Google Scholar

Kleibergen, F. (2002a) Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70, 1781–1805.Google Scholar

Kleibergen, F. (2002b) Two Independent Pivotal Statistics That Test Location and Misspecification and Add-Up to the Anderson-Rubin Statistic. Working paper, Brown University.

Moreira, M.J. (2003) A conditional likelihood ratio test for structural models. Econometrica 71, 1027–1048.Google Scholar

Nelson, C.R. & R. Startz (1990) Some further results on the exact small sample properties of the instrumental variable estimator. Econometrica 58, 967–976.Google Scholar

Newey, W.K. (1985) Generalized method of moments specification testing. Journal of Econometrics 29, 229–256.Google Scholar

Newey, W.K. & R.J. Smith (2004) Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–255.Google Scholar

Newey, W.K. & K.D. West (1987) Hypothesis testing with efficient method of moments estimation. International Economic Review 28, 777–787.Google Scholar

Otsu, T. (2003) Generalized Empirical Likelihood Inference under Weak Identification. Working paper, University of Wisconsin.

Owen, A. (1988) Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249.Google Scholar

Owen, A. (1990) Empirical likelihood ratio confidence regions. Annals of Statistics 18, 90–120.Google Scholar

Pakes, A. & D. Pollard (1989) Simulation and the asymptotics of optimization estimators. Econometrica 57, 1027–1057.Google Scholar

Phillips, P.C.B. (1984) The exact distribution of LIML: I. International Economic Review 25, 249–261.Google Scholar

Phillips, P.C.B. (1989) Partially identified econometric models. Econometric Theory 5, 181–240.Google Scholar

Qin, J. & J. Lawless (1994) Empirical likelihood and general estimating equations. Annals of Statistics 22, 300–325.Google Scholar

Smith, R.J. (1997) Alternative semi-parametric likelihood approaches to generalized method of moments estimation. Economic Journal 107, 503–519.Google Scholar

Smith, R.J. (2001) GEL Criteria for Moment Condition Models. Working paper, University of Bristol. Revised version CWP 19/04, cemmap, IFS and UCL. http://cemmap.ifs.org.uk/wps/cwp0419.pdf.

Staiger, D. & J.H. Stock (1997) Instrumental variables regression with weak instruments. Econometrica 65, 557–586.Google Scholar

Startz, R., E. Zivot, & C.R. Nelson (2004) Improved inference in weakly identified instrumental variables regression. In Frontiers of Analysis and Applied Research: Essays in Honor of Peter C.B. Phillips. Cambridge University Press.

Stock, J.H. & J.H. Wright (2000) GMM with weak identification. Econometrica 68, 1055–1096.Google Scholar

Stock, J.H., J.H. Wright, & M. Yogo (2002) A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics 20, 518–529.Google Scholar

van der Vaart, A.W. & J.A. Wellner (1996) Weak Convergence and Empirical Processes. Springer.

Wooldridge, J. (2002) Econometric Analysis of Cross Section and Panel Data. MIT Press.

Size results for Design (I) at 5% significance level

Size results for Design (IHET) at 5% significance level

Power curves, strong instrument. (a) Normal errors, (b) t(2) errors, (c) χ2 errors.

Article contents

GENERALIZED EMPIRICAL LIKELIHOOD ESTIMATORS AND TESTS UNDER PARTIAL, WEAK, AND STRONG IDENTIFICATION

Abstract

1. INTRODUCTION

2. ESTIMATION

2.1. Model

Example 1 (i.i.d. linear IV regression)

Example 2 (conditional moment restrictions)

2.2. Assumptions

Example 1 (continued)

2.3. The GEL Estimator

2.4. First-Order Equivalence

Example 1 (continued)

3. TEST STATISTICS

Example 1 (continued)

4. SUBVECTOR TEST STATISTICS

Example 1 (continued)

5. SIMULATION EVIDENCE

5.1. Error Distributions

5.2. Test Statistics

5.3. Size Comparison

5.4. Power Comparison

APPENDIX: Proofs

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests