1. INTRODUCTION
Consider the linear regression model
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm001.gif?pub-status=live)
where ε1,…,εn are independent and identically distributed (i.i.d.) random variables with mean 0 and variance σ2. For simplicity, we will assume that the predictors are centered to have mean 0 and that the intercept β0 is always estimated by Y. This assumption allows us to focus on estimation of β1,…,βp, but it is not essential.
Throughout this paper, we will assume that the xi's are nearly collinear in the sense that the matrix
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm002.gif?pub-status=live)
is nonsingular for each n but that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm003.gif?pub-status=live)
where C is singular; we will refer to such designs as “nearly singular.”
The exact definition of near singularity (which will be given in the next section) is an asymptotic one, but in practice, a nearly singular design might be characterized as one where the smallest eigenvalue (or eigenvalues) of Cn is small compared to the trace of Cn. In some cases, the near singularity is, in fact, a consequence of the model; see, for example, Phillips (2001). It is well known that ordinary least squares (OLS) estimation, although unbiased, leads to parameter estimates with large variance. Several alternative methods, which trade bias for variance, have been proposed to deal with this problem; these methods include ridge regression (Hoerl and Kennard, 1970), partial least squares (Wold, 1984; Lorber, Wanger, and Kowalski, 1987), continuum regression (Stone and Brooks, 1990), the “lasso” (Tibshirani, 1996; Radchenko, 2004), and the smooth clipped absolute deviation (SCAD) penalty of Fan and Li (2001).
Under the condition
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm004.gif?pub-status=live)
the OLS estimator, which we will denote by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm001.gif?pub-status=live)
, is asymptotically normal; more precisely, we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm005.gif?pub-status=live)
(Srivastava, 1971). Note that the condition (4) can be rewritten as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm002.gif?pub-status=live)
which if Cn tends to a nonsingular matrix C is equivalent to
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm003.gif?pub-status=live)
moreover, if C is nonsingular then the asymptotic normality in (5) can be expressed as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm006.gif?pub-status=live)
The convergence in (5) is very general and quite useful in practice even in the case of nearly singular designs; on the other hand, generalizing (5) to estimators obtained after shrinkage procedures (such as ridge regression or the lasso) or automatic model selection procedures (such as the Akaike information criterion [AIC]) is difficult. Results such as (6) (where the normalization is by a sequence of constants rather than a sequence of matrices) turn out to be easier to obtain and can give considerable insight into the properties (from a large-sample perspective) of the particular estimator.
2. PENALIZED LEAST SQUARES ESTIMATION
Regularization is a frequently used technique in statistics for obtaining estimators in situations where standard estimators are unstable or otherwise poorly defined.
We will consider estimating β by minimizing the penalized least squares (LS) criterion
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm007.gif?pub-status=live)
for a given λn where γ > 0; the resulting estimator will be denoted throughout by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm004.gif?pub-status=live)
, thereby suppressing its dependence on both γ and λn with
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm005.gif?pub-status=live)
denoting the OLS estimator (with λn = 0). These so-called Bridge estimators were introduced by Frank and Friedman (1993) as a generalization of ridge regression (which occurs for γ = 2). The special case when γ = 1 corresponds to the lasso (Tibshirani, 1996). Properties of these estimators have been studied by, among others, Fu (1998), Knight and Fu (2000), Radchenko (2004), and Leeb and Pötscher (2006). For γ ≤ 1, the estimators minimizing (7) have the potentially attractive feature of being exactly 0 if λn is sufficiently large, thus combining parameter estimation and model selection; indeed model selection methods such as AIC and the Bayesian information criterion (BIC) can be viewed as limiting cases as γ → 0. Also note that when γ < 1, the objective function (7) is not convex and the estimator
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm006.gif?pub-status=live)
can be quite sensitive to the choice of λn; more precisely, when γ < 1, the mapping from λn to
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm007.gif?pub-status=live)
will have jump discontinuities. The SCAD penalty of Fan and Li (2001) is a nonconvex penalty (indexed by two parameters) that combines the features of a lasso-type penalty (for small parameter values) with an AIC-type penalty (for larger parameter values).
We could also replace (7) by a penalized LS criterion that allows us a separate tuning parameter for each coefficient:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm008.gif?pub-status=live)
It is straightforward to generalize the results of this paper to estimators obtained by minimizing (8). The objective function (7) is more common in practice; typically, the predictors are scaled to have a variance of 1.
In this section, we will consider the asymptotic behavior of Bridge estimators when the design is nearly singular. More precisely, suppose that Cn (as defined in (3)) is nonsingular but tends to a singular matrix C. In particular, we will assume that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm009.gif?pub-status=live)
for some sequence {an} tending to infinity where D0 is positive definite on the null space of C (i.e., vTD0v > 0 for nonzero v with Cv = 0). Note that D0 is necessarily nonnegative definite on the null space of C so that it is not too stringent to require it to be positive definite on this null space. If D0 is not positive definite on the null space of C then we can modify (9) to obtain appropriate limiting distributions; this will be considered in the next section. We are also assuming (at least implicitly) that the near singularity affects all the predictors in the model. Applications where the condition (9) holds are given in Phillips (2001) and Gabaix and Ibragimov (2006). A referee has also pointed out a possible connection to the problem of weak instruments (cf. Stock, Wright, and Yogo, 2002), for example, in two-stage least squares estimation. Caner (2004, 2006) considers nearly singular designs in the context of generalized method of moments (GMM) estimation.
To obtain consistency and limiting distributions for
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm008.gif?pub-status=live)
minimizing (7), we need to impose conditions on the sequence {λn} so that it does not grow too quickly. In the case where C is nonsingular, Knight and Fu (2000) showed that to obtain nondegenerate limiting distributions for
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm009.gif?pub-status=live)
, we require
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm010.gif?pub-status=live)
for γ ≥ 1 and λn/nγ/2 → λ0 for γ < 1; for nearly singular designs, the growth criterion for {λn} will be somewhat more stringent.
It is worth mentioning that asymptotic results tend to undersell the value of shrinkage estimation in practice. The reason for this is simple. Shrinkage is used in practice to reduce the variability in the estimation of parameters that are “small” by forcing their estimates toward 0 (or setting them to 0). However, from an asymptotic perspective, the only parameters that are “small” are those that are exactly 0 as all other parameters can be (with probability tending to 1 as n → ∞) distinguished as different than 0. Thus for a parameter βk whose value is nonzero, shrinkage generally produces bias in the resulting estimator that may or may not vanish asymptotically, and the resulting asymptotic bias is typically not compensated by a reduction in the asymptotic variance. On the other hand, if βk = 0 then shrinkage will typically reduce the asymptotic variance of the estimator (without any asymptotic bias), which leads to a sort of superefficiency in these cases. Obviously, it is desirable to produce estimators that have no asymptotic bias when βk ≠ 0 and are superefficient when βk = 0; such estimators exist but can be extremely sensitive to small perturbations in the data or changes in the choice of tuning parameters. The asymptotic results are very useful in giving insight into how sensitive a given methodology is to the choice of tuning parameters.
A useful tool in the development of the asymptotic distribution of the penalized LS estimators is the notion of epi-convergence in distribution, which is discussed in Pflug (1995), Geyer (1994, 1996), and Knight (1999). A sequence of random lower semicontinuous functions {Zn} on
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm011.gif?pub-status=live)
(taking values in [−∞,∞]) epi-converges in distribution to
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm012.gif?pub-status=live)
if for closed rectangles R1,…,Rk with open interiors R1o,…,Rko, we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170408103654-02861-mediumThumb-S0266466608080146ffm013.jpg?pub-status=live)
for all real a1,…,ak. Epi-convergence in distribution is particularly useful for studying estimators that minimize (or maximize) objective functions subject to constraints and also estimators that minimize discontinuous (but lower semicontinuous) objective functions; the best known weak convergence for functions, which is based on uniform convergence on compact sets (van der Vaart and Wellner, 1996), is poorly suited to these types of objective functions. However, this type of weak convergence, when applicable, does imply epi-convergence in distribution.
The limiting distributions of “argmin” estimators can often be determined via epi-convergence of the associated objective functions; in particular, if
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm014.gif?pub-status=live)
and
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm015.gif?pub-status=live)
where Z has a unique minimizer U then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm016.gif?pub-status=live)
provided that Un = Op(1). For an application of epi-convergence in distribution in the context of estimation in nonregular econometric models, see Chernozhukov and Hong (2004) and Chernozhukov (2005).
In the case where {Zn} are convex with
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm017.gif?pub-status=live)
(where the minimizer of Z, U, is unique) then the condition Un = Op(1) is guaranteed, and so Un
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm018.gif?pub-status=live)
. Moreover, in the case of convexity, finite-dimensional weak convergence of {Zn} to Z is sufficient for
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm019.gif?pub-status=live)
provided that Z is finite (with probability 1) on an open set (Geyer, 1996); however, for nearly singular designs, this latter condition is not satisfied as the appropriate limiting objective function is finite only on a lower dimensional subspace of
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm020.gif?pub-status=live)
. Finite-dimensional weak convergence implies epi-convergence in distribution if {Zn} is stochastically equi–lower semicontinuous as defined in Knight (1999).
We will now consider the asymptotic behavior of nearly singular designs under fairly weak conditions. We will assume that Cn is nonsingular for all n and satisfies (9) for some sequence {an}. Define bn = (n/an)1/2 and define Zn to be
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm010.gif?pub-status=live)
If
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm021.gif?pub-status=live)
minimizes (7) then the minimizer of (10) is simply
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm022.gif?pub-status=live)
; the objective function Zn in (10) is simply a rescaled version of the objective function (7) with constants subtracted to ensure convergence. Note that because
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm023.gif?pub-status=live)
, the estimators will have a slower rate of convergence than when C is nonsingular.
The following result was given in Knight and Fu (2000).
THEOREM 1. Assume the linear model (1) where Cn in (2) satisfies (3), (4), and (9) where C is singular and D0 is positive definite on the null space of C. Define W to be a 0 mean multivariate normal random vector such that Var(uTW) = σ2uTD0u > 0 for each nonzero u satisfying Cu = 0. Let
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm024.gif?pub-status=live)
minimize (7) for some γ > 0 and λn ≥ 0.
(i) If γ > 1 and λn/bn → λ0 ≥ 0 then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm025.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm026.gif?pub-status=live)
(ii)If γ = 1 and λn/bn → λ0 ≥ 0 then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm027.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm028.gif?pub-status=live)
(iii)If γ < 1 and λn/bnγ → λ0 ≥ 0 then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm029.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm030.gif?pub-status=live)
Proof. Define Zn as in (10). First of all, we must show in each case that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm031.gif?pub-status=live)
where Z0(u) = Z(u) for u satisfying Cu = 0 and Z0(u) = ∞ otherwise. This follows by first showing finite-dimensional weak convergence of {Zn} to
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm032.gif?pub-status=live)
and then stochastic equi–lower semicontinuity (e-1-sc) (Knight, 1999) of {Zn}; note that, because Z0 is not finite on an open set,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm033.gif?pub-status=live)
is not sufficient for
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm034.gif?pub-status=live)
even when Zn is convex (i.e., when γ ≥ 1). Finally, we must show that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm035.gif?pub-status=live)
When γ ≥ 1, this holds automatically from the convexity of the Zn's; for γ < 1, it can be established by noting that the quadratic part of Zn is growing faster (in ∥u∥) than the nonconvex penalty. █
Note that for γ ≥ 1, the limiting distribution will typically depend on λ0 whereas for γ < 1, this is only true if at least one of the βj's is 0. However, if γ < 1 and at least one βj is 0 then the mapping from λ0 to the limiting distribution will have discontinuities; this mapping is continuous for γ ≥ 1 because of the convexity (in u) of the limiting objective function V for any λ0.
The condition on λn in part (iii) of Theorem 1 can be modified to achieve the “best of both worlds” for γ < 1, that is, no asymptotic bias for estimators of nonzero parameters and superefficiency for estimators of zero parameters. We do this by assuming that λn/bnτ → λ0 > 0 where γ < τ < 1. Although this seems attractive, it should be noted that this is an asymptotic condition and does not really give much insight regarding the choice of λn for fixed n.
Example 1
Consider a design with p predictors with common mutual correlation ρn. Assuming the predictors are normalized to have variance 1, we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm036.gif?pub-status=live)
we will assume that ρn → 1 and an(1 − ρn) → ψ > 0. In this case, {Cn} converges to a matrix C (of all 1's) and an(Cn − C) → D0 where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm037.gif?pub-status=live)
(In this example, the form of D0 is not particularly important.) If the matrices are p × p then the null space of C is the space of vectors u with u1 + ··· + up = 0. For the sake of illustration, let us suppose that β1,…,βp are all nonzero and take γ ≥ 1. Then the limiting objective function Z in Theorem 1 is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm011.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm038.gif?pub-status=live)
By Theorem 1, we have (setting bn = (n/an)1/2)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm039.gif?pub-status=live)
It is interesting to compare this limiting distribution to the limiting distribution of the OLS estimator:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm040.gif?pub-status=live)
where Z0 is simply Z in (11) setting λ0 = 0. The size of the asymptotic bias of
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm041.gif?pub-status=live)
relative to
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm042.gif?pub-status=live)
(which is unbiased) depends on the coefficients of u1,…,up in the penalty
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm043.gif?pub-status=live)
Note that these coefficients are bounded (in β) only if γ = 1 (the lasso) and that the bias vanishes for γ > 1 (i.e.,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm044.gif?pub-status=live)
) if, and only if, β1 = ··· = βp whereas for γ = 1, this same condition holds under the weaker condition sgn(β1) = ··· = sgn(βp). It should be noted also that the preceding discussion does not depend on the form of the matrix D0.
Next suppose that β1 ≠ 0 and β2 = ··· = βp = 0. If γ ≤ 1 and λ0 > 0 then the joint limiting distribution of the estimators of β2,…,βp will have positive probability mass at 0 and because the limiting distribution lies in the null space of C, this implies that the limiting distribution of
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm045.gif?pub-status=live)
has positive probability mass at 0.
Theorem 1 can be extended to model selection methods such as AIC and BIC. Suppose that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm046.gif?pub-status=live)
minimizes
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm012.gif?pub-status=live)
for AIC, λn = 2 whereas for BIC, we have λn = ln(n). The following result gives the limiting distribution in AIC-like situations where λn → λ0 ∈ (0,∞).
THEOREM 2. Assume the linear model (1) where Cn in (2) satisfies (3), (4), and (9) where C is singular and D0 is positive definite on the null space of C. Suppose that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm047.gif?pub-status=live)
minimizes (12) and λn → λ0 ≥ 0. Then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm048.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm049.gif?pub-status=live)
with
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm050.gif?pub-status=live)
for u in the null space of C.
Proof. Define the objective function
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm013.gif?pub-status=live)
and note that it is minimized at
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm051.gif?pub-status=live)
. First of all, outside of the null space of C, it is easy to see that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm052.gif?pub-status=live)
. For u in the null space of C, we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm053.gif?pub-status=live)
For the penalty term,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm054.gif?pub-status=live)
Thus we have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm055.gif?pub-status=live)
. Epi-convergence in distribution follows by establishing e-l-sc (Knight, 1999); we need to show that for each bounded set B, ε > 0, and δ > 0, there exist u1,…,um ∈ B and open neighborhoods O(u1),…,O(um) such that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm056.gif?pub-status=live)
and
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm057.gif?pub-status=live)
First of all, note that Zn is finite for each n with discontinuities at points that do not depend on n. If B does not intersect the null space of C then e-l-sc is straightforward; for each n, Zn is approximately a quadratic function that is tending to +∞. On the other hand, if B does intersect the null space of C then we can take u1,…,um to lie in this null space and obtain the desired inequality. It remains only to establish that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm058.gif?pub-status=live)
; this follows because for n > some n0, we have Zn(0) ≤ (λ0 + ε)p, and there exists a compact set Kε such that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm059.gif?pub-status=live)
for each ε > 0. Thus
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm060.gif?pub-status=live)
. █
As noted previously, AIC corresponds to the case where λ0 = 2; Theorem 2 confirms the well-known fact that AIC is not a consistent model selection method in the sense that if βr+1 = ··· = βp = 0 then asymptotically AIC gives positive probability to models with at least one of βr+1,…,βp nonzero. Note however that the parameter estimators computed by minimizing AIC are themselves consistent (in this case bn-consistent). So-called consistent model selection procedures such as BIC have λn → ∞ at some (usually slow) rate.
THEOREM 3. Assume the linear model (1) where Cn in (2) satisfies (3), (4), and (9) where C is singular and D0 is positive definite on the null space of C. Suppose that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm061.gif?pub-status=live)
minimizes (12) where λn → +∞ with λn = o(bn2). Then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm062.gif?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm063.gif?pub-status=live)
with
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm064.gif?pub-status=live)
for u in the null space of C.
Proof. When λn → ∞, we can rewrite Zn in (13) as
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm065.gif?pub-status=live)
Then for βj ≠ 0,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm066.gif?pub-status=live)
uniformly over compact sets (and thus this convergence is also epi-convergence). On the other hand if βj = 0 then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm067.gif?pub-status=live)
where this pointwise convergence can be extended to epi-convergence. Now note that if λn grows too quickly then Zn may be minimized at some
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm068.gif?pub-status=live)
having
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm069.gif?pub-status=live)
; this possibility is ruled out by the assumption that λn = o(bn2) and so argmin(Zn) = Op(1). █
The form of the penalty in the asymptotic objective effectively forces the limiting distribution of
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm070.gif?pub-status=live)
to be a point mass at 0 when βj = 0.
Example 2
Consider a design with Cn defined as in Example 1 with β1 = ··· = βp = 0. When 0 ≤ 0 < ρn = ρ < 1, then Table 1 gives the limiting distribution of the estimated model size for p = 5 and p = 10 for ρ = 0, 0.5, 0.9; given that we have a “null” model here, the correct model size is 0. Table 1 suggests that as ρ → 1, the probability of selecting a model of size 1 decreases and the probabilities of selecting a model of size 0 or 2 increase. The result of Theorem 2 suggests that if an(1 − ρn) → ψ > 0 then the probability of AIC selecting a model of size 1 tends to 0; this is somewhat misleading as the mapping
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm071.gif?pub-status=live)
is not continuous at any u having at least one 0 component.
Limiting distributions of estimated model size for AIC with p = 5 and 10 predictors, and mutual interpredictor correlations of ρ = 0, 0.5, and 0.9. The probability estimates are based on 10,000 replications and have a standard error of at most 0.005.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170408103654-94357-mediumThumb-S0266466608080146tbl001.jpg?pub-status=live)
3. OTHER POINTS OF INTEREST
3.1. Higher Order Near Singularity
Theorems 1 and 2 require that the matrix D0 be positive definite on the null space of C. Unfortunately, this is not always true; Phillips (2001) gives an example involving polynomial regression with “slowly varying” predictors where this condition is violated.
The near singularity condition an(Cn − C) → D0 with uTD0u > 0 for non-zero u satisfying Cu = 0 can be generalized as follows. We start by recursively defining matrices H1, D1, H2, D2,… such that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm014.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm015.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm016.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm017.gif?pub-status=live)
Now define the following subspace of the null space of C:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm018.gif?pub-status=live)
Note that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm072.gif?pub-status=live)
is always well defined (if C, D0,…,Dk in (14)–(17) are well defined) as it contains at least the vector 0. However, we are most interested in cases where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm073.gif?pub-status=live)
is larger. We can then redefine bn in terms of an(1),…,an(k) as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146frm019.gif?pub-status=live)
Then it can be shown that the conclusions of Theorems 1 and 2 hold with bn defined as in (19), Var(uTW) = σ2uTDku > 0 for
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm074.gif?pub-status=live)
defined in (18), Dk replacing D0 in the definition of Z(u):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm075.gif?pub-status=live)
It is also possible to extend the results of this paper to cases where different degrees of near singularity exist in disjoint subsets of variables; in this case, we will obtain different convergent rates for the estimators of the parameters in the different subsets. For example, if xi = vec(xi(1),xi(2)) then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm076.gif?pub-status=live)
Suppose that Cn → C where Cn(11) → C(11) (nonsingular) and Cn(22) → C(22) (singular) with an(Cn(22) − C(22)) → D0(22); then it is also reasonable to assume that an1/2(Cn(12) − C(12)) → D0(12) (and likewise for Cn(21)). In this case, writing β = vec(β(1),β(2)), we would typically (i.e., subject to other regularity conditions) have
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20170216044544049-0748:S0266466608080146:S0266466608080146ffm077.gif?pub-status=live)
where bn = (n/an)1/2. For shrinkage estimation minimizing (7), we would need to choose λn to match the slowest rate of convergence to obtain nondegenerate limiting distributions.
3.2. Maximum Likelihood and GMM Estimation
The results of this paper extend naturally to maximum likelihood estimation where the information matrix is nearly singular. In regular models where the log-likelihood function is locally quadratic, it is straightforward to extend Theorems 1 and 2; applications would include model selection and shrinkage estimation for so-called generalized linear models, which include logistic regression and log-linear Poisson regression. As mentioned previously, the notion of near singularity may be very useful in determining the asymptotic behavior of estimation procedures with weak instruments. In the context of GMM and generalized empirical likelihood estimation, Caner (2004, 2006) has investigated similar issues.
It is worth noting that there is a considerable literature on estimation for models where the information matrix is singular; for some recent examples, see Barnabani (2002) and Rotnitzky, Cox, Bottai, and Robins (2000). In such cases, typically the limiting distributions of maximum likelihood estimators are concentrated on a lower dimensional subspace or have a slower rate of convergence than the standard rate.