Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-02-11T09:11:26.408Z Has data issue: false hasContentIssue false

SHRINKAGE ESTIMATION FOR NEARLY SINGULAR DESIGNS

Published online by Cambridge University Press:  30 November 2007

Keith Knight
Affiliation:
University of Toronto
Rights & Permissions [Opens in a new window]

Abstract

Shrinkage estimation procedures such as ridge regression and the lasso have been proposed for stabilizing estimation in linear models when high collinearity exists in the design. In this paper, we consider asymptotic properties of shrinkage estimators in the case of “nearly singular” designs.I thank Hannes Leeb and Benedikt Pötscher and also the referees for their valuable comments. This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Type
Research Article
Copyright
© 2008 Cambridge University Press

1. INTRODUCTION

Consider the linear regression model

where ε1,…,εn are independent and identically distributed (i.i.d.) random variables with mean 0 and variance σ2. For simplicity, we will assume that the predictors are centered to have mean 0 and that the intercept β0 is always estimated by Y. This assumption allows us to focus on estimation of β1,…,βp, but it is not essential.

Throughout this paper, we will assume that the xi's are nearly collinear in the sense that the matrix

is nonsingular for each n but that

where C is singular; we will refer to such designs as “nearly singular.”

The exact definition of near singularity (which will be given in the next section) is an asymptotic one, but in practice, a nearly singular design might be characterized as one where the smallest eigenvalue (or eigenvalues) of Cn is small compared to the trace of Cn. In some cases, the near singularity is, in fact, a consequence of the model; see, for example, Phillips (2001). It is well known that ordinary least squares (OLS) estimation, although unbiased, leads to parameter estimates with large variance. Several alternative methods, which trade bias for variance, have been proposed to deal with this problem; these methods include ridge regression (Hoerl and Kennard, 1970), partial least squares (Wold, 1984; Lorber, Wanger, and Kowalski, 1987), continuum regression (Stone and Brooks, 1990), the “lasso” (Tibshirani, 1996; Radchenko, 2004), and the smooth clipped absolute deviation (SCAD) penalty of Fan and Li (2001).

Under the condition

the OLS estimator, which we will denote by

, is asymptotically normal; more precisely, we have

(Srivastava, 1971). Note that the condition (4) can be rewritten as

which if Cn tends to a nonsingular matrix C is equivalent to

moreover, if C is nonsingular then the asymptotic normality in (5) can be expressed as

The convergence in (5) is very general and quite useful in practice even in the case of nearly singular designs; on the other hand, generalizing (5) to estimators obtained after shrinkage procedures (such as ridge regression or the lasso) or automatic model selection procedures (such as the Akaike information criterion [AIC]) is difficult. Results such as (6) (where the normalization is by a sequence of constants rather than a sequence of matrices) turn out to be easier to obtain and can give considerable insight into the properties (from a large-sample perspective) of the particular estimator.

2. PENALIZED LEAST SQUARES ESTIMATION

Regularization is a frequently used technique in statistics for obtaining estimators in situations where standard estimators are unstable or otherwise poorly defined.

We will consider estimating β by minimizing the penalized least squares (LS) criterion

for a given λn where γ > 0; the resulting estimator will be denoted throughout by

, thereby suppressing its dependence on both γ and λn with

denoting the OLS estimator (with λn = 0). These so-called Bridge estimators were introduced by Frank and Friedman (1993) as a generalization of ridge regression (which occurs for γ = 2). The special case when γ = 1 corresponds to the lasso (Tibshirani, 1996). Properties of these estimators have been studied by, among others, Fu (1998), Knight and Fu (2000), Radchenko (2004), and Leeb and Pötscher (2006). For γ ≤ 1, the estimators minimizing (7) have the potentially attractive feature of being exactly 0 if λn is sufficiently large, thus combining parameter estimation and model selection; indeed model selection methods such as AIC and the Bayesian information criterion (BIC) can be viewed as limiting cases as γ → 0. Also note that when γ < 1, the objective function (7) is not convex and the estimator

can be quite sensitive to the choice of λn; more precisely, when γ < 1, the mapping from λn to

will have jump discontinuities. The SCAD penalty of Fan and Li (2001) is a nonconvex penalty (indexed by two parameters) that combines the features of a lasso-type penalty (for small parameter values) with an AIC-type penalty (for larger parameter values).

We could also replace (7) by a penalized LS criterion that allows us a separate tuning parameter for each coefficient:

It is straightforward to generalize the results of this paper to estimators obtained by minimizing (8). The objective function (7) is more common in practice; typically, the predictors are scaled to have a variance of 1.

In this section, we will consider the asymptotic behavior of Bridge estimators when the design is nearly singular. More precisely, suppose that Cn (as defined in (3)) is nonsingular but tends to a singular matrix C. In particular, we will assume that

for some sequence {an} tending to infinity where D0 is positive definite on the null space of C (i.e., vTD0v > 0 for nonzero v with Cv = 0). Note that D0 is necessarily nonnegative definite on the null space of C so that it is not too stringent to require it to be positive definite on this null space. If D0 is not positive definite on the null space of C then we can modify (9) to obtain appropriate limiting distributions; this will be considered in the next section. We are also assuming (at least implicitly) that the near singularity affects all the predictors in the model. Applications where the condition (9) holds are given in Phillips (2001) and Gabaix and Ibragimov (2006). A referee has also pointed out a possible connection to the problem of weak instruments (cf. Stock, Wright, and Yogo, 2002), for example, in two-stage least squares estimation. Caner (2004, 2006) considers nearly singular designs in the context of generalized method of moments (GMM) estimation.

To obtain consistency and limiting distributions for

minimizing (7), we need to impose conditions on the sequence {λn} so that it does not grow too quickly. In the case where C is nonsingular, Knight and Fu (2000) showed that to obtain nondegenerate limiting distributions for

, we require

for γ ≥ 1 and λn/nγ/2 → λ0 for γ < 1; for nearly singular designs, the growth criterion for {λn} will be somewhat more stringent.

It is worth mentioning that asymptotic results tend to undersell the value of shrinkage estimation in practice. The reason for this is simple. Shrinkage is used in practice to reduce the variability in the estimation of parameters that are “small” by forcing their estimates toward 0 (or setting them to 0). However, from an asymptotic perspective, the only parameters that are “small” are those that are exactly 0 as all other parameters can be (with probability tending to 1 as n → ∞) distinguished as different than 0. Thus for a parameter βk whose value is nonzero, shrinkage generally produces bias in the resulting estimator that may or may not vanish asymptotically, and the resulting asymptotic bias is typically not compensated by a reduction in the asymptotic variance. On the other hand, if βk = 0 then shrinkage will typically reduce the asymptotic variance of the estimator (without any asymptotic bias), which leads to a sort of superefficiency in these cases. Obviously, it is desirable to produce estimators that have no asymptotic bias when βk ≠ 0 and are superefficient when βk = 0; such estimators exist but can be extremely sensitive to small perturbations in the data or changes in the choice of tuning parameters. The asymptotic results are very useful in giving insight into how sensitive a given methodology is to the choice of tuning parameters.

A useful tool in the development of the asymptotic distribution of the penalized LS estimators is the notion of epi-convergence in distribution, which is discussed in Pflug (1995), Geyer (1994, 1996), and Knight (1999). A sequence of random lower semicontinuous functions {Zn} on

(taking values in [−∞,∞]) epi-converges in distribution to

if for closed rectangles R1,…,Rk with open interiors R1o,…,Rko, we have

for all real a1,…,ak. Epi-convergence in distribution is particularly useful for studying estimators that minimize (or maximize) objective functions subject to constraints and also estimators that minimize discontinuous (but lower semicontinuous) objective functions; the best known weak convergence for functions, which is based on uniform convergence on compact sets (van der Vaart and Wellner, 1996), is poorly suited to these types of objective functions. However, this type of weak convergence, when applicable, does imply epi-convergence in distribution.

The limiting distributions of “argmin” estimators can often be determined via epi-convergence of the associated objective functions; in particular, if

and

where Z has a unique minimizer U then

provided that Un = Op(1). For an application of epi-convergence in distribution in the context of estimation in nonregular econometric models, see Chernozhukov and Hong (2004) and Chernozhukov (2005).

In the case where {Zn} are convex with

(where the minimizer of Z, U, is unique) then the condition Un = Op(1) is guaranteed, and so Un

. Moreover, in the case of convexity, finite-dimensional weak convergence of {Zn} to Z is sufficient for

provided that Z is finite (with probability 1) on an open set (Geyer, 1996); however, for nearly singular designs, this latter condition is not satisfied as the appropriate limiting objective function is finite only on a lower dimensional subspace of

. Finite-dimensional weak convergence implies epi-convergence in distribution if {Zn} is stochastically equi–lower semicontinuous as defined in Knight (1999).

We will now consider the asymptotic behavior of nearly singular designs under fairly weak conditions. We will assume that Cn is nonsingular for all n and satisfies (9) for some sequence {an}. Define bn = (n/an)1/2 and define Zn to be

If

minimizes (7) then the minimizer of (10) is simply

; the objective function Zn in (10) is simply a rescaled version of the objective function (7) with constants subtracted to ensure convergence. Note that because

, the estimators will have a slower rate of convergence than when C is nonsingular.

The following result was given in Knight and Fu (2000).

THEOREM 1. Assume the linear model (1) where Cn in (2) satisfies (3), (4), and (9) where C is singular and D0 is positive definite on the null space of C. Define W to be a 0 mean multivariate normal random vector such that Var(uTW) = σ2uTD0u > 0 for each nonzero u satisfying Cu = 0. Let

minimize (7) for some γ > 0 and λn ≥ 0.

(i) If γ > 1 and λn/bn → λ0 ≥ 0 then

where

(ii)If γ = 1 and λn/bn → λ0 ≥ 0 then

where

(iii)If γ < 1 and λn/bnγ → λ0 ≥ 0 then

where

Proof. Define Zn as in (10). First of all, we must show in each case that

where Z0(u) = Z(u) for u satisfying Cu = 0 and Z0(u) = ∞ otherwise. This follows by first showing finite-dimensional weak convergence of {Zn} to

and then stochastic equi–lower semicontinuity (e-1-sc) (Knight, 1999) of {Zn}; note that, because Z0 is not finite on an open set,

is not sufficient for

even when Zn is convex (i.e., when γ ≥ 1). Finally, we must show that

When γ ≥ 1, this holds automatically from the convexity of the Zn's; for γ < 1, it can be established by noting that the quadratic part of Zn is growing faster (in ∥u∥) than the nonconvex penalty. █

Note that for γ ≥ 1, the limiting distribution will typically depend on λ0 whereas for γ < 1, this is only true if at least one of the βj's is 0. However, if γ < 1 and at least one βj is 0 then the mapping from λ0 to the limiting distribution will have discontinuities; this mapping is continuous for γ ≥ 1 because of the convexity (in u) of the limiting objective function V for any λ0.

The condition on λn in part (iii) of Theorem 1 can be modified to achieve the “best of both worlds” for γ < 1, that is, no asymptotic bias for estimators of nonzero parameters and superefficiency for estimators of zero parameters. We do this by assuming that λn/bnτ → λ0 > 0 where γ < τ < 1. Although this seems attractive, it should be noted that this is an asymptotic condition and does not really give much insight regarding the choice of λn for fixed n.

Example 1

Consider a design with p predictors with common mutual correlation ρn. Assuming the predictors are normalized to have variance 1, we have

we will assume that ρn → 1 and an(1 − ρn) → ψ > 0. In this case, {Cn} converges to a matrix C (of all 1's) and an(CnC) → D0 where

(In this example, the form of D0 is not particularly important.) If the matrices are p × p then the null space of C is the space of vectors u with u1 + ··· + up = 0. For the sake of illustration, let us suppose that β1,…,βp are all nonzero and take γ ≥ 1. Then the limiting objective function Z in Theorem 1 is

where

By Theorem 1, we have (setting bn = (n/an)1/2)

It is interesting to compare this limiting distribution to the limiting distribution of the OLS estimator:

where Z0 is simply Z in (11) setting λ0 = 0. The size of the asymptotic bias of

relative to

(which is unbiased) depends on the coefficients of u1,…,up in the penalty

Note that these coefficients are bounded (in β) only if γ = 1 (the lasso) and that the bias vanishes for γ > 1 (i.e.,

) if, and only if, β1 = ··· = βp whereas for γ = 1, this same condition holds under the weaker condition sgn(β1) = ··· = sgn(βp). It should be noted also that the preceding discussion does not depend on the form of the matrix D0.

Next suppose that β1 ≠ 0 and β2 = ··· = βp = 0. If γ ≤ 1 and λ0 > 0 then the joint limiting distribution of the estimators of β2,…,βp will have positive probability mass at 0 and because the limiting distribution lies in the null space of C, this implies that the limiting distribution of

has positive probability mass at 0.

Theorem 1 can be extended to model selection methods such as AIC and BIC. Suppose that

minimizes

for AIC, λn = 2 whereas for BIC, we have λn = ln(n). The following result gives the limiting distribution in AIC-like situations where λn → λ0 ∈ (0,∞).

THEOREM 2. Assume the linear model (1) where Cn in (2) satisfies (3), (4), and (9) where C is singular and D0 is positive definite on the null space of C. Suppose that

minimizes (12) and λn → λ0 ≥ 0. Then

where

with

for u in the null space of C.

Proof. Define the objective function

and note that it is minimized at

. First of all, outside of the null space of C, it is easy to see that

. For u in the null space of C, we have

For the penalty term,

Thus we have

. Epi-convergence in distribution follows by establishing e-l-sc (Knight, 1999); we need to show that for each bounded set B, ε > 0, and δ > 0, there exist u1,…,umB and open neighborhoods O(u1),…,O(um) such that

and

First of all, note that Zn is finite for each n with discontinuities at points that do not depend on n. If B does not intersect the null space of C then e-l-sc is straightforward; for each n, Zn is approximately a quadratic function that is tending to +∞. On the other hand, if B does intersect the null space of C then we can take u1,…,um to lie in this null space and obtain the desired inequality. It remains only to establish that

; this follows because for n > some n0, we have Zn(0) ≤ (λ0 + ε)p, and there exists a compact set Kε such that

for each ε > 0. Thus

. █

As noted previously, AIC corresponds to the case where λ0 = 2; Theorem 2 confirms the well-known fact that AIC is not a consistent model selection method in the sense that if βr+1 = ··· = βp = 0 then asymptotically AIC gives positive probability to models with at least one of βr+1,…,βp nonzero. Note however that the parameter estimators computed by minimizing AIC are themselves consistent (in this case bn-consistent). So-called consistent model selection procedures such as BIC have λn → ∞ at some (usually slow) rate.

THEOREM 3. Assume the linear model (1) where Cn in (2) satisfies (3), (4), and (9) where C is singular and D0 is positive definite on the null space of C. Suppose that

minimizes (12) where λn → +∞ with λn = o(bn2). Then

where

with

for u in the null space of C.

Proof. When λn → ∞, we can rewrite Zn in (13) as

Then for βj ≠ 0,

uniformly over compact sets (and thus this convergence is also epi-convergence). On the other hand if βj = 0 then

where this pointwise convergence can be extended to epi-convergence. Now note that if λn grows too quickly then Zn may be minimized at some

having

; this possibility is ruled out by the assumption that λn = o(bn2) and so argmin(Zn) = Op(1). █

The form of the penalty in the asymptotic objective effectively forces the limiting distribution of

to be a point mass at 0 when βj = 0.

Example 2

Consider a design with Cn defined as in Example 1 with β1 = ··· = βp = 0. When 0 ≤ 0 < ρn = ρ < 1, then Table 1 gives the limiting distribution of the estimated model size for p = 5 and p = 10 for ρ = 0, 0.5, 0.9; given that we have a “null” model here, the correct model size is 0. Table 1 suggests that as ρ → 1, the probability of selecting a model of size 1 decreases and the probabilities of selecting a model of size 0 or 2 increase. The result of Theorem 2 suggests that if an(1 − ρn) → ψ > 0 then the probability of AIC selecting a model of size 1 tends to 0; this is somewhat misleading as the mapping

is not continuous at any u having at least one 0 component.

Limiting distributions of estimated model size for AIC with p = 5 and 10 predictors, and mutual interpredictor correlations of ρ = 0, 0.5, and 0.9. The probability estimates are based on 10,000 replications and have a standard error of at most 0.005.

3. OTHER POINTS OF INTEREST

3.1. Higher Order Near Singularity

Theorems 1 and 2 require that the matrix D0 be positive definite on the null space of C. Unfortunately, this is not always true; Phillips (2001) gives an example involving polynomial regression with “slowly varying” predictors where this condition is violated.

The near singularity condition an(CnC) → D0 with uTD0u > 0 for non-zero u satisfying Cu = 0 can be generalized as follows. We start by recursively defining matrices H1, D1, H2, D2,… such that

Now define the following subspace of the null space of C:

Note that

is always well defined (if C, D0,…,Dk in (14)–(17) are well defined) as it contains at least the vector 0. However, we are most interested in cases where

is larger. We can then redefine bn in terms of an(1),…,an(k) as follows:

Then it can be shown that the conclusions of Theorems 1 and 2 hold with bn defined as in (19), Var(uTW) = σ2uTDku > 0 for

defined in (18), Dk replacing D0 in the definition of Z(u):

It is also possible to extend the results of this paper to cases where different degrees of near singularity exist in disjoint subsets of variables; in this case, we will obtain different convergent rates for the estimators of the parameters in the different subsets. For example, if xi = vec(xi(1),xi(2)) then

Suppose that CnC where Cn(11)C(11) (nonsingular) and Cn(22)C(22) (singular) with an(Cn(22)C(22)) → D0(22); then it is also reasonable to assume that an1/2(Cn(12)C(12)) → D0(12) (and likewise for Cn(21)). In this case, writing β = vec(β(1),β(2)), we would typically (i.e., subject to other regularity conditions) have

where bn = (n/an)1/2. For shrinkage estimation minimizing (7), we would need to choose λn to match the slowest rate of convergence to obtain nondegenerate limiting distributions.

3.2. Maximum Likelihood and GMM Estimation

The results of this paper extend naturally to maximum likelihood estimation where the information matrix is nearly singular. In regular models where the log-likelihood function is locally quadratic, it is straightforward to extend Theorems 1 and 2; applications would include model selection and shrinkage estimation for so-called generalized linear models, which include logistic regression and log-linear Poisson regression. As mentioned previously, the notion of near singularity may be very useful in determining the asymptotic behavior of estimation procedures with weak instruments. In the context of GMM and generalized empirical likelihood estimation, Caner (2004, 2006) has investigated similar issues.

It is worth noting that there is a considerable literature on estimation for models where the information matrix is singular; for some recent examples, see Barnabani (2002) and Rotnitzky, Cox, Bottai, and Robins (2000). In such cases, typically the limiting distributions of maximum likelihood estimators are concentrated on a lower dimensional subspace or have a slower rate of convergence than the standard rate.

References

REFERENCES

Barnabani, M. (2002) Wald-Based Approach with Singular Information Matrix. Working paper 2002/13, Department of Statistics, University of Florence.
Caner, M. (2004) Nearly Singular Design in GMM and Generalized Empirical Likelihood Estimators. Working paper, Department of Economics, University of Pittsburgh.
Caner, M. (2006) Lasso Type GMM Estimator. Working paper, Department of Economics, University of Pittsburgh.
Chernozhukov, V. (2005) Extremal quantile regression. Annals of Statistics 33, 806839.Google Scholar
Chernozhukov, V. & H. Hong (2004) Likelihood estimation and inference in a class of nonregular econometric models. Econometrica 72, 14451480.Google Scholar
Fan, J. & R. Li (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 13481360.Google Scholar
Frank, I.E. & J.H. Friedman (1993) A statistical view of some chemometrics regression tools (with discussion). Technometrics 35, 109148.Google Scholar
Fu, W.J. (1998) Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics 7, 397416.Google Scholar
Gabaix, X. & R. Ibragimov (2006) Log(rank − 1/2): A Simple Way to Improve the OLS Estimation of Tail Exponents. Working paper, Harvard Institute of Economic Research.
Geyer, C.J. (1994) On the asymptotics of constrained M-estimation. Annals of Statistics 22, 19932010.Google Scholar
Geyer, C.J. (1996) On the Asymptotics of Convex Stochastic Optimization. Technical report, Department of Statistics, University of Minnesota.
Hoerl, A.E. & R.W. Kennard (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 5567.Google Scholar
Knight, K. (1999) Epi-convergence in Distribution and Stochastic Equi-semicontinuity. Unpublished manuscript, Department of Statistics, University of Toronto.
Knight, K. & W. Fu (2000) Asymptotics for lasso-type estimators. Annals of Statistics 28, 13561378.Google Scholar
Leeb, H. & B. Pötscher (2006) Performance limits for estimators of the risk or distribution of shrinkage-type estimators, and some general lower risk-bound results. Econometric Theory 22, 6997.Google Scholar
Lorber, A., L.E. Wanger, & B.R. Kowalski (1987) A theoretical foundation for the PLS algorithm. Journal of Chemometrics 1, 1931.Google Scholar
Pflug, G.C. (1995) Asymptotic stochastic programs. Mathematics of Operations Research 20, 769789.Google Scholar
Phillips, P.C.B. (2001) Regression with Slowly Varying Regressors. Cowles Foundation Discussion paper 1310, Yale University.
Radchenko, P. (2004) Reweighting the Lasso. Unpublished manuscript, Department of Statistics, University of Chicago.
Rotnitzky, A., D.R. Cox, M. Bottai, & J. Robins (2000) Likelihood-based inference with singular information matrix. Bernoulli 6, 243284.Google Scholar
Srivastava, M.S. (1971) On fixed width confidence bounds for regression parameters. Annals of Mathematical Statistics 42, 14031411.Google Scholar
Stock, J.H., J.H. Wright, & M. Yogo (2002) A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics 20, 518529.Google Scholar
Stone, M. & R.J. Brooks (1990) Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression (with discussion). Journal of the Royal Statistical Society, Series B 52, 237269; corrigendum (1992) 54, 906–907.Google Scholar
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267288.Google Scholar
van der Vaart, A.W. & J.A. Wellner (1996) Weak Convergence and Empirical Processes with Applications to Statistics. Springer.
Wold, H. (1984) PLS regression. In N.L. Johnson & S. Kotz (eds.), Encyclopedia of Statistical Sciences, vol. 6, pp. 581591. Wiley.
Figure 0

Limiting distributions of estimated model size for AIC with p = 5 and 10 predictors, and mutual interpredictor correlations of ρ = 0, 0.5, and 0.9. The probability estimates are based on 10,000 replications and have a standard error of at most 0.005.