SHRINKAGE ESTIMATION FOR NEARLY SINGULAR DESIGNS

Keith Knight

doi:10.1017/S0266466608080146

SHRINKAGE ESTIMATION FOR NEARLY SINGULAR DESIGNS

Published online by Cambridge University Press: 30 November 2007

Keith Knight

Show author details

Keith Knight: Affiliation:
University of Toronto

Article contents

Abstract
1. INTRODUCTION
2. PENALIZED LEAST SQUARES ESTIMATION
3. OTHER POINTS OF INTEREST
References

Rights & Permissions

Abstract

Shrinkage estimation procedures such as ridge regression and the lasso have been proposed for stabilizing estimation in linear models when high collinearity exists in the design. In this paper, we consider asymptotic properties of shrinkage estimators in the case of “nearly singular” designs.I thank Hannes Leeb and Benedikt Pötscher and also the referees for their valuable comments. This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Type: Research Article
Information: Econometric Theory , Volume 24 , Issue 2 , April 2008 , pp. 323 - 337

DOI: https://doi.org/10.1017/S0266466608080146 [Opens in a new window]
Copyright: © 2008 Cambridge University Press

1. INTRODUCTION

Consider the linear regression model

where ε₁,…,ε_n are independent and identically distributed (i.i.d.) random variables with mean 0 and variance σ². For simplicity, we will assume that the predictors are centered to have mean 0 and that the intercept β₀ is always estimated by Y. This assumption allows us to focus on estimation of β₁,…,β_p, but it is not essential.

Throughout this paper, we will assume that the x_i's are nearly collinear in the sense that the matrix

is nonsingular for each n but that

where C is singular; we will refer to such designs as “nearly singular.”

The exact definition of near singularity (which will be given in the next section) is an asymptotic one, but in practice, a nearly singular design might be characterized as one where the smallest eigenvalue (or eigenvalues) of C_n is small compared to the trace of C_n. In some cases, the near singularity is, in fact, a consequence of the model; see, for example, Phillips (2001). It is well known that ordinary least squares (OLS) estimation, although unbiased, leads to parameter estimates with large variance. Several alternative methods, which trade bias for variance, have been proposed to deal with this problem; these methods include ridge regression (Hoerl and Kennard, 1970), partial least squares (Wold, 1984; Lorber, Wanger, and Kowalski, 1987), continuum regression (Stone and Brooks, 1990), the “lasso” (Tibshirani, 1996; Radchenko, 2004), and the smooth clipped absolute deviation (SCAD) penalty of Fan and Li (2001).

Under the condition

the OLS estimator, which we will denote by

, is asymptotically normal; more precisely, we have

(Srivastava, 1971). Note that the condition (4) can be rewritten as

which if C_n tends to a nonsingular matrix C is equivalent to

moreover, if C is nonsingular then the asymptotic normality in (5) can be expressed as

The convergence in (5) is very general and quite useful in practice even in the case of nearly singular designs; on the other hand, generalizing (5) to estimators obtained after shrinkage procedures (such as ridge regression or the lasso) or automatic model selection procedures (such as the Akaike information criterion [AIC]) is difficult. Results such as (6) (where the normalization is by a sequence of constants rather than a sequence of matrices) turn out to be easier to obtain and can give considerable insight into the properties (from a large-sample perspective) of the particular estimator.

2. PENALIZED LEAST SQUARES ESTIMATION

Regularization is a frequently used technique in statistics for obtaining estimators in situations where standard estimators are unstable or otherwise poorly defined.

We will consider estimating β by minimizing the penalized least squares (LS) criterion

for a given λ_n where γ > 0; the resulting estimator will be denoted throughout by

, thereby suppressing its dependence on both γ and λ_n with

denoting the OLS estimator (with λ_n = 0). These so-called Bridge estimators were introduced by Frank and Friedman (1993) as a generalization of ridge regression (which occurs for γ = 2). The special case when γ = 1 corresponds to the lasso (Tibshirani, 1996). Properties of these estimators have been studied by, among others, Fu (1998), Knight and Fu (2000), Radchenko (2004), and Leeb and Pötscher (2006). For γ ≤ 1, the estimators minimizing (7) have the potentially attractive feature of being exactly 0 if λ_n is sufficiently large, thus combining parameter estimation and model selection; indeed model selection methods such as AIC and the Bayesian information criterion (BIC) can be viewed as limiting cases as γ → 0. Also note that when γ < 1, the objective function (7) is not convex and the estimator

can be quite sensitive to the choice of λ_n; more precisely, when γ < 1, the mapping from λ_n to

will have jump discontinuities. The SCAD penalty of Fan and Li (2001) is a nonconvex penalty (indexed by two parameters) that combines the features of a lasso-type penalty (for small parameter values) with an AIC-type penalty (for larger parameter values).

We could also replace (7) by a penalized LS criterion that allows us a separate tuning parameter for each coefficient:

It is straightforward to generalize the results of this paper to estimators obtained by minimizing (8). The objective function (7) is more common in practice; typically, the predictors are scaled to have a variance of 1.

In this section, we will consider the asymptotic behavior of Bridge estimators when the design is nearly singular. More precisely, suppose that C_n (as defined in (3)) is nonsingular but tends to a singular matrix C. In particular, we will assume that

for some sequence {a_n} tending to infinity where D₀ is positive definite on the null space of C (i.e., v^TD₀v > 0 for nonzero v with Cv = 0). Note that D₀ is necessarily nonnegative definite on the null space of C so that it is not too stringent to require it to be positive definite on this null space. If D₀ is not positive definite on the null space of C then we can modify (9) to obtain appropriate limiting distributions; this will be considered in the next section. We are also assuming (at least implicitly) that the near singularity affects all the predictors in the model. Applications where the condition (9) holds are given in Phillips (2001) and Gabaix and Ibragimov (2006). A referee has also pointed out a possible connection to the problem of weak instruments (cf. Stock, Wright, and Yogo, 2002), for example, in two-stage least squares estimation. Caner (2004, 2006) considers nearly singular designs in the context of generalized method of moments (GMM) estimation.

To obtain consistency and limiting distributions for

minimizing (7), we need to impose conditions on the sequence {λ_n} so that it does not grow too quickly. In the case where C is nonsingular, Knight and Fu (2000) showed that to obtain nondegenerate limiting distributions for

, we require

for γ ≥ 1 and λ_n/n^γ/2 → λ₀ for γ < 1; for nearly singular designs, the growth criterion for {λ_n} will be somewhat more stringent.

It is worth mentioning that asymptotic results tend to undersell the value of shrinkage estimation in practice. The reason for this is simple. Shrinkage is used in practice to reduce the variability in the estimation of parameters that are “small” by forcing their estimates toward 0 (or setting them to 0). However, from an asymptotic perspective, the only parameters that are “small” are those that are exactly 0 as all other parameters can be (with probability tending to 1 as n → ∞) distinguished as different than 0. Thus for a parameter β_k whose value is nonzero, shrinkage generally produces bias in the resulting estimator that may or may not vanish asymptotically, and the resulting asymptotic bias is typically not compensated by a reduction in the asymptotic variance. On the other hand, if β_k = 0 then shrinkage will typically reduce the asymptotic variance of the estimator (without any asymptotic bias), which leads to a sort of superefficiency in these cases. Obviously, it is desirable to produce estimators that have no asymptotic bias when β_k ≠ 0 and are superefficient when β_k = 0; such estimators exist but can be extremely sensitive to small perturbations in the data or changes in the choice of tuning parameters. The asymptotic results are very useful in giving insight into how sensitive a given methodology is to the choice of tuning parameters.

A useful tool in the development of the asymptotic distribution of the penalized LS estimators is the notion of epi-convergence in distribution, which is discussed in Pflug (1995), Geyer (1994, 1996), and Knight (1999). A sequence of random lower semicontinuous functions {Z_n} on

(taking values in [−∞,∞]) epi-converges in distribution to

if for closed rectangles R₁,…,R_k with open interiors R₁^o,…,R_k^o, we have

for all real a₁,…,a_k. Epi-convergence in distribution is particularly useful for studying estimators that minimize (or maximize) objective functions subject to constraints and also estimators that minimize discontinuous (but lower semicontinuous) objective functions; the best known weak convergence for functions, which is based on uniform convergence on compact sets (van der Vaart and Wellner, 1996), is poorly suited to these types of objective functions. However, this type of weak convergence, when applicable, does imply epi-convergence in distribution.

The limiting distributions of “argmin” estimators can often be determined via epi-convergence of the associated objective functions; in particular, if

and

where Z has a unique minimizer U then

provided that U_n = O_p(1). For an application of epi-convergence in distribution in the context of estimation in nonregular econometric models, see Chernozhukov and Hong (2004) and Chernozhukov (2005).

In the case where {Z_n} are convex with

(where the minimizer of Z, U, is unique) then the condition U_n = O_p(1) is guaranteed, and so U_n

. Moreover, in the case of convexity, finite-dimensional weak convergence of {Z_n} to Z is sufficient for

provided that Z is finite (with probability 1) on an open set (Geyer, 1996); however, for nearly singular designs, this latter condition is not satisfied as the appropriate limiting objective function is finite only on a lower dimensional subspace of

. Finite-dimensional weak convergence implies epi-convergence in distribution if {Z_n} is stochastically equi–lower semicontinuous as defined in Knight (1999).

We will now consider the asymptotic behavior of nearly singular designs under fairly weak conditions. We will assume that C_n is nonsingular for all n and satisfies (9) for some sequence {a_n}. Define b_n = (n/a_n)^1/2 and define Z_n to be

minimizes (7) then the minimizer of (10) is simply

; the objective function Z_n in (10) is simply a rescaled version of the objective function (7) with constants subtracted to ensure convergence. Note that because

, the estimators will have a slower rate of convergence than when C is nonsingular.

The following result was given in Knight and Fu (2000).

THEOREM 1. Assume the linear model (1) where C_n in (2) satisfies (3), (4), and (9) where C is singular and D₀ is positive definite on the null space of C. Define W to be a 0 mean multivariate normal random vector such that Var(u^TW) = σ²u^TD₀u > 0 for each nonzero u satisfying Cu = 0. Let

minimize (7) for some γ > 0 and λ_n ≥ 0.

(i) If γ > 1 and λ_n/b_n → λ₀ ≥ 0 then

where

(ii)If γ = 1 and λ_n/b_n → λ₀ ≥ 0 then

where

(iii)If γ < 1 and λ_n/b_n^γ → λ₀ ≥ 0 then

where

Proof. Define Z_n as in (10). First of all, we must show in each case that

where Z₀(u) = Z(u) for u satisfying Cu = 0 and Z₀(u) = ∞ otherwise. This follows by first showing finite-dimensional weak convergence of {Z_n} to

and then stochastic equi–lower semicontinuity (e-1-sc) (Knight, 1999) of {Z_n}; note that, because Z₀ is not finite on an open set,

is not sufficient for

even when Z_n is convex (i.e., when γ ≥ 1). Finally, we must show that

When γ ≥ 1, this holds automatically from the convexity of the Z_n's; for γ < 1, it can be established by noting that the quadratic part of Z_n is growing faster (in ∥u∥) than the nonconvex penalty. █

Note that for γ ≥ 1, the limiting distribution will typically depend on λ₀ whereas for γ < 1, this is only true if at least one of the β_j's is 0. However, if γ < 1 and at least one β_j is 0 then the mapping from λ₀ to the limiting distribution will have discontinuities; this mapping is continuous for γ ≥ 1 because of the convexity (in u) of the limiting objective function V for any λ₀.

The condition on λ_n in part (iii) of Theorem 1 can be modified to achieve the “best of both worlds” for γ < 1, that is, no asymptotic bias for estimators of nonzero parameters and superefficiency for estimators of zero parameters. We do this by assuming that λ_n/b_n^τ → λ₀ > 0 where γ < τ < 1. Although this seems attractive, it should be noted that this is an asymptotic condition and does not really give much insight regarding the choice of λ_n for fixed n.

Example 1

Consider a design with p predictors with common mutual correlation ρ_n. Assuming the predictors are normalized to have variance 1, we have

we will assume that ρ_n → 1 and a_n(1 − ρ_n) → ψ > 0. In this case, {C_n} converges to a matrix C (of all 1's) and a_n(C_n − C) → D₀ where

(In this example, the form of D₀ is not particularly important.) If the matrices are p × p then the null space of C is the space of vectors u with u₁ + ··· + u_p = 0. For the sake of illustration, let us suppose that β₁,…,β_p are all nonzero and take γ ≥ 1. Then the limiting objective function Z in Theorem 1 is

where

By Theorem 1, we have (setting b_n = (n/a_n)^1/2)

It is interesting to compare this limiting distribution to the limiting distribution of the OLS estimator:

where Z₀ is simply Z in (11) setting λ₀ = 0. The size of the asymptotic bias of

relative to

(which is unbiased) depends on the coefficients of u₁,…,u_p in the penalty

Note that these coefficients are bounded (in β) only if γ = 1 (the lasso) and that the bias vanishes for γ > 1 (i.e.,

) if, and only if, β₁ = ··· = β_p whereas for γ = 1, this same condition holds under the weaker condition sgn(β₁) = ··· = sgn(β_p). It should be noted also that the preceding discussion does not depend on the form of the matrix D₀.

Next suppose that β₁ ≠ 0 and β₂ = ··· = β_p = 0. If γ ≤ 1 and λ₀ > 0 then the joint limiting distribution of the estimators of β₂,…,β_p will have positive probability mass at 0 and because the limiting distribution lies in the null space of C, this implies that the limiting distribution of

has positive probability mass at 0.

Theorem 1 can be extended to model selection methods such as AIC and BIC. Suppose that

minimizes

for AIC, λ_n = 2 whereas for BIC, we have λ_n = ln(n). The following result gives the limiting distribution in AIC-like situations where λ_n → λ₀ ∈ (0,∞).

THEOREM 2. Assume the linear model (1) where C_n in (2) satisfies (3), (4), and (9) where C is singular and D₀ is positive definite on the null space of C. Suppose that

minimizes (12) and λ_n → λ₀ ≥ 0. Then

where

with

for u in the null space of C.

Proof. Define the objective function

and note that it is minimized at

. First of all, outside of the null space of C, it is easy to see that

. For u in the null space of C, we have

For the penalty term,

Thus we have

. Epi-convergence in distribution follows by establishing e-l-sc (Knight, 1999); we need to show that for each bounded set B, ε > 0, and δ > 0, there exist u₁,…,u_m ∈ B and open neighborhoods O(u₁),…,O(u_m) such that

and

First of all, note that Z_n is finite for each n with discontinuities at points that do not depend on n. If B does not intersect the null space of C then e-l-sc is straightforward; for each n, Z_n is approximately a quadratic function that is tending to +∞. On the other hand, if B does intersect the null space of C then we can take u₁,…,u_m to lie in this null space and obtain the desired inequality. It remains only to establish that

; this follows because for n > some n₀, we have Z_n(0) ≤ (λ₀ + ε)p, and there exists a compact set K_ε such that

for each ε > 0. Thus

. █

As noted previously, AIC corresponds to the case where λ₀ = 2; Theorem 2 confirms the well-known fact that AIC is not a consistent model selection method in the sense that if β_r+1 = ··· = β_p = 0 then asymptotically AIC gives positive probability to models with at least one of β_r+1,…,β_p nonzero. Note however that the parameter estimators computed by minimizing AIC are themselves consistent (in this case b_n-consistent). So-called consistent model selection procedures such as BIC have λ_n → ∞ at some (usually slow) rate.

THEOREM 3. Assume the linear model (1) where C_n in (2) satisfies (3), (4), and (9) where C is singular and D₀ is positive definite on the null space of C. Suppose that

minimizes (12) where λ_n → +∞ with λ_n = o(b_n²). Then

where

with

for u in the null space of C.

Proof. When λ_n → ∞, we can rewrite Z_n in (13) as

Then for β_j ≠ 0,

uniformly over compact sets (and thus this convergence is also epi-convergence). On the other hand if β_j = 0 then

where this pointwise convergence can be extended to epi-convergence. Now note that if λ_n grows too quickly then Z_n may be minimized at some

having

; this possibility is ruled out by the assumption that λ_n = o(b_n²) and so argmin(Z_n) = O_p(1). █

The form of the penalty in the asymptotic objective effectively forces the limiting distribution of

to be a point mass at 0 when β_j = 0.

Example 2

Consider a design with C_n defined as in Example 1 with β₁ = ··· = β_p = 0. When 0 ≤ 0 < ρ_n = ρ < 1, then Table 1 gives the limiting distribution of the estimated model size for p = 5 and p = 10 for ρ = 0, 0.5, 0.9; given that we have a “null” model here, the correct model size is 0. Table 1 suggests that as ρ → 1, the probability of selecting a model of size 1 decreases and the probabilities of selecting a model of size 0 or 2 increase. The result of Theorem 2 suggests that if a_n(1 − ρ_n) → ψ > 0 then the probability of AIC selecting a model of size 1 tends to 0; this is somewhat misleading as the mapping

is not continuous at any u having at least one 0 component.

Limiting distributions of estimated model size for AIC with p = 5 and 10 predictors, and mutual interpredictor correlations of ρ = 0, 0.5, and 0.9. The probability estimates are based on 10,000 replications and have a standard error of at most 0.005.

3. OTHER POINTS OF INTEREST

3.1. Higher Order Near Singularity

Theorems 1 and 2 require that the matrix D₀ be positive definite on the null space of C. Unfortunately, this is not always true; Phillips (2001) gives an example involving polynomial regression with “slowly varying” predictors where this condition is violated.

The near singularity condition a_n(C_n − C) → D₀ with u^TD₀u > 0 for non-zero u satisfying Cu = 0 can be generalized as follows. We start by recursively defining matrices H₁, D₁, H₂, D₂,… such that

Now define the following subspace of the null space of C:

Note that

is always well defined (if C, D₀,…,D_k in (14)–(17) are well defined) as it contains at least the vector 0. However, we are most interested in cases where

is larger. We can then redefine b_n in terms of a_n⁽¹⁾,…,a_n^(k) as follows:

Then it can be shown that the conclusions of Theorems 1 and 2 hold with b_n defined as in (19), Var(u^TW) = σ²u^TD_ku > 0 for

defined in (18), D_k replacing D₀ in the definition of Z(u):

It is also possible to extend the results of this paper to cases where different degrees of near singularity exist in disjoint subsets of variables; in this case, we will obtain different convergent rates for the estimators of the parameters in the different subsets. For example, if x_i = vec(x_i⁽¹⁾,x_i⁽²⁾) then

Suppose that C_n → C where C_n⁽¹¹⁾ → C⁽¹¹⁾ (nonsingular) and C_n⁽²²⁾ → C⁽²²⁾ (singular) with a_n(C_n⁽²²⁾ − C⁽²²⁾) → D₀⁽²²⁾; then it is also reasonable to assume that a_n^1/2(C_n⁽¹²⁾ − C⁽¹²⁾) → D₀⁽¹²⁾ (and likewise for C_n⁽²¹⁾). In this case, writing β = vec(β⁽¹⁾,β⁽²⁾), we would typically (i.e., subject to other regularity conditions) have

where b_n = (n/a_n)^1/2. For shrinkage estimation minimizing (7), we would need to choose λ_n to match the slowest rate of convergence to obtain nondegenerate limiting distributions.

3.2. Maximum Likelihood and GMM Estimation

The results of this paper extend naturally to maximum likelihood estimation where the information matrix is nearly singular. In regular models where the log-likelihood function is locally quadratic, it is straightforward to extend Theorems 1 and 2; applications would include model selection and shrinkage estimation for so-called generalized linear models, which include logistic regression and log-linear Poisson regression. As mentioned previously, the notion of near singularity may be very useful in determining the asymptotic behavior of estimation procedures with weak instruments. In the context of GMM and generalized empirical likelihood estimation, Caner (2004, 2006) has investigated similar issues.

It is worth noting that there is a considerable literature on estimation for models where the information matrix is singular; for some recent examples, see Barnabani (2002) and Rotnitzky, Cox, Bottai, and Robins (2000). In such cases, typically the limiting distributions of maximum likelihood estimators are concentrated on a lower dimensional subspace or have a slower rate of convergence than the standard rate.

References

REFERENCES

Barnabani, M. (2002) Wald-Based Approach with Singular Information Matrix. Working paper 2002/13, Department of Statistics, University of Florence.

Caner, M. (2004) Nearly Singular Design in GMM and Generalized Empirical Likelihood Estimators. Working paper, Department of Economics, University of Pittsburgh.

Caner, M. (2006) Lasso Type GMM Estimator. Working paper, Department of Economics, University of Pittsburgh.

Chernozhukov, V. (2005) Extremal quantile regression. Annals of Statistics 33, 806–839.Google Scholar

Chernozhukov, V. & H. Hong (2004) Likelihood estimation and inference in a class of nonregular econometric models. Econometrica 72, 1445–1480.Google Scholar

Fan, J. & R. Li (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.Google Scholar

Frank, I.E. & J.H. Friedman (1993) A statistical view of some chemometrics regression tools (with discussion). Technometrics 35, 109–148.Google Scholar

Fu, W.J. (1998) Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics 7, 397–416.Google Scholar

Gabaix, X. & R. Ibragimov (2006) Log(rank − 1/2): A Simple Way to Improve the OLS Estimation of Tail Exponents. Working paper, Harvard Institute of Economic Research.

Geyer, C.J. (1994) On the asymptotics of constrained M-estimation. Annals of Statistics 22, 1993–2010.Google Scholar

Geyer, C.J. (1996) On the Asymptotics of Convex Stochastic Optimization. Technical report, Department of Statistics, University of Minnesota.

Hoerl, A.E. & R.W. Kennard (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67.Google Scholar

Knight, K. (1999) Epi-convergence in Distribution and Stochastic Equi-semicontinuity. Unpublished manuscript, Department of Statistics, University of Toronto.

Knight, K. & W. Fu (2000) Asymptotics for lasso-type estimators. Annals of Statistics 28, 1356–1378.Google Scholar

Leeb, H. & B. Pötscher (2006) Performance limits for estimators of the risk or distribution of shrinkage-type estimators, and some general lower risk-bound results. Econometric Theory 22, 69–97.Google Scholar

Lorber, A., L.E. Wanger, & B.R. Kowalski (1987) A theoretical foundation for the PLS algorithm. Journal of Chemometrics 1, 19–31.Google Scholar

Pflug, G.C. (1995) Asymptotic stochastic programs. Mathematics of Operations Research 20, 769–789.Google Scholar

Phillips, P.C.B. (2001) Regression with Slowly Varying Regressors. Cowles Foundation Discussion paper 1310, Yale University.

Radchenko, P. (2004) Reweighting the Lasso. Unpublished manuscript, Department of Statistics, University of Chicago.

Rotnitzky, A., D.R. Cox, M. Bottai, & J. Robins (2000) Likelihood-based inference with singular information matrix. Bernoulli 6, 243–284.Google Scholar

Srivastava, M.S. (1971) On fixed width confidence bounds for regression parameters. Annals of Mathematical Statistics 42, 1403–1411.Google Scholar

Stock, J.H., J.H. Wright, & M. Yogo (2002) A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics 20, 518–529.Google Scholar

Stone, M. & R.J. Brooks (1990) Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression (with discussion). Journal of the Royal Statistical Society, Series B 52, 237–269; corrigendum (1992) 54, 906–907.Google Scholar

Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.Google Scholar

van der Vaart, A.W. & J.A. Wellner (1996) Weak Convergence and Empirical Processes with Applications to Statistics. Springer.

Wold, H. (1984) PLS regression. In N.L. Johnson & S. Kotz (eds.), Encyclopedia of Statistical Sciences, vol. 6, pp. 581–591. Wiley.

Article contents

SHRINKAGE ESTIMATION FOR NEARLY SINGULAR DESIGNS

Abstract

1. INTRODUCTION

2. PENALIZED LEAST SQUARES ESTIMATION

Example 1

Example 2

3. OTHER POINTS OF INTEREST

3.1. Higher Order Near Singularity

3.2. Maximum Likelihood and GMM Estimation

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests