Independence claims in linguistics

John C. Paolillo

doi:10.1017/S0954394511000081

Independence claims in linguistics

Published online by Cambridge University Press: 05 August 2011

John C. Paolillo

Show author details

John C. Paolillo: Affiliation:
Indiana University

Article contents

Abstract
HYPOTHESIS TESTING
CLAIMS OF INDEPENDENCE
INDEPENDENCE AND THEORY CONSTRUCTION
Footnotes
References

Rights & Permissions

Abstract

Empirical work in linguistics often puts forward claims about the independence of two phenomena as substantive hypotheses. But, independence is always an assumption in the framework of empirical hypothesis testing, meaning independence claims are not empirically verifiable. Hence, they need to be regarded differently from truly empirical hypotheses within the substantive theories of which they are part. In this paper, a number of independence claims are illustrated, alongside their problematic consequences for empirically guided theorizing. Recommendations are made regarding the use of independence that should facilitate the empirical testing of substantive linguistic hypotheses.

Type: Research Article
Information: Language Variation and Change , Volume 23 , Issue 2 , July 2011 , pp. 257 - 274

DOI: https://doi.org/10.1017/S0954394511000081 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Much of linguistics is practiced as an amalgam of structuralist and rationalist practices, in which observation is regarded as a relatively uncomplicated process and emphasis is placed on deduction in the development of theoretical arguments. While clarity and soundness in rational argumentation are important in theory building, clarity in observation is also important. Used in conjunction, observation and reasoning constrain each other. But because of the greater emphasis on rationalism in linguistics, it is quite common for observation to be sacrificed in favor of some theoretical goal (e.g., parsimony, elegance). Consequently, rigor and verifiability are often compromised.

Variationist linguistics is a mode of empirical linguistic inquiry with more developed practices of observation than some other branches. Specifically, linguistic observations are not taken as solitary facts, usable in isolation; instead, they are regarded as potentially varying indications of more general tendencies. Statistical models of varying sophistication are used to sift apart the influences on any given observation that are either general or due to chance, and theorization directly addresses the tendencies that are generalized from the pooled observations. Variationist linguistics is thus an empirical social science. It uses the same statistical models and observational protocols as other empirical social sciences.

Nonetheless, the coupling of observation and theory is not always complete. One way in which a disconnect between observation and theory shows up is in the way that independence, in the statistical sense of the term, emerges as a central claim of a linguistic theory. Statistical independence is the default assumption in hypothesis testing. In fact, it is a basic axiom guiding inference in hypothesis testing. Hence, one does not empirically discover independence in the same way that one discovers nonindependence, and a reported finding of independence cannot be interpreted in the same way, either.

Independence claims come in a variety of forms in linguistics. Linguists often assume that social and linguistic environments in a variable process do not interact (Fasold, Reference Fasold1991), or that the rate of linguistic change is uniform (Kroch, Reference Kroch1989), or that the natural way to combine implicational hierarchies in Optimality Theory (Prince & Smolensky, Reference Prince and Smolensky1993) is to “harmonically align” the two scales. All of these represent independence claims; all are potentially problematic in that they assert empirically untestable claims as if they were substantive, empirical hypotheses when they are not, and theory building around such concepts should take this into account.

The remainder of this paper is organized as follows. In the next section, statistical hypothesis testing is explained. Emphasis is placed on the logic of the significance test within a statistical modeling framework, and the place of statistical independence in relation to hypothesis testing. The third section outlines four types of independence claims in linguistics: (i) the noninteraction of social and grammatical constraints in variable rules, (ii) the Constant Rate Effect, (iii) implicational relations in two-way tables, and (iv) Harmonic Alignment in Optimality Theory. The final section summarizes the points presented about independence claims in linguistics and suggests some desiderata for theory building so that researchers can avoid asserting independence claims as empirical hypotheses.

HYPOTHESIS TESTING

Independence is an important concept in statistics, and it is fundamental to the construction and use of statistical models. Two events are said to be independent if knowing the outcome of one provides no information about the likely outcome of the other. Statistically, the probability of one event does not depend on the other; the conditional probability of the second event is the same, whatever the outcome of the first event. Statistical independence implies causal independence; hence, independence is a basic assumption, and any hypothesized dependence of events requires special evidence. The procedure for establishing this is known as a hypothesis test.

A hypothesis test applies probability theory as a guide to developing inferences, based on the distribution of some set of observations; it has a number of components: (i) a mathematical model of the expected distribution of observations, (ii) a stochastic component representing chance occurrence, and (iii) a test statistic whose chance distribution is known, measuring the fit of the model to the observed data. These components constrain each other and need to be chosen together in a way that is appropriate to the observations being made.

The mathematical model is typically composed of a number of terms involving predictor variables, or simply predictors (e.g., x ₁, x ₂), corresponding to distinct aspects of each observation; these are known as “factors” in the variationist tradition and correspond to contextual conditions, whether social, linguistic, or otherwise. Each predictor variable is associated with a parameter (β) representing the quantitative strength of the effect of the associated predictor. These are combined into a mathematical statement, for which the expression in (1) is typical.

(1)

$f\lpar y\rpar = {\rm \alpha} + {\rm \beta}_1 x_1 + {\rm \beta}_2 x_2 + \ldots + {\rm \varepsilon} \eqno\lpar 1\rpar$

In (1), y represents the variable of special interest, known as the response variable.Footnote ¹ A typical response would be the relative frequency of t/d deletion in a sample of English. The function f(y) is a link function, relating the predicted response variable to a sum of terms; because (1) has a link function, it is a Generalized Linear Model (GLM).Footnote ² Logistic regression, commonly used in variationist analysis (Varbrul/GoldVarb), is a GLM where f(y) is specified as ln[(y/[(1−y])] (McCullagh and & Nelder, 1989; Paolillo, Reference Paolillo2002).Footnote ³ The term ɛ represents the error distribution, the model's stochastic component.

For any given situation, different model statements with different terms are possible, each making a distinct statement about the relations among the observations. For example, a simpler model than (1) could omit the β₁x ₁ term, the β₂x ₂ term, or both. Model statements omitting a term express the statistical independence of y and the omitted predictor x. Alternatively, a predictor x corresponding to any term whose β value is 0 is independent of y, because that mathematically accomplishes the same thing. Moreover, the terms of the model (1) are assumed to take their values independently of one another: x ₁ and x ₂ have independent effects on y in this model. There is no need to consult the value of x ₁ to determine what the effect on y is for a given value of x ₂, or vice versa. Of course, a different model could include a term β_1,2x ₁x ₂ whose value depends on the conjunction of the values of x ₁ and x ₂; such a term is called an interaction term and it represents an effect in which y, x ₁, and x ₂ are all mutually dependent. The term β_1,2x ₁x ₂ is nonetheless independent of the other model terms β₁x ₁ and β₂x ₂ in that there are distinct contributions to y for each of the three terms. Each term also corresponds to an independently testable hypothesis about the relationship between the x values and y.

Hypothesis testing builds on this kind of model comparison directly by designating some simpler model as a null hypothesis (H ₀). Classically, H ₀ is described as an absolutely minimal model with only a chance component (e.g., f(y) = ɛ, meaning the observed y is independent of all other observed events), but any reasonable model may have the status of H ₀ for a specific test. Most commonly in a model such as (1), terms of the model are tested independently, and for each test, H ₀ represents β = 0 for some specific term; that is, the test asks whether the term needs to be included in the model or not.

The hypothesis test is performed by computing a test statistic and comparing its value with its expected chance distribution, derived from ɛ.Footnote ⁴ We then compare the test statistic to an established criterion value, with a sufficiently small probability. Values of the test statistic whose probability is less than the criterion value are considered to indicate that H ₀ is inadequate and should be rejected; such values are called “significant.” Values whose probability is greater than the chosen criterion value are regarded as nonsignificant, and the model is retained as satisfactory. When H ₀ represents β = 0, significance means that the term needs to be retained (y is dependent on x), and nonsignificance means that y can be treated as independent of the corresponding x. For interaction terms, the interpretation is similar: if H ₀ (β_1,2 = 0) is not rejected, y is assumed to be independent of the conjunction of x ₁ and x ₂; otherwise, y, x ₁, and x ₂ are mutually dependent.

Key to statistical hypothesis testing is the notion of rejecting H ₀ under evidence that it is unlikely to be true. There is an asymmetry between significant and nonsignificant results in this respect. For significant results, H ₀ is rejected, leading the researcher to propose an alternative model. For nonsignificant results, H ₀ is retained, but because it was assumed to begin with, its validity has not actually been empirically established by the significance test, but rests instead upon the reasonableness of any arguments that were used to suggest it in the first place. Furthermore, H ₀ always represents some kind of independence: y is independent of some x, or y is independent of some conjunction of x's. Hence, independence is always the default assumption in any hypothesis test.

Types of error

A crucial notion in significance testing is the notion of error, by which we mean an incorrect inference. Since statistical inference is based on a probability, specifically a probability associated with H ₀, it is guaranteed that some statistical inferences we make will be incorrect. Yet, because we are able to know the probability associated with these errors, it is possible to understand their risk and optimize it in the interest of soundness of inference, on the one hand, and attention to theoretically interesting phenomena, on the other. Normally, this optimum is selected by adjusting the criterion probability for a significance test.

To understand how this optimization works, we need to distinguish two types of incorrect inference, known as type 1 and type 2 errors. A type 1 error occurs when H ₀ is falsely rejected in favor of an alternative hypothesis, when in fact the independence represented by H ₀ holds. A type 2 error occurs when H ₀ is not rejected, even though it does not hold, and some alternative hypothesis H ₁ would be more appropriate. Type 1 and type 2 errors are complementary, and there is a tradeoff between them such that as the probability of type 1 error decreases, the probability of type 2 error increases.Footnote ⁵ But, only the distribution under H ₀ is actually known, because the hypothesis space for alternatives to H ₀ is uncountably infinite and cannot be known. Hence, in hypothesis testing, we adjust the criterion level to be conservative with respect to the probability of type 1 error and are forced to be agnostic about the probability of type 2 error.Footnote ⁶ In practice, criterion probabilities (often called p levels) of .05 or .01 are typically taken to be acceptable, with the latter being a stricter criterion requiring stronger evidence to reject H ₀.

The distinction between type 1 and type 2 errors allows us to make an important connection between different types of substantive hypotheses and the kinds of inferences we can draw from a hypothesis test. One type of hypothesis asserts that two phenomena are mutually dependent in some way; either one causes the other, or the two arise together from some common cause. Such a hypothesis will be represented in a model by a term relating some predictor x to the observed y. The null hypothesis H ₀ is represented by β = 0, which is readily tested by a statistical hypothesis test; the type 1 error, under assumed independence, is known. Such hypotheses, positing dependence, are therefore empirically testable.

A second type of hypothesis proceeds in the other direction, arguing that there is not any kind of statistical dependence between two events. This requires taking some unspecified H ₁ as the starting point and arguing for independence. However, it is not clear what H ₁ to start with (there are uncountably many), and so we cannot know the probability distribution of the test statistic under H ₁. Actually, we must consider all possible H ₁, and we must show all of them to be unlikely; many have effects that are immeasurably small, and so not empirically distinct from H ₀. In other words, we cannot know the probability distribution of type 2 error, the way we can for type 1, and we cannot set a criterion probability for it. Hence, claims of independence are hypotheses that are not empirically testable.

The role of independence in a hypothesis makes an important distinction between two types of hypothesis and the roles they can have in theory building, when empirical procedures are used to guide the process. Claims of dependence are empirically testable, and contrast with a default position of assumed independence. Claims of independence are not empirically testable. When they arise in a theory, they arise as assumptions and are not guided by empirical principles.

CLAIMS OF INDEPENDENCE

We turn now from the statistical view of independence to considering specific cases in which statements of independence are made as substantive theoretical claims. Independence claims take different forms and are not always easy to recognize. Moreover, some forms of independence claims actually appear to be something else, like logical implication. Other times, independence claims are embedded within other claims, and some effort is required to disentangle them. Nonetheless, independence claims are common in linguistics. In this section, we examine four different claims of independence and illustrate the problems they raise for linguistic theorizing. We begin with the most transparent types of independence claims, and proceed toward forms of independence claims that are less readily recognizable.

Noninteraction of internal and external constraints

An oft-repeated claim in the variationist literature is that internal (linguistic) and external (social) constraints on variation are independent. The claim is found in both substantive and methodological versions, and their consequences are somewhat different. The ultimate source of these claims appears to be Labov; one version appears in Fasold (Reference Fasold1991), citing Labov (no reference given), and a nearly identical formulation is given in Labov (Reference Labov1994) as justification for the organization of Principles of Linguistic Change, in which “internal factors” and “external factors” are each explored in separate volumes published seven years apart.

The multivariate analyses that are included in these volumes almost always include both internal and external factors. When the analyses are carried out, it appears that the two sets of factors—internal and external—are effectively independent of each other. … Moreover the internal factors are normally independent of each other, while the external factors are heavily interactive. These basic sociolinguistic findings provide the methodological rationale for the way the material in volumes 1 and 2 is divided and for the separate discussion of internal and external factors. (Labov, Reference Labov1994:3)Footnote ⁷

On the substantive level, Labov wants to be able to discuss types of linguistic changes (chain shifts, mergers, and splits) separate from their contexts, so that questions posed by structural and historical linguists can be addressed. On the methodological level, treating social and linguistic constraints separately has advantages for the multivariate models employed, because different predictor variables are more readily treated as independent, and interactions or variable nesting patterns raise problems of combinatoric complexity.

The claim has problems on both the empirical and analytical levels, however. On the empirical level, the claims “social and linguistic factors do not interact” and “linguistic factors do not interact” are both independence claims; they are assumed by default as long as there is no evidence to reject them. Accepting them is subject to a type 2 error of unknown probability, is therefore never empirically supported, and is only as reasonable as the supporting arguments. Sigley (Reference Sigley2003) evaluated these claims for t/d deletion in a specific corpus of New Zealand English and demonstrated several social-by-linguistic and linguistic-by-linguistic factor interactions. Hence, Sigley argued, it is unsafe to assume noninteraction in general, and for that reason, he recommended a regime of systematic and thorough testing of interaction effects.

On the analytic level, however, there is a deeper problem, which amplifies Sigley's criticism. At some point prior to the multivariate analysis, there is always an identification of the object of study, that being a variable linguistic process. This identification places both linguistic and social bounds on the problem, framing the data collection and subsequent analysis. The effect of this definition is to partition consideration of potentially related phenomena into different analyses. This procedure provides no guarantee that, for example, lack of observed interaction, is anything other than a methodological artifact. Moreover, otherwise observable interactions are potentially obscured by being presented in separate analyses.

For example, consonant cluster reduction in English affects specific segments (t/d) in specific linguistic contexts (e.g., following consonant, vowel, or pause), which were studied by Guy (Reference Guy and Labov1980) in New York City and Philadelphia. The two different speech communities (social factors) show a common variable process, except for a difference in the treatment of following pause: in Philadelphia, it favors retention, whereas in New York City, it favors deletion. Had the two analyses been combined into a common multivariate analysis, this result would be represented by a significant social-by-linguistic interaction for following pause within dialect group. In other words, the observation of different factor weights across examples of related phenomena in different speech communities, where the differences are meaningful, corresponds to the identification of significant interactions between social and linguistic factors in a variable process.

A similar criticism applies to the failure to observe interactions among internal constraints. As before, independence is the default assumption, to which the analyst returns, when evidence for interaction has not been found, and claiming noninteraction is subject to an unknown probability of type 2 error. Moreover, linguistic processes are identified in ways that partition them from other, potentially related processes; to the extent that this involves the identification of contrasting effects in distinct but related processes (e.g., lenition affecting one class of consonants, but not another, even in the same environment), this implies the identification of significant nonindependence relations among linguistic predictors.

This criticism is general, applying equally to all linguistic variables and changes in progress. In every case, there is a linguistic variable of interest, which occurs in specific linguistic environments. The process is observed in a specific speech community, or contrasts linguistically across communities or structural segments of the same community. Hence, identification of interesting instances of nonindependence of internal and external factors (or among internal factors) is central to the practice of variationist analysis. Assumed independence of internal and external factors may be methodologically helpful, but only by introducing a methodological artifact, and substantively, the claim is vacuous.

Constant rate effect

A second independence claim is made by Kroch (Reference Kroch1989 and related work) called the constant rate effect (CRE). The CRE basically says that when a linguistic change occurs, it propagates through all relevant linguistic environments at the same rate. Various examples are presented to illustrate this effect; in all cases, the argument is made that the regression slopes of the change over time are the same in all linguistic environments, and that, therefore, the change proceeds at a constant rate, irrespective of the environment in which it is examined. The model used is a logistic regression model such as (1). A recent reinterpretation of the CRE is framed in terms of constant curvature (Kallel, Reference Kallel2007). By using a more sophisticated statistical model, the rate of change over time is allowed to vary; although at any given point in time, it is the same in all environments. This reinterpretation is also an independence claim, subject to the same problems as the CRE.

As an example of the CRE, consider the case of do-periphrasis in English, for which Kroch (Reference Kroch1989) reanalyzed data originally produced by Ellegård (Reference Ellegård and Behre1953), which is reproduced in Table 1. The data were submitted to a logistic regression analysis, from which no difference was found in the rate of increase of do-periphrasis over time in the five environments. Each environment has a different frequency of periphrasis at any given time, and that effect is constant and entirely independent of the rate at which do-periphrasis is changing. That the CRE is an independence claim is very clearly stated in Kroch (Reference Kroch1989).

Thus, if a study reports a series of multivariate analyses for different time periods, and the contextual effects are constant across these analyses, the rate of change of each context measured separately would necessarily be the same. This equivalence holds because, in statistical terms, the constant rate hypothesis is the claim that the overall rate of use of a form is independent of the contextual effects of its use. (Kroch, Reference Kroch1989:204)

Table 1. Frequency of do-periphrasis in English by time period

Source: Reproduced, with permission, from Kroch (Reference Kroch1989:Table 3), who used data from Ellegård (Reference Ellegård and Behre1953).

Kroch then refers to the functional form of the logistic regression model (1) in which the independent contextual factors present in a context are represented by distinct weights added together, concluding, “As is clear from the equation, the contextual effects are constant across time and do not interact with the time variable” (Kroch, Reference Kroch1989:204). In other words, the fact that there is no β_context,timex _contextx _time interaction term in the model means that change over time is constant in all contextual environments. Note that the appropriate null hypothesis in this test is that the regression slopes for individual contexts are the same: in other words, we assume that the rate of change is constant, and only reject this assumption if we find evidence otherwise. Failure to find evidence to refute this assumption is not unequivocal support of it. Furthermore, by retreating to the null hypothesis, we invite type 2 error of unknown probability. The strength of the CRE, as a substantive hypothesis, therefore, rests solely on its reasonableness within the substantive domain of linguistics, and not on the quantitative observation of independence.

To see more clearly how the argument for the CRE needs to be made, consider Figure 1, in which the frequency of do-periphrasis from Table 1 is represented on the logit scale, and each time period is plotted at the center of its corresponding time interval. This is the scale on which the logistic regression is estimated, and constancy in the rate of change would be observed as parallel regression lines in this scale. Note that the empirical logits of cells in Table 1 with 0% periphrasis are –∞, and are represented here by the relevant lines trailing off the graph.

Figure 1. Frequency of do-periphrasis in English, measured in empirical logits. Source: Data from Ellegård (Reference Ellegård and Behre1953), as reported in Kroch (Reference Kroch1989:Table 3).

For each series in Figure 1, a regression slope needs to be estimated, but there is considerable variation in all of them, such that they cross each other in multiple places, and it is not entirely clear where the “true” lines should fall or what their slopes should be. These slopes are likely to be very similar, and given the variation observed and the small Ns for some of the series, it is highly likely that their confidence intervals overlap. Consequently, a significance test is unlikely to establish the need for individual time-context regression slopes. In other words, we accept the null hypothesis that regression on the five individual series has the same slope as regression on the series aggregated together.

The principal cause of the failure to find nonindependence in the data is the variance of the different series, whose wandering easily overwhelms any estimate of difference between their slopes. Historical data are never randomly sampled, and sampling inadequacies (and potentially biases) very probably confound Ellegård's data. Kroch (Reference Kroch1989) did not discuss this aspect of the do-periphrasis data, in spite of its relevance to the observed variability and the apparent inability to observe different rates of change in different contexts. This was an oversight, as consideration of sampling deficiencies should be prior to consideration of any substantive hypotheses such as general principles of language change.

As regards independence and the CRE, it is immaterial how many times we fail to make an observation of differential rate of change. Constant rate of change is assumed, rather than established by empirical finding, and claiming independence invites type 2 error of unknown probability. Because different observational inadequacies apply to different studies of the CRE, no general argument can be made regarding the adequacy of sampling, among other things, for all cases in which the CRE is observed. Finally, as with the independence of internal and external factors in language change, the analyst has methodological freedom to operationalize the object of study in ways that partition consideration of related changes into different historical processes. At a very minimum, there is something that changes in the system (e.g., do-periphrasis) and something that does not (parts of the verbal system not involving do). The availability of such analytical options means that the analyst can always posit an analysis consistent with the CRE. Hence, the CRE is an assumption, not an empirical hypothesis, and it cannot be meaningfully tested using empirical means.

Implicational relations

Another type of circumstance in which linguists make independence claims is in places where an implicational relation is asserted. The involvement of statistical independence is less obvious than in other types of claims, because in these cases, the outcomes involved are typically discrete, and the theories that are proposed to account for them often do not admit variable outcomes. Categorical variation of this type does not immediately appear to be consistent with variationist analysis, and analysis is cast using logical relations instead, such as implication. However, there is actually no inconsistency, and different types of circumstances in categorical analyses correspond to different statistical model statements. Some statements using logical relations turn out to be independence claims when examined this way and, hence, are less interesting than they are commonly thought to be.

The basis for the statistical treatment of logical relations explained here comes from works by Sankoff and Rousseau (e.g., Rousseau, Reference Rousseau, Fasold and Schiffrin1989; Rousseau & Sankoff, Reference Rousseau, Sankoff and Sankoff1978a, Reference Rousseau and Sankoff1978b; Sankoff & Rousseau, Reference Sankoff, Rousseau and Mitchell1974, Reference Sankoff, Rousseau and Jacobson1980, Reference Sankoff, Rousseau, Sankoff and Cedergren1981) that treated implicational scales within the variable rule framework, using the logistic regression model as given in (1). Their basic observation can be stated as follows. Categorical variation is a special case of variation in which the β parameters take extreme values, such that the nonlinear logistic transformation results in categorical predictions (probabilities of 1 or 0). For the logistic transform,Footnote ⁸ values between 7 and 8 on the logit scale converge to 1 on the probability scale (within three decimal places), meaning that for any combination of factors where the βx terms combine to yield a value greater than 7 (or less than –7), more than 1000 observations would likely be needed in that specific context to observe a variable outcome.

The relation between logical implication and independence can be seen in Figure 2. Each panel represents a two-by-two table of relative frequencies in which cells with a value of 0 are those in which the response is categorically absent, whereas those with a value of 1 have the response categorically present. The rows and columns represent the absence and presence of two contextual predictors, labeled A and B for generality. The four panels have patterns corresponding to the truth tables of four logical connectives: disjunction (A ∨ B), conjunction (A ∧ B), implication (A → B), and exclusive disjunction (A ◊ B). These patterns are presented as parameterizations of the logistic regression model in (1) in Table 2, in which the center four columns give the parameters of each model, and the last four columns give the cell predictions for each, on the logit scale. Cell values exceeding +7 or –7 yield categorical predictions on the proportion scale (1 or 0), when rounded to three decimal places.Footnote ⁹ Because all the cells do, only the signs (+/–) matter when reading the predictions on the proportion scale, with positive (+) values corresponding to 1s and negative (–) values corresponding to 0s.

Figure 2. Logical relations in two-way tables.

Table 2. Logistic regression models for different logical connectives

Each of the first three rows of Table 2 represents a model with no interaction term, because β_A,B in each is 0; predictors A and B represent independent contextual effects in these models. Only the last model has a non-0 interaction term, so A and B are nonindependent in it. Notably, model (c), which corresponds to logical implication, requires no interaction parameter to state in the logistic regression model.

Consider a claim to the effect that the existence of a variant (1) in a context with a predictor A implies the existence of a predictor B in the context, and A without B categorically shows the alternative (0). Predictors A and B are general; they could be speakers, social strata, grammatical environments, or any other relevant predictor that is part of a grammatical model. The only important fact about them is their implicational relation. In just such a case, the relation among A, B, and the variant is represented exactly by panel (c) in Figure 2, with the associated model parameters in the third row of Table 2 expressing the same relation. In other words, an implicational relationship between two predictors is expressed as a model in which A and B have (only) independent effects, and therefore, a claim to the effect that A implies B is an independence claim.

As a concrete example of this, consider the comparison of subject selection in Chamorro (Chung, Reference Chung1998) and Lummi (Jelinek & Demers, Reference Jelinek and Demers1983) as analyzed by Aissen and Bresnan (Reference Aissen and Bresnan2002b). The relevant variation in both cases is active and passive voice; Aissen and Bresnan discussed this in terms of “subject selection” (passive and active select a particular argument as subject of the sentence), governed by constraints framed within optimality theory (OT) (Aissen Reference Aissen1997; Aissen & Bresnan, Reference Aissen and Bresnan2002a; Bresnan, Dingare, & Manning, Reference Bresnan, Dingare, Manning, Butt and King2001). In Chamorro, the grammatical conditions on active and passive voice are animacy and thematic role, whereas in Lummi, they are proximity (grammatical person deixis) and thematic role. The animacy (animate/inanimate) and proximity (first or second versus third person) of both agent and patient thematic roles are independent of each other, allowing four relevant combinations in each language, as indicated in Table 3. Each of these combinations may realize active, passive, or both, in which case there is variation.

Table 3. Subject selection (voice) and animacy or proximity in Chamorro and Lummi

Source: Adapted from Aissen and Bresnan (Reference Aissen and Bresnan2002b).

Table 4. Generating constraint orders and parameter weights under HA: Original preference orders are a > b > …> z and X > Y

Taking passive as the 1 variant and active as 0, and provisionally treating cells in which active and passive occur together as .5 (0 on the logit scale), logistic regression models can be constructed for the observed variation in both languages as indicated in the lower half of Table 3. Note that both languages exhibit independence of agent and patient (with respect to animacy or proximity), as there is no interaction parameter (β_a.agt,b.pat = 0, for any a and b). Furthermore, any intermediate level of variation in the variable cells can be obtained by adjusting the parameters. For example, if the proximal agent, proximal patient cell in the upper left of the Lummi table had 40% active voice and 60% passive, β_prox.agt could be changed to –8.405; with all other parameters the same, the predicted relative frequency would change only in that specific cell. Both the Lummi and Chamorro examples illustrate independence, and the implicational relations observed in Table 3 are independence claims, representing the null hypothesis that β_a.agt,b.pat = 0. As independence claims, they can only be rejected, not verified, by empirical data, and acceptance of the null hypothesis has an unknown probability of type 2 error. The only content present in the implicational relations of Table 3 resides in the relative magnitude of the agent and patient effects on subject selection.

Harmonic alignment

Aissen and Bresnan (Reference Aissen and Bresnan2002b) did not merely cast the observation of passive/active voice in Chamorro and Lummi in terms of implicational relations in two-by-two tables. Rather, they sought to motivate these observations in terms of a principle of OT known as Harmonic Alignment (HA) (Aissen Reference Aissen2003; Bresnan, et al., Reference Bresnan, Dingare, Manning, Butt and King2001). But Rousseau and Sankoff's (1978b) treatment of categorical variation also generalizes to larger two-way and multiple-way classifications, including those implied by OT (Paolillo, Reference Paolillo2002, Reference Paolillo, Uyechi and Wee2010). Categorical variation in such systems is therefore analyzable in terms of the logistic regression model in (1), and the same potential for independence claims arises in the general case. This can be seen by closer examination of the principle of HA.

Harmonic Alignment is appealed to in various applications of OT, both in syntax (e.g., Aissen, Reference Aissen2003; Aissen & Bresnan, Reference Aissen and Bresnan2002a) and phonology (e.g., Anttila, Reference Anttila, Hinskens, van Hout and Wetzels1997). Harmonic Alignment proposes that a pair of preference scales combine more naturally (“harmonically”) with the preferred items on both scales aligned, as worked out by generating “harmony scales” that are subsequently used to generate constraint rankings. This process is defined when one or both of the original preference scales is binary.Footnote ¹⁰ We begin with two preference scales X > Y and a > b > …> z. These combine to produce two harmony scales: X/a ϕ X/b ϕ … ϕ X/z, having the preferred element of A combined with each of the elements in second scale in the original order, and Y/z ϕ … ϕ Y/b ϕ Y/a, with the dispreferred element of the first scale and the elements of the second scale in reverse order. These two harmony scales are interpreted as specifying an ordering of (negative) constraints with the order of the rankings of both scales reversed once more: *X/z » … » *X/b » *X/a and *Y/a » *Y/b » … » *Y/z. The central claim of HA is that these orderings of negative constraints, as derived from the original preference rankings, are fixed. The two sets of constraint rankings, however, are not ordered with respect to each other, and they may be interleaved to get different constraint rankings for different analyses. Anttila (Reference Anttila, Hinskens, van Hout and Wetzels1997) combined this with “crucial unranking” to predict variation in the Finnish genitive plural.

To cast these statements in terms of a statistical model, we first recognize that conjoined constraints of this type essentially name each of the cells independently as interaction effects: *X/a corresponds to the conjunction of X and a in a specific (candidate) form. Constraint ranking simply requires the assigned weights to be in a specific order, leaving their magnitudes (the exact β values) unspecified. We can ask, therefore, if a model with independent parameters will fit this specification of cell values. The relevant model is a logistic regression model in which the binomial value X or Y is modeled with respect to the predictor a, b, … , z; the relevant parameters are found by taking the ratio of the relevant β_X_/• and β_Y_/• values, such as β_a = β_X/a ÷ β_Y/a. Because the orders of the values β_X/a > β_X/b > …> β_X/z and β_Y/a < β_Y/b <…< β_Y/z are fixed (in opposite directions), the corresponding ratios β_a > β_b > …> β_z are also fixed in order, reflecting the original preference order a > b >…> z. In a logistic regression model, one of these usually takes the value 0, and a parameter α_X describes an overall reference value for the rate of X vs. Y preference. Harmonic Alignment, therefore, derives, through a circuitous process, a constraint order equivalent to a logistic regression model with an order of parameter values identical to the original preference orders.

Hence, Harmonic Alignment adds no information beyond the original preference orders and represents an independence claim, a null hypothesis about the relation between constraint rankings. It is not empirically verifiable, and acceptance of Harmonic Alignment is subject to an unknown probability of type 2 error. As with the CRE and other independence claims, HA imposes no particular requirement on the way that preference orders or the linguistic predictors so ordered are composed. Consequently, if potentially falsifying data were discovered, HA is readily maintained by redefining the linguistic predictors or constraints.Footnote ¹¹ Therefore, HA should not be regarded as a statement that makes substantive empirical claims about language.

INDEPENDENCE AND THEORY CONSTRUCTION

We have seen four examples of independence claims from linguistics, noting that they are problematic from an empirical perspective, because they are not testable and may only be assumed true. Other assumptions consistent with the observations are possible and, hence, may turn out to be true instead, leading to an unknown probability of type 2 inference error. The reasonableness of an independence claim for whatever linguistic phenomenon being considered depends solely on things other than the observations being made.

In some cases, the independence claims are easily refuted. In others, the independence assumptions are buried under analytical assumptions or metalanguage that are not transparently mapped onto statistical concepts. Consequently, independence claims may arise in ways that are difficult to appreciate or to guard against. How then does one address the role of independence in theorizing so that these problems do not occur? The foregoing examples do not exhaust the places in linguistic theorizing where independence claims can arise. More examples can be cited, each of which appears reasonable, and many have stalwart adherents who would need to be convinced of the nonempirical status of theirs and others' independence claims. Independence claims are therefore not unmistakably obvious when they occur. However, it is possible to identify certain “danger zones” where independence claims are likely to occur.

One such danger zone involves theoretical constructs enumerating combinatorial possibilities for some kind of phenomenon. Harmonic Alignment is one such example, but OT offers yet others. The function GEN, which generates a set of candidates for an input form is one, and the permutation of partial constraint rankings (as generated under HA) into more fully specified orders or “factorial typologies” is another (see, e.g., Fong & Anttila, Reference Fong, Anttila, Uyechi and Wee2010). The formal devices used to explore such combinatorial systems (Cartesian products, Galois lattices) may be used to similarly illuminate statistical models,Footnote ¹² and a careful mapping across these different abstract domains can often profitably illuminate the relationship between a theoretical model and the family of statistical models it corresponds to. Therefore, one can regard entities such as GEN as formal devices that permit one to work out the consequences of an analysis. At the same time, linguists have a tendency to overinterpret them, so that independence becomes a theoretical claim rather than a guide to finding interesting substantive relationships in data. For this reason, combinatoric generators need to be regarded with some care, so that inappropriate independence claims are not derived from them.

A second way in which independence becomes a claim arises when empirical observations are made and a null hypothesis is not refuted in some presumably large number of cases. Examples of this type are the noninteraction of social and linguistic factors, and the Constant Rate Effect. Independence is inherently present in the statistical models used as the null hypothesis regarding the interrelation of multiple factors. Because refuting the null hypothesis requires data, and datasets used in studies of linguistic variation are small and noisy, it can be quite difficult to refute the null hypothesis. Alternatively, redefining the object of inquiry masks observed dependencies. At some point, a researcher may suspect that the null hypothesis will never be refuted, leading to an interpretation of the assumed model of independence as a substantive result of the empirical analysis, when it is in fact not a substantive result and can never be one. Disconfirmation of an independence claim does lead to a Popperian kind of progress in inquiry, but a failure to disconfirm merely leaves one with one's starting assumptions, and given that the reasons for such failure are many (e.g., insufficient data, poor research design), little can really be concluded from it.

How then does one use independence in empirical reasoning? Independence is a central characteristic of statistical models, where it expresses the notion that different factors can have distinct, simultaneous contributions to an observed phenomenon. This notion corresponds to a relatively simple understanding of causality, which is certainly simpler than one in which complex conditions are required to decide which causes are relevant to a phenomenon before making a prediction. The latter possibility corresponds to some form of interaction, whose representation is more complex, and which we want to have persuasive evidence for before positing. In other words, independence represents our starting point when reasoning about multiple potential causes for a phenomenon. As a null hypothesis, it can guide us into making discoveries when evidence permits us to falsify it, but it makes no theoretical contribution of which we should be overly credulous. Therefore, it is an error to incorporate independence claims into the architecture of an explanatory theory or to use them in any way where they are subject to overinterpretation.

The arguments presented here stress that the notion of statistical independence needs to be properly understood with respect to its role in empirical research. Independence claims commonly arise from the combinatorics of our conceptual models and from observations that fail to refute a null hypothesis. Claims to the effect that two phenomena are necessarily independent generally represent a failure to recognize these circumstances where they occur. Independence may have other useful roles in theorizing, such as in the deductive processes that map out the consequences of a formal model, but in and of themselves independence claims make little useful contribution to the empirical side of the research.

Footnotes

1. Another term is dependent variable, contrasted with independent variable, for which we use the term predictor here. Although there is a tradition behind these terms, they confound discussion of statistical independence, so we do not use them here. Other statistical traditions use the term feature.

2. Link functions in GLMs are restricted to members of the exponential family; see McCullagh and Nelder (Reference McCullagh and Nelder1989) for details.

3. Other models like those of factor analysis have different functional forms and are not discussed here for reasons of space.

4. In a GLM, two such tests are available: the analysis of variance test, and the Wald test. In analysis of variance, the test statistic is computed by aggregating the deviances of the observed y from the estimated y, and in turn referencing the aggregate to either the F-distribution (for linear models) or the chi-square distribution (for logistic regression and log-linear models). The Wald test is computed from the β values directly and uses the normal or t-distributions.

5. A similar trade-off occurs in decision theory in psychology, when a decision criterion is adjusted (Swets, Reference Swets, Scarborough and Sternberg1998), as well as in signal detection theory, both of which differ from statistical hypothesis testing in assuming a forced choice between exactly two alternatives whose probability distributions are both known. In statistical hypothesis testing, only the distribution under H ₀ is known.

6. In the extreme case, if we set the type 1 error probability to 0, we will never reject H ₀ and, therefore, will be forced to make at least some type 2 errors. Other factors go into the choice of p, including the number of tests to be run (see Sigley, Reference Sigley2003).

7. Weinreich, Labov, and Herzog (Reference Weinreich, Labov, Herzog, Lehman and Malkiel1968) make numerous statements that contrast with this assumption, although methodological versions of the independence assumption can be found there as well.

8. The functional inverse of the logit is the logistic, defined as logistic(x) = e ^x/(1 + e ^x).

9. Sankoff and Rousseau's general proof involves a limit as the parameter weights go to infinity; for purposes of illustration, small finite values are used here.

10. Aissen (Reference Aissen2003) develops a version of Harmonic Alignment that works with two nonbinary preference scales. The resulting hierarchy is a two-way table (2003:Figure 4) and its analysis is readily seen to be an implicational scale/model of independence of exactly the form described in Rousseau and Sankoff (Reference Rousseau and Sankoff1978b).

11. Note that the predictors in the Chamorro and Lummi examples are actually conjunctions (e.g., agent and animate, patient and animate), so a process of definition has already occurred, arriving at complex predictors.

12. A statistical model with only independent factors is a member of a join lattice with a single join operator, whereas a statistical model with interactions is a member of a join lattice with two distinct join operators.

References

REFERENCES

Aissen, Judith. (1997). On the syntax of obviation. Language 73(4):705–750.Google Scholar

Aissen, Judith. (2003). Differential object marking: Iconicity vs. economy. Natural Language and Linguistic Theory 21:435–483.CrossRef Google Scholar

Aissen, Judith, & Bresnan, Joan. (2002a). Optimality and functionality: Objections and refutations. Natural Language and Linguistic Theory 20:81–95.CrossRef Google Scholar

Aissen, Judith, & Bresnan, Joan. (2002b). Harmonic alignment in morphosyntax: Subject selection. Course handout from Special Joint Summer School (Linguistic Society of America/Deutsche Gesellschaft für Sprachwissenschaft), Düsseldorf, Germany. 2002. Available at: http://www.phil-fak.uni-duesseldorf.de/summerschool2002/LNAissen1.pdf.Google Scholar

Anttila, Arto. (1997). Deriving variation from grammar. In Hinskens, F., van Hout, R., & Wetzels, L. (eds.), Variation, change and phonological theory. Amsterdam: Benjamins. 35–68.CrossRef Google Scholar

Bresnan, Joan, Dingare, Shipra, & Manning, Christopher D. (2001). Soft constraints mirror hard constraints: Voice and person in English and Lummi. In Butt, M. & T. King, H. (eds.), Proceedings of the LFG 01 Conference, 13–32. Stanford: CSLI Publications. http://csli-publications.stanford.edu/LFG/6/lfg01.html Google Scholar

Chung, Sandra. (1998). The design of agreement: Evidence from Chamorro. Chicago: University of Chicago Press.Google Scholar

Ellegård, Alvar. (1953). The Auxiliary do: The establishment and regulation of its use in English. In Behre, F. (ed.), Gothenburg studies in English. Stockholm: Almqvist and Wiksell.Google Scholar

Fasold, Ralph. (1991). The quiet demise of variable rules. American Speech 66(1):3–21.Google Scholar

Fong, Vivian, & Anttila, Arto. (2010). Variation and ambiguity. In Uyechi, L. & Wee, L.-H. (eds.), Reality exploration and discovery: Pattern interaction in language and life. Stanford, CA: CSLI Publications. 345–358.Google Scholar

Guy, Gregory. (1980). Variation in the group and in the individual. In Labov, W. (ed.), Language variation in space and time. New York: Academic Press. 1–36.Google Scholar

Jelinek, Eloise, & Demers, Richard A.. (1983). The agent hierarchy and voice in some Coast Salish languages. International Journal of American Linguistics 49:167–185.CrossRef Google Scholar

Kallel, Amel. (2007). The loss of negative concord in Standard English: Internal factors. Language Variation and Change 19(1):27–49.CrossRef Google Scholar

Kroch, Anthony. (1989). Reflexes of grammar in patterns of language change. Language Variation and Change 1(3):199–244.Google Scholar

Labov, William. (1994). Principles of linguistic change. Vol. 1. Internal factors. Oxford: Blackwell.Google Scholar

McCullagh, Peter, & Nelder, John A. (1989). Generalized linear models. 2nd ed.Boca Raton, FL: Chapman & Hall/CRC.Google Scholar

Paolillo, John. (2002). Analyzing linguistic variation: Statistical models and methods. Stanford, CA: CSLI Publications.Google Scholar

Paolillo, John. (2010). Optimality theory as a probabilistic model. In Uyechi, L. & Wee, L.-H. (eds.), Reality exploration and discovery: Pattern interaction in language and life. Stanford, CA: CSLI Publications. 105–124.Google Scholar

Prince, Alan, & Smolensky, Paul. (1993). Optimality theory: Constraint interaction in generative grammar. Piscataway, NJ: Rutgers University Center for Cognitive Science.Google Scholar

Rousseau, Pascale. (1989). A versatile program for the analysis of sociolinguistic data. In Fasold, R. & Schiffrin, D. (eds.), Language change and variation. Amsterdam: Benjamins. 395–409.CrossRef Google Scholar

Rousseau, Pascale, & Sankoff, David. (1978a). Advances in variable rule methodology. In Sankoff, D. (ed.), Linguistic variation: Models and methods. New York: Academic Press. 57–69.Google Scholar

Rousseau, Pascale, & Sankoff, David. (1978b). Singularities in the analysis of binomial data. Biometrika 65(3):603–608.Google Scholar

Sankoff, David, & Rousseau, Pascale. (1974). A method for assessing variable rule and implicational scale analyses of linguistic variation. In Mitchell, J. (ed.), Computers in the humanities. Edinburgh: Edinburgh University Press. 3–15.Google Scholar

Sankoff, David, & Rousseau, Pascale. (1980). Categorical contexts and variable rules. In Jacobson, S. (ed.), Papers from the Symposium on Scandanavian Syntactic Variation. Stockholm: Almkvist and Wiskell International. 7–22.Google Scholar

Sankoff, David, & Rousseau, Pascale. (1981). Echelles et regles. In Sankoff, D. & Cedergren, H. (eds.), Variation omnibus. Carbondale, IL: Linguistic Research, Inc. 257–269.Google Scholar

Sigley, Robert. (2003). The importance of interaction effects. Language Variation and Change 15:227–253.Google Scholar

Swets, John A. (1998). Separating discrimination and decision in detection, recognition and matters of life and death. In Scarborough, D. & Sternberg, S. (eds.), An invitation to cognitive science. Vol. 4. Methods, models and conceptual issues, 2nd ed.Cambridge, MA: MIT Press. 635–702.Google Scholar

Weinreich, Uriel, Labov, William, & Herzog, Martin. (1968). Empirical foundations for a theory of language change. In Lehman, W. & Malkiel, Y. (eds.), Directions for historical linguistics., 95-188. Austin: University of Texas Press. 95–188.Google Scholar

Table 1. Frequency of do-periphrasis in English by time period

Figure 1. Frequency of do-periphrasis in English, measured in empirical logits. Source: Data from Ellegård (1953), as reported in Kroch (1989:Table 3).

Figure 2. Logical relations in two-way tables.

Table 2. Logistic regression models for different logical connectives

Table 3. Subject selection (voice) and animacy or proximity in Chamorro and Lummi

Table 4. Generating constraint orders and parameter weights under HA: Original preference orders are a > b > …> z and X > Y

Article contents

Independence claims in linguistics

Abstract

HYPOTHESIS TESTING

Types of error

CLAIMS OF INDEPENDENCE

Noninteraction of internal and external constraints

Constant rate effect

Implicational relations

Harmonic alignment

INDEPENDENCE AND THEORY CONSTRUCTION

Footnotes

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests