Hostname: page-component-745bb68f8f-v2bm5 Total loading time: 0 Render date: 2025-02-11T21:40:21.087Z Has data issue: false hasContentIssue false

Power and sample size: I. The success of others is a recipe for your failure

Published online by Cambridge University Press:  24 June 2014

Dusan Hadzi-Pavlovic*
Affiliation:
School of Psychiatry, University of New South Wales, Kensington, NSW, Australia Black Dog Institute, Randwick, NSW, Australia
*
Dusan Hadzi-Pavlovic, Black Dog Institute Building, Prince of Wales Hospital, Hospital Road, Randwick, NSW 2031, Australia. Tel: +61 2 9382 3716; Fax: +61 2 9382 3712; E-mail: d.hadzi-pavlovic@unsw.edu.au
Rights & Permissions [Opens in a new window]

Abstract

Type
Statistically Speaking
Copyright
Copyright © 2009 John Wiley & Sons A/S

Determining an appropriate sample size has ethical, scientific and economic importance. A study with too few subjects to give a confident finding is of limited scientific value, an inappropriate use of study subjects (particularly if the subjects are exposed to some risk) and a poor use of resources. A study with too many subjects exposes excess subjects to a risk (when present) and wastes resources (including subjects' availability to participate).

Researchers looking to replicate or extend a finding frequently say that ‘a previous study found a significant difference with 10 subjects in each group, so 10 subjects per group will be enough for our study’. Unfortunately, this is often not true. We will look at this issue after we introduce sample size calculation and power analysis.

Power and sample size when comparing two independent groups

In a study with a sample size of n subjects split into two groups to compare two conditions (e.g. low dose vs. high dose) we want to use the observed means and SDs (standard deviations) to decide whether there is a difference between the conditions or not. This decision-making procedure is called statistical hypothesis testing and involves deciding between a null hypothesis (H0) and an alternative hypothesis (H1). Typically, H0 says that there is no difference, whereas H1 says that there is a difference. With two groups we usually use a t-test to make the decision.

A t-test consists of an observed t-statistic (t OBS) calculated from the data and a pre-determined critical t-value (t CRIT). If t OBSt CRIT or t OBS≤−t CRIT we decide in favour of H1 (reject H0), otherwise we decide in favour of H0 (accept H0).

When there is no difference (H0 is true), but we decide from the t-test that there is a difference, we make a Type I error. The risk of a Type I error is called α and we usually want this to be low: typically α = 0.05 (or 5%) meaning that we are prepared to falsely reject H0 on up to 5% of occasions. The value of t CRIT is determined by the sample size and the desired α–it is the value which if used as above results in 5% or less errors.

The top half of Fig. 1 illustrates this procedure for the two-tailed t-test: the curve shows the distribution of t OBS when H0 is true, n = 20 and α = 0.05. When −t CRIT< t OBS< t CRIT then H0 is accepted: correctly so since H0 is true. On α = 5% of occasions (2.5% of them when t OBS≤−t CRIT plus 2.5% when t OBSt CRIT, hence two tails) we reject H0 for a Type I error.

Fig. 1 Distributions of t-statistics when the H0 is true and false and their relationship to statistical hypothesis testing and its outcomes.

Just like t CRIT, t OBS has a probability, usually reported as the p value or significance level, where p is the combined probability of + t OBS having the value it has (or greater) when H0 is true, plus the reverse for −t OBS). Rejecting H0 because +t OBSt CRIT (or −t OBSt CRIT) is equivalent to rejecting it because p≤α.

Power: finding differences that are there

When there is a difference (H0 is false) but we decide there is no difference then we make a Type II error. The risk of a Type II error is called β; however, we tend to think in terms of its converse (1 −β) which is called power and is the probability of rejecting H0 when it is false. Power is typically set to at least 80% and says that we want to be able find a true difference on at least 80% of occasions.

The bottom half of Fig. 1 shows what happens when H0 is false. This can happen in an infinite number of ways and Fig. 1 shows the specific case where the difference is equivalent to an effect size (ES) of 0.5 (see Reference Hadzi-Pavlovic(1) for ESs). When t OBS lies between -t CRIT and t CRIT we incorrectly accept the H0 (Type II error) and this occurs on β of times. When t OBSt CRIT (or very rarely in this case t OBS≤−t CRIT), we correctly reject H0. This occurs on 1 −β of times (power). In the example, our Type II error rate is 66.2%, and the percentage of t OBS which will fall between −t CRIT and t CRIT; so our power is equal to 33.8%, meaning that only on one-third of occasions would we correctly conclude that a difference of this size exists.

A study does not have one single power; rather it has a range corresponding to the range of possible H1. Put another way, it has a different power for each different ES, greater power for a greater difference and vice versa. If we plotted the lower half of Fig. 1 for an ES > 0.5 the curve would be shifted out to the right and the probability of a t OBSt CRIT would be higher, hence power would be greater. The converse would hold if ES was <0.5. In a study design, planning power is calculated for a specific ES, but that is not too much of a limitation since power for any larger ES will be greater.

How does changing n affect Fig. 1? As n increases, both curves stay where they are but the width of both curves narrows, with the result that t CRIT moves to the left (keeping α constant) and so the proportion of the lower curve which is below t CRIT (β) decreases, and thus power increases.

In sample size estimation or power analysis we have four values: ES, n, α and power (1 −β). If we decide on any three of these values we can determine the fourth. In sample size estimation we start with an ES, α and β, and then find the n which gives us the power (1 −β) to detect the ES while keeping our Type I error rate at α.

In power analysis we start from an ES, α and n, and then find the corresponding power, telling us what probability we have of detecting that ES, given our n and the t CRIT necessary to keep our Type I error rate at α.

However, the relationship between these values is such, that if we improve one of them we worsen the others. For example, if we choose to reduce n, then we lose power unless we also increase α or are prepared to assume that the ES is larger.

Sample size from another published study

Consider a reported study with 10 subjects per group (n = 20) where the two-tailed significance level is given as p = 0.05. This means that t OBS = t CRIT and from the observed t statistic (which must have been 2.1009) we can work out the ES size. Given the ES we can work out that the power for an exact replication would be 51%.

In other words, if someone reports a result where the t OBS was just at the point at which one decides to reject H0, then replicating their study with the same n and α, while expecting the same ES, results in a study with power of little more than 50% – only a one-in-two chance of replicating their finding!

Instead of the single p as above, imagine a large number of reported p values ranging from 0.10 to 0.001 and again n = 20. For each such study we can calculate the ES and then power if one attempted to replicate with n = 20 and α = 0.05. In Fig. 2 we have plotted the power against the p-value, whereas in Fig. 3 we have plotted the required n to achieve the power of 80%. Unless the reported p is less than 0.01 we will need more subjects than the 20 used in the study. At worst, for a just-significant finding we will need twice as many subjects. (If the required sample size is less than that reported because the implied ES is very large we should ask, how plausible is the ES?) The two figures are reflections of each other of course, emphasising the connections between the different values.

Fig. 2 Power of a study replicating a reported study when using the original sample size and assuming the ES implied by the reported p value.

Fig. 3 Sample size required for power = 80% in a study replicating a reported study when using the original sample size and assuming the ES implied by the reported p value.

References

Hadzi-Pavlovic, D.Effect sizes I: differences between means. Acta Neuropsychiatrica 2007;19:318320. CrossRefGoogle Scholar
Figure 0

Fig. 1 Distributions of t-statistics when the H0 is true and false and their relationship to statistical hypothesis testing and its outcomes.

Figure 1

Fig. 2 Power of a study replicating a reported study when using the original sample size and assuming the ES implied by the reported p value.

Figure 2

Fig. 3 Sample size required for power = 80% in a study replicating a reported study when using the original sample size and assuming the ES implied by the reported p value.