Friday, October 13, 2017

Power analysis

Power analysis is an important aspect of experimental design. It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, we would be wise to alter or abandon the experiment.

The following four quantities have an intimate relationship:
  1. sample size
  2. effect size
  3. significance level = P(Type I error) = probability of finding an effect that is not there
  4. power = 1 - P(Type II error) = probability of finding an effect that is there
Given any three, we can determine the fourth.


An effect size is a quantitative measure of the strength of a phenomenon. Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength (magnitude) rather than assigning a significance level reflecting whether the magnitude of the relationship observed could be due to chance. The effect size does not directly determine the significance level, or vice versa.

Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. It can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size.


An effect size (ES) measures the strength of the result and is solely magnitude based – it does not depend on sample size.  So the effect size is pure – it is what actually was found in the study for the sample studied, regardless of the number of subjects.  But is what was found generalizable to a population? This is where p‐values come into play.  A p‐value gives you the likelihood that what you found is not due to chance.  P‐values very much depend on sample size.

There are three issues with a low power.

1. False negatives
2. Inflated effect size estimates
3. Lower positive predictive value

1. False negatives
The most obvious issue with a low power is the high likelihood of getting false negatives, that is, failing to find an effect that is there. According to the definition, power is 1- P(Type II error). A lower power therefore indicates a high probability of Type II errors (false negatives.)

2. Inflated effect sizes
Cohen's d is often used as a standardized measure of the effect size. It is defined as the difference between two means divided by a standard deviation of the data.

Samples drawn from a population with a given effect size will be distributed around the true effect size. The power of studies does not affect the mean of this distribution, but it affects the shape and areas of significance in the distribution.

The following graph demonstrates the distributions of Cohen's d based on simulation when the true effect size is 0.5 and when the power is 30% and 90% respectively. Note that the distributions are always centered around the true effect size, but the spreads are different -- with a high power, the distribution is more narrowly centered around the true value.  In a sense, with a high power, the effect size you get from the sample is a more accurate estimate of the true effect size.



To understand how power influences the areas of significance, the shaded area in the following graph shows all the effect sizes that corresponds to a statistical test with a p value less than 0.5. Note that with 30% power, it is less likely for a test to be significant, and the values that satisfy the statistical significance are only extreme values. In other words, with a low power size, you only conclude there is a statistically significant effect size when you sample happens to give you an extreme estimate. In this specific case, when you receive an accurate estimate of the true effect size (0.5), it will not pass the significance test.



On the other hand, when the power is 90%, you have a much higher chance to get statistically significant results, and the estimates that pass the test of significance are more likely to be centered around the true effect size.

Suppose we run several studies that investigate a specific effect. When the power is low, then the reported statistically significant results are likely to overestimate the true effect size. If the power is high, then the average estimated effect size from all these studies are much closer to the true effect size.

The following graph reports the average reported Cohen's d as a function of the statistical power based on 10000 runs of simulation with the true Cohen's d as 0.5. Note that when the power is high, the average reported effect size is very close to the true effect size. When the power is low, however, we tend to have an inflated estimate of the effect size.


With a low power, we tend to overestimate the effectiveness of our treatments. It is also difficult to properly power future studies based on past research. 

3. Lower positive predictive value

The positive predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive results. The PPV describes the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic.

The PPV is defined as

As a function of the significance level (α) and power (1-β), 
Here the odds ratio (OR) represents the odds that the hypothesis is true.

We often do not know the odds of the hypothesis being true when we do a study. But we can look at what the PPV would be for a range of OR and a range of levels of powers. From the following graph we can see that when we have a low power, it is difficult to draw conclusions even from significant studies. This is likely to lead to wasted resources due to following up on false positive studies. 


A study with low statistical power are not likely to detect a true effect. That low power also reduces the likelihood that a statistically significant result reflects a true effect.