Economics Everywhere -- Huanren Zhang's Blog: 2017

Power analysis is an important aspect of experimental design. It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, we would be wise to alter or abandon the experiment.

The following four quantities have an intimate relationship:

sample size
effect size
significance level = P(Type I error) = probability of finding an effect that is not there
power = 1 - P(Type II error) = probability of finding an effect that is there

Given any three, we can determine the fourth.

An effect size is a quantitative measure of the strength of a phenomenon. Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength (magnitude) rather than assigning a significance level reflecting whether the magnitude of the relationship observed could be due to chance. The effect size does not directly determine the significance level, or vice versa.

Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. It can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size.

An effect size (ES) measures the strength of the result and is solely magnitude based – it does not depend on sample size. So the effect size is pure – it is what actually was found in the study for the sample studied, regardless of the number of subjects. But is what was found generalizable to a population? This is where p‐values come into play. A p‐value gives you the likelihood that what you found is not due to chance. P‐values very much depend on sample size.

There are three issues with a low power.

1. False negatives
2. Inflated effect size estimates
3. Lower positive predictive value

1. False negatives
The most obvious issue with a low power is the high likelihood of getting false negatives, that is, failing to find an effect that is there. According to the definition, power is 1- P(Type II error). A lower power therefore indicates a high probability of Type II errors (false negatives.)

2. Inflated effect sizes
Cohen's d is often used as a standardized measure of the effect size. It is defined as the difference between two means divided by a standard deviation of the data.

Samples drawn from a population with a given effect size will be distributed around the true effect size. The power of studies does not affect the mean of this distribution, but it affects the shape and areas of significance in the distribution.

The following graph demonstrates the distributions of Cohen's d based on simulation when the true effect size is 0.5 and when the power is 30% and 90% respectively. Note that the distributions are always centered around the true effect size, but the spreads are different -- with a high power, the distribution is more narrowly centered around the true value. In a sense, with a high power, the effect size you get from the sample is a more accurate estimate of the true effect size.

To understand how power influences the areas of significance, the shaded area in the following graph shows all the effect sizes that corresponds to a statistical test with a p value less than 0.5. Note that with 30% power, it is less likely for a test to be significant, and the values that satisfy the statistical significance are only extreme values. In other words, with a low power size, you only conclude there is a statistically significant effect size when you sample happens to give you an extreme estimate. In this specific case, when you receive an accurate estimate of the true effect size (0.5), it will not pass the significance test.

On the other hand, when the power is 90%, you have a much higher chance to get statistically significant results, and the estimates that pass the test of significance are more likely to be centered around the true effect size.

Suppose we run several studies that investigate a specific effect. When the power is low, then the reported statistically significant results are likely to overestimate the true effect size. If the power is high, then the average estimated effect size from all these studies are much closer to the true effect size.

The following graph reports the average reported Cohen's d as a function of the statistical power based on 10000 runs of simulation with the true Cohen's d as 0.5. Note that when the power is high, the average reported effect size is very close to the true effect size. When the power is low, however, we tend to have an inflated estimate of the effect size.

With a low power, we tend to overestimate the effectiveness of our treatments. It is also difficult to properly power future studies based on past research.

3. Lower positive predictive value

The positive predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive results. The PPV describes the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic.

The PPV is defined as

As a function of the significance level (α) and power (1-β),

Here the odds ratio (OR) represents the odds that the hypothesis is true.

We often do not know the odds of the hypothesis being true when we do a study. But we can look at what the PPV would be for a range of OR and a range of levels of powers. From the following graph we can see that when we have a low power, it is difficult to draw conclusions even from significant studies. This is likely to lead to wasted resources due to following up on false positive studies.

A study with low statistical power are not likely to detect a true effect. That low power also reduces the likelihood that a statistically significant result reflects a true effect.

A correlation coefficient measures the extent to which two variables tend to change together. Two correlation coefficients are commonly used in statistical analysis: the Peason correlation and the Spearman correlation.

The Pearson correlation coefficient, also referred to as the Pearson's r, Pearson product-moment correlation coefficient (PPMCC) or bivariate correlation, is a measure of the linear correlation between two variables X and Y.

The Spearman's rank correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the ranking of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

The Pearson correlation measures linear relationships, while the Spearman correlation measures monotonic relationships. Compared to the Pearson correlation, the Spearman correlation ignores the actual values of the data but only use the rank order of the values. The Spearman correlation between two variables is therefore equal to the Pearson correlation between the rank values of those two variables.

Let's look at some examples to understand the relationship between the Pearson correlation and the Spearman correlation.

When a linear relationship exists between two variables, the absolute values of both correlation coefficients are equal to 1.

When a monotonic but nonlinear relationship exists between two variables, the absolute value of Spearman correlation is equal to 1 while the value of Pearson correlation is less than 1.

When the relationship between the two variables are not monotonic, the values of the two correlations are close to 0.

Note that the above figures also report the corresponding two-sided p-values of the correlations, which measures the probability for uncorrelated data exhibiting the same correlation. With a small p-value, we are more confident that the calculated correlation exists between the two variables. Note that the p-values of the Pearson correlation are only valid when the variables are normally distributed or the dataset is large enough so that the central limit theorem applies. The p-values of the Spearman correlation do not suffer from such an issue because as a nonparametric text, the Spearman correlation does not make assumptions about the distributions of the variables.

From the above examples, it seems that the (absolute) values of the Spearman correlation are always greater than the Pearson correlation. This is intuitive: because the Spearman correlation captures any monotonic relationships while the Pearson correlation only captures linear relationships, in the presence of monotonic relationships, the Spearman correlation is the same as the Pearson correlation when a linear relationship exists, and strictly greater when a nonlinear monotonic relationship exists. However, this conjecture is not true when the relationship between two variables are not strictly monotonic, as shown by some examples from Ascombe's quartet. In the second and the fourth examples, the Spearman correlation is less than the Pearson correlation.

Economics Everywhere -- Huanren Zhang's Blog

Friday, October 13, 2017

Power analysis

Friday, September 22, 2017

Pearson Correlation vs. Spearman Correlation