The principles of experimental design — randomization of subjects into two or more groups receiving different treatments — allow us to draw valid conclusions about how well the treatments work.
Part 1: Hypothesis Testing
What is hypothesis testing?
Hypothesis tests are also called significance tests. Their purpose is to help us understand whether random chance might be responsible for an observed effect.
Hypothesis testing: A/B test
The most famous hypothesis testing is A/B test:
- Control group: a group of subjects exposed to no treatment
- An A/B test is an experiment with two groups and see which group is better after different treatments. One group may be called control group if it is not exposed to any treatment.
Key features of an experiment is:
- comparison between groups
- control other variables
By doing so, we can do hypothesis testing. Depending on how much we can control other variables, we can classify events into three groups:
- If you have a lot of control over features, then you have an experiment.
- If you have no control over the features, then you have an observational study.
- If you have some control, then you have a quasi-experiment.
Null hypothesis
Null hypothesis is a very important concept in hypothesis testing, and it is the baseline assumption that all the treatments are equivalent, and any difference between the groups is due to chance. Our hope is that we can in fact prove the null hypothesis wrong and show that the out comes for groups A and B are more different than what chance might produce.
Contrary to null hypothesis, alternative hypothesis is the opposite. For example, Null = “A ≤ B”; alternative = “A > B”.
Hypothesis test types: one-way vs two-way
There are two types of hypothesis tests:
- one-way hypothesis test: in A/B test, you test a new option (B) against a default option (A). You will not come to B unless it is significantly better than A. In such a case, it is a one-way hypothesis test.
- two-way hypothesis test: the alternative hypothesis is bi-directional. For example, A is different from B is the hypothesis, which means, A can be either bigger than B or smaller than B.
ANOVA
The statistical procedure that tests for a statistically significant difference among the groups is called analysis of variance, or ANOVA.
Part 2: Hypothesis Testing Implementation
Random permutation test
In hypothesis test, we often use a permutation procedure to test the hypothesis, and it follows the following procedure:
a) combine the results from different groups (original groups) into a single data set
b) shuffle the data, and allocate the data to group A and group B randomly
c) calculate the metrics for each group just like what you did for original groups
d) repeat b) and c) R times, and yield a permutation distribution
Now go back to the observed difference between groups and compare it to the set of permuted differences. If the observed difference lies well within the set of permuted differences, then we have not proven anything — the observed difference is within the range of what chance might produce. However, if the observed difference lies outside most of the permutation distribution, then we conclude that chance is not responsible. In technical terms, the difference is statistically significant.
Here we give an example to illustrate permutation test:
session_times = pd.read_csv(WEB_PAGE_DATA_CSV)
session_times.Time = 100 * session_times.Timeax = session_times.boxplot(by='Page', column='Time',
figsize=(4, 4))
ax.set_xlabel('')
ax.set_ylabel('Time (in seconds)')
plt.suptitle('')plt.tight_layout()
plt.show()mean_a = session_times[session_times.Page == 'Page A'].Time.mean()
mean_b = session_times[session_times.Page == 'Page B'].Time.mean()
print(mean_b - mean_a)def perm_fun(x, nA, nB):
n = nA + nB
idx_B = set(random.sample(range(n), nB))
idx_A = set(range(n)) - idx_B
return x.loc[idx_B].mean() - x.loc[idx_A].mean()
random.seed(1)
perm_diffs = [perm_fun(session_times.Time, nA, nB) for _ in range(1000)]fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_diffs, bins=11, rwidth=0.9)
ax.axvline(x = mean_b - mean_a, color='black', lw=2)
ax.text(50, 190, 'Observed\ndifference', bbox={'facecolor':'white'})
ax.set_xlabel('Session time differences (in seconds)')
ax.set_ylabel('Frequency')plt.tight_layout()
plt.show()
print(np.mean(perm_diffs > mean_b - mean_a))
In the above example, the difference mean is around 12%, and this suggests that the observed difference is within the range of chance variation and thus is not statistically significant. Comparing the observed value of the statistic to the resampled distribution allows you to judge whether an observed difference between samples might occur by chance.
ANOVA test is similar to random permutation test.
four_sessions = pd.read_csv(FOUR_SESSIONS_CSV)ax = four_sessions.boxplot(by='Page', column='Time',
figsize=(4, 4))
ax.set_xlabel('Page')
ax.set_ylabel('Time (in seconds)')
plt.suptitle('')
plt.title('')plt.tight_layout()
plt.show()observed_variance = four_sessions.groupby('Page').mean().var()[0]
print('Observed means:', four_sessions.groupby('Page').mean().values.ravel())
print('Variance:', observed_variance)def perm_test(df):
df = df.copy()
df['Time'] = np.random.permutation(df['Time'].values)
return df.groupby('Page').mean().var()[0]
random.seed(1)
perm_variance = [perm_test(four_sessions) for _ in range(3000)]
print('Pr(Prob)', np.mean([var > observed_variance for var in perm_variance]))fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_variance, bins=11, rwidth=0.9)
ax.axvline(x = observed_variance, color='black', lw=2)
ax.text(60, 200, 'Observed\nvariance', bbox={'facecolor':'white'})
ax.set_xlabel('Variance')
ax.set_ylabel('Frequency')plt.tight_layout()
plt.show()
The p-value, given by Pr(Prob) , is 0.09278. In other words, given the same underlying stickiness, 9.3% of the time the response rate among four pages might differ as much as was actually observed, just by chance. This degree of improbability falls short of the traditional statistical threshold of 5%, so we conclude that the difference among the four pages could have arisen by chance.
The above permutation test is purely random shuffling, and hence we call it a random permutation test. There are two variants:
- an exhaustive permutation test: this is only feasible if the sample size is small as it will try all the possible combinations.
- a bootstrap permutation test: sample replacement is involved.
Using resampling to do the test is a universal method, and it does not require that data must be numeric or categorical. It does not have a restriction on sample size either. No need to assume that data is normally distributed as well.
T-test for two-group data
T-test is a standardized version of common test statistics such as means.
F-test for multiple-group data
Just like the t-test can be used instead of a permutation test for comparing the mean of two groups, there is a statistical test for ANOVA based on the F-statistic.
Part 3: Statistical Significance
Statistical significance is how statisticians measure whether an experiment (or even a study of existing data) yields a result more extreme than what chance might produce. If the result is beyond the realm of chance variation, it is said to be statistically significant.
p-value is used to measure statistical significance, and it is defined as the probability of obtaining results as unusual or extreme as the observed results. Put it simpler: it is the probability that the result is due to chance. We can have two ways of calculating p-value:
- random permutation test, which is explained before.
- use a statistic model.
survivors = np.array([[200, 23739 - 200], # observation 1
[182, 22588 - 182]]) # observation 2
chi2, p_value, df, _ = stats.chi2_contingency(survivors)print(f'p-value for single sided test: {p_value / 2:.4f}')
- t-Tests: a standardized version of common test statistics such as means.
res = stats.ttest_ind(session_times[session_times.Page == 'Page A'].Time,
session_times[session_times.Page == 'Page B'].Time,
equal_var=False)
print(f'p-value for single sided test: {res.pvalue / 2:.4f}')
However, p-value itself cannot truly reflect the Statistical Significance, and there are six principles to follow when using p-value:
Even if a result is statistically significant, that does not mean it has practical significance. For a data scientist, a p-value is a useful metric in situations where you want to know whether a model result that appears interesting and useful is within the range of normal chance variability.
Part 4: Reference
Book
- Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python: CHAPTER 3 Statistical Experiments and Significance Testing
Video
Code