Hypothesis testing 2: two sample t-test

In a previous post, we introduced the basic terminology of hypothesis testing.  We also wanted to test the null hypothesis H_0 that the two websites have the same clickthrough rate i.e. p_1=p_2 against the alternative hypothesis H_1 that p_1\neq p_2.  Here we show how to do it.

Test Statistic and the t-test

A test statistic is some function of a sample that is used for hypothesis testing.  The name of the test denotes the distribution of the test statistic: for instance, a z-test refers to a hypothesis test where the test statistic under the null hypothesis is Gaussian distributed.  However, to use an exact z-test, we require the standard deviation of our data to be known.  An alternative related test is the t-test.  Consider the case of iid observations X_1,\cdots,X_n\sim \mathcal{N}(\mu,\sigma^2) with sample mean \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\sim \mathcal{N}(\mu,\sigma^2/n) for some unknown parameters \mu,\sigma^2. Let S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2 which is an unbiased estimate of the variance. Then

(1)   \begin{equation*} \frac{\bar{X}-\mu}{S/\sqrt{n}} \end{equation*}

has a Student’s t-distribution with n-1 degrees of freedom.
If we wanted to test whether the population or true mean was p, we would set a threshold for the p-value, for instance 0.05, let t be t-distributed with n-1 degrees of freedom, and then calculate whether

(2)   \begin{align*} P(t\leq \frac{\bar{X}-p}{S/\sqrt{n}})&\leq 0.025\\ P(t\geq \frac{\bar{X}-p}{S/\sqrt{n}})&\leq 0.025 \end{align*}

If one of these holds, then we reject the null hypothesis.

Two Samples

In our setting we don’t want to test whether our sample mean matches some hypothesized true mean, but instead want to test whether the true means between two samples are equal.  That is, we have two samples X_1^{(1)},\cdots,X_{n_1}^{(1)} with mean p_1 and X_1^{(2)},\cdots,X_{n_2}^{(2)} with mean p_2, and we assume the sample means \bar{X}^{(1)}\sim \mathcal{N}(p_1,\sigma_1^2/n_1) and \bar{X}^{(2)}\sim \mathcal{N}(p_2,\sigma_2^2/n_2).  Then \bar{X}^{(1)}-\bar{X}^{(2)}\sim \mathcal{N}(p_1-p_2,\sigma_1^2/n_1+\sigma_2^2/n_2) is the sample mean of the differences.  If S^{2,(1)} is the unbiased estimator of the variance of \bar{X}^{(1)} and S^{2,(2)} that for \bar{X}^{(2)}, then under the null hypothesis p_1=p_2, \bar{X}^{(1)}-\bar{X}^{(2)}\sim\mathcal{N}(0,\sigma_1^2/n_1+\sigma_2^2/n_2), so that

(3)   \begin{align*} \frac{\bar{X}^{(1)}-\bar{X}^{(2)}}{\sqrt{S^{2,(1)}/n_1+S^{2,(2)}/n_2}} \end{align*}

follows a student’s t-distribution with n_1+n_2-2 degrees of freedom.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.