Today we’re going to talk about our first test involving two samples: the t test for two independent samples!
When Would You Use It?
The t test for two independent samples is a parametric test used to determine if two independent samples represent two populations with different mean values.
What Type of Data?
The t test for two independent samples requires interval or ratio data.
- Each sample is a simple random sample from the populations they represent.
- The distributions underlying each of the populations are normal.
- The variances of the underlying populations are equal (homogeneity of variance; a formal test for this will come in a later week).
Step 1: Formulate the null and alternative hypotheses. The null hypothesis claims that the two sample means are equal. The alternative hypothesis claims otherwise (one population mean is greater than the other, less than the other, or that the means are simply not equal).
Step 2: Compute the t-score. The t-score is computed as follows:
Step 3: Obtain the p-value associated with the calculated t-score. The p-value indicates the probability of a difference in the two sample means that is equal to or more extreme than the observed difference between the sample means, under the assumption that the null hypothesis is true.
Step 4: Determine the conclusion. If the p-value is larger than the prespecified α-level, fail to reject the null hypothesis (that is, retain the claim that the population means are equal). If the p-value is smaller than the prespecified α-level, reject the null hypothesis in favor of the alternative.
The data for this example come from the midterm scores of my lab section for STAT 213. While lab attendance is technically optional, the students’ attendance is recorded for each lab (if they show up to lab, they basically get additional instructional materials unlocked to help them study more).
I wanted to see if there was a significant difference in the average midterm score for students who attended lab at least half the time (sample 1) and students who attended lab less than half the time (sample 2). Specifically, I wanted to test the claim that attending lab more frequently was associated with a higher midterm score. Here, n1 = 17 and n2 = 13. Set α = 0.05.
H0: µ1 = µ2 (or µ1 – µ2 = 0)
Ha: µ1 > µ2 (or µ1 – µ2 > 0)
Since our p-value is smaller than our alpha-level, we reject H0 and claim that the population means are significantly different (with evidence in favor of the mean being higher for those attending labs more often).
Example in R
x=read.table('clipboard', header=T) attach(x) x1=subset(x,attended==1)[,1] #attended lab x2=subset(x,attended==0)[,1] #did not attend lab n1=length(x1) n2=length(x2) xbar1=mean(x1) xbar2=mean(x2) s1=((sum(x1^2)-(((sum(x1))^2)/n1))/(n1-1)) s2=((sum(x2^2)-(((sum(x2))^2)/n2))/(n2-1)) t = (xbar1 - xbar2)/sqrt(((((n1-1)*s1) +((n2-1)*s2))/(n1+n2-2))*((1/n1)+(1/n2))) #test statistic pval = (1-pt(t, n-1)) #p-value