[This is coming out on a Monday ’cause I was super busy yesterday and had no time to make this/post it.]
Today’s test is a non-parametric test for two samples: the Kolmogorov-Smirnov test for two independent samples!
When Would You Use It?
The Kolmogorov-Smirnov test for two independent samples is a nonparametric test used to determine if two independent samples represent two different populations.
What Type of Data?
The Kolmogorov-Smirnov test for two independent samples requires ordinal data.
- All of the observations in the samples are randomly selected and independent of one another.
- The scale of the measurement is ordinal.
Step 1: Formulate the null and alternative hypotheses. The null hypothesis claims that the the distribution underlying the population for one sample is the same as the distribution underlying the population for the other sample. The alternative claims that the distributions are not the same.
Step 2: Compute the test statistic. The test statistic, in the case of this test, is defined by the point that represents the greatest vertical distance at any point between the cumulative probability distribution constructed from the first sample and the cumulative probability distribution constructed from the second sample. I will refer you to the example shown below to show how these calculations are done in a specific testing situation.
Step 3: Obtain the critical value. Unlike most of the tests we’ve done so far, you don’t get a precise p-value when computing the results here. Rather, you calculate your test statistic and then compare it to a specific value. This is done using a table. Find the number at the intersection of your sample sizes for your specified alpha-level. Compare this value with your test statistic.
Step 4: Determine the conclusion. If your test statistic is equal to or larger than the table value, reject the null hypothesis (that is, claim that the distribution of the data is inconsistent with the hypothesized population distribution). If your test statistic is less than the table value, fail to reject the null.
For this test’s example, I want to use some of my music data from 2012. I know that I tend to listen to music from the “electronic” genre and from the “dance” genre fairly equally, so I want to determine, based on play count, if I can say that the population distributions for these genres are similar. To keep things simple, I will use nelectronic = 6 and ndance = 6.
H0: Felectronic(X) = Fdance(X) for all values of X
Ha: Felectronic(X) ≠ Fdance(X) for at least one value of X
For the computations section of this test, I will display a table of values for the data and describe what the values are and how the test statistic is obtained.
Column A and Column C, together, show the ranked values of the play counts for electronic (Column A) and dance (Column C).
Column B represents the cumulative proportion in the sample for each play count in Column A. For example, for the play count = 7, the cumulative proportion of that value is just 1/6, since there is no smaller value in Column A.
Column D represents the same thing as column B, except for Column C.
Column E is Column B – Column D.
The test statistic is obtained by determining the largest value from Column E. Here, the test statistic is .5. This value is compared to the critical value at α = 0.05, n1 = 6, n2 = 6, which is .667. Since our test statistic is not larger than our critical value, we fail to reject the null and claim that the distributions of play counts for electronic and dance are similar.
Example in R
No R example this week, as this is pretty easy to do by hand, especially with having to rank things.