Tag Archives: kolmogorov-smirnov goodness-of-fit test

Week 7: The Kolmogorov-Smirnov Goodness-of-Fit Test for a Single Sample

Today we’re going to do our first test of goodness-of-fit with the Kolmogorov-Smirnov goodness-of-fit test for a single sample.

When Would You Use It?
The Kolmogorov-Smirnov goodness-of-fit test is a nonparametric test used in a single sample situation to determine if the distribution of a sample of values conforms to a specific population (or probability) distribution.

What Type of Data?
The Kolmogorov-Smirnov goodness-of-fit test requires ordinal data.

Test Assumptions
None listed.

Test Process
Step 1: Formulate the null and alternative hypotheses. The null hypothesis claims that the distribution of the data in the sample is consistent with the hypothesized theoretical population distribution. The alternative claims that the distribution of the data in the sample is inconsistent with the hypothesized theoretical population distribution.

Step 2: Compute the test statistic. The test statistic, in the case of this test, is defined by the point that represents the greatest vertical distance at any point between the cumulative probability distribution constructed from the sample and the cumulative probability distribution constructed under the hypothesized population distribution. Since the specifics of the cumulative probability distribution calculations depend on which distributions are used, I will refer you to the example shown below to show how these calculations are done in a specific testing situation.

Step 3: Obtain the critical value. Unlike most of the tests we’ve done so far, you don’t get a precise p-value when computing the results here. Rather, you calculate your test statistic and then compare it to a specific value. This is done using a table (such as the one here). Find the number at the intersection of your sample size n and the specified alpha-level. Compare this value with your test statistic.

Step 4: Determine the conclusion. If your test statistic is equal to or larger than the table value, reject the null hypothesis (that is, claim that the distribution of the data is inconsistent with the hypothesized population distribution). If your test statistic is less than the table value, fail to reject the null.

Example
For this test’s example, I wanted to determine, from a sample of n = 59 IQ scores, if scores in the population follow a normal distribution with a mean µ = 100 and standard deviation σ = 15. Set α = 0.05.

H0: IQ scores in the population follow a normal distribution with a mean of 100 and a standard deviation of 15.
Ha: IQ scores in the population deviate from a normal distribution with a mean of 100 and a standard deviation of 15.

Computations:

For the computations section of this test, I will display a table of values for the first three and the last of the IQ scores (sorted from smallest to largest) and describe what the values are and how the test statistic is obtained.

test7

Column A represents the IQ scores of the sample, ranked from lowest to highest.

Column B represents the z-scores of the IQ tests, calculated by taking the difference of the score and the mean (100), then dividing by the standard deviation (15).

Column B is not necessary, but is used to make the calculation of Column C easier. Column C is the proportion of cases between the z-score (Column B) and the hypothesized mean of the population’s distribution (100 in this case). For example, for an IQ of 81, the proportion of scores falling between 82 and 100 is .385.

Column D represents the percentile rank of a given IQ score in the hypothesized population distribution. An IQ of 82, for example, is the 11.5th percentile.

Column E represents the cumulative proportion, in the sample, for each IQ. For an IQ of 82, the cumulative proportion is just 1/59, while the cumulative proportion for the highest value, 145, is 59/59.

Column F is the absolute difference between the ith values in Column D and Column E. This represents the differences between the proportions in the sample population and the proportions expected under the hypothesized population distribution.

Finally, Column G is the absolute difference between the value of Column D for a given row and the value of Column E for the preceeding row. For example, for an IQ of 89, Column G is calculated by taking |0.232 – 0.017|.

The test statistic is obtained by determining the largest value from either Column F or Column G. That is, whichever column has the largest value, then that largest value becomes the test statistic. When these values are computed for the whole dataset, the largest value is 0.438. This value is compared to the critical value at α = 0.05, n > 35, which ends up being:

test7b

Since our test statistic is larger than our critical value, we reject H0 and claim that IQ scores in the population deviate from a normal distribution with mean 100 and standard deviation 15.

Example in R

x=read.table('clipboard', header=F)
x=as.matrix(x)
x=sort(x)                                                   #column A
mu=100
sd=15
B=(x-mu)/sd                                                 #column B
pmu=.5
pz=pnorm(abs(z), mean = 0, sd = 1, lower.tail = TRUE)
C=abs(pmu-pz) #column C
D=pnorm(z, mean = 0, sd = 1, lower.tail = TRUE)             #column D
E=rep(NaN,length(x))                                        #column E
for (i in 1:length(x)){
e[i]=i/length(x)
}
F=abs(e-dz)                                                 #column F
ee=c(0,e[1:(length(x)-1)])
G=abs(ee-dz)                                                #column G
if(max(G)>max(F)){Tstat=max(G)}else{Tstat=max(F)}
Tstat                                                       #test statistic