Today we’re going to talk about another measure of association: the point-biserial correlation coefficient!
When Would You Use It?
The point-biserial correlation coefficient is a parametric test used to determine, in the population, if the correlation between values on two variables some value other than zero. More specifically, it is used to determine if there is a significant linear relationship between the two variables.
What Type of Data?
The point-biserial correlation coefficient requires one variable to be expressed as interval or ratio data and the other variable to be represented by a dichotomous nominal or categorical scale. The point-biserial correlation coefficient is a special case of the Pearson product-moment correlation coefficient requires interval or ratio data.
- The sample has been randomly selected from the population it represents.
- The dichonomous variable is not based on an underlying continuous interval or ratio distribution.
Step 1: Formulate the null and alternative hypotheses. The null hypothesis claims that in the population, the correlation between the scores on variable X and variable Y is equal to zero. The alternative hypothesis claims otherwise (that the correlation is less than, greater than, or simply not equal to zero.)
Step 2: Compute the test statistic, a t-value. To do so, the actual correlation coefficient, rpb, must be calculated first. This calculation is as follows:
To compute the t-statistic, the following equation is used:
Step 3: Obtain the p-value associated with the calculated t-score. The p-value indicates the probability of observing a correlation as extreme or more extreme than the observed sample correlation, under the assumption that the null hypothesis is true.
Step 4: Determine the conclusion. If the p-value is larger than the prespecified α-level, fail to reject the null hypothesis (that is, retain the claim that the correlation in the population is zero). If the p-value is smaller than the prespecified α-level, reject the null hypothesis in favor of the alternative.
Let’s look at my music data again! I want to see if there is a significant correlation between the number of times I’ve played a song and whether or not it is a “favorite” (i.e., has 3+ stars). I suspect, of course, that I play my favorite songs more often than my non-favorite ones. If I code “favorite” as 1 and “non-favorite” as 0, then I will expect a positive correlation. I took a sample of n = 100 songs and let α = 0.05.
H0: ρpb = 0
Ha: ρpb > 0
Since our calculated p-value is smaller than our α-level, we reject H0 and conclude that the correlation in the population is significantly greater than zero.
Example in R x=read.table('clipboard', header=T) attach(x) cor.test(favorite, playcount, alternative="greater") Pearson's product-moment correlation data: favorite and playcount t = 3.1048, df = 98, p-value = 0.001245 alternative hypothesis: true correlation is greater than 0 95 percent confidence interval: 0.1407541 1.0000000 sample estimates: cor 0.299258