An Exercise in Algebra: The Standard Deviation


Alrighty, you people had better be ready for a stats-related post! You can blame this one on the first assignment for my STAT 213 lab students.

A student emailed me tonight asking how he was to go about solving one of the homework questions. I took a look at the question, and this is what it said:

A data set consists of the 11 data points shown below, plus one additional data point. When the additional point is included in the data set, the sample standard deviation of the 12 points is computed to be 14.963. If it is known that the additional data point is 25 or less, find the value of the twelfth data point.

21, 24, 47, 14, 19, 17, 35, 29, 40, 17, 53

At first, I suspected that there might be a neat little trick you could employ in order to solve this question. But after Nate and I tried several different possible shortcuts, we realized (and this was later confirmed by the instructor for the course) that the only way to actually solve this was to do it longhand: working it out with the formula for the standard deviation.

Which we did, ‘cause we’re badasses.

I want to show you how it’s done, because it’s actually pretty cool to see how you can figure out the missing value (or values, really; there are always two values that can equally change a standard deviation for a given set of data). But I won’t use the numbers above, ‘cause the values/sums get pretty big with that size of a sample and with those numbers. So let’s make a fake problem to solve instead.

A data set consists of the 4 data points 3, 4, 6, and 9, plus one additional data point. When the additional point is included in the data set, the sample standard deviation of the 5 points is computed to be 2.55. Find the two possible values of the fifth data point.

Here’s the longhand:

sd

Cool, huh? You can check it by finding the standard deviations of (3, 3, 4, 6, 9) and (3, 4, 6, 8, 9); they’re both approximately 2.55!

And, of course, here’s a function I wrote in R called “findpoints” that will do the same thing. It will find the two possible values of the missing data point if it’s given the known points and the standard deviation of the complete dataset.

findpoints = function (x, snew){
  n = length(x) + 1            
  snew = ((snew)^2)*length(x)
  sumx = sum(x)
  sumxx = sum(x^2)
  snew = snew - sumxx
  a = 1 + (-2*(1/n)) + (n*(1/(n^2)))
  b = ((-2*sumx)*(1/n)) + ((-2*(1/n))*sumx) + ((2*sumx)*(n*(1/(n^2))))
  c = ((sumx^2)*(n*(1/(n^2)))) + (((-2*sumx)*(1/n))*sumx) - snew 
  root1 = (-b - (sqrt((b^2) - (4*a*c))))/(2*a)
  root2 = (-b + (sqrt((b^2) - (4*a*c))))/(2*a)
  roots = c(root1, root2)
  return(roots)
}

Let’s try it in R with the data we used for the longhand:

> y = c(3, 4, 6, 9)
> s = 2.55
> findpoints(y, s)
[1] 2.997501 8.002499

Yay!

Sorry, this was a lot of fun, haha.

Edit: the students in STAT 213 were NOT supposed to do it this way! The question was more about using your statistical intuition combined with some guess-and-check to figure out the answer. The way they were supposed to go about it was as follows:

  1. Figure out the standard deviation of the given data points.
  2. Compare that standard deviation with the given standard deviation for all the data points. If the standard deviation for n = 11 is smaller than the standard deviation for n = 12, you know that the missing point has to be outside the range of the given data values (either larger or smaller). If the standard deviation for n = 11 is bigger than the standard deviation for n = 12, you know that the missing point has to be a value within the range of the given data values.
  3. Combine your knowledge from 2) with the fact that you’re told that the missing data value has to be 25 or less to get a reduced range of possible values for your missing data point.

For example, the standard deviation (13.260) for the n = 11 values in the original example is less than the standard deviation we are given (14.963) for the n = 12 values, which suggests that the additional data point is outside the range of the given data values. This, combined with knowing that the additional point is equal to or less than 25, lets me know that the point has to be less than 14 (since that is the smallest value in our given data). From there, I can start plugging in values less than 14 for that additional point and calculating the standard deviation until I find the value that gives me a standard deviation of 14.963

Advertisements

One response

  1. […] Statistics: Nate and I play around with the standard deviation. […]

    Like

What sayest thou? Speak!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: