Alright y’all, sit your butts down…it’s time for some REGRESSION!
If you’ve ever gone to Yellowstone National Park, you likely stopped to watch Old Faithful shoot off its rather regular jet of water. In case you’ve never seen this display, have a video taken of an eruption in 2007:
(Side note: you can hear people frantically winding their disposable cameras throughout the video. Retro.)
What does Old Faithful have to do with regression, you ask?
Well, while the geyser is neither the largest nor the most regular in Yellowstone, it’s the biggest regular geyser. Its size, combined with the relative predictability of its eruptions, makes it a good geyser for tourists to check out, as park rangers are able to estimate when the eruptions might occur and thus inform people about them. And that’s where regression comes into play: by analyzing the relationship between the length of an Old Faithful eruption and the waiting time between eruptions, a regression equation can be created that can allow for someone armed with the length of the last eruption to predict the amount of waiting time until the next one. Let’s see how it’s done!*
Part 1: What is Regression?
(This is TOTALLY not comprehensive; it’s just a very brief description of what regression is. There are a lot of assumptions that must be met and a lot of little details that I left out, but I just wanted to give a short overview for anyone who’s like, “I know a little bit about what regression is, but I need a bit of a refresher.”)
Regression is a statistical technique used to describe the relationship between two variables that are thought to be linearly related. It’s a little like correlation in the sense that it can be used to determine the strength of the linear relationship between the variables (think of the relationship between height and weight; in general, the taller someone is, the more they’re likely to weigh, and this relationship is pretty linear). However, unlike correlation, regression requires that the person interested in the data designate one variable as the independent variable and one as the dependent variable. That is, one variable (the independent variable) causes change in the other variable (the dependent variable, “dependent” because its value is at least in part dependent on the changes of the independent variable). In the height/weight example, we can say that height is the independent variable and weight the dependent variable, as it makes intuitive sense to say that height affects weight (and it doesn’t really make sense to say that weight affects height).
What regression then allows us to do with these two variables is this: say we have 30 people for which we’ve measured both their heights and weights. We can use this information to construct an equation of a line—the regression line—that best describes the linear relationship between height and weight for these 30 people. We can then use this equation for inference. For example, say you wanted to estimate the weight for a person who was 6 feet tall. By plugging in the value of six feet into your regression equation, you can calculate the likely associated weight estimate.
In short, regression lets us do this: if we have two variables that we suspect have a linear relationship and we have some data available for those two variables, we can use the data to construct the equation of a line that best describes the linear relationship between the variables. We can then use the line to infer or estimate the value of the dependent variable based on any given value of the independent variable.
Part 2: Regression and Old Faithful
We can apply regression to Old Faithful in a useful way. Say you’re a park ranger at Yellowstone and you want to be able to tell tourists when they should start gathering around Old Faithful to watch it spout its water. You know that there’s a relationship between how long each eruption is and the subsequent waiting time until the next eruption. (For the sake of this example, let’s say you also know that this relationship is linear…which it is in real life.) So you want to create a regression equation that will let you predict waiting time from eruption time.
You get your hands on some data**—recorded eruption lengths (to the nearest .1 minute) and the subsequent waiting time (to the nearest minute) and you use this to build your regression equation! Let’s pretend you know how to do this in Excel or SPSS or R or something like that. The regression equation you get is as follows:
WaitingTime = 33.97 + 10.36*EruptionTime
What does this regression equation tell us? The main thing it tells us is that based on this data set, for every minute increase in the length of the eruption (EruptionTime), the waiting time (WaitingTime) until the next eruption increases by 10.36 minutes.
It also, of course, gives us a tool for predicting the waiting time for the next eruption following an eruption of any given length. For example, say the first eruption you observe on a Wednesday morning lasts for 2.9 minutes. Now that you’ve got your regression equation, you can set EruptionTime = 2.9 and solve the equation for the WaitingTime. In this case,
WaitingTime = 33.97 + 10.36*2.9 = 33.97 + 30.04 = 64.01
That means that you estimate the waiting time until the next eruption to be a little bit more than an hour. This is information you can use to help you do your job—telling tourists when the next eruption is likely.
Of course, no regression equation (and thus no prediction based off a regression equation) is perfect—I’ve read that people who try to predict eruptions based on regression equations are usually within a 10-minute margin, plus or minus—but it’s definitely a useful tool in my opinion. Plus it’s stats, so y’know…it’s cool automatically.
*I actually have no idea if Yellowstone officials actually have used regression to determine when to tell crowds to gather at the geyser; I can’t remember how it’s all even set up at the Old Faithful location, seeing as how I was like six years old when I saw it and Nate and I were thwarted in our efforts to see it a few weeks ago. But hey, any excuse to talk about stats, right?
**There are a decent number of Old Faithful datasets out there; I chose this one because it was easy to find and decently precise with regards to recording the durations.