Today I’m going to talk about probability!
Suppose you just went grocery shopping and are waiting at the bus stop to catch the bus back home. It’s Moscow and it’s November, so you’re probably cold, and you’re wondering to yourself what the probability is that you’ll have to wait more than, say, two more minutes for the bus to arrive.
How do we figure this out? Well, the first thing we need to know is that things like wait time are usually modeled as exponential random variables. Exponential random variables have the following pdf:
where lambda is what’s usually called the “rate parameter” (which gives us info on how “spread out” the corresponding exponential distribution is, but that’s not too important here). So let’s say that for our bus example, lambda is, hmm…1/2.
Now we can figure out the probability that you’ll be waiting more than two minutes for the bus. Let’s integrate that pdf!
So you have a probability of .368 of waiting more than two minutes for the bus.
Cool, huh? BUT WAIT, THERE’S MORE!
Now let’s say you’ve waited at the bus stop for eight whole minutes. You’re bored and you like probability, so you think, “what’s the probability, given that I’ve been standing here for eight minutes now, that I’ll have to wait at least 10 minutes total?”
In other words, given that the wait time thus far has been eight minutes, what is the probability that the total wait time will be at least 10 minutes?
We can represent that conditional probability like this:
And this can be found as follows:
Which can be rewritten as:
Which is, using the same equation and integration as above:
Which is the exact same probability as the probability of having to wait more than two minutes (as calculated above)!
WOAH, MIND BLOWN, RIGHT?!
This is a demonstration of a particular property of the exponential distribution: that of memorylessness. That is, if we select an arbitrary point s along an exponential distribution, the probability of observing any value greater than that s value is exactly the same as it would be if we didn’t even bother selecting the s. In other words, the distribution of the remaining function does not depend on s.
Another way to think about it: suppose I have an alarm clock that will go off after a time X (with some rate parameter lambda). If we consider X as the “lifetime” of the alarm, it doesn’t matter if I remember when I started the clock or not. If I go away and come back to observe the alarm hasn’t gone off, I can “forget” about the time I was away and still know that the distribution of the lifetime X is still the same as it was when I actually did set the clock, and will be the same distribution at any time I return to the clock (assuming it hasn’t gone off yet).
Isn’t that just the coolest freaking thing?! This totally made my week.
Here’s more “Claudia is bored” random thingies.
I have 111 friends on Facebook. I wanted to see the distribution of birthdays across the months (and the zodiac signs, because why not?). So I Facebook stalked everyone and found that 97 of my 111 friends had their birthdays listed (at least month and day). Here’s the distribution by month:
I knew I had a lot of February, May, and November, but I didn’t know I had so many April and July. Haha, look at August and September. Very interesting, especially in comparison to this.
And here’s some zodiac just ‘cause:
Holycrapholycrapholycrapholycrap this is cool!
Alright. This blog is about odds ratios, when they’re useful, and when they’re not.
Part I: WTF is an odds ratio?
So I feel really dumb because I’ve been dealing with odds ratios all summer for my other job and I just realized that I actually freaking teach odds ratios in class.
An odds ratio is exactly what it sounds like: a ratio of odds (HOLY CRAP NO WAY!). So to better understand it, let’s look at what odds are. Odds are basically ratios of probabilities—specifically, the ratio of the probability of some even happening to the probability of it not happening.
Example: suppose you had 9 M&Ms in a bag (for some strange reason), three of which were red, five of which were green, and one of which is brown. To calculate your odds of pulling a red M&M, take the number of red M&Ms (3) over the number of non-red M&Ms (6). So the odds of pulling a red M&M are 3:6, or 1:2.
So what’s an odds ratio? It’s taking two of these odds and comparing them in ratio form (so it’s like a ratio of ratios). Wiki says it nicely: The odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. If you’ve got the odds for Condition 1 as the numerator and the odds for Condition 2 as the denominator of your odds ratio, interpretation is as follows:
- Odds ratio = 1 means that the event is equally likely to occur in both Condition 1 and Condition 2.
- Odds ratio > 1 means that the event is more likely to occur in Condition 1
- Odds ratio < 1 means that the event is more likely to occur in Condition 2
Part II: Where would you see an odds ratio?
My dad is involved in writing and distributing a water quality/water attitudes survey. Over the years such surveys have been distributed to 30-some-odd states and tons of data have been collected. A big part of my job this summer was to go through data from 2008, 2010, and 2012 for the four Pacific Northwest states, AK, ID, OR, and WA.
We looked specifically at a couple questions with binary answers. So let’s take this question as an example.
“Have you received water quality information from environmental agencies?” People could answer “yes” or “no.” So what we were interested in was the proportion of people who answered “yes” for several different demographics. For this example, let’s just use age. We could express this info in two different ways. The raw proportions (proportion saying “yes”) for each age range we defined:
And then the odds ratios:
Why is the >70 group missing on this plot? Because we’re using its odds as the denominator for each of the odds ratio calculations involving the other five age categories. To calculate the odds for the >70 group, we take the proportion of “yes” over the proportion of “no.” Let’s call that odds value D. Now let’s say we want the odds ratio for the < 30 group to the >70 group (the red bar in the second graph up there). We calculate its odds the same way we did for the >70 group. Let’s call that odds value N. Then to get that odds ratio value, we take N/D. Simple as that!
But what’s it telling us? If we look at that red bar in the second graph, it’s an odds ratio of about .8. Since .8 is less than 1, we can say that people who are in the >70 group are more likely to say “yes, I’ve gotten water quality info from environmental agencies” than are people in the <30 group. And we can actually see that difference reflected in the proportions graph: the >70 group’s proportion for “yes” is higher than the <30 group’s. In fact, look at the similar shapes of the two graphs overall.
Part III: Here’s where things get interesting.
So pretty cool so far, right? When you read papers that involve a lot of proportions for binary data like this, the researchers really like to give you odds ratios, sometimes in plots like this. And sometimes it works out where that’s okay, ‘cause the odds ratios reflect what’s actually going on with the raw proportions.
But as is often the case with real data, things aren’t always nice and pretty like that.
Let’s look at another question from the surveys: “How important is clean drinking water?” This was actually originally a Likert scale question (5 different importance values were possible) but we combined ratings to make it binary in the end: “Not Important” vs. “Important.” And again, we wanted to compare answers for several different demographics. Let’s just look at age again. Here’s the odds ratio plot, again using the odds for the >70 group as our denominator for the odds ratio calculations:
Woah! Big differences, huh? I bet the proportions differ dramatically between the age groups too—
Wait, then what the hell is going on with those odds ratios?
Here, dear reader, is where we see an instance of “stuff that works well under normal circumstances goes batshit crazy when we reach extremes.” Take another look at those proportions. No one’s going to say that clean drinking water isn’t important, right? Those are definitely high proportions. Extremely high, one might say. So when we take an odds—the ratio of the proportion for “Very Important” to the proportion for “Not Important.”–we’re seeing relatively big proportions being divided by relatively small proportions. The result? Big numbers (example: .97/.03 = 32.33, compared to a modest .56/.44 = 1.27…, for example). But the most important thing is that when you’re dealing with those extreme proportions, small differences are very much exaggerated in the odds ratios.
Suppose the proportion for the >70 group is .97. So its odds would be .97/.03 = 32.33. That 32.33 is our D again. And let’s say that the proportion for the 40 – 49 group is .98 (which I think it was, actually). Its odds would be .98/.01 = 98. The odds ratio: 98/32.33 = 3.03. A huge odds ratio! That on its own would suggest quite a big difference in proportions for these two groups…when in fact, they only differ by .01.
Part IV: So what?
This whole rambling thing has a point, I promise. As I mentioned, when you see data like this in studies and papers and stuff, you’ll often see odds ratios reported. You won’t see the actual raw proportions nearly as often. In the examples I used here, the “environmental agencies” question was an example where the differences in the odds ratios are actually meaningful, since they reflect the actual trend in the proportions. The “drinking water” question, on the other hand, was an example where the odds ratios on their own are practically meaningless. They’re dramatic, but they over-dramatize very small differences in the actual proportions. You can’t trust them on their own. If they are provided, look at the raw proportions. If not, ask yourself if dramatic odds ratios make sense. Would you expect big differences in proportions across groups, or no? Is there something else going on instead?
So the moral of the story is this: be wary as you traverse the vast universe of academic papers! Odds ratios in the mirror may be less impressive than they appear.
(Edit: good lord, this is long. I envisioned it as like three paragraphs. Sorry.)
Said the statistician with the small sample size to the statistician with the large one: “I’m ‘n’-vious!”
POP QUIZ GO: What Englishman was it that Anders Hall called “a genius who almost single-handedly created the foundations for modern statistical science”?
It’s the same guy who Richard Dawkins labeled “the greatest biologist since Darwin.”
Give up? It’s SIR RONALD FISHER!
An evolutionary biologist, geneticist, and statistician, Fisher lived from 1890 to 1962. He had plans to enter the British Army upon his graduation from the University of Cambridge (where he studied biology/eugenics) in 1912, but he had horrible eyesight and failed the vision test. So what did he do instead? He worked as a statistician for London, among other things. He also started to write for the Eugenic Review, which only increased his interest in stat methods.
In 1918, his paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance was published in which he introduced the method of analysis of variance (yup, ANOVA!). A year later, after taking a job with an agricultural station, he began to gather numerous sets of data—both large and small—which allowed him to develop methods of experimental design as well as small sample statistics. Throughout his professional career, he continued to develop ANOVA, promoted ML estimation, described the z-distribution (now used in the form of the F-distribution), and pretty much set up the foundation for the field of population genetics. He also (and I didn’t know this until I read more about him) opposed Bayesian statistics quite vehemently.
Anyway. Thought he deserved a bit of a mention today, since he died on this day in 1962.
Once you let me leave the house.
In the meantime…
STATS JOKES STATS JOKES STATS JOKES!
Because it’s that kind of a day.
- One day there was a fire in a wastebasket in the Dean’s office and in rushed a physicist, a chemist, and a statistician. The physicist immediately starts to work on how much energy would have to be removed from the fire to stop the combustion. The chemist works on which reagent would have to be added to the fire to prevent oxidation. While they are doing this, the statistician is setting fires to all the other wastebaskets in the office. “What are you doing?” they demanded. “Well, to solve the problem, obviously you need a large sample size” the statistician replies.
- What’s the question the Cauchy distribution hates the most?
“Got a moment?”
- Did you hear about the statistician who was looking all over for the sum of eigenvalues from a variance-covariance matrix but couldn’t find a trace?
- Did you hear about the nonparametrician who couldn’t get his driving license? He couldn’t pass the sign test.
- A middle-aged man suddenly contracted the dreaded disease kurtosis. not only was this disease severely debilitating, but he had the most virulent strain called leptokurtosis. A close friend told him his only hope was to see a statistical physician who specialized in this type of disease. The man was very fortunate to locate a specialist but he had to travel 800 miles for an appointment.
After a thorough physical exam, the statistical physician exclaimed, “Sir, you are indeed a lucky person in that the FDA has just approved a new drug called Mesokurtimide for your illness. This drug will bulk you in the middle, smooth out your stubby tail, and restore your longer range of functioning. In other words, you will feel ‘NORMAL’ again!”
- What did one regression coefficient say to the other regression coefficient?
“I’m partial to you!”
- Why are the mean, median, and mode like a valuable piece of real estate?
LOCATION! LOCATION! LOCATION!
Yay, I feel better now.
The idea (and written content) of this was Sean’s. I animated it because that’s what I did with stuff back in 2008.
Yes, I still find this absolutely hilarious.
[Edit: oops, this one got lost in the last upload and didn't get posted. Apologies!]
(Note: this has nothing to do with geometry.)
God I love stats.
A lot of the time it seems like the visual representation of data is sacrificed for the actual numerical analyses—be they summary statistics, ANOVAs, factor analyses, whatever. We seem to overlook the importance of “pretty pictures” when it comes to interpreting our data.
This is bad.
One of the first statisticians to recognize this issue and bring it into the spotlight was Francis Anscombe, an Englishman working in the early 1900s. Anscombe was especially interested in regression—particularly in the idea of how outliers can have a nasty effect on an overall regression analysis.
In fact, Anscombe was so interested in the idea of outliers and of differently-shaped data in general, he created what is known today as Anscombe’s quartet.
No, it’s not a vocal quartet who sings about stats (note to self: make this happen). It is in fact a set of four different datasets, each with the same mean, the same variance, the same correlation between the x and y points, and the same regression line equation.
So what’s different between these datasets? Take a look at these plots:
See how nutso crazy different all those datasets look? They all have the same freaking means/variances/correlations/regression lines.
If this doesn’t emphasize the importance of graphing your data, I don’t know what does.
I mean seriously. What if your x and y variables were “amount of reinforced carbon used in the space shuttle heat shield” and “maximum temperature the heat shield can withstand,” respectively Plots 2 and 3 would mean TOTALLY DIFFERENT THINGS for the amount of carbon that would work best.
So yeah. Graph your data, you spazmeisters.
GOD I LOVE STATS.
I can’t stay away from that website with the mathematician birthday/deathday data.
So I decided to look at the data a little differently this time. Each square represents either a death or a birth on the day of the year. The data are broken up by month by a small white space.
Note the “most eventful” and “least eventful” days. And look at October and all its within-group variance.
I think I’m going to do some stats on this data. Because that’s what I do. But I can’t do it for like another week, since next week involves LOTS of studying/homework/panic attacks.
So if you recall, not too long ago I analyzed whether the frequencies of letters in the English language change depending on the letter of the word. To do so, I gathered about 5,000 English words and compared the frequency distributions of the letters for the first five letters of the words. Click here to check that out if you haven’t already done so.
I’d wanted to go further into the words, but I didn’t have time/data to do so.
So that’s what I did today!
I pulled large samples of 4-, 5-, 6-, 7-, 8-, and 9-letter words from an online Scrabble dictionary*. For each sample, I went through and found the frequency distribution of the 26 letters of the alphabet for each letter place in the word (e.g., for the 4-letter words, I found the frequency distribution of the 26 letters for the first, second, third, and fourth place in the 4-letter word).
Because I think something like this is something that requires some sort of visual, I made a gif for each word size (4, 5, 6, 7, 8, 9 letters) that compares the letter frequency for each letter place in the word (in red) compared to the overall frequency of the letters in the entirety of the English language (grey). Check them all out and see if you notice a pattern as the gifs progress through the letter places in the words.
Did you notice it? Regardless of word size, the letter frequencies were most different from the overall frequency in the English language near the beginning and end of the words. Near the “middle” of the words (like the fourth and fifth letters of the nine-letter words, for example), the letter frequencies best matched the overall frequency in the English language (that is, the red distribution best matched the grey distribution).
In addition to the graphical aspect, I of course worked this out with numbers. Like last time, I measured “error” as the absolute value of the total difference between the red and grey distributions for each letter of each word. This confirmed what the gifs show: the smallest error was always for the one or two letters in the “middle” of each word, regardless of size.
Pretty damn cool, huh?
FYI, the six gifs sync and “restart” at the same time every 2,520 frames, in case you’re one of those people who wonders about those types of things.
*Yes, I realize the use of a Scrabble dictionary skews the results a bit, considering that plurals are included in the dictionary as well (notice the “S” is really frequent for the last letter in all cases). But plurals are words, after all, so I figured I’d include them anyway. The pattern still exists anyway even if you omit the last letter from all gifs.
This gentleman is my new favorite living human being.
I’d add that linear algebra is an important middle step as well. A lot of stuff that I really enjoy in the field of statistics is stuff I wouldn’t understand nearly as well had I not taken linear algebra.
Basic statistics (like the stuff I’m teaching) –> Linear algebra –> more advanced statistics (FA, PCA, SEM) –> calculus (or taught concurrently with the previous) –> mathematical statistics
In my personal experience, I was able to get to SEM-level without calculus. I took calculus, but I never really used it in the context of stats.
But now that I’m taking it again, even at the basic level of 170, I’m seeing how this will apply to statistics (especially mathematical stats). And that’s super exciting.
So I don’t think this idea of “stats before calc” discounts the importance of calculus. Rather, I think it focuses on this idea of “practical versus theoretical” understanding. Statistics, especially very basic statistics, is something I think everyone should know. It’s practical, it’s applicable in every field. Calculus gives you a stronger understanding of WHY it’s so practical and applicable (at least in my opinion).
So yeah. Dr. Benjamin was also on the Colbert Report some time ago. I’ll have to find that vid.
Haha, speaking of the Report, I’m going to go watch the Maurice Sendak interviews again.