(Note: this has nothing to do with geometry.)
God I love stats.
A lot of the time it seems like the visual representation of data is sacrificed for the actual numerical analyses—be they summary statistics, ANOVAs, factor analyses, whatever. We seem to overlook the importance of “pretty pictures” when it comes to interpreting our data.
This is bad.
One of the first statisticians to recognize this issue and bring it into the spotlight was Francis Anscombe, an Englishman working in the early 1900s. Anscombe was especially interested in regression—particularly in the idea of how outliers can have a nasty effect on an overall regression analysis.
In fact, Anscombe was so interested in the idea of outliers and of differently-shaped data in general, he created what is known today as Anscombe’s quartet.
No, it’s not a vocal quartet who sings about stats (note to self: make this happen). It is in fact a set of four different datasets, each with the same mean, the same variance, the same correlation between the x and y points, and the same regression line equation.
So what’s different between these datasets? Take a look at these plots:
See how nutso crazy different all those datasets look? They all have the same freaking means/variances/correlations/regression lines.
If this doesn’t emphasize the importance of graphing your data, I don’t know what does.
I mean seriously. What if your x and y variables were “amount of reinforced carbon used in the space shuttle heat shield” and “maximum temperature the heat shield can withstand,” respectively Plots 2 and 3 would mean TOTALLY DIFFERENT THINGS for the amount of carbon that would work best.
So yeah. Graph your data, you spazmeisters.
GOD I LOVE STATS.
I can’t stay away from that website with the mathematician birthday/deathday data.
So I decided to look at the data a little differently this time. Each square represents either a death or a birth on the day of the year. The data are broken up by month by a small white space.
Note the “most eventful” and “least eventful” days. And look at October and all its within-group variance.
I think I’m going to do some stats on this data. Because that’s what I do. But I can’t do it for like another week, since next week involves LOTS of studying/homework/panic attacks.
So if you recall, not too long ago I analyzed whether the frequencies of letters in the English language change depending on the letter of the word. To do so, I gathered about 5,000 English words and compared the frequency distributions of the letters for the first five letters of the words. Click here to check that out if you haven’t already done so.
I’d wanted to go further into the words, but I didn’t have time/data to do so.
So that’s what I did today!
I pulled large samples of 4-, 5-, 6-, 7-, 8-, and 9-letter words from an online Scrabble dictionary*. For each sample, I went through and found the frequency distribution of the 26 letters of the alphabet for each letter place in the word (e.g., for the 4-letter words, I found the frequency distribution of the 26 letters for the first, second, third, and fourth place in the 4-letter word).
Because I think something like this is something that requires some sort of visual, I made a gif for each word size (4, 5, 6, 7, 8, 9 letters) that compares the letter frequency for each letter place in the word (in red) compared to the overall frequency of the letters in the entirety of the English language (grey). Check them all out and see if you notice a pattern as the gifs progress through the letter places in the words.
Did you notice it? Regardless of word size, the letter frequencies were most different from the overall frequency in the English language near the beginning and end of the words. Near the “middle” of the words (like the fourth and fifth letters of the nine-letter words, for example), the letter frequencies best matched the overall frequency in the English language (that is, the red distribution best matched the grey distribution).
In addition to the graphical aspect, I of course worked this out with numbers. Like last time, I measured “error” as the absolute value of the total difference between the red and grey distributions for each letter of each word. This confirmed what the gifs show: the smallest error was always for the one or two letters in the “middle” of each word, regardless of size.
Pretty damn cool, huh?
FYI, the six gifs sync and “restart” at the same time every 2,520 frames, in case you’re one of those people who wonders about those types of things.
*Yes, I realize the use of a Scrabble dictionary skews the results a bit, considering that plurals are included in the dictionary as well (notice the “S” is really frequent for the last letter in all cases). But plurals are words, after all, so I figured I’d include them anyway. The pattern still exists anyway even if you omit the last letter from all gifs.
This gentleman is my new favorite living human being.
I’d add that linear algebra is an important middle step as well. A lot of stuff that I really enjoy in the field of statistics is stuff I wouldn’t understand nearly as well had I not taken linear algebra.
Basic statistics (like the stuff I’m teaching) –> Linear algebra –> more advanced statistics (FA, PCA, SEM) –> calculus (or taught concurrently with the previous) –> mathematical statistics
In my personal experience, I was able to get to SEM-level without calculus. I took calculus, but I never really used it in the context of stats.
But now that I’m taking it again, even at the basic level of 170, I’m seeing how this will apply to statistics (especially mathematical stats). And that’s super exciting.
So I don’t think this idea of “stats before calc” discounts the importance of calculus. Rather, I think it focuses on this idea of “practical versus theoretical” understanding. Statistics, especially very basic statistics, is something I think everyone should know. It’s practical, it’s applicable in every field. Calculus gives you a stronger understanding of WHY it’s so practical and applicable (at least in my opinion).
So yeah. Dr. Benjamin was also on the Colbert Report some time ago. I’ll have to find that vid.
Haha, speaking of the Report, I’m going to go watch the Maurice Sendak interviews again.
You know, sometimes the most “pointless” analyses turn up the coolest stuff.
Today I had…get ready for it…FREE TIME! So I decided to try analyzing a fairly large dataset using SAS (’cause SAS can handle large datasets better than R and because I need to practice my coding anyway).
I went here to get a list of the 5,000 most common words in the English language. What I wanted to do was answer the following questions:
1. What is the frequency distribution of letters looking at just the first letter of each word?
2. Does the distribution in (1) differ from the overall distribution in the whole of the English language?
3. Does either frequency distribution hold for the second letter, third letter, etc.?
LET’S DO THIS!
So the frequency distribution of characters for the first letter of words is well-established. Wiki, of course, has a whole section on it. Note that this distribution is markedly different than the distribution when you consider the frequency of character use overall.
I found practically the same thing with my sample of 5,000 words.
So this wasn’t really anything too exciting.
What I did next, though, was to look at the frequencies for the next four letters (so the second letter of a word, the third letter, the fourth, and the fifth).
Now obviously there were many words in the top 5,000 that weren’t five letters long. So with each additional letter I did lose some data. But I adjusted the comparative percentages so that any difference we saw weren’t due to the data loss.
Anyway. So what I did was plot the “overall frequency” in grey—that is, the frequency of each letter in the whole of the English language—against the observed frequency in my sample of 5,000 words in red—again, for the first, second, third, fourth, and fifth letter of the word.
And what I found was actually really interesting. The further “into” a word we got, the closer the frequencies conformed to the overall frequency in the English language.
The x-axis is the letter (A=1, B=2,…Z=26). The y-axis is the number of instances out of a sample of 5,000 words. See how the red distribution gets closer in shape to the grey distribution as we move from the first to the fifth letter in the words? The “error”–the absolute value of the overall difference between the red and grey distributions–gets smaller with each further letter into the word.
I was going to go further into the words, but 1) I left my data at school and 2) I figured anyway that after five letters, I would find a substantial drop in data because there would be a much lower count of words that were 6+ letters long.
COOL, huh? It’s like a reverse Benford’s Law.*
*Edit: actually, now that I think about it, it’s not really a REVERSE Benford’s Law; as I found when I analyzed that pattern, it too rapidly disintegrated as we moved to the second and third digit in a given number and the frequency of the digits 0 – 9 conformed to the expected frequencies (1/10 each).
I have my first ounce of legitimate free time today and what do I do with it?
“I GOTTA ANALYZE SOME DATA!”
Today’s feature: analyzing Nobel laureates by birth dates.
Nobel Prizes are awarded for achievement in six different categories: physics, chemistry, physiology/medicine, literature, peace, and economic sciences. Thus far, there have been 863 prizes awarded to individuals and organizations.
The Nobel website has a bunch of facts on their laureates, including a database where you can search by birthday. So because I’m me and I like to analyze the most pointless stuff possible, here’s what today’s little flirtation with association entails:
1. Does the birth month of the laureate relate in any way to the category of the award (chem, medicine, etc.)?
2. Does the zodiac sign of the laureate in any way to the category of the award?
Vroom, vroom! Let’s do it.
Pre-Analysis: Examining the data
So I should preface this. I decided, upon inspecting the observed contingency table comparing Birth Month and Award Category, to drop the Economics prize altogether. I calculated that the expected cell counts would be very small (because the Economics category is actually the newest Nobel category); such small cell counts would totally throw the chi-square test. So we’re stuck with the other five categories for our analysis.
Question 1: Relation of birth month to award category
Treating Birth Month as a categorical variable (with categories January – December) and Award Category as another categorical variable (with categories equal to the six award categories), I performed a chi-square test to examine if there is an association between the two categories.
Results: χ2 (45)= 81.334, p = 0.0007345. This suggests, using a critical value of .05, that there is a significant relationship between birth month and award category.
Examining the contingency table again (which I’d post here but it’s being a bitch and won’t format correctly, so I’m just going to list what I see):
- Those born in the summer months (June – August) and the months of late fall (October, November) tend to own the Peace and Literature prizes.
- August-, September-, and October-born have most of the Physics prizes.
- The Chemistry prizes seem pretty evenly distributed throughout the months.
- The summer-born seem to have the most awards overall.
Question 2: Relation of zodiac sign to award category.
I suspected this to have a similar p-value, just solely based on the above analysis.
Results: I get a χ2 (54) = 199.8912, p < 0.0001. So this suggests, using our same cutoff value, that there is a significant relationship between zodiac sign and award category. Which makes sense, considering what we just saw with the months. But what’s interesting is that just by looking at the size of the chi-square this relationship is actually stronger than the above one.
Looking at the contingency table for this relationship, here are a few of my observations:
- Aries, Gemini, Virgos, and Libras own the Medicine awards.
- Cancers, Sagittarians, and Aquarians own the Physics awards.
- The first five zodiac signs (Aries – Leo) seem to dominate Literature.
- Capricorns are interesting. They have the least amount of awards overall, but 30% of the awards they do have are in Peace. That’s far more (percentage-wise) than any other sign. Strange noise.
OKAY THAT’S ALL.
So in my researching for my English essay, I read quite a bit about the Royal Society (or, the Royal Society of London for Improving Natural Knowledge as its full title stands).
Well, today I found out that one of the main developers of my favorite statistical test EVER (factor analysis) was also a part of the Society for awhile: Charles Spearman!
So let’s check him out, shall we?
Charles Spearman (1863 – 1945) resigned from 15 years of service in the British Army to pursue a PhD in experimental psychology. By the time he obtained his degree he had already published a paper on the factor analysis of human intelligence. This paper impressed many of his fellow psychologists at the time, mainly because of Spearman’s rigorous application of mathematical techniques and models (factor analysis!) to the analysis of the human mind.
In fact, his work was so impressive that it earned him a place in the Royal Society in 1924. Spearman continued his work, focusing mainly on developing new statistical techniques that could be applied to, among other things, psychological constructs and concepts. He was especially influenced by Galton (developer of correlation) and worked to create a nonparametric version of Pearson’s method of calculating correlation.*
But probably his greatest contribution had to be the part he played in the development of factor analysis. Even today, it’s probably one of the most used statistical techniques in the realm of the social sciences, particularly in psychology.
So there you go! A little bit about one of the founders of the super awesome factor analysis. Cool, huh?
*Actually, this ended up as another “two smart dudes can’t get along feud” between himself and Pearson, the latter not appreciating the nonparametric adaptation of his technique. What do they put in that Royal Society water, anyway?
MIND HAS BEEN BLOWN.
PERSPECTIVE HAS BEEN CHANGED.
PANTS ARE OFF.
THAT LAST STATEMENT IS IRRELEVANT.
You guys probably all already knew this ‘cause you’re smarter than me, but I just learned that the inflection points of the normal distribution occur at the first standard deviation above and below the mean.
Inflection points, remember, are the points on a curve where the concavity changes (f’’(x) = 0).
I’m not quite sure why that’s a significant thing because I don’t think I’m quite at the math level I need to be to understand it, but I’m pretty sure that it’s a significant thing. Either way, VERY COOL.
I’mma go screw around with calculus now.
Freaking love calculus.
Long-time readers of my blog may remember the post I did a long time ago in which I looked at the zodiac signs of the Presidents of the United States in conjunction with assassinations/assassination attempts.
For whatever reason, that little exploration popped back into my head the other day so I decided to do a more thorough analysis along the same lines.
I went to Wikipedia’s list of assassinated people and pulled both assassination dates and birthdates (when available) into a huge-ass dataset.
Questions of interest:
- Is there a time of the year where more assassinations have tended to occur throughout the world?
- Do assassination victims tend to be born at certain times of the year (and in certain zodiac signs, just for fun), taking into account the general overall frequencies of specific birthdays?
- (And I was going to see whether trends in assassinations differ between the continents, but I totally forgot to for this blog, haha. Maybe later.)
So! The data!
As I said, I looked at both the birthdate (when available; n = 612) and the assassination date (just month and day, not year; n = 778) for all of the victims. I didn’t think it made any sense doing any sort of paired data analysis (pairing birthdate and death date of each individual) because when you think about it, the two should be independent on one another. My being born on February 2nd shouldn’t affect the day and month on which I’d be assassinated, right?
In fact, I figured there’d be no relationship between birth date and death date at all…but I was kind of wrong.
Take a look at this plot (click to enlarge).
This shows all 1,390 points of data—the 612 birthdates and the 778 death dates—and their frequencies by month of the year. Does anybody else find the fact that the two lines are kind of a reflection of each other along a horizontal axis…strange? Especially the fall/winter months (August – March), holy crap.
Keep in mind that this is NOT paired data. Haha, I had to keep telling myself that while looking at this because I kept trying to make logical sense of it. There’s no reason (that I can think of anyway) why this pattern should be occurring, and yet there it is. Yes, I know it’s not a perfect reflection and I know that the differences in instances given the sample size are pretty small and the differences are exaggerated by the Y-axis range (my fault), but still. You have to admit that’s freaky.
Months in which assassinations were most common: June, February, and October.
Months in which most eventual assassination victims were born: March, January, May, and September. Nothing too remarkable; the general frequency of people born in these months versus the number of assassination victims born in these months doesn’t seem markedly different to me.
Most commonly assassinated zodiac signs: Virgo, Aries, Aquarius, and Gemini (which, if you believe in the zodiac affecting personality, could just mean that people of these signs are more apt to take positions that leave them more vulnerable to assassination attempts).
It’s a paper by Hadley Wickham and Lisa Stryjewski detailing John Tukey’s 1970 introduction of the boxplot as well as changes, improvements, and adaptations to the visualization.
I’m not going to summarize it here ‘cause I think you need the visuals and I already post enough crap here without explicit permission and therefore won’t be reposting any from the article, but seriously, read it. It’s fascinating.