Tag Archives: analysis

Dicking around with Data

I have my first ounce of legitimate free time today and what do I do with it?


Today’s feature: analyzing Nobel laureates by birth dates.

Nobel Prizes are awarded for achievement in six different categories: physics, chemistry, physiology/medicine, literature, peace, and economic sciences. Thus far, there have been 863 prizes awarded to individuals and organizations.

The Nobel website has a bunch of facts on their laureates, including a database where you can search by birthday. So because I’m me and I like to analyze the most pointless stuff possible, here’s what today’s little flirtation with association entails:

1. Does the birth month of the laureate relate in any way to the category of the award (chem, medicine, etc.)?

2. Does the zodiac sign of the laureate in any way to the category of the award?

Vroom, vroom! Let’s do it.

Pre-Analysis: Examining the data

So I should preface this. I decided, upon inspecting the observed contingency table comparing Birth Month and Award Category, to drop the Economics prize altogether. I calculated that the expected cell counts would be very small (because the Economics category is actually the newest Nobel category); such small cell counts would totally throw the chi-square test. So we’re stuck with the other five categories for our analysis.

Question 1: Relation of birth month to award category

Treating Birth Month as a categorical variable (with categories January – December) and Award Category as another categorical variable (with categories equal to the six award categories), I performed a chi-square test to examine if there is an association between the two categories.

Results: χ2 (45)= 81.334, p = 0.0007345. This suggests, using a critical value of .05, that there is a significant relationship between birth month and award category.

Examining the contingency table again (which I’d post here but it’s being a bitch and won’t format correctly, so I’m just going to list what I see):

  • Those born in the summer months (June – August) and the months of late fall (October, November) tend to own the Peace and Literature prizes.
  • August-, September-, and October-born have most of the Physics prizes.
  • The Chemistry prizes seem pretty evenly distributed throughout the months.
  • The summer-born seem to have the most awards overall.

Question 2: Relation of zodiac sign to award category.

I suspected this to have a similar p-value, just solely based on the above analysis.

Results: I get a χ2 (54) = 199.8912, p < 0.0001. So this suggests, using our same cutoff value, that there is a significant relationship between zodiac sign and award category. Which makes sense, considering what we just saw with the months. But what’s interesting is that just by looking at the size of the chi-square this relationship is actually stronger than the above one.

Looking at the contingency table for this relationship, here are a few of my observations:

  • Aries, Gemini, Virgos, and Libras own the Medicine awards.
  • Cancers, Sagittarians, and Aquarians own the Physics awards.
  • The first five zodiac signs (Aries – Leo) seem to dominate Literature.
  • Capricorns are interesting. They have the least amount of awards overall, but 30% of the awards they do have are in Peace. That’s far more (percentage-wise) than any other sign. Strange noise.


Blameworthiness and the Anonymous Judge: An Analysis of FML Categories

The website Fmylife was created on January 13, 2008 and serves as a blog for people to post anecdotes relating to unfortunate goings on (either by their doing or others’) in their lives. The stories that are published allow readers of the blog to essentially assess the placement of blame for each anecdote. As Wiki so succinctly puts it, “anybody who visits the site can decide if the writer of each anecdote’s life indeed “sucks” [‘fuck your life’ or ‘FYL’] or if he or she “deserved” what happened [‘you deserved it’ or ‘YDI’].”
The FML posts belong to one of seven categories: Love, Money, Kids, Work, Health, Miscellaneous, and Intimacy.

Party on.

Anyway, me being me, I wanted to see if people rating the FMLs rated them differently (FYL vs. YDI) depending on the category of the FML. That is, I wanted to see whether people assigned blame (quantified by the number of YDIs voted) to the anecdote poster differently depending on what category the FML belonged to.

a) People would assign blame to the poster more readily when the anecdote belonged to more “personal” or “individual” category (Money and Health, maybe Miscellaneous).
b) People would be more willing to say FYL to the poster if the anecdote is from a category that involved other individuals (Love or Kids or Work).

Utilizing the “random FML” button, I acquired a random sample of 30 FMLs per each category, save the Intimacy category (‘cause FMLs from that category are not included in the random search). I noted the number of FYLs and the number of YDIs for each anecdote and then computed a paired t-test comparison of mean differences for each category.

H0: µFYL =  µYDI for all categories. This means that there is no significant difference between the mean number of FYLs and the mean number of YDIs, regardless of the category.

Ha: µFYL <  µYDI for Money and Health categories (meaning most people would assign blame to the poster) and µFYL >  µYDI for Love, Kids, and Work categories (meaning most people would NOT assign blame to the poster).

Analyses were done in R. All t-tests were performed under the assumption of unequal variances, as was indicated by the Levene Tests for each group (performed using the lawstat package in R).

Love: t(29) = 5.04, p < 0.0001*
Money: t(29) = 1.76, p = 0.09
Kids: t(29) = 4.24, p = 0.0002*
Work: t(29) = 3.85, p = 0.0005*
Health: t(29) = 1.601, p = 0.06
Miscellaneous: t(29) = 0.922, p = 0.3641
*significant at the 0.05 level

So what does this mean?

While the results were statistically insignificant for one “individual-based” groups Money and Health (and Miscellaneous, but I didn’t have any specific hypotheses regarding that category), my second hypothesis received statistical support!
That is, at the 0.05 level of significance, significantly fewer readers place blame on the individual FML poster in the categories of Love, Kids, and Work—categories that were deemed by me to be those that involved the actions of others more than just the action of the individual poster.

So I guess we can very loosely conclude based on my oh-so-scientific way of categorizing the categories (haha) that people who vote on Fmylife tend to assign blame more readily to the individual poster when said poster’s anecdote belongs to a category that includes more individual-based actions than when the anecdote belongs to a category that includes the actions of others.


30-Day Meme – Day 5: Your favorite quote.
I’m not much of a quote person, but I still really like the quote I used in my senior yearbook: “become who you are,” as said by Friedrich Nietzsche. It’s such a simple quote and kind of sums up what I think life is all about.

Haha, I don’t have much more to say about today’s meme entry.

Data, data everywhere and not a model to fit

Things a normal person does to relax:
– sleeps
– hangs out with friends
– copious amounts of alcohol
– screws around

Things Claudia does to relax:
– ignores sleep
– locks herself in her apartment
– copious amounts of Red Bull
– fits a structural equation model to her music data


I’ve spent a cumulative 60+ hours solely on my thesis writing this week, and considering all the other crap I had to finish, what with the semester ending and all, that’s a pretty large amount of time.
Despite that, it’s pretty sad that I spent my first few hours of free time this week fitting an SEM to my music.
BUT IT HAPPENED, so here it is.

With “number of stars” the variable I was most interested in, I wanted to fit what I considered to be a reasonable model that showed the relationships between the number of stars a song eventually received from me (I rarely if ever change the number after I’ve assigned the stars) and other variables, such as play count and date acquired. Note: structural equation modeling is like doing a bunch of regressions at once, allowing you to fit more complicated models and to see where misfit most likely occurs.

Cool? Cool.


This is the initial model I proposed. The one-way arrows indicate causal relationships (e.g., there is a causal relationship in my proposed model between the genre of a song and the number of stars it has), the double-headed arrow indicates a general correlation without direction. Oh, and “genre” was coded with numbers 1 through 11, with lower numbers indicating my least favorite genres and higher numbers indicating my favorite genres. Important for later.

Using robust maximum likelihood estimation (because of severe nonnormality), I tested this model in terms of its ability to describe the covariance structure evident in the sample (which, in this case, is the 365 songs I downloaded last year).

So here’s what we got!
Satorra-Bentler scaled χ2(7) = 9.68, p = 0.207
Robust CFI: .992
Robust RMSEA: .032
Average absolute standardized residual: 0.0190

All these stats indicate a pretty awesome fit of the model to the data. This is shocking, considering ridiculous non-normality in the data itself and the fact that this is the first model I tried.

Here are the standardized pathway values (analogous to regression coefficients, so if you know what those mean, you can interpret these), with the significant values marked with asterisks:

So what’s this all mean? Well, in general, the relationships I’ve suggested with this model are, according to the stats, a good representation of the actual relationships existing among the variables in real life. Specifically:
– There is a significant positive relationship between genre and play count, which makes sense. Songs from my more preferred genres are played more often.
– There is a strong positive relationship between play count and stars, which also obviously makes a lot of sense.
– The significant negative relationship between date added and play count makes sense as well; the more recently downloaded songs (those with high “date added” numbers) have been played less frequently than older songs.
– There is no significant correlation between genre and song length, which surprises me.
– Genre, length, and play count all have significant, direct effects on how many stars I give a song.
– Another interesting finding is the positive relationship between stars and skips, which suggests that the higher number of stars a song has, the more often it is skipped. Perhaps this is just due to the sheer number of times I play the higher-starred songs. Who knows?

Yay! Fun times indeed.