# Big Week o’ Blog Stats Celebration, Day 1: Mean Word Comparisons by Category

STATS TIME! Are you excited?

First, I want to preface all of this with the list of variables I kept track of when going through my blog archive:

• Blog Number. My first blog is coded as 1, the second as 2, the third is 3, and so on up until 2193.
• Year. Which blogging year the blog came from. There are six years, each spanning May – May.
• Month. January, February, etc.
• Day. The 1st of the month, 2nd of the month, etc.
• Weekday. Monday, Tuesday, etc.
• Word Count. Word count of each post, not counting the title.
• GFI. Gunning Fog Index.
• Punctuation. How many punctuation marks the post contained.
• Title. 0 = title unrelated to blog content, 1 = title directly relevant to blog content, and 2 = ambiguous title; could be related or unrlated.
• Survey. 0 = blog does not contain a survey, 1 = blog contains survey.
• Image. 0 = blog does not contain any images, 1 = blog contains 1+ image(s)
• Category. What category did I tag my blog as (details below).

ALSO NOTE: significance is always judged at the p = 0.05 level. Just didn’t want to have to keep specifying that. :)

So! Today we’re looking at Categories. There are 35 of them (or there will be once I go through and delete all the old “defunct” tags from the few blogs that still have them). Here’s the list in case anybody gives a crap:

So what are we looking at within this sexy, large dataset with respect to categories, then?

Questions of Interest
A)
What is the distribution of the categories? That is, which categories are most popular and which are hardly ever used?
B) Do certain categories have a statistically significant different amount of words per post than the other categories?

Hypotheses
A: The most popular categories (by percent) will be Blogging, School, and probably Surveys.
B: The least popular categories will be Ramblings and Sports.
C: Categories with a significantly different number of words per post will be Surveys, Philosophy, and Rants.
D: The three categories specified in Hypothesis C will have higher word counts, not lower.

LET’S DO THIS NOISE.

Analyses
First up, a pie chart! This was my first attempt at visualizing category percentages. By the way, I definitely would have titled this like a good little statistician, but I couldn’t get the image large enough (in my opinion) with the title included. So I’ll call it Percent of Blogs by Category (NOT percent of words by category; that’s just in the ANOVA below).

I had to screw around with this a lot to get it in the easiest to read color scheme. Pie chart with 35 slices = not the best visual, but I think it’s still better than a bar graph in this case.

Table o’ actual counts (click to blow it up so it’s actually readable, haha):

God, all those Blogging blogs.

Second: ANOVAs! Well, okay, just one. But it’s an ANOVA!

According to a more in-depth, ANOVA-driven analysis…

• The mean Word Count per blog is statistically significantly different depending on blog Category, F = 23.184, p < 0.001.
• Blogs in the Surveys category have a significantly higher word count than the other categories, t = 7.739, p < 0.0001.
• Blogs in the Writing category have a significantly higher word count than the other categories, t = 3.624, p < 0.001.
• Blogs in the Philosophy category have a significantly higher word count than the other categories, t = 3.365, p < 0.001.
• Blogs in the Rants category have a significantly higher word count than the other categories, t = 2.480, p < 0.05.

I (or R, rather) also computed a buttload of Tukey HSDs (595 of them!) to test the mean differences between each pair of categories, but most of the significant ones involved (as expected) Surveys, Writing, Philosophy, and Rants.

So. Results:
Hypothesis A: supported! Blogging and school, man: my life.
Hypothesis B: mostly supported! There were a few categories that had nearly as few entries as Sports. I’d get rid of the Rambling category, but then I’d have 34 categories, which isn’t a nicely-dividable number like 35 (I like numbers ending in 0 or 5).  Guess I just need to ramble more.
Hypothesis C: mostly supported! I’d totally forgotten about Writing.
Hypothesis D: supported! Surveys, Writing, Philosophy, and Rants contained blogs that had higher than average word counts.

Cool, huh?

Tune in tomorrow for more stats no one cares about except me!