Yay, I’ve been waiting for this day! Why? ‘Cause I get to use Wordle. I don’t have any hypotheses for today; rather, I have three main questions of interest.
Question A: how do my “commonly used words” change throughout the years?
Question B: are there some words I use more than others in my blog titles?
Question C: looking at my blog in total, what are my most commonly-used words?
Question A: Using Wordle’s word counts, here’s a table of my top 10 words for each year (note: Worlde can automatically remove “common” words like the, and, a, etc., so I did that). Words consistently highly used across the years are colored.
(Year 1’s “Andy” is because of a short story I posted. Year 4’s “Hate” is because of grad school.)
My top 10 words I use in my titles are:
- Waiter (from all my “Waiter! There’s a…” titles)
Here is a Wordle of my top 100 words spanning all six years!
I would have guessed I’d used the word “blog” a lot more. And my own name less. I use my name in my blog more than “haha” and I’m always dropping “haha”s all over the place! What.
Bonus: here are a few of my common phrases by year. A lot of these are biased because of one blog containing a repeating phrase, but they’re still amusing.
- “Claudia is”
- “Airplane airplane airplane airplane”
- “Who cares about apathy”
- “ag sci computer lab”
- “if you had sex”
- “the socio-adaptive force”
- “who said hello”
- “I can be absolutely fine”
- “go ahead and stir baby” (haha, it took me like twenty minutes to try and figure out why this was a popular phrase; then I remembered it was because of this)
- “the fact that I”
- “wifey wifey wifey wifey”
- “have you ever”
- “best of all possible” (hahaha, this was the year I discovered Leibniz)
- “the mad scientist’s life”
- “the last time you”
- “I hate this” (yup, grad school time)
- “your conversational partner has disconnected” (and Omegle time)
- “approach to environmental ethics”
- “what do you want”
- “today’s song”
- “this week’s science blog”
- “today’s song”
- “you have no idea”
- “for quite some time”
- “all of a sudden”
- “what do you think of”
- “I miss happiness”
- “what would it be”
- “the last time you”
- “sure why not”
It’s day three!
Today we’re looking at three different variables: trends in my Titles, the frequency of blogs involving Surveys, and the frequency of blogs involving Images.
To make sense of these variables and the stats surrounding them, I had to code them. As I said in my first blog stats-related post this week, for the Titles variable, titles were coded 0 if they had nothing to do with the blog content whatsoever (e.g., “Do obedient consonants respond to a Q queue cue?”), a 1 if they were directly relevant to the blog content (e.g., “Greek letters as broken down by meanings in Statistics: a subjective and torturous endeavor”), and 2 if they weren’t completely unrelated but one couldn’t guess the blog content from the title (e.g., “ZOMG”). For the Surveys variable, I just coded the blog entry as 0 if it didn’t contain a survey and 1 if it did. Same thing for the Images variable—a 0 if there were no images and a 1 if there were one or more images.
So. Do I have any hypotheses? Of course I do!
A: The majority of my blog titles have nothing to do with the blog content (that is, they’re coded as 0).
B: I’ve posted more Surveys as time has gone on.
C: I’ve posted more Images as time has gone on.
D: Blogs with Images have fewer words than blogs without Images.
Quick initial analysis: a pie chart of titles!
Hahaha, a quarter of my blog titles tell you absolutely nothing about the associated blogs. That’s fantastic.
Now some more serious fun. To determine whether the amount of Surveys I’ve been posting has been increasing with time, I first made a graph that looks like a bar code to get a rough idea of the frequency/spacing of surveys in my blog*. Each black vertical line represents a Survey blog (y-axis runs from 0 to 1 but since Survey is coded as either a 0 or 1, the appearance of a line indicates Survey = 1).
Second, I looked at the correlation between Blog Number (blog 1 was May 1, 2006, blog 2,192 was May 1, 2012) and the presence of a Survey. The way the coding works, a positive correlation would indicate that as time progressed, I had a greater tendency to post a survey-containing blog.
In this case, I did get a positive correlation of rpb = 0.071. This isn’t the usual Pearson r correlation because I’m not comparing two continuous variables; rather, it’s a point biserial correlation to accommodate the dichotomously-coded Survey variable. However, it’s mathematically equivalent to the Pearson r, so I felt comfortable running a test of significance on the correlation. Turns out, the little .071 correlation is statistically significant, t = 3.346, p < 0.001. This means that the true correlation between Blog Number and the number of surveys I post is not zero and I’ve been posting more and more surveys as time has gone on.
Taking the same procedure with the Blog Number variable and the dichotomous Image variable, here’s another bar code-esque pic (black lines = blogs containing 1+ image):
Here we get an even stronger correlation of rpb = 0.194, which is statistically significant, t = 9.273, p < 0.0001. This shows that the true correlation between Blog Number and the number of Images my blog contains is not zero, and I’ve been posting more and more blogs containing an image as time has gone on.
Finally, I checked out word count between all blogs with Images and all blogs without Images. I made two subset data sets, one containing all the blogs with images, one containing all the blogs with no images, and ran a t-test. The difference in word count was (to me) surprisingly large and definitely significant, t = 6.658, p < 0.0001. The actual means of the No Image vs. Image blogs were 290.425 words and 177.925 words, respectively.
Hypothesis A: Haha, totally not supported, and actually opposite: most of my Titles ARE directly relevant to the content. That’s…surprising to me. I name my blogs right before posting them (which is usually like a decade and a half after I write them, given how often I update this blog), and I’ve usually used the “mash the keyboard until the letters make sense” approach to titles. That, or “let’s see what dumb pun I can make today!”
Hypothesis B: Supported! This is probably strongly due to the fact that I’m working to complete the 5,000 Question Survey and have been working on it since late 2010.
Hypothesis C: Supported! WordPress makes it substantially easier to include images than MySpace ever did. Also, more time spent on the internet now = more random humorous images found via StumbleUpon/Tumblr/other blogs/etc.
Hypothesis D: Very supported. The actual word count difference between blogs with and blogs without Images was surprising to me, though the sample size difference could probably be considered a culprit. However, I guess it shouldn’t be too surprised, though; going through the archives I found quite a few blogs that were like “here’s an image!”, the image, and nothing else.
*Yeah, I know there’s got to be a more sophisticated way to represent this. Creating a CDF doesn’t work with a dichotomous variable. Maybe if I write a loop that adds all the preceding 1’s to each instance of a 1 it hits as it goes from Blog Number = 1 to Blog Number = 2193, and then create sort of a pseudo-CDF using that…hmm…next week’s project!!!
Yo, blogland! Time for another round of “stats no one cares about except me!”
Today we’re looking at Word Count by Day of the Week, Month, and Year. I’d like to see if there are any general trends or if I blather on about nothing in relatively consistent bursts across time. Maybe if all these days of analyses reveal some trends, I could try fitting a model to this data. I loves me some model fittin’.
Onwards and upwards!
A. No one day of the week will have a statistically significant difference in word count than any other day of the week. I don’t think I blog more or less over the weekend, and I see no reason why any day of the five-day week would have longer blogs than any other.
B. I don’t know if they’ll be significant or not, but I’m predicting that word counts will in general be higher during the spring school months (January– April at least) than the summer/winter months. The more responsibilities I have, the more I turn to blogging for procrastination, and I usually take more credits in the summer.
C. From highest word count to lowest: Year 6, Year 2, Year 5, Year 4, Year 1, Year 3.
Here is a pie chart (a tasty, tasty pie chart!) of the percentage of words I’ve written by the day of the week.
Pretty equal, eh? But what does the ANOVA say? According to the stats, there are no statistically significant differences in word count by day of the week, F = 0.642, p = 0.697. According to the Tukey HSDs, none of the individual pairs of days of the week are statistically significant in terms of their word count, either.
Here is another pie chart. This one shows percentage of words by month.
Again, pretty even. Stats? F = 1.505, p = 0.123, meaning that there are no statistically significant differences in word count by month. No statistically significant differences in any of the pairs of months, either.
Finally, we jump to the largest span of time I’m looking at: years! Pie pie pie pie pie:
Haha, holy crap, Year 6 and Year 2 combined account for nearly half of the words in my total blog. Poor little Year 3.
And finally we see some significance! There is a statistically significant difference in word count by blog year , F = 11.021, p > 0.001.
Hypothesis A: Supported! All days of the week are subject to equal amounts of my blathering. Poor things.
Hypothesis B: Eh. Technically January, February, March, April, and May are the wordiest months, but they’re not significantly so.
Hypothesis C: Woo! I totally called it. If anyone’s curious, Year 3 was a word drought because I was living in the house with the guys and I had…other stuff occupying my time.
More to come tomorrow, ladies and gents!
STATS TIME! Are you excited?
First, I want to preface all of this with the list of variables I kept track of when going through my blog archive:
- Blog Number. My first blog is coded as 1, the second as 2, the third is 3, and so on up until 2193.
- Year. Which blogging year the blog came from. There are six years, each spanning May – May.
- Month. January, February, etc.
- Day. The 1st of the month, 2nd of the month, etc.
- Weekday. Monday, Tuesday, etc.
- Word Count. Word count of each post, not counting the title.
- GFI. Gunning Fog Index.
- Punctuation. How many punctuation marks the post contained.
- Title. 0 = title unrelated to blog content, 1 = title directly relevant to blog content, and 2 = ambiguous title; could be related or unrlated.
- Survey. 0 = blog does not contain a survey, 1 = blog contains survey.
- Image. 0 = blog does not contain any images, 1 = blog contains 1+ image(s)
- Category. What category did I tag my blog as (details below).
ALSO NOTE: significance is always judged at the p = 0.05 level. Just didn’t want to have to keep specifying that. :)
So! Today we’re looking at Categories. There are 35 of them (or there will be once I go through and delete all the old “defunct” tags from the few blogs that still have them). Here’s the list in case anybody gives a crap:
So what are we looking at within this sexy, large dataset with respect to categories, then?
Questions of Interest
A) What is the distribution of the categories? That is, which categories are most popular and which are hardly ever used?
B) Do certain categories have a statistically significant different amount of words per post than the other categories?
A: The most popular categories (by percent) will be Blogging, School, and probably Surveys.
B: The least popular categories will be Ramblings and Sports.
C: Categories with a significantly different number of words per post will be Surveys, Philosophy, and Rants.
D: The three categories specified in Hypothesis C will have higher word counts, not lower.
LET’S DO THIS NOISE.
First up, a pie chart! This was my first attempt at visualizing category percentages. By the way, I definitely would have titled this like a good little statistician, but I couldn’t get the image large enough (in my opinion) with the title included. So I’ll call it Percent of Blogs by Category (NOT percent of words by category; that’s just in the ANOVA below).
I had to screw around with this a lot to get it in the easiest to read color scheme. Pie chart with 35 slices = not the best visual, but I think it’s still better than a bar graph in this case.
Table o’ actual counts (click to blow it up so it’s actually readable, haha):
God, all those Blogging blogs.
Second: ANOVAs! Well, okay, just one. But it’s an ANOVA!
According to a more in-depth, ANOVA-driven analysis…
- The mean Word Count per blog is statistically significantly different depending on blog Category, F = 23.184, p < 0.001.
- Blogs in the Surveys category have a significantly higher word count than the other categories, t = 7.739, p < 0.0001.
- Blogs in the Writing category have a significantly higher word count than the other categories, t = 3.624, p < 0.001.
- Blogs in the Philosophy category have a significantly higher word count than the other categories, t = 3.365, p < 0.001.
- Blogs in the Rants category have a significantly higher word count than the other categories, t = 2.480, p < 0.05.
I (or R, rather) also computed a buttload of Tukey HSDs (595 of them!) to test the mean differences between each pair of categories, but most of the significant ones involved (as expected) Surveys, Writing, Philosophy, and Rants.
Hypothesis A: supported! Blogging and school, man: my life.
Hypothesis B: mostly supported! There were a few categories that had nearly as few entries as Sports. I’d get rid of the Rambling category, but then I’d have 34 categories, which isn’t a nicely-dividable number like 35 (I like numbers ending in 0 or 5). Guess I just need to ramble more.
Hypothesis C: mostly supported! I’d totally forgotten about Writing.
Hypothesis D: supported! Surveys, Writing, Philosophy, and Rants contained blogs that had higher than average word counts.
Tune in tomorrow for more stats no one cares about except me!
My blog from May 1, 2006:
“Alrightythen! I finally got a MySpace*. People will now find it easier to stalk me. Anyways, since I’m obsessive-compulsive, I had to start this on the 1st of a Month–I wanted to start this a week or so ago, but NO–that would be in the middle of April. I’m tired now and I want to go to bed, but NO–that would mean I would have to wait another month to start.
Anyway, it’s all good. I’ll be here posting my thoughts every day. At least, that’s what I want to do.
The odds of it happening?
Oh, silly little high-schooler. If only you knew.
Yes, the internet has been putting up with my daily drivel for six years now. Scary stuff, huh? Haha, I should actually blame Aneel and E’raina for this…they’re the ones that peer-pressured me into getting a MySpace and starting a blog. DO YOU SEE WHAT YOU’VE DONE?!
Anyway, yay! I’ve spent the past two weeks or so going through each and every one of my posts to create a nice large data set. Because I spent so long doing it, I thought it only fitting that I not confine my statistical exploration of my blogs to one night (tonight). Therefore, I will give you six days (one for each year I’ve been blogging) of stats/blogging insanity guaranteed to either interest you or make you wonder why the hell you even read this. Or both. Or neither. Maybe it’ll give you cheese. The possibilities? Endless.
This is what’s going down in the Big Week o’ Blog Stats Celebration:
- Wednesday: Comparisons by blog category
- Thursday: Mean word comparisons by weekday, month, and year
- Friday: The use and frequency of images and surveys
- Saturday: Common words and topics
- Sunday: Overall trends: Gunning Fog Index and word count
- Monday: The best blogs
Excited? I am! Any excuse to do some recreational stats.
Here’s to six more years (at least)!
*YES, I started out on MySpace. Give me a break, it was the thing back in 2006, you know it was.
Haha, the blog’s gettin’ revamped all over the place. Today’s upgrade: my own domain!
I can now be found at www.eigenblogger.com (the old eigenblogger.wordpress.com will redirect). Yay!
I also added a “Best Of” tab at the top which contains my ten favorite blogs. Just in case people who are passing through would like a taste of some of the better posts of mine (because 95% of them suck).
Further upgrades will occur throughout the week. I also only have to work Monday, Tuesday, and Wednesday because Arizona is apparently obsessed with rodeos enough to shut down the community college.
OOH and I got a new bike! Pics soon.
BAM, new WordPress theme! Like it? I like how the sidebar (the content to the right) is separate from the main blog. I also like how each post gets its own little box.
Anyway, I’m still having the WordPress vs. Typepad argument in my head. If I stay on WordPress, I’m going to buy my own domain and be www.eigenblogger.com, because that looks awesome and I think (nearly) six years of blogging deserves its own domain.
Okay, that’s all. Sorry. Short blog time.
Hello fine readers! I have a question pertaining to—what else?—my blog!
So since I started my blog back in 2006, it’s been known as “Le Seul Mot Juste” with the subheader “take THAT, Gustave Flaubert!”
However, I’ve grown quite attached to “Eigenblogger,” especially since if I go premium with my WordPress account I’ll be found at http://www.eigenblogger.com, which is more memorable and easier to type out than http://www.leseulmotjuste.com or even http://www.takethatgustave.com. Though that last one would be pretty great.
I’ve also kind of adopted the phrase “live fast, die young, love data” as a motto, so that’d be a good subheader, eh?
Here are three new header designs with the new blog title/subheader. This also gives me the opportunity to try out my first poll! WOOHOO! Designs first, poll at the end.
I want cheese.