It’s day three!
Today we’re looking at three different variables: trends in my Titles, the frequency of blogs involving Surveys, and the frequency of blogs involving Images.
To make sense of these variables and the stats surrounding them, I had to code them. As I said in my first blog stats-related post this week, for the Titles variable, titles were coded 0 if they had nothing to do with the blog content whatsoever (e.g., “Do obedient consonants respond to a Q queue cue?”), a 1 if they were directly relevant to the blog content (e.g., “Greek letters as broken down by meanings in Statistics: a subjective and torturous endeavor”), and 2 if they weren’t completely unrelated but one couldn’t guess the blog content from the title (e.g., “ZOMG”). For the Surveys variable, I just coded the blog entry as 0 if it didn’t contain a survey and 1 if it did. Same thing for the Images variable—a 0 if there were no images and a 1 if there were one or more images.
So. Do I have any hypotheses? Of course I do!
A: The majority of my blog titles have nothing to do with the blog content (that is, they’re coded as 0).
B: I’ve posted more Surveys as time has gone on.
C: I’ve posted more Images as time has gone on.
D: Blogs with Images have fewer words than blogs without Images.
Quick initial analysis: a pie chart of titles!
Hahaha, a quarter of my blog titles tell you absolutely nothing about the associated blogs. That’s fantastic.
Now some more serious fun. To determine whether the amount of Surveys I’ve been posting has been increasing with time, I first made a graph that looks like a bar code to get a rough idea of the frequency/spacing of surveys in my blog*. Each black vertical line represents a Survey blog (y-axis runs from 0 to 1 but since Survey is coded as either a 0 or 1, the appearance of a line indicates Survey = 1).
Second, I looked at the correlation between Blog Number (blog 1 was May 1, 2006, blog 2,192 was May 1, 2012) and the presence of a Survey. The way the coding works, a positive correlation would indicate that as time progressed, I had a greater tendency to post a survey-containing blog.
In this case, I did get a positive correlation of rpb = 0.071. This isn’t the usual Pearson r correlation because I’m not comparing two continuous variables; rather, it’s a point biserial correlation to accommodate the dichotomously-coded Survey variable. However, it’s mathematically equivalent to the Pearson r, so I felt comfortable running a test of significance on the correlation. Turns out, the little .071 correlation is statistically significant, t = 3.346, p < 0.001. This means that the true correlation between Blog Number and the number of surveys I post is not zero and I’ve been posting more and more surveys as time has gone on.
Taking the same procedure with the Blog Number variable and the dichotomous Image variable, here’s another bar code-esque pic (black lines = blogs containing 1+ image):
Here we get an even stronger correlation of rpb = 0.194, which is statistically significant, t = 9.273, p < 0.0001. This shows that the true correlation between Blog Number and the number of Images my blog contains is not zero, and I’ve been posting more and more blogs containing an image as time has gone on.
Finally, I checked out word count between all blogs with Images and all blogs without Images. I made two subset data sets, one containing all the blogs with images, one containing all the blogs with no images, and ran a t-test. The difference in word count was (to me) surprisingly large and definitely significant, t = 6.658, p < 0.0001. The actual means of the No Image vs. Image blogs were 290.425 words and 177.925 words, respectively.
Hypothesis A: Haha, totally not supported, and actually opposite: most of my Titles ARE directly relevant to the content. That’s…surprising to me. I name my blogs right before posting them (which is usually like a decade and a half after I write them, given how often I update this blog), and I’ve usually used the “mash the keyboard until the letters make sense” approach to titles. That, or “let’s see what dumb pun I can make today!”
Hypothesis B: Supported! This is probably strongly due to the fact that I’m working to complete the 5,000 Question Survey and have been working on it since late 2010.
Hypothesis C: Supported! WordPress makes it substantially easier to include images than MySpace ever did. Also, more time spent on the internet now = more random humorous images found via StumbleUpon/Tumblr/other blogs/etc.
Hypothesis D: Very supported. The actual word count difference between blogs with and blogs without Images was surprising to me, though the sample size difference could probably be considered a culprit. However, I guess it shouldn’t be too surprised, though; going through the archives I found quite a few blogs that were like “here’s an image!”, the image, and nothing else.
*Yeah, I know there’s got to be a more sophisticated way to represent this. Creating a CDF doesn’t work with a dichotomous variable. Maybe if I write a loop that adds all the preceding 1’s to each instance of a 1 it hits as it goes from Blog Number = 1 to Blog Number = 2193, and then create sort of a pseudo-CDF using that…hmm…next week’s project!!!