It’s the last day of the big statistics marathon. Sad? I am. But I got a few new R projects coming, so you’ll be subject to those shortly.
Anyway. Today is less about stats analyses and more about just general naked-eye trends. What questions we’re looking at today:
A. What are my most popular blogs by view count on WordPress?
B. What are some of the most popular search terms people have used to arrive at my blog?
C. What are some of the most hilarious search terms people have used to arrive at my blog?
D. Blogs/topics I think are worth sharing that didn’t make my Best Of list up top.
I’ve been on WordPress since September 1st, 2010. Since then, my most viewed blogs have been:
- (153 views) Scrabble Letter Values and the QWERTY Keyboard
- (149 views) Colored Beats!
- (58 views) Oh look, PayPal wants me to fill out a survey
- (34 views) TWSB: Well, it certainly would make the cartographer’s job easier…
- (28 views) TWSB: Weebles Wobble (But They Wouldn’t if They Had Three Legs)
- (26 views) Pi vs. e
- (19 views) An analysis of statewise uniform population density (according to Craigslist)
- (19 views) Claudia’s 365 Days of Music – A Review
- (18 views) 5 x 20 seconds of fun
Those may not seem like tremendously large viewing numbers, but considering I’ve got over 2,000 posts and like three people who actually frequent Eigenblogger, 153’s not too bad. Part B explains some of the numbers.
Speaking of which…
Top 10 search phrases are:
- “colored beats”
- “Leibniz porn”
- “what one thing could paypal have done to improve your experience with the account limitation process”
- le seul mot juste”
- “scrabble letter breakdown”
- “scrabble letter values”
- “scrabble letters”
- “scrabble letter rank”
- “rho rho rho your boat”
Yes, a freakishly large amount of times my blog has been found have been because of somebody (sombodies?) searching for “Leibniz porn.” That is simultaneously awesome and confusing. Does “porn” mean something like “metaphysical texts” in some other language? If not, and at least one person out there is searching for legitimate calculus-oriented, ostentatious wig-wearing, best-of-all-possible-smut Leibniz porn, WHO THE HELL ARE YOU AND WILL YOU BE MY SOUL MATE FOREVER?!
Le Seul Mot Juste was the name of my blog up until like three months ago.
And “rho rho rho?” Who the hell knows. Maybe my intellectually-compatible-perfect-future-boyfriend-husband-thing (hereafter referred to as my ICPFBHT) was trying to make some sort of stats pun as he sat hunched over his computer keyboard in a darkened room, chugging Red Bulls and listening to electronica. Naked. With stacks of Leibniz’ works next to him.
People have found my blog by searching for rather humorous things such as:
- “jokes about leibniz cookies”
- “analysis without anal”
- “paddled in parachute pants”
- “yo dawg science”
- “jokes about godot”
- “if your a noodle and you know it clap your hands” (yeah, I have no idea, either.)
- “ who the hell is millard fillmore”
- “gavagai turnips”
It’s shameless self-promotion time! I was going to make a big ol’ flowchart thing that showed you what blogs to go for depending on your general interests, but I’m lazy and I’m sure none of you readers really care that much, so you get this instead.
Got here via a statistics-related post and/or are interested in random recreational stats parties? Why not check out my blogs under the Statistics category?
Interested in philosophy?
What about science?
(Want to read me bitch about stuff?)
Haha, that’s all I got. So there you go! Six days’ worth of stats for six years’ worth of blogs. I hope to entertain you all for another six years at least.
Thank you for reading! Seriously. I’m not all about acquiring followers, but it is really nice to have regular readers. :)
Today is mega trends day. I’ll be looking at blog-wide stuff like the overall changes in word count and the overall changes in the Gunning Fog Index. Woohoo!
A. The Word Count per blog has increased as time has gone on. That is, my blogs today are longer than my blogs when I first started.
B. The GFI per blog has increased as time has gone on.
C. There is no significant correlation between Word Count and the GFI.
I performed a regression (aka a glorified correlation in this case) between Word Count and Blog Number to determine if the number of words per blog has increased as time has gone on. Which indeed it has; predicting Word Count by Blog Number, the regression equation can be written as Word Count = 0.0613*Blog Number. Blog Number predicts a significant proportion of variance in the Word Count variable, F(1,2190) = 14.15, p < 0.001. Here is a plot. The red line is the regression line. As always, click on those bad boy plots to see them more clearly.
Same procedure for GFI vs. Word Count. The GFI, or Gunning Fog Index, remember, is a measurement of the readability of English writing and its values correspond to the number of years of formal education a person must achieve in order to fully understand the written passage. For example, a GFI of 10 suggests that an individual must have completed 10th grade in order to understand the material. To achieve near universal understanding, Wiki recommends that the GFI of a bit of text hover around an eight.
Anyway. The regression equation here is Word Count = 0.0008639(GFI). GFI predicts a significant proportion of variance in the Word Count variable, F(1, 2190) = 51.86, p < 0.0001. Here is another plot with another regression line.
Finally, I tested the correlation between Word Count and GFI. The correlation was -0.0028 but was not significant with t = -.0.1287, p = 0.8976.
A. Supported! The regression line isn’t very steep, but it’s significant still.
B. Supported! That’s actually a pretty impressive regression line, in my opinion.
C. Supported! There’s practically no correlation at all between the length of my blogs and the level of comprehension. I blame the surveys.
Yay, I’ve been waiting for this day! Why? ‘Cause I get to use Wordle. I don’t have any hypotheses for today; rather, I have three main questions of interest.
Question A: how do my “commonly used words” change throughout the years?
Question B: are there some words I use more than others in my blog titles?
Question C: looking at my blog in total, what are my most commonly-used words?
Question A: Using Wordle’s word counts, here’s a table of my top 10 words for each year (note: Worlde can automatically remove “common” words like the, and, a, etc., so I did that). Words consistently highly used across the years are colored.
(Year 1’s “Andy” is because of a short story I posted. Year 4’s “Hate” is because of grad school.)
My top 10 words I use in my titles are:
- Waiter (from all my “Waiter! There’s a…” titles)
Here is a Wordle of my top 100 words spanning all six years!
I would have guessed I’d used the word “blog” a lot more. And my own name less. I use my name in my blog more than “haha” and I’m always dropping “haha”s all over the place! What.
Bonus: here are a few of my common phrases by year. A lot of these are biased because of one blog containing a repeating phrase, but they’re still amusing.
- “Claudia is”
- “Airplane airplane airplane airplane”
- “Who cares about apathy”
- “ag sci computer lab”
- “if you had sex”
- “the socio-adaptive force”
- “who said hello”
- “I can be absolutely fine”
- “go ahead and stir baby” (haha, it took me like twenty minutes to try and figure out why this was a popular phrase; then I remembered it was because of this)
- “the fact that I”
- “wifey wifey wifey wifey”
- “have you ever”
- “best of all possible” (hahaha, this was the year I discovered Leibniz)
- “the mad scientist’s life”
- “the last time you”
- “I hate this” (yup, grad school time)
- “your conversational partner has disconnected” (and Omegle time)
- “approach to environmental ethics”
- “what do you want”
- “today’s song”
- “this week’s science blog”
- “today’s song”
- “you have no idea”
- “for quite some time”
- “all of a sudden”
- “what do you think of”
- “I miss happiness”
- “what would it be”
- “the last time you”
- “sure why not”
It’s day three!
Today we’re looking at three different variables: trends in my Titles, the frequency of blogs involving Surveys, and the frequency of blogs involving Images.
To make sense of these variables and the stats surrounding them, I had to code them. As I said in my first blog stats-related post this week, for the Titles variable, titles were coded 0 if they had nothing to do with the blog content whatsoever (e.g., “Do obedient consonants respond to a Q queue cue?”), a 1 if they were directly relevant to the blog content (e.g., “Greek letters as broken down by meanings in Statistics: a subjective and torturous endeavor”), and 2 if they weren’t completely unrelated but one couldn’t guess the blog content from the title (e.g., “ZOMG”). For the Surveys variable, I just coded the blog entry as 0 if it didn’t contain a survey and 1 if it did. Same thing for the Images variable—a 0 if there were no images and a 1 if there were one or more images.
So. Do I have any hypotheses? Of course I do!
A: The majority of my blog titles have nothing to do with the blog content (that is, they’re coded as 0).
B: I’ve posted more Surveys as time has gone on.
C: I’ve posted more Images as time has gone on.
D: Blogs with Images have fewer words than blogs without Images.
Quick initial analysis: a pie chart of titles!
Hahaha, a quarter of my blog titles tell you absolutely nothing about the associated blogs. That’s fantastic.
Now some more serious fun. To determine whether the amount of Surveys I’ve been posting has been increasing with time, I first made a graph that looks like a bar code to get a rough idea of the frequency/spacing of surveys in my blog*. Each black vertical line represents a Survey blog (y-axis runs from 0 to 1 but since Survey is coded as either a 0 or 1, the appearance of a line indicates Survey = 1).
Second, I looked at the correlation between Blog Number (blog 1 was May 1, 2006, blog 2,192 was May 1, 2012) and the presence of a Survey. The way the coding works, a positive correlation would indicate that as time progressed, I had a greater tendency to post a survey-containing blog.
In this case, I did get a positive correlation of rpb = 0.071. This isn’t the usual Pearson r correlation because I’m not comparing two continuous variables; rather, it’s a point biserial correlation to accommodate the dichotomously-coded Survey variable. However, it’s mathematically equivalent to the Pearson r, so I felt comfortable running a test of significance on the correlation. Turns out, the little .071 correlation is statistically significant, t = 3.346, p < 0.001. This means that the true correlation between Blog Number and the number of surveys I post is not zero and I’ve been posting more and more surveys as time has gone on.
Taking the same procedure with the Blog Number variable and the dichotomous Image variable, here’s another bar code-esque pic (black lines = blogs containing 1+ image):
Here we get an even stronger correlation of rpb = 0.194, which is statistically significant, t = 9.273, p < 0.0001. This shows that the true correlation between Blog Number and the number of Images my blog contains is not zero, and I’ve been posting more and more blogs containing an image as time has gone on.
Finally, I checked out word count between all blogs with Images and all blogs without Images. I made two subset data sets, one containing all the blogs with images, one containing all the blogs with no images, and ran a t-test. The difference in word count was (to me) surprisingly large and definitely significant, t = 6.658, p < 0.0001. The actual means of the No Image vs. Image blogs were 290.425 words and 177.925 words, respectively.
Hypothesis A: Haha, totally not supported, and actually opposite: most of my Titles ARE directly relevant to the content. That’s…surprising to me. I name my blogs right before posting them (which is usually like a decade and a half after I write them, given how often I update this blog), and I’ve usually used the “mash the keyboard until the letters make sense” approach to titles. That, or “let’s see what dumb pun I can make today!”
Hypothesis B: Supported! This is probably strongly due to the fact that I’m working to complete the 5,000 Question Survey and have been working on it since late 2010.
Hypothesis C: Supported! WordPress makes it substantially easier to include images than MySpace ever did. Also, more time spent on the internet now = more random humorous images found via StumbleUpon/Tumblr/other blogs/etc.
Hypothesis D: Very supported. The actual word count difference between blogs with and blogs without Images was surprising to me, though the sample size difference could probably be considered a culprit. However, I guess it shouldn’t be too surprised, though; going through the archives I found quite a few blogs that were like “here’s an image!”, the image, and nothing else.
*Yeah, I know there’s got to be a more sophisticated way to represent this. Creating a CDF doesn’t work with a dichotomous variable. Maybe if I write a loop that adds all the preceding 1’s to each instance of a 1 it hits as it goes from Blog Number = 1 to Blog Number = 2193, and then create sort of a pseudo-CDF using that…hmm…next week’s project!!!
Yo, blogland! Time for another round of “stats no one cares about except me!”
Today we’re looking at Word Count by Day of the Week, Month, and Year. I’d like to see if there are any general trends or if I blather on about nothing in relatively consistent bursts across time. Maybe if all these days of analyses reveal some trends, I could try fitting a model to this data. I loves me some model fittin’.
Onwards and upwards!
A. No one day of the week will have a statistically significant difference in word count than any other day of the week. I don’t think I blog more or less over the weekend, and I see no reason why any day of the five-day week would have longer blogs than any other.
B. I don’t know if they’ll be significant or not, but I’m predicting that word counts will in general be higher during the spring school months (January– April at least) than the summer/winter months. The more responsibilities I have, the more I turn to blogging for procrastination, and I usually take more credits in the summer.
C. From highest word count to lowest: Year 6, Year 2, Year 5, Year 4, Year 1, Year 3.
Here is a pie chart (a tasty, tasty pie chart!) of the percentage of words I’ve written by the day of the week.
Pretty equal, eh? But what does the ANOVA say? According to the stats, there are no statistically significant differences in word count by day of the week, F = 0.642, p = 0.697. According to the Tukey HSDs, none of the individual pairs of days of the week are statistically significant in terms of their word count, either.
Here is another pie chart. This one shows percentage of words by month.
Again, pretty even. Stats? F = 1.505, p = 0.123, meaning that there are no statistically significant differences in word count by month. No statistically significant differences in any of the pairs of months, either.
Finally, we jump to the largest span of time I’m looking at: years! Pie pie pie pie pie:
Haha, holy crap, Year 6 and Year 2 combined account for nearly half of the words in my total blog. Poor little Year 3.
And finally we see some significance! There is a statistically significant difference in word count by blog year , F = 11.021, p > 0.001.
Hypothesis A: Supported! All days of the week are subject to equal amounts of my blathering. Poor things.
Hypothesis B: Eh. Technically January, February, March, April, and May are the wordiest months, but they’re not significantly so.
Hypothesis C: Woo! I totally called it. If anyone’s curious, Year 3 was a word drought because I was living in the house with the guys and I had…other stuff occupying my time.
More to come tomorrow, ladies and gents!
STATS TIME! Are you excited?
First, I want to preface all of this with the list of variables I kept track of when going through my blog archive:
- Blog Number. My first blog is coded as 1, the second as 2, the third is 3, and so on up until 2193.
- Year. Which blogging year the blog came from. There are six years, each spanning May – May.
- Month. January, February, etc.
- Day. The 1st of the month, 2nd of the month, etc.
- Weekday. Monday, Tuesday, etc.
- Word Count. Word count of each post, not counting the title.
- GFI. Gunning Fog Index.
- Punctuation. How many punctuation marks the post contained.
- Title. 0 = title unrelated to blog content, 1 = title directly relevant to blog content, and 2 = ambiguous title; could be related or unrlated.
- Survey. 0 = blog does not contain a survey, 1 = blog contains survey.
- Image. 0 = blog does not contain any images, 1 = blog contains 1+ image(s)
- Category. What category did I tag my blog as (details below).
ALSO NOTE: significance is always judged at the p = 0.05 level. Just didn’t want to have to keep specifying that. :)
So! Today we’re looking at Categories. There are 35 of them (or there will be once I go through and delete all the old “defunct” tags from the few blogs that still have them). Here’s the list in case anybody gives a crap:
So what are we looking at within this sexy, large dataset with respect to categories, then?
Questions of Interest
A) What is the distribution of the categories? That is, which categories are most popular and which are hardly ever used?
B) Do certain categories have a statistically significant different amount of words per post than the other categories?
A: The most popular categories (by percent) will be Blogging, School, and probably Surveys.
B: The least popular categories will be Ramblings and Sports.
C: Categories with a significantly different number of words per post will be Surveys, Philosophy, and Rants.
D: The three categories specified in Hypothesis C will have higher word counts, not lower.
LET’S DO THIS NOISE.
First up, a pie chart! This was my first attempt at visualizing category percentages. By the way, I definitely would have titled this like a good little statistician, but I couldn’t get the image large enough (in my opinion) with the title included. So I’ll call it Percent of Blogs by Category (NOT percent of words by category; that’s just in the ANOVA below).
I had to screw around with this a lot to get it in the easiest to read color scheme. Pie chart with 35 slices = not the best visual, but I think it’s still better than a bar graph in this case.
Table o’ actual counts (click to blow it up so it’s actually readable, haha):
God, all those Blogging blogs.
Second: ANOVAs! Well, okay, just one. But it’s an ANOVA!
According to a more in-depth, ANOVA-driven analysis…
- The mean Word Count per blog is statistically significantly different depending on blog Category, F = 23.184, p < 0.001.
- Blogs in the Surveys category have a significantly higher word count than the other categories, t = 7.739, p < 0.0001.
- Blogs in the Writing category have a significantly higher word count than the other categories, t = 3.624, p < 0.001.
- Blogs in the Philosophy category have a significantly higher word count than the other categories, t = 3.365, p < 0.001.
- Blogs in the Rants category have a significantly higher word count than the other categories, t = 2.480, p < 0.05.
I (or R, rather) also computed a buttload of Tukey HSDs (595 of them!) to test the mean differences between each pair of categories, but most of the significant ones involved (as expected) Surveys, Writing, Philosophy, and Rants.
Hypothesis A: supported! Blogging and school, man: my life.
Hypothesis B: mostly supported! There were a few categories that had nearly as few entries as Sports. I’d get rid of the Rambling category, but then I’d have 34 categories, which isn’t a nicely-dividable number like 35 (I like numbers ending in 0 or 5). Guess I just need to ramble more.
Hypothesis C: mostly supported! I’d totally forgotten about Writing.
Hypothesis D: supported! Surveys, Writing, Philosophy, and Rants contained blogs that had higher than average word counts.
Tune in tomorrow for more stats no one cares about except me!
My blog from May 1, 2006:
“Alrightythen! I finally got a MySpace*. People will now find it easier to stalk me. Anyways, since I’m obsessive-compulsive, I had to start this on the 1st of a Month–I wanted to start this a week or so ago, but NO–that would be in the middle of April. I’m tired now and I want to go to bed, but NO–that would mean I would have to wait another month to start.
Anyway, it’s all good. I’ll be here posting my thoughts every day. At least, that’s what I want to do.
The odds of it happening?
Oh, silly little high-schooler. If only you knew.
Yes, the internet has been putting up with my daily drivel for six years now. Scary stuff, huh? Haha, I should actually blame Aneel and E’raina for this…they’re the ones that peer-pressured me into getting a MySpace and starting a blog. DO YOU SEE WHAT YOU’VE DONE?!
Anyway, yay! I’ve spent the past two weeks or so going through each and every one of my posts to create a nice large data set. Because I spent so long doing it, I thought it only fitting that I not confine my statistical exploration of my blogs to one night (tonight). Therefore, I will give you six days (one for each year I’ve been blogging) of stats/blogging insanity guaranteed to either interest you or make you wonder why the hell you even read this. Or both. Or neither. Maybe it’ll give you cheese. The possibilities? Endless.
This is what’s going down in the Big Week o’ Blog Stats Celebration:
- Wednesday: Comparisons by blog category
- Thursday: Mean word comparisons by weekday, month, and year
- Friday: The use and frequency of images and surveys
- Saturday: Common words and topics
- Sunday: Overall trends: Gunning Fog Index and word count
- Monday: The best blogs
Excited? I am! Any excuse to do some recreational stats.
Here’s to six more years (at least)!
*YES, I started out on MySpace. Give me a break, it was the thing back in 2006, you know it was.