O LAWDY, WE CODIN’
This was one of my favorite topics we covered in that Algorithms class I took in my final semester at U of I:
This is such a great explanation of it too. I love the graph.
Eulering
Heyyyyyyyyy, what’s up, fools?
So remember Project Euler, that site that has hundreds of programming challenge problems? Well, I haven’t had much time for it lately (blame school), but today I decided to log back in and see if there was a problem I could try. And I found one!
This is the problem:
Passcode derivation (Problem 79): A common security method used for online banking is to ask the user for three random characters from a passcode. For example, if the passcode was 531278, they may ask for the 2nd, 3rd, and 5th characters; the expected reply would be: 317.
The text file, keylog.txt, contains fifty successful login attempts.
Given that the three characters are always asked for in order, analyse the file so as to determine the shortest possible secret passcode of unknown length.
This is one that I was able to solve by hand pretty easily, but since it’s a coding challenge site, I figured I ought to give it a shot using R. It took me a bit to get my code just right (there was one particular thing I was trying to do and I couldn’t figure out how to do it in R, so I had to modify things a bit), but I finally got it right!
Anyway, I’m not going to share my code here (it’s discouraged to share solutions outside the problem forums, each of which can only be accessed once you’ve input the correct answer for a given problem), but I thought that this was a super interesting and fun question to try. It’s easy to do by hand, but in my opinion a bit harder to do with code.
If you like this type of stuff, try it out!
Also, happy birthday, mom!
Conduct of Code
OH MY GOD this looks like fun.
From the site (and in case you don’t want to click the link for whatever reason): “Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve. Although mathematics will help you arrive at elegant and efficient methods, the use of a computer and programming skills will be required to solve most problems. The motivation for starting Project Euler, and its continuation, is to provide a platform for the inquiring mind to delve into unfamiliar areas and learn new concepts in a fun and recreational context.”
The problems look fairly challenging (at least, challenging in R, which of course is my programming language of choice, I mean c’mon), but at least it will give me a good excuse to practice!
Edit: hahaha, I’ve done like five of them already. But the rest look super hard!
I…
Just taught myself how to write macros in Excel. A lot of them. In like two hours. It was fantastic.
I basically figured out how to do what I was trying to do in R in Excel. It’s actually a lot easier to implement in Excel, especially since (I totally just learned this, pardon my beginner’s excitement) you can create a hyperlink in a cell that links to another cell within the same spreadsheet. Totally had no idea you could do that.
But now I can do stuff in Visual Basic. Woot.
Okay, that’s all.
Adventures in R: Creating a Pseudo-CDF Plot for Binary Data
(Alternate title: “Ha, I’m Dumb”)
(Alternate alternate title: “Skip This if Statistics Bore You”)
You may recall a few days ago during one of my Blog Stats blogs I mentioned the problem of creating a cumulative distribution function-type plot for binary data, which would show the cumulative number of times one of the two binary variables occurred over some duration of another variable.
Um, let’s go to the actual example, ‘cause that description sucked.
Let’s say I have two variables called Blogs and Images for a set of data for which N = 2193. The variable Blogs gives the blog number for each post, so it runs from 1 to 2193. The variable Images is a binary variable and is coded 0 if the blog in question contains no image(s) and 1 if the blog contains 1 or more images.
Simple enough, right?
So what I was trying to do was create an easy-to-interpret visual that would show the increase in the cumulative number of blogs containing images over time, where time was measured by the Blogs variable.
Not being ultra well-versed in the world of visually representing binary data, this was the best I could come up with in the heat of the analysis:
If you take a look at the y-axis, it becomes clear that due to the coding, the Images variable could only either equal 0 or 1. When it equaled 1, this plot drew a vertical black line at the spot on the x-axis that matched the corresponding Blogs variable. It’s not the worst graph (and if you scan it at the grocery store, you’ll probably end up with a bag of Fritos or something), but it’s not the easiest-to-interpret graph on the planet either, now is it?
What I was really looking for was some sort of cumulative distribution function (CDF) plot, but for binary data. I like how Wiki puts it: “Intuitively, [the CDF] is the “area so far” function of the probability distribution.” As you move right on the x-axis, the CDF curve lines up with the probability (given on the y-axis) that the variable, at that point on the x-axis, is less than or equal to the value indicated by the curve. Assuming your y-axis is set for probability (mine isn’t, but it’s still easy to interpret). This is all well and good for well-behaving ratio data, but what happens if I want to do such a plot for a dichotomously-coded variable?
There were two ways to go about this:
1) Be a spazz and write some R code to get it done, or
2) Be an anti-spazz and look up if anybody’s written some R code to get it done.
I originally wanted to do A, which I did, but B was actually a lot harder than it should have been.
Let’s look at A first. I wanted to plot the number of surveys containing images against time, measured by the Blogs variable. Since I coded blogs containing images as 1 and blogs not containing images as 0, all I needed to get R to do was spit out a list of the cumulative sum of the Images variable at each instance of the Blogs variable (so a total of 2193 sums). Then plot it.
R and I have a…history when it comes to me attempting to write “for” loops. But it finally worked this time. I’ll just give you that little segment, ‘cause the rest of the code’s for the plotting parameters and too long/bothersome to throw on here.
for (m in (1:length(ximage))){
newimage=ximage[1:m]
xnew=sum(newimage)
t=cbind(m,xnew)
points(t,type="h",pch="1")
}
ximage is the name of the vector containing the coded Images variable. So what this little “for” loop does is create a new variable (newimage) for every vector length between 1 and 2193 instances of the Images variable. Another new variable (xnew) calculated the sum of 1s in each newimage. t combines the Blogs number (1 through 2193) with the matching xnew. Finally, the points of t are plotted (on a pre-created blank plot).
So. Wanna see?
Woo!
So I actually figured this out on Wednesday, but I didn’t blog about it because I wanted to see if I could find a function that already does what I wanted. Why did it take an extra three days to find it? Because I couldn’t for the life of me figure out what that type of plot was called. It’s not a true CDF because it’s not a continuous variable we’re dealing with. But after obsessively searching (this is the reason for the alternate title—I should have known what this type of plot was called), I finally found a (very, very simple) function that makes what this is: a cumulative frequency graph (I know, I know, duh, right?).
So here’s the miniscule little bit of code needed to do what I did:
cumfreq=cumsum(ximage) plot(cumfreq, type="h")
The built-in function (it was even in the damn base package. SHAME, Claudia, SHAME!!) cumsum gives a vector of the sum at each instance of ximage; plotting that makes the exact same graph as my code (except I manually fancied up my axes in my code).
Cool, eh?
Maybe I’ll post my full code once I make it uncustomized to this particular problem.



