Sunday, June 01, 2008
Statistics, Charts and R
For a few weeks now, I've been playing around with the statistical computing package "R" to draw some graphs/plots of data. Why would I want to do this? Let me quote one of my favorite movies, Sneakers:
R is an open source environment for statistical computing. It can do some pretty neat breakdowns of your data and has a lot of built in functions for doing so. One of it's great strengths is generating production quality graphics and charts. This is what I needed it for and what I'll be explaining here in a moment. I learned R by watching a video introduction to R created by Decision Science News. There were 2 actually. But not very long and it got me to a base level. I then installed R on my Mac, it was cake. Go to the R site, download the DMG, run the R executable and you're ready to go. That got me up and running so I could start playing around on my own and using other examples from the web. YMMV on other platforms.
Now for the example. Let's set the stage. Say you have some data in a table, for example, a race my girlfriend competed in, the 2006 San Diego 10K race. I copied, pasted that data into a file, scrubbed it down, did some math with perl to get me the # of seconds, and ended up with a CSV file. Download the file, save it locally, read that file in with R:

Below is a script I used to generate the graph above. You can see how I am plotting the dots, and drawing both lines:
Let's generate another one, a histogram. That's real easy, you can just type this:

So, what did I learn from the creation of this plot? My initial suspicion was that younger people would do better in the race...the data shows that is's almost average across the board. The average age is in the late 30's, but the histogram shows the biggest group was mid-late 20's. Hardly anyone in their early 20's even entered the race...too busy drinking? Also, there is a neat little cluster at the bottom left of the plot that shows a group of young kids in their teens that did well.
I have been making more of these, mostly around sysadmin type stuff. I'll post those as I get more time.
The world isn't run by weapons anymore, or energy, or money. It's run by little ones and zeroes, little bits of data. It's all just electrons. ...Well, you get the point. We all know that information is king and analyzing and visualizing that information is becoming more and more important. Presenting data in a way that is easy to digest is key to getting your audience to understanding that data. So let me make an introduction...... There's a war out there, old friend. A world war. And it's not about who's got the most bullets. It's about who controls the information. What we see and hear, how we work, what we think... it's all about the information!
R is an open source environment for statistical computing. It can do some pretty neat breakdowns of your data and has a lot of built in functions for doing so. One of it's great strengths is generating production quality graphics and charts. This is what I needed it for and what I'll be explaining here in a moment. I learned R by watching a video introduction to R created by Decision Science News. There were 2 actually. But not very long and it got me to a base level. I then installed R on my Mac, it was cake. Go to the R site, download the DMG, run the R executable and you're ready to go. That got me up and running so I could start playing around on my own and using other examples from the web. YMMV on other platforms.
Now for the example. Let's set the stage. Say you have some data in a table, for example, a race my girlfriend competed in, the 2006 San Diego 10K race. I copied, pasted that data into a file, scrubbed it down, did some math with perl to get me the # of seconds, and ended up with a CSV file. Download the file, save it locally, read that file in with R:
race<-read.csv("race.csv")You are reading that CSV file in as a table into a variable called 'race'. Because that CSV has a header as the first line, it automatically assigns variables based on those column names. To reference those columns, use 'race$CITY', to check out the 'CITY' column. So to check out what you've just done, type "race" on the console. Typing the variable name will spit it all back out. To see a breakdown of what that variable contains, type:summary(race)To see stats on the racers ages, type in:
summary(race$AGE)Which spits out:
Min. 1st Qu. Median Mean 3rd Qu. Max.Minimum age of a runner was 10, oldest was 81. Average age was 37.06 years old. Doing this for race$SEX shows us there were 411 women and 474 men. Neat! Now for the visuals:
10.00 28.00 35.00 37.06 44.00 81.00

Below is a script I used to generate the graph above. You can see how I am plotting the dots, and drawing both lines:
race<-read.csv("race.csv")Not too hard, not too much code...pretty easy in fact! One of the great things about R is the built in help. Any of those functions, just type:
#Main Plot.
plot(race$SECONDS/60,race$AGE,
col="#5fae27",
main="",
xlab="Minutes",
ylab="Age",
cex=0.5,
type="p")
# Set the Title
title(main="Age vs Time")
# Draw the Red Line
lines(stats::lowess(race$SECONDS/60,race$AGE,f=0.1),
col="red",
lwd=2)
#Draw the Blue Line
lines(stats::lowess(race$SECONDS/60,race$AGE,f=0.3),
col="blue",
lwd=3)
?function..and you'll have immediate help. I encourage you to do that for the example above, to better understand it. It will describe far better than I can how each one of those functions works.
Let's generate another one, a histogram. That's real easy, you can just type this:
hist(race$AGE,col="RED",xlab="Age",breaks=100,main="Histogram of Racers Age"To generate this:

So, what did I learn from the creation of this plot? My initial suspicion was that younger people would do better in the race...the data shows that is's almost average across the board. The average age is in the late 30's, but the histogram shows the biggest group was mid-late 20's. Hardly anyone in their early 20's even entered the race...too busy drinking? Also, there is a neat little cluster at the bottom left of the plot that shows a group of young kids in their teens that did well.
I have been making more of these, mostly around sysadmin type stuff. I'll post those as I get more time.
Subscribe to Posts [Atom]