This site is no longer being updated. Please visit my main website at: http://christophergandrud.blogspot.com/.

TUESDAY, 10 JULY 2012

Sourcing Code From GitHub ♦

In previous posts I described how to input data stored on GitHub directly into R.

You can do the same thing with source code stored on GitHub. Hadley Wickham has actually made the whole process easier by combining the getURL, textConnection, and source commands into one function: source_url. This is in his devtools package.

Imagine we have a .R source code file like this:

# Make cars scatter plot
library(ggplot2)
Plot <- qplot(cars$dist, cars$speed) +
theme_bw()
print(Plot)


It is hosted on GitHub with the URL: https://raw.github.com/christophergandrud/christophergandrud.github.com/master/SourceCode/CarsScatterExample.R

So to run this source code directly in R all we need to type is:

library(devtools)
source_url(SourceURL)


There you go.

You can also directly source GitHub gists (which are nice for sharing short bits of code) with the source_gist command.

FRIDAY, 22 JUNE 2012

Salmon's Rules for Researcher/Bloggers ♦

Felix Salmon had a nice piece the other day about the Jonah Lehrer self-plagiarism affair. (Basically a journalist, Jonah Lehrer, copied some things he had published elsewhere and posted them on his New Yorker blog.) Felix uses the controversy surrounding this event to write up four useful blogging rules for print journalists who also blog.

I think these rules are probably also useful for academic researchers who also blog. To paraphrase Salmon’s rules:

”Hey Look at This”: Blogging is about reading rather than writing. Point others to something interesting you read.

This can serve a very important function in academia, where without blogs and the like, most research is confined to low readership journals and conference presentations.

Link, Do Not Repeat: Because any content on the internet is just a link away, you never have to repeat it.

The need not to repeat frees researchers up to add onto others work. It also fits in well into an established culture of citing other’s work.

Blogs are Interactions, Not Just Primary Sources: Read, generously link to, summarise, and fill in the gaps of other blogs and web content.

This is another key part of the same process as rule 2.

A Blog Post is the Beginning, Not the End: Use a blog to develop ideas.

This is the one I like the most. Even if no one read my blog I would still write it. Writing the blog has become part of my learning process. Writing a post makes me sharpen my ideas and, in the case of technical posts, my skills. I also regularly go back to previous posts to remember how to do something.

Of course the social aspect of blogging about research ideas further increases the benefits of blogging for researchers. A recent example is when I posted about using GitHub to host data. Someone wrote to me with a problem they were having with my code. So I then learned about how to solve this problem. Commenters to my solution post then shared even more information about the issue.

I now know how to solve this problem so that I can get my research done, other people can also find the solution, and I have a record if I forget what the solution was.

SUNDAY, 17 JUNE 2012

TMI ♦

I just finished reading Lisa Anderson’s somewhat didactic article in the most recent issue of Perspectives on Politics. It’s called: Too Much Information? Political Science, the University, and the Public Sphere.

In the article she is concerned about the role of academic political scientists in a world where students and policymakers have free access to large amounts of information. To a certain extent I think her description of the present’s distinctiveness from the past is a bit overblown. Much like how communication today is not drastically faster than in the past (at least since the laying of transoceanic telegraph cables ), she probably overstates the primacy of political scientists as ”central to the development, collection, and dissemination of knowledge”. For example, think tanks–which she and Stephen Walt identify as challengers to academic political science primacy as sources of information–are certainly not new.

At the same time, I think she overly downplays the contemporary importance of academic political scientists as gatherers of new information, i.e. as researchers. Much of the best political science research today uses new technologies–like web scraping and analysing Twitter feeds–to gather new information. Technological advancements that are driving the free access of greater information, similarly make it easier for these researchers to disseminate what they have found. Political science is already part of these new processes.

However, I really enjoyed her overall insights about how political scientists could reorient ourselves to a world of ‘too much information’. She argues that we

need to revitalize the spirit of playful inventive, open excitement that is, or should be, the hallmark of genuine education–and to do that [we] will need to remember that, unlike the think tanks and policy shops of the world, [our] principle line of work is, well, education.

Initially, it was unclear to me why this is any more necessary for contemporary political scientists than those in the past. On second thought there is a new pressure created by technological change. Given the quickly improving quality of distance education courses offered by prestigious universities and by people who do bring a ‘playful inventiveness’ to their teaching it is unclear why people would accept anything less.

Somewhat more directly tied to recent technological advances, she argues that:

Presenting the finished, polished, completed findings from research conducted in a political science department to policymakers … is neither what today’s policymakers need–it takes to long to produce, it is not interactive or mobile, it precludes questions: in short , it does not reflect the requirements of the audience, any audience today–nor is it what a true political scientist is, or should be, really good at.

The conclusions she draws here are perhaps too strong. There is certainly a role for information developed to the quality required for scholarly publication. Nonetheless, this point particularly resonated with me, given my own recent blogging on preliminary research findings and work making data more accessible.

One of the central challenges for academics of my generation, will likely be finding a balance where our research is timely, directly benefits our students and policymakers and itself benefits from a wider community’s input while also achieving a high academic quality.

Finally, I used to be somewhat skeptical of the sort of buzzwords in the next quote. However, the more I have become engaged with new technologies and ways of conducting research and teaching, the more I recognise their importance for the ways higher education and high level research is changing, I think for the better:

… what we do in our professional lives should be about learning … ”life-long learning” [is] erasing the categorical distinctions between student and teacher. The hierarchical relationship of deference once accorded those with privileged access to information is fast disappearing, replaced by collaborative learning, crowd-sourcing, social networks, and webs of reciprocity.

FRIDAY, 15 JUNE 2012

Update to Data on Github Post ♦

A reader of my most recent post tried the R code I had written to download the data set of electoral disproportionality from the GitHub repository. However, it didn’t work for them. After entering disproportionality.data <- getURL(url) they got the error message:

Error in function (type, msg, asError = TRUE)  :
SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

The Solution

The problem seems to be that they didn’t have a certificate from an appropriate signing agent (see the RCurl FAQ page (near the bottom) for more information. If you are really interested in SSL verification this page from redhat is a place to look).

The solution to this problem is pretty straightforward. As the RCurl FAQ page points out you can use the argument ssl.verifypeer = FALSE to skip certificate verification (effectively a man-in-the-middle attack).

So, if you get the above error message just use this new code:

library(RCurl)

url <- "https://raw.github.com/christophergandrud/Disproportionality_Data/master/Disproportionality.csv"

disproportionality.data <- getURL(url, ssl.verifypeer = FALSE)

disproportionality.data <- read.csv(textConnection(disproportionality.data))

That should work.

Question

I didn’t originally mention this issue, because I didn’t get it when I ran the code on my Mac. When I tried the code on a Windows machine I was able to replicate the error.

Does any reader know why Windows computers (or any other types) lack certificates from an appropriate signing agent needed to download data from GitHub? How can you get one?

MONDAY, 11 JUNE 2012

Data on GitHub ♦

Update (15 June 2012): See this post for instructions on how to download GitHub based data into R if you are getting the error about an SSL certificate problem.

GitHub is designed for collaborating on coding projects. Nonetheless, it is also a potentially great resource for researchers to make their data publicly available. Specifically you can use it to:

• track changes,
• make data publicly available for replication,
• create a website to nicely present key information about the data,

and uniquely:

• benefit from error checking by the research community.

This is an example of a data set that I’ve put up on GitHub.

How?

Taking advantage of these things through GitHub is pretty easy. In this post I’m going to give a brief overview of how to set up a GitHub data repository.

Note: I’ll assume that you have already set up your GitHub account. If you haven’t done this, see the instructions here (for set up in the command line) or here (for the Mac GUI program) or here (for the Windows GUI program).

Store Data in the Cloud

Data basically consists of two parts, the data and description files that explain what the data means and how we obtained it. Both of these things can be simple text files, easily hosted on GitHub:

1. Create a new repository on GitHub by clicking on the New Repository button on your GitHub home page. A repository is just a collection of files.

• Have GitHub create a README.md file.

• If you are using GUI GitHub, on your repository’s GitHub main page simply click the Clone to Mac or Clone to Windows buttons (depending on your operating system).

• If you are using command line git.

- First copy the repository’s URL. This is located on the repository’s GitHub home page near the top (it is slightly different from the page URL).

- In the command line just use the git clone [URL] command. To clone the example data repository I use for this post type:

  $git clone https://github.com/christophergandrud/Disproportionality_Data.git - Of course you can choose which directory on your computer to put the repository in with the cd command before running git clone. 3. Fill the repository with your data and description file. • Use the README.md file as the place to describe your data–e.g. where you got it from, what project you used it for, any notes. This file will be the first file people see when they visit your repository. - To format the README.md file use Markdown syntax. • Create a Data folder in the repository and save your data in it using some text format. I prefer .csv. You can upload other types of files to GitHub, but if you save it in a text-based format others can directly suggest changes and you can more easily track changes. 4. Commit your changes and push them to GitHub. • In GUI GitHub click on your data repository, write a short commit summary then click Commit & Sync. • In command line git first change your directory to the data repository with cd. Then add your changes with $ git add .. This adds your changed files to the ”staging area” from where you can commit them. If you want to see what files were changed type git status -s.

- Then commit the changes with:

$git commit -m 'a comment describing the changes' - Then push the committed changes to GitHub with: $ git push origin master

5. Create a cover site with GitHub Pages. This creates a nice face for the data repository. To create the page:

• Click the Admin button next to your repository’s name on its GitHub main page.

• Under ”GitHub Pages” click Automatic Page Generator. Then choose the layout you like, add a tracking ID if you like, and publish the page.

Track Changes

GitHub will now track every change you make to all files in the data repository each time you commit the changes. The GitHub website and GUI program have a nice interface for seeing these changes.

Replication Website

Once you set up the page described in Step 5, other researchers can easily download the whole data repository either as a .tar.gz file or .zip. They can also go through your main page to the GitHub repository.

Specific data files can be directly downloaded into R with the foreign and RCurl packages (and textConnection from the base package). To download my example data into R just type:

    library(RCurl)

url <- "https://raw.github.com/christophergandrud/Disproportionality_Data/master/Disproportionality.csv"

disproportionality.data <- getURL(url)

disproportionality.data <- read.csv(textConnection(disproportionality.data))

Note: make sure you copy the file’s raw GitHub URL.

You can use this to directly load GitHub based data into your Sweave or knitr file for direct replication.

Improve your data through community error checking

GitHub has really made open source coding projects much easier. Anybody can view a project’s entire code and suggest improvements. This is done with a pull request. If the owner of the project’s repository likes the changes they can accept the request.

Researchers can use this same function to suggest changes to a data set. If other researchers notice an error in a data set they can suggest a change with a pull request. The owner of the data set can then decide whether or not to accept the change.

Hosting data on GitHub and using pull requests allows data to benefit the kind of community led error checking that has been common on wikis and open source coding projects for awhile.

Git Resources

• Pro Git: a free book on how to use command line git.

• Git Reference: another good reference for command line git.

• github:help: GitHub’s reference pages.

MONDAY, 4 JUNE 2012

Slidify ♦

Tools for using R/RStudio as a one-stop shop for research and presentation have been coming out quickly. I think this one has a good shot of being included in future releases of RStudio:

The other day I ran across a new R package called slidify by Ramnath Vaidyanathan. In previous posts I had been messing around with Pandoc and deck.rb to turn knitr Markdown files into HTML presentations.

Slidify has two key advantages over these approaches:

• it can directly convert .Rnw files in R into slideshows, i.e. no toggling between R and the Terminal,

• there are lots of slideshow options (deck.js, dzslides, html5slides, shower, and slidy).

It’s not on CRAN yet, but it worked pretty well for me.

The syntax is simple.

• In the Markdown document demarcate new slides with --- (it has to be three dashes and there can’t be spaces after the dashes).

• When you want to convert your .Rnw into a presentation just type:

  library(slidify)
slidify("presentation.Rnw")

The default style is html5slides. The package isn’t that well documented right now, but to change to a different style just use framework. For example:

    slidify("presentation.Rnw", framework = "deck.js")

I used slidify to put together a slideshow that advertises an intro applied stats course I’m teaching next semester. The slideshow is here. (You can see that I’m trying to attract social science students who are reluctant to take a stats class).

I sloppily removed the default Slidify logo by deleting the images folder in the html5slides folder slidify creates.

PS

Oh, also you might notice that I’m using github to host the course. I hope to blog about this in the near future.

SUNDAY, 3 JUNE 2012

knitr, Slidy, Dropbpox ♦

I just noticed that Markus Gesmann has a nice post on using RStudio, knitr, Pandoc, and Slidy to create slideshows. After my recent attempt to use deck.rb to turn a Markdown/knitr file into a deck.js presentation I caved in and also decided to go with Pandoc and Slidy.

For me, Slidy produced the cleanest slides of the three formats that Pandoc supports. The presentation is here and the source is here.

The only thing I really disliked was having to use <br /> or something similar to keep the text from bunching up at the top of the slides, which looked strange when projected onto a screen. You can customise Slidy CSS files, but I haven’t got around to that yet.

In this post I don’t want to duplicate what Markus Gesmann has already done. Instead, I wanted to mention two things that I noticed/thought about while making my presentation:

• The new MathJax syntax implemented in RStudio 0.96.227 doesn’t seem to work with Pandoc. It just renders latex as if it was part of the equation rather than the qualifier to the equation begin delimiter. To get around this I just used the regular old  and  syntax.

• It’s pretty easy to host presentations with Dropbox. Just make sure all of your files are in the same folder in your Public folder. If you want output from knitr to go into and be retrieved from someplace else, you can use the desired base URL for these files by adding this code after the Pandoc title information:

  {r setup, echo=FALSE}

# Set up R so that figures save and load properly.
opts_knit\$set(base.url = "")


• Where base.url = "" includes the URL of the folder you want the output stored in.

• All items in a folder in Dropbox’s Public folder have the same base URL.
• I learned about base.url from Yihui Xie’s source code for his knitr/Markdown example on github. He uses it to save and retrieve figures from other folders on github.

Extra: Pandoc Code

I used the following Pandoc code in the Terminal to convert the .md file to Slidy:

pandoc -t slidy leg_violence_present1.md -o leg_violence_present1.html -s -i -S --mathjax