psData: Political Science Panel-Series Data (v2)

Christopher Gandrud

csv,conf: 15 July 2014

Talk Aims


  • Describe political science/panel-series data problems.

  • Introduce psData, our first attempt at a solution.

  • Thoughts for future framework.

  • Get ideas for going forward/avoiding effort duplication.

What is panel-series data?


Country Year Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
Angola 2000
Angola 2001
Brazil 2000
Brazil 2001
Cambodia 2000
Cambodia 2001

Good


Many political scientists are creating/making publicly available panel-series data sets!


Issues (1)


  • Data tied to particular studies.

  • Data posted and maintained haphzardly.

Issues (2)


  • Variety of formats (SPSS, Stata, Excel, CSV).
    • Excel data is often a mess.
  • Variety of panel/series identifiers, even for regularly used panels (especially countries).

  • Some variables are suggested by the literature (e.g. winset) are composed of other variables, but aren’t regularly updated.

RR Screenshot

Consequences


  • Effort Duplication: Political scientists (or RA’s) waste a lot of time downloading/cleaning/transforming/merging commonly used data sets.


  • Errors introduced by data import and transformation scripts that are written individually and never shared across researchers
    • Don’t benefit from code review.

Solution


R package psData

psData logo

Goals (1)


  • Avoid effort duplication
    • Standardised functions to download/clean/transform political science/panel-series data sets.
    • Easy to contribute new functions.

Goals (2)


  • Standardised output
    • Makes it easy to merge multiple data sets with the same panel-series structure.

Goals (3)


  • Viral/quick error fixes
    • Updates and errors are (inevitably) made they can be found and corrected by anyone.
    • Development hosted on GitHub
    • Corrections automatically propagated via CRAN.

Master Build: Version 0.1.2


  • Master build on CRAN

  • Silos each data set into its own getter or creator function.
    • DpiGet
    • PolityGet
    • RRCrisisGet
    • WinsetCreator
  • Creator functions source getter functions for raw data.

Examples (getter)


# Load package
library(psData)

# Download/transform polity2 variable
PolityData <- PolityGet(vars = "polity2")

head(PolityData)
##  iso2c     country year polity2
##    AF Afghanistan 1800      -6
##    AF Afghanistan 1801      -6
##    AF Afghanistan 1802      -6
##    AF Afghanistan 1803      -6
##    AF Afghanistan 1804      -6
##    AF Afghanistan 1805      -6

Examples (getter)


# Download/transform Reinhart and Rogoff (2010)
RRData <- RRCrisisGet()

head(RRData)[1:5]
##  iso2c country year RR_Independence RR_CurrencyCrisis
##    AO  Angola 1800               0                 0
##    AO  Angola 1801               0                 0
##    AO  Angola 1802               0                 0
##    AO  Angola 1803               0                 0
##    AO  Angola 1804               0                 0
##    AO  Angola 1805               0                 0

Example (creator)


# Create winset and selector variables
WinsetData <- WinsetCreator()

head(WinsetData)
##  iso2c     country year    W ModS
##    AF Afghanistan 1975 0.25    0
##    AF Afghanistan 1976 0.25    0
##    AF Afghanistan 1977 0.25    0
##    AF Afghanistan 1989 0.50    0
##    AF Afghanistan 1990 0.50    0
##    AF Afghanistan 1991 0.50    0

Dev builds



  • Development at rOpenGov.

  • Ideas we are playing with:
    • Creating an S4 R method for handling the downloaded data in a standard format.
    • Homebrew-style recipies, specifying how to download the data and including meta-data (e.g. BibTex citations).
    • Improve testing (e.g. testdat).
    • Integration with data version tracking (minimum SHA hashes, maybe dat).

Audience


Ideas for avoiding effort duplication with others building open data frameworks?

Contact


Twitter: @chrisgandrud


GitHub: rOpenGov/psData