Zelig and the Tidyverse

Built using Zelig version 5.1.0.90000

Many researchers integrate tools from the “tidyverse” into their data workflows. The tidyverse includes a number of packages to read data into R–readr–, transform it for analysis– tibble, tidyr, purr, and dplyr, and visualise the results with ggplot2. All of these packages can be loaded in R by loading the tidyverse package:

library(tidyverse)

Zelig can be easily slotted into this workflow for model estimation and identifying quantities of interest. Most Zelig estimation models require underlying data to be in the “tidy” format that the tidyverse creates. Zelig’s three step workflow can be organized very well using pipes (%>%) that the tidyverse advocates. Using pipes with Zelig will make your code more legible and computationally efficient.

What is “tidy data”

“Tidy” data is defined as having three qualities (Wickham 2014, 4):

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Data formatted in this way is ideal for most Zelig models.

Analyzing tidy data with pipes

This example shows one way to integrate the Tidyverse and Zelig into one workflow.

Imagine we used the Tidyverse to create a data set in tidy format that we want to analyse with Zelig:

##              Fertility Agriculture Examination
## Courtelary        80.2        17.0          15
## Delemont          83.1        45.1           6
## Franches-Mnt      92.5        39.7           5
## Moutier           85.8        36.5          12
## Neuveville        76.9        43.5          17
## Porrentruy        76.1        35.3           9

We could then analyse it with Zelig and plot quantities of interest using Tidyverse’s pipe operator. Note the use of data = .. This tells the pipe to enter the swiss data object as the value of zelig’s data argument:

library(Zelig)
library(tidyverse)

swiss %>% 
    zelig(Fertility ~ Agriculture + Examination, model = 'ls', data = ., 
          cite = FALSE) %>%
    setx(Agriculture = seq(1, 90, by = 5)) %>%
    sim() %>% 
    plot()

We could also find first differences by including setx1 in the workflow:

swiss %>% 
    zelig(Fertility ~ Agriculture + Examination, model = 'ls', data = ., 
          cite = FALSE) %>%
    setx(Agriculture = 10) %>%
    setx1(Agriculture = 90) %>%
    sim() %>% 
    plot()

Note that piping does not work with Zelig 5’s reference class syntax.

Zelig and ggplot2

Zelig’s in-house plots are built using base R plotting functionality. You may want to take the quantities of interest simulated by Zelig and create custom plots with other tools, such as ggplot2.

The function zelig_qi_to_df extracts quantities of interest simulated by sim and returns them in a data frame formatted to follow tidy data principles. This output could be easily plotted with ggplot2.

Using the example analysis from before:

sims.full <- swiss %>% 
                zelig(Fertility ~ Agriculture + Examination, model = 'ls', data = ., 
                      cite = FALSE) %>%
                setx(Agriculture = seq(1, 90, by = 5)) %>%
                sim() %>%
                zelig_qi_to_df()

head(sims.full)

##   setx_value Agriculture Examination expected_value predicted_value
## 1          x           1    16.48936       78.13042        95.21521
## 2          x           1    16.48936       69.98811        67.26157
## 3          x           1    16.48936       86.69984        82.40750
## 4          x           1    16.48936       69.24372        86.90894
## 5          x           1    16.48936       74.54766        87.80328
## 6          x           1    16.48936       78.83522        82.21025

Each row contains information for an individual “observation”, i.e. a quantity of interest calculated from one draw of the model parameters from the multivariate normal distribution.

The first three columns of the object returned by zelig_qi_to_df in this case contain information that identifies the scenario that the quantity of interest is from. For example, the first row is from a scenario where Agriculture is fitted at 1 and Examination is fitted at the sample mean 16.4893617. The setx_value column identifies if the values were fitted by a setx call as x or x1–a contrasting scenario used to find first differences.

The final two columns contain the expected and predicted value of the quantity of interest, respectively.¹

Typically when we plot simulated quantities of interest, we would be interested in showing the central interval of the simulated distribution. This involves finding the central interval for each simulation scenario for some range, e.g. the central 95% of simulations. Zelig now includes a helper function qi_slimmer that takes the output of zelig_qi_to_df and finds the desired central interval for each scenario. For example:

sims.slimmed <- qi_slimmer(sims.full)

## Slimming Expected Values . . .

head(sims.slimmed)

##   setx_value Agriculture Examination qi_ci_min qi_ci_median qi_ci_max
## 1          x           1    16.48936  65.57709     74.39166  83.35835
## 2          x           6    16.48936  66.11056     73.98805  82.18708
## 3          x          11    16.48936  66.61200     73.54860  81.03832
## 4          x          16    16.48936  66.80842     73.12864  79.76726
## 5          x          21    16.48936  67.10545     72.71652  78.56796
## 6          x          26    16.48936  67.34976     72.30054  77.23264

Now we have one row for each scenario. This contains the columns qi_ci_min, qi_ci_median, and qi_ci_max for the central interval of the simulations from each scenario. By default qi_ci_min and qi_ci_max contain central interval bounds at the lower 2.5 and upper 97.5 percentiles, respectively. qi_ci_median contains the simulated distribution’s median. The interval can be changed with the ci argument. You can choose expected value (ev) or predicted value (pv) with the qi_ci_type argument.

Now we have everything to make a ggplot2 plot summarizing the simulated quantity of interest:

ggplot(sims.slimmed, aes(Agriculture, qi_ci_median)) +
    geom_ribbon(aes(ymin = qi_ci_min, ymax = qi_ci_max), alpha = 0.3) +
    geom_line() + 
    ylab('Expected Fertility') +
    theme_bw()

Get in touch

We are also considering in the future whether to generate Zelig’s default plots with ggplot2. We’d love to hear your thoughts: GitHub , Google Groups , or Gitter Chat .

“Depending on the issue being studied, the expected or mean value of the dependent variable may be more interesting than a predicted value. The difference is subtle but important. A predicted value contains both fundamental and estimation uncertainty, whereas an expected value averages over the fundamental variability arising from sheer randomness in the world, leaving only the estimation uncertainty caused by not having an infinite number of observations. Thus, predicted values have a larger variance than expected values, even though the average should be nearly the same in both cases.” From King, Tomz, and Wittenberg (2000, 350)↩

2017-05-04

What is “tidy data”

Analyzing tidy data with pipes

Zelig and ggplot2

Get in touch

Contents