Wednesday, April 2, 2014

Social Science Goes R: Weighted Survey Data

Social Science Goes R: Weighted Survey Data

Social Science Goes R: Weighted Survey Data

To get this blog started, I'll be rolling out a series of posts relating to the use of survey data in R. Most content comes from the ECPR Winter School in Methods and Techniques R course, that I had the pleasure of teaching this February. This post is going to be focused on some of the practical problems of sampling in social science contexts and on an introduction to the survey package by Thomas Lumley and on how to set up data with predefined sampling weights.

In the next few weeks, I'll be covering more topics related to survey data, like computing your own weights for a survey, doing statistical analyses and visualizing survey data. All posts will deal with R and survey and be example driven. Enough of a prelude, let's get to the meat.

Survey data

Much of social science quantitative methods still rely on survey data. Surveys are questionnaires that are distributed to interviewees in hopes of measuring their views on certain topics. Instead of asking everyone – which is too costly, usually – only a subset, the sample, is being interviewed. The findings from this subset are then generalized on the entire population, using statistical methods. Most famously perhaps, surveys are used to predict the outcomes of elections.

For this generalization from sample to population to work, everyone in the population has to have the same chances of ending up in the sample. We call this simple random sampling. This is easy to achieve for small populations: Let's say you want to estimate the age distribution of all students in a (large) class. For small classes, you could simply ask everyone. If you find a day where everyone is present, you can then simply pick people at random. If your aim is as bad as mine, you can just throw things at people. The person that catches it, is sampled. This should make for a simple random sample. Everyone had the same chance to be hit by my random shots. If you don't like throwing things at people, you can throw things at a list of people instead. This leads us to how simple random sampling is done in practice: taking a list of everyone in the population and selecting people at random from it. Alternatively but to the same effect if (and only if unfortunately) everyone has a unique number, we can just randomly create numbers, until we have enough people in the sample.

Sampling from larger groups becomes more complex. In theory we could use one of the simple random sampling approaches outlined above. But for larger groups, there are usually no lists. And even if there are complete and up-to-date lists, it might not guarantee that you can reach that person. Or that the person on the list wants to even talk to you.

Non response

In social science, we generally group non-response into two broad categories: Item and unit non-response. The former refers to certain questions not being answered by someone who was already sampled successfully. This effectively boils down to missing values and is another topic that is generally ignored in social sciences. There will be a series on missing value treatment later on.

Here, I want to focus on unit non-response. That is, people not ending up in the sample, for various reasons. One possibility is that they were not on the list from the example above. If our list is made up of legal telephone numbers, but someone in the population does not have a telephone, than that person cannot end up in the sample. Similarly, if we contact people by phone (and they indeed do have a phone), but they are too busy to participate in the survey, they won't end up in the sample either.

As long as this unit non-response is purely random, it is not a problem. It is also not a problem, if it is not random but unconnected to the phenomenon we want to investigate. For instance, if you investigate favorite colors, it is plausible to assume that people without phones or people that refuse to participate in the survey don't have systematically different color preferences. Unfortunately, this is about the most realistic example for the independence of a social science investigative topic and sample inclusion. Most questions in social science would be answered differently by those without phones and those that do not participate in surveys.

Representativeness

In order to be able to conclude from our sample to the entire population, that sample needs to be representative of the population. Here representative means that traits that can be found in the population are also found in the sample. And at the same proportions. So if the population contains 54% women, so should the sample. The odds that simple random sampling will produce a representative sample are very high. Whether a sample is representative or not is also context dependent. A sample that is representative for social sciences might not be representative in a medical context.

Because some people are very hard or even impossible to reach, the sample we end up with might not be representative. It might be biased. Typically, in a survey sample we end up with too many senior citizens, too many women and too few young people and too few workaholics. Using that sample as is permits conclusions only from the sample to the typical survey participating population. Usually, that population is not the one we are interested in. Rather, we would like for our results to apply to the general public.

So, because not every one in the population has the same probability of being included in our sample, we don't end up with a simple random sample. As it is not a simple random sample, we need to check if it is still representative. And usually, it isn't.

Post-stratification weights

The solution to this dilemma is quite easy. We assign everyone that did make it into the sample a weight \( w \). So people that turn out too often in the sample receive a weight of less than 1. And those that we were not able to reach enough of are upweighted with a weight larger than 1. Respondents that belong to groups that have been sampled perfectly receive a weight of 1. This solution is called post-stratification, because it computes weights based on group (or stratum) characteristics, like the distribution of age or gender proportions.

This method of course only works to correct minor biases. If we did not manage to include a single person from a group, no weight can correct for that. Even if we managed to sample only very few, the resulting large weights could be problematic. If that three young men suddenly count for 100, the results might not be representative either.

In the next weeks we'll go into more detail here and discuss how these weights can be computed. For now, we'll focus on data sets that come with weights supplied. This is typical for large social science data sets, like the European Social Survey.

Survey data in R

In R, working with survey data that requires post-stratification weights is made possible by the survey package. The survey package not only allows for adjusting the composition of a sample to the characteristics of the general population. Most base packages would allow you to do that by specifying a weights argument. The survey package goes further by correcting the design effect introduced by the application of post-stratification weights.

I've prepared a small data set to demonstrate the usage of survey. You can find it under: http://knutur.at/wsmt/R/RData/small.RData. It's basically just random data with some creative column names.

To start using this data set, we first need to load it into R and then attach the survey package.

load(url("http://knutur.at/wsmt/R/RData/small.RData"))

Let's look at the data briefly.

summary(small)
##  sex     edu        partDemo       nJobsYear        intell    
##  M:57   Lo :24   Min.   :0.000   Min.   :0.00   Min.   :40.5  
##  F:43   Mid:42   1st Qu.:0.000   1st Qu.:1.00   1st Qu.:46.9  
##         Hi :34   Median :0.500   Median :2.00   Median :50.6  
##                  Mean   :0.735   Mean   :2.25   Mean   :50.0  
##                  3rd Qu.:1.000   3rd Qu.:3.00   3rd Qu.:52.6  
##                  Max.   :4.000   Max.   :8.00   Max.   :58.7  
##                  NA's   :2                      NA's   :1     
##       emo             age           weight     
##  Min.   :-64.5   Min.   :19.0   Min.   :0.341  
##  1st Qu.: 11.2   1st Qu.:27.0   1st Qu.:1.046  
##  Median : 79.4   Median :44.0   Median :1.707  
##  Mean   : 54.2   Mean   :41.7   Mean   :1.716  
##  3rd Qu.:100.3   3rd Qu.:54.0   3rd Qu.:2.436  
##  Max.   :118.1   Max.   :63.0   Max.   :2.995  
##                  NA's   :1

There are a number of variables in this artificial data set. The one of most interest now is perhaps the weight variable. It contains the weight \( w \) per respondent.

Let's attach the survey package now:

library(survey)
## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart

The message that the dotchart function from the graphics package is now being masked is not really dramatic. It simply tells us that both packages provide a function by that name, and that after attaching survey the one from survey is going to be used. If you were dependent on using graphics' dotchart function, you would need to be careful here. But since we just demo the creation of weighted data sets, this message can simply be ignored.

Inside survey, there is the svydesign function that can be used to link a data frame with a weight. This function takes many arguments, three are of importance at this point: ids, data, and weights. The latter two are easily explained as they specify the data frame we seek to apply weights on and the vector containing the weights.

The first argument, ids is a little bit more complex. By default, survey assumes we use data generated in a multistage sampling procedure. Then the ids argument would be used to specify which variables encode sampling cluster membership. Clustered sampling is another quite frequent social science sampling method, but not one that is used (nor usable) for large general population surveys. Therefore, we set ids = ~ 1 to indicate that all respondents originated from the same cluster.

small.w <- svydesign(ids = ~1, data = small, weights = small$weight)

The resulting object is not a simple data frame anymore. Using summary on it produces different results:

summary(small.w)
## Independent Sampling design (with replacement)
## svydesign(ids = ~1, data = small, weights = small$weight)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.334   0.410   0.586   0.801   0.956   2.930 
## Data variables:
## [1] "sex"       "edu"       "partDemo"  "nJobsYear" "intell"   
## [6] "emo"       "age"       "weight"

To further use this object, we need to use survey's own set of analytical functions. They will be the topics of the next few posts. As a little teaser, here is a comparison of the sex ratios in the unweighted and the weighted data frames:

prop.table(table(small$sex))
## 
##    M    F 
## 0.57 0.43
prop.table(svytable(~sex, design = small.w))
## sex
##      M      F 
## 0.5857 0.4143

Not a dramatic change, but one that could prove decisive in the more complex analyses of the weeks to come. This concludes the first entry in our series on survey data in R. The next post in this series will deal with how we can compute post stratification survey weights in the first place. If you have any thoughts on survey data in R or would like to see something specific covered, please leave a comment.