Lab 2: Introduction to Inference

An NCRM Case Study in Pedagogy

Author

Adrian Millican

1. Introduction

This week we are going to explore some basic probability concepts, using R to help visualise some of the more theoretical elements.

To do so, we are going to explore a national survey of voters – a common type of quantitative data used in Politics and IR. Our dataset this week is a subset of the 2019 Cooperative Congressional Election Study (CCES). In election years (e.g. 2016, 2018, 2020) the CCES interviews over 50,000 individuals. In our 2019 file (a non-election year) the CCES interviewed 18,000 respondents.

This survey asks hundreds of questions about the U.S. political system, including how voters perceive of the president or presidential candidates, their vote intention, preferences over various items of policy, and a host of demographic questions (e.g., age, gender, race, occupation, religiosity). To make this information slightly easier to handle, we have collectively chosen a subset of seven variables that illustrate the wider dataset:

  1. caseid – a unique identifier for each respondent in the survey
  2. birthyr – the year each respondent was born
  3. gender – the self-identified gender of each respondent
  4. region – what region of the US the respondent lives in
  5. trump_job_approval – respondents job approval for President Trump, ranging from 1 (‘strongly approve’) to 4 (‘strongly disapprove’)
  6. vote_intention – who the respondent would have voted for if the election had been “today”
  7. regulate_co2 – whether respondents support (=1) or oppose (=0) the federal Environmental Protection Agency regulating the emission of carbon dioxide

1.1 New commands this week

  • read_csv() – a function to read in data if it is saved in .csv format
  • set.seed() – sets R’s random number generator
  • mean() – an alternative way to calculate the mean of a continuous variable
  • sample() – randomly select elements of a vector
  • filter() – keep only certain rows of data in a dataframe

1.2 Introduction to Probability in Political Science

Understanding probability is fundamental for analyzing data and drawing conclusions in political science. This chapter introduces key concepts of probability theory, explores the central limit theorem, and examines their practical applications in areas such as polling and survey design. Later, we will work through examples in R to solidify these concepts with hands-on practice.


1.3 The Theory of Probability

1.3.1 What is Probability?

Probability measures how likely an event is to occur, expressed as a number between 0 and 1. It provides a structured way to quantify uncertainty, which is essential in making decisions based on incomplete or variable information.

For instance, the probability of flipping a coin and landing on heads is 0.5, reflecting an equal likelihood for either heads or tails. Similarly, the probability of rolling a six on a fair six-sided die is 0.16, since there are six equally likely outcomes.

Probability plays a central role in political science because so many of the phenomena we study involve uncertainty. Whether we are estimating voter preferences, predicting election outcomes, or analyzing the likelihood of policy adoption, probability provides the framework for making informed inferences.

1.3.2 Properties of Probability

There are a number of important properties that probabilities have that we need to take account of in order to start understanding how the basic rules of proability translate into us being able to make claims about the world where we only have access to sample data. These are:

  1. Range of Probabilities: Probabilities are always between 0 and 1. A probability of 0 means the event is impossible, while a probability of 1 means the event is certain to occur.

  2. Total Probability: The sum of probabilities for all possible outcomes of a random experiment equals 1. For example, in flipping a fair coin, the probabilities of Heads (0.5) and Tails (0.5) sum to 1.

  3. Complement Rule: The probability of an event not occurring is . For instance, if the probability of it raining tomorrow is 0.3, then the probability of it not raining is 0.7.

1.4 Random Variables

Another core element that we need to understand as we build our knowledge of the role of probability within political science is the idea of a random variable. A random variable assigns numerical values to the outcomes of a random process. Random variables are crucial in political science because they allow us to quantify and analyze phenomena such as voter turnout, approval ratings, and legislative outcomes.

1.4.1 Types of Random Variables

  1. Discrete Random Variables: These take on specific, countable values. For example, the result of rolling a die is a discrete random variable with possible outcomes .

  2. Continuous Random Variables: These can take on any value within a range. Examples include the time it takes to complete a survey or the percentage of voters who support a candidate.

1.4.2 What Do We Mean by “Random”?

In everyday language, “random” often implies chaos or unpredictability. In statistics, however, randomness has a precise definition: it refers to processes governed by probabilities, where outcomes cannot be predicted with certainty but follow a predictable pattern in the long run.

1.4.3 Everyday Example: Coin Flipping

Flipping a coin is a simple and intuitive way to illustrate probability. A fair coin has two outcomes—Heads or Tails—each equally likely with a probability of 0.5. This example can also demonstrate key properties of probability:

  • The outcome of a single flip is uncertain, but we know the probabilities of each result.

  • Over many flips, the relative frequency of Heads and Tails will converge to 0.5, illustrating the law of large numbers.

Below is a table showing the proportion of Heads observed over an increasing number of flips:

Number of Flips Heads (%) Tails (%)
10 50 50
100 52 48
1,000 49 51
10,000 50 50

As the number of flips increases, the proportions approach the theoretical probability of 0.5 for each outcome.

If you wanted to illustrate this in R we could do this using the following steps. Remember, to make this code work you need to place it within code chunks and run them. To view output you can either use the sink option to output to a text file or use render to compile it into html. All instructions are contained in the lab below. set.seed(89) sets the start point to generate random numbers from.

coin_outcomes <- c(‘H’,‘T’) - this code creates an object which stores H and T to represent heads and tails.

# Set the seed
set.seed(89)

# Define coin outcomes
coin_outcomes <- c('H','T')
sample(coin_outcomes, size = 1)
[1] "H"

Next, we are going to take some samples using the sample command to illustrate how changes in sample size effect the outcomes we get.

The following code simulates coin flips and demonstrates the long-run probability of getting “Heads” or “Tails.” Here’s how the code works, line by line:

Define Possible Outcomes (coin_outcomes): Before running the sample function, you need a list of possible outcomes for the coin flip, which are typically “Heads” and “Tails.”

Simulate 10 Coin Flips (flips_10): The first sample() function simulates flipping a coin 10 times. The size = 10 argument specifies that 10 flips are performed, and replace = TRUE allows for repeating outcomes (Heads or Tails).

Simulate 100 Coin Flips (flips_100): The second sample() function increases the sample size to 100 flips. This will allow you to observe how the proportion of Heads and Tails begins to stabilize as the sample size increases.

Simulate 1 Million Coin Flips (flips_1m): The third sample() function simulates 1 million flips. With such a large number of flips, the proportion of Heads and Tails will closely approximate the true long-run probabilities (50% for each), according to the law of large numbers.

The idea is to observe how the proportion of Heads and Tails changes with increasing sample sizes, demonstrating how the long-run probability stabilizes over larger numbers of trials.

# Long-run probability demonstration
flips_10 <- sample(coin_outcomes, size = 10, replace = TRUE)

#this code then gives us the proportion of heads
sum(flips_10 == "H")/10
[1] 0.4
flips_100 <- sample(coin_outcomes, size = 100, replace = TRUE)

sum(flips_100 == "H")/100
[1] 0.53
flips_1m <- sample(coin_outcomes, size = 1000000, replace = TRUE)

sum(flips_1m == "H")/1000000
[1] 0.499895

Remember, if you run the code chunk you can either output this to a html file or if you using sink simply run the chunk and check the output.txt file.

What we have seen here is an example of the long run probability of a random variable.

  • After 100 coin flips, 54/100 landed ‘Heads’

  • After 100 million, just under 50% landed Heads

    • So with an infinite number of flips, we would expect
      → P(Coin flip =H) = 0.5
  • Therefore, when talking about probabilities, we are referring to the long-run probability of an event

    • I.e. the probability assuming many “coin flips”

    • Smooths out our expectations
      20



1.5 Probability Distributions

Probability distributions describe how the values of a random variable are distributed. They are essential tools for understanding and modeling uncertainty in data.

1.5.1 Discrete Probability Distributions

A discrete probability distribution lists all possible values of a random variable and their associated probabilities. For example, the probability distribution of rolling a six-sided die is:

Outcome Probability
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6

This distribution is uniform because all outcomes are equally likely.

1.5.2 Continuous Probability Distributions: The Normal Distribution

For continuous random variables, probabilities are represented by areas under a curve. The most well-known continuous distribution is the normal distribution, characterized by its bell shape. Key properties of the normal distribution include:

  • The mean, median, and mode are equal.

  • It is symmetric around the mean.

  • Approximately 68% of the data lies within one standard deviation of the mean, 95% within two, and 99.7% within three.

For example, voter approval ratings in a large population might be normally distributed, with most values clustering around the mean (e.g., 50%) and fewer values at the extremes (e.g., 20% or 80%). Understanding the mean and standard deviation of this distribution allows us to predict the likelihood of specific outcomes.

1.5.3 Why Do Probability Distributions Matter?

Understanding probability distributions helps us model real-world phenomena. For instance:

  • Polling results often follow a normal distribution, allowing us to calculate confidence intervals.

  • Voter turnout data might follow a skewed distribution, highlighting regional differences.

Probability distributions are also foundational for statistical inference, enabling us to estimate population parameters and test hypotheses.

1.6 Sampling Techniques

Before we can start thinking about how we apply ideas of probability to statistical analysis, an important step is to understand how we collect our data.

1.6.1 Why Sampling Matters

In political science, we rarely have access to data on an entire population. Instead, we collect samples—subsets of the population—to make inferences. Sampling allows us to estimate characteristics of the population while saving time, effort, and resources.

1.6.2 Types of Sampling Techniques

  1. Simple Random Sampling: Every member of the population has an equal chance of being selected. For example, randomly selecting 500 voters from a voter registry ensures each voter has an equal probability of inclusion.

  2. Stratified Sampling: The population is divided into subgroups (strata) based on specific characteristics, and samples are taken from each stratum. For instance, dividing voters by age group and sampling proportionally from each ensures representation across age demographics.

  3. Cluster Sampling: The population is divided into clusters (e.g., neighborhoods), and entire clusters are randomly selected for sampling. This method is often used for logistical convenience.

  4. Systematic Sampling: A starting point is randomly chosen, and samples are taken at regular intervals. For example, selecting every 10th name on a voter list after a random start.

1.6.7 Importance of Random Sampling

Random sampling is critical to ensure the sample is representative of the population. Without randomness, samples can be biased, leading to inaccurate conclusions. For example, if a poll only surveys urban voters, it may not reflect the opinions of rural voters, leading to skewed results.

1.6.8 Why Sampling is Necessary:

  • Cost: Observing an entire population is expensive and time-consuming.
  • Observability: Not all populations are fully observable or accessible.

Given that we typically have only a sample, we can only estimate the population value. The goal is to make these estimates as accurate as possible. A key aspect of achieving this is random sampling.

1.6.9 Random Sampling to Avoid Bias:

Consider trying to estimate the average wage in the UK by asking only people in the lobby of Goldman Sachs. This is likely to lead to an upwardly biased estimate, as it doesn’t represent the broader population of wages.

A much better approach is to draw people at random from the population, which ensures that every member of the population has an equal chance of being chosen.

Practically, what is Random Sampling?

Random sampling means: - Equal probability: Every member of the population has the same chance of being selected. - Proportional representation: Different groups (e.g., bankers, service staff) should be selected at the rate they occur in the population.

This may sound simple, but achieving truly random sampling is often more difficult than it seems!

1.6.10 Sampling Error:

When we are using sample data, there will always be some level of sampling error. This refers to: 1. Bias: How close the estimates are to the true population value (i.e., how accurate our estimates are). 2. Precision: How dispersed the estimates are around the true value (i.e., how consistent they are).

In summary, random sampling is crucial for reducing bias and improving the accuracy and consistency of our estimates.

1.7 Central Limit Theorem (CLT)

Our knowledge of sampling and importantly random sampling allows us to start to explore one of the most important concepts of statistical theory. The central limit theorem provides the foundation for making inferences about populations based on sample data. The theorem states that:

  1. The sampling distribution of the sample mean becomes approximately normal as the sample size increases, regardless of the population’s original distribution.

  2. The mean of the sampling distribution equals the population mean.

  3. The standard deviation of the sampling distribution is , where is the population standard deviation and is the sample size.

1.7.1 Why Does the CLT Matter?

The CLT is particularly useful in political science because it justifies the use of normal distribution-based methods for hypothesis testing and confidence intervals, even when the underlying data are not normally distributed. For example:

  • In polling, we often work with proportions or averages derived from samples.

  • The CLT allows us to estimate the uncertainty around these statistics and make predictions about the broader population.

1.7.2 Illustration: Challenges in Sampling

When conducting a survey or poll, we only have access to a single sample rather than the entire population. The central limit theorem reassures us that, even with one sample, we can make reasonable inferences about the population. However, challenges arise because:

  • We don’t know which sample we have: Since each sample has its own mean, some samples will deviate from the true population mean by chance.

  • Sampling error varies with sample size: Larger samples result in a narrower sampling distribution, reducing variability and providing more precise estimates.

To visualize this, imagine sampling from a population with a mean voter approval rating of 60%. Small samples (e.g., 30 individuals) might yield sample means ranging from 55% to 65%, while larger samples (e.g., 1,000 individuals) would result in sample means much closer to 60%.

Below is a conceptual example:

Sample Size Mean Approval Rating (%) Variability (Standard Error)
30 58-62 High
100 59-61 Moderate
1,000 59.8-60.2 Low

If we wanted to explore what sampling error looks like within r, we could do so using the following code:

First, we create some popualtion data from which we draw samples. We then find the mean of the population:

population <- 1:100 mean(population)

We then take two mean samples of our popluation with a sample size of 10.

mean(sample(population,10)) mean(sample(population,10))

# Repeat sampling
population <- 1:100
mean(population)
[1] 50.5
mean(sample(population,10))
[1] 40.4
mean(sample(population,10))
[1] 41.1

If you look at the output, we know that the population mean is 50.5, but the two sample means are 61.9 and 42.2 respectively. What this illustrates is that samples are not always located by the population mean. As our sample size goes up, our sample mean should be located closer to the population mean as the error reduces.

2. Demonstrating Probabilty, Sampling Distributions and Central Limit Theorem in R.

Before we get started with loading in our data to R and getting the practical side of this lab started, it is important to remind ourselves of the key commands that we have to navigate between the different panes within R Studio.

2.1 Accessibility Commands Reminder

Here are the essential RStudio keyboard shortcuts to navigate between windows:

  • Ctrl + 1 (Windows/Linux) or Cmd + 1 (Mac): Focus on the Source pane (where you write and edit scripts).

  • Ctrl + 2 (Windows/Linux) or Cmd + 2 (Mac): Focus on the Console pane (where you run code and see output).

  • Ctrl + 3 (Windows/Linux) or Cmd + 3 (Mac): Focus on the Environment pane (where you see your variables).

  • Ctrl + 4 (Windows/Linux) or Cmd + 4 (Mac): Focus on the History pane (where you view previous commands).

  • Ctrl + 6 (Windows/Linux) or Cmd + 6 (Mac): Focus on the Plots pane (where graphical outputs are displayed).

  • Ctrl + 7 (Windows/Linux) or Cmd + 7 (Mac): Focus on the Packages pane (for loading and managing R packages).

2.2 Exercise 1 – loading the data and calculating descriptive statistics

First, download the data rpir_cces19.csv from the Blackboard folder and save it into your data folder. Then, open RStudio, create a new R script, and load the tidyverse and jmv packages.

2.2.1 Setting our Working Directory

Now that we have learnt how to find the location that we wish to save our files in, we need to tell R where this is located. In R, the working directory is the folder on your computer where R reads and saves files by default. Think of it as R’s “home base” for accessing and storing data files, scripts, and other resources. You can set or check your working directory using commands like setwd() or getwd(). Note: If you cannot remember how to learn where your working directory is and get the filepath, please go back to section 1.8 in lab 1.

We need to save our data files to the working directory we wish to use. I recommend that you create a folder to save all of your data files and output. For me, this produces the following location. Simply change the code here to represent where your folder is located.

To set our working directory, we need to use the command “setwd” and open brackets and quotations and paste our full filepath we have from copied from above within this. It should look something like this.

We again need to do this within a code chunk so click Option+Command+I to open one and then write your code within it.

# Replace "path/to/your/directory" with the full path to your working folder
setwd("path/to/your/directory")

Run the code chunk using shift+control+enter

This week we are going to use a different type of data file. While last week we used an SPSS data file (.sav) this week we are going to access data saved in a CSV (comma separated values) format. CSV is a very common way to store data (each row is an observation, and variables are separated by commas). You could (if you wanted to) open this file in other software like Microsoft Excel. Because we are using a different format of data, we’ll need to open it using a different command: read_csv():

We again need to do this within a code chunk so click Option+Command+I to open one and then write your code within it.

move the cursor into the code chunk and run the commands:

  1. library(tidyverse)
  2. library(jmv)
  3. cces <- read_csv(“data/rpir_cces19.csv”)
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
library(jmv)
library(BrailleR)
cces <- read_csv("data/rpir_cces19.csv")
  • What does the code do?

    • library(tidyverse) loads the tidyverse package, which provides functions for data manipulation.

    • library(jmv) loads the jmv package, which provides additional functionality for statistical analysis and reporting.

    • read_csv("data/rpir_cces19.csv") reads in the CSV file located in the “data” folder of your project directory.

Run the code chunk using shift+control+enter

2.3 Using Sink to Output Code Chunks

So far, we have used commands to execute code chunks. This is great as it allows us to run through elements of code as we go. However, at present, screen reader support is not easy to use within the output of code chunks. As we see later, typically the easiest way to work with R is by rendering it into a whole document. However, waiting until the end of the document to check whether our code works leads to challenges, particularly if there are errors in our code. An easy way to resolve this is to create a new code chunk that captures both standard output and error messages into a text file.

Setting Up Sink for Output Logging

  1. Create a new code chunk

    • Use Option+Command+I (Mac) or Ctrl+Alt+I (Windows/Linux) to insert a new chunk.

    • Navigate into the code chunk. The top row of the chunk will have the text {r}.

  2. Modify the chunk options

    • To the right-hand side of r, place a comma and write include = FALSE.

    • This should look like {r setup, include = FALSE, eval = FALSE}.

  3. Set up the sink function

    • Move to a fresh line inside the chunk and enter the following code:
# Open a file connection
output_file <- file("output.txt", open = "wt") 

# Redirect both standard output and error messages to the same file
sink(output_file, split = TRUE)  # Capture normal output
sink(output_file, type = "message")  # Capture errors and warnings
  1. Run each line of code in turn

    • Ensure your cursor is on each line and press Ctrl+Enter (Windows/Linux) or Cmd+Enter (Mac) to execute it.

How This Works

  • sink(output_file, split = TRUE) redirects normal output to output.txt while still displaying it in the console.

  • sink(output_file, type = "message") ensures that error messages and warnings are also captured.

  • The text file does not automatically open, but you can find it in your working directory.

Viewing the Output

  • After running a code chunk, reopen output.txt to see the updated output.

  • This allows you to track results and identify errors without waiting until the document is fully rendered.

2.4 Generating Random Sample Data

This week we are going to be asking R to sample data randomly. But our computers never really generate truly random numbers. Instead, they generate numbers that appear random, but which are generated from a fixed starting point (which we can set). Before we go any further, therefore, we should tell R to start generating random variables from a specific point, using the set.seed() command. This will help make sure our results are the same when we discuss them in the lab.

Open a code chunk and click Option+Command+I and type in the code.

set.seed(31415)
  • What does the code do?

    • set.seed(31415) sets the starting point for generating random numbers. By using the same seed (31415), you ensure that the random numbers you generate will be reproducible.

Run the code chunk using shift+control+enter

For the purposes of this lab, let us assume that our data contains the entire population. This is clearly unrealistic, as there are many millions more individuals in the United States than in our data. However, just for the purpose of this lab, we are going to suppose that we have collected all the information we want for some smaller, hypothetical population – let’s call it the CCES 2019 population.

2.5 Exploring our Variables

To begin with we are going to look at the variable Birth Year (variable name birthyr) which tells us the birth year of each individual within the data we have. One useful starting point can be to explore the distribution of a variable, in this case we might want to know how many young or old people there are in our data. There are a few things we could do to achieve this, we could look at measures of dispersion as we did in Lab 1 using things like minimum / maximum, range or standard deviation. We can also use a graph known as a histogram which shows a visual overview of the distribution of the variable, and with the help of the BrailleR package we can get an excellent written summary of the data that helps us interpret the variable.

This line of code below creates an accessible visualisation of a standard histogram with BrailleR’s descriptive functionality. First, hist(cces$birthyr, main = "Histogram of Birth Years", xlab = "Birth Year") where the hist command generates a histogram of the birthyr variable from the cces dataset, labeling the plot with a title using the main optoin and an x-axis description using xlab. Wrapping this inside VI() command invokes BrailleR’s accessibility feature, which produces a detailed textual summary of the histogram—such as the number of bins, their ranges, and the frequency counts—so that screen readers can convey the information effectively. This approach ensures that the graphical output is supplemented with an interpretable description, making the visualization inclusive for users with visual impairments.

VI(hist(cces$birthyr, main = "Histogram of Birth Years", xlab = "Birth Year"))

This is a histogram, with the title: with the title: Histogram of Birth Years
"cces$birthyr" is marked on the x-axis.
Tick marks for the x-axis are at: 1920, 1940, 1960, 1980, and 2000 
There are a total of 18000 elements for this variable.
Tick marks for the y-axis are at: 0, 500, 1000, 1500, and 2000 
It has 18 bins with equal widths, starting at 1915 and ending at 2005 .
The mids and counts for the bins are:
mid = 1917.5  count = 1 
mid = 1922.5  count = 2 
mid = 1927.5  count = 34 
mid = 1932.5  count = 166 
mid = 1937.5  count = 400 
mid = 1942.5  count = 727 
mid = 1947.5  count = 1180 
mid = 1952.5  count = 1634 
mid = 1957.5  count = 2208 
mid = 1962.5  count = 1890 
mid = 1967.5  count = 1393 
mid = 1972.5  count = 1095 
mid = 1977.5  count = 1451 
mid = 1982.5  count = 1556 
mid = 1987.5  count = 1617 
mid = 1992.5  count = 1551 
mid = 1997.5  count = 1084 
mid = 2002.5  count = 11

What does the print out of VI tell us about our data? Well it tells us the number of bins that the data is collapsed into and that they start at 1915 and end at 2005 which tells us the range of our data. It then tells us the midpoint of each bin, and the number of cases within each bin. What we can see is that the older bins are quite low in counts, suggesting that there are very few people of advanced age, but as the histogram progresses the count goes up and tends to stay high in counts. What does this suggest? It mgiht tell us that our data is somewhat skewed towards younger individuals within the sample.

2.6 Calculating Averages

In the exercise we are assuming that our data is the population. The next step is to calculate some “population” means. We could use the jmv package discussed last week, but for a single statistic it is sometimes easier to use the built in base R commands.

To calculate and store the average year of birth in our population, we simply identify the corresponding column in our data (using the $ operator), enter this into the mean() command, and set na.rm = TRUE. This latter option makes sure that R ignores any missing values:

Open a code chunk by clicking Option+Command+I and then write your code within it.

mean(cces$birthyr, na.rm = TRUE)
[1] 1969.731
  • What does the code do?

    • cces$birthyr selects the birthyr column from the cces data.

    • mean(cces$birthyr, na.rm = TRUE) calculates the mean of the birthyr column, ignoring any missing values (na.rm = TRUE).

Run the code chunk using shift+control+enter. Remmeber you can view the content of the command by opening the output.txt file we created

Now that we have calculated the population mean for birthyr, we can run similar code to inspect the other variables of interest.

  • Q1a. What are the population means for birthyr (code provided), trump_job_approval, and regulate_co2? Use the same code from above to explore these variables.

Remember, Option+Command+I opens a code chunk and shift+control+enter runs it.

mean(cces$birthyr, na.rm = TRUE)
[1] 1969.731
mean(cces$trump_job_approval, na.rm = TRUE)
[1] 2.782826
mean(cces$regulate_co2, na.rm = TRUE)
[1] 0.3249708
  • Q1b. Given the scale of regulate_co2, what does the population mean represent?
# Notice that this variable only contains 1s or 0s, i.e. it is binary
# These correspond to 'yes' and 'no' respectively

# When you take the mean of a binary variable, you get a PROPORTION
# i.e. the proportion of individuals that believe CO2 should be regulated.
  • Q1C. Try calculate a population mean for the vote_intention variable – why does R return NA?
mean(cces$vote_intention, na.rm = TRUE)
Warning in mean.default(cces$vote_intention, na.rm = TRUE): argument is not
numeric or logical: returning NA
[1] NA
# It's a categorical variable
# We can only take the mean of continuous variables

# You can inspect the data to see this:
str(cces$vote_intention)
 chr [1:18000] "Trump" "Democratic Nominee" "Democratic Nominee" ...

2.7 Exercise 2 – Sampling

Now let’s pretend that we wanted to run a survey on this hypothetical CCES 2019 population. And suppose that we only have sufficient resources to interview a measly 10 individuals (interviewing people is a time-intensive business!)

To simulate this a random sampling strategy, we can use the sample() function in R. This function will randomly select elements of a vector1 (in our case a variable in our data), and you can specify how many to select.

First we will pass to this function the column of unique identifiers, and specify that we want size = 10 (i.e. 10 randomly chosen observations). Finally, we tell R not to replace our observations (replace = FALSE) i.e. once we have randomly selected an individual.

Open a code chunk using Option+Command+I and write the code then run the chunk using shift+control+enter

sample_ids <- sample(x = cces$caseid, size = 10, replace = FALSE)
  • What does the code do?

    • The sample() function randomly selects 10 unique identifiers from the caseid column of the cces data. The replace = FALSE argument ensures that the same identifier is not selected more than once.

If you type sample_ids into the line below your code in the code chunk and run it, this line of code will appear in the output.txt document. It will show you a vector of unique identifiers.

Now we have a random sample of 10 unique identifiers, we can create our sample data. Here we use a new command filter() that lets us (you guessed it) filter certain rows from our data. The basic structure of this command is filter(DATA, CONDITION), where for this exercise our DATA is cces, and the CONDITION is a logical statement. Since we only want to keep those rows that are in our vector sample_ids, our logical statement is caseid %in% sample_ids. The full code is:

cces_sample <- filter(cces, caseid %in% sample_ids)
cces_sample
# A tibble: 10 × 7
     caseid birthyr gender region trump_job_approval vote_intention regulate_co2
      <dbl>   <dbl> <chr>  <chr>               <dbl> <chr>                 <dbl>
 1   1.03e9    1954 Male   South                   4 Democratic No…            0
 2   1.03e9    1962 Male   North…                  1 Note sure                 0
 3   1.03e9    1944 Male   West                    2 Trump                     1
 4   1.03e9    1984 Female South                  NA Note sure                 1
 5   1.03e9    1976 Female South                  NA Democratic No…            0
 6   1.03e9    1977 Female Midwe…                  4 Democratic No…            0
 7   1.03e9    1981 Female West                    3 Democratic No…            1
 8   1.04e9    1957 Male   North…                  4 Democratic No…            0
 9   1.04e9    1950 Female West                    4 Democratic No…            0
10   1.04e9    1986 Female North…                  1 Trump                     0
VI(cces_sample)

The summary of each variable is
caseid: Min. 1030540333   1st Qu. 1031612233   Median 1032641570   Mean 1034379709   3rd Qu. 1038291065   Max. 1039977795  
birthyr: Min. 1944   1st Qu. 1954.75   Median 1969   Mean 1967.1   3rd Qu. 1980   Max. 1986  
gender: Length 10   Class character   Mode character  
region: Length 10   Class character   Mode character  
trump_job_approval: Min. 1   1st Qu. 1.75   Median 3.5   Mean 2.875   3rd Qu. 4   Max. 4   NA's 2  
vote_intention: Length 10   Class character   Mode character  
regulate_co2: Min. 0   1st Qu. 0   Median 0   Mean 0.3   3rd Qu. 0.75   Max. 1  
  • What does the code do?

    • The filter() function selects only the rows in the cces data where the caseid is in the sample_ids vector. The result is stored in cces_sample, which is a smaller subset of the original data.

Finally, we can calculate the sample mean in a similar way to calculating the population mean. The only difference is that instead of using the cces data, we use the cces_sample data:

mean(cces_sample$birthyr, na.rm = TRUE)
[1] 1967.1
  • Q2a. Using the cces_sample object (i.e. you do not need to recalculate cces_sample each time), what are the sample means for birthyr, trump_job_approval, and regulate_co2?
mean(cces_sample$birthyr, na.rm = TRUE)
[1] 1967.1
mean(cces_sample$trump_job_approval, na.rm = TRUE)
[1] 2.875
mean(cces_sample$regulate_co2, na.rm = TRUE)
[1] 0.3
  • Q2b. Are the sample means the same as the respective population mean calculated in Exercise 1? Why would this be the case?
# No - they are different!

# It's due to sampling error -- we are estimating the population mean (i.e. Exercise 1)
#  using a much smaller set of observations

2.8 Exercise 3 – Increasing the sample size

Building on the previous exercise, we can start to explore what happens when we resample our data.

  • Q3a. Create three new sampled dataframes called cces_sample2, cces_sample3 and cces_sample4, with 25, 500, and 10000 observations respectively. Hint: you’ll need to create new vectors of ids first, i.e. sample_ids2, sample_ids3, sample_ids4
sample_ids2 <- sample(x = cces$caseid, size = 25, replace = FALSE)
sample_ids3 <- sample(x = cces$caseid, size = 500, replace = FALSE)
sample_ids4 <- sample(x = cces$caseid, size = 1000, replace = FALSE)

cces_sample2 <- filter(cces, caseid %in% sample_ids2)
cces_sample3 <- filter(cces, caseid %in% sample_ids3)
cces_sample4 <- filter(cces, caseid %in% sample_ids4)
  • What does the code do?

    • The code generates new sample IDs for each sample size (25, 500, and 1000).

    • It then filters the original cces data to create new sampled data frames (cces_sample2, cces_sample3, and cces_sample4).

  • Q3b. For each new sample, calculate the sample mean for the trump_job_approval variable (as in Exercise 2). Compare these sample means to the population mean.
mean(cces_sample2$trump_job_approval, na.rm = TRUE)
[1] 3.166667
mean(cces_sample3$trump_job_approval, na.rm = TRUE)
[1] 2.735113
mean(cces_sample4$trump_job_approval, na.rm = TRUE)
[1] 2.779519
  • Q3c. What do the results above tell us about our sample estimate of the population mean as our sample size increases?
# Remind ourselves what the population mean is:
mean(cces$trump_job_approval, na.rm = TRUE)
[1] 2.782826
# As sample size increases, our estimate of the population mean becomes more accurate 
# i.e. it is closer to the population mean

2.9 Exercise 4 – the normality of sampling distributions

In this final exercise, we are going to simulate running multiple samples and plotting the sampling distribution. Suppose now we had a bigger budget, and could interview 1000 individuals from our population (and that, we could do this multiple times!)

One warning, is that at present there is no good way of making this work well with a screen reader. I’ve left it in so you can learn the code. However, what the simulation does is create 1000 sample means, it then plots them and what you see created is a bell shaped curve that is centred on the population mean. This is a practical way of proving central limit theorem.

To simulate this, we are going to use slightly more complicated R code (but you’ve seen lots of this before). Don’t worry about understanding exactly how this works, you may simply copy and paste this code into your R script:

s_means <- data.frame(iteration = as.numeric(),
                      sample_mean = as.numeric())

for (i in 1:1000) { 
  
  s_ids <- sample(cces$caseid, 1000, replace = FALSE)
  s_data <- filter(cces, caseid %in% s_ids)
  
  s_mean <- mean(s_data$regulate_co2, na.rm = TRUE)
  
  s_means <- add_row(s_means, iteration = i, sample_mean = s_mean)
  
}

ggplot(s_means, aes(x = sample_mean)) +
  geom_density() +
  labs(x = "Sample Mean", y = "Density") +
  geom_vline(xintercept = mean(s_means$sample_mean),
             linetype = "dashed")
  • What does the code do?

    • The code runs a loop 1000 times to simulate drawing a random sample of 1000 individuals each time, calculating the mean of regulate_co2, and storing the results.

    • After collecting the sample means, the ggplot() function is used to create a density plot of the sample means.

This may take a few seconds to run. (For those interested, an annotated version of this code is included at the end.)

  • Q4a. What value is the distribution “centred” on, and what is this very similar to? Hint: think back to Exercise 1.
# By inspecting the graph, the middle of the normal distribution is approx. 0.325
# That's very similar to our population mean:
mean(cces$regulate_co2, na.rm = TRUE)
[1] 0.3249708
  • Q4b. Describe the shape of the graph. Does it resemble anything discussed in the statistics lab lecture?
# 1. It is bell-shaped
# 2. It is (almost) symmetric

# It looks a lot like a normal distribution!
  • Q4c. Why might this be a useful property of sampling distributions?
# 1. Notice that, on average, across our separate samples our sample estimate equals
#    the population estimate. So, on average, we're going to infer the CORRECT population
#    value. This means our estimate of the population mean is unbiased!

# 2. Because the distribution of sample means is normally distributed, we know that
#    we can infer that 95% of our sample estimates are going to fall within a certain 
#    distance from the mean of the sample means! Therefore, we can quantify how 
#    confident we are that a specific estimate contains the population parameter.

# (More on this in future lectures)

3 Appendix – Understanding the sampling distribution code

This discussion is for interest only. You are not expected to be able to produce this code in any of your assignments.

# First, we create an empty dataframe to store our results
s_means <- data.frame(iteration = as.numeric(),
                      sample_mean = as.numeric())

# Next, we declare a for-loop
# For loops allow us to repeat a block of code multiple times
# Each time, varying the value of a specific variable
# This will first assign the value 1 to a variable named i
# It will then execute all the code in the curly brackets
# Once the code within the brackets is completed, it loops 
# back to the top, sets i = 2, then runs the same code again
# It will continue to do this until i = 1000
for (i in 1:1000) { 
  
  # Within each loop of the code, we sample a new set of ids and data
  s_ids <- sample(cces$caseid, 1000, replace = FALSE) # Randomly chosen ids
  s_data <- filter(cces, caseid %in% s_ids) # Corresponding sample of the data
  
  # Next we calculate the sample mean for the question on 
  #   the regulation CO2 emissions
  # Note we can save this value to a variable
  s_mean <- mean(s_data$regulate_co2, na.rm = TRUE)
  
  # Then we're going to add a row to the dataframe we made earlier
  s_means <- add_row(s_means, iteration = i, sample_mean = s_mean)
  
} # This curly bracket closes the for-loop

# Finally, we plot the results:
ggplot(s_means, aes(x = sample_mean)) + # Set the data
  geom_density() + # Plot a "density" line
  labs(x = "Sample Mean", y = "Density") + # Change the axis labels for presentation
  geom_vline(xintercept = mean(s_means$sample_mean),
             linetype = "dashed") # Draw a vertical, dashed line at the mean of sample means

Data

Ansolabehere, Stephen; Schaffner, Brian; Luks, Samantha, 2020, “CCES Common Content, 2019”, https://doi.org/10.7910/DVN/WOT7O8, Harvard Dataverse, V1, UNF:6:34vNKfe/vAMemliFcOkbvw== [fileUNF]

Footnotes

  1. A vector is just a list of values. We have actually defined vectors before – it is the result of wrapping a list of strings/numbers using the c() command, e.g. c("Lab17Vote", "HouseOwned")↩︎