Lab 5: Correlations

An NCRM Case Study in Pedagogy

Author

Fanni Toth

Published

March 7, 2025

1. Introduction

Moving on from Lab 4

Over the past few weeks, we have progressively built our understanding of statistical analysis, moving from describing individual variables to analyzing relationships between them.

In Lab 1, we focused on univariate (one-variable) analysis, using descriptive statistics such as means, medians, standard deviations, and frequency distributions. In Lab 2, we introduced the Central Limit Theorem (CLT) to explain the importance of sampling distributions and their role in statistical inference.

From Lab 3 onward, we shifted to bivariate (two-variable) analysis, which allows us to examine whether and how two variables are related. We also introduced hypothesis testing, a key tool for determining whether patterns in the data are meaningful or simply due to chance. The choice of statistical test depends on the levels of measurement of our independent and dependent variables.

  • Lab 3: t-tests – We introduced t-tests as a method to compare differences in means between two groups. T-tests are appropriate when we have a binary categorical independent variable (e.g., gender, voted or not) and a numeric dependent variable (e.g., income, turnout). This test helps determine whether the observed difference in means is statistically significant.

  • Lab 4: chi-square tests – We then introduced chi-square tests, which assess relationships between two categorical variables. Unlike t-tests, which analyze numerical differences, chi-square tests help us determine whether the distribution of one categorical variable depends on another. To do this, we used contingency tables (crosstabs) to display the joint distribution of two categorical variables and compared percentages to identify patterns. The chi-square test then allowed us to assess whether the differences in these percentages were statistically significant.

Plan for this Lab

In this lab session, we will introduce correlation analysis, a statistical method used to measure the strength and direction of relationships between variables. Correlations are typically applied to continuous (numeric) variables, but they can also be used with ordinal (rank-ordered) categorical variables. Note that there are other types of correlation measures, but we will focus on these two key methods.

This final analytical tool will equip you with the ability to explore most types of data from a bivariate perspective, enabling you to conduct basic inferential analysis across different variable types.

2. Understanding Correlations

What Are Correlations?

Correlation analysis is a statistical method used to measure the strength and direction of the relationship between two variables. Unlike previous tests that compared group differences (such as t-tests) or associations between categorical variables (such as chi-square tests), correlation helps us assess how two numeric variables change together.

For example, we might ask:

  • Is there a relationship between years of education and income?

  • Do higher levels of political trust correspond with higher voter turnout?

  • Does economic growth correlate with approval ratings of the government?

Using Scatterplots to Identify Relationships

Before calculating correlations, a good first step is to visualize the relationship between two variables using a scatterplot. In a scatterplot:

  • The x-axis represents the independent variable (the potential cause).

  • The y-axis represents the dependent variable (the possible effect).

  • Each data point on the graph represents a case, plotted according to its values on both variables.

Patterns in scatterplots help us recognize different types of relationships. A positive correlation means that as one variable increases, the other also increases. For example, an increase in years of education might be associated with an increase in income. A negative correlation means that as one variable increases, the other decreases. An example of this would be the relationship between unemployment rate and GDP growth, where higher unemployment rates might correspond with lower economic growth. If there is no clear pattern in the data points, it indicates that the two variables are not related in any meaningful way.

Measuring Correlation: Pearson’s r

The Pearson correlation coefficient, often denoted as r, is a statistic that quantifies the direction and strength of a linear relationship between two continuous numeric variables. The value of Pearson’s r always falls between -1 and +1.

  • A value of +1 indicates a perfect positive correlation, meaning that as one variable increases, the other increases in a perfectly linear fashion.

  • A value of -1 indicates a perfect negative correlation, meaning that as one variable increases, the other decreases in a perfectly linear way.

  • A value of 0 means there is no correlation, suggesting that the two variables do not move together in a predictable way.

The strength of the relationship can be categorized based on the value of r.

  • Anything lower than 0.2 we would consider so weak that it is practically nonexistent, i.e. no correlation.

  • A weak correlation typically falls between 0.2 and 0.3 in absolute value (i.e. either positive or negative), meaning there is some relationship but it is not strong.

  • A moderate correlation falls between 0.3 and 0.5, indicating a more noticeable association.

  • A moderately strong to a strong correlation is indicated by values greater than 0.5, meaning the two variables have a substantial linear relationship.

Note however that these thresholds are a general rule of thumb and may vary by discipline. Different fields may use slightly different conventions for interpreting correlation strength, depending on the nature of the data and research context.

Example: Stability of Electoral Support

Suppose we test whether Labour’s 2015 vote share in each constituency is correlated with its 2010 vote share. If Labour’s support remained stable across elections, we would expect a strong positive correlation (i.e., constituencies that voted heavily for Labour in 2010 likely did so in 2015 as well).

If we compute Pearson’s r and find a value of 0.85, this suggests a strong positive relationship, meaning that past election results are a strong predictor of future voting patterns.

Testing for Statistical Significance

Like other statistical tests, correlation results include a p-value, which tells us whether the observed relationship is statistically significant. A p-value lower than 0.05 suggests that the correlation is unlikely to have occurred by chance. This allows us to reject the null hypothesis, which states that there is no relationship between the two variables. If the p-value is greater than 0.05, we do not have sufficient evidence to conclude that there is a meaningful correlation, and therefore we fail to reject the null hypothesis.

Spearman’s Rank Correlation: When Pearson’s r Isn’t Appropriate

Pearson’s correlation assumes that the relationship between variables is linear, meaning it follows a straight-line pattern. However, some relationships do not follow this pattern but still show a consistent trend. These are known as monotonic relationships, meaning the values either always increase or always decrease, but not necessarily in a straight line.

For example, consider the relationship between age and happiness. Young adults may report high happiness, middle-aged adults may report lower happiness due to life stress, and older adults may report higher happiness again in retirement. This creates a U-shaped relationship that Pearson’s r may not detect properly. In such cases, Spearman’s Rank Correlation, also known as Spearman’s rho, is a more suitable alternative.

Spearman’s correlation is useful when one or both variables are ordinal, meaning they have a natural ranking, such as survey responses ranging from “strongly disagree” to “strongly agree.” It is also helpful when the relationship is monotonic but not linear, or when there are extreme outliers in the data that could distort Pearson’s r. Spearman’s correlation works by ranking the values of both variables and measuring how closely the rankings match. Like Pearson’s r, it ranges from -1 to +1, with larger absolute values indicating stronger relationships.

Note: Choosing Between Pearson’s r and Spearman’s rho

Pearson’s r is the preferred method when the relationship between two variables is both linear and monotonic, as it provides a precise measure of how strongly they move together in a straight-line relationship. However, Pearson’s r is sensitive to outliers and requires the relationship to be truly linear. If the relationship is monotonic but not linear, or if the data contain extreme outliers that could distort Pearson’s r, then Spearman’s rho is the better choice, as it is more robust and does not assume linearity.

Key Takeaways

  1. Correlation analysis helps us examine relationships between two numeric variables.

  2. Scatterplots are a useful first step to visualize these relationships before conducting statistical tests.

  3. Pearson’s r is used for linear relationships between continuous variables, while Spearman’s rho is used when data are ordinal or have a non-linear monotonic relationship.

  4. Finally, the p-value helps determine whether the correlation is statistically significant, ensuring that our findings are meaningful rather than due to random chance.

This lab will provide hands-on practice in calculating and interpreting correlation coefficients using R.

3. Correlations using R

In this lab, we will introduce the following new command:

  • corrMatrix (a Jamovi function for calculating correlations)

We will also reuse several familiar commands:

  • filter
  • mutate
  • ggplot
  • geom_box
  • geom_point

Research Question: What Explains Cross-National Variation in Turnout?

It is well-understood that turnout varies across countries. In this lab, we are going to start exploring this by examining a range of different explanations that are both institutional and behavioral. We will also continue this theme of analysis through the homework exercises.

In our first example, we will look at research question: What is the relationship between competitiveness of elections and turnout?

We will also explore the importance of case selection. We will examine the importance of thinking carefully about what cases you include and what cases you can think about excluding. In relation to this, we will discuss how to spot outliers.

Setting our Hypotheses

Let’s start by setting our hypotheses. Remember, it is standard practice that we set both our null and our alternative hypotheses. The null is very important as it is what we refer to when testing our hypotheses.

The null hypothesis states that there is no relationship between the two variables. While we can specify an alternative hypothesis to reflect our expectations about the relationship, we must always consider the null hypothesis first. Based on our findings, we will either reject the null hypothesis if we find evidence supporting the alternative hypothesis, or we fail to reject the null hypothesis if there is insufficient evidence to support the alternative hypothesis:

  • H₀ (Null Hypothesis): There is no association between competitiveness and turnout.

The alternative hypothesis can be either non-directional or directional. A non-directional alternative hypothesis simply states that a relationship exists without specifying its direction:

  • Hₐ (Alternative Hypothesis - Non-Directional): There is an association between competitiveness and turnout.

Alternatively, a directional alternative hypothesis specifies whether we expect a positive or negative relationship. A directional hypothesis is useful when there is a strong theoretical foundation to predict the nature of the relationship, typically based on prior research.

In this case, rational choice theories of political participation suggest that individuals are more likely to vote when they believe their vote has a greater chance of influencing the outcome. When an election is highly competitive, the perceived probability of casting a decisive ballot increases, making voting a more rational choice. Based on this reasoning, our alternative hypothesis is:

  • Hₐ (Alternative Hypothesis - Positive Direction): Higher competitiveness leads to higher turnout.

Getting started

As always, follow these steps to get started:

  1. load R Studio
  2. load a quarto script and save it
  3. set our working directory
  4. setting up an output file to sink the results (not necessary but recommended)
  5. load our packages
  6. load our data
  7. Set as factor (we have lots of categorical data today!)
  8. Off we go!

First, we need to open a quarto script and open a code chunk so we can set our working directory. Remember we need to run this before we proceed with the lab.

# Replace "path/to/your/directory" with the full path to your working folder
setwd("path/to/your/directory")

Next, I would again recommend setting up an output_filelab5 to sink the results so that we can check analysis and any error messages as we go.

# Open a file connection
output_file <- file("output_filelab5.txt", open = "wt") 

# Redirect both standard output and error messages to the same file
sink(output_file, split = TRUE)  # Capture normal output
sink(output_file, type = "message")  # Capture errors and warnings

We also need to load the necessary packages.

library(jmv)
library(haven)
library(tidyverse)
library(BrailleR)

The Pippa Norris Dataset

Pippa Norris is a renowned political scientist based at Harvard and the University of Sydney, specializing in a wide range of behavioural and institutional topics related to elections. As a comparativist, she has dedicated much of her career to making large datasets accessible for analysis and has worked extensively to integrate multiple data sources into a single, valuable resource for researchers.

In this lab, we will use one of her most comprehensive datasets, which merges numerous datasets into a unified source. This allows us to explore a wide array of international relations and comparative politics topics using just one file. It is a rich resource, and you may find it worthwhile to explore further in your own time.

Today, we will use this dataset to examine voter turnout at an aggregate level. This means we will focus on country-level factors that may help explain variations in turnout across different nations.

We will use the following variables:

  1. Van_Comp: a measure of electoral competitiveness
  2. Van_Part: a measure of turnout
  3. Cheibub2Type: a measure of regime type (dichotomous: democracy or dictatorship)
  4. ElecFam2004: type of electoral system

Loading data and setting as factors

Next we need to load in our data and set it as factors. We will do this by adding the droplevels function, the same as last week.

norris <- read_spss ("data/Democracy Cross-National Data.sav")

norrisfac <- as_factor(norris) %>% droplevels(.)

Getting to Know our Data

Let’s start off by having a descriptive look at our variables. As per usual, the first steps of good analysis are:

  1. Identification of key variables
  2. Identification of types of variables
  3. Preliminary descriptive statistics
  4. Graphical interpretations of relationships

One key observation from the codebook is that there are multiple turnout variables. This is because the dataset combines several different sources, each of which has measured turnout independently. These datasets may cover different countries, time periods, or elections, leading to slight variations in how turnout is recorded. As a result, measures may not always align perfectly.

The general rule when selecting a turnout variable is to choose the one that corresponds to the specific study or time period you are analyzing. If an exact match is not available, ensure that the variable you use is at least reasonably time-appropriate for your research question.

We are going to start by looking at data collected by Vanhanen.

  1. Dependent variable: Van_part
  2. Independent variable: Van_Comp

We will start by examining our variables descriptively to identify any potential issues that may require recoding.

Independent Variable: Competitiveness of Elections

This is a measure of how competitive elections are and provides us some insights about how close an election was.

Let’s start by looking at the data.

class(norrisfac$Van_Comp)
[1] "numeric"
attributes(norrisfac$Van_Comp)
$label
[1] "Vanhanen Competition (% seats largest party) 2000"

$format.spss
[1] "F10.2"

$display_width
[1] 10
summary(norrisfac$Van_Comp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   15.45   45.42   47.40   54.06 2000.00      11 

What does this tell us? The class function confirms that the variable is numeric, while the attributes function provides a descriptive label:

“Vanhanen Competition (% seats largest party) 2000”

The summary function further generates key summary statistics.

Do we notice any issues? Given that competition is measured as the percentage of seats held by the largest party, the fact that the summary shows a maximum value of 2000 raises concerns about a potential data issue.

What could be causing this?

  • A missing value coded incorrectly?

  • An error in the data entry or coding process?

While we cannot determine the exact cause, we do know that this value is problematic and requires further investigation or correction.

The next step is to have a look at a boxplot to help identify data structure and examine whether there are any potential outliers in our data we need to get rid of.

box1 = boxplot(norrisfac$Van_Comp, main = "Boxplot of Van_Comp", ylab = "Van_Comp")

VI(box1)
This graph has a boxplot printed vertically
With the title: Boxplot of Van_Comp
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 0, 500, 1000, 1500, and 2000 
This variable  has 184 values.
An outlier is marked at: 2000 
The whiskers extend to 0 and 70 from the ends of the box, 
which are at 15.3 and 54.075 
The median, 45.425 is 78 % from the lower end of the box to the upper end.
The upper whisker is 1.04 times the length of the lower whisker.

What does this output tell us?

First, we can confirm that the VI description of the boxplot matches the values obtained using the summary function. The description indicates that the lower end of the box is 15.3 and the upper end is 54.075, which corresponds to the first quartile (Q1) and third quartile (Q3) in the summary statistics. The whiskers extend from 0, which is the minimum value recorded in the dataset.

However, the upper whisker extends to 70, and the boxplot identifies an outlier marked at 2000. This extreme value is highly unusual and does not represent the overall distribution of the data. If we leave it in the dataset, it will significantly distort our results, influencing measures like the mean and standard deviation. Therefore, it is important to remove this outlier before proceeding with the analysis.

We can do this using mutate and case_when.

[Note that in Lab 4, we used recode with mutate because we were modifying a factor (categorical) variable. In this case, we are modifying a numeric variable, so case_when is the appropriate function to use.]

The command below essentially creates a new variable called Compelec, which is a modified version of Van_Comp. The condition specifies that we should only include values of Van_Comp that are less than or equal to 100. This threshold makes sense because the maximum possible percentage of seats held by the largest party is 100%.

norrisfac <- norrisfac %>%
  mutate(Compelec = case_when(Van_Comp <= 100 ~Van_Comp))

Now that we have removed the outlier, we can rerun the boxplot and summary statistics to confirm that the data is cleaned.

box2 = boxplot(norrisfac$Compelec, main = "Boxplot of Compelec", ylab = "Compelec")

VI(box2)
This graph has a boxplot printed vertically
With the title: Boxplot of Compelec
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 0, 10, 20, 30, 40, 50, 60, and 70 
This variable  has 183 values.
There are no outliers marked for this variable 
The whiskers extend to 0 and 70 from the ends of the box, 
which are at 15.3 and 53.625 
The median, 45.4 is 79 % from the lower end of the box to the upper end.
The upper whisker is 1.07 times the length of the lower whisker.
summary(norrisfac$Compelec)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   15.30   45.40   36.73   53.62   70.00      12 

Upon reviewing the results, we can see that the outlier at 2000 has been removed, and our maximum value now is 70. There are no other extreme values that appear problematic. This means the variable Compelec is now cleaned and ready for further analysis!

Dependent Variable: Turnout

Now let’s focus on our turnout variable.

We will be using the variable Van_Part. As always, before conducting any analysis, it is important to first explore the variable to better understand its distribution and characteristics. To start, we will generate a summary of Van_Part to examine its key statistics.

summary(norrisfac$Van_Part)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   22.07   35.20   33.12   46.91   70.16      11 

Are there any potential issues? One thing to note is that the minimum value is 0, which suggests that no one at all voted in at least one election. This could indicate a real case where turnout was genuinely 0%, or it could be the result of a coding issue in the dataset.

To investigate further, let’s generate a boxplot to visualize the distribution of Van_Part and check for any unusual values.

box3 = boxplot(norrisfac$Van_Part, main = "Boxplot of Van_Part", ylab = "Van_Part")

VI(box3)
This graph has a boxplot printed vertically
With the title: Boxplot of Van_Part
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 0, 10, 20, 30, 40, 50, 60, and 70 
This variable  has 184 values.
There are no outliers marked for this variable 
The whiskers extend to 0 and 70.16 from the ends of the box, 
which are at 22.01 and 47.015 
The median, 35.205 is 53 % from the lower end of the box to the upper end.
The upper whisker is 1.05 times the length of the lower whisker.

The VI description confirms that there are no statistical outliers in the dataset. However, despite this, there is still a strong rationale for removing cases with very low turnout. Elections with exceptionally low turnout are rare events, yet they can have a disproportionate impact on our analysis, potentially skewing the results.

Since such cases are infrequent but can significantly alter statistical outcomes, let’s recode any election with a turnout of less than 10% to exclude these extreme values.

We will use the same approach as before, but this time, instruct R to retain only cases where turnout is at least 10%. This is done using the greater than or equal to (>=) operator.

norrisfac <- norrisfac %>%
  mutate(Vanpart1 = case_when(Van_Part >= 10 ~Van_Part))

Let’s check that this has worked:

box4 = boxplot(norrisfac$Vanpart1, main = "Boxplot of Vanpart1", ylab = "Vanpart1")

VI(box4)
This graph has a boxplot printed vertically
With the title: Boxplot of Vanpart1
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 10, 20, 30, 40, 50, 60, and 70 
This variable  has 156 values.
There are no outliers marked for this variable 
The whiskers extend to 10.63 and 70.16 from the ends of the box, 
which are at 28.9 and 49.095 
The median, 39.4 is 52 % from the lower end of the box to the upper end.
The upper whisker is 1.15 times the length of the lower whisker.
summary(norrisfac$Vanpart1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  10.63   28.95   39.40   39.01   48.86   70.16      39 

Both the boxplot’s whiskers and the summary statistics confirm that the lowest recorded value is now 10.63. This means that all cases with turnout below 10% have been successfully removed from the dataset.

Examining the correlation between competitiveness and turnout

Before calculating correlation coefficients, it is essential to determine whether a relationship is suitable for correlation analysis. Correlation assumes that there is some kind of systematic relationship between the two variables, but the type of correlation we use depends on whether the relationship is linear or monotonic.

Sighted individuals typically assess this by examining scatterplots with a fitted lowess (locally weighted smoothing) line. A lowess line is a smoothed curve that helps visualize the overall trend between two variables. If the line follows a straight path, the relationship is linear, making Pearson’s correlation appropriate. If the line curves but still moves consistently in one direction, the relationship is monotonic but not linear, meaning Spearman’s correlation is a better choice.

However, at present, there is no reliable way for BrailleR or any other R package to generate alt text that would accurately and sufficiently describe the shape of the lowess line in a way that allows a blind individual to independently assess linearity. As a workaround, we use a numerical approach based on comparing Pearson’s r and Spearman’s rho (more on this below).

Producing Scatterplots

Although blind users rely on numerical methods, it is still important to understand how to generate scatterplots when presenting research to others. Scatterplots provide a visual representation of relationships, allowing sighted colleagues or reviewers to assess patterns in the data. Below is an example of how to create a scatterplot in R using ggplot2, with geom_point() to plot individual data points and geom_smooth() to add a lowess (locally weighted smoothing) line to highlight trends:

plot1 = ggplot(norrisfac, aes(x = Compelec, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Competitiveness and Election Turnout", x = "Turnout", y = "Competitiveness")

VI(plot1)
Warning: Removed 40 rows containing non-finite outside the scale range
(`stat_smooth()`).
This chart has title 'Relationship Between Competitiveness and Election Turnout'.
It has x-axis 'Turnout' with labels 0, 20, 40 and 60.
It has y-axis 'Competitiveness' with labels 20, 40 and 60.
It has 2 layers.
Layer 1 is a set of 155 big solid circle points of which about 99% can be seen.
Layer 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 13% of the graph.

An alternative way to generate the same graph is by using the qplot() function:

plot2 = qplot(x=Compelec, y=Vanpart1, data = norrisfac, geom = c("point", "smooth"))  +
  labs(title = "Relationship Between Competitiveness and Election Turnout",
       x = "Turnout",
       y = "Competitiveness")
Warning: `qplot()` was deprecated in ggplot2 3.4.0.
VI(plot2)
Warning: Removed 40 rows containing non-finite outside the scale range
(`stat_smooth()`).
This chart has title 'Relationship Between Competitiveness and Election Turnout'.
It has x-axis 'Turnout' with labels 0, 20, 40 and 60.
It has y-axis 'Competitiveness' with labels 20, 40 and 60.
It has 2 layers.
Layer 1 is a set of 155 big solid circle points of which about 99% can be seen.
Layer 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 13% of the graph.

As the VI description confirms, both graphs are identical in structure.

While the VI description provides some insight into the graph’s layout, it does not convey the distribution of scatterplot points or the shape of the lowess curve, both of which are crucial for assessing linearity. However, it does indicate that 40 rows were removed due to non-finite values, suggesting the presence of missing data points. The scatterplot contains 155 data points, with 97% visible, meaning some may be overlapping or outside the visible range.

The lowess smoothed curve with 95% confidence intervals is a statistical trend line that estimates the relationship between two variables, surrounded by a shaded area, the confidence interval (CI), which represents the range where we expect most data points to fall. Since the CI only covers 13% of the data points, this means that most of the observed values do not fit well with the trend, indicating a weak or inconsistent relationship. In other words, the data points are widely spread and do not follow a clear pattern, making it difficult to determine a reliable trend based on the lowess curve alone. This reinforces the importance of relying on numerical comparisons between Pearson’s r and Spearman’s rho, rather than visual assessment alone, for a more accurate interpretation of the relationship.

Choosing Between Pearson’s r and Spearman’s rho

To formally determine whether to use Pearson’s correlation (which assumes linearity) or Spearman’s correlation (which only assumes a monotonic trend), we compute both correlation coefficients and compare them.

After generating the correlation matrix, compare the Pearson’s r and Spearman’s rho values:

  • If the difference between them is less than 0.1, the relationship is likely linear, so Pearson’s correlation is preferred, because it tends to be more accuarate.

  • If the difference is 0.1 or greater, the relationship is likely monotonic but not linear, meaning Spearman’s correlation is a better choice, because it tends to be more robust.

This method ensures that we select the appropriate correlation measure based on statistical comparisons rather than visual assessment, making the analysis fully accessible while still allowing for accurate interpretation of relationships between variables.

Computing Correlations in R

There are two methods for producing correlation matrices, each with different advantages.

The first method uses the corrMatrix function from the jmv package, which allows you to calculate both Pearson’s r and Spearman’s rho in a single line of code. However, the output is presented in a structured table, which may not be the most screen-reader-friendly option.

The code for this method is as follows:

corrMatrix(norrisfac, vars = vars(Vanpart1, Compelec), spearman = TRUE) 

 CORRELATION MATRIX

 Correlation Matrix                                       
 ──────────────────────────────────────────────────────── 
                                 Vanpart1     Compelec    
 ──────────────────────────────────────────────────────── 
   Vanpart1    Pearson's r               —                
               df                        —                
               p-value                   —                
               Spearman's rho            —                
               df                        —                
               p-value                   —                
                                                          
   Compelec    Pearson's r       0.3205272            —   
               df                      153            —   
               p-value           0.0000478            —   
               Spearman's rho    0.4060211            —   
               df                      153            —   
               p-value           0.0000002            —   
 ──────────────────────────────────────────────────────── 

Note that we only need to specify spearman = TRUE to include Spearman’s correlation, as corrMatrix prints Pearson’s correlation by default.

A more screen-reader-friendly alternative is to use the cor.test() function from base R. While this requires running two separate commands — one for Pearson’s and one for Spearman’s correlation — it provides output in a simpler, plain text format, making it easier to navigate with a screen reader.

cor.test(norrisfac$Vanpart1, norrisfac$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  norrisfac$Vanpart1 and norrisfac$Compelec
t = 4.1855, df = 153, p-value = 4.782e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1715473 0.4551752
sample estimates:
      cor 
0.3205272 
cor.test(norrisfac$Vanpart1, norrisfac$Compelec, method = "spearman") 
Warning in cor.test.default(norrisfac$Vanpart1, norrisfac$Compelec, method =
"spearman"): Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  norrisfac$Vanpart1 and norrisfac$Compelec
S = 368635, p-value = 1.592e-07
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4060211 

[Note: when using cor.test() with Spearman’s correlation, you may see a warning about ties: ‘Cannot compute exact p-value with ties.’ This occurs because Spearman’s correlation is based on ranks, and tied values require an approximate p-value rather than an exact one. However, this does not affect the correlation coefficient itself and can generally be ignored in large datasets.]

How Do We Interpret the Output?

To interpret the output, we follow these steps:

  1. Determine which correlation to use by comparing Pearson’s r and Spearman’s rho.

  2. Interpret the direction and strength of the correlation coefficient.

  3. Evaluate the p-value to decide whether to reject or fail to reject the null hypothesis.

First, to decide which correlation to use, we compare the absolute difference between Pearson’s r and Spearman’s rho. Pearson’s r is 0.32, while Spearman’s rho is 0.41. Since the absolute difference is less than 0.1, this suggests that the relationship is approximately linear, meaning Pearson’s correlation is appropriate. However, since the difference is close to 0.1, we might also consider Spearman’s rho for robustness, especially if there are concerns about outliers or non-linearity.

Second, we consider the strength and direction of the relationship. As mentioned earlier, correlation values range from -1 to +1:

  • -1 indicates a strong negative relationship (as one variable increases, the other decreases).

  • 0 means no relationship.

  • +1 indicates a strong positive relationship (as one variable increases, so does the other).

In this case, the Pearson’s r = 0.32 and Spearman’s rho = 0.41, both indicating a positive relationship. Based on standard interpretation:

  • A correlation between 0.2 and 0.4 suggests a weak to moderate positive relationship.

  • A correlation above 0.4 is considered moderate to increasingly strong.

Since Pearson’s r is 0.32, the relationship is weak to moderate, while Spearman’s rho at 0.41 suggests a slightly stronger, but still moderate, relationship. This means that as competitiveness increases, turnout also tends to increase, but the relationship is not very strong.

Third, we look at our p-values to determine whether we can reject our null hypothesis. In this case, the Pearson’s correlation p-value is 4.782e-05, and the Spearman’s correlation p-value is 1.592e-07. When you see e-numbers in p-values, these indicate very small values, meaning we can usually reject the null hypothesis. (For example, 4.782e-05 represents 4.782 × 10⁻⁵, or 0.00004782.)

Since both p-values are far below the conventional threshold of 0.05, we reject the null hypothesis. This means there is statistically significant evidence that competitiveness and turnout are positively associated, rather than the relationship occurring by chance.

The Importance of Recoding Variables

What would have happened if we had not recoded our turnout variable by removing cases where turnout was below 10%? To explore this, we compare the difference in Pearson’s r values before and after recoding.

cor.test(norrisfac$Van_Part, norrisfac$Compelec, method = "pearson")

    Pearson's product-moment correlation

data:  norrisfac$Van_Part and norrisfac$Compelec
t = 11.317, df = 181, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5500222 0.7214140
sample estimates:
      cor 
0.6437201 
cor.test(norrisfac$Vanpart1, norrisfac$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  norrisfac$Vanpart1 and norrisfac$Compelec
t = 4.1855, df = 153, p-value = 4.782e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1715473 0.4551752
sample estimates:
      cor 
0.3205272 

Before recoding, Pearson’s r is 0.64. After recoding, Pearson’s r drops to 0.32, shifting from a moderate-strong correlation to a weak-moderate one. Although the p-value remains below 0.05, meaning we would still reject the null hypothesis, the relationship is now much weaker than it initially appeared.

Why does this matter?

Recoding is crucial! Including cases that are not representative or contain errors can significantly distort our interpretation of results. While both correlations are statistically significant, the strength of the relationship has changed dramatically, illustrating how poorly handled data can lead to misleading conclusions. Ensuring that variables are appropriately recoded helps us produce more accurate and reliable findings.

Analyzing Democracies Only

Thinking Carefully About Case Selection

So far, we have analyzed all countries in the dataset, including both democratic and authoritarian regimes. However, does it make sense to examine electoral turnout patterns in countries that are not democracies? Including non-democratic states may dramatically affect our findings because factors unique to authoritarian regimes—such as election rigging, political repression, or lack of electoral competition—could significantly alter voter behavior. If elections are seen as inconsequential or manipulated, this is likely to influence the decision to vote.

To account for this, we may want to remove authoritarian regimes and focus only on democratic states. This is a common approach when studying democratic institutions like elections. Fortunately, we can do this using familiar tools: we can filter our dataset to include only democracies.

In this dataset, we use the democracy scale from Cheibub et al., which classifies countries as either democracies or dictatorships. Let’s first examine this variable:

attributes(norrisfac$Cheibub2Type)
$levels
[1] "Democracy"    "Dictatorship"

$class
[1] "factor"

Now that we have identified the democracy classification variable, we can filter the dataset to include only democratic states. We use the filter() function to create a new dataset, norrisdem, which contains only democracies:

norrisdem <- filter(norrisfac, Cheibub2Type == "Democracy") 

Re-examining the Relationship in Democracies Only

First, let’s recall our original findings using the full dataset. We start by looking at the scatterplot:

plot1 = ggplot(norrisfac, aes(x = Compelec, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Competitiveness and Election Turnout", x = "Turnout", y = "Competitiveness") 

VI(plot1)
Warning: Removed 40 rows containing non-finite outside the scale range
(`stat_smooth()`).
This chart has title 'Relationship Between Competitiveness and Election Turnout'.
It has x-axis 'Turnout' with labels 0, 20, 40 and 60.
It has y-axis 'Competitiveness' with labels 20, 40 and 60.
It has 2 layers.
Layer 1 is a set of 155 big solid circle points of which about 99% can be seen.
Layer 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 13% of the graph.

Now, let’s compare this to our filtered dataset, which includes only democracies:

plot3 = ggplot(norrisdem, aes(x = Compelec, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Competitiveness and Election Turnout in Democracies", x = "Turnout", y = "Competitiveness")

VI(plot3)
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_smooth()`).
This chart has title 'Relationship Between Competitiveness and Election Turnout in Democracies'.
It has x-axis 'Turnout' with labels 0, 20, 40 and 60.
It has y-axis 'Competitiveness' with labels 20, 40 and 60.
It has 2 layers.
Layer 1 is a set of 107 big solid circle points of which about 98% can be seen.
Layer 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 21% of the graph.

What Changed?

After filtering out non-democracies, we see a notable reduction in the number of cases. The VI function highlights that instead of 155 data points, there are now only 107. This means nearly a third of the data has been removed, which could significantly impact our results.

This reduction in cases has key implications:

  1. The dataset now consists only of democracies, eliminating potential distortions from authoritarian states where turnout may be influenced by coercion or fraudulent practices.

  2. The correlation coefficient is expected to be stronger, as extreme cases that previously pulled the relationship in different directions have been removed.

Checking the Correlation Coefficients

Now, let’s formally test whether the correlation differs by comparing the Pearson’s r coefficient between the full dataset (norrisfac) and the subset containing democracies only (norrisdem):

cor.test(norrisfac$Vanpart1, norrisfac$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  norrisfac$Vanpart1 and norrisfac$Compelec
t = 4.1855, df = 153, p-value = 4.782e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1715473 0.4551752
sample estimates:
      cor 
0.3205272 
cor.test(norrisdem$Vanpart1, norrisdem$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  norrisdem$Vanpart1 and norrisdem$Compelec
t = 5.0065, df = 105, p-value = 2.245e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2717849 0.5804696
sample estimates:
      cor 
0.4389911 

We see a clear shift in Pearson’s r—it has increased from 0.32 to 0.44, a notable difference! This suggests that excluding non-democratic states has strengthened the observed relationship.

Key Takeaway: Case Selection Matters!

Case selection is critical in research. Think carefully about which cases you include and exclude, ensuring that any omissions are theoretically justified. In this case, we have a clear rationale for removing non-democratic states: authoritarian regimes fundamentally alter political behavior due to low trust, lack of engagement, or fear surrounding elections. By focusing only on democracies, we obtain results that better reflect the dynamics of electoral participation in genuinely competitive systems.

Controlling for a Third Variable

A final step in our analysis is to examine the relationship between competitiveness and turnout while controlling for electoral system, using the variable ElecFam2004. While correlation alone does not allow us to formally control for additional variables (for that, we would need regression analysis, which is beyond the scope of this course), we can still apply some useful workarounds to explore this relationship.

Using Facet Wrap to Visualize Electoral System Differences

To better understand how electoral system type affects the relationship between turnout and competitiveness, we can use facet_wrap() to create separate scatterplots for each electoral system type (Majoritarian, Combined, and Proportional). Let’s also include drop_na to get rid of missing values:

plot4 = norrisdem %>%
  drop_na(Compelec, Vanpart1, ElecFam2004) %>%
  ggplot(aes(x = Compelec, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Competitiveness and Election Turnout in Democracies by Electoral System", x = "Turnout", y = "Competitiveness") +
  facet_wrap(~ElecFam2004)

VI(plot4)
This chart has title 'Relationship Between Competitiveness and Election Turnout in Democracies by Electoral System'.
The chart is comprised of 3 panels containing sub-charts, arranged horizontally.
The panels represent different values of ElecFam2004.
Each sub-chart has x-axis 'Turnout' with labels 0, 20, 40 and 60.
Each sub-chart has y-axis 'Competitiveness' with labels 20, 40 and 60.
Each sub-chart has 2 layers.
Panel 1 represents data for ElecFam2004 = Majoritarian COORD = 1.
Layer 1 of panel 1 is a set of 35 big solid circle points of which about 100% can be seen.
Layer 2 of panel 1 is a 'lowess' smoothed curve with 95% confidence intervals covering 34% of the graph.Panel 2 represents data for ElecFam2004 = Combined COORD = 1.
Layer 1 of panel 2 is a set of 22 big solid circle points of which about 100% can be seen.
Layer 2 of panel 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 34% of the graph.Panel 3 represents data for ElecFam2004 = Proportional COORD = 1.
Layer 1 of panel 3 is a set of 50 big solid circle points of which about 98% can be seen.
Layer 2 of panel 3 is a 'lowess' smoothed curve with 95% confidence intervals covering 34% of the graph.

The VI function will summarize key aspects of the graph, helping us understand some of the structural differences across electoral systems without relying on visual inspection.

Running Correlations Separately for Each Electoral System

To further explore how electoral systems affect the relationship between turnout and competitiveness, we can filter the dataset to create subsets for each electoral system type and run separate Pearson correlations within each group:

group1 = norrisdem %>% filter(ElecFam2004 == "Combined")
group2 = norrisdem %>% filter(ElecFam2004 == "Majoritarian")
group3 = norrisdem %>% filter(ElecFam2004 == "Proportional")

cor.test(group1$Vanpart1, group1$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  group1$Vanpart1 and group1$Compelec
t = 2.4316, df = 20, p-value = 0.02456
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.07020732 0.74853557
sample estimates:
      cor 
0.4776765 
cor.test(group2$Vanpart1, group2$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  group2$Vanpart1 and group2$Compelec
t = 2.2907, df = 33, p-value = 0.0285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.04238839 0.62633707
sample estimates:
      cor 
0.3704027 
cor.test(group3$Vanpart1, group3$Compelec, method = "pearson") 

    Pearson's product-moment correlation

data:  group3$Vanpart1 and group3$Compelec
t = 3.0148, df = 48, p-value = 0.0041
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1357420 0.6096507
sample estimates:
      cor 
0.3990136 

We can see from the outputs that correlation betwen turnout and competitiveness varies across electoral systems, showing a moderate positive relationships in all three cases.

In combined systems, Pearson’s r is 0.48 with a p-value of 0.025, indicating a moderate and statistically significant relationship. For majoritarian systems, Pearson’s r is 0.37 with a p-value of 0.029. The relationship remains moderate, but it is weaker than in combined systems. Since the p-value is below 0.05, we still reject the null hypothesis, confirming a significant association. In proportional representation systems, Pearson’s r is again moderate with a value of 0.40 and a p-value of 0.004, making it the most statistically significant of the three.

Overall, since the p-value is below 0.05 in all three cases, we can reject the null hypothesis of no association between turnout and competitiveness, even when controlling for electoral systems. (Remember, if even one of the relationships had been non-significant, we would have to fail to reject the null or, at best, only partially reject it.)

We find that while the relationship between turnout and competitiveness holds across all electoral systems, its strength varies: it is strongest in Combined systems, and weakest in Majoritarian systems. This reinforces the importance of controlling for institutional differences, as the effect of competitiveness on turnout is not uniform across different electoral contexts.

Conclusions

In this Lab we have demonstrated how to use R to compute and interpret Pearson’s and Spearman’s correlations, assess relationships between variables, and apply practical workarounds to account for additional factors.

What are the key things to take away?

  • Choosing the right correlation measure matters. Pearson’s r is appropriate for linear relationships, while Spearman’s rho is better for monotonic but non-linear relationships.

  • Statistical significance does not imply strength. A correlation can be significant but still weak, as seen in some cases here.

  • Recoding and case selection are crucial. Cleaning data and focusing on relevant cases (e.g., democracies only) can substantially change results.

  • Controlling for additional variables improves analysis. While correlation alone cannot formally control for third variables, filtering subsets of data can provide insights into how relationships differ across contexts.

Understanding these concepts ensures more accurate interpretations of correlations and reinforces best practices for data analysis in R.

Next, we move on to the homework exercises.

4. Homework Exercises

Introduction

In this homework, we will continue exploring cross-national variation in voter turnout. This assignment will require you to apply key data analysis skills, including recoding variables, generating box plots and scatterplots, and selecting the appropriate correlation tests to assess relationships between variables.

What affects turnout?

You will analyze the following variables:

  • Van_Part

  • Prop

  • freeandfairscore

To better understand these variables, refer to the Codebook, which provides definitions and descriptions. Additionally, use commands such as class() to check the data type, attributes() to see associated metadata, str() to examine the overall structure, and summary() to generate key descriptive statistics. These tools will help you interpret the variables and identify any necessary transformations or recoding steps.

For each variable, follow these steps:

  1. Formulate the hypotheses: State a null and alternative hypothesis and briefly explain why you expect a particular relationship.

  2. Examine the variables: Inspect them using summary statistics and visualizations.

  3. Perform any necessary recoding: Use boxplots to identify potential outliers and assess the distribution of the data. Additionally, apply informed judgment based on theoretical reasoning to determine if transformations or recoding are necessary.

  4. Generate scatterplots: Create visual representations of the relationship between the variables.

  5. Run both correlation tests: Compute both Pearson’s and Spearman’s correlation using cor.test() or corrMatrix. Compare the results to determine which correlation is more appropriate for interpretation.

  6. Interpret the results: After selecting the appropriate correlation test, analyze the strength, direction, and statistical significance (p-value) of the relationship. Based on the p-value, determine whether to reject or fail to reject the null hypothesis.

This structured approach will help you draw meaningful conclusions about the factors influencing voter turnout across different countries.

Getting our data ready

The following steps are provided in case you are starting from a fresh script. They guide you through setting up R for analysis and recoding the dependent variable. These steps mirror the beginning of the lab workbook, so if you have already completed the setup, you can skip ahead to the section “Proportionality”

  1. load R Studio
  2. load a quarto script and save it
  3. set our working directory
  4. setting up an output file to sink the results (not necessary but recommended)
  5. load our packages
  6. load our data
  7. Set as factor (we have lots of categorical data today!)
  8. Off we go!

Loading packages and data

After setting the working directory and configuring an output file to capture results, you will need to load the following packages: jmv (for running correlations), haven (for loading SPSS data), tidyverse (for recoding and creating graphs), and BrailleR (for accessibility support).

library(jmv)
library(haven)
library(tidyverse)
library(BrailleR)

Next, we need to load our data using read_spss as we have SPSS data, and convert relevant variables using as_factor while dropping unused levels in the new version.

norris <- read_spss ("data/Democracy Cross-National Data.sav")

norrisfac <- as_factor(norris) %>% droplevels(.)

Recoding our dependent variable: Turnout

Let’s start by recoding our dependent variable, Van_Part, which measures turnout. Before making any changes, we need to examine it using class(), attributes(), summary(), and a boxplot() to understand its structure and distribution.

Note: An alternative to using class() and attributes() is str(), which provides both the variable class, its label, and a preview of the first five observations. However, str() may be harder to interpret with a screen reader, so you might prefer using separate commands for clarity.

class(norrisfac$Van_Part)
[1] "numeric"
attributes(norrisfac$Van_Part)
$label
[1] "Vanhanen Participation (electoral turnout) 2000"

$format.spss
[1] "F10.2"

$display_width
[1] 10
summary(norrisfac$Van_Part)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   22.07   35.20   33.12   46.91   70.16      11 
str(norrisfac$Van_Part)
 num [1:195] 0 34.5 33.8 NA 0 ...
 - attr(*, "label")= chr "Vanhanen Participation (electoral turnout) 2000"
 - attr(*, "format.spss")= chr "F10.2"
 - attr(*, "display_width")= int 10

There are clearly some unusual findings here, such as turnout as low as 0%, which may indicate missing data, errors, or extreme cases. To further investigate, let’s generate a boxplot to check for potential outliers and assess the distribution of turnout values.

box_vanpart = boxplot(norrisfac$Van_Part, main = "Boxplot of Van_Part", ylab = "Van_Part")

VI(box_vanpart)
This graph has a boxplot printed vertically
With the title: Boxplot of Van_Part
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 0, 10, 20, 30, 40, 50, 60, and 70 
This variable  has 184 values.
There are no outliers marked for this variable 
The whiskers extend to 0 and 70.16 from the ends of the box, 
which are at 22.01 and 47.015 
The median, 35.205 is 53 % from the lower end of the box to the upper end.
The upper whisker is 1.05 times the length of the lower whisker.

The boxplot does not indicate any outliers, but as discussed in the workbook, a turnout of 0% is not meaningful in this context. These cases could represent missing data, errors, or exceptional circumstances that do not align with our analytical goals. To ensure more reliable results, we will remove cases where turnout is below 10%.

norrisfac <- norrisfac %>%
  mutate(Vanpart1 = case_when(Van_Part >= 10 ~Van_Part))

Let’s check that this has worked:

box_Vanpart1 = boxplot(norrisfac$Vanpart1, main = "Boxplot of Vanpart1", ylab = "Vanpart1")

VI(box_Vanpart1)
This graph has a boxplot printed vertically
With the title: Boxplot of Vanpart1
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 10, 20, 30, 40, 50, 60, and 70 
This variable  has 156 values.
There are no outliers marked for this variable 
The whiskers extend to 10.63 and 70.16 from the ends of the box, 
which are at 28.9 and 49.095 
The median, 39.4 is 52 % from the lower end of the box to the upper end.
The upper whisker is 1.15 times the length of the lower whisker.
summary(norrisfac$Vanpart1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  10.63   28.95   39.40   39.01   48.86   70.16      39 

The summary and boxplot look much better now.

Creating a filtered democracy-only dataset.

Start by examining the variable Cheibub2Type:

levels(norrisfac$Cheibub2Type)
[1] "Democracy"    "Dictatorship"

This looks fine, a factor variable that clearly distinguishes between democracy and dictatorships.

We can use this variable to filter our new dataset.

norrisdem <- filter(norrisfac, Cheibub2Type == "Democracy")

Exercise 1: Proportionality

Let’s now crack on with the main analysis. You will be using the variable Prop (stands for proportionality).

It is widely accepted that lower proportionality is associated with lower voter turnout. In systems where the distribution of seats more closely reflects votes, individuals are more encouraged to participate. Why? When votes translate more fairly into representation, voters are more likely to feel that their ballot has an impact, making turnout more worthwhile across all locations.

To explore this relationship, we will begin by examining the variable Prop, an index of proportionality developed by Richard Rose.

Question 1a. Formulate your hypotheses by stating both the null and alternative hypotheses. Provide both a non-directional and a directional alternative hypothesis.

# Null Hypothesis (H0): There is no relationship between proportionality and electoral turnout.
# Non-Directional Alternative Hypothesis (Ha): There is an association between proportionality and electoral turnout.
# Directional Alternative Hypothesis (Ha2): As proportionality increases, voter turnout is likely to increase.

Question 1b. Check the variable using the usual commands and a boxplot.

class(norrisdem$Prop)
[1] "numeric"
attributes(norrisdem$Prop)
$label
[1] "Index of proportionality (Richard Rose Encyclopedia of elections, CQ Press)"

$format.spss
[1] "F10.2"

$display_width
[1] 10
summary(norrisdem$Prop)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  62.00   83.70   88.20   88.08   95.00   99.50      29 
box_Prop = boxplot(norrisdem$Prop, main = "Boxplot of Prop", ylab = "Prop")

VI(box_Prop)
This graph has a boxplot printed vertically
With the title: Boxplot of Prop
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 70, 80, 90, and 100 
This variable  has 85 values.
An outlier is marked at: 62 
The whiskers extend to 66.9 and 99.5 from the ends of the box, 
which are at 83.7 and 95 
The median, 88.2 is 40 % from the lower end of the box to the upper end.
The upper whisker is 0.27 times the length of the lower whisker.

Question 1c. Is there any recoding you need to do on the basis of this? If recoding is necessary, create a new variable named Prop1.

Note: If you do any recoding, always make sure you explain why you recoded and to check the recoding has worked.

# We see here one outliers at 62. Given that the lower end of the whiskers are at 66.9, let's drop any cases below the level of 65.  

norrisdem <- norrisdem %>%
  mutate(Prop1 = case_when(Prop >= 65 ~Prop))

#Let's check it has worked:
summary(norrisdem$Prop1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  66.90   83.92   88.30   88.39   95.00   99.50      30 
box_Prop1 = boxplot(norrisdem$Prop1, main = "Boxplot of Prop1", ylab = "Prop1")

VI(box_Prop1)
This graph has a boxplot printed vertically
With the title: Boxplot of Prop1
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 70, 75, 80, 85, 90, 95, and 100 
This variable  has 84 values.
An outlier is marked at: 66.9 
The whiskers extend to 68.2 and 99.5 from the ends of the box, 
which are at 83.85 and 95 
The median, 88.3 is 40 % from the lower end of the box to the upper end.
The upper whisker is 0.29 times the length of the lower whisker.
# Note that after recoding Prop to Prop1 by removing values below 65, the boxplot now identifies 66.9 as an outlier. This occurs because removing the lowest values has shifted the distribution, causing the interquartile range (IQR) and whisker boundaries to be recalculated. Now, the whiskers extend from 68.2 to 99.5, making 66.9 fall outside the lower bound.

#The key question is whether to recode again to remove 66.9. Since it falls below the lower whisker (68.2), it is statistically considered an outlier. Additionally, the IQR ranges from 83.85 to 95, and the upper whisker is only 0.29 times the length of the lower whisker, indicating that most of the data is concentrated above 80. This suggests that values below 70 are unusually low relative to the rest of the dataset and may be distorting the analysis.

#Given this, I have decided to recode anything below 70, ensuring that the distribution reflects the main concentration of data while minimizing the influence of extreme low values. This decision is based on both statistical reasoning (creating a more balanced distribution) and substantive reasoning (focusing on meaningful variation in proportionality scores).

norrisdem <- norrisdem %>%
  mutate(Prop1 = case_when(Prop >= 70 ~Prop))

summary(norrisdem$Prop1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  70.30   84.00   89.00   89.15   95.00   99.50      33 
box_Prop1 = boxplot(norrisdem$Prop1, main = "Boxplot of Prop1", ylab = "Prop1")

VI(box_Prop1)
This graph has a boxplot printed vertically
With the title: Boxplot of Prop1
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 70, 75, 80, 85, 90, 95, and 100 
This variable  has 81 values.
There are no outliers marked for this variable 
The whiskers extend to 70.3 and 99.5 from the ends of the box, 
which are at 84 and 95 
The median, 89 is 45 % from the lower end of the box to the upper end.
The upper whisker is 0.33 times the length of the lower whisker.
# Hooray, no more outliers!

Once you are happy with the variable, we can turn our attention to a scatterplot and check the relationship and consider what type of correlation we want to utilise.

Question 1d. Generate a scatterplot to examine the relationship between proportionality and turnout. What insights can you gather from the VI description?

plot1d = ggplot(norrisdem, aes(x = Prop1, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Proportionality and Election Turnout", x = "Turnout", y = "Competitiveness")

VI(plot1d)
Warning: Removed 34 rows containing non-finite outside the scale range
(`stat_smooth()`).
This chart has title 'Relationship Between Proportionality and Election Turnout'.
It has x-axis 'Turnout' with labels 70, 80, 90 and 100.
It has y-axis 'Competitiveness' with labels 20, 40 and 60.
It has 2 layers.
Layer 1 is a set of 80 big solid circle points of which about 100% can be seen.
Layer 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 23% of the graph.
# The VI description indicates that 34 rows were removed due to non-finite values, suggesting missing data points. The scatterplot contains 80 visible points, and the lowess smoothed curve with 95% confidence intervals covers only 23% of the data points, implying that the trend may not be well-defined across the full range. While the VI provides structural details, it does not convey the distribution of points or the shape of the trend, so numerical correlation values should be used to assess the strength and direction of the relationship.

Question 1e. Now compute both Pearson’s and Spearman’s correlations. Which correlation is more appropriate for interpretation in this case?

cor.test(norrisdem$Vanpart1, norrisdem$Prop1, method = "pearson") 

    Pearson's product-moment correlation

data:  norrisdem$Vanpart1 and norrisdem$Prop1
t = 3.7435, df = 78, p-value = 0.0003453
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1865386 0.5618044
sample estimates:
      cor 
0.3902604 
cor.test(norrisdem$Vanpart1, norrisdem$Prop1, method = "spearman")
Warning in cor.test.default(norrisdem$Vanpart1, norrisdem$Prop1, method =
"spearman"): Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  norrisdem$Vanpart1 and norrisdem$Prop1
S = 55143, p-value = 0.001289
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.3536957 
# The Pearson's r is 0.39 and the Spearman's rho is 0.35. Given that the difference between the two is only 0.04, which is less than 0.1, we should choose to interpret the Pearson's r.

Question 1f. What does the analysis show you? After selecting the appropriate correlation test, write a few sentences interpreting the correlation coefficient’s strength and direction, as well as the p-value. Do you reject or fail to reject the null?

# We chose to analyse the Pearson's r. The analysis shows a moderate positive correlation between proportionality and turnout, with a Pearson’s r of 0.39. This indicates that as proportionality increases, turnout tends to increase as well, though the relationship is not particularly strong. The p-value is 0.00035, which is well below the standard significance threshold of 0.05, meaning the result is statistically significant. Based on this, we reject the null hypothesis, confirming that there is a significant association between proportionality and turnout. However, the correlation is moderate, suggesting that while proportionality plays a role, other factors also influence voter turnout.

Exercise 2: Free and Fair Elections

We will examine how the fairness and freedom of elections influence voter turnout.

Why does this matter? From a rational perspective, if an election is not free and fair, the likelihood of casting a decisive vote is extremely low. In a system where elections are rigged or manipulated, voters may feel that their participation has little to no impact, reducing their incentive to turn out.

In this instance, let’s go back to the ‘norrisfac’ dataset, which includes both democracies and dictatorships. Since we are examining the impact of free and fair elections on voter turnout, restricting the analysis to only democracies (as in norrisdem) would limit variation in electoral fairness. Including both regime types allows us to capture the full range of election quality, making the analysis more meaningful.

Question 2a. Let’s start again by formulating your hypotheses: the null and alternative (both non-directional and directional).

# Null Hypothesis (H0): There is no relationship between levels of free and fair elections and electoral turnout.
# Non-Directional Alternative Hypothesis (Ha): There is an association between levels of free and fair elections and electoral turnout.
# Directional Alternative Hypothesis (Ha2): As the level of electoral freedom and fairness increases, voter turnout is likely to rise.

Question 2b. Check the independent variable (freeandfairscore) using the usual commands and a boxplot.

class(norrisfac$freeandfairscore)
[1] "numeric"
attributes(norrisfac$freeandfairscore)
$label
[1] "Free and Fair Score sum (Bishop Hoeffler)"

$format.spss
[1] "F10.2"

$display_width
[1] 10
summary(norrisfac$freeandfairscore)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   3.000   6.000   5.419   8.000  10.000      32 
box_freefair = boxplot(norrisfac$freeandfairscore, main = "Boxplot of freeandfairscore", ylab = "freeandfairscore")

VI(box_freefair)
This graph has a boxplot printed vertically
With the title: Boxplot of freeandfairscore
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 0, 2, 4, 6, 8, and 10 
This variable  has 163 values.
There are no outliers marked for this variable 
The whiskers extend to 0 and 10 from the ends of the box, 
which are at 3 and 8 
The median, 6 is 60 % from the lower end of the box to the upper end.
The upper whisker is 0.67 times the length of the lower whisker.

Question 2c. Is there any recoding you need to do on the basis of this?

#Based on the boxplot of the `freeandfairscore` variable, there does not appear to be a strong need for recoding. The variable ranges from 0 to 10, with no outliers detected, indicating that all values fall within an expected range. The median is 6, and the interquartile range (IQR) spans from 3 to 8, suggesting a fairly balanced distribution. However, one point worth noting is the asymmetry in the whisker lengths: the lower whisker is longer than the upper whisker, implying a potential left skew, i.e. the data seems to be concentrated at the higher end of the range (closer to 10).

Question 2d. Generate a scatterplot and explain what insights you gather from the VI description.

plot2d = ggplot(norrisfac, aes(x = freeandfairscore, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Levels of Electoral Freedom and Fairness and Election Turnout", x = "Turnout", y = "Levels of Electoral Freedom and Fairness")

VI(plot2d)
Warning: Removed 47 rows containing non-finite outside the scale range
(`stat_smooth()`).
This chart has title 'Relationship Between Levels of Electoral Freedom and Fairness and Election Turnout'.
It has x-axis 'Turnout' with labels 0.0, 2.5, 5.0, 7.5 and 10.0.
It has y-axis 'Levels of Electoral Freedom and Fairness' with labels 20, 40 and 60.
It has 2 layers.
Layer 1 is a set of 148 big solid circle points of which about 92% can be seen.
Layer 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 13% of the graph.
# The VI description indicates that 47 rows were removed due to non-finite values, suggesting missing data points. The scatterplot contains 148 visible points, and the lowess smoothed curve with 95% confidence intervals covers only 13% of the data points, suggesting that either the relationship is weak or the data is unevenly distributed, with large portions not contributing much to the smoothing. [Note that the low percentage of data covered by the lowess smoothed curve could also be influenced by the fact that the variable Free and Fair Elections is measured on a 0 to 10 scale, so many data points are likely stacked on the same x-values, leading to overplotting.]

Question 2e. Compute both correlation tests and decide which is more appropriate for interpretation in this case.

cor.test(norrisfac$Vanpart1, norrisfac$freeandfairscore, method = "pearson") 

    Pearson's product-moment correlation

data:  norrisfac$Vanpart1 and norrisfac$freeandfairscore
t = 6.3305, df = 146, p-value = 2.837e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3272366 0.5818550
sample estimates:
      cor 
0.4640783 
cor.test(norrisfac$Vanpart1, norrisfac$freeandfairscore, method = "spearman")
Warning in cor.test.default(norrisfac$Vanpart1, norrisfac$freeandfairscore, :
Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  norrisfac$Vanpart1 and norrisfac$freeandfairscore
S = 297362, p-value = 9.9e-09
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4496083 
# The Pearson's r is 0.46 and the Spearman's rho is 0.45. Given that the difference between the two is only 0.01, which is less than 0.1, we should choose to interpret the Pearson's r.

Question 2f. What does your chosen correlation test tell you? Write a few sentences.

#The Pearson’s correlation test shows a statistically significant positive correlation between Vanpart1 (Voter Participation) and Free and Fair Elections Score (r = 0.464). The p-value (2.837e-09) is extremely small, indicating that the relationship is highly unlikely to have occurred by chance. This suggests that as elections become more free and fair, voter participation tends to increase, though the correlation is not strong enough to imply a direct causal relationship.

Although freeandfairscore is a numeric variable, it ranges only from 0 to 10, making it more like an ordinal scale rather than a continuous one. This limited range could contribute to the low percentage of data covered by the lowess smoothed curve, as there may not be enough variation for a strong trend to emerge. Given this, it might make sense to treat freeandfairscore as a categorical variable rather than a numeric one.

To better analyze differences in voter turnout, we will recode freeandfairscore into two categories:

  • “Not free and fair” for values 5.5 and below

  • “Free and fair” for values above 5.5

We can do this using the cut() function, which allows us to split a numeric variable into defined categories. The code below performs this recoding:

norrisfac <- norrisfac %>%
  mutate(freefair = cut(freeandfairscore, 
                        breaks=c(0, 5.5, 10), 
                        labels=c("Not free and fair", "Free and fair")))

This code is very similar to the recoding methods used in Lab 4, where we transformed categorical variables. Here, we once again use the mutate() function, but instead of recode(), we apply cut() to categorize the variable.

In this case, mutate() creates a new categorical variable called freefair, which recodes the original freeandfairscore variable into two groups: “Not free and fair” for values 5.5 and below, and “Free and fair” for values above 5.5. The cut() function handles this transformation, where the breaks argument (c(0, 5.5, 10)) defines the cutoff points, and the labels argument (c("Not free and fair", "Free and fair")) assigns meaningful category names.

Finally, we use the levels() function to verify that the new variable has been correctly created and categorized.

levels(norrisfac$freefair)
[1] "Not free and fair" "Free and fair"    

Question 2g. Now that we have a binary independent variable (freefair) and a numeric dependent variable (Vanpart1), identify the appropriate statistical test for analyzing this relationship. Run the test, report the results, and evaluate the null hypothesis, which states that there is no association between free and fair elections and voter participation.

# The appropriate statistical test for analysing a binary independent variable and a numeric dependent variable is a t-test. We can conduct this test using the jamovi package:

ttestIS(norrisfac, "Vanpart1", group = "freefair",
        students = FALSE, welchs = TRUE)

 INDEPENDENT SAMPLES T-TEST

 Independent Samples T-Test                                      
 ─────────────────────────────────────────────────────────────── 
                            Statistic    df          p           
 ─────────────────────────────────────────────────────────────── 
   Vanpart1    Welch's t    -4.658244    138.4659    0.0000074   
 ─────────────────────────────────────────────────────────────── 
   Note. Hₐ μ <sub>Not free and fair</sub> ≠ μ <sub>Free and
   fair</sub>
# Alternatively, we can perform the t-test using the Base R function `t.test`:

t.test(norrisfac$Vanpart1[norrisfac$freefair == "Not free and fair"],
       norrisfac$Vanpart1[norrisfac$freefair == "Free and fair"])

    Welch Two Sample t-test

data:  norrisfac$Vanpart1[norrisfac$freefair == "Not free and fair"] and norrisfac$Vanpart1[norrisfac$freefair == "Free and fair"]
t = -4.6582, df = 138.47, p-value = 7.403e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -13.733904  -5.549039
sample estimates:
mean of x mean of y 
 33.73635  43.37782 
# Both tests indicate that the p-value is very small (0.0000074), meaning we can reject the null hypothesis that there is no relationship between free and fair elections and turnout. The `t.test` command provides additional insight, showing that the mean turnout for elections classified as "Not free and fair" is 33.7%, while the mean for "Free and fair" elections is 43.4%. This suggests that turnout increases by nearly 10% in elections that are free and fair, highlighting a significant relationship between electoral integrity and voter participation.

Finally, let’s examine whether the correlation between free and fair elections and turnout remains when we control for regime type. To do this, use the original numeric independent variable freeandfairscore and analyze its correlation with Vanpart1, while introducing Cheibub2Type as a control variable.

Question 2h. First, examine the control variable (Cheibub2Type) to determine whether it needs to be recoded before proceeding with the analysis.

class(norrisfac$Cheibub2Type)
[1] "factor"
attributes(norrisfac$Cheibub2Type)
$levels
[1] "Democracy"    "Dictatorship"

$class
[1] "factor"
levels(norrisfac$Cheibub2Type)
[1] "Democracy"    "Dictatorship"
#We can see that Cheibub2Type is a factor variable with only 2 levels, "Democracy" and "Dictatorship". Since both categories are relevant to our research question, no recoding is necessary.

Question 2i. Generate a scatterplot that includes the control variable Cheibub2Type as a facet_wrap and explain what insights you gather from the VI description. In this case, please also add drop_na to get rid of missing values.

plot2i = norrisfac %>%
  drop_na(freeandfairscore, Vanpart1, Cheibub2Type) %>%
  ggplot(aes(x = freeandfairscore, y = Vanpart1)) +
  geom_point() +         
  geom_smooth() +
  labs(title = "Relationship Between Levels of Electoral Freedom and Fairness and Election Turnout, Controlling by Regome Type", 
       x = "Turnout", 
       y = "Levels of Electoral Freedom and Fairness")+
  facet_wrap(~Cheibub2Type)
  
VI(plot2i)
This chart has title 'Relationship Between Levels of Electoral Freedom and Fairness and Election Turnout, Controlling by Regome Type'.
The chart is comprised of 2 panels containing sub-charts, arranged horizontally.
The panels represent different values of Cheibub2Type.
Each sub-chart has x-axis 'Turnout' with labels 0.0, 2.5, 5.0, 7.5 and 10.0.
Each sub-chart has y-axis 'Levels of Electoral Freedom and Fairness' with labels 20, 40 and 60.
Each sub-chart has 2 layers.
Panel 1 represents data for Cheibub2Type = Democracy COORD = 1.
Layer 1 of panel 1 is a set of 103 big solid circle points of which about 92% can be seen.
Layer 2 of panel 1 is a 'lowess' smoothed curve with 95% confidence intervals covering 20% of the graph.Panel 2 represents data for Cheibub2Type = Dictatorship COORD = 1.
Layer 1 of panel 2 is a set of 45 big solid circle points of which about 100% can be seen.
Layer 2 of panel 2 is a 'lowess' smoothed curve with 95% confidence intervals covering 20% of the graph.
# The VI description indicates that the graph consists of two sub-charts, representing the two values of Cheibub2Type: Democracy and Dictatorship. The Democracy chart contains 103 data points, with the lowess smoothed curve and 95% confidence intervals covering 20% of the data points. The Dictatorship chart has only 45 observations, yet the lowess curve also covers 20% of the data. This suggests that in both cases, the relationship may be weak or the data is unevenly distributed, with large portions of the dataset contributing little to the smoothing process.

Question 2j. Now compute the correlation test that you chose in Exercise 2e, but ensure you are controlling for regime type. What do the correlation tests tell you? Write a few sentences.

democracy = norrisfac %>% filter(Cheibub2Type == "Democracy")
dictatorship = norrisfac %>% filter(Cheibub2Type == "Dictatorship")

cor.test(democracy$Vanpart1, democracy$freeandfairscore, method = "pearson") 

    Pearson's product-moment correlation

data:  democracy$Vanpart1 and democracy$freeandfairscore
t = 5.7456, df = 101, p-value = 9.737e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3349682 0.6293909
sample estimates:
      cor 
0.4963192 
cor.test(dictatorship$Vanpart1, dictatorship$freeandfairscore, method = "pearson")

    Pearson's product-moment correlation

data:  dictatorship$Vanpart1 and dictatorship$freeandfairscore
t = -0.60113, df = 43, p-value = 0.5509
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3747798  0.2078140
sample estimates:
        cor 
-0.09128857 
# In democracies, Pearson’s r is 0.5, indicating a moderately strong positive correlation between free and fair elections and voter turnout. The p-value is 9.737e-08, which is well below 0.05, confirming that the relationship is statistically significant. Therefore, we reject the null hypothesis that there is no relationship between free and fair elections and turnout in democracies.

# In dictatorships, Pearson’s r is -0.09, suggesting a very weak, practically non-existent negative correlation. Furthermore, the p-value is 0.5509, meaning the relationship is not statistically significant.

# Overall, we partially reject the null hypothesis: there is a significant relationship between free and fair elections and turnout in democracies, but no meaningful or significant relationship in dictatorships.

A Note of Thanks!

Adrian and I want to take a moment to thank you for your hard work and dedication in these labs. We have been very impressed with your approach to this module and the effort you have put into mastering the material, especially under challenging circumstances.

As you continue working on your assignments, please don’t hesitate to reach out if you have any questions. We’ve covered a lot of ground in this module, and you should be proud of your progress. Well done!

Appendix: How Pearson’s r and Spearman’s rho are Calculated

This appendix provides a basic explanation of how Pearson’s correlation coefficient (r) and Spearman’s rank correlation coefficient (rho, ρ) are calculated.

Pearson’s r: Measuring Linear Relationships

As discussed, Pearson’s r is a measure of how strongly two numerical (continuous) variables are related in a linear way. It tells us the direction (positive or negative) and strength (how closely the variables move together) of the relationship.

How is Pearson’s r calculated?

Pearson’s r is based on the following formula:

r = ∑[(xi - x̄)(yi - ȳ)] ÷ √[∑(xi - x̄)² × ∑(yi - ȳ)²]

  1. How the two variables move together (called co-variance, measured in the first half of the formula, before the division).

  2. How spread out each variable is (called variance, measured in the second half of the formula, after the division).

To calculate Pearson’s r:

  1. Find the mean (average) of each variable.

  2. For each value, subtract the mean from that value.

  3. Multiply these differences together for both variables.

  4. Sum up these values. This tells us how much the two variables move together.

  5. Divide by the product of the spread of each variable (using the square root of the sum of squared differences from the mean).

As discussed, the final value of r always falls between -1 and 1, where +1 means a perfect positive correlation (as one variable increases, so does the other); -1 means a perfect negative correlation (as one variable increases, the other decreases); and 0 means no correlation (the two variables are unrelated).

Pearson’s r is most useful when the relationship between the two variables is linear, meaning they follow a straight-line pattern when plotted on a graph, as this ensures a more accurate measure of correlation.

Spearman’s rho: Measuring Monotonic Relationships

Spearman’s rho is similar to Pearson’s r but is used when the relationship between variables is not linear but still follows a pattern (monotonic), or at least one of the variables is ordinal (ranked categories).

How is Spearman’s rho calculated?

This test uses the following formula:

ρ = 1 − [6 × ∑D²] ÷ [n(n² − 1)]

Instead of using raw numerical values, Spearman’s rho works by ranking the data:

  1. Rank each value in both variables from lowest to highest.

  2. Compare the ranks rather than the actual values.

  3. Calculate the difference between the ranks for each pair of values.

  4. Square these differences and sum them up (the D in the formula).

  5. Plug the sum into a formula that adjusts for the number of values in the dataset.

Like Pearson’s r, Spearman’s rho ranges from -1 to 1, and the interpretation is the same: +1 means a perfect increasing pattern, -1 means a perfect decreasing pattern and 0 means no correlation.

Spearman’s rho is useful when the data does not follow a straight-line relationship but still moves in one direction (i.e. is monotonic), or the variable is ordinal (for example, ranking satisfaction from 1 to 5), as this ensures a more robust measure of correlation.