A Screen Reader Friendly Introduction to R and Descriptive Statistics

An NCRM Case Study in Pedagogy

Author

Adrian Millican

1. Introduction

During this course we are going to explore statistics and think about how and why we might want to apply these within the field of political science. As part of this course, we will learn a new programming language called R and we will learn how to code and use R to create statistical models and tables through a graphical user interface called R Studio (N.b. They did start calling it Posit but appear to have reverted to R Studio).

Learning R and R Studio with visual impairments may seem like a daunting challenge, however this course builds upon a wealth of knowledge created from other scholars work who themselves have learnt how to use this programming language while visually impaired. We aim to provide resources that explain not only how to use the language of R and the software associated it, but give you the skills to be able to learn how to apply these skills to answer political questions.

In this first session, we are going to explore some of the basics of the R programming language and look at some of the packages that exist that help make it easier to use.

This course has been written originally for use in an introduction to Political Science research course, but the types of tests run and exploration of data tools are as applciable to any other social science and even introductory statistics used in STEM subjects. So while the data used to demonstrate commands/statistical tools here are political in nature, the lessons are highly transferrable to a wide reange of subjects and alternative data sources.

The commands in this document focus mostly upon those of MAC users, but a separate cheat sheet has full lists of commands for both MAC and Windows users.

1.1 Plan for Today

This sheet is going to go through some of the basic elements of R and RStudio including:

1. Basic R functions.

2. Factors and numeric variables.

3. Basic Descriptive statistics

4. Introduction to Graphs and ggplot package

1.1.1 Packages Used

In this first session we will use a number of packages. Packages are written by users of R that are made available publicly to make our lives easier. The packages we will use today include:

  1. BrailleR - this package contains a number of commands that make data outputs more friendly to screen readers.

  2. Tidyverse - this package contains a number of elements that help us recode variables and manage our data.

  3. DT - this package helps to make data tables that are screen-reader friendly

  4. Haven - this helps us load our data into R

1.1.2 Commands covered

  1. install.packages()

  2. library()

  3. as_factor

  4. read_sav

  5. str()

  6. summary()

1.1.3 Variables used

  • Region (Region of the country)

  • Lab17 (Labour vote share in 2017 General Election)

  • Winner17 (The number of constituencies won by each party)

  • leaveHanretty (estimate of percentage of leave voters within each constituency)

  • c11Degree (percentage of individauls within a constituency with a degree)

1.2 Installing R and R Studio

Before we begin working with the software, we need to install the programming language of R and then the user interface that we feed our commands into. It is important to download and install R before installing R Studio.

You can download the R language for Mac by clicking on this text

You can download Posit for Mac by clicking on this link to Posit

1.3 Opening R Studio

The first thing we need to do is to open the R Studio programme. Use your Mac to open the software. Once we have it open, we need to open a document to store all of our commands in.

R Studio can be navigated using both a screen reader and has built in accessibility tools that I will explain as and when we come to them.

Firstly, we want to make sure that screen reader is setup within R Studio. To do this you need to take the following steps:

  1. Click Control, Option and H together to move to the Help tab on the menu bar.

  2. Scroll down until you find the Accessibility option

  3. Click into this and make sure screen reader support is enabled

1.4 Using and Navigating R Studio with a Screen Reader

R Studio contains four main windows in a matrix style. Below is a summary of each element of this and a list of the keyboard shortcuts to allow you to navigate between them:

  • Top Left: This is the as the source window and is where we write our code. We can get our screen reader to move to the source window by clicking Control and 1

  • Top right: This is our environment and stores and lists all objects and graphs we create. We can navigate to the environment by clicking Control and 8

  • Bottom Left: This is the console where our code is run and output produced. We can get our screen reader to move to the source window by clicking Control and 2

  • Bottom right: This is where our files, tables and packages and most importantly help files.

    • To access the Files tab click Control and 5

    • To access the Plots tab click Control and 6

    • If you want to access the help files select Control and 3.

  • If at any point you lose track of which window you are in, you can click control, option and 1 together and it will speak the location you are presently in.

  • A full list of keyboard shortcuts for R Studio can be found in the following link: Posit R Studio Keyboard Shortcuts

1.5 Installing Packages

Before we get onto the statistical analysis, we need to start by installing a few packages. As we mentioned we are going to use packages that make the R language easier to use. We install these packages by entering our console clicking Control+2.

Once we are in the console, we need to use the install.packages command. The command works by typing install.packages(“packagename”) into the console. Make sure to replace packagename with the name of the package you are using. For us, that will be the following three commands:

N.B. When inputting code with either parentheses or quotation marks, R will automatically provide a closing bracket/parenthesis. you need to use the right arrow to move outside of it if you have further code to run

install.packages("tidyverse")
install.packages("BrailleR")
install.packages("DT")
install.packages("haven")
  • Run the code chunk using shift+control+enter

Note: It is very important to run chunks as you go through your document. It makes it much easier to spot errors as your cursor will be located near the code chunk and if it doesn’t run it will open a dialog box below the code chunk which contains error information.

We only have to install packages once. The only time we would repeat this step is if we uninstall R and R Studio and then re-install it.

Once we have installed these packages, we are ready to open a document and get started.

1.5.1 A Few Important Things to Note

Below are a list of pieces of information that are important to remember to get R to work appropriately:

  1. R is case sensitive. When you enter commands they need to use the appropriate upper or lower case.

  2. You can annotate your code using a hashtag symbol at any point it in. This is relevant to when you are in R chunks(explained below).

1.6 Opening a New Document

In order to get started, we need to open a document that we can write our code in and store our statistical output. There are many types of documents we can open to store our R code. We are going to use an option called a Quarto document as it is particularly friendly for Screen Readers.

In order to open a new Quarto document we need to access the File menu.

  • Start by clicking Control+Option+F keys together. Navigate Down until you hear “New File”. Once you hear this click the Right arrow to enter the submenu.

  • Use the Down arrow to navigate to an option “New Quarto Document” and press the Enter key.

  • There are three options for type of Quarto file. We want to select “Document”. There are dialogue windows to enter your name and also provide a title of the document. You can select the document to write as either HTML, PDF or Word. It is suggested that HTML or Word are best for screen readers.

Now we have our Quarto document open, we can start setting up our work.

1.6.1 Source and Visual windows

Quarto documents have two windows. One called Visual which presents the document much like a word document. with the ability to edit the document using a series of dropdown menus. The other window is the Source window which allows you to do all of this but in a way that is friendlier to a screen reader.

Keyboard Shortcuts to Switch Modes

  1. Switch to Source Mode:

    • Shortcut: Ctrl + Option + 1
  2. Switch to Visual Mode:

    • Shortcut: Ctrl + Option + 2

1.6.2 Useful Commands in Source View

We can edit our documents and write notes and entire assignments within Quarto documents and print them as html or word documents. The main thing we need to understand is how and where we put code. We need to place our code in code chunks. We open one of these by clicking Option+Command+I.

Anything outside of the code chunks can be treated much like writing in a word document.

1.7 Running Packages

Before we conduct any analysis, we need to run our pacakges. While we have installed them, we now need to tell the software that we want to load them. We do this with the “library” command. To execute this code within a quarto document we need to open a code box.

  • We do this by Option+Command+I

  • Write our commands within the code chunk

  • Run the code chunk using shift+control+enter

library(tidyverse)
library(BrailleR)
library(DT)
library(haven)
library(jmv)

Now we have got our packages loaded, we have one more step to complete before we can get started with loading data and running analyses. We need to set our working directory.

Remember, every time we want to write and run code we need to include it wihin a code chunk.

1.8 How to Find and Access the Working Directory on a Mac

In order to get R Studio to understand how and where files are located and where it should

When using Finder on a Mac, you can locate files and folders within your working directory and retrieve their full filepaths. Here’s how to do it step by step:

Step 1: Access the Sidebar

1. Open Finder.

2. Locate the Sidebar on the left-hand side of the Finder window.

  • The Sidebar displays a list of parent directories, such as OneDrive, iCloud, Dropbox, Applications, and other commonly used locations.
  • If you are using a screen reader like VoiceOver, interact with the Sidebar:
  • Press VO+Shift+Down Arrow (VoiceOver command) to interact.

  • Use the Up Arrow or Down Arrow keys to navigate the list.

Step 2: Select the Parent Directory

1. Navigate through the Sidebar until you find the directory where your file is located (e.g., OneDrive).

2. Press Enter or Return to open the directory.

Step 3: Navigate to Files and Folders

1. Once the parent directory is selected, navigate to the **File and Folder List** on the right side of the Finder window.

  • With VoiceOver, press VO+Right Arrow to move focus to the next area (the file list).

  • Interact with the file list using VO+Shift+Down Arrow and navigate using the Arrow Keys to locate the file or folder you need.

Step 4: Get the Full Filepath

To retrieve the full filepath of your working directory:

1. Select the file or folder you want in Finder.

2. Press Command+Option+C to copy the full filepath to your clipboard.

  • This shortcut copies the absolute filepath of the selected item.

3. Paste the filepath into any text editor or document using Command+V to view it.

Notes for VoiceOver Users

- When interacting with Finder, ensure you’re aware of the distinction between areas (Sidebar, File List, Toolbar, etc.) and navigate accordingly using VO+Arrow Keys.

- The Sidebar is your guide to top-level directories, while the File List area lets you drill down into specific files and folders.

Quick Example

If you need to access a file in OneDrive:

1. Open Finder and interact with the Sidebar.

2. Navigate to OneDrive using the Arrow Keys.

3. Move focus to the File List area using VO+Right Arrow.

4. Interact with the File List and navigate to the desired file or folder.

5. Use Command+Option+C to copy the full filepath.

This method ensures you can easily locate your working directory and retrieve its full filepath in a straightforward and accessible manner.

1.9 Setting our Working Directory

Now that we have learnt how to find the location that we wish to save our files in, we need to tell R where this is located. In R, the working directory is the folder on your computer where R reads and saves files by default. Think of it as R’s “home base” for accessing and storing data files, scripts, and other resources. You can set or check your working directory using commands like setwd() or getwd().

We need to save our data files to the working directory we wish to use. I recommend that you create a folder to save all of your data files and output. For me, this produces the following location. Simply change the code here to represent where your folder is located.

To set our working directory, we need to use the command “setwd” and open brackets and quotations and paste our full filepath we have from copied from above within this. It should look something like this.

We again need to do this within a code chunk so click Option+Command+I to open one and then write your code within it.

# Replace "path/to/your/directory" with the full path to your working folder
setwd("path/to/your/directory")

Run the code chunk using shift+control+enter

1.10 Loading Data

Now we have downloaded our data and set our working directory, we need to load our data files. In this lab we are going to use one datasets:

  1. The British Election Study Constituency Data. This file contains aggregate information about election results from constituencies and links it to the Census 2010 data.

The data for these folders is saved in Learn Ultra SGIA1201 Research Politics and International Relations folder under Lab 1. Please download these and save them to the folder you have created to store all of your R scripts and outputs.

Both of these datasets are written in a format called .sav file. These are created using the SPSS software package (similar to Excel).

To load your SPSS data into R using the haven package which is part of the Tidyverse package we installed earlier. We need to create something called an “object,” which is just a way to store your data so R can work with it. Think of an object as a labeled box where your data goes. For example, to load the file BES17Constituency.sav into a dataobject called BES17con, you can use the following command:

(N.b. remember that our dataset needs to be placed within brackets and quotation marks):

Steps to understanding this code:

  1. We first write the data object name we want. You can call it anything but try to choose things that are representative of what they are. Here BES17con makes sense to me because it is British Election Study 17 constituency data.

  2. Next we have to tell it that we wish to store something in the object. This can be done either using an “equals” symbol or “less than and a hyphen”. These can be used interchangeably and serve the same purpose

  3. Next we write the command. In this case it is read_sav

  4. Finally, we open brackets and quotation marks (as a rule R will automatically provide closing brackets and quotation marks when you open then

  5. We insert the name of the file within the brackets and quotation marks

BES17con = read_sav("BES17Constituency.sav")

Run the code chunk using shift+control+enter

Now we have got our data loaded we need to do one last thing. We need to ask the software to sort out which variables are which. Variables come in a variety of different types, however the important ones to use are as follows:

  • Numeric Variables: These are just raw numbers such as percentages or someones age

  • Factor Variables: These are are what R calls categorical variables. These are things that don’t have a numeric value such as gender, vote choice or religion.

  • Character Variables: These are also what R calls categorical variables. They have slightly different properties and we’ll discuss it when relevant. However, the main two are Numeric and Factors.

When R loads data using the read_sav command it reads all data in as numeric variables. However, there are factor variables within it. In order to get R to sort these we need to use the as_factor command to create a new data object that stores our new data with numeric and factor variables correctly identified.

BESconfac = as_factor(BES17con)

This data object contains over 1000 variables. This is quite hard to manage when you are able to visualise the information but particularly tricky when navigating using a screen reader. In this instance we are going to create a new data object where we keep only certain variables that we are interested in today.

This line of R code modifies the BESconfac dataset by subsetting it to include only specific columns (variables). Here’s a breakdown of what the code does:

1. Original Dataset

  • BESconfac is the name of the dataset you’re working with. It contains multiple rows (observations) and columns (variables).

2. Subsetting with [, ]

  • The [,] syntax in R is used to subset a dataset.

    • The rows go before the comma.

    • The columns go after the comma.

    • Since nothing is specified before the comma, all rows are kept.

    • After the comma, a vector of column names (c(...)) specifies which columns to keep.

3. Selected Columns

  • The code retains only these columns from BESconfac:

    1. "Region": Likely a geographical region variable.

    2. "Lab17": Potentially data on Labour Party results in 2017.

    3. "Winner17": Information about the winning party or candidate in 2017.

    4. "Turnout17": Voter turnout in 2017.

    5. "leaveHanretty": Data on Brexit leave vote percentages, possibly from Hanretty’s estimates.

    6. "c11Degree": Census 2011 data on education (e.g., degree qualifications).

    7. "Lab15": Data on Labour Party results in 2015.

    8. "c11HouseOwned": Census 2011 data on house ownership.

    9. "c11Employed": Census 2011 data on employment status.

4. Assigning Back to BESconfac

  • By assigning the subsetted data back to BESconfac, the original dataset is replaced with the reduced version that contains only the specified columns. This helps focus the analysis on relevant variables while discarding unnecessary ones.

Below we have an example of the full code that we can run.

BESconfac = BESconfac[, c("Region", "Lab17", "Winner17", "Turnout17", "leaveHanretty", "c11Degree", "Lab15", "c11HouseOwned", "c11Employed")]

Run the code chunk using shift+control+enter

Success! We are finally ready to start exploring our data.

4.1 N.B. A note on this code.

In this code, we have BESconfac = BESconfac. What this does is essentially say “create a new object that is a modified version of the original object”. What this does is overwrite the object we initially had. I do this for simplicity as we don’t need the old object again. However, should you wish to add more variables to this list, you simply need to add them and rerun the script. if you wish you could also give the object a new name rather than overwriting. There is no right answer to this, just however you find it easier to work.

4. 2 Why only Columns?

In this case, we only need to select certain columns as we are looking to keep only certain varaibles which appear in the columns axis. We do not have to specify rows as we are wanting to keep all rows within the data object. Were we wishing to keep only certain cases (in this case constituencies) we could also specify a similar command that would instruct R to only keep certain columns of interest. We’ll look at a variety of different ways to do this in the next few labs. If we don’t specifically tell it to delete rows, it will automatically keep all rows within the data object.

2: Understanding our Data Object

To start with, it can be very useful to our data object. There are a few useful commands that help us to do this.

First, we can use the “datatable” command from the DT package to produce a screen readable table of all of the variables within our data object (N.b. this is most useful when we have reduced our object to fewer variables as opposed to a large data set).

We complete this by using the command datatable followed by the name of our object which we place in brackets.

datatable(BESconfac)

This command works as a command that exists within baseR. It is worth seeing how you get on with understanding the information in this format with a screenreader.

We can also use the VI command from the BrailleR package to view our data object. This provides a plain text summary of the variables in our data object with variable name and some statistics associated with it.

VI(BESconfac)

The summary of each variable is
Region: North East 29   North West 75   Yorkshire and The Humber 54   East Midlands 46   West Midlands 59   East of England 58   London 73   South East 84   South West 55   Wales 40   Scotland 59  
Lab17: Min. 8.13479623824451   1st Qu. 28.0037178497325   Median 39.4567328186425   Mean 41.8891165928358   3rd Qu. 56.4712917880729   Max. 85.7288432826978   NA's 1  
Winner17: Conservative 317   Labour 262   Liberal Democrat 12   Scottish National Party 35   Plaid Cymru 4   UKIP 0   Green 1   Speaker 1  
Turnout17: Min. 53.0193055346982   1st Qu. 65.4188663712498   Median 69.1631888391556   Mean 68.7499481104751   3rd Qu. 72.3894471199125   Max. 79.520644898155  
leaveHanretty: Min. 20.48078926   1st Qu. 45.3344030175   Median 53.68607141   Mean 52.058257622943   3rd Qu. 60.15470329   Max. 75.64987141  
c11Degree: Min. 5.09875721260542   1st Qu. 10.7859668844704   Median 14.6940379137609   Mean 16.7109261204563   3rd Qu. 19.5937362583671   Max. 51.0983234659437   NA's 59  
Lab15: Min. 4.50576128704979   1st Qu. 17.6996063515449   Median 31.2801682526548   Mean 32.3499306385183   3rd Qu. 44.3888615565076   Max. 81.3009400307268   NA's 1  
c11HouseOwned: Min. 20.484017184975   1st Qu. 59.2061386487193   Median 66.9110593501683   Mean 64.1110976960409   3rd Qu. 71.8440343200194   Max. 85.8417573296179  
c11Employed: Min. 42.0457765324434   1st Qu. 58.8809928021743   Median 62.2125111633333   Mean 61.7822072282222   3rd Qu. 65.826600914912   Max. 74.5566087778777  

A Note on Running and Interpreting Code: Rendering an HTML Document

So far in this lab we have used Shift + Command + Enter to run individual code chunks to make sure our packages load, our working directory is set and that our data is loading effectively.

However, we’re now moving to a point where we might want to actually start interpreting our data and understanding something about our variables. The easiest way to engage with this via a screen reader is to convert our document here into an HTML file. This is how this lab is produced, the Quarto file in essence works as a word document with the built in option to run our code and write our analysis all in one place.

To get our code and notes into an HTML document, we now need to “render” our document, this is the process of running all of the commands and code we have produced in order to produce one file with all of the output together.

We “render” our document by clicking Shift+Command+K. This will execute all code chunks in order and if it renders successfully will open up a new window with the html document we have created. This file will save to our working directory.

We can render a document as many times as we want, so you can render each time you produce a new piece of analysis and it will add it to the html file which will update.

Once you have viewed your analysis, all you need to do is use your voice commands to navigate back into R Studio. It doesn’t close the window simply opens the HTML file above it.

Using Sink to Output Code Chunks

So far we used commands to execute code chunks. This is great as it allows us to run through elements of code as we go. However, at present screen reader support is not easy to use within the output of code chunks. As we see later, typically the easiest way to work with R is rendering it in a whole document. However, waiting until the end of the document to check our code works leads to challenges particularly if there are errors in our code. An easy way to resolve this is to create a new code chunk

  1. We do this by Option+Command+I
  2. navigate into the code chunk. The top row of the chunk will have the text {r}.
    1. To the right hand side of the r place a comma and write the text include = FALSE

    2. This should look like {r setup, include = FALSE}

  3. use the down arrow to go to a fresh line of code. In this line of code write: sink(“output.txt”, append = TRUE)
    1. This asks r to create an output file in your working directory that will update each time you run a code chunk. It will provide the output that the code chunk creates.
  4. Finally, to set this up, Run the code chunk using shift+control+enter

This text file will not automatically open, but if you go to your working directory you will find a new file called output.txt.

Reopening this file after running a code chunk will update the file allowing you to read the results or understand any errors that have occurred.

# Redirect output to a text file
sink("output.txt", append = TRUE)

Closing Sink

There is one consideration when using this. It works perfectly to run individual chunks to allow you to get output into a text format that works with screen readers. However, while sink is open it will not allow you to render the document.

In order to render the document, you have to tell R to stop using sink. The easiest way to do this is to go back to the code chunk where you set up the sink and change the following code:

{r setup, include=TRUE, eval=FALSE}

Once you have done this, you can go ahead and render the whole document into an HTML file.

2.1: Understanding Our Variables

Often times, we don’t need to see information about a whole dataset, rather we want to see information about specific variables of interest within our dataset. We can use similar tools to help us understand basic information.

There are a few things that we need to understand one key thing about the variable before we know what to do next:

  1. Is it numeric or categorical?

On the basis of this information, we can choose appropriate descriptive statistics to understand our variables.

One very helpful command to get to know our data is the class command. This gives us information about what type of variable we have.

To call a specific variable within a data object we use a dollar sign. So for example to use the str command to learn about our Region variable we would use class(BESconfac$Region).

class(BESconfac$Region)
[1] "factor"

We can see from the output that this variable is designated as a factor meaning it is categorical. We might also be interested in knowing what categories exist within this variable.

2.2: Understanding and Descriptively Analysing a Categorical Variable

Now we know it is a factor variable, we have an understanding about what commands we can use to explore it. In this instance, a first question might be to know what categories exist within the variable. To check this we can use the “levels” command. it follows the same structure as above simply changing class to levels.

levels(BESconfac$Region)
 [1] "North East"               "North West"              
 [3] "Yorkshire and The Humber" "East Midlands"           
 [5] "West Midlands"            "East of England"         
 [7] "London"                   "South East"              
 [9] "South West"               "Wales"                   
[11] "Scotland"                

N.b. the levels command only works with variables that are factors.

We can see that there are a number of regions that the UK is divided up into. If we want to know how many constituencies exist in each region we can use the “summary” command.

summary(BESconfac$Region)
              North East               North West Yorkshire and The Humber 
                      29                       75                       54 
           East Midlands            West Midlands          East of England 
                      46                       59                       58 
                  London               South East               South West 
                      73                       84                       55 
                   Wales                 Scotland 
                      40                       59 

A useful command that includes information from the levels and class command is the “attributes” command which also includes information on the variable label. You can try this again through using the following code:

attributes(BESconfac$Region)
$levels
 [1] "North East"               "North West"              
 [3] "Yorkshire and The Humber" "East Midlands"           
 [5] "West Midlands"            "East of England"         
 [7] "London"                   "South East"              
 [9] "South West"               "Wales"                   
[11] "Scotland"                

$class
[1] "factor"

$label
[1] "Region"

So far we have seen an example of a categorical variable. However, numeric variables have different descriptive statistics


Question 1A. Using the code above, please try and recreate the levels, summary and attributes command on the basis of the variable Winner17. Winner17 is a categorical variable that explains to us who won each constituency, with the output being the total number of constituencies won by each party.

levels(BESconfac$Winner17)
[1] "Conservative"            "Labour"                 
[3] "Liberal Democrat"        "Scottish National Party"
[5] "Plaid Cymru"             "UKIP"                   
[7] "Green"                   "Speaker"                
summary(BESconfac$Winner17)
           Conservative                  Labour        Liberal Democrat 
                    317                     262                      12 
Scottish National Party             Plaid Cymru                    UKIP 
                     35                       4                       0 
                  Green                 Speaker 
                      1                       1 
attributes(BESconfac$Winner17)
$levels
[1] "Conservative"            "Labour"                 
[3] "Liberal Democrat"        "Scottish National Party"
[5] "Plaid Cymru"             "UKIP"                   
[7] "Green"                   "Speaker"                

$class
[1] "factor"

$label
[1] "2017 Winning party"

2.3 Descriptive Statistics for Categorical Variables

One thing that we do not get from our examples so far is a sense of what percentage of the time constituencies appear in each region or how many times a party won a constitency. In order to do this we need to start considering descriptive statistics.

Descriptive statistics are ways of understanding our data. Typically we focus on measures of centrality such as mean, median and mode or measures of dispersion such as minimum/maximum values, range or standard deviation.

There are a variety of ways that we can produce these using R. The first way we will examine is using the “descriptives” command from jamovi package (jmv).

This command works

descriptives(BESconfac, Winner17, freq = TRUE, mean = FALSE, median = FALSE, sd = FALSE, min = FALSE, max = FALSE)

 DESCRIPTIVES

 Descriptives            
 ─────────────────────── 
              Winner17   
 ─────────────────────── 
   N               632   
   Missing           0   
 ─────────────────────── 


 FREQUENCIES

 Frequencies of Winner17                                             
 ─────────────────────────────────────────────────────────────────── 
   Winner17                   Counts    % of Total    Cumulative %   
 ─────────────────────────────────────────────────────────────────── 
   Conservative                  317      50.15823        50.15823   
   Labour                        262      41.45570        91.61392   
   Liberal Democrat               12       1.89873        93.51266   
   Scottish National Party        35       5.53797        99.05063   
   Plaid Cymru                     4       0.63291        99.68354   
   UKIP                            0       0.00000        99.68354   
   Green                           1       0.15823        99.84177   
   Speaker                         1       0.15823       100.00000   
 ─────────────────────────────────────────────────────────────────── 

We might want a table that includes percentages of each category. This is something that is not easily captured for screen readers through the descriptives command. One way we can attempt to mitigate this is to have the data presented in plain text.

This does take a few steps more, but the output appears to be much easier for screenreaders to cope with.

Firstly, we need to calculate the counts and percentages of the variable categories. We do this by creating two new objects which contain this information. We do this using the code counts <- table(BESconfac$Region)

  • What this does:

    • The table() function counts how many times each unique value appears in the Winner17 variable of the BESconfac dataset.

    • For example, if Region has values like “Yes”, “No”, and “NA”, this command will count how many occurrences of “Yes”, “No”, and “NA” exist in that column.

  • What you’ll see:

    • The result will be a frequency table (or count) of each unique value in the Winner17 variable.

Next we need to calculate the percentages of each category. We do this by creating an object called percentages. The code is percentages <- prop.table(counts) * 100

  • What this does:

    • The prop.table() function calculates the proportion of each category by dividing each count by the total number of observations.

    • Multiplying by 100 converts these proportions into percentages.

  • What you’ll see:

    • This will give you the percentage of each category in the Winner17 variable.

Next, we need to organise the results into a data frame. We need to combine this information to form the basis of a readable table.

counts = table(BESconfac$Region)
percentages = prop.table(counts) * 100

result = data.frame(
  Category = names(counts),
  Count = as.vector(counts),
  Percentage = round(as.vector(percentages), 2)
)
  • What this does:

    • This creates a data frame (a structured table) that contains:

      • Category: The unique values from Region (like “Yes”, “No”, “NA”).

      • Count: The number of occurrences of each category.

      • Percentage: The percentage of total observations for each category, rounded to 2 decimal places.

  • What you’ll see:

    • A neatly organized table with three columns: Category, Count, and Percentage.

Finally, we have the bit of the code that looks a little bit more complicated. It in essence produces the results from that table into a readable format.

for (i in 1:nrow(result)) {
  cat(paste(
    "Category:", result$Category[i],
    "- Count:", result$Count[i],
    "- Percentage:", result$Percentage[i], "%\n"
  ))
}
  • What this does:

    • The for loop goes through each row of the result data frame (where each row represents a different category in Region).

    • For each row, the cat() function is used to print a sentence that combines the category name, count, and percentage in a human-readable format.

    • paste() is used to combine the individual pieces of information into a single string of text for each row.

  • What you’ll see:

    • This will print the results in this format

If you are struggling to understand the loop function in the last bit of text do not worry! A separate document is available that talks this through.

counts = table(BESconfac$Region)
percentages = prop.table(counts) * 100


# Create a data frame for easy formatting
result = data.frame(
  Category = names(counts),
  Count = as.vector(counts),
  Percentage = round(as.vector(percentages), 2)
)

for (i in 1:nrow(result)) {
  cat(paste(
    "Category:", result$Category[i],
    "- Count:", result$Count[i],
    "- Percentage:", result$Percentage[i], "%\n"
  ))
}
Category: North East - Count: 29 - Percentage: 4.59 %
Category: North West - Count: 75 - Percentage: 11.87 %
Category: Yorkshire and The Humber - Count: 54 - Percentage: 8.54 %
Category: East Midlands - Count: 46 - Percentage: 7.28 %
Category: West Midlands - Count: 59 - Percentage: 9.34 %
Category: East of England - Count: 58 - Percentage: 9.18 %
Category: London - Count: 73 - Percentage: 11.55 %
Category: South East - Count: 84 - Percentage: 13.29 %
Category: South West - Count: 55 - Percentage: 8.7 %
Category: Wales - Count: 40 - Percentage: 6.33 %
Category: Scotland - Count: 59 - Percentage: 9.34 %

2.4: Graphing Single Categorical Variables

Understanding how to interpret graphs creates significant new challenges for those with visual impairments. However, packages exist that help make graphics more engaging and offer an insight into what they show and their use. In this section, we are going to explore some ways to create graphs and make them interpretable.

We are going to start off by trying to create a bar plot of our Region variable. A bar chart provides similar information to our table in that it has a column for each category and the height of the category explains how many cases are in it. We can create graphs using a command called ggplot which exists within the tidyverse package we loaded earlier.

The basic code requires us to create a new object to store our graph in, we then use the ggplot command and within brackets specify the data object we want to load, and then within the aesthetics we tell it what information we want to be displayed within the graph. The common location of elements are x axis which is a horizontal line across the bottom of the page, the y axis which is along the left hand veritcal side of the page and fill which is where we add information within the graph space. Finally, we need to use an addition symbol “+” to link this bit of code to the type of graph we want. In ggplot these are called geoms. In this instance we want geom_bar().

The code should look like this:

Bar = ggplot(BESconfac, aes(x=Region))+
  geom_bar()

As it stands this is not at all useful to a visually impaired user. In fact even to someone who has use of their sight would not see anything at this stage as the graph is stored as an object and not displayed. To make this interpretable for those who are visually impaired we use a command from the BrailleR package called “VI”. This command aims to make graphs interpretable through providing a discription of what it shows.

To use it, we state the command VI and the name of the data object within brackets. It should be something like this:

VI(Bar)
This is an untitled chart with no subtitle or caption.
It has x-axis 'Region' with labels North East, North West, Yorkshire and The Humber, East Midlands, West Midlands, East of England, London, South East, South West, Wales and Scotland.
It has y-axis 'count' with labels 0, 20, 40, 60 and 80.
The chart is a bar chart with 11 vertical bars.

If we run this chunk we should now get a description of the graph that a screen reader can engage with.

What does it show us? It tells us what is on the y and x axis. It also tells us what the different bars are and how tall they are. bar corresponds to a party so Bar 1= Conservative, Bar 2= Labour etc.

Question 1b: Have a go at creating a bar graph for the Winner17 variable. What do the results tell us?

3: Understanding and Descriptively Analysing a Numeric Variable

In this next part of the lab we are going to explore similar commands to those covered in Section 2. However, we are instead going to focus in on numeric variables instead of categorical. We are going to start by examining the variable leaveHanretty which provides us a percentage estimate of leave voters from the 2016 EU referendum in the UK.

We start by exploring its class to confirm the variable type.

class(BESconfac$leaveHanretty)
[1] "numeric"

As we can see, this is indeed a numeric variable. Now we have established this we might want to know further information about it. We can proceed by using the attributes command.

attributes(BESconfac$leaveHanretty)
$label
[1] "Hanretty (2017) Estimate of EU Referendum Leave %"

$format.spss
[1] "F16.15"

$display_width
[1] 16

We can see from these results that the variable is indeed numeric, it gives us the label which shows us that the variable is providing an overview of the Hanretty Estimate of EU Referendum Leave %.

We might also want to explore summary statistics for this variable to get a sense of what it is telling us. We can again use the summary command.

summary(BESconfac$leaveHanretty)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.48   45.33   53.69   52.06   60.15   75.65 

The output here tells us that the minimum % vote leave is 20.48% and the maxiumum is 75.65%. We can see then that there were some districts where few people voted leave and others where they voted leave in large numbers. The mean is 52.06% suggesting that on average in each constituency more people voted leave than voted remain. Somewhat unsurpsing given we know the result!

However, there are other statistics here that might be useful such as the standard deviation. To access this information we need to use another command. We can use the descriptives command we did earlier. We just need to remember to adjust our options and set freq = FALSE and the other options = TRUE. The code should look like this:

descriptives(BESconfac, leaveHanretty, freq = FALSE, mean = TRUE, median = TRUE, sd = TRUE, min = TRUE, max = TRUE, range = TRUE)

 DESCRIPTIVES

 Descriptives                            
 ─────────────────────────────────────── 
                         leaveHanretty   
 ─────────────────────────────────────── 
   N                               632   
   Missing                           0   
   Mean                       52.05826   
   Median                     53.68607   
   Standard deviation         11.43820   
   Range                      55.16908   
   Minimum                    20.48079   
   Maximum                    75.64987   
 ─────────────────────────────────────── 

As we can see, we now get a bit more information about our variable. We can see that the standard deviation for our variable is 11.43. What does this mean? The standard devaition is essentially the average of the average. It tells us how far away on average each case is from the mean. A low standard deviation means data points are clustered closely around the mean, a big standard deviation might tell us the data is very spread out.

3.1.1 An Alternative Descriptives Code

If the code highlighted above does not interact well with your screen reader there are other ways that we can get descriptive statistics. We can use built in baseR codes to get all of these statistics. The codes are mean/median/sd/min/max and range. We need to then specify our dataobject and variable name using the dollar symbol as previously. There is one new bit of code to discuss as well. We have to tell the software what to do with missing values. Missing values are empty cells or cells with information we do not need. E.g. someone who says ‘don’t know’ in a survey or a constituency we don’t have information about. To tell R to ignore these, we use an option called na.rm = TRUE. This is useful when we are dealing with numeric variables. The codes would look like this.

mean(BESconfac$leaveHanretty, na.rm = TRUE)
[1] 52.05826
median(BESconfac$leaveHanretty, na.rm = TRUE)
[1] 53.68607
sd(BESconfac$leaveHanretty, na.rm = TRUE)
[1] 11.4382
min(BESconfac$leaveHanretty, na.rm = TRUE)
[1] 20.48079
max(BESconfac$leaveHanretty, na.rm = TRUE)
[1] 75.64987
range(BESconfac$leaveHanretty, na.rm = TRUE)
[1] 20.48079 75.64987

This produces one line of output with each statistic. If we wanted to do more to draw it into a table as we have done previously we could also create an object and store all of this information it is as a list. The code to do this is as follows:

stats = list(
  Mean = mean(BESconfac$leaveHanretty, na.rm = TRUE),
  Median = median(BESconfac$leaveHanretty, na.rm = TRUE),
  Standard_Deviation = sd(BESconfac$leaveHanretty, na.rm = TRUE),
  Minimum = min(BESconfac$leaveHanretty, na.rm = TRUE),
  Maximum = max(BESconfac$leaveHanretty, na.rm = TRUE),
  Range = range(BESconfac$leaveHanretty, na.rm = TRUE)
)

print(stats)
$Mean
[1] 52.05826

$Median
[1] 53.68607

$Standard_Deviation
[1] 11.4382

$Minimum
[1] 20.48079

$Maximum
[1] 75.64987

$Range
[1] 20.48079 75.64987

We can use the the cat() function to create a structured and readable output with descriptive labels for each statistic to make this again easier for a screen reader. Here’s a breakdown of the code:

  1. The cat function creates a structured and readable output with descriptive labels for ecah statistic. The information in the quotation marks is the label that each row will have. We then have the code we produced above for each statistic and at the end we use a comma followed by “”, to specify we want to break a new line. this makes a much better table that can be more effectively read by a screen reader.
cat("Statistics for leaveHanretty:\n",
    "Mean:", mean(BESconfac$leaveHanretty, na.rm = TRUE), "\n",
    "Median:", median(BESconfac$leaveHanretty, na.rm = TRUE), "\n",
    "Standard Deviation:", sd(BESconfac$leaveHanretty, na.rm = TRUE), "\n",
    "Minimum:", min(BESconfac$leaveHanretty, na.rm = TRUE), "\n",
    "Maximum:", max(BESconfac$leaveHanretty, na.rm = TRUE), "\n",
    "Range:", range(BESconfac$leaveHanretty, na.rm = TRUE), "\n")
Statistics for leaveHanretty:
 Mean: 52.05826 
 Median: 53.68607 
 Standard Deviation: 11.4382 
 Minimum: 20.48079 
 Maximum: 75.64987 
 Range: 20.48079 75.64987 

Question 2a. Try to create and analyse the following statistics for the c11HouseOwned variable and write a few lines about what these statistics tell you:

  • mean

  • median

  • range

  • standard deviation

  • minimum

  • maximum

3.1.2 Summary

We have reviewed a few different ways of exploring numeric variables. Ultimately, it does not matter which way works for you is fine. The crucial thing is to make sure you are able to engage and interpret data through one of these different approaches.

3.2 Graphing Single Numeric Variables

In this final section we are going to explore how we can graph individual numeric variables. We will do this first by looking at the leaveHanretty variable.

We have two different types of graphs we can use to

3.2.1 Histograms

Histograms work in a similar way to a bar chart. However instead of having one column for each category it works by grouping numbers together into what are referred to as bins or breaks.

We can do this using ggplot similarly to what we did earlier. In fact, the only bit of code that needs to change is the geom which we need to make geom_histogram()

hist1 = ggplot(BESconfac, aes(x=leaveHanretty))+
  geom_histogram()

VI(hist1)
This is an untitled chart with no subtitle or caption.
It has x-axis 'leaveHanretty' with labels 20, 40 and 60.
It has y-axis 'count' with labels 0, 20 and 40.
The chart is a bar chart with 30 vertical bars.

This givves us a brief overview of the histogram. It tells us that it has x-axis leaveHanretty with 20/40/60 labels and that there are 30 bars but it does not do a good job of describing the pattern of the data. This is a limitation of ggplot.

3.2.2. Histograms in Base R using Plot

We need to create an object to save our graph in. I’m going to call it Hist in this instance. This time rather than using ggplot we use the command hist.

The code works slightly differently but has similarities to code we have used earlier. The command is hist, then we state the object and variable separted by a dollar symbol. We can apply options such as specifying breaks which tells us how many columns will be greated, main which provides the option to title the graph and xlab which labels the x-axis.

Hist = hist(BESconfac$leaveHanretty, breaks = 20, main = "Histogram of leaveHanretty", xlab = "leaveHanretty")

We can view this graph by using VI(Hist).

VI(Hist)
This is a histogram, with the title: with the title: Histogram of leaveHanretty
"BESconfac$leaveHanretty" is marked on the x-axis.
Tick marks for the x-axis are at: 20, 30, 40, 50, 60, and 70 
There are a total of 632 elements for this variable.
Tick marks for the y-axis are at: 0, 10, 20, 30, 40, and 50 
It has 28 bins with equal widths, starting at 20 and ending at 76 .
The mids and counts for the bins are:
mid = 21  count = 6 
mid = 23  count = 7 
mid = 25  count = 6 
mid = 27  count = 8 
mid = 29  count = 9 
mid = 31  count = 8 
mid = 33  count = 9 
mid = 35  count = 14 
mid = 37  count = 11 
mid = 39  count = 21 
mid = 41  count = 18 
mid = 43  count = 26 
mid = 45  count = 23 
mid = 47  count = 36 
mid = 49  count = 27 
mid = 51  count = 46 
mid = 53  count = 46 
mid = 55  count = 45 
mid = 57  count = 53 
mid = 59  count = 52 
mid = 61  count = 52 
mid = 63  count = 31 
mid = 65  count = 19 
mid = 67  count = 21 
mid = 69  count = 17 
mid = 71  count = 14 
mid = 73  count = 5 
mid = 75  count = 2

This time, we get a great deal more detail. It tells us the number of cases in the graph (632). It tells us the number of counts within each of the breaks. It tells us that as we get to the middle of the percentage range around 41-61 we get a much greater count of cases in each column than we get at the bottom and top of the variable.

3.2.3 Box Plots in Base R

An alternative type of graph for a single numeric variable is a boxplot. It provides us a different way of presenting information that is clearer in terms of what the distribution of the variable is.

A boxplot is a visual summary of the distribution of a dataset that shows its central tendency, variability, and potential outliers. For a blind person, we can describe the key components of the boxplot in text form.

Components of a Boxplot and How to Describe Them

  1. Median (Middle Value):

    • The median is represented by the line inside the box. It divides the dataset into two equal halves.

    • Describe the median’s value and where it lies in relation to the dataset.

    Example: “The median is 25, meaning half of the values are below this and half are above.”

  2. Interquartile Range (IQR):

    • The box itself represents the middle 50% of the data (from the 25th to the 75th percentile).

    • The range of the box tells how spread out the middle half of the data is.

    • Provide the values of the 25th percentile (Q1) and 75th percentile (Q3), and explain the IQR as Q3 - Q1.

    Example: “The interquartile range is 10, spanning from 20 (25th percentile) to 30 (75th percentile).”

  3. Whiskers (Data Range):

    • Whiskers extend to the smallest and largest values that are not considered outliers.

    • Mention the minimum and maximum values within this range.

    Example: “The whiskers extend from 15 to 35, showing the range of most data points.”

  4. Outliers:

    • Outliers are individual data points beyond the whiskers.

    • Mention how many outliers there are and their approximate values.

    Example: “There are two outliers: one at 12 and another at 40.”

  5. Skewness:

    • If the median is closer to one side of the box, or the whiskers are uneven, the data may be skewed.

    • Describe the skewness (e.g., “Data is left-skewed because the left whisker is longer than the right”).

We can produce a boxplot of our leaveHanretty variable by using the boxplot command. It builds upon similar code we saw to produce the histogram in baseR. The only difference is that we specify boxplot instead of histogram. The remainder of the code is very similar as seen below:

BoxHanretty1 = boxplot(BESconfac$leaveHanretty, main = "Boxplot of leaveHanretty", ylab = "leaveHanretty")

We can also use the VI command to describe the graph to us.

VI(BoxHanretty1)
This graph has a boxplot printed vertically
With the title: Boxplot of leaveHanretty
"" appears on the x-axis.
"" appears on the y-axis.
Tick marks for the y-axis are at: 20, 30, 40, 50, 60, and 70 
This variable  has 632 values.
An outlier is marked at: 22.04674 20.71151 22.93742 21.79552 22.15339 21.62047 20.48079 22.78226 21.61112 20.53967 22.4284 
The whiskers extend to 23.72532 and 75.64987 from the ends of the box, 
which are at 45.33207 and 60.16252 
The median, 53.68607 is 56 % from the lower end of the box to the upper end.
The upper whisker is 0.72 times the length of the lower whisker.

What does the summary of the boxplot tell us? It identifies outliers which tells us there are a number of cases where leave vote was unusually low. It also tells us that the whiskers are much longer at the bottom than they are at the top suggesting the spread of cases between lower whisker is much greater than the upper whisker.

Overall. we can see that the median is above 50% so we know that the middle average is a majority leave vote.

Question 3b. Try and recreate these graphs using the c11Employed variable and write a few sentences about what you find.

Appendix: Calculating Descriptive Statistics

Appendix: Screen-Readable Equations and Explanations

1. Mean (Arithmetic Average)

Definition: The mean (or arithmetic average) is the sum of all values divided by the number of values.

Equation:

Mean (\(\bar{x}\)) = (Sum of all values) / (Number of values)

Mathematically:

Where:

\(\bar{x}\) is the mean,

\(x_i\) represents each individual value,

\(\sum x_i\) means summing all values,

\(n\) is the total number of values.

Explanation: The mean provides a central value for the dataset and is used in many statistical analyses. However, it is sensitive to extreme values (outliers), which can distort the average.

2. Median

Definition: The median is the middle value when the data is arranged in ascending order.

Calculation:

If the number of values (\(n\)) is odd, the median is the middle value.

If \(n\) is even, the median is the average of the two middle values.

Mathematically:

Explanation: The median is useful when the dataset contains outliers or is skewed because it represents the central tendency without being influenced by extreme values.

3. Mode

Definition: The mode is the most frequently occurring value(s) in the dataset.

Examples:

If a dataset is \({2, 3, 3, 5, 7, 3}\), the mode is 3, because it appears most frequently.

If a dataset is \({1, 1, 2, 2, 3, 3}\), it has multiple modes: 1, 2, and 3 (this is called a multimodal distribution).

If all values occur with equal frequency, the dataset has no mode.

Explanation: The mode is useful for categorical data or distributions where identifying the most common value is important. Unlike the mean and median, the mode does not necessarily provide a measure of central tendency for all data types.

4. Standard Deviation

Definition: The standard deviation measures the spread of data points around the mean. A higher standard deviation indicates greater variability.

Equation for a sample standard deviation:

Where:

\(s\) is the sample standard deviation,

\(x_i\) represents each data point,

\(\bar{x}\) is the mean,

\(n\) is the number of values,

\(\sum (x_i - \bar{x})^2\) is the sum of squared differences from the mean.

Explanation:

A low standard deviation means the data points are close to the mean.

A high standard deviation means the data points are more spread out.

The denominator uses \(n - 1\) instead of \(n\) because it provides an unbiased estimate when working with a sample rather than the full population.

For a population standard deviation, the formula is:

where \(\sigma\) is the population standard deviation, \(\mu\) is the population mean, and \(N\) is the total number of values.

Summary

Mean: Average of all values, affected by outliers.

Median: Middle value, useful for skewed data.

Mode: Most frequently occurring value(s), useful for categorical data.

Standard Deviation: Measures variability, important for assessing data spread.

Each of these measures helps in understanding the distribution and characteristics of a dataset, with different strengths depending on the nature of the data and the research question.