library(tidyverse) # a structure of data manipulation including several packages
library(data.table) # load the data.table package
OpenScience in R
Now that you know what computational notebooks are and why we should care about them, let’s start using them! This section introduces you to using R for manipulating tabular data. Please read through it carefully and pay attention to how ideas about manipulating data are translated into code. For this part, you can read directly from the course website, although it is recommended you follow the section interactively by running the code on your own.
Once you have read through, jump on the Do-It-Yourself section, which will provide you with a challenge that you should complete on your own, and will allow you to put what you have already learnt into practice.
Data wrangling
Real world datasets tend to be messy. There is no way around it: datasets have “holes” (missing data), the amount of formats in which data can be stored is endless, and the best structure to share data is not always the optimum to analyze them, hence the need to wrangle (manipulate, transform and structure) them. As has been correctly pointed out in many outlets (e.g.), much of the time spent in what is called (Geo-)Data Science is related not only to sophisticated modeling and insight, but to more basic and less exotic tasks such as obtaining data, processing, turning them into a shape that makes analysis possible, and exploring it to get to know their basic properties.
In this session, you will use a few real world datasets and learn how to process them in R so they can be transformed and manipulated, if necessary, and analyzed. For this, we will introduce some of the fundamental tools of data analysis and scientific computing. We use a prepared dataset that saves us much of the more intricate processing that goes beyond the introductory level the session is aimed at.
In this notebook, we discuss several patterns to clean and structure data properly, including tidying, subsetting, and aggregating; and we finish with some basic visualization. An additional extension presents more advanced tricks to manipulate tabular data.
Before we get our hands data-dirty, let us import all the additional libraries we will need to run the code:
Loading packages
We will start by loading core packages for working with geographic vector and attribute data.
Datasets
We will be exploring some demographic characteristics in Liverpool. To do that, we will use a dataset that contains population counts, split by ethnic origin. These counts are aggregated at the Lower Layer Super Output Area (LSOA from now on). LSOAs are an official Census geography defined by the Office of National Statistics. You can think of them, more or less, as neighbourhoods. Many data products (Census, deprivation indices, etc.) use LSOAs as one of their main geographies.
To do this, we will download a data folder from github called census2021_ethn
. You should place this in a data folder you will use throughout the course.
Import housesales data from csv
<- read.csv("data/census2021_ethn/liv_pop.csv", row.names = "GeographyCode") census2021
Let us stop for a minute to learn how we have read the file. Here are the main aspects to keep in mind:
We are using the method
read.csv
from base R, you could also useread_csv
fromlibrary("readr")
.Here the csv is based on a data file but it could also be a web address or sometimes you find data in packages.
The argument row.names is not strictly necessary but allows us to choose one of the columns as the index of the table. More on indices below.
We are using
read.csv
because the file we want to read is in the csv format. However, many more formats can be read into an R environment. A full list of formats supported may be found here.To ensure we can access the data we have read, we store it in an object that we call
census2021
. We will see more on what we can do with it below but, for now, just keep in mind that allows us to save the result ofread.csv
.
You need to store the data file on your computer, and read it locally. To do that, you can follow these steps:
Download the
census2021_ethn
file by right-clicking on this link and saving the filePlace the file in a data folder you have created where you intend to read it.
Your folder should have the following structure a. a gds folder (where you will save your quarto .qmd documents) b. a data folder c. the
census2021_ethn
folder inside your data folder.
First go to https://download-directory.github.io/
Then go to the folder you need to today. So for example copy: https://github.com/pietrostefani/gds/tree/main/data/London
Paste it in the green box… give it a few minutes
Check your downloads file and unzip
Data, sliced and diced
Now we are ready to start playing with and interrogating the dataset! What we have at our fingertips is a table that summarizes, for each of the LSOAs in Liverpool, how many people live in each, by the region of the world where they were born. We call these tables DataFrame
objects, and they have a lot of functionality built-in to explore and manipulate the data they contain.
Structure
Let’s start by exploring the structure of a DataFrame
. We can print it by simply typing its name:
view(census2021)
Since they represent a table of data, DataFrame
objects have two dimensions: rows and columns. Each of these is automatically assigned a name in what we will call its index. When printing, the index of each dimension is rendered in bold, as opposed to the standard rendering for the content. In the example above, we can see how the column index is automatically picked up from the .csv file’s column names. For rows, we have specified when reading the file we wanted the column GeographyCode, so that is used. If we hadn’t specified any, tidyverse
in R will automatically generate a sequence starting in 0 and going all the way to the number of rows minus one. This is the standard structure of a DataFrame
object, so we will come to it over and over. Importantly, even when we move to spatial data, our datasets will have a similar structure.
One further feature of these tables is that they can hold columns with different types of data. In our example, this is not used as we have counts (or int, for integer, types) for each column. But it is useful to keep in mind we can combine this with columns that hold other type of data such as categories, text (str, for string), dates or, as we will see later in the course, geographic features.
Inspecting
We can check the top (bottom) X lines of the table by passing X to the method head (tail). For example, for the top/bottom five lines:
head(census2021) # read first 5 rows
Europe Africa Middle.East.and.Asia The.Americas.and.the.Caribbean
E01006512 910 106 840 24
E01006513 2225 61 595 53
E01006514 1786 63 193 61
E01006515 974 29 185 18
E01006518 1531 69 73 19
E01006519 1238 7 24 14
Antarctica.and.Oceania
E01006512 0
E01006513 7
E01006514 5
E01006515 2
E01006518 4
E01006519 3
tail(census2021)
Europe Africa Middle.East.and.Asia The.Americas.and.the.Caribbean
E01033763 1302 68 142 11
E01033764 2106 32 49 15
E01033765 1277 21 33 17
E01033766 1028 12 20 8
E01033767 1003 29 29 5
E01033768 1016 69 111 21
Antarctica.and.Oceania
E01033763 4
E01033764 0
E01033765 3
E01033766 7
E01033767 1
E01033768 6
Summarise
We can get an overview of the values of the table:
summary(census2021)
Europe Africa Middle.East.and.Asia
Min. : 731 Min. : 0.00 Min. : 1.00
1st Qu.:1331 1st Qu.: 7.00 1st Qu.: 16.00
Median :1446 Median : 14.00 Median : 33.50
Mean :1462 Mean : 29.82 Mean : 62.91
3rd Qu.:1580 3rd Qu.: 30.00 3rd Qu.: 62.75
Max. :2551 Max. :484.00 Max. :840.00
The.Americas.and.the.Caribbean Antarctica.and.Oceania
Min. : 0.000 Min. : 0.00
1st Qu.: 2.000 1st Qu.: 0.00
Median : 5.000 Median : 1.00
Mean : 8.087 Mean : 1.95
3rd Qu.:10.000 3rd Qu.: 3.00
Max. :61.000 Max. :11.00
Note how the output is also a DataFrame
object, so you can do with it the same things you would with the original table (e.g. writing it to a file).
In this case, the summary might be better presented if the table is “transposed”:
t(summary(census2021))
Europe Min. : 731 1st Qu.:1331
Africa Min. : 0.00 1st Qu.: 7.00
Middle.East.and.Asia Min. : 1.00 1st Qu.: 16.00
The.Americas.and.the.Caribbean Min. : 0.000 1st Qu.: 2.000
Antarctica.and.Oceania Min. : 0.00 1st Qu.: 0.00
Europe Median :1446 Mean :1462
Africa Median : 14.00 Mean : 29.82
Middle.East.and.Asia Median : 33.50 Mean : 62.91
The.Americas.and.the.Caribbean Median : 5.000 Mean : 8.087
Antarctica.and.Oceania Median : 1.00 Mean : 1.95
Europe 3rd Qu.:1580 Max. :2551
Africa 3rd Qu.: 30.00 Max. :484.00
Middle.East.and.Asia 3rd Qu.: 62.75 Max. :840.00
The.Americas.and.the.Caribbean 3rd Qu.:10.000 Max. :61.000
Antarctica.and.Oceania 3rd Qu.: 3.00 Max. :11.00
Columns
Create new columns
We can generate new variables by applying operations on existing ones. For example, we can calculate the total population by area. Here is a couple of ways to do it:
Using the package dplyr
. More info can be found here. dplyr is a part of the tidyverse!
<- census2021 %>%
census2021 mutate(Total_Pop = rowSums(select(., Africa, Middle.East.and.Asia, Europe, The.Americas.and.the.Caribbean, Antarctica.and.Oceania)))
head(census2021)
Europe Africa Middle.East.and.Asia The.Americas.and.the.Caribbean
E01006512 910 106 840 24
E01006513 2225 61 595 53
E01006514 1786 63 193 61
E01006515 974 29 185 18
E01006518 1531 69 73 19
E01006519 1238 7 24 14
Antarctica.and.Oceania Total_Pop
E01006512 0 1880
E01006513 7 2941
E01006514 5 2108
E01006515 2 1208
E01006518 4 1696
E01006519 3 1286
A different spin on this is assigning new values: we can generate new variables with scalars, and modify those:
$new_column <- 1
census2021head(census2021)
Europe Africa Middle.East.and.Asia The.Americas.and.the.Caribbean
E01006512 910 106 840 24
E01006513 2225 61 595 53
E01006514 1786 63 193 61
E01006515 974 29 185 18
E01006518 1531 69 73 19
E01006519 1238 7 24 14
Antarctica.and.Oceania Total_Pop new_column
E01006512 0 1880 1
E01006513 7 2941 1
E01006514 5 2108 1
E01006515 2 1208 1
E01006518 4 1696 1
E01006519 3 1286 1
dplyr
is an immensely useful package in R because it streamlines and simplifies the process of data manipulation and transformation. With its intuitive and consistent syntax, dplyr provides a set of powerful and efficient functions that make tasks like filtering, summarizing, grouping, and joining datasets much more straightforward. Whether you’re working with small or large datasets, dplyr
’s optimized code execution ensures fast and efficient operations. Its ability to chain functions together using the pipe operator (%>%
) allows for a clean and readable code structure, enhancing code reproducibility and collaboration. Overall, dplyr
is an indispensable tool for data analysts and scientists working in R, enabling them to focus on their data insights rather than wrestling with complex data manipulation code.
Delete columns
Permanently deleting variables is also within reach of one command:
dplyr
<- census2021 %>%
census2021 mutate(new_column = 1)
<- census2021 %>%
census2021 select(-new_column)
Queries
Index-based queries
Here we explore how we can subset parts of a DataFrame
if we know exactly which bits we want. For example, if we want to extract the total and European population of the first four areas in the table:
We can select with c()
. If this structure is new to you have a look here.
<- census2021 %>%
eu_tot_first4 filter(rownames(census2021) %in% c('E01006512', 'E01006513', 'E01006514', 'E01006515')) %>%
select(Total_Pop, Europe)
eu_tot_first4
Total_Pop Europe
E01006512 1880 910
E01006513 2941 2225
E01006514 2108 1786
E01006515 1208 974
Condition-based queries
However, sometimes, we do not know exactly which observations we want, but we do know what conditions they need to satisfy (e.g. areas with more than 2,000 inhabitants). For these cases, DataFrames
support selection based on conditions. Let us see a few examples. Suppose we want to select…
Areas with more than 900 people in Total:
<- census2021 %>%
pop900 filter(Total_Pop > 900)
Areas where there are no more than 750 Europeans:
<- census2021 %>%
euro750 filter(Europe < 750)
Areas with exactly ten person from Antarctica and Oceania:
<- census2021 %>%
oneOA filter(`Antarctica.and.Oceania` == 10)
These queries can grow in sophistication with almost no limits.
Combining queries
Now all of these queries can be combined with each other, for further flexibility. For example, imagine we want areas with more than 25 people from the Americas and Caribbean, but less than 1,500 in total:
<- census2021 %>%
ac25_l500 filter(The.Americas.and.the.Caribbean > 25, Total_Pop < 1500)
ac25_l500
Europe Africa Middle.East.and.Asia The.Americas.and.the.Caribbean
E01033750 1235 53 129 26
E01033752 1024 19 114 33
E01033754 1262 37 112 32
E01033756 886 31 221 42
E01033757 731 39 223 29
E01033761 1138 52 138 33
Antarctica.and.Oceania Total_Pop
E01033750 5 1448
E01033752 6 1196
E01033754 9 1452
E01033756 5 1185
E01033757 3 1025
E01033761 11 1372
Sorting
Among the many operations DataFrame
objects support, one of the most useful ones is to sort a table based on a given column. For example, imagine we want to sort the table by total population:
<- census2021 %>%
db_pop_sorted arrange(desc(Total_Pop)) #sorts the dataframe by the "Total_Pop" column in descending order
head(db_pop_sorted)
Europe Africa Middle.East.and.Asia The.Americas.and.the.Caribbean
E01006747 2551 163 812 24
E01006513 2225 61 595 53
E01006751 1843 139 568 21
E01006524 2235 36 125 24
E01006787 2187 53 75 13
E01006537 2180 23 46 6
Antarctica.and.Oceania Total_Pop
E01006747 2 3552
E01006513 7 2941
E01006751 1 2572
E01006524 11 2431
E01006787 2 2330
E01006537 2 2257
Additional resources
A good introduction to data manipulation in R is the “Data wrangling” chapter in R for Data Science.
A good extension is Hadley Wickham’ “Tidy data” paper which presents a very popular way of organising tabular data for efficient manipulation.