library(tidyverse)
Do-It-Yourself
This section is all about you taking charge of the steering wheel and choosing your own adventure. For this block, we are going to use what we’ve learnt before to take a look at a dataset of casualties in the war in Afghanistan. The data was originally released by Wikileaks, and the version we will use is published by The Guardian.
You can read a bit more about the data at The Guardian’s data blog
import pandas
Data preparation
Before you can set off on your data journey, the dataset needs to be read, and there’s a couple of details we will get out of the way so it is then easier for you to start working.
The data are published on a Google Sheet.
As you will see, each row includes casualties recorded month by month, split by Taliban, Civilians, Afghan forces, and NATO.
Let’s read it into an R or Python session:
# Specify the URL of the CSV file
<- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRa7OIBiz7-yqmgwUEn4V5Wm1TO8rGow_wQVS1PWp--UTCAKqNUhtifECO5ZR9XrMd6Ddq9NxQwf1ll/pub?gid=0&single=true&output=csv"
url
# Read the data from the URL into a DataFrame
<- read.csv(url)
data
# see the data
head(data)
Year Month Taliban Civilians Afghan.forces Nato..detailed.in.spreadsheet.
1 2004 January 15 51 23
2 2004 February 7 4 5
3 2004 March 19 2 2
4 2004 April 5 3 19
5 2004 May 18 29 56 6
6 2004 June 163 32 14 2
Nato...official.figures
1 11
2 2
3 3
4 3
5 9
6 5
# Specify the URL of the CSV file
= ("https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/export?format=csv&gid=1")
url
# Read the data from the URL into a DataFrame
= pandas.read_csv(url, skiprows=[0, -1], thousands=",")
data
# see the data
data.head()
Year Month ... Nato (detailed in spreadsheet) Nato - official figures
0 2004.0 January ... NaN 11.0
1 2004.0 February ... 5.0 2.0
2 2004.0 March ... 2.0 3.0
3 2004.0 April ... NaN 3.0
4 2004.0 May ... 6.0 9.0
[5 rows x 7 columns]
This allows us to read the data straight into a DataFrame
, as we have done in the previous session.
Now we are good to go!
Tasks
Now, the challenge is to put to work what we have learnt in this block. For that, the suggestion is that you carry out an analysis of the Afghan Logs in a similar way as how we looked at population composition in Liverpool. These are of course very different datasets reflecting immensely different realities. Their structure, however, is relatively parallel: both capture counts aggregated by a spatial (neighbourhood) or temporal unit (month), and each count is split by a few categories.
Try to answer the following questions:
- Obtain the minimum number of civilian casualties (in what month was that?)
- How many NATO casualties were registered in August 2008?
- What is the month with the most total number of casualties?
Tip: You will need to first create a column with total counts