Do-It-Yourself

This section is all about you taking charge of the steering wheel and choosing your own adventure. For this block, we are going to use what we’ve learnt before to take a look at a dataset of casualties in the war in Afghanistan. The data was originally released by Wikileaks, and the version we will use is published by The Guardian.

You can read a bit more about the data at The Guardian’s data blog

library(tidyverse)
import pandas

Data preparation

Before you can set off on your data journey, the dataset needs to be read, and there’s a couple of details we will get out of the way so it is then easier for you to start working.

The data are published on a Google Sheet.

As you will see, each row includes casualties recorded month by month, split by Taliban, Civilians, Afghan forces, and NATO.

Let’s read it into an R or Python session:

# Specify the URL of the CSV file
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRa7OIBiz7-yqmgwUEn4V5Wm1TO8rGow_wQVS1PWp--UTCAKqNUhtifECO5ZR9XrMd6Ddq9NxQwf1ll/pub?gid=0&single=true&output=csv"

# Read the data from the URL into a DataFrame
data <- read.csv(url)

# see the data
head(data)
  Year    Month Taliban Civilians Afghan.forces Nato..detailed.in.spreadsheet.
1 2004  January      15        51            23                               
2 2004 February                 7             4                              5
3 2004    March      19         2                                            2
4 2004    April       5         3            19                               
5 2004      May      18        29            56                              6
6 2004     June     163        32            14                              2
  Nato...official.figures
1                      11
2                       2
3                       3
4                       3
5                       9
6                       5
# Specify the URL of the CSV file
url = ("https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/export?format=csv&gid=1")

# Read the data from the URL into a DataFrame
data = pandas.read_csv(url, skiprows=[0, -1], thousands=",")

# see the data
data.head()
     Year     Month  ...  Nato (detailed in spreadsheet)  Nato - official figures
0  2004.0   January  ...                             NaN                     11.0
1  2004.0  February  ...                             5.0                      2.0
2  2004.0     March  ...                             2.0                      3.0
3  2004.0     April  ...                             NaN                      3.0
4  2004.0       May  ...                             6.0                      9.0

[5 rows x 7 columns]

This allows us to read the data straight into a DataFrame, as we have done in the previous session.

Now we are good to go!

Tasks

Now, the challenge is to put to work what we have learnt in this block. For that, the suggestion is that you carry out an analysis of the Afghan Logs in a similar way as how we looked at population composition in Liverpool. These are of course very different datasets reflecting immensely different realities. Their structure, however, is relatively parallel: both capture counts aggregated by a spatial (neighbourhood) or temporal unit (month), and each count is split by a few categories.

Try to answer the following questions:

  1. Obtain the minimum number of civilian casualties (in what month was that?)
  2. How many NATO casualties were registered in August 2008?
  3. What is the month with the most total number of casualties?

Tip: You will need to first create a column with total counts