Data
This section of the webpage briefly outlines the datasets that should be used during the Hackathon and provides related resources and documentation. The Hackathon’s problem statement can be found here and should be referred to regarding the event’s purpose and objectives.
The human mobility datasets provided for the Hackathon fall into one of two categories: digital data (from Meta) or key informant survey data (from IOM). The table below outlines these, summarising their accessibility, processing and stability. In addition, contextual data on Pakistan is provided to help with analysis, as well as a variety of spatial boundaries that can be used to map the provided datasets. Each dataset is outlined below, alongside any available documentation and guidance for analysis.
Data | Provider | Spatial Detail | Source | Access | Processing | Stability |
Facebook Population and Movement During Crisis | Meta | 800m tiles Global admin 2 |
Meta apps (mainly Facebook) | Free to researchers | Computational | Made available after a disaster event for 90 days |
Community Needs Identification Data | IOM | Province District Tehsil Village/settlement |
Key informants | Published online (collection is time-consuming and costly) | Manpower | Ad Hoc survey rounds and reports after the disaster event |
1. Data Access
All the data mentioned in this document has been uploaded onto Snowflake. As part of the Hackathon, Snowflake colleagues will provide a tutorial on how to access and analyse the data in Snowflake. They will be available throughout the event to answer any queries and a series of guides on using Snowflake can be found on this section of the webpage.
2. Datasets
2.1. Facebook Population and Movement During Crisis
Meta provides these data as part of their Data for Good programme in the aftermath of humanitarian disasters. Once uploaded, these data are available to researchers and policymakers for 90 days before being removed from the Data for Good platform. The data shows the number of Facebook users located within a given spatial unit at a given time.
For the Hackathon, we are using the data available in the immediate aftermath of the 2022 Pakistan floods. The data refer to the period 14 August 2022 – 7 September 2022 and are at two geographic scales: 800m tiles (known as quadkeys, based on the Bing Maps tile system) and aggregated to global administrative (GADM) level 2 geographies.
These data consist of two datasets, Population During Crisis and Movement During Crisis:
- Population – these are population stock data, showing the number of users in each spatial unit at three snapshots: 00:00, 8:00 and 16:00 (Pacific Time). The data are removed when there are fewer than 10 observations.
- Snowflake Schema:
META_DATA_FOR_GOOD
- Snowflake Views:
ADMIN_L2_AGGREGATED_POPULATION
andTILE_POPULATION
- Snowflake Schema:
- Movement – these are population flow data, showing the origin and destination of users between temporal points. Users’ origin and destination are chosen according to where they spent most time within each 8-hour window. For example, data recorded at 16:00 shows the flow between areas from 08:00 to 16:00. Where there are fewer than 10 observations for a flow, data are removed. The Pakistan data available for the Movement data are only at the 08:00 and 16:00 time periods, but not for the 00:00 period.
- Snowflake Schema:
META_DATA_FOR_GOOD
- Snowflake Views:
ADMIN_L2_AGGREGATED_MOVEMENT
andTILE_MOVEMENT
- Snowflake Schema:
- Population – these are population stock data, showing the number of users in each spatial unit at three snapshots: 00:00, 8:00 and 16:00 (Pacific Time). The data are removed when there are fewer than 10 observations.
The data are generally comprehensive, but there are some gaps, largely due to collection issues at Meta’s end. For example, in both the tiled and aggregated Movement data, only the 08:00 and 16:00 time stamps are available, with the 00:00 period missing. Additionally, the aggregated Population data for 20, 21 and 22 August are also missing.
Both datasets contain data from a baseline period before the disaster event to compare users’ stock or flow during the crisis. The raw and percentage differences are provided within the dataset, along with a z-score to assess the statistical significance of the change from the pre-crisis baseline to the crisis period.
The data is available as a series of csv files. Each file is either the tiled or aggregated data for each time stamp of each day of the Population or Movement data.
A guide on using these data in R can be found here.
2.2. IOM Data
These data are collated and published by the IOM’s Displacement Tracking Matrix (DTM) team. The primary data produced for Pakistan by the IOM in the context of the 2022 floods are the Flood Response Community Needs Identification (CNI) datasets. These data are collated and published in rounds; six rounds have been run so far, though data for Round 5 is not publicly available.
The data for all 5 publicly available rounds can be found in Snowflake Schema:
IOM_CNI_DATA
Data from the CNI is derived from key informant interviews and direct observations. It provides estimates for the number of temporarily displaced persons (TDPs), the number of returnees and other variables related to displaced populations for each village surveyed. A more detailed methodology for these data can be found in the reports below. These data are often very challenging to collect, and, as a result, each round does not cover the exact same areas, or collect the same level of data.
The data are provided at the village/settlement level. Also included are variables for province, district and Tehsil. For the first two, codes are provided also to make them easier to be joined to spatial boundaries. These codes are p-codes, a geographic framework commonly used by the Office for the Coordination of Humanitarian Affairs (OCHA) to link data to administrative areas. These geographies and codes do not map onto those that the Facebook Population and Movement data are aggregated to. A short guide on the different geographies is provided at the end of this guidance page.
For the Hackathon, each data round is provided. The data format and variable names vary slightly between rounds, though each generally provides similar variables and data. Alongside each round of data, a report is also published. Reports can be found here for each round:
2.3. Contextual Data
Alongside mobility data, we have prepared a series of datasets that provide additional context to Pakistan. They consist of population data in raster and aggregated form, and socioeconomic data, also in raster and aggregated format.
Raster population data:
Population estimates for Pakistan for 2020 from WorldPop. The data is at 100m resolution grids in a single raster file.
Population estimates for Pakistan by age and sex for 2020 from WorldPop. The data is at 100m resolution grids in a series of rasters. Files are structured like {iso} {gender} {age group} {year} {type} {resolution}.tif - gender fields are f (female), m (male), t (total); age group fields are 00 (0-12 months), 01 (1-4 years), 05 (5-9 years) and so on until 90 (age 90 and above).
- Snowflake Schema:
CONTEXTUAL_DATA
- Snowflake View:
POPULATION_BY_SEX_BY_AGE_GROUP_100MTR_RESOLUTION
- Snowflake Schema:
Aggregated population data:
- Population estimates for Pakistan for 2020 from the WorldPop raster, aggregated (total_pop) to global administrative level 2, and the province and district polygons used by OCHA. The data is calculated by the project team using WorldPop data and spatial polygons provided by GADM and OCHA.
- Snowflake Schema:
CONTEXTUAL_DATA
- Snowflake Views:
POPULATION_PAKISTAN_2020_*
- Snowflake Schema:
- Population estimates for Pakistan for 2020 from the WorldPop raster, aggregated (total_pop) to global administrative level 2, and the province and district polygons used by OCHA. The data is calculated by the project team using WorldPop data and spatial polygons provided by GADM and OCHA.
Socioeconomic data:
- Deprivation data from The Global Gridded Relative Deprivation Index (GRDI). The data is a raster file at 1km resolution, cropped to Pakistan. Data are on a 0-100 scale, with high values indicating higher relative levels of deprivation. Complete documentation for the data can be accessed here. These have also been aggregated to global administrative level 2 and the province and district polygons used by OCHA by computing the mean value (mean_rdi) within the polygon from the GRDI raster data.
- Snowflake Schema:
CONTEXTUAL_DATA
- Snowflake Views:
GRDI_*
- Snowflake Schema:
- Deprivation data from The Global Gridded Relative Deprivation Index (GRDI). The data is a raster file at 1km resolution, cropped to Pakistan. Data are on a 0-100 scale, with high values indicating higher relative levels of deprivation. Complete documentation for the data can be accessed here. These have also been aggregated to global administrative level 2 and the province and district polygons used by OCHA by computing the mean value (mean_rdi) within the polygon from the GRDI raster data.
Climate data:
Satellite detected water extents in Pakistan between 01 and 29 August 2022. The data is a series of shapefiles showing flood extent based on satellite imagery. Data is from the UN Operational Satellite Applications Programme (UNOSAT), found here.
Sentinel-1 based analysis of Pakistan floods, capturing flood progression between 10 August and 23 September 2022. Data is sourced from TU Wein and is compromised of two raster layers at 20-metre resolution: first_flood_detection.tif, the day the flood was first detected, recorded as day-of-year (DOY), and flood_frequency.tif, the percentage of valid observations in which a flood was detected (a proxy for flood permanence).
Dekadal (10-day) rainfall data from 2021 onward at the district level (coded to OCHA administrative level 2) sourced from the World Food Programme (WFP). Data shows series of metrics for rainfall, listed here.
Dekadal (10-day) Normalized Difference Vegetation Index (NDVI) data from 2021 onward at the district level (coded to OCHA administrative level 2) sourced from the World Food Programme (WFP). Data shows series of metrics for vegetation greenness, listed here, normally used to quantify the health and density of vegetation.
Snowflake Data Access:
- Snowflake Schema:
CONTEXTUAL_DATA
- Snowflake Views:
CLIMATE_PAKISTAN_*
- Snowflake Schema:
2.4. Spatial Boundaries
Due to recent boundary changes and differing administrative boundaries being used by different organisations, the joining of spatial data together for Pakistan is not a simple task. Much of the data listed above, however, is able to be joined to a spatial boundary to be mapped and analysed. This section describes the spatial boundaries made available, as well a table with details of how the Meta, IOM and contextual datasets can be joined to these.
The boundaries available are:
- Spatial polygons for global administrative levels, sourced from GADM. The shapefile boundaries for global administrative levels 0, 1, 2 and 3 and the geopackage for Pakistan are included. These boundaries can be joined to the aggregated Facebook Population and Movement data, as well as the population and deprivation data aggregated to global administrative level 2.
- Snowflake Schema:
SPATIAL_BOUNDARIES
- Snowflake Views:
PAKISTAN_GADM_ADMIN_LEVEL_POLYGONS
- Snowflake Schema:
- Spatial polygons used by OCHA for subnational boundaries of Pakistan. Data is shapefiles for provinces (adm1), districts (adm2) and tehsils (adm3), sourced from OCHA. These boundaries can be joined to the spatial variable codes in the IOM CNI data for provinces and districts, as well as the population and deprivation data aggregated to OCHA administrative level 1 and 2, and to the WFP climate data.
- Snowflake Schema:
SPATIAL_BOUNDARIES
- Snowflake Views:
PAKISTAN_OCHA_ADMIN_LEVEL_POLYGONS
- Snowflake Schema:
- Spatial polygons for the 800m tiles (also known as quadkeys) found in the Facebook Population and Movement tile data. These were created from the tile datasets using the R package quadkeyr.
- Snowflake Schema:
SPATIAL_BOUNDARIES
- Snowflake Views:
META_QUADKEY_POLYGONS
- Snowflake Schema:
- Spatial polygons for global administrative levels, sourced from GADM. The shapefile boundaries for global administrative levels 0, 1, 2 and 3 and the geopackage for Pakistan are included. These boundaries can be joined to the aggregated Facebook Population and Movement data, as well as the population and deprivation data aggregated to global administrative level 2.
The below table provides an overview of the variables in each dataset and which boundary they relate to.
Dataset | Spatial Codes | Type | Boundary File and Corresponding Code |
---|---|---|---|
Facebook Data | |||
Facebook Population (aggregated) | polygon_id | GADM level 2 Polygon | Global administrative level 2 polygons (GID_2) |
Facebook Population (tiled) | quadkey | 800m tile | Quadkey polygons (quadkey) |
Facebook Movement (aggregated) | start_polygon_id end_polygon_id |
Origin polygon Destination polygon |
Global administrative level 2 polygons (GID_2) |
Facebook Movement (tiled) | start_quadkey end_quadkey |
Origin 800m tile Destination 880m tile |
Quadkey polygons (quadkey) |
IOM Data | |||
IOM CNI Data | ProvinceCode/Pcode/PCODE; or Admin1 pcode District Code/Pcode/PCODE; or Admin2 pcode |
Province (adm1) polygon District (adm2) polygon |
OCHA subnational polygons (ADM1_PCODE) (ADM2_PCODE) |
Contextual Data | |||
Aggregated Population Data (gadm_2) | GID_2 | GADM level 2 Polygon | Global administrative level 2 polygons (GID_2) |
Aggregated Population Data (ocha_1 and ocha_2) | ADM1_PCODE ADM2_PCODE |
Province (adm1) polygon District (adm2) polygon |
OCHA subnational polygons (ADM1_PCODE)(ADM2_PCODE) |
Aggregated Global Relative Deprivation Index Data (GRDI) (gadm_2) | GID_2 | GADM level 2 Polygon | Global administrative level 2 polygons (GID_2) |
Aggregated Global Relative Deprivation Index Data (GRDI) (ocha_1 and ocha_2) | ADM1_PCODE ADM2_PCODE |
Province (adm1) polygon District (adm2) polygon |
OCHA subnational polygons (ADM1_PCODE)(ADM2_PCODE) |
WFP - Rainfall at Subnational Level | ADM2_PCODE | District (adm2) polygon | OCHA subnational polygons (ADM2_PCODE) |
WFP - NVDI at Subnational Level | ADM2_PCODE | District (adm2) polygon | OCHA subnational polygons (ADM2_PCODE) |