R and data science for public policies

# R and data science for public policies
## Advanced session
### Joël Gombin, Datactivist
### 17 & 18 February 2020

---

<div class='my-footer'><span>Sciences Po coding week</span> <center><div class=logo><img src='https://github.com/datactivist/slides_datactivist/raw/master/inst/rmarkdown/templates/xaringan/resources/img/fond_noir_monochrome.png' width='100px'></center></span></div>

---

This slidedeck online : https://joelgombin.github.io/scpocodingweek/slides/

Sources : https://github.com/joelgombin/scpocodingweek

Datactivist's (and my) work is placed under a [Creative Commons 4.0 BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode.fr) licence.

![](https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png)

---
## Who are we?

[![](https://github.com/datactivist/slides_datactivist/raw/master/inst/rmarkdown/templates/xaringan/resources/img/logo.png)](https://datactivist.coop)

### We .red[open data], we make them .red[useful]

---
## Who are we?

![](https://www.dropbox.com/s/kicm8kmqal303yc/equipe.jpg?dl=1)

---
## Who are we?

- Datactivist is an open data pure player, created in 2016 by Samuel Goëta and Joël Gombin.

- Operating at all steps of data opening and reuse, Datactivist works with both data producers and data re-users.

- Our mission : to reduce information asymmetries, and allow everyone to seize data, without naivety, but without pessimism.

- a research-driven approach: see **https://datactivist.coop/these**.

- We created the [#TeamOpenData community](https://teamopendata.org).

---
class: inverse, center, middle

# The agenda

---
## The agenda

- using `git` and github to deal with code like a pro

- tidyverse: a refresher

- datavisualisation with `ggplot2`

- mapping data with `banR`, `sf`, `tmap`, `leaflet`...

- using APIs

- wrap-up, tips & tricks

---
## Where can I find help?

The R community is very welcoming and helpful. Don't be shy, and ask for help - all the pros do!

Good starting points are:

- Google (duh)
- twitter #Rstats or #RstatsFR
- The [Rstudio Community](https://community.rstudio.com/) forum
- [StackOverflow](https://stackoverflow.com/questions/tagged/r)
- Some Slack channels, for example https://r-grrr.slack.com/ (in French - let me know if you want an invite)
- the github repo of the package you're having troubles with

When you ask for help, it is a **great** idea to prepare a [minimal **rep**roducible **ex**ample](https://resources.rstudio.com/webinars/help-me-help-you-creating-reproducible-examples-jenny-bryan) (reprex)

---
class: inverse, center, middle

# Let's get started

---
## Checklist

- Are R and Rstudio correctly installed? (i.e., can you succesfully launch Rstudio?)

- Is git installed on your computer? Do you have a github account?

If this doesn't work, you can use https://frama.link/codingweek

---
## Let's git started

If you want to know everything about R and git/github, go visit [this site](https://happygitwithr.com/).

Git : a *version control system*. Think a "Track changes" features for code on steroids!

Github is a web interface, storage space and social network for code managed through git. It is to git what Google Docs is to MS Word.

Git and github are useful for documentation and collaboration.

---
## Let's git started

First and foremost : **please do use projects in Rstudio**. Then, when you create your project, initiate a git repo too. Using [Rmarkdown](http://rmarkdown.rstudio.com/) is a good idea, too.

---
## Let's git started

Then, whenever you have done some significant work, save your script and commit the changes.

---
## Let's git started

Finally, sync your changes with github for backup and potential collaboration.

---
class: inverse, center, middle

# Let's get tidy

---
## The datascience workflow

---
## Installing the tidyverse

- The `{tidyverse}` helps making the workflow working

* In order to install the tidyverse, type: `install.packages("tidyverse")` (or through Rstudio)

* `library(tidyverse)` => loads the package

---
## Importing data

- button "Import dataset"

- through the CLI: the `{readr}` package for CSV/text files, `{haven}` for SPSS / SAS / Stata, `{readxl}` for XLS(X), `{rio}` for (almost) everything else.

---
## Import your first dataset

- load the package: `library(readr)`

- remember to assign the result:

`monjeudedonnees <- read_csv("monfichier.csv")`

- let's import some data!

---
## Let's play with some data

Download the dataset from https://www.data.gouv.fr/fr/datasets/donnees-du-repertoire-national-des-elus/#resource-community-4e196bc2-6b0a-40e4-b251-4e9e505d55df

```r
suppressPackageStartupMessages(library(tidyverse))
rne <- read_csv("./data/Repertoire-national-des-elus.zip")
```

```
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   `Nom de l'élu` = col_character(),
##   `Prénom de l'élu` = col_character(),
##   `Code sexe` = col_character(),
##   `Date de naissance` = col_date(format = ""),
##   `Code profession` = col_character(),
##   `Libellé de la profession` = col_character(),
##   Age = col_double(),
##   `Nombre de mandats` = col_double(),
##   Fonctions = col_character(),
##   `Nombre de fonctions` = col_double(),
##   Identifiant = col_character()
## )
```

```
## See spec(...) for full column specifications.
```

```
## Warning: 6 parsing failures.
##    row         col           expected actual                                      file
## 118187 Identifiant delimiter or quote      Q './data/Repertoire-national-des-elus.zip'
## 118187 Identifiant delimiter or quote        './data/Repertoire-national-des-elus.zip'
## 234535 Identifiant delimiter or quote      G './data/Repertoire-national-des-elus.zip'
## 234535 Identifiant delimiter or quote        './data/Repertoire-national-des-elus.zip'
## 358097 Identifiant delimiter or quote      B './data/Repertoire-national-des-elus.zip'
## ...... ........... .................. ...... .........................................
## See problems(...) for more details.
```

## Let's play with some data

If needed, adjust the import process with arguments `col_types`, `locale` or `na`.

---
## Preparing tidy data

- `dplyr` & `tidyr` mainly

- the pipe: `%>%` (ctrl/cmd + shift + M)

- the verbs: 
  + `filter` (and `slice`)
  + `arrange`
  + `select` (and `rename`)
  + `distinct`
  + `mutate` 
  + `group_by` 
  + `summarise`
  + `sample_n` / `sample_frac`

- `stringr` might be a useful companion if you have to deal with characters/strings

---
## Pivoting

[`tidyr::pivot_longer` & `tidyr::pivot_wider`](https://tidyr.tidyverse.org/articles/pivot.html)

---
## Recoding data

- [`recode`](https://dplyr.tidyverse.org/reference/recode.html)

- [`case_when`](https://dplyr.tidyverse.org/reference/case_when.html)

---
## Merging datasets

- primary keys and foreign keys

- `left_join`
- `right_join`
- `full_join`
- `inner_join`
- `semi_join`
- `anti_join`

---
## Merging datasets

.center[![source: https://r4ds.had.co.nz/relational-data.html#mutating-joins](https://d33wubrfki0l68.cloudfront.net/108c0749d084c03103f8e1e8276c20e06357b124/5f113/diagrams/join-setup.png)]

.center[![source: https://r4ds.had.co.nz/relational-data.html#mutating-joins](https://d33wubrfki0l68.cloudfront.net/820b012580731f2134f90ee9c6388994c2343683/27703/diagrams/join-setup2.png)]

---
## Inner join

.center[![](https://d33wubrfki0l68.cloudfront.net/3abea0b730526c3f053a3838953c35a0ccbe8980/7f29b/diagrams/join-inner.png)]

---
## Outer joins

.reduite[.center[![](https://d33wubrfki0l68.cloudfront.net/9c12ca9e12ed26a7c5d2aa08e36d2ac4fb593f1e/79980/diagrams/join-outer.png)]]

---
## Semi join

.center[![](https://d33wubrfki0l68.cloudfront.net/028065a7f353a932d70d2dfc82bc5c5966f768ad/85a30/diagrams/join-semi.png)]

---
## Anti join

.center[![](https://d33wubrfki0l68.cloudfront.net/f29a85efd53a079cc84c14ba4ba6894e238c3759/c1408/diagrams/join-anti.png)]

---
## A different matter: binding

- `bind_rows`
- `bind_cols`

---
## dplyr tips & tricks

```r
rne %>% 
  group_by(`Libellé de la profession`) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n))
```

```
## # A tibble: 66 x 2
##    `Libellé de la profession`                n
##    <chr>                                 <int>
##  1 Employés (secteur privé)              66163
##  2 Agriculteurs propriétaires exploit.   46481
##  3 Retraités salariés privés             42994
##  4 Autres cadres (secteur privé)         25133
##  5 Retraités fonct.publique (sf enseig.) 23050
##  6 Ouvriers (secteur privé)              19393
##  7 Fonctionnaires de catégorie C         19135
##  8 Agents technique et techniciens       17692
##  9 Sans profession déclarée              15562
## 10 Fonctionnaires de catégorie B         15162
## # … with 56 more rows
```

---
## dplyr tips & tricks

```r
rne %>% 
  count(`Libellé de la profession`, sort = TRUE) 
```

---
class: inverse, center, middle

# Visualising data with `ggplot2`

---
## The grammar of graphics

- data

- aesthetic mapping (`aes`)

- `layer`s : `geom` and `stat`

- `scale`s

- a `coord`inates system

- small multiples (`facet`)

- a `theme`

---
## The main `geom`s

- `geom_point`

- `geom_line`

- `geom_text`

- `geom_quantile`

- `geom_histogram`

- `geom_density`

- `geom_smooth`

- `geom_bar`

- and even `geom_sf` in conjonction with the `sf` package for spatial data

---
## the main `aes`thetics

- `x`

- `y`

- `colour`

- `fill`

- `size`

- line width `lwd`

- line type `lty`

---
class: inverse, center, middle

# Mapping data
## Let's indulge in some map porn

---
## An example of the spatial possibilities of R

https://joelgombin.github.io/makingof

---
## Spatial tidydata: the `sf` package

Simple features (sf) are:
- a dataframe
- one row per feature
- a list-column for the geometry (point, line, polygon...)
- geometries and attributes have the same status

---
## Import spatial data

Source: https://opendata.paris.fr/explore/dataset/velib-disponibilite-en-temps-reel/export/

```r
library(sf)
velib <- read_sf("./data/velib-disponibilite-en-temps-reel.geojson")
velib
```

```
## Simple feature collection with 1398 features and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 2.165597 ymin: 48.76467 xmax: 2.538242 ymax: 48.94702
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 1,398 x 13
##    nbfreeedock station_state overflowactivat… station_code nbedock nbbikeoverflow nbbike nbebike station_name creditcard kioskstate
##          <int> <chr>         <chr>            <chr>          <int>          <int>  <int>   <int> <chr>        <chr>      <chr>     
##  1          29 Operative     no               16107             35              0      3       3 Benjamin Go… no         yes       
##  2          27 Operative     no               6015              52              0     17       7 André Mazet… yes        yes       
##  3          19 Operative     no               11104             20              0      1       0 Charonne - … no         yes       
##  4          16 Operative     no               9020              21              0      3       2 Toudouze - … yes        yes       
##  5          25 Operative     no               12109             30              0      1       4 Mairie du 1… no         yes       
##  6           5 Operative     no               5001               6              0     12       7 Harpe - Sai… yes        yes       
##  7          51 Operative     no               14014             60              0      2       6 Jourdan - S… yes        yes       
##  8           0 Operative     no               17026              0              0      4       7 Jouffroy d'… yes        yes       
##  9          20 Operative     no               17041             38              0     10       8 Guersant - … yes        yes       
## 10          21 Operative     no               10013             59              0     33       5 Alibert - J… yes        yes       
## # … with 1,388 more rows, and 2 more variables: geo <list>, geometry <POINT [°]>
```

---
## Previewing spatial data

```r
library(mapview)
mapview(velib, cex = "nbbike")
```

<div id="htmlwidget-90142964187052c772e5" style="width:100%;height:400px;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-90142964187052c772e5">{"x":{"url":"index_files/figure-html//widgets/widget_unnamed-chunk-8.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

---
## Using spatial data

```r
library(leaflet)
leaflet() %>% addTiles() %>% setView(-93.65, 42.0285, zoom = 4) %>%
  addWMSTiles(
    "http://mesonet.agron.iastate.edu/cgi-bin/wms/nexrad/n0r.cgi",
    layers = "nexrad-n0r-900913",
    options = WMSTileOptions(format = "image/png", transparent = TRUE),
    attribution = "Weather data © 2012 IEM Nexrad"
  )
```

<div id="htmlwidget-f16cab7632969c8368d7" style="width:100%;height:250px;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-f16cab7632969c8368d7">{"x":{"url":"index_files/figure-html//widgets/widget_unnamed-chunk-10.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

---
## Using `leaflet` for online maps

See https://rstudio.github.io/leaflet/

```r
velib %>% leaflet() %>% addTiles() %>% addCircleMarkers(radius = 2, color = "red")
```

<div id="htmlwidget-3e225c386f5b0e5b01dc" style="width:100%;height:300px;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-3e225c386f5b0e5b01dc">{"x":{"url":"index_files/figure-html//widgets/widget_unnamed-chunk-12.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

---
## Using `tmap` for elaborate maps

```r
library(tmap)
# https://opendata.paris.fr/explore/dataset/voie/export/
streets <- read_sf("./data/voie.geojson")
# https://opendata.paris.fr/explore/dataset/arbresremarquablesparis/export/
trees <- read_sf("./data/arbresremarquablesparis.geojson")

streets %>% 
  tm_shape() +
  tm_lines(alpha = 0.2) +
tm_shape(toilets) +
  tm_dots(col = "red") +
tm_shape(trees) +
  tm_dots(col = "darkgreen") +
tm_scale_bar(position = c("left", "bottom")) +
tm_compass(position = c("left", "top"))
```

---
## Using `tmap` for elaborate maps

---
## Other nice spatial stuff

- `banR` for geocoding
- OpenStreetMap and OverpassTurbo
- ...

---
class: inverse, center, middle

# Using APIs

---
## Calling APIs with `httr`

https://httr.r-lib.org/

https://thinkr.fr/les-api-un-enfer/

---
class: inverse, center, middle

# Thank you!

Contact : [joel@datactivist.coop](mailto:joel@datactivist.coop)