class: center, middle, inverse, title-slide # R and data science for public policies ## Advanced session ### Joël Gombin, Datactivist ### 17 & 18 February 2020 --- layout: true <div class='my-footer'><span>Sciences Po coding week</span> <center><div class=logo><img src='https://github.com/datactivist/slides_datactivist/raw/master/inst/rmarkdown/templates/xaringan/resources/img/fond_noir_monochrome.png' width='100px'></center></span></div> --- class: center, middle This slidedeck online : https://joelgombin.github.io/scpocodingweek/slides/ Sources : https://github.com/joelgombin/scpocodingweek Datactivist's (and my) work is placed under a [Creative Commons 4.0 BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode.fr) licence. <BR> <BR>  --- ## Who are we? <BR><BR><BR><BR> [](https://datactivist.coop) ### We .red[open data], we make them .red[useful] --- ## Who are we?  --- ## Who are we? - Datactivist is an open data pure player, created in 2016 by Samuel Goëta and Joël Gombin. - Operating at all steps of data opening and reuse, Datactivist works with both data producers and data re-users. - Our mission : to reduce information asymmetries, and allow everyone to seize data, without naivety, but without pessimism. - a research-driven approach: see **https://datactivist.coop/these**. - We created the [#TeamOpenData community](https://teamopendata.org). --- class: inverse, center, middle # The agenda --- ## The agenda - using `git` and github to deal with code like a pro - tidyverse: a refresher - datavisualisation with `ggplot2` - mapping data with `banR`, `sf`, `tmap`, `leaflet`... - using APIs - wrap-up, tips & tricks --- ## Where can I find help? The R community is very welcoming and helpful. Don't be shy, and ask for help - all the pros do! Good starting points are: - Google (duh) - twitter #Rstats or #RstatsFR - The [Rstudio Community](https://community.rstudio.com/) forum - [StackOverflow](https://stackoverflow.com/questions/tagged/r) - Some Slack channels, for example https://r-grrr.slack.com/ (in French - let me know if you want an invite) - the github repo of the package you're having troubles with When you ask for help, it is a **great** idea to prepare a [minimal **rep**roducible **ex**ample](https://resources.rstudio.com/webinars/help-me-help-you-creating-reproducible-examples-jenny-bryan) (reprex) --- class: inverse, center, middle # Let's get started --- ## Checklist - Are R and Rstudio correctly installed? (i.e., can you succesfully launch Rstudio?) - Is git installed on your computer? Do you have a github account? If this doesn't work, you can use https://frama.link/codingweek --- ## Let's git started If you want to know everything about R and git/github, go visit [this site](https://happygitwithr.com/). Git : a *version control system*. Think a "Track changes" features for code on steroids! Github is a web interface, storage space and social network for code managed through git. It is to git what Google Docs is to MS Word. Git and github are useful for documentation and collaboration. --- ## Let's git started First and foremost : **please do use projects in Rstudio**. Then, when you create your project, initiate a git repo too. Using [Rmarkdown](http://rmarkdown.rstudio.com/) is a good idea, too. .center[.reduite[]] --- ## Let's git started Then, whenever you have done some significant work, save your script and commit the changes. .reduite[] --- ## Let's git started Finally, sync your changes with github for backup and potential collaboration. .reduite[] --- class: inverse, center, middle # Let's get tidy --- ## The datascience workflow .center[ [](https://r4ds.had.co.nz/introduction.html) ] --- ## Installing the tidyverse - The `{tidyverse}` helps making the workflow working * In order to install the tidyverse, type: `install.packages("tidyverse")` (or through Rstudio) * `library(tidyverse)` => loads the package --- ## Importing data - button "Import dataset" - through the CLI: the `{readr}` package for CSV/text files, `{haven}` for SPSS / SAS / Stata, `{readxl}` for XLS(X), `{rio}` for (almost) everything else. --- ## Import your first dataset - load the package: `library(readr)` - remember to assign the result: `monjeudedonnees <- read_csv("monfichier.csv")` - let's import some data! --- ## Let's play with some data Download the dataset from https://www.data.gouv.fr/fr/datasets/donnees-du-repertoire-national-des-elus/#resource-community-4e196bc2-6b0a-40e4-b251-4e9e505d55df ```r suppressPackageStartupMessages(library(tidyverse)) rne <- read_csv("./data/Repertoire-national-des-elus.zip") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_logical(), ## `Nom de l'élu` = col_character(), ## `Prénom de l'élu` = col_character(), ## `Code sexe` = col_character(), ## `Date de naissance` = col_date(format = ""), ## `Code profession` = col_character(), ## `Libellé de la profession` = col_character(), ## Age = col_double(), ## `Nombre de mandats` = col_double(), ## Fonctions = col_character(), ## `Nombre de fonctions` = col_double(), ## Identifiant = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ``` ## Warning: 6 parsing failures. ## row col expected actual file ## 118187 Identifiant delimiter or quote Q './data/Repertoire-national-des-elus.zip' ## 118187 Identifiant delimiter or quote './data/Repertoire-national-des-elus.zip' ## 234535 Identifiant delimiter or quote G './data/Repertoire-national-des-elus.zip' ## 234535 Identifiant delimiter or quote './data/Repertoire-national-des-elus.zip' ## 358097 Identifiant delimiter or quote B './data/Repertoire-national-des-elus.zip' ## ...... ........... .................. ...... ......................................... ## See problems(...) for more details. ``` ## Let's play with some data If needed, adjust the import process with arguments `col_types`, `locale` or `na`. --- ## Preparing tidy data - `dplyr` & `tidyr` mainly - the pipe: `%>%` (ctrl/cmd + shift + M) - the verbs: + `filter` (and `slice`) + `arrange` + `select` (and `rename`) + `distinct` + `mutate` + `group_by` + `summarise` + `sample_n` / `sample_frac` - `stringr` might be a useful companion if you have to deal with characters/strings --- ## Pivoting [`tidyr::pivot_longer` & `tidyr::pivot_wider`](https://tidyr.tidyverse.org/articles/pivot.html) <img src="http://r4ds.had.co.nz/images/tidy-9.png" width="90%" style="display: block; margin: auto;" /> --- ## Recoding data - [`recode`](https://dplyr.tidyverse.org/reference/recode.html) - [`case_when`](https://dplyr.tidyverse.org/reference/case_when.html) --- ## Merging datasets - primary keys and foreign keys - `left_join` - `right_join` - `full_join` - `inner_join` - `semi_join` - `anti_join` --- ## Merging datasets .center[] .center[] --- ## Inner join .center[] --- ## Outer joins .reduite[.center[]] --- ## Semi join .center[] --- ## Anti join .center[] --- ## A different matter: binding - `bind_rows` - `bind_cols` --- ## dplyr tips & tricks ```r rne %>% group_by(`Libellé de la profession`) %>% summarise(n = n()) %>% arrange(desc(n)) ``` ``` ## # A tibble: 66 x 2 ## `Libellé de la profession` n ## <chr> <int> ## 1 Employés (secteur privé) 66163 ## 2 Agriculteurs propriétaires exploit. 46481 ## 3 Retraités salariés privés 42994 ## 4 Autres cadres (secteur privé) 25133 ## 5 Retraités fonct.publique (sf enseig.) 23050 ## 6 Ouvriers (secteur privé) 19393 ## 7 Fonctionnaires de catégorie C 19135 ## 8 Agents technique et techniciens 17692 ## 9 Sans profession déclarée 15562 ## 10 Fonctionnaires de catégorie B 15162 ## # … with 56 more rows ``` --- ## dplyr tips & tricks ```r rne %>% count(`Libellé de la profession`, sort = TRUE) ``` ``` ## # A tibble: 66 x 2 ## `Libellé de la profession` n ## <chr> <int> ## 1 Employés (secteur privé) 66163 ## 2 Agriculteurs propriétaires exploit. 46481 ## 3 Retraités salariés privés 42994 ## 4 Autres cadres (secteur privé) 25133 ## 5 Retraités fonct.publique (sf enseig.) 23050 ## 6 Ouvriers (secteur privé) 19393 ## 7 Fonctionnaires de catégorie C 19135 ## 8 Agents technique et techniciens 17692 ## 9 Sans profession déclarée 15562 ## 10 Fonctionnaires de catégorie B 15162 ## # … with 56 more rows ``` --- class: inverse, center, middle # Visualising data with `ggplot2` --- ## The grammar of graphics - data - aesthetic mapping (`aes`) - `layer`s : `geom` and `stat` - `scale`s - a `coord`inates system - small multiples (`facet`) - a `theme` --- ## The main `geom`s - `geom_point` - `geom_line` - `geom_text` - `geom_quantile` - `geom_histogram` - `geom_density` - `geom_smooth` - `geom_bar` - and even `geom_sf` in conjonction with the `sf` package for spatial data --- ## the main `aes`thetics - `x` - `y` - `colour` - `fill` - `size` - line width `lwd` - line type `lty` --- class: inverse, center, middle # Mapping data ## Let's indulge in some map porn --- ## An example of the spatial possibilities of R https://joelgombin.github.io/makingof --- ## Spatial tidydata: the `sf` package Simple features (sf) are: - a dataframe - one row per feature - a list-column for the geometry (point, line, polygon...) - geometries and attributes have the same status --- ## Import spatial data Source: https://opendata.paris.fr/explore/dataset/velib-disponibilite-en-temps-reel/export/ ```r library(sf) velib <- read_sf("./data/velib-disponibilite-en-temps-reel.geojson") velib ``` ``` ## Simple feature collection with 1398 features and 12 fields ## geometry type: POINT ## dimension: XY ## bbox: xmin: 2.165597 ymin: 48.76467 xmax: 2.538242 ymax: 48.94702 ## epsg (SRID): 4326 ## proj4string: +proj=longlat +datum=WGS84 +no_defs ## # A tibble: 1,398 x 13 ## nbfreeedock station_state overflowactivat… station_code nbedock nbbikeoverflow nbbike nbebike station_name creditcard kioskstate ## <int> <chr> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <chr> ## 1 29 Operative no 16107 35 0 3 3 Benjamin Go… no yes ## 2 27 Operative no 6015 52 0 17 7 André Mazet… yes yes ## 3 19 Operative no 11104 20 0 1 0 Charonne - … no yes ## 4 16 Operative no 9020 21 0 3 2 Toudouze - … yes yes ## 5 25 Operative no 12109 30 0 1 4 Mairie du 1… no yes ## 6 5 Operative no 5001 6 0 12 7 Harpe - Sai… yes yes ## 7 51 Operative no 14014 60 0 2 6 Jourdan - S… yes yes ## 8 0 Operative no 17026 0 0 4 7 Jouffroy d'… yes yes ## 9 20 Operative no 17041 38 0 10 8 Guersant - … yes yes ## 10 21 Operative no 10013 59 0 33 5 Alibert - J… yes yes ## # … with 1,388 more rows, and 2 more variables: geo <list>, geometry <POINT [°]> ``` --- ## Previewing spatial data ```r library(mapview) mapview(velib, cex = "nbbike") ```
--- ## Using spatial data ```r library(leaflet) leaflet() %>% addTiles() %>% setView(-93.65, 42.0285, zoom = 4) %>% addWMSTiles( "http://mesonet.agron.iastate.edu/cgi-bin/wms/nexrad/n0r.cgi", layers = "nexrad-n0r-900913", options = WMSTileOptions(format = "image/png", transparent = TRUE), attribution = "Weather data © 2012 IEM Nexrad" ) ```
--- ## Using `leaflet` for online maps See https://rstudio.github.io/leaflet/ ```r velib %>% leaflet() %>% addTiles() %>% addCircleMarkers(radius = 2, color = "red") ```
--- ## Using `tmap` for elaborate maps ```r library(tmap) # https://opendata.paris.fr/explore/dataset/voie/export/ streets <- read_sf("./data/voie.geojson") # https://opendata.paris.fr/explore/dataset/arbresremarquablesparis/export/ trees <- read_sf("./data/arbresremarquablesparis.geojson") streets %>% tm_shape() + tm_lines(alpha = 0.2) + tm_shape(toilets) + tm_dots(col = "red") + tm_shape(trees) + tm_dots(col = "darkgreen") + tm_scale_bar(position = c("left", "bottom")) + tm_compass(position = c("left", "top")) ``` --- ## Using `tmap` for elaborate maps <img src="index_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Other nice spatial stuff - `banR` for geocoding - OpenStreetMap and OverpassTurbo - ... --- class: inverse, center, middle # Using APIs --- ## Calling APIs with `httr` https://httr.r-lib.org/ https://thinkr.fr/les-api-un-enfer/ --- class: inverse, center, middle # Thank you! Contact : [joel@datactivist.coop](mailto:joel@datactivist.coop)