Reproducible Workflows

Dr Mine Dogucu

Goals

  • Version control
  • Naming files
  • README.md
  • gitignore
  • Importing data

Version control

  • A reproducible workflow tracks changes with meaningful commit messages.
  • Note that certain file types cannot be version controlled. For instance, instead of using Word or Excel you can use markdown and csv files respectively.

Naming files

Three principles of naming files

  • machine readable
  • human readable
  • plays well with default ordering (e.g. alphabetical and numerical ordering)

(Jenny Bryan)

The workshop file and folder names follow

  • the tidyverse style (all lower case letters, words separated by HYPHEN)

README.md

class: middle

  • README file is the first file users read. In our case a user might be our future self, a teammate, or (if open source) anyone.

  • There can be multiple README files within a single directory: e.g. for the general project folder and then for a data subfolder. Data folder README’s can possibly contain codebook (data dictionary).

  • It should be brief but detailed enough to help user navigate.

  • a README should be up-to-date.

  • On GitHub we use markdown for README file (README.md). Good news: emojis are supported.

README examples

.gitignore

A .gitignore file contains the list of files which Git has been explicitly told to ignore.

For instance README.html can be git ignored.

You may consider git ignoring confidential files (e.g. some datasets) so that they would not be pushed by mistake to GitHub.

A file can be git ignored either by point-and-click using RStudio’s Git pane or by adding the file path to the .gitignore file. For instance weather.csv data file in a data folder need to be added as data/weather.csv

Files with certain files (e.g. all .log files) can also be ignored. See git ignore patterns.

Importing .csv Data

class: middle

Importing Excel Data

Importing Excel Data

Importing SAS, SPSS, Stata Data

Where is the dataset file?

A very important tweet to discuss.

Importing data will depend on where the dataset is on your computer. However we use the help of here::here() function. This function sets the working directory to the project folder (i.e. where the .Rproj file is).

It is also a good practice to save session information as package versions change, in order to be able to reproduce results from an analysis we need to know under what technical conditions the analysis was conducted.

sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.2.1  magrittr_2.0.3  fastmap_1.1.0   cli_3.4.1      
 [5] tools_4.2.1     htmltools_0.5.3 rstudioapi_0.14 yaml_2.3.6     
 [9] stringi_1.7.8   rmarkdown_2.18  knitr_1.40      stringr_1.4.1  
[13] xfun_0.34       digest_0.6.30   jsonlite_1.8.3  rlang_1.0.6    
[17] evaluate_0.18  

A better way to keep track of package versions, system settings during compiling a project is by using renv::snapshot(). This function will create a renv.lock and will take a snapshot of packages to be stored in this file.

Even a better approach for reproducible versions would be using Docker.