Some notes on research-compendium

Last updated on Sep 1, 2019 DataScience, R

These are my notes while studying the research-compendium concept, which is essentially a bunch of guidelines to produce research that is ‘easily’ reproducible.

The notes are mostly based on ^{marwick-2018-packag-r} , which is one canonical reading on the concept. Other references are mentioned throughout the text, and also collected separately. These notes were prepared a few weeks ago during a foray into Docker. They are neither complete not comprehensive - but will serve as a good refresher of the principle concepts.

Landing page : contains several references explaining research-compendium.
Principles
- stick with the prevailing conventions of your peers / scholarly community
- Keep data, methods and outputs separate, but make sure to unambiguously express the connections between them. The result files should be treated disposable (can be regenerated).
- Specify computational environment as clearly as possible. Minimally, a text file specifying the version numbers of the software and other critical tools being used.
R’s package structure is conducive to organise and share a compendium, for any project.
Dynamic documents : essentially like org files or Rmarkdown files i.e. literate programming. Sweave was originally introduced around 2002. However, around 2015 : knittr and rmarkdown made substantial progress and are in general more preferred than using sweave.
Shipping data with the packages
- CRAN : generally less than 5MB. A large percentage of the packages have some form of data. Data should be included if a methods package is being shipped with the analysis.
- use the piggyback package for attaching large datafiles to github repos.
  - It is convenient to be able to upload a new dataset to be associated with thep package, and this can be accessed with pb_download().
- “medium’ sized data files can be attached using arkdb
Adding a Dockerfile to the compendium
- containerit : o2r/containerit
- repo to docker : jupyter/repo2docker
- Binder : https://mybinder.org
- Use the holepunch package to make the setup easier.
Summarising the folder structure for R packages esque
- Readme file : self-explanatory and should be as detailed as possible, and preferably include a graphical connection between various components.
- R/ : Script files with resusable functions go here. If roxygen is used to generate the documentation, then man/ dicrectory is automatically populated with this.
- analysis/ : analysis scripts and reports. Considering using ascending names in the file names to aid clarity and order eg 001-load.R, 002 -… and so on.
- The above does not capture the dependencies. Therefore an .Rmd or Makefile (or Makefile.R) can be included to capture the full tree of dependencies. These files control the order of execution.
- DESCRIPTION file in the project root provides formally structured, machine and human-readable information on authors / project license, software dependenceis and other meta data.
  - when this file is included, the project becomes an installable R package.
- NAMESPACE: autogenerated file that exports R functions for repeated use.
- LICENSE : specifying conditions for use /reuse
Drone : CI service that operates on Docker containers. This can be used as a check.
Makefiles
- uses the make language.
- specifies the relationship between data, the output and the code generating the output.
- Defines outputs (targets) in terms of inputs (dependencies) and the code necessary to produce them (recipes).
- Allows rebuilding only the parts that are out of date.
- the remake package enables write Make like instructions in R.
Principles to consider before sharing a research compendium
- Licensing, Version control, persistence, metadata : main aspects to consider.
- Archive a specific commit at a repository that issues persistent URL’s eg DOI which are designed to be more persistent than other URL’s. Refere re3data.org for discipline-specific DOI issuing repositories. Using a DOI simplifies citations by allowing the transfer of basic metadata to a central registry (eg CrossRef and Datacite). Doing this ensures that a publicly available snapshot of code exists that can match the results published.
- CRAN is generally not recommended for research-compendium packages, because it is strict about directory structures and contents of the R packages. It also has a 5MB limit for package data and documentation.
Tools and templates
- devtools
- rrtools : extends devtools

Reference list

https://ropensci.org/commcalls/2019-07-30/?eType=EmailBlastContent&eId=2d18a2f6-57ef-4d15-8c52-84be5c49e039 | rOpenSci | Reproducible Research with R
https://github.com/annakrystalli/rrtools-repro-research | annakrystalli/rrtools-repro-research: Tutorial on Reproducible Research in R with rrtools
https://karthik.github.io/holepunch/ | Configure Your R Project for binderhub • hole punch
https://github.com/karthik/holepunch | karthik/holepunch: Make your R project Binder ready
https://peerj.com/preprints/3192/ | Packaging data analytical work reproducibly using R (and friends) [PeerJ Preprints]
https://github.com/alan-turing-institute/the-turing-way/tree/master/workshops/build-a-binderhub | the-turing-way/workshops/build-a-binderhub at master · alan-turing-institute/the-turing-way
https://github.com/alan-turing-institute/the-turing-way/tree/master/workshops | the-turing-way/workshops at master · alan-turing-institute/the-turing-way
https://research-compendium.science/ | Research Compendium
http://inundata.org/talks/rstd19/#/0/33 | reproducible-data-analysis
https://github.com/benmarwick/rrtools | benmarwick/rrtools: rrtools: Tools for Writing Reproducible Research in R
https://github.com/shrysr/correlationfunnel | shrysr/correlationfunnel: Speed Up Exploratory Data Analysis (EDA)
https://github.com/cboettig/nonparametric-bayes | cboettig/nonparametric-bayes: Non-parametric Bayesian Inference for Conservation Decisions
https://lincolnmullen.com/blog/makefiles-for-writing-data-analysis-ocr-and-converting-shapefiles/#fnref2 | Makefiles for Writing, Data Analysis, OCR, and Converting Shapefiles | Lincoln Mullen
https://github.com/lmullen/civil-procedure-codes/blob/master/Makefile | civil-procedure-codes/Makefile at master · lmullen/civil-procedure-codes

Bibliography

[marwick-2018-packag-r] @miscmarwick-2018-packag-r, DATE_ADDED = Mon Oct 14 13:55:11 2019, author = Ben Marwick and Carl Boettiger and Lincoln Mullen, doi = 10.7287/peerj.preprints.3192v2, title = Packaging data analytical work reproducibly using R (and friends), url = https://doi.org/10.7287/peerj.preprints.3192v2, year = 2018, ↩

R Data-Science

Some notes on research-compendium

Reference list

Bibliography

Related