Some notes on research-compendium
These are my notes while studying the research-compendium concept, which is essentially a bunch of guidelines to produce research that is ‘easily’ reproducible.
The notes are mostly based on marwick-2018-packag-r , which is one canonical reading on the concept. Other references are mentioned throughout the text, and also collected separately. These notes were prepared a few weeks ago during a foray into Docker. They are neither complete not comprehensive - but will serve as a good refresher of the principle concepts.
- Landing page : contains several references explaining research-compendium.
- Principles
- stick with the prevailing conventions of your peers / scholarly community
- Keep data, methods and outputs separate, but make sure to unambiguously express the connections between them. The result files should be treated disposable (can be regenerated).
- Specify computational environment as clearly as possible. Minimally, a text file specifying the version numbers of the software and other critical tools being used.
- R’s package structure is conducive to organise and share a compendium, for any project.
- Dynamic documents : essentially like org files or Rmarkdown files i.e. literate programming. Sweave was originally introduced around 2002. However, around 2015 : knittr and rmarkdown made substantial progress and are in general more preferred than using sweave.
- Shipping data with the packages
- CRAN : generally less than 5MB. A large percentage of the packages have some form of data. Data should be included if a methods package is being shipped with the analysis.
- use the
piggyback package for attaching large datafiles to github repos.
- It is convenient to be able to upload a new dataset to be associated with thep package, and this can be accessed with
pb_download()
.
- It is convenient to be able to upload a new dataset to be associated with thep package, and this can be accessed with
- “medium’ sized data files can be attached using arkdb
- Adding a Dockerfile to the compendium
- containerit : o2r/containerit
- repo to docker : jupyter/repo2docker
- Binder : https://mybinder.org
- Use the holepunch package to make the setup easier.
- Summarising the folder structure for R packages esque
- Readme file : self-explanatory and should be as detailed as possible, and preferably include a graphical connection between various components.
R/
: Script files with resusable functions go here. If roxygen is used to generate the documentation, thenman/
dicrectory is automatically populated with this.analysis/
: analysis scripts and reports. Considering using ascending names in the file names to aid clarity and order eg 001-load.R, 002 -… and so on.- The above does not capture the dependencies. Therefore an .Rmd or
Makefile
(orMakefile.R
) can be included to capture the full tree of dependencies. These files control the order of execution. DESCRIPTION
file in the project root provides formally structured, machine and human-readable information on authors / project license, software dependenceis and other meta data.- when this file is included, the project becomes an installable R package.
NAMESPACE
: autogenerated file that exports R functions for repeated use.LICENSE
: specifying conditions for use /reuse
- Drone : CI service that operates on Docker containers. This can be used as a check.
Makefiles
- uses the make language.
- specifies the relationship between data, the output and the code generating the output.
- Defines outputs (targets) in terms of inputs (dependencies) and the code necessary to produce them (recipes).
- Allows rebuilding only the parts that are out of date.
- the
remake
package enables write Make like instructions in R.
- Principles to consider before sharing a research compendium
- Licensing, Version control, persistence, metadata : main aspects to consider.
- Archive a specific commit at a repository that issues persistent URL’s eg DOI which are designed to be more persistent than other URL’s. Refere re3data.org for discipline-specific DOI issuing repositories. Using a DOI simplifies citations by allowing the transfer of basic metadata to a central registry (eg CrossRef and Datacite). Doing this ensures that a publicly available snapshot of code exists that can match the results published.
- CRAN is generally not recommended for research-compendium packages, because it is strict about directory structures and contents of the R packages. It also has a 5MB limit for package data and documentation.
- Tools and templates
devtools
rrtools
: extends devtools
Reference list
- https://ropensci.org/commcalls/2019-07-30/?eType=EmailBlastContent&eId=2d18a2f6-57ef-4d15-8c52-84be5c49e039 | rOpenSci | Reproducible Research with R
- https://github.com/annakrystalli/rrtools-repro-research | annakrystalli/rrtools-repro-research: Tutorial on Reproducible Research in R with rrtools
- https://karthik.github.io/holepunch/ | Configure Your R Project for binderhub • hole punch
- https://github.com/karthik/holepunch | karthik/holepunch: Make your R project Binder ready
- https://peerj.com/preprints/3192/ | Packaging data analytical work reproducibly using R (and friends) [PeerJ Preprints]
- https://github.com/alan-turing-institute/the-turing-way/tree/master/workshops/build-a-binderhub | the-turing-way/workshops/build-a-binderhub at master · alan-turing-institute/the-turing-way
- https://github.com/alan-turing-institute/the-turing-way/tree/master/workshops | the-turing-way/workshops at master · alan-turing-institute/the-turing-way
- https://research-compendium.science/ | Research Compendium
- http://inundata.org/talks/rstd19/#/0/33 | reproducible-data-analysis
- https://github.com/benmarwick/rrtools | benmarwick/rrtools: rrtools: Tools for Writing Reproducible Research in R
- https://github.com/shrysr/correlationfunnel | shrysr/correlationfunnel: Speed Up Exploratory Data Analysis (EDA)
- https://github.com/cboettig/nonparametric-bayes | cboettig/nonparametric-bayes: Non-parametric Bayesian Inference for Conservation Decisions
- https://lincolnmullen.com/blog/makefiles-for-writing-data-analysis-ocr-and-converting-shapefiles/#fnref2 | Makefiles for Writing, Data Analysis, OCR, and Converting Shapefiles | Lincoln Mullen
- https://github.com/lmullen/civil-procedure-codes/blob/master/Makefile | civil-procedure-codes/Makefile at master · lmullen/civil-procedure-codes
Bibliography
[marwick-2018-packag-r] @miscmarwick-2018-packag-r, DATE_ADDED = Mon Oct 14 13:55:11 2019, author = Ben Marwick and Carl Boettiger and Lincoln Mullen, doi = 10.7287/peerj.preprints.3192v2, title = Packaging data analytical work reproducibly using R (and friends), url = https://doi.org/10.7287/peerj.preprints.3192v2, year = 2018, ↩