Matt Warkentin: Reproducible and scalable data analysis workflows with targets

Matt Warkentin

R-centric project workflows

Writing good code is hard, and coordinating all of the code, data, and outputs for a project to ensure accurate and reproducible results is even harder. In theory, we should be able to return to a project years later and it should “just work”. Often this isn’t true even if we return to the project weeks or months later.

I was late to adopt drake for managing my R-based projects. For those of you that aren’t familiar, drake is a mature R package that offers functionality like GNU make or snakemake, but is specifically geared towards projects heavily using R.

For projects in R, the drake package can help. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

Adopting drake really changed the way I thought about how to develop and structure my R projects moving forward. However, shortly after finding drake, I stumbled upon its heir apparent, the freshly public targets package, which was under rapid development and aimed to use lessons learned from drake to create an API that was more consistent, friendly, and extensible than its predecessor.

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.

I was a very early adopter of targets, and in some small way helped to push its development forward by helping its best-in-class author, Will Landau, work through bugs and consider new features ¹. As of writing this post, targets is undergoing review by ROpenSci, and is working towards a CRAN release in the not-so-distant future.

After mentioning drake, and then targets, in many many lab meetings, I was asked to present on the topic to demonstrate how I have been using these packages for my current work. The slide deck below was used for this internal presentation, but I thought I would make the slides publicly available in case others may find it useful.

Slide deck

I am listed as a formal contributor to targets, which also precluded me from accepting the opportunity to review the package for ROpenSci↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Warkentin (2020, Nov. 12). Matt Warkentin: Reproducible and scalable data analysis workflows with targets. Retrieved from https://mattwarkentin.github.io/posts/2020-11-12-targets-demo/

BibTeX citation

@misc{warkentin2020reproducible,
  author = {Warkentin, Matt},
  title = {Matt Warkentin: Reproducible and scalable data analysis workflows with targets},
  url = {https://mattwarkentin.github.io/posts/2020-11-12-targets-demo/},
  year = {2020}
}