Dynamic Function-Oriented Make-Like Declarative Pipelines for R
Package: targetsTitle: Dynamic Function-Oriented 'Make'-Like Declarative WorkflowsDescription: The 'targets' package is a pipeline toolkit...Authors@R: c( person( given = c("William", "Michael"), family = "Landau", role = c("aut", "cre"), email = "will.landau@gmail.com", comment = c(ORCID = "0000-0003-1878-3253") ), person( given = c("Matthew", "T."), family = "Warkentin", role = "ctb" ), person( family = "Eli Lilly and Company", role = "cph" ))
Package: targetsTitle: Dynamic Function-Oriented 'Make'-Like Declarative WorkflowsDescription: The 'targets' package is a pipeline toolkit...Authors@R: c( person( given = c("William", "Michael"), family = "Landau", role = c("aut", "cre"), email = "will.landau@gmail.com", comment = c(ORCID = "0000-0003-1878-3253") ), person( given = c("Matthew", "T."), family = "Warkentin", role = "ctb" ), person( family = "Eli Lilly and Company", role = "cph" ))
From: Lepore, Mauro
Subject: Would you be willing to review a package for rOpenSci?
To: warkentin@lunenfeld.ca
Dear Matthew,
Hi, this is Mauro . I hope you and your loved ones are safe. I'm writing to ask if you would be willing to review a package for rOpenSci. As you probably know, rOpenSci conducts peer review of R packages contributed to our collection in a manner similar to journals.
The package targets by Will Landau provides make-like pipelines for R. targets supersedes drake, and is submitted to rOpenSci jointly with the package tarchetypes. You can find targets and tarchetypes on GitHub here and here. We conduct our open review process via GitHub as well.
...
Thank you for your time.
Sincerely, Mauro
The
targets
package is a Make-like pipeline toolkit for Statistics and data science in R. Withtargets
, you can maintain a reproducible workflow without repeating yourself.targets
learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.
The
targets
package is a Make-like pipeline toolkit for Statistics and data science in R. Withtargets
, you can maintain a reproducible workflow without repeating yourself.targets
learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.
{targets}
is a project workflow tool that is very R
-centric
{GNU make}
and {snakemake}
It allows you to effectively modularize your data analysis projects to create obvious and reproducible workflows
Can easily extend your workflow to massively parallelize tasks
The
drake
package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.The
targets
package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible indrake
.
The
drake
package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.The
targets
package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible indrake
.
targets
is more...
Efficient
Reproducible
Maintainable
Portable
Domain specific
Organization
Organization
Modularity
Organization
Modularity
Transparency and Reproducibility
Organization
Modularity
Transparency and Reproducibility
Caching and History
Organization
Modularity
Transparency and Reproducibility
Caching and History
Scalability and Parallel Computing
Infographic from https://docs.ropensci.org/drake/
{targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive frictionAll functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your projectAll functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
tar_pipeline()
- Contains the complete set of targets to be included in the pipeline
All functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
tar_pipeline()
- Contains the complete set of targets to be included in the pipeline
tar_option_set()
- Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.
All functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
tar_pipeline()
- Contains the complete set of targets to be included in the pipeline
tar_option_set()
- Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.
tar_make()
- Inspects your code/pipeline to understand the dependencies, and builds the pipeline in a separate clean R
session
tar_target( name, command, pattern = NULL, tidy_eval = targets::tar_option_get("tidy_eval"), packages = targets::tar_option_get("packages"), library = targets::tar_option_get("library"), format = targets::tar_option_get("format"), iteration = targets::tar_option_get("iteration"), error = targets::tar_option_get("error"), memory = targets::tar_option_get("memory"), garbage_collection = targets::tar_option_get("garbage_collection"), deployment = targets::tar_option_get("deployment"), priority = targets::tar_option_get("priority"), resources = targets::tar_option_get("resources"), storage = targets::tar_option_get("storage"), retrieval = targets::tar_option_get("retrieval"), cue = targets::tar_option_get("cue"))
tar_target( name, command, pattern = NULL, tidy_eval = targets::tar_option_get("tidy_eval"), packages = targets::tar_option_get("packages"), library = targets::tar_option_get("library"), format = targets::tar_option_get("format"), iteration = targets::tar_option_get("iteration"), error = targets::tar_option_get("error"), memory = targets::tar_option_get("memory"), garbage_collection = targets::tar_option_get("garbage_collection"), deployment = targets::tar_option_get("deployment"), priority = targets::tar_option_get("priority"), resources = targets::tar_option_get("resources"), storage = targets::tar_option_get("storage"), retrieval = targets::tar_option_get("retrieval"), cue = targets::tar_option_get("cue"))
tar_option_set( tidy_eval = NULL, packages = NULL, library = NULL, envir = NULL, format = NULL, iteration = NULL, error = NULL, memory = NULL, garbage_collection = NULL, deployment = NULL, priority = NULL, resources = NULL, storage = NULL, retrieval = NULL, cue = NULL, debug = NULL)
tar_option_set( tidy_eval = NULL, packages = NULL, library = NULL, envir = NULL, format = NULL, iteration = NULL, error = NULL, memory = NULL, garbage_collection = NULL, deployment = NULL, priority = NULL, resources = NULL, storage = NULL, retrieval = NULL, cue = NULL, debug = NULL)
tar_pipeline(...)
tar_pipeline()
simply accepts an arbitrary number of tar_target()
objects, or a list thereof.Example:
# _targets.Rtar_pipeline( tar_target(first, f1()), tar_target(second, f2()), tar_target(third, f3(first, second)))
tar_pipeline()
does NOT matter. {targets}
is smart enough to infer the topology and learn dependenciestar_pipeline(...)
tar_pipeline()
simply accepts an arbitrary number of tar_target()
objects, or a list thereof.Example:
# _targets.Rtar_pipeline( tar_target(first, f1()), tar_target(second, f2()), tar_target(third, f3(first, second)))
tar_pipeline()
does NOT matter. {targets}
is smart enough to infer the topology and learn dependenciesR/├── 01-data.R├── 02-clean.R├── 03-fit-model.R├── 04-summarize-results.R└── 05-tables-figs.Rrun_scripts.R
R/├── 01-data.R├── 02-clean.R├── 03-fit-model.R├── 04-summarize-results.R└── 05-tables-figs.Rrun_scripts.R
Does not scale well to larger/complicated projects
You are in charge of storing/loading important objects
Everything needs to be ran every time
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.
Generally speaking, a good target is...
Long enough to eat up a decent chunk of runtime, and
Small enough that tar_make()
frequently skips it, and
Meaningful to your project, and
A well-behaved R
object that can be stored.
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.
Generally speaking, a good target is...
Long enough to eat up a decent chunk of runtime, and
Small enough that tar_make()
frequently skips it, and
Meaningful to your project, and
A well-behaved R
object that can be stored.
{targets}
pipeline is a directed acyclic graph (DAG) showing all of the tasks (nodes) and their interrelationships (vertices)A key design consideration when working with {targets}
is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge
A key design consideration when working with {targets}
is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge
find_outcomes <- function(data, icd_code) { # <<some R code>> return(data_with_outcomes)}
find_outcomes(my_data, "C34")
targets
pipeline{targets}
for building R
-centric projectsHowever,
_targets.R
must exist at the root of the project├── R/│ ├── functions.R├── _targets.R├── run.R├── project-name.Rproj
{targets}
for building R
-centric projectsHowever,
_targets.R
must exist at the root of the project├── R/│ ├── functions.R├── _targets.R├── run.R├── project-name.Rproj
{targets}
projects├── R/├── _targets.R├── run.R├── project-name.Rproj
# _targets.Rlibrary(targets)# Load functionssource("functions.R")# Set global optionstar_option_set(...)# Define targets/pipelinetar_pipeline(...)
├── R/├── _targets.R├── run.R├── project-name.Rproj
# run.Rtargets::tar_make()
├── R/├── _targets.R├── run.R├── project-name.Rproj
├── R/│ ├── clean-data.R│ ├── cv-splits.R│ ├── fit-model.R│ ├── summarize-results.R│ ├── build-report.R
I suggest having one script per function/task
Name the script the same name as the function contained therein
├── R/├── _targets.R├── run.R├── project-name.Rproj
├── R/│ ├── clean-data.R│ ├── cv-splits.R│ ├── fit-model.R│ ├── summarize-results.R│ ├── build-report.R
# clean-data.Rclean_data <- function(data) { "<<some R code...>>" return(data_clean)}
_targets/│ ├── meta/│ ├── meta│ ├── progress│ ├── objects/│ ├── target_name_1│ ├── target_name_2├── R/├── _targets.R├── run.R├── project-name.Rproj
_targets/
directory. Instead, inspect the data store and load objects using the suite of available helper functions# _targets.Rtar_pipeline( tar_target( data, palmerpenguins::penguins ), tar_target( model, lm(bill_length_mm ~ species, data = data) ))
# _targets.Rtar_pipeline( tar_target( data, palmerpenguins::penguins ), tar_target( model, lm(bill_length_mm ~ species, data = data) ))
tar_make()
● run target data● run target model
tar_make()
✓ skip target data✓ skip target model✓ Already up to date.
A look at the data store...
_targets/├── meta│ ├── meta│ └── progress└── objects ├── data └── model
tar_read(data) # compare with tar_load()
#> # A tibble: 344 x 8#> species island bill_length_mm bill_depth_mm flipper_length_…#> <fct> <fct> <dbl> <dbl> <int>#> 1 Adelie Torge… 39.1 18.7 181#> 2 Adelie Torge… 39.5 17.4 186#> 3 Adelie Torge… 40.3 18 195#> 4 Adelie Torge… NA NA NA#> 5 Adelie Torge… 36.7 19.3 193#> 6 Adelie Torge… 39.3 20.6 190#> 7 Adelie Torge… 38.9 17.8 181#> 8 Adelie Torge… 39.2 19.6 195#> 9 Adelie Torge… 34.1 18.1 193#> 10 Adelie Torge… 42 20.2 190#> # … with 334 more rows, and 3 more variables: body_mass_g <int>,#> # sex <fct>, year <int>
NOTE: tar_read()
reads objects into memory, but the user must assign the object into a variable for persistence; tar_load()
reads and assigns objects into a variable of the same name
tar_read(model) # compare with tar_load()
#> Call:#> lm(formula = bill_length_mm ~ species, data = data)#> #> Coefficients:#> (Intercept) speciesChinstrap speciesGentoo #> 38.791 10.042 8.713
NOTE: tar_read()
reads objects into memory, but the user must assign the object into a variable for persistence; tar_load()
reads and assigns objects into a variable of the same name
# _targets.Rtar_pipeline( tar_target( data, palmerpenguins::penguins ), tar_target( model, lm(bill_length_mm ~ species, data = data) ), tar_target( summary, summary(model) ))tar_make()
✓ skip target data✓ skip target model● run target summary
# _targets.Rtar_pipeline( tar_target( data, read_csv("path/to/data.csv") ))
# _targets.Rtar_pipeline( tar_target( data_file, "path/to/data.csv", format = "file" ), tar_target( data, read_csv(data_file) ))
The
targets
packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.
The
targets
packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.
Patterns: Creates branches (i.e. sub-targets) by repeating a task over a set of arguments
# _targets.R# Draws from random Normal of various sizestar_pipeline( tar_target(size, seq(1, 1000, by = 100)), tar_target(draws, rnorm(size), pattern = map(size)))
targets
will aggregate each of the sub-targets of draws
using vctrs::vec_c()
. In this example, this will combine all of our draws into one single vector.iteration = "list"
, insteadIteration: patterns repeat tasks and iterate over arguments (e.g. using map()
), and there are two important aspects of iteration...
Branching
targets
slice the data when creating branches?Aggregation
targets
combine the results after completing branches?Iteration: patterns repeat tasks and iterate over arguments (e.g. using map()
), and there are two important aspects of iteration...
Branching
targets
slice the data when creating branches?Aggregation
targets
combine the results after completing branches?iteration = "vector"
vctrs::vec_slice(x, i)
and aggregated with vctrs::vec_c(...)
iteration = "list"
list[[i]]
and aggregated with list(...)
We will set up a more advanced example using dynamic branching
Let's fit a model for how life expectancy has changed over time for each country in the gapminder
data set
lifeExpectancy=time+ϵ
# _targets.Rtar_pipeline( tar_target( gapminder_file, "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv", format = "file" ), tar_target( gapminder, read_tsv(gapminder_file) ), tar_target( country, group_split(gapminder, country), # returns a list of data frames iteration = "list" ), tar_target( model, lm(lifeExp ~ year, data = country), pattern = map(country), iteration = "list" ))
tar_visnetwork()
tar_make()
● run target gapminder_file● run target gapminder● run target country● run branch model_55cab078● run branch model_137fa27c● run branch model_11168a7a● run branch model_4fa278a7● run branch model_f0b9128a...● run branch model_b1ee577e● run branch model_af9237c1● run branch model_ab0302f4● run branch model_0b45f3c6● run branch model_f4b18a5b
# _targets.Rtar_pipeline( tar_target( gapminder_file, "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv", format = "file" ), tar_target( gapminder, read_tsv(gapminder_file) ), tar_target( country, group_split(gapminder, country), iteration = "list" ), tar_target( model, lm(lifeExp ~ year + gdpPercap, data = country), pattern = map(country), iteration = "list" ))
tar_visnetwork(label = c("branches", "time"))
# _targets.Roptions(clustermq.scheduler = "multiprocess")
tar_make_clustermq(workers = 6)
# _targets.Roptions(clustermq.scheduler = "multiprocess")
tar_make_clustermq(workers = 6)
✓ skip target gapminder_file✓ skip target gapminder✓ skip target country● run branch model_137fa27c● run branch model_11168a7a● run branch model_4fa278a7...● run branch model_0b45f3c6● run branch model_f4b18a5b● run branch model_55cab078Master: [11.1s 77.9% CPU]; Worker: [avg 20.9% CPU, max 8110571.0 Mb]
targets
supports high-performance computing with thetar_make_clustermq()
andtar_make_future()
functions. These functions are liketar_make()
, but they allow multiple targets to run simultaneously over parallel workers. These workers can be processes on your local machine, or they can be jobs on a computing cluster. The main process automatically sends a target to a worker as soon as
- The worker is available, and
- All the target’s upstream dependency targets have been checked or built.
{clustermq}
R package...enabling parallel processing of targets is as simple as...# Add this line to your _targets.R fileoptions(clustermq.scheduler = "multiprocess")
# Instead of tar_make()targets::tar_make_clustermq(workers = 6)
R
processes are running locally, so a good rule-of-thumb is the number of processes should always be less than or equal to the number of available cores"slurm"
, and targets
will run the worker processes as jobs on compute nodes# Add this line to your _targets.R fileoptions( clustermq.scheduler = "slurm", clustermq.template = "/path/to/slurm_clustermq.tmpl" )
clustermq
toward a file which provides a template for submitting batch jobs"slurm"
, and targets
will run the worker processes as jobs on compute nodes# Add this line to your _targets.R fileoptions( clustermq.scheduler = "slurm", clustermq.template = "/path/to/slurm_clustermq.tmpl" )
clustermq
toward a file which provides a template for submitting batch jobstargets::tar_make_clustermq(workers = 6)
ZeroMQ
socket communication but this should be widely available next upgrade)# slurm_clustermq.tmpl#SBATCH --job-name={{ job_name }}#SBATCH --partition=default#SBATCH --output={{ log_file | /dev/null }}#SBATCH --error={{ log_file | /dev/null }}#SBATCH --mem-per-cpu={{ memory | 4096 }}#SBATCH --array=1-{{ n_jobs }}ulimit -v $(( 1024 * {{ memory | 4096 }} ))CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
Values in double curly-braces will be automatically populated by clustermq
(or fall back to default values, when available)
resources
argument to tar_option_set()
Other job schedulers are supported, including: LSF, SGE, PBS, Torque
Seamless integration with AWS/cloud storage capabilities
targets
offers lots of different storage formats, some of which are faster and more memory efficient
tarchetypes
is a helper package that "is a collection of target and pipeline archetypes for the targets package"
Submitting HPC jobs using SSH - Develop locally and deploy remotely
Lots more, this package is in rapid-development, though its current API is pretty stable as this package is preparing for peer-review and a subsequent CRAN release
Will Landau
targets
and drake
Michael Schubert
clustermq
Yihui Xie, Garrick Aden-Buie
Package: targetsTitle: Dynamic Function-Oriented 'Make'-Like Declarative WorkflowsDescription: The 'targets' package is a pipeline toolkit...Authors@R: c( person( given = c("William", "Michael"), family = "Landau", role = c("aut", "cre"), email = "will.landau@gmail.com", comment = c(ORCID = "0000-0003-1878-3253") ), person( given = c("Matthew", "T."), family = "Warkentin", role = "ctb" ), person( family = "Eli Lilly and Company", role = "cph" ))
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Dynamic Function-Oriented Make-Like Declarative Pipelines for R
Package: targetsTitle: Dynamic Function-Oriented 'Make'-Like Declarative WorkflowsDescription: The 'targets' package is a pipeline toolkit...Authors@R: c( person( given = c("William", "Michael"), family = "Landau", role = c("aut", "cre"), email = "will.landau@gmail.com", comment = c(ORCID = "0000-0003-1878-3253") ), person( given = c("Matthew", "T."), family = "Warkentin", role = "ctb" ), person( family = "Eli Lilly and Company", role = "cph" ))
Package: targetsTitle: Dynamic Function-Oriented 'Make'-Like Declarative WorkflowsDescription: The 'targets' package is a pipeline toolkit...Authors@R: c( person( given = c("William", "Michael"), family = "Landau", role = c("aut", "cre"), email = "will.landau@gmail.com", comment = c(ORCID = "0000-0003-1878-3253") ), person( given = c("Matthew", "T."), family = "Warkentin", role = "ctb" ), person( family = "Eli Lilly and Company", role = "cph" ))
From: Lepore, Mauro
Subject: Would you be willing to review a package for rOpenSci?
To: warkentin@lunenfeld.ca
Dear Matthew,
Hi, this is Mauro . I hope you and your loved ones are safe. I'm writing to ask if you would be willing to review a package for rOpenSci. As you probably know, rOpenSci conducts peer review of R packages contributed to our collection in a manner similar to journals.
The package targets by Will Landau provides make-like pipelines for R. targets supersedes drake, and is submitted to rOpenSci jointly with the package tarchetypes. You can find targets and tarchetypes on GitHub here and here. We conduct our open review process via GitHub as well.
...
Thank you for your time.
Sincerely, Mauro
The
targets
package is a Make-like pipeline toolkit for Statistics and data science in R. Withtargets
, you can maintain a reproducible workflow without repeating yourself.targets
learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.
The
targets
package is a Make-like pipeline toolkit for Statistics and data science in R. Withtargets
, you can maintain a reproducible workflow without repeating yourself.targets
learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.
{targets}
is a project workflow tool that is very R
-centric
{GNU make}
and {snakemake}
It allows you to effectively modularize your data analysis projects to create obvious and reproducible workflows
Can easily extend your workflow to massively parallelize tasks
The
drake
package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.The
targets
package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible indrake
.
The
drake
package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.The
targets
package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible indrake
.
targets
is more...
Efficient
Reproducible
Maintainable
Portable
Domain specific
Organization
Organization
Modularity
Organization
Modularity
Transparency and Reproducibility
Organization
Modularity
Transparency and Reproducibility
Caching and History
Organization
Modularity
Transparency and Reproducibility
Caching and History
Scalability and Parallel Computing
Infographic from https://docs.ropensci.org/drake/
{targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive frictionAll functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your projectAll functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
tar_pipeline()
- Contains the complete set of targets to be included in the pipeline
All functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
tar_pipeline()
- Contains the complete set of targets to be included in the pipeline
tar_option_set()
- Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.
All functions in {targets}
are prefixed by tar_*
, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
tar_target()
- The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
tar_pipeline()
- Contains the complete set of targets to be included in the pipeline
tar_option_set()
- Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.
tar_make()
- Inspects your code/pipeline to understand the dependencies, and builds the pipeline in a separate clean R
session
tar_target( name, command, pattern = NULL, tidy_eval = targets::tar_option_get("tidy_eval"), packages = targets::tar_option_get("packages"), library = targets::tar_option_get("library"), format = targets::tar_option_get("format"), iteration = targets::tar_option_get("iteration"), error = targets::tar_option_get("error"), memory = targets::tar_option_get("memory"), garbage_collection = targets::tar_option_get("garbage_collection"), deployment = targets::tar_option_get("deployment"), priority = targets::tar_option_get("priority"), resources = targets::tar_option_get("resources"), storage = targets::tar_option_get("storage"), retrieval = targets::tar_option_get("retrieval"), cue = targets::tar_option_get("cue"))
tar_target( name, command, pattern = NULL, tidy_eval = targets::tar_option_get("tidy_eval"), packages = targets::tar_option_get("packages"), library = targets::tar_option_get("library"), format = targets::tar_option_get("format"), iteration = targets::tar_option_get("iteration"), error = targets::tar_option_get("error"), memory = targets::tar_option_get("memory"), garbage_collection = targets::tar_option_get("garbage_collection"), deployment = targets::tar_option_get("deployment"), priority = targets::tar_option_get("priority"), resources = targets::tar_option_get("resources"), storage = targets::tar_option_get("storage"), retrieval = targets::tar_option_get("retrieval"), cue = targets::tar_option_get("cue"))
tar_option_set( tidy_eval = NULL, packages = NULL, library = NULL, envir = NULL, format = NULL, iteration = NULL, error = NULL, memory = NULL, garbage_collection = NULL, deployment = NULL, priority = NULL, resources = NULL, storage = NULL, retrieval = NULL, cue = NULL, debug = NULL)
tar_option_set( tidy_eval = NULL, packages = NULL, library = NULL, envir = NULL, format = NULL, iteration = NULL, error = NULL, memory = NULL, garbage_collection = NULL, deployment = NULL, priority = NULL, resources = NULL, storage = NULL, retrieval = NULL, cue = NULL, debug = NULL)
tar_pipeline(...)
tar_pipeline()
simply accepts an arbitrary number of tar_target()
objects, or a list thereof.Example:
# _targets.Rtar_pipeline( tar_target(first, f1()), tar_target(second, f2()), tar_target(third, f3(first, second)))
tar_pipeline()
does NOT matter. {targets}
is smart enough to infer the topology and learn dependenciestar_pipeline(...)
tar_pipeline()
simply accepts an arbitrary number of tar_target()
objects, or a list thereof.Example:
# _targets.Rtar_pipeline( tar_target(first, f1()), tar_target(second, f2()), tar_target(third, f3(first, second)))
tar_pipeline()
does NOT matter. {targets}
is smart enough to infer the topology and learn dependenciesR/├── 01-data.R├── 02-clean.R├── 03-fit-model.R├── 04-summarize-results.R└── 05-tables-figs.Rrun_scripts.R
R/├── 01-data.R├── 02-clean.R├── 03-fit-model.R├── 04-summarize-results.R└── 05-tables-figs.Rrun_scripts.R
Does not scale well to larger/complicated projects
You are in charge of storing/loading important objects
Everything needs to be ran every time
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.
Generally speaking, a good target is...
Long enough to eat up a decent chunk of runtime, and
Small enough that tar_make()
frequently skips it, and
Meaningful to your project, and
A well-behaved R
object that can be stored.
Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.
Generally speaking, a good target is...
Long enough to eat up a decent chunk of runtime, and
Small enough that tar_make()
frequently skips it, and
Meaningful to your project, and
A well-behaved R
object that can be stored.
{targets}
pipeline is a directed acyclic graph (DAG) showing all of the tasks (nodes) and their interrelationships (vertices)A key design consideration when working with {targets}
is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge
A key design consideration when working with {targets}
is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge
find_outcomes <- function(data, icd_code) { # <<some R code>> return(data_with_outcomes)}
find_outcomes(my_data, "C34")
targets
pipeline{targets}
for building R
-centric projectsHowever,
_targets.R
must exist at the root of the project├── R/│ ├── functions.R├── _targets.R├── run.R├── project-name.Rproj
{targets}
for building R
-centric projectsHowever,
_targets.R
must exist at the root of the project├── R/│ ├── functions.R├── _targets.R├── run.R├── project-name.Rproj
{targets}
projects├── R/├── _targets.R├── run.R├── project-name.Rproj
# _targets.Rlibrary(targets)# Load functionssource("functions.R")# Set global optionstar_option_set(...)# Define targets/pipelinetar_pipeline(...)
├── R/├── _targets.R├── run.R├── project-name.Rproj
# run.Rtargets::tar_make()
├── R/├── _targets.R├── run.R├── project-name.Rproj
├── R/│ ├── clean-data.R│ ├── cv-splits.R│ ├── fit-model.R│ ├── summarize-results.R│ ├── build-report.R
I suggest having one script per function/task
Name the script the same name as the function contained therein
├── R/├── _targets.R├── run.R├── project-name.Rproj
├── R/│ ├── clean-data.R│ ├── cv-splits.R│ ├── fit-model.R│ ├── summarize-results.R│ ├── build-report.R
# clean-data.Rclean_data <- function(data) { "<<some R code...>>" return(data_clean)}
_targets/│ ├── meta/│ ├── meta│ ├── progress│ ├── objects/│ ├── target_name_1│ ├── target_name_2├── R/├── _targets.R├── run.R├── project-name.Rproj
_targets/
directory. Instead, inspect the data store and load objects using the suite of available helper functions# _targets.Rtar_pipeline( tar_target( data, palmerpenguins::penguins ), tar_target( model, lm(bill_length_mm ~ species, data = data) ))
# _targets.Rtar_pipeline( tar_target( data, palmerpenguins::penguins ), tar_target( model, lm(bill_length_mm ~ species, data = data) ))
tar_make()
● run target data● run target model
tar_make()
✓ skip target data✓ skip target model✓ Already up to date.
A look at the data store...
_targets/├── meta│ ├── meta│ └── progress└── objects ├── data └── model
tar_read(data) # compare with tar_load()
#> # A tibble: 344 x 8#> species island bill_length_mm bill_depth_mm flipper_length_…#> <fct> <fct> <dbl> <dbl> <int>#> 1 Adelie Torge… 39.1 18.7 181#> 2 Adelie Torge… 39.5 17.4 186#> 3 Adelie Torge… 40.3 18 195#> 4 Adelie Torge… NA NA NA#> 5 Adelie Torge… 36.7 19.3 193#> 6 Adelie Torge… 39.3 20.6 190#> 7 Adelie Torge… 38.9 17.8 181#> 8 Adelie Torge… 39.2 19.6 195#> 9 Adelie Torge… 34.1 18.1 193#> 10 Adelie Torge… 42 20.2 190#> # … with 334 more rows, and 3 more variables: body_mass_g <int>,#> # sex <fct>, year <int>
NOTE: tar_read()
reads objects into memory, but the user must assign the object into a variable for persistence; tar_load()
reads and assigns objects into a variable of the same name
tar_read(model) # compare with tar_load()
#> Call:#> lm(formula = bill_length_mm ~ species, data = data)#> #> Coefficients:#> (Intercept) speciesChinstrap speciesGentoo #> 38.791 10.042 8.713
NOTE: tar_read()
reads objects into memory, but the user must assign the object into a variable for persistence; tar_load()
reads and assigns objects into a variable of the same name
# _targets.Rtar_pipeline( tar_target( data, palmerpenguins::penguins ), tar_target( model, lm(bill_length_mm ~ species, data = data) ), tar_target( summary, summary(model) ))tar_make()
✓ skip target data✓ skip target model● run target summary
# _targets.Rtar_pipeline( tar_target( data, read_csv("path/to/data.csv") ))
# _targets.Rtar_pipeline( tar_target( data_file, "path/to/data.csv", format = "file" ), tar_target( data, read_csv(data_file) ))
The
targets
packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.
The
targets
packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.
Patterns: Creates branches (i.e. sub-targets) by repeating a task over a set of arguments
# _targets.R# Draws from random Normal of various sizestar_pipeline( tar_target(size, seq(1, 1000, by = 100)), tar_target(draws, rnorm(size), pattern = map(size)))
targets
will aggregate each of the sub-targets of draws
using vctrs::vec_c()
. In this example, this will combine all of our draws into one single vector.iteration = "list"
, insteadIteration: patterns repeat tasks and iterate over arguments (e.g. using map()
), and there are two important aspects of iteration...
Branching
targets
slice the data when creating branches?Aggregation
targets
combine the results after completing branches?Iteration: patterns repeat tasks and iterate over arguments (e.g. using map()
), and there are two important aspects of iteration...
Branching
targets
slice the data when creating branches?Aggregation
targets
combine the results after completing branches?iteration = "vector"
vctrs::vec_slice(x, i)
and aggregated with vctrs::vec_c(...)
iteration = "list"
list[[i]]
and aggregated with list(...)
We will set up a more advanced example using dynamic branching
Let's fit a model for how life expectancy has changed over time for each country in the gapminder
data set
lifeExpectancy=time+ϵ
# _targets.Rtar_pipeline( tar_target( gapminder_file, "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv", format = "file" ), tar_target( gapminder, read_tsv(gapminder_file) ), tar_target( country, group_split(gapminder, country), # returns a list of data frames iteration = "list" ), tar_target( model, lm(lifeExp ~ year, data = country), pattern = map(country), iteration = "list" ))
tar_visnetwork()
tar_make()
● run target gapminder_file● run target gapminder● run target country● run branch model_55cab078● run branch model_137fa27c● run branch model_11168a7a● run branch model_4fa278a7● run branch model_f0b9128a...● run branch model_b1ee577e● run branch model_af9237c1● run branch model_ab0302f4● run branch model_0b45f3c6● run branch model_f4b18a5b
# _targets.Rtar_pipeline( tar_target( gapminder_file, "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv", format = "file" ), tar_target( gapminder, read_tsv(gapminder_file) ), tar_target( country, group_split(gapminder, country), iteration = "list" ), tar_target( model, lm(lifeExp ~ year + gdpPercap, data = country), pattern = map(country), iteration = "list" ))
tar_visnetwork(label = c("branches", "time"))
# _targets.Roptions(clustermq.scheduler = "multiprocess")
tar_make_clustermq(workers = 6)
# _targets.Roptions(clustermq.scheduler = "multiprocess")
tar_make_clustermq(workers = 6)
✓ skip target gapminder_file✓ skip target gapminder✓ skip target country● run branch model_137fa27c● run branch model_11168a7a● run branch model_4fa278a7...● run branch model_0b45f3c6● run branch model_f4b18a5b● run branch model_55cab078Master: [11.1s 77.9% CPU]; Worker: [avg 20.9% CPU, max 8110571.0 Mb]
targets
supports high-performance computing with thetar_make_clustermq()
andtar_make_future()
functions. These functions are liketar_make()
, but they allow multiple targets to run simultaneously over parallel workers. These workers can be processes on your local machine, or they can be jobs on a computing cluster. The main process automatically sends a target to a worker as soon as
- The worker is available, and
- All the target’s upstream dependency targets have been checked or built.
{clustermq}
R package...enabling parallel processing of targets is as simple as...# Add this line to your _targets.R fileoptions(clustermq.scheduler = "multiprocess")
# Instead of tar_make()targets::tar_make_clustermq(workers = 6)
R
processes are running locally, so a good rule-of-thumb is the number of processes should always be less than or equal to the number of available cores"slurm"
, and targets
will run the worker processes as jobs on compute nodes# Add this line to your _targets.R fileoptions( clustermq.scheduler = "slurm", clustermq.template = "/path/to/slurm_clustermq.tmpl" )
clustermq
toward a file which provides a template for submitting batch jobs"slurm"
, and targets
will run the worker processes as jobs on compute nodes# Add this line to your _targets.R fileoptions( clustermq.scheduler = "slurm", clustermq.template = "/path/to/slurm_clustermq.tmpl" )
clustermq
toward a file which provides a template for submitting batch jobstargets::tar_make_clustermq(workers = 6)
ZeroMQ
socket communication but this should be widely available next upgrade)# slurm_clustermq.tmpl#SBATCH --job-name={{ job_name }}#SBATCH --partition=default#SBATCH --output={{ log_file | /dev/null }}#SBATCH --error={{ log_file | /dev/null }}#SBATCH --mem-per-cpu={{ memory | 4096 }}#SBATCH --array=1-{{ n_jobs }}ulimit -v $(( 1024 * {{ memory | 4096 }} ))CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
Values in double curly-braces will be automatically populated by clustermq
(or fall back to default values, when available)
resources
argument to tar_option_set()
Other job schedulers are supported, including: LSF, SGE, PBS, Torque
Seamless integration with AWS/cloud storage capabilities
targets
offers lots of different storage formats, some of which are faster and more memory efficient
tarchetypes
is a helper package that "is a collection of target and pipeline archetypes for the targets package"
Submitting HPC jobs using SSH - Develop locally and deploy remotely
Lots more, this package is in rapid-development, though its current API is pretty stable as this package is preparing for peer-review and a subsequent CRAN release
Will Landau
targets
and drake
Michael Schubert
clustermq
Yihui Xie, Garrick Aden-Buie