Reproducible and scalable data analysis workflows with targets

Dynamic Function-Oriented Make-Like Declarative Pipelines for R

Matthew T. Warkentin, MSc. Ph.D (c) Lunenfeld-Tanenbaum Research Institute, Sinai Health System

Novemer 12, 2020

Package: targets
Title: Dynamic Function-Oriented 'Make'-Like Declarative Workflows
Description: The 'targets' package is a pipeline toolkit...
Authors@R: c(
  person(
    given = c("William", "Michael"),
    family = "Landau",
    role = c("aut", "cre"),
    email = "will.landau@gmail.com",
    comment = c(ORCID = "0000-0003-1878-3253")
  ),
  person(
    given = c("Matthew", "T."),
    family = "Warkentin",
    role = "ctb"
  ),
  person(
    family = "Eli Lilly and Company",
    role = "cph"
  ))

Package: targets
Title: Dynamic Function-Oriented 'Make'-Like Declarative Workflows
Description: The 'targets' package is a pipeline toolkit...
Authors@R: c(
  person(
    given = c("William", "Michael"),
    family = "Landau",
    role = c("aut", "cre"),
    email = "will.landau@gmail.com",
    comment = c(ORCID = "0000-0003-1878-3253")
  ),
  person(
    given = c("Matthew", "T."),
    family = "Warkentin",
    role = "ctb"
  ),
  person(
    family = "Eli Lilly and Company",
    role = "cph"
  ))

From: Lepore, Mauro
Subject: Would you be willing to review a package for rOpenSci?
To: warkentin@lunenfeld.ca

Dear Matthew,

Hi, this is Mauro . I hope you and your loved ones are safe. I'm writing to ask if you would be willing to review a package for rOpenSci. As you probably know, rOpenSci conducts peer review of R packages contributed to our collection in a manner similar to journals.

The package targets by Will Landau provides make-like pipelines for R. targets supersedes drake, and is submitted to rOpenSci jointly with the package tarchetypes. You can find targets and tarchetypes on GitHub here and here. We conduct our open review process via GitHub as well.

...

Thank you for your time.

Sincerely, Mauro

What is {targets}?

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.

What is {targets}?

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.

{targets} is a project workflow tool that is very R-centric
- Similar tools exist for other languages, such as {GNU make} and {snakemake}
It allows you to effectively modularize your data analysis projects to create obvious and reproducible workflows
Can easily extend your workflow to massively parallelize tasks
- With some setup can use external compute resources (e.g. HPC) as a computational back-end for your pipeline

What about {drake}?

The drake package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.

The targets package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible in drake.

What about {drake}?

The drake package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.

The targets package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible in drake.

targets is more...

Efficient
Reproducible
Maintainable
Portable
Domain specific

See the Statement of Need for details.

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together
Transparency and Reproducibility
- Out of the box, you get a transparent and reproducible data analysis workflow

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together
Transparency and Reproducibility
- Out of the box, you get a transparent and reproducible data analysis workflow
Caching and History
- Re-running code that doesn't change often is tedious and time-consuming. Caching results means you only run what is absolutely necessary to get up-to-date results

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together
Transparency and Reproducibility
- Out of the box, you get a transparent and reproducible data analysis workflow
Caching and History
- Re-running code that doesn't change often is tedious and time-consuming. Caching results means you only run what is absolutely necessary to get up-to-date results
Scalability and Parallel Computing
- Mental models of projects break down at scale. Building projects as workflows scales well and facilitates parallel computing

Infographic from https://docs.ropensci.org/drake/

 Using {targets}All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
16

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
- tar_pipeline() - Contains the complete set of targets to be included in the pipeline

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
- tar_pipeline() - Contains the complete set of targets to be included in the pipeline
- tar_option_set() - Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
- tar_pipeline() - Contains the complete set of targets to be included in the pipeline
- tar_option_set() - Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.
- tar_make() - Inspects your code/pipeline to understand the dependencies, and builds the pipeline in a separate clean R session

tar_target(
  name,
  command,
  pattern = NULL,
  tidy_eval = targets::tar_option_get("tidy_eval"),
  packages = targets::tar_option_get("packages"),
  library = targets::tar_option_get("library"),
  format = targets::tar_option_get("format"),
  iteration = targets::tar_option_get("iteration"),
  error = targets::tar_option_get("error"),
  memory = targets::tar_option_get("memory"),
  garbage_collection = targets::tar_option_get("garbage_collection"),
  deployment = targets::tar_option_get("deployment"),
  priority = targets::tar_option_get("priority"),
  resources = targets::tar_option_get("resources"),
  storage = targets::tar_option_get("storage"),
  retrieval = targets::tar_option_get("retrieval"),
  cue = targets::tar_option_get("cue")
)

Unique name given to a target. TIP: A common prefix can make it easier to refer to families of targets

R code that produces a target value

tar_target(
  name,
  command,
  pattern = NULL,
  tidy_eval = targets::tar_option_get("tidy_eval"),
  packages = targets::tar_option_get("packages"),
  library = targets::tar_option_get("library"),
  format = targets::tar_option_get("format"),
  iteration = targets::tar_option_get("iteration"),
  error = targets::tar_option_get("error"),
  memory = targets::tar_option_get("memory"),
  garbage_collection = targets::tar_option_get("garbage_collection"),
  deployment = targets::tar_option_get("deployment"),
  priority = targets::tar_option_get("priority"),
  resources = targets::tar_option_get("resources"),
  storage = targets::tar_option_get("storage"),
  retrieval = targets::tar_option_get("retrieval"),
  cue = targets::tar_option_get("cue")
)

tar_option_set(
  tidy_eval = NULL,
  packages = NULL,
  library = NULL,
  envir = NULL,
  format = NULL,
  iteration = NULL,
  error = NULL,
  memory = NULL,
  garbage_collection = NULL,
  deployment = NULL,
  priority = NULL,
  resources = NULL,
  storage = NULL,
  retrieval = NULL,
  cue = NULL,
  debug = NULL
)

Character vector of packages that will be loaded before building your pipeline

tar_option_set(
  tidy_eval = NULL,
  packages = NULL,
  library = NULL,
  envir = NULL,
  format = NULL,
  iteration = NULL,
  error = NULL,
  memory = NULL,
  garbage_collection = NULL,
  deployment = NULL,
  priority = NULL,
  resources = NULL,
  storage = NULL,
  retrieval = NULL,
  cue = NULL,
  debug = NULL
)

tar_pipeline(...)

tar_pipeline() simply accepts an arbitrary number of tar_target() objects, or a list thereof.

Example:

# _targets.R
tar_pipeline(
  tar_target(first, f1()),
  tar_target(second, f2()),
  tar_target(third, f3(first, second))
)

Note: The order of targets inside tar_pipeline() does NOT matter. {targets} is smart enough to infer the topology and learn dependencies

tar_pipeline(...)

tar_pipeline() simply accepts an arbitrary number of tar_target() objects, or a list thereof.

Example:

# _targets.R
tar_pipeline(
  tar_target(first, f1()),
  tar_target(second, f2()),
  tar_target(third, f3(first, second))
)

Note: The order of targets inside tar_pipeline() does NOT matter. {targets} is smart enough to infer the topology and learn dependencies

Imperative scripting

R/
├── 01-data.R
├── 02-clean.R
├── 03-fit-model.R
├── 04-summarize-results.R
└── 05-tables-figs.R
run_scripts.R

Imperative scripting

R/
├── 01-data.R
├── 02-clean.R
├── 03-fit-model.R
├── 04-summarize-results.R
└── 05-tables-figs.R
run_scripts.R

Does not scale well to larger/complicated projects
You are in charge of storing/loading important objects
Everything needs to be ran every time

Guiding Design Principles

Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.

Guiding Design Principles

Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.

Generally speaking, a good target is...
1. Long enough to eat up a decent chunk of runtime, and
2. Small enough that tar_make() frequently skips it, and
3. Meaningful to your project, and
4. A well-behaved R object that can be stored.

Guiding Design Principles

Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.

Generally speaking, a good target is...
1. Long enough to eat up a decent chunk of runtime, and
2. Small enough that tar_make() frequently skips it, and
3. Meaningful to your project, and
4. A well-behaved R object that can be stored.
A {targets} pipeline is a directed acyclic graph (DAG) showing all of the tasks (nodes) and their interrelationships (vertices)

Think Functional

A key design consideration when working with {targets} is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge

Think Functional

A key design consideration when working with {targets} is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge

find_outcomes <- function(data, icd_code) {
  # <<some R code>>
  return(data_with_outcomes)
}

find_outcomes(my_data, "C34")

We now have a function that is easy to maintain and can be used in our targets pipeline

Suggested Project Structure

Everyone has their own preferred way of organizing their files and projects. This is only a suggested way based on my workflow using {targets} for building R-centric projects

However,

_targets.R must exist at the root of the project

├── R/
│   ├── functions.R
├── _targets.R
├── run.R
├── project-name.Rproj

Suggested Project Structure

Everyone has their own preferred way of organizing their files and projects. This is only a suggested way based on my workflow using {targets} for building R-centric projects

However,

_targets.R must exist at the root of the project

├── R/
│   ├── functions.R
├── _targets.R
├── run.R
├── project-name.Rproj

A more mature project may have more subdirectories and files, but this serves as the skeleton for most/all of my {targets} projects

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

# _targets.R
library(targets)
# Load functions
source("functions.R")
# Set global options
tar_option_set(...)
# Define targets/pipeline
tar_pipeline(...)

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

# run.R
targets::tar_make()

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

├── R/
│   ├── clean-data.R
│   ├── cv-splits.R
│   ├── fit-model.R
│   ├── summarize-results.R
│   ├── build-report.R

I suggest having one script per function/task
Name the script the same name as the function contained therein

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

├── R/
│   ├── clean-data.R
│   ├── cv-splits.R
│   ├── fit-model.R
│   ├── summarize-results.R
│   ├── build-report.R

# clean-data.R
clean_data <- function(data) {
  "<<some R code...>>"
  return(data_clean)
}

 Data StoreAfter you run your pipeline the first time the data store is created...
40

Data Store

After you run your pipeline the first time the data store is created...

_targets/
│   ├── meta/
│       ├── meta
│       ├── progress
│   ├── objects/
│       ├── target_name_1
│       ├── target_name_2
├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

It is NOT recommended to directly interact with the _targets/ directory. Instead, inspect the data store and load objects using the suite of available helper functions

A simple example...

Artwork by @allison_horst

# _targets.R
tar_pipeline(
  tar_target(
    data,
    palmerpenguins::penguins
  ),
  tar_target(
    model,
    lm(bill_length_mm ~ species, data = data)
  )
)

A simple example...

Artwork by @allison_horst

# _targets.R
tar_pipeline(
  tar_target(
    data,
    palmerpenguins::penguins
  ),
  tar_target(
    model,
    lm(bill_length_mm ~ species, data = data)
  )
)

tar_make()

● run target data
● run target model

A simple example...

tar_make()

✓ skip target data
✓ skip target model
✓ Already up to date.

A look at the data store...

_targets/
├── meta
│   ├── meta
│   └── progress
└── objects
    ├── data
    └── model

tar_read(data) # compare with tar_load()

#> # A tibble: 344 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_…
#>    <fct>   <fct>           <dbl>         <dbl>            <int>
#>  1 Adelie  Torge…           39.1          18.7              181
#>  2 Adelie  Torge…           39.5          17.4              186
#>  3 Adelie  Torge…           40.3          18                195
#>  4 Adelie  Torge…           NA            NA                 NA
#>  5 Adelie  Torge…           36.7          19.3              193
#>  6 Adelie  Torge…           39.3          20.6              190
#>  7 Adelie  Torge…           38.9          17.8              181
#>  8 Adelie  Torge…           39.2          19.6              195
#>  9 Adelie  Torge…           34.1          18.1              193
#> 10 Adelie  Torge…           42            20.2              190
#> # … with 334 more rows, and 3 more variables: body_mass_g <int>,
#> #   sex <fct>, year <int>

NOTE: tar_read() reads objects into memory, but the user must assign the object into a variable for persistence; tar_load() reads and assigns objects into a variable of the same name

tar_read(model) # compare with tar_load()

#> Call:
#> lm(formula = bill_length_mm ~ species, data = data)
#> 
#> Coefficients:
#>      (Intercept)  speciesChinstrap     speciesGentoo  
#>           38.791            10.042             8.713

NOTE: tar_read() reads objects into memory, but the user must assign the object into a variable for persistence; tar_load() reads and assigns objects into a variable of the same name

# _targets.R
tar_pipeline(
  tar_target(
    data,
    palmerpenguins::penguins
  ),
  tar_target(
    model,
    lm(bill_length_mm ~ species, data = data)
  ),
   tar_target(
     summary,
     summary(model)
   )
)
tar_make()

✓ skip target data
✓ skip target model
● run target summary

 Depending on external filesSometimes your pipeline might need to look outward and depend on "external" filesA common example is depending on a CSV/Excel or PLINK file
The same technique applies if your code produces a file

48

Depending on external files

Sometimes your pipeline might need to look outward and depend on "external" files
- A common example is depending on a CSV/Excel or PLINK file
- The same technique applies if your code produces a file

# _targets.R
tar_pipeline(
  tar_target(
    data,
    read_csv("path/to/data.csv")
  )
)

Depending on external files

Sometimes your pipeline might need to look outward and depend on "external" files
- A common example is depending on a CSV/Excel or PLINK file
- The same technique applies if your code produces a file

# _targets.R
tar_pipeline(
  tar_target(
    data_file,
    "path/to/data.csv",
    format = "file"
  ),
  tar_target(
    data,
    read_csv(data_file)
  )
)

Dynamic branching

The targets packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.

Dynamic branching

The targets packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.

Patterns: Creates branches (i.e. sub-targets) by repeating a task over a set of arguments

# _targets.R
# Draws from random Normal of various sizes
tar_pipeline(
  tar_target(size, seq(1, 1000, by = 100)),
  tar_target(draws, rnorm(size), pattern = map(size))
)

By default, targets will aggregate each of the sub-targets of draws using vctrs::vec_c(). In this example, this will combine all of our draws into one single vector.
- We could combine into a list by adding iteration = "list", instead

Dynamic branching

Iteration: patterns repeat tasks and iterate over arguments (e.g. using map()), and there are two important aspects of iteration...

Branching
- How should targets slice the data when creating branches?
Aggregation
- How should targets combine the results after completing branches?

Dynamic branching

Iteration: patterns repeat tasks and iterate over arguments (e.g. using map()), and there are two important aspects of iteration...

Branching
- How should targets slice the data when creating branches?
Aggregation
- How should targets combine the results after completing branches?

iteration = "vector"
- Branches are created with vctrs::vec_slice(x, i) and aggregated with vctrs::vec_c(...)
iteration = "list"
- Branches are created with list[[i]] and aggregated with list(...)

Advanced Example: Dynamic Branching

We will set up a more advanced example using dynamic branching
Let's fit a model for how life expectancy has changed over time for each country in the gapminder data set
- How many countries are in our data set? Not sure and it doesn't matter

$lifeExpectancy = time + \epsilon$

After, we will adjust our life expectancy model for GDP per capita

# _targets.R
tar_pipeline(
  tar_target(
    gapminder_file,
    "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv",
    format = "file"
  ),
  tar_target(
    gapminder,
    read_tsv(gapminder_file)
  ),
  tar_target(
    country,
    group_split(gapminder, country), # returns a list of data frames
    iteration = "list"
  ),
  tar_target(
    model,
    lm(lifeExp ~ year, data = country),
    pattern = map(country),
    iteration = "list"
  )
)

Advanced Example

tar_visnetwork()

Advanced Example

tar_make()

● run target gapminder_file
● run target gapminder
● run target country
● run branch model_55cab078
● run branch model_137fa27c
● run branch model_11168a7a
● run branch model_4fa278a7
● run branch model_f0b9128a
...
● run branch model_b1ee577e
● run branch model_af9237c1
● run branch model_ab0302f4
● run branch model_0b45f3c6
● run branch model_f4b18a5b

# _targets.R
tar_pipeline(
  tar_target(
    gapminder_file,
    "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv",
    format = "file"
  ),
  tar_target(
    gapminder,
    read_tsv(gapminder_file)
  ),
  tar_target(
    country,
    group_split(gapminder, country),
    iteration = "list"
  ),
  tar_target(
    model,
    lm(lifeExp ~ year + gdpPercap, data = country),
    pattern = map(country),
    iteration = "list"
  )
)

Advanced Example

tar_visnetwork(label = c("branches", "time"))

Advanced Example

# _targets.R
options(clustermq.scheduler = "multiprocess")

tar_make_clustermq(workers = 6)

Advanced Example

# _targets.R
options(clustermq.scheduler = "multiprocess")

tar_make_clustermq(workers = 6)

✓ skip target gapminder_file
✓ skip target gapminder
✓ skip target country
● run branch model_137fa27c
● run branch model_11168a7a
● run branch model_4fa278a7
...
● run branch model_0b45f3c6
● run branch model_f4b18a5b
● run branch model_55cab078
Master: [11.1s 77.9% CPU]; Worker: [avg 20.9% CPU, max 8110571.0 Mb]

High-performance Computing

targets supports high-performance computing with the tar_make_clustermq() and tar_make_future() functions. These functions are like tar_make(), but they allow multiple targets to run simultaneously over parallel workers. These workers can be processes on your local machine, or they can be jobs on a computing cluster. The main process automatically sends a target to a worker as soon as

The worker is available, and

All the target’s upstream dependency targets have been checked or built.

Illustrative video from https://wlandau.github.io/targets-manual/hpc.html

High-performance Computing

Assuming you have installed the {clustermq} R package...enabling parallel processing of targets is as simple as...

# Add this line to your _targets.R file
options(clustermq.scheduler = "multiprocess")

# Instead of tar_make()
targets::tar_make_clustermq(workers = 6)

Keep in mind, these multiple persistent R processes are running locally, so a good rule-of-thumb is the number of processes should always be less than or equal to the number of available cores

High-performance Computing

If your project lives on our cluster NFS, you can change the scheduler to "slurm", and targets will run the worker processes as jobs on compute nodes

# Add this line to your _targets.R file
options(
  clustermq.scheduler = "slurm",
  clustermq.template = "/path/to/slurm_clustermq.tmpl"  
)

You also need to point clustermq toward a file which provides a template for submitting batch jobs

High-performance Computing

If your project lives on our cluster NFS, you can change the scheduler to "slurm", and targets will run the worker processes as jobs on compute nodes

# Add this line to your _targets.R file
options(
  clustermq.scheduler = "slurm",
  clustermq.template = "/path/to/slurm_clustermq.tmpl"  
)

You also need to point clustermq toward a file which provides a template for submitting batch jobs

targets::tar_make_clustermq(workers = 6)

Now these worker processes will run on compute nodes (caveat: currently few nodes support ZeroMQ socket communication but this should be widely available next upgrade)

Slurm Job Template

# slurm_clustermq.tmpl
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=default
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Values in double curly-braces will be automatically populated by clustermq (or fall back to default values, when available)
- Additional job options may be defined in this template (e.g. loading modules, activating environments)
- You may also choose to pass these wildcards using the resources argument to tar_option_set()
Other job schedulers are supported, including: LSF, SGE, PBS, Torque

Topics I didn't cover...

Seamless integration with AWS/cloud storage capabilities
- You can store and load objects from AWS S3 buckets as if they existed locally
targets offers lots of different storage formats, some of which are faster and more memory efficient
tarchetypes is a helper package that "is a collection of target and pipeline archetypes for the targets package"
Submitting HPC jobs using SSH - Develop locally and deploy remotely
Lots more, this package is in rapid-development, though its current API is pretty stable as this package is preparing for peer-review and a subsequent CRAN release

Helpful Resources

Acknolwedgements

Will Landau
- Author of targets and drake
Michael Schubert
- Author of clustermq
Yihui Xie, Garrick Aden-Buie

Kohn's Second Law: An experiment is reproducible until another laboratory tries to repeat it
When we don't use workflow tools... Matt's Second Law: A statistical analysis is reproducible until you try to repeat it

Reproducible and scalable data analysis workflows with targets

Dynamic Function-Oriented Make-Like Declarative Pipelines for R

Matthew T. Warkentin, MSc. Ph.D (c) Lunenfeld-Tanenbaum Research Institute, Sinai Health System

Novemer 12, 2020

Package: targets
Title: Dynamic Function-Oriented 'Make'-Like Declarative Workflows
Description: The 'targets' package is a pipeline toolkit...
Authors@R: c(
  person(
    given = c("William", "Michael"),
    family = "Landau",
    role = c("aut", "cre"),
    email = "will.landau@gmail.com",
    comment = c(ORCID = "0000-0003-1878-3253")
  ),
  person(
    given = c("Matthew", "T."),
    family = "Warkentin",
    role = "ctb"
  ),
  person(
    family = "Eli Lilly and Company",
    role = "cph"
  ))

Package: targets
Title: Dynamic Function-Oriented 'Make'-Like Declarative Workflows
Description: The 'targets' package is a pipeline toolkit...
Authors@R: c(
  person(
    given = c("William", "Michael"),
    family = "Landau",
    role = c("aut", "cre"),
    email = "will.landau@gmail.com",
    comment = c(ORCID = "0000-0003-1878-3253")
  ),
  person(
    given = c("Matthew", "T."),
    family = "Warkentin",
    role = "ctb"
  ),
  person(
    family = "Eli Lilly and Company",
    role = "cph"
  ))

From: Lepore, Mauro
Subject: Would you be willing to review a package for rOpenSci?
To: warkentin@lunenfeld.ca

Dear Matthew,

...

Thank you for your time.

Sincerely, Mauro

What is {targets}?

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.

What is {targets}?

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets learns how your pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.

{targets} is a project workflow tool that is very R-centric
- Similar tools exist for other languages, such as {GNU make} and {snakemake}
It allows you to effectively modularize your data analysis projects to create obvious and reproducible workflows
Can easily extend your workflow to massively parallelize tasks
- With some setup can use external compute resources (e.g. HPC) as a computational back-end for your pipeline

What about {drake}?

The drake package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.

The targets package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible in drake.

What about {drake}?

The drake package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue.

The targets package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible in drake.

targets is more...

Efficient
Reproducible
Maintainable
Portable
Domain specific

See the Statement of Need for details.

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together
Transparency and Reproducibility
- Out of the box, you get a transparent and reproducible data analysis workflow

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together
Transparency and Reproducibility
- Out of the box, you get a transparent and reproducible data analysis workflow
Caching and History
- Re-running code that doesn't change often is tedious and time-consuming. Caching results means you only run what is absolutely necessary to get up-to-date results

Why should I use {targets}?

Organization
- Explicitly building your projects with as a cohesive pipeline keeps your project more organized and focused
Modularity
- Breaking tasks into small digestible code chunks makes it easier to debug your code and easier to see how all of the parts fit together
Transparency and Reproducibility
- Out of the box, you get a transparent and reproducible data analysis workflow
Caching and History
- Re-running code that doesn't change often is tedious and time-consuming. Caching results means you only run what is absolutely necessary to get up-to-date results
Scalability and Parallel Computing
- Mental models of projects break down at scale. Building projects as workflows scales well and facilitates parallel computing

Infographic from https://docs.ropensci.org/drake/

 Using {targets}All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
16

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
- tar_pipeline() - Contains the complete set of targets to be included in the pipeline

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
- tar_pipeline() - Contains the complete set of targets to be included in the pipeline
- tar_option_set() - Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.

Using {targets}

All functions in {targets} are prefixed by tar_*, which makes it easy to work with the package due to low cognitive friction
Your 80/20 functions...
- tar_target() - The unit of interest; targets are the building blocks of your pipeline and represent meaningful components of your project
- tar_pipeline() - Contains the complete set of targets to be included in the pipeline
- tar_option_set() - Set global configuration options, such as default storage formats, packages, memory allocation, storage, deployment...etc.
- tar_make() - Inspects your code/pipeline to understand the dependencies, and builds the pipeline in a separate clean R session

tar_target(
  name,
  command,
  pattern = NULL,
  tidy_eval = targets::tar_option_get("tidy_eval"),
  packages = targets::tar_option_get("packages"),
  library = targets::tar_option_get("library"),
  format = targets::tar_option_get("format"),
  iteration = targets::tar_option_get("iteration"),
  error = targets::tar_option_get("error"),
  memory = targets::tar_option_get("memory"),
  garbage_collection = targets::tar_option_get("garbage_collection"),
  deployment = targets::tar_option_get("deployment"),
  priority = targets::tar_option_get("priority"),
  resources = targets::tar_option_get("resources"),
  storage = targets::tar_option_get("storage"),
  retrieval = targets::tar_option_get("retrieval"),
  cue = targets::tar_option_get("cue")
)

Unique name given to a target. TIP: A common prefix can make it easier to refer to families of targets

R code that produces a target value

tar_target(
  name,
  command,
  pattern = NULL,
  tidy_eval = targets::tar_option_get("tidy_eval"),
  packages = targets::tar_option_get("packages"),
  library = targets::tar_option_get("library"),
  format = targets::tar_option_get("format"),
  iteration = targets::tar_option_get("iteration"),
  error = targets::tar_option_get("error"),
  memory = targets::tar_option_get("memory"),
  garbage_collection = targets::tar_option_get("garbage_collection"),
  deployment = targets::tar_option_get("deployment"),
  priority = targets::tar_option_get("priority"),
  resources = targets::tar_option_get("resources"),
  storage = targets::tar_option_get("storage"),
  retrieval = targets::tar_option_get("retrieval"),
  cue = targets::tar_option_get("cue")
)

tar_option_set(
  tidy_eval = NULL,
  packages = NULL,
  library = NULL,
  envir = NULL,
  format = NULL,
  iteration = NULL,
  error = NULL,
  memory = NULL,
  garbage_collection = NULL,
  deployment = NULL,
  priority = NULL,
  resources = NULL,
  storage = NULL,
  retrieval = NULL,
  cue = NULL,
  debug = NULL
)

Character vector of packages that will be loaded before building your pipeline

tar_option_set(
  tidy_eval = NULL,
  packages = NULL,
  library = NULL,
  envir = NULL,
  format = NULL,
  iteration = NULL,
  error = NULL,
  memory = NULL,
  garbage_collection = NULL,
  deployment = NULL,
  priority = NULL,
  resources = NULL,
  storage = NULL,
  retrieval = NULL,
  cue = NULL,
  debug = NULL
)

tar_pipeline(...)

tar_pipeline() simply accepts an arbitrary number of tar_target() objects, or a list thereof.

Example:

# _targets.R
tar_pipeline(
  tar_target(first, f1()),
  tar_target(second, f2()),
  tar_target(third, f3(first, second))
)

Note: The order of targets inside tar_pipeline() does NOT matter. {targets} is smart enough to infer the topology and learn dependencies

tar_pipeline(...)

tar_pipeline() simply accepts an arbitrary number of tar_target() objects, or a list thereof.

Example:

# _targets.R
tar_pipeline(
  tar_target(first, f1()),
  tar_target(second, f2()),
  tar_target(third, f3(first, second))
)

Note: The order of targets inside tar_pipeline() does NOT matter. {targets} is smart enough to infer the topology and learn dependencies

Imperative scripting

R/
├── 01-data.R
├── 02-clean.R
├── 03-fit-model.R
├── 04-summarize-results.R
└── 05-tables-figs.R
run_scripts.R

Imperative scripting

R/
├── 01-data.R
├── 02-clean.R
├── 03-fit-model.R
├── 04-summarize-results.R
└── 05-tables-figs.R
run_scripts.R

Does not scale well to larger/complicated projects
You are in charge of storing/loading important objects
Everything needs to be ran every time

Guiding Design Principles

Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.

Guiding Design Principles

Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.

Generally speaking, a good target is...
1. Long enough to eat up a decent chunk of runtime, and
2. Small enough that tar_make() frequently skips it, and
3. Meaningful to your project, and
4. A well-behaved R object that can be stored.

Guiding Design Principles

Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case.

Generally speaking, a good target is...
1. Long enough to eat up a decent chunk of runtime, and
2. Small enough that tar_make() frequently skips it, and
3. Meaningful to your project, and
4. A well-behaved R object that can be stored.
A {targets} pipeline is a directed acyclic graph (DAG) showing all of the tasks (nodes) and their interrelationships (vertices)

Think Functional

A key design consideration when working with {targets} is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge

Think Functional

A key design consideration when working with {targets} is to embrace functions
Try to abstract important steps in your workflow into functions that do a single obvious task
At first, this may seem like extra work, but the downstream payoff is huge

find_outcomes <- function(data, icd_code) {
  # <<some R code>>
  return(data_with_outcomes)
}

find_outcomes(my_data, "C34")

We now have a function that is easy to maintain and can be used in our targets pipeline

Suggested Project Structure

Everyone has their own preferred way of organizing their files and projects. This is only a suggested way based on my workflow using {targets} for building R-centric projects

However,

_targets.R must exist at the root of the project

├── R/
│   ├── functions.R
├── _targets.R
├── run.R
├── project-name.Rproj

Suggested Project Structure

Everyone has their own preferred way of organizing their files and projects. This is only a suggested way based on my workflow using {targets} for building R-centric projects

However,

_targets.R must exist at the root of the project

├── R/
│   ├── functions.R
├── _targets.R
├── run.R
├── project-name.Rproj

A more mature project may have more subdirectories and files, but this serves as the skeleton for most/all of my {targets} projects

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

# _targets.R
library(targets)
# Load functions
source("functions.R")
# Set global options
tar_option_set(...)
# Define targets/pipeline
tar_pipeline(...)

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

# run.R
targets::tar_make()

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

├── R/
│   ├── clean-data.R
│   ├── cv-splits.R
│   ├── fit-model.R
│   ├── summarize-results.R
│   ├── build-report.R

I suggest having one script per function/task
Name the script the same name as the function contained therein

├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

├── R/
│   ├── clean-data.R
│   ├── cv-splits.R
│   ├── fit-model.R
│   ├── summarize-results.R
│   ├── build-report.R

# clean-data.R
clean_data <- function(data) {
  "<<some R code...>>"
  return(data_clean)
}

 Data StoreAfter you run your pipeline the first time the data store is created...
40

Data Store

After you run your pipeline the first time the data store is created...

_targets/
│   ├── meta/
│       ├── meta
│       ├── progress
│   ├── objects/
│       ├── target_name_1
│       ├── target_name_2
├── R/
├── _targets.R
├── run.R
├── project-name.Rproj

It is NOT recommended to directly interact with the _targets/ directory. Instead, inspect the data store and load objects using the suite of available helper functions

A simple example...

Artwork by @allison_horst

# _targets.R
tar_pipeline(
  tar_target(
    data,
    palmerpenguins::penguins
  ),
  tar_target(
    model,
    lm(bill_length_mm ~ species, data = data)
  )
)

A simple example...

Artwork by @allison_horst

# _targets.R
tar_pipeline(
  tar_target(
    data,
    palmerpenguins::penguins
  ),
  tar_target(
    model,
    lm(bill_length_mm ~ species, data = data)
  )
)

tar_make()

● run target data
● run target model

A simple example...

tar_make()

✓ skip target data
✓ skip target model
✓ Already up to date.

A look at the data store...

_targets/
├── meta
│   ├── meta
│   └── progress
└── objects
    ├── data
    └── model

tar_read(data) # compare with tar_load()

#> # A tibble: 344 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_…
#>    <fct>   <fct>           <dbl>         <dbl>            <int>
#>  1 Adelie  Torge…           39.1          18.7              181
#>  2 Adelie  Torge…           39.5          17.4              186
#>  3 Adelie  Torge…           40.3          18                195
#>  4 Adelie  Torge…           NA            NA                 NA
#>  5 Adelie  Torge…           36.7          19.3              193
#>  6 Adelie  Torge…           39.3          20.6              190
#>  7 Adelie  Torge…           38.9          17.8              181
#>  8 Adelie  Torge…           39.2          19.6              195
#>  9 Adelie  Torge…           34.1          18.1              193
#> 10 Adelie  Torge…           42            20.2              190
#> # … with 334 more rows, and 3 more variables: body_mass_g <int>,
#> #   sex <fct>, year <int>

NOTE: tar_read() reads objects into memory, but the user must assign the object into a variable for persistence; tar_load() reads and assigns objects into a variable of the same name

tar_read(model) # compare with tar_load()

#> Call:
#> lm(formula = bill_length_mm ~ species, data = data)
#> 
#> Coefficients:
#>      (Intercept)  speciesChinstrap     speciesGentoo  
#>           38.791            10.042             8.713

NOTE: tar_read() reads objects into memory, but the user must assign the object into a variable for persistence; tar_load() reads and assigns objects into a variable of the same name

# _targets.R
tar_pipeline(
  tar_target(
    data,
    palmerpenguins::penguins
  ),
  tar_target(
    model,
    lm(bill_length_mm ~ species, data = data)
  ),
   tar_target(
     summary,
     summary(model)
   )
)
tar_make()

✓ skip target data
✓ skip target model
● run target summary

 Depending on external filesSometimes your pipeline might need to look outward and depend on "external" filesA common example is depending on a CSV/Excel or PLINK file
The same technique applies if your code produces a file

48

Depending on external files

Sometimes your pipeline might need to look outward and depend on "external" files
- A common example is depending on a CSV/Excel or PLINK file
- The same technique applies if your code produces a file

# _targets.R
tar_pipeline(
  tar_target(
    data,
    read_csv("path/to/data.csv")
  )
)

Depending on external files

Sometimes your pipeline might need to look outward and depend on "external" files
- A common example is depending on a CSV/Excel or PLINK file
- The same technique applies if your code produces a file

# _targets.R
tar_pipeline(
  tar_target(
    data_file,
    "path/to/data.csv",
    format = "file"
  ),
  tar_target(
    data,
    read_csv(data_file)
  )
)

Dynamic branching

The targets packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.

Dynamic branching

The targets packages supports shorthand to create large pipelines. Dynamic branching defines new targets (i.e. branches) while the pipeline is running, and those definitions can be based on prior results from upstream targets.

Patterns: Creates branches (i.e. sub-targets) by repeating a task over a set of arguments

# _targets.R
# Draws from random Normal of various sizes
tar_pipeline(
  tar_target(size, seq(1, 1000, by = 100)),
  tar_target(draws, rnorm(size), pattern = map(size))
)

By default, targets will aggregate each of the sub-targets of draws using vctrs::vec_c(). In this example, this will combine all of our draws into one single vector.
- We could combine into a list by adding iteration = "list", instead

Dynamic branching

Iteration: patterns repeat tasks and iterate over arguments (e.g. using map()), and there are two important aspects of iteration...

Branching
- How should targets slice the data when creating branches?
Aggregation
- How should targets combine the results after completing branches?

Dynamic branching

Iteration: patterns repeat tasks and iterate over arguments (e.g. using map()), and there are two important aspects of iteration...

Branching
- How should targets slice the data when creating branches?
Aggregation
- How should targets combine the results after completing branches?

iteration = "vector"
- Branches are created with vctrs::vec_slice(x, i) and aggregated with vctrs::vec_c(...)
iteration = "list"
- Branches are created with list[[i]] and aggregated with list(...)

Advanced Example: Dynamic Branching

We will set up a more advanced example using dynamic branching
Let's fit a model for how life expectancy has changed over time for each country in the gapminder data set
- How many countries are in our data set? Not sure and it doesn't matter

$lifeExpectancy = time + \epsilon$

After, we will adjust our life expectancy model for GDP per capita

# _targets.R
tar_pipeline(
  tar_target(
    gapminder_file,
    "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv",
    format = "file"
  ),
  tar_target(
    gapminder,
    read_tsv(gapminder_file)
  ),
  tar_target(
    country,
    group_split(gapminder, country), # returns a list of data frames
    iteration = "list"
  ),
  tar_target(
    model,
    lm(lifeExp ~ year, data = country),
    pattern = map(country),
    iteration = "list"
  )
)

Advanced Example

tar_visnetwork()

Advanced Example

tar_make()

● run target gapminder_file
● run target gapminder
● run target country
● run branch model_55cab078
● run branch model_137fa27c
● run branch model_11168a7a
● run branch model_4fa278a7
● run branch model_f0b9128a
...
● run branch model_b1ee577e
● run branch model_af9237c1
● run branch model_ab0302f4
● run branch model_0b45f3c6
● run branch model_f4b18a5b

# _targets.R
tar_pipeline(
  tar_target(
    gapminder_file,
    "/Users/matt/Library/R/4.0/library/gapminder/extdata/gapminder.tsv",
    format = "file"
  ),
  tar_target(
    gapminder,
    read_tsv(gapminder_file)
  ),
  tar_target(
    country,
    group_split(gapminder, country),
    iteration = "list"
  ),
  tar_target(
    model,
    lm(lifeExp ~ year + gdpPercap, data = country),
    pattern = map(country),
    iteration = "list"
  )
)

Advanced Example

tar_visnetwork(label = c("branches", "time"))

Advanced Example

# _targets.R
options(clustermq.scheduler = "multiprocess")

tar_make_clustermq(workers = 6)

Advanced Example

# _targets.R
options(clustermq.scheduler = "multiprocess")

tar_make_clustermq(workers = 6)

✓ skip target gapminder_file
✓ skip target gapminder
✓ skip target country
● run branch model_137fa27c
● run branch model_11168a7a
● run branch model_4fa278a7
...
● run branch model_0b45f3c6
● run branch model_f4b18a5b
● run branch model_55cab078
Master: [11.1s 77.9% CPU]; Worker: [avg 20.9% CPU, max 8110571.0 Mb]

High-performance Computing

targets supports high-performance computing with the tar_make_clustermq() and tar_make_future() functions. These functions are like tar_make(), but they allow multiple targets to run simultaneously over parallel workers. These workers can be processes on your local machine, or they can be jobs on a computing cluster. The main process automatically sends a target to a worker as soon as

The worker is available, and

All the target’s upstream dependency targets have been checked or built.

Illustrative video from https://wlandau.github.io/targets-manual/hpc.html

High-performance Computing

Assuming you have installed the {clustermq} R package...enabling parallel processing of targets is as simple as...

# Add this line to your _targets.R file
options(clustermq.scheduler = "multiprocess")

# Instead of tar_make()
targets::tar_make_clustermq(workers = 6)

Keep in mind, these multiple persistent R processes are running locally, so a good rule-of-thumb is the number of processes should always be less than or equal to the number of available cores

High-performance Computing

If your project lives on our cluster NFS, you can change the scheduler to "slurm", and targets will run the worker processes as jobs on compute nodes

# Add this line to your _targets.R file
options(
  clustermq.scheduler = "slurm",
  clustermq.template = "/path/to/slurm_clustermq.tmpl"  
)

You also need to point clustermq toward a file which provides a template for submitting batch jobs

High-performance Computing

If your project lives on our cluster NFS, you can change the scheduler to "slurm", and targets will run the worker processes as jobs on compute nodes

# Add this line to your _targets.R file
options(
  clustermq.scheduler = "slurm",
  clustermq.template = "/path/to/slurm_clustermq.tmpl"  
)

You also need to point clustermq toward a file which provides a template for submitting batch jobs

targets::tar_make_clustermq(workers = 6)

Now these worker processes will run on compute nodes (caveat: currently few nodes support ZeroMQ socket communication but this should be widely available next upgrade)

Slurm Job Template

# slurm_clustermq.tmpl
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=default
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Values in double curly-braces will be automatically populated by clustermq (or fall back to default values, when available)
- Additional job options may be defined in this template (e.g. loading modules, activating environments)
- You may also choose to pass these wildcards using the resources argument to tar_option_set()
Other job schedulers are supported, including: LSF, SGE, PBS, Torque

Topics I didn't cover...

Seamless integration with AWS/cloud storage capabilities
- You can store and load objects from AWS S3 buckets as if they existed locally
targets offers lots of different storage formats, some of which are faster and more memory efficient
tarchetypes is a helper package that "is a collection of target and pipeline archetypes for the targets package"
Submitting HPC jobs using SSH - Develop locally and deploy remotely
Lots more, this package is in rapid-development, though its current API is pretty stable as this package is preparing for peer-review and a subsequent CRAN release

Helpful Resources

Acknolwedgements

Will Landau
- Author of targets and drake
Michael Schubert
- Author of clustermq
Yihui Xie, Garrick Aden-Buie

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides