fully_featured_R_package_development
fully_featured_R_package_development.RmdUser setup
At an absolute minimum, read Hadley Wickam and Jenny Bryan’s R packages’ section The Whole Game.
You should also at the very least skim through the rest of that book. You’ll use it as a reference frequently and you need to know what is there.
After skimming the the Wickam/Bryan R packages book section on documentation, also skim the Roxygen2 documentation. Documenting your code is important, and easier done as you write than leaving it to later.
Finally, you need to be aware of the common code style standards. You can choose the break the rules, but you should do this for a reason, not due to ignorance of common conventions. I would suggest looking at either the tidyverse style guide or the Google R style guide. These two are almost exactly the same. For instance, they both reference the other as a source.
System Requirements:
Up to date R
up to date Rstudio
Up to date
usethis(the main R development tools meta-package)Up to date
renv(similar to pythonvenvin function, though not seemingly in its conventions or implementation)I make a directory called
codeandprojectsin my$HOME. I typically put projects which I intend to be packaged, or src code that I pull from the internet, incode, and projects which are more strictly focused on analysis and are not going to be formally packaged inprojects-
Set up your Rstudio environment
- Open Rstudio, go to Tools –> Global options
- Set your default working directory to a location thaty you
choose as opposed to the default. I set mine to
$HOME/projects. - Uncheck the boxes – probably right underneath the default directory setting box – that say anything along the lines of “Save session data on exit” or “save Rdata automatically”, … – anything like that, turn it off.
- Go to the Appearance tab. I prefer the
Ubuntu monoeditor font, and theTomorrow nighteditor theme. This also isn’t necessary, but I find the inverted color scheme to be a lot easier on my eyes. - You’re going to spend a good amount of time in Rstudio – look around the rest of the options, eg the window arrangement tab, and see what you can customize to your own preferences.
- Set your default working directory to a location thaty you
choose as opposed to the default. I set mine to
- Open Rstudio, go to Tools –> Global options
Exit Rstudio, and re-launch it – now if you do getwd(),
you’ll see that Rstudio launches by default in the location of your
choice, and looks better.
Create the package skeleton
Despite preferring $HOME/projects be my default launch
location, I’m going to make this project – since it is intended to be
packaged software – in my $HOME/code directory. Create the
project skeleton with usethis.
# note: the name of the package will be the basename of this path.
# I am putting this in my $HOME/code directory
usethis::create_package('~/code/brentlabModelPerfTesting')This will launch a new window where the
current working directory is set to the path you specified.
You can see this in the top right corner of your Rstudio session.
Set up a bunch of boilerplate
Next, we’re going to set up a bunch of ‘boilerplate’ package code
Add a README.md file
-
usethis::use_readme_md()(oruse_readme_rmdif you wish. I prefermdb/c it is simpler)
Also add an index.md
- This is for your pkgdown documentation – the README.md will be
displayed on github, but
pkgdownwill use theindex.mdfile preferentially for its own homepage – this way your documentation homepage can be different than your github homepage README. No need to do anything with this now, though you could put a ‘hello world’ line in (this is just a markdown document) if you want
Add a license
- Set up the license – you can read about licenses if you want. I
generally use the MIT license, which is intended to be as permissive as
possible. This is just considered good practice, and it is easily done:
usethis::use_mit_license(). You should look at the other licenses available, eguse_gnu_license(), if you care about these things.
Update the initial DESCRIPTION fields
- Update your author, title and description fields in the
DESCRIPTIONfile
Start your virtual environment
renv::init(). Note: renv is not nearly as intuitive and simple as pythonvenv. It is your responsibility to learn how to userenv, and you should expect that you’re going to need to spend time doing that. An alternative torenvispackrat, though Posit/Rstudio seems to have chosenrenvas the main virtual environment management platform.Frequently check
renv::status()and frequentlyrenv::snapshot()and if necessarilyrenv::install('some package'). Note that this happens in addition to usingusethis::use_package('package', type='<type>', min_version = <TRUE/<specific_version>>). It is critical that you figure out the difference between these two – read the documentation first, play around with it on your own, and if you have questions, post a question on the appropriate boards (see the documentation forusethisandrenv. It tells you where to ask questions).
Git/github
Note: everything you add to your
.gitignore, you may also want to add to
.Rbuildignore and eventually .dockerignore
In your package, in the console, do
usethis::use_git()-
go to this repo maintained by Github itself and copy/paste (or wget or whatever) this R project .gitignore template into your project
.gitignorefile- It doesn’t really matter, but just look through the
.gitignorenow and remove any duplicates from the items thatusethisincludes by default. I also at this point add atmpdirectory in my package workspace, in which I’ll store temporary development data, scripts, etc on my local, but that I don’t want version controlled or pushed to github, and add that to the.gitignore. - add a directory,
docs/to your .gitignore - Note: You can also add this .gitignore template when you create the git repo through their interface.
- It doesn’t really matter, but just look through the
-
Go to your github account and create a repo. I would name the repo the same name as the package, though this is not required.
- Follow the directions on git to add your local repo to github
You can, but I don’t yet make my first push – I do the following first
Set up your test environment
usethis::use_testthat()
Set up a basic function and test – there will be more in a later section on how to do this. But, we just want to make sure that all this boiler plate will actually run. Do the following:
usethis::use_r('init_function')this will open a fileR/init_function.R. With that as your active window, enterusethis::use_test(), which will create a test forinit_function. Note: You don’t need to write anything in theinit_function.Rscript, and once we have an actual function, we’re going to delete this. It just needs to be here so that the test suite runs and passes (the test suite fills a passing test by default)
Set up pkgdown and gh-pages
usethis::use_pkgdown_github_pages()will create the gh-pages branch on your github repo. This is where you documentation will get built and served by githubIf you didn’t already do it, add
docs/to your.gitignore– you don’t watch to VC thedocs/on your main branch.IMPORTANT In order to allow the github action to update the gh-pages branch, you need to change a branch security setting on github. On the github repo page, do the following:
Go to:
Settings –> Actions –> general –> Workflow permissions
and set the permissions to ‘Read and Write’
Set up the initial dockerfile
create a file called (by convention) Dockerfile and
(this is not convention – must be called this)
.dockerignore. Add anything to the
.dockerignore that you dont want included in the dockerfile
– use the .github, thoughtful sense, and maybe other
peoples’ .dockerignore files as guides.
important: add Dockerfile to your
.Rbuildignore. I’m going to try to note when you should do
things like this, but I might miss some. You should be thinking about
what should be ignored where, and, when you do the package
build check (explained in a couple steps from now), forgetting to do
things like this will be caught. You can fix it at the
check step, too.
Here is the one I use. This should not be used blindly. This guide is intended for sophisticated users. Read the dockerfile, read the docker docs, go ahead and try this out, but if something isn’t working for you, figure it out and debug now. If there is a typo or improvement to this template/example, please submit a pull request.
# get the base image, the rocker/verse has R, RStudio and pandoc
FROM r-base
# Get and install system dependencies
ENV RENV_VERSION 0.16.0
RUN R -e "install.packages(c('remotes'), repos = c(CRAN = 'https://cran.wustl.edu'))"
RUN R -e "remotes::install_github('rstudio/renv@${RENV_VERSION}')"
# it would be a good idea to figure out what
# each of these do. My method of dealing with this
# is to reduce it, try installing, and then add
# packages to deal with errors. I don't adjust this
# as frequently as I should.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common \
dirmngr \
wget \
build-essential \
libssl-dev \
libxml2-dev \
libcurl4-openssl-dev \
libfontconfig1-dev \
libharfbuzz-dev \
libfribidi-dev \
libtiff-dev
# Clean up
RUN apt-get autoremove -y
WORKDIR /project
COPY renv.lock renv.lock
COPY renv/ renv/
COPY .Rprofile .Rprofile
COPY renv/activate.R renv/activate.R
COPY renv/settings.dcf renv/settings.dcf
# note: update this path as necessary based no the r-base r version
# and what you make your WORKDIR
ENV R_LIBS /project/renv/library/R-4.2/x86_64-pc-linux-gnu
# Using renv takes a really, really long time. I don't know
# renv terribly well, so it is possible that there are settings
# that could reduce the time it takes to install.
RUN R -e "renv::restore()"
WORKDIR /project/src
COPY . .
WORKDIR /project
RUN R -e "renv::activate();renv::install('./src', dependencies = TRUE)"
RUN rm -rf src
##### METHOD 2 install from github releases using remotes::install_github #####
#
# You can skip the renv portion and just use remotes::install_github(..., dependencies=TRUE)
Now check that this can actually build the image. Debug if necessary. You could do this from the Rstudio integrated terminal, but I do it in a terminal on my system
docker build -t <your dockerhub username>/brentlabmodelperftesting .If it works without error, great! No need to do anything else with this now.
Set up CI for the build check, documentation construction, and coverage
this
page is linked from usethis::use_github_actions, and
gives a full list of available pre-configured CI github actions jobs
that you can add to your project
-
R CMD check and build on multiple operating systems
- Read the
?usethis::use_github_actionsdocumentation. If you don’t know what CI is, go read about it and figure it out. If you don’t konw whatR CMD Checkmeans, look it up in the Wickam/Bryan R package book. Otherwise, choose your poison here.use_github_action_check_standard()is likely generally the one you want. In this case, I am only interested in testing the package on linux, so I’m going to chooseuse_github_action_check_release()
- Read the
Add the R CMD Check badge:
usethis::use_github_actions_badge()-
usethis::use_coverage('codecov')installscovrto track test coverage of your code. This will have some output as it installs, and the last bit is a badge that you should add to your README.md and index.md if you wish- After running the cmd above, run
usethis::use_github_action("test-coverage")so that the coverage report is also run on all pushes/pulls to main
- After running the cmd above, run
usethis::use_github_actions('pkgdown'): Automatically build the docs ongh-pages
Adjust the CI to work with renv
In general, the default CI will work. But, if you do end up using
renv to handle some more complicated dependencies (e.g.,
the precompiled-with-gpu capability linux based XGBoost
distribution that we’ll be using), you’ll want to use renv
rather than pak to build your environment in the CI.
Go to the .github/workflows directory. Here is what my
files look like when using renv – each gets a similar
adjustment.
Note: This guide is not for unsophisticated
users. Getting the virtual environment, the CI, the Dockerfile, etc work
are all tasks that require a level of understanding and ability that is
beyond the novice, and probably beyond the intermediate level. If this
doesn’t run for you, I will always consider a pull request to update and
fix the documentation. Otherwise, go figure it out on your own. There is
documentation for each one of these items, and there are help community
message boards. I am expecting that you don’t copy/paste this in
blindly, but that you actually look at this, and look at the
usethis default CI scripts, and understand what has been
changed.
R CMD Check
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
pull_request:
branches: [main, master]
name: R-CMD-check
jobs:
R-CMD-check:
runs-on: ${{ matrix.config.os }}
name: ${{ matrix.config.os }} (${{ matrix.config.r }})
# note: I removed the DEV ubuntu release b/c it failed -- I'm not
# interested in this case in ensuring that this package runs on
# a dev branch of ubuntu
strategy:
fail-fast: false
matrix:
config:
- {os: ubuntu-latest, r: 'release'}
- {os: ubuntu-latest, r: 'oldrel-1'}
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
R_KEEP_PKG_SOURCE: yes
# this needs to be added
RENV_PATHS_ROOT: ~/.local/share/renv
steps:
- uses: actions/checkout@v3
- uses: r-lib/actions/setup-pandoc@v2
- uses: r-lib/actions/setup-r@v2
with:
r-version: ${{ matrix.config.r }}
http-user-agent: ${{ matrix.config.http-user-agent }}
use-public-rspm: true
# setup-r-dependencies is removed !
# instead, we're handling dependencies with renv
- name: Cache packages
uses: actions/cache@v1
with:
path: ${{ env.RENV_PATHS_ROOT }}
key: ${{ runner.os }}-renv-${{ hashFiles('**/renv.lock') }}
restore-keys: |
${{ runner.os }}-renv-
# note! have to add the rcmdcheck install
- name: Restore packages
shell: Rscript {0}
run: |
if (!requireNamespace("renv", quietly = TRUE)) install.packages("renv")
renv::restore()
renv::install('rcmdcheck')
- uses: r-lib/actions/check-r-package@v2
with:
upload-snapshots: true
pkgdown
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
pull_request:
branches: [main, master]
release:
types: [published]
workflow_dispatch:
name: pkgdown
jobs:
pkgdown:
runs-on: ubuntu-latest
# Only restrict concurrency for non-PR jobs
concurrency:
group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }}
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
RENV_PATHS_ROOT: ~/.local/share/renv
steps:
- uses: actions/checkout@v3
- uses: r-lib/actions/setup-pandoc@v2
- uses: r-lib/actions/setup-r@v2
with:
use-public-rspm: true
- name: Cache packages
uses: actions/cache@v1
with:
path: ${{ env.RENV_PATHS_ROOT }}
key: ${{ runner.os }}-renv-${{ hashFiles('**/renv.lock') }}
restore-keys: |
${{ runner.os }}-renv-
- name: Restore packages
shell: Rscript {0}
run: |
if (!requireNamespace("renv", quietly = TRUE)) install.packages("renv")
renv::restore()
renv::install('.')
- name: Build site
run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
shell: Rscript {0}
- name: Deploy to GitHub pages 🚀
if: github.event_name != 'pull_request'
uses: JamesIves/github-pages-deploy-action@v4.4.1
with:
clean: false
branch: gh-pages
folder: docs
coverage
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
pull_request:
branches: [main, master]
name: test-coverage
jobs:
test-coverage:
runs-on: ubuntu-latest
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
RENV_PATHS_ROOT: ~/.local/share/renv
steps:
- uses: actions/checkout@v3
- uses: r-lib/actions/setup-r@v2
with:
use-public-rspm: true
- name: Cache packages
uses: actions/cache@v1
with:
path: ${{ env.RENV_PATHS_ROOT }}
key: ${{ runner.os }}-renv-${{ hashFiles('**/renv.lock') }}
restore-keys: |
${{ runner.os }}-renv-
- name: Restore packages
shell: Rscript {0}
run: |
if (!requireNamespace("renv", quietly = TRUE)) install.packages("renv")
renv::restore()
- name: Test coverage
run: |
covr::codecov(
quiet = FALSE,
clean = FALSE,
install_path = file.path(Sys.getenv("RUNNER_TEMP"), "package")
)
shell: Rscript {0}
- name: Show testthat output
if: always()
run: |
## --------------------------------------------------------------------
find ${{ runner.temp }}/package -name 'testthat.Rout*' -exec cat '{}' \; || true
shell: bash
- name: Upload test results
if: failure()
uses: actions/upload-artifact@v3
with:
name: coverage-test-failures
path: ${{ runner.temp }}/package
Check your main branch locally
Review your
.Rbuildignoreand make sure you have ignored what you need to ignore. Hidden files I believe are ignored by default – you should check that in the docs – but, if you made atmpdirectory, you should add that to your.Rbuildignore.-
Go to the “Build” tab in your “Environment” window in Rstudio. Click the “Check” button. This runs
devtools::check()(comes along withusethis), which under the hood runsR CMD check. This * should * run quickly and have NO errors or warnings, since we haven’t installed anything. If there are errors or warnings, you need to stop and fix that.- If you haven’t developed an R package before, or have never used this button, get used to it. You should be clicking this frequently as you develop, even though it does take some time to execute.
-
git add, commit, and push to main
-
Note: make sure you add the hidden files
which are not in your
.gitignore, such as.Rprofile,.github,.Rbuildignoreand.gitignore
-
Note: make sure you add the hidden files
which are not in your
go to github, click the “Actions” tab, and make sure that each of these CI actions actually runs. If it doens’t run now, with a completely empty package, it will be easier to fix now than it will be later.
Create a dev branch and develop
You could not do this and just develop and push to
main. However, it is really, really useful to keep
a main branch that actually works. For instance, right now,
the renv, dockerfile, CI and
pkgdown/gh-pages all work. You might do something that
breaks any one of these, and debugging them is not easy. Being able to
revert to a working version, and then slowly re-integrate changes to
figure out what is causing the bug is far easier than making a big push
to main, breaking things, and then rolling back the git
history in main in my opinion.
- On your local, create a
devbranch (git checkout -b dev). At the very least, you’ll do your work in thedevbranch, and push changes to the remotedev. You may choose to use other branches, also. But, importantly, this way you’ll be able to do your development indevand choose when to merge it intomain. It will keep yourmainbranch “clean”, meaning to the best of your ability, yourmainbranch will work.devmay, on the otherhand, be unstable as you develop.- When you’re ready to merge from
devtomain, this will be the point at which you bump the semantic versioning (you may choose to do this throughout yourdevbranch developments – lots of different methods here), rebuild the container(s), and create a ‘release’ on github. There should be an updatedCHANGELOGthat goes along with any version bump. - If these instructions aren’t immediately clear and obvious, then you need to stop and go read documentation and watch youtube videos on BASIC git/github usage. This is a fundamental skill in programming today, and has been for some time.
- When you’re ready to merge from
Writing code
First, if you aren’t writing tests along with your code, then you are only doing half of your job. If you have any preconceived notions about what tests are that make you think that you shouldn’t be writing tests, then you need to get rid of those preconceived notions. They are also wrong.
The point of tests is to create a reproducible debugging environment for you, and others, as you develop the codebase beyond your first function. It allows you, or another developer, to easily at least call your function, set a breakpoint at the first line, and then step through line by line so that instead of relying on your (probably shitty) documentation, the developer can actually see what the code explicitely does.
The tests also serve as integration checks – when you add a new function, it should not break an old function. The tests don’t guarantee that this is true, but it is a good check to have even if you shouldn’t trust blindly that passing a test is a bulletproof guarantee of quality. That is something that you really always need to think about.
Finally, because you do want a test suite that does provide some assurance that the code is correct and works, improving the tests over time is a lifetime-of-the-software goal – making the tests now serve as a starting point for something that can be iteratively improved in the future.
Before writing any code
Know what you’re doing
Before you have started this entire process, you knew what the goals of this specific software is. If you don’t, stop and go figure that out first.
If you know what the purpose of the software is, then you also know what the ‘basic most fundamental’ tasks of this software will be, and what that input data should look like because, as the developer, you decide what input and output are. It is not decided by anyone else, or the data. You decide.
Get some development data
Before you do anything else, you need to get some data that you can use for development. This should be as minimal as is reasonable to actually write the code you need to write.
Put the data and put it either in inst or
tmp – if you put it in inst, then you can make
this data available in the package when it is distributed. If you put it
in tmp, then if you added this to your various
ignore files, you can prevent sharing that data outside of
your own local development space. This is useful if you are too lazy to
subset down the development data initially and just want to take what is
expedient and stick it in initially.
Know basic style conventions
Read at least 1 of the style guides linked above in User setup. Follow the style guide there in terms
of organizing your code. My preference is many, short code files that
are more or less divided by function and/or object and have the same
name as the function or class it contains. There are some R conventions
surrounding filenames – eg, Clases.R,
methods.R, zzz.R – you can get this kind of
information from the CRAN documentation on packaging, and also from the
Hadley/Bryan book.
Use a linter The package styler is the most commonly used R linter. Install this with
Set up a test when you set up a class/function script
I am going to follow my convention of 1 file to 1 function, both with the same name.
- first, create a script file for the
prep_datafunction in theR/directory with `usethis::use_r(‘prep_data’)- This will open a file called
R/prep_data.R. - Now call
usethis::use_test()which will, if you current active window isprep_data.R, create a test for that file.
- This will open a file called
- In the test, at least set up the input. So before writing anything else, in the test, I do this:
test_that("prepare_data", {
# this is my development data, which I put in the `inst/` directory
# where it will be installed, along with the package, when the package
# is installed. This is how you access that data. Note that part of the
# test suite is doing `library(<your package>)` prior to this -- `usethis`
# handled this for you, so you don't see it in the actual specific test
test_data = readRDS(
system.file('testing_gene_data.rds',
package = 'brentlabModelPerfTesting'))
# I haven't written anything in the prep_data function! This will fail
# until I do, which is a good thing.
actual = prep_data(test_data, 10)
# just placeholder for now
expected = head(test_data)
expect_identical(expected, actual)
})In the function, I might now write something like this – Note that I
start writing the documentation now! Also, to handle importing packages,
we are using roxygen2 – that is what the
@importFrom statement is doing. You can read more about
this in the Wickam/Bryan Packaging R book, and in the
roxygen2 docs. If you don’t know how to manage dependencies
this way, then you need to stop and go read about it now.
#' prep data for the testing functions
#'
#' @importFrom assertthat assert_that
#'
#' @param gene_data a data.frame where the label_vector column is the response
#' and the rest of the columns are predictors
#' @param feature_num is the number of feature_num to select from the data. must be
#' $>=$ 1.If exceeds ncol, ncol-1 is used (1 col removed as the label).
#' Default is to use all columns
#' @param label_vector name of the column to use as the response
#'
#' @export
prep_data = function(
gene_data,
feature_num = Inf,
label_vector = 'ensg00000183117_19'){
assertthat::assert_that(is.numeric(feature_num))
assertthat::assert_that((label_vector) %in% colnames(gene_data))
# return this out -- this is just a placeholder for development
head(gene_data)
}And run the test via that same panel from which you ran the
check. Maybe it works, maybe it doesn’t, but at least now
you have a framework in which you can make your developing
reproducible.
Interactive development
In python, you can use pip install -e . to install a
development package from its src directory (the one where
you’re working) so that any changes you make are immediately present in
your PYTHONPATH. In R, you can do something similar with
devtools::load_all(). But this
needs to be re-done each time you make a change. Do this now.
Set a breakpoint
-
In
prep_data, set a breakpoint. Maybe set it on the first line, so that no code is executed inside of hte function, and you stop in an interactive environment in which you can execute the code/write/etc in the console with the actual function data.
Installing dependencies
To install dependencies, we’re going to use
usethis::use_package. Read the docs on this – you will be
able to decide what section the package gets added to
(Imports, Suggests, Depends). I
typically set min_version = TRUE, but put thought into
this, generally. Important remember to update
your virtual environment frequently – to see the changes, use
renv::status(). To update the lockfile, do
renv::snapshot(). Be aware of what you’re doing – don’t
fill your environment with a bunch of junk (though, if you do this,
renv does have some functions to help clean things up).
Installing from a remote other than CRAN or Bioconductor
For example, I want this package to depend on a pre-compiled version of XGBoost which is distributed by XGBoost, but not on CRAN.
this is how this is done:
- First, based on the Remotes
documentation, add the following to the
DESCRIPTIONfile:
Remotes:
xgboost=url::https://github.com/dmlc/xgboost/releases/download/v1.7.2/xgboost_r_gpu_linux_1.7.2.tar.gz
Then, use
renvto install, in this case,xgboost– it will respect theRemotesfield in DESCRIPTION.Next, use
usethisto addxgboost, in this case, to the DESCRIPTIONImports:usethis::use_package('xgboost', min_version=TRUE)Update the virtual environment:
renv::status()– just take a look at what changed, make sure you agree. Thenrenv::snapshot().
As a side note, this is what the renv.lock file entry
for xgboost looks like:
...
"xgboost": {
"Package": "xgboost",
"Version": "1.7.2.1",
"Source": "URL",
"RemoteType": "url",
"RemoteUrl": "https://github.com/dmlc/xgboost/releases/download/v1.7.2/xgboost_r_gpu_linux_1.7.2.tar.gz",
"Hash": "d6328ccd7dbb29c1c1c285b045083e02",
"Requirements": [
"Matrix",
"data.table",
"jsonlite"
]
},
...Iterate and develop
write, set a breakpoint, execute to the breakpoint, write, and critically simultaneously update the test
Result
By doing this process, I end up with a test that looks like this:
test_that("prepare_data", {
test_data = readRDS(
system.file('testing_gene_data.rds',
package = 'brentlabModelPerfTesting'))
suppressWarnings({prepped_data_subset = prep_data(test_data, 10)})
expect_equal(dim(prepped_data_subset$train), c(20*.8,10))
suppressWarnings({prepped_data_default = prep_data(test_data)})
expect_equal(dim(prepped_data_default$train), c(20*.8,19))
})And a function that looks like this:
#' prep data for the testing functions
#'
#' @importFrom assertthat assert_that
#' @importFrom rsample initial_split training testing
#' @importFrom xgboost xgb.DMatrix
#' @importFrom dplyr select all_of
#' @import futile.logger
#'
#' @param gene_data a data.frame where the label_vector column is the response
#' and the rest of the columns are predictors
#' @param feature_num is the number of feature_num to select from the data. must be
#' $>=$ 1.If exceeds ncol, ncol-1 is used (1 col removed as the label).
#' Default is to use all columns
#' @param label_vector name of the column to use as the response
#'
#' @return a list with xgboost data matrix objects, slots dtrain, dtest and wl
#' see the xgboost docs on wl. dtrain is the trainin data to
#'
#' @examples
#' test_data = readRDS(
#' system.file('testing_gene_data.rds',
#' package = 'brentlabModelPerfTesting'))
#' # note: suppressed warnings used here b/c the test data is too
#' # small to stratify. Typically, do not use supressWarnings.
#' suppressWarnings({prepped_data_subset = brentlabModelPerfTesting::prep_data(test_data, 10)})
#'
#' names(prepped_data_subset)
#'
#' @export
prep_data = function(
gene_data,
feature_num = Inf,
label_vector = 'ensg00000183117_19'){
assertthat::assert_that(is.numeric(feature_num))
assertthat::assert_that((label_vector) %in% colnames(gene_data))
gene_data_split <- rsample::initial_split(
gene_data,
prop = 0.8,
strata = rlang::sym(label_vector)
)
feature_num = max(1,feature_num)
flog.info(paste0('creating train test data with ',
as.character(min(ncol(gene_data)-1, feature_num)),
' predictor variables'))
# prepare the input data -- subset number of feature_num to
# test effect of feature_num on runtime/performance
train_dat = as.matrix(rsample::training(gene_data_split) %>%
dplyr::select(-all_of(label_vector)) %>%
.[,1:min(ncol(gene_data)-1, feature_num)])
test_dat = as.matrix(rsample::testing(gene_data_split) %>%
dplyr::select(-all_of(label_vector))) %>%
.[,1:min(ncol(gene_data_split)-1,feature_num)]
# common for all test iterations -- not testing effect of number of samples
train_labels = training(gene_data_split)[[label_vector]]
test_labels = testing(gene_data_split)[[label_vector]]
# return data for xgboost input
list(
train = xgboost::xgb.DMatrix(train_dat, label = train_labels),
test = xgboost::xgb.DMatrix(test_dat, label=test_labels)
)
}Check locally
In the same tab in the Rstudio pane where you did the
check, run theTesttab – make sure those pass first. Debug if necessary.Check your documentation format with
roxygen2::roxygenise(). Debug if necessary.Then, run the
checkthis locally (from the Environment/Build pane in Rstudio), as we did before with just the boilerplate code. Debug if necessary.
Conclusion
From this point, you can look at the repo and see what I’ve done
beyond this – more or less just adding a single additional function, and
writing the cmd line interface (which is in inst). You
should see the vignette Using_Package_Data for some
comments on including data in the project. See Usage for
instrutions on using both the cmd line script and the container. See
performance_testing_xgboost_on_htcf for results of performance testing
by varying the parameters which affect runtime/memory on both CPUs
(various numbers of cpus) the GPU. This also examines the scheduling
rate on both CPU and GPU batch runs on HTCF.