Using_Package_Data

library(brentlabModelPerfTesting)

#> Note that this is repeated in the inst/ directory so that it is shown #> on the github page, too – make sure if you update this, to also update #> that

Using inst/

After installing the package, all of these files are available like so:


test_data = readRDS(
  system.file('<name of file>',
          package = 'brentlabModelPerfTesting'))

A concrete example

test_data = readRDS(
  system.file('testing_gene_data.rds',
          package = 'brentlabModelPerfTesting'))

head(test_data)
#> # A tibble: 6 × 20
#>   ensg00000183…¹ x1935…² x1935…³ x1935…⁴ x1935…⁵ x1935…⁶ x1935…⁷ x1935…⁸ x1935…⁹
#>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1           2.32       0       0       0       0       0       0       0       0
#> 2           2.63       0       0       0       0       0       0       0       0
#> 3          47.1        0       0       0       0       0       0       0       0
#> 4           2.02       0       0       0       0       0       0       0       0
#> 5           1.56       0       0       0       0       0       0       0       0
#> 6           4.70       0       0       0       0       0       0       0       0
#> # … with 11 more variables: x1935631 <dbl>, x1935696 <dbl>, x1935797 <dbl>,
#> #   x1935819 <dbl>, x1935887 <dbl>, x1935904 <dbl>, x1935955 <dbl>,
#> #   x1935992 <dbl>, x1935993 <dbl>, x1936035 <dbl>, x1936227 <dbl>, and
#> #   abbreviated variable names ¹ensg00000183117_19, ²x1935406, ³x1935415,
#> #   ⁴x1935433, ⁵x1935521, ⁶x1935546, ⁷x1935563, ⁸x1935574, ⁹x1935625

list all available flies

note! there is more here than just the contents of `inst` – it is actually

listing the directory which stores the installed version of the package

in your .libPath()

list.files(system.file(package = 'brentlabModelPerfTesting'))
#>  [1] "cpu_gpu_perf_results.rds" "DESCRIPTION"             
#>  [3] "gene_data_clean.rds"      "help"                    
#>  [5] "html"                     "INDEX"                   
#>  [7] "LICENSE"                  "Meta"                    
#>  [9] "NAMESPACE"                "R"                       
#> [11] "README.md"                "slurm-simple.tmpl"       
#> [13] "testing_gene_data.rds"    "xgboost_pref_testing.R"

All of the .rds files in this directory are data.frames (a tibble is a data.frame) – if you read them in with readRDS, you’ll have a data.frame in memory.

NOTE: our SNP matricies would be better stored and operated on if they were sparse matricies – these are essentially matricies with their column vectors (R is column-major by default) are stored with run length encoding to compress the long strings of 0s. Some of the modeling softwares – XGBoost being one – do accept sparse matricies as input. I have not done testing on this, yet.

cpu_gpu_perf_results.rds

This is a data.frame which contains the performance results testing on both various numbers of CPUs and on the GPU while varying number of features, number of rounds, max_depth and max_bin.

gene_data_clean.rds

A 1300 subject by 81,822 feature matrix. The first column is ensg00000183117_19 and represents the expression of that gene in the 1300 subjects. The rest of the columns are SNP vectors – mostly 0s, representing REF genotype at that SNP. A 1 represents ALT – with names like x1935887 (note: R doesn’t like numeric column names since it is easy to confuse with a column index. Hence, a ‘clean’ R data.frame will add an x to numerically named columns).

slurm-simple.tmpl

This is a simple slurm template for use with future.batchtools

testing_gene_data.rds

A 20 x 20 subset of the first 20 columns/rows of gene_data_clean

xgboost_perf_testing.R

A executable cmd line script intended to be used to performance test XGBoost (see the function ?brentlabModelPerfTesting::perf_test_xgboost, or the reference docs for details). An example of using this script in a container is in the Usage section of the docs