Using_Package_Data
Using_Package_Data.Rmd#> Note that this is repeated in the inst/ directory
so that it is shown #> on the github page, too – make sure if you
update this, to also update #> that
Using inst/
After installing the package, all of these files are available like so:
test_data = readRDS(
system.file('<name of file>',
package = 'brentlabModelPerfTesting'))A concrete example
test_data = readRDS(
system.file('testing_gene_data.rds',
package = 'brentlabModelPerfTesting'))
head(test_data)
#> # A tibble: 6 × 20
#> ensg00000183…¹ x1935…² x1935…³ x1935…⁴ x1935…⁵ x1935…⁶ x1935…⁷ x1935…⁸ x1935…⁹
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2.32 0 0 0 0 0 0 0 0
#> 2 2.63 0 0 0 0 0 0 0 0
#> 3 47.1 0 0 0 0 0 0 0 0
#> 4 2.02 0 0 0 0 0 0 0 0
#> 5 1.56 0 0 0 0 0 0 0 0
#> 6 4.70 0 0 0 0 0 0 0 0
#> # … with 11 more variables: x1935631 <dbl>, x1935696 <dbl>, x1935797 <dbl>,
#> # x1935819 <dbl>, x1935887 <dbl>, x1935904 <dbl>, x1935955 <dbl>,
#> # x1935992 <dbl>, x1935993 <dbl>, x1936035 <dbl>, x1936227 <dbl>, and
#> # abbreviated variable names ¹ensg00000183117_19, ²x1935406, ³x1935415,
#> # ⁴x1935433, ⁵x1935521, ⁶x1935546, ⁷x1935563, ⁸x1935574, ⁹x1935625in your .libPath()
list.files(system.file(package = 'brentlabModelPerfTesting'))
#> [1] "cpu_gpu_perf_results.rds" "DESCRIPTION"
#> [3] "gene_data_clean.rds" "help"
#> [5] "html" "INDEX"
#> [7] "LICENSE" "Meta"
#> [9] "NAMESPACE" "R"
#> [11] "README.md" "slurm-simple.tmpl"
#> [13] "testing_gene_data.rds" "xgboost_pref_testing.R"All of the .rds files in this directory are
data.frames (a tibble is a data.frame) – if
you read them in with readRDS, you’ll have a
data.frame in memory.
NOTE: our SNP matricies would be better stored and operated on if they were sparse matricies – these are essentially matricies with their column vectors (R is column-major by default) are stored with run length encoding to compress the long strings of 0s. Some of the modeling softwares – XGBoost being one – do accept sparse matricies as input. I have not done testing on this, yet.
cpu_gpu_perf_results.rds
This is a data.frame which contains the performance results testing on both various numbers of CPUs and on the GPU while varying number of features, number of rounds, max_depth and max_bin.
gene_data_clean.rds
A 1300 subject by 81,822 feature matrix. The first column is
ensg00000183117_19 and represents the expression of that
gene in the 1300 subjects. The rest of the columns are SNP vectors –
mostly 0s, representing REF genotype at that SNP. A 1 represents ALT –
with names like x1935887 (note: R doesn’t like numeric
column names since it is easy to confuse with a column index. Hence, a
‘clean’ R data.frame will add an x to
numerically named columns).
slurm-simple.tmpl
This is a simple slurm template for use with future.batchtools
xgboost_perf_testing.R
A executable cmd line script intended to be used to performance test
XGBoost (see the function
?brentlabModelPerfTesting::perf_test_xgboost, or the reference
docs for details). An example of using this script in a container is
in the Usage
section of the docs