RNA Variants are compared to the WGS data, and used to either confirm a given library’s label, or where possible find a better match, using the following procecure:
I use a modified version of the nf-core/rnavar RNA variant calling
pipeline, which is based on the GATK protocol and uses their tools. You
can find my modifications here. This
allows for bam input, rather than fastq, and allows us
to use the bams output by the RNAseq pipeline as opposed to starting
from the alignment step. I process these in batches of no more than 30 –
due to the way nextflow handles tracking intermediate data, and the
number of times the bam files have to be manipulated, and therefore
copied, the size of the work
directory gets very large with
smaller numbers of libraries.
bam
input librariesheet can be found here.
The rna variant calling work directory is large because the bams are
copied multiple times as they are prepared for variant calling. I submit
in sets of more more than 30, which ends up producing ~3 TB of
intermediate data. Here is an example of creating librariesheets:
library(tidyverse)
set = '20230727'
bam_list = list.files(paste0('/mnt/scratch/llfs_variant_calling/data/', set), '*bam$',
full.names = TRUE)
df = tibble(
library = str_remove(basename(bam_list), '.markdup.sorted.bam'),
fastq_1 = "",
fastq_2 = "",
bam = str_replace(bam_list, "/mnt/scratch", "/scratch/mblab/chasem")
)
file_path = paste0(
"/mnt/scratch/llfs_variant_calling/librariesheet/",
set,
".csv"
)
write_csv(df, file_path)
bam_input
branch.work
to results
, so when you move the
vcf
and tabix
files from scratch
to whatever long term storage you’re using (brentlab HTCF users, I use
LTOS), make sure you tell the copy program to follow the symlinks and
grab the actual file, for instance --copy-links
using
rsync
or -F
using s3cmd
.I first convert the RNA VCF files to GDS files and compare the
variants on chr21 to the variants on chr21 on the DNA library of the
same label. I use
the scripts
in this R project to convert the vcf
files to
gds
files, and then use the WGS dna gds
files
to extract and compare variants (also with scripts in that repo). The
sbatch scripts and R scripts that I use are in the repo linked
above.
For libraries with more than 100 high quality variant sites that can
be compared which have a ‘match ratio’, meaning total number of
matching sites
/ total variant
sites in the
RNA for which good data exists in the DNA, greater than 94%, the sample
is called confidently correctly labelled. All others are moved on to
comparison against all 4,556 WGS samples.
For those libraries which fail to be called confidently correctly labelled, we re-run the same comparison, but this time on chr6 and against all 4,556 WGS samples. Next, we summarize the results and determine if we can confidently relabel the sample using the following thresholds:
If the empirical best match is different than the label, then the sample is labeled a `mislabel’.
If empirical best match has:
match_ratio
of >=94%then we relabel with the empirical best match. See the following tables:
wgs_compare
: this stores the results of the same
label comparisons
full_wgs_compare
: this stores the summary statistics
of the all by all comparisons
filtered_wgs_compare
: a view of
full_wgs_compare
which includes the mislabel
and relabel
fields
library_relabel_summary
: a view which provides a
summary of all of the samples marked mislabel
and/or
relabel
The first step for either the same-same comparison or the all-by-all comparison is to read in the data. If this is the same-same comparison, then this dataframe should be added to the wgs_compare table in the database
library(tidyverse)
library(RSQLite)
library(here)
con = dbConnect(RSQLite::SQLite(),
here('llfs_rnaseq_data/llfs_rnaseq_database.sqlite'))
library_tbl = tbl(con,'library') %>%
collect()
collected_metrics = Sys.glob('/mnt/scratch/llfs_rna_dna_compare_test/all_by_all_batch_20231016/*/compiled_metrics.csv')
# note: these could be done in such a way that they are collected by 'batch'
# and then the batch_id could be used to join to library
#names(chr6_collected_metrics) = basename(dirname(chr6_collected_metrics))
collected_metrics = '/mnt/scratch/llfs_rna_dna_compare_test/makeup_libraries/compiled_metrics.csv'
collected_metrics_df_raw = map(collected_metrics,read_csv) %>%
bind_rows(.id='batch_id') %>%
mutate(batch_id = as.integer(batch_id))
collected_metrics_df = collected_metrics_df_raw %>%
select(batch_id,
rna_sample,
rna_visit,
dna_sample,
chr,
overlap_fltr,
n_match_fltr,
homo_expr_cand_fltr) %>%
dplyr::rename(visit=rna_visit) %>%
dplyr::rename(total_variants=overlap_fltr,
matching_variants=n_match_fltr,
homo_expr_cand=homo_expr_cand_fltr) %>%
mutate(match_ratio=matching_variants/total_variants,
rna_library = as.character(rna_library),
visit = as.character(visit)) %>%
left_join(select(whatdatall_tbl,id,subject) %>%
mutate(subject=as.character(subject)),
by=c('rna_library'='subject')) %>%
mutate(rna_library = ifelse(!is.na(id), id, rna_library)) %>%
left_join(select(library_tbl, pk,fastq_id,visit,batch_id),
by = c('rna_library' = 'fastq_id',
'visit',
'batch_id')) %>%
select(-id) %>%
dplyr::rename(library_id=pk,
dna_subject=dna_sample) %>%
select(library_id,dna_subject,chr,total_variants,
matching_variants,homo_expr_cand, match_ratio)
# dbApppend(con,'wgs_compare',collected_metrics)
At this point, we would filter for libraries with less than .94 match ratio or less than 100 total variants, removing libraries that have percent_intergenic greater than .08, and submit the all-by-all comparison. Once that is finished, the data is parsed into a dataframe just as above. It is then summarized like so:
whatdatall_tbl = tbl(con,'whatdatall') %>%
collect()
library_tbl = tbl(con,'library') %>%
collect() %>%
mutate(library_id = as.character(library_id))
all_by_all_summary = collected_metrics_df %>%
left_join(select(library_tbl,library_id,fastq_id)) %>%
left_join(select(whatdatall_tbl,id,subject) %>%
mutate(id = as.character(id)),
by = c('fastq_id'='id')) %>%
dplyr::rename(rna_subject=subject) %>%
parse_all_by_all_compiled_results()
# dbAppendTable(con,'full_wgs_compare', all_by_all_summary)
auditing the results
library(tidyverse)
library(RSQLite)
library(here)
con = dbConnect(RSQLite::SQLite(),
here('llfs_rnaseq_data/llfs_rnaseq_database.sqlite'))
wgs_subjects = read_csv('/mnt/scratch/llfs_rna_dna_compare_test/lookups/wgs_dna_subject_ids.txt',
col_names = FALSE)$X1
whatdatall_tbl =
tbl(con,'whatdatall') %>%
collect() %>%
mutate(id=as.character(id))
corrected_sample_tbl =
tbl(con, 'corrected_sample') %>%
collect() %>%
filter(reason %in% c('v1_transfer', 'phantom', 'manifest_typo')) %>%
select(-pk)
genomic_origin_tbl =
tbl(con,'genomic_origin_view') %>%
collect()
wgs_compare_tbl =
tbl(con,'wgs_compare') %>%
collect()
library_tbl =
tbl(con,'library') %>%
collect() %>%
left_join(select(corrected_sample_tbl,library_id,whatdatall_id,reason)) %>%
filter(is.na(reason) | reason %in% c('v1_transfer', 'manifest_typo')) %>%
mutate(fastq_id=ifelse(!is.na(whatdatall_id),whatdatall_id,fastq_id)) %>%
select(-c(whatdatall_id,reason)) %>%
left_join(select(whatdatall_tbl,id,subject), by = c('fastq_id'='id')) %>%
filter(str_detect(fastq_id,regex('pool',ignore_case=TRUE),negate=TRUE)) %>%
filter(subject %in% wgs_subjects) %>%
left_join(select(genomic_origin_tbl,library_id,percent_intergenic)) %>%
filter(percent_intergenic <0.08)
stopifnot(length(setdiff(library_tbl$library_id,
wgs_compare_tbl$library_id))==0)
wgs_compare_potential_mislabels =
wgs_compare_tbl %>%
filter(total_variants < 100 | match_ratio < 0.94) %>%
filter(library_id %in%
(genomic_origin_tbl %>%
filter(percent_intergenic < 0.08) %>%
pull(library_id))) %>%
filter(library_id %in% library_tbl$library_id)
full_wgs_compare_tbl =
tbl(con,'full_wgs_compare') %>%
collect()
setdiff(wgs_compare_potential_mislabels$library_id,
full_wgs_compare_tbl$library_id)
A work in progress is taking all of the gvcf files (the modified nf-core/rnavar pipeline outputs put vcf and gvcf files), combining them, and then using hte GATK joint variant caller. Doing this would allow the calculation of a kinship matrix, which may help further confirm the re-labelling, and also possibly identify and hopefully provide evidence for re-labelling mislabeled libraries which do not have WGS data.
The following are good papers on what to do with RNA variants. This is offered as hopefully a spur to pursue these methods: