rna-dna-variant-compare • llfsRnaseq

Introduction

RNA Variants are compared to the WGS data, and used to either confirm a given library’s label, or where possible find a better match, using the following procecure:

Pre-processing

I use a modified version of the nf-core/rnavar RNA variant calling pipeline, which is based on the GATK protocol and uses their tools. You can find my modifications here. This allows for bam input, rather than fastq, and allows us to use the bams output by the RNAseq pipeline as opposed to starting from the alignment step. I process these in batches of no more than 30 – due to the way nextflow handles tracking intermediate data, and the number of times the bam files have to be manipulated, and therefore copied, the size of the work directory gets very large with smaller numbers of libraries.

Create an input samplesheet. An example of the bam input librariesheet can be found here. The rna variant calling work directory is large because the bams are copied multiple times as they are prepared for variant calling. I submit in sets of more more than 30, which ends up producing ~3 TB of intermediate data. Here is an example of creating librariesheets:

library(tidyverse)

set = '20230727'

bam_list = list.files(paste0('/mnt/scratch/llfs_variant_calling/data/', set), '*bam$', 
                      full.names = TRUE)
df = tibble(
  library = str_remove(basename(bam_list), '.markdup.sorted.bam'),
  fastq_1 = "",
  fastq_2 = "",
  bam = str_replace(bam_list, "/mnt/scratch", "/scratch/mblab/chasem")
)

file_path = paste0(
  "/mnt/scratch/llfs_variant_calling/librariesheet/",
  set,
  ".csv"
)

write_csv(df, file_path)

Run the pipeline.
- First, clone the repo and make sure that you switch to the bam_input branch.
- This is an example of the script I use to run the pipeline on HTCF (a cluster which uses SLURM). I have included the config file for HTCF, also.
I have the pipeline to set to symlink rather than copy files from work to results, so when you move the vcf and tabix files from scratch to whatever long term storage you’re using (brentlab HTCF users, I use LTOS), make sure you tell the copy program to follow the symlinks and grab the actual file, for instance --copy-links using rsync or -F using s3cmd.
Unless you’re doing 30 or less libraries, you’ll need to repeat this procedure in batches of ~30.

Comparing Libraries Against Their Original Labels

I first convert the RNA VCF files to GDS files and compare the variants on chr21 to the variants on chr21 on the DNA library of the same label. I use
the scripts in this R project to convert the vcf files to gds files, and then use the WGS dna gds files to extract and compare variants (also with scripts in that repo). The sbatch scripts and R scripts that I use are in the repo linked above.

Choosing Potential Mislabels

For libraries with more than 100 high quality variant sites that can be compared which have a ‘match ratio’, meaning total number of matching sites / total variant sites in the RNA for which good data exists in the DNA, greater than 94%, the sample is called confidently correctly labelled. All others are moved on to comparison against all 4,556 WGS samples.

Searching for Better Matches

For those libraries which fail to be called confidently correctly labelled, we re-run the same comparison, but this time on chr6 and against all 4,556 WGS samples. Next, we summarize the results and determine if we can confidently relabel the sample using the following thresholds:

If the empirical best match is different than the label, then the sample is labeled a `mislabel’.

If empirical best match has:

a match_ratio of >=94%
the pvalue that results from a chi-square test between the labelled match and the empirical best match is significant
and the chi-square pvalue between the empirical best match and the second best match is significant

then we relabel with the empirical best match. See the following tables:

wgs_compare: this stores the results of the same label comparisons
full_wgs_compare: this stores the summary statistics of the all by all comparisons
filtered_wgs_compare: a view of full_wgs_compare which includes the mislabel and relabel fields
library_relabel_summary: a view which provides a summary of all of the samples marked mislabel and/or relabel

Examples of compiling and parsing the results of the variant comparisons

The first step for either the same-same comparison or the all-by-all comparison is to read in the data. If this is the same-same comparison, then this dataframe should be added to the wgs_compare table in the database

library(tidyverse)
library(RSQLite)
library(here)

con = dbConnect(RSQLite::SQLite(),
                         here('llfs_rnaseq_data/llfs_rnaseq_database.sqlite'))

library_tbl = tbl(con,'library') %>%
  collect()

collected_metrics = Sys.glob('/mnt/scratch/llfs_rna_dna_compare_test/all_by_all_batch_20231016/*/compiled_metrics.csv')

# note: these could be done in such a way that they are collected by 'batch'
# and then the batch_id could be used to join to library
#names(chr6_collected_metrics) = basename(dirname(chr6_collected_metrics))

collected_metrics = '/mnt/scratch/llfs_rna_dna_compare_test/makeup_libraries/compiled_metrics.csv'

collected_metrics_df_raw = map(collected_metrics,read_csv) %>%
  bind_rows(.id='batch_id') %>%
  mutate(batch_id = as.integer(batch_id))

collected_metrics_df = collected_metrics_df_raw %>%
    select(batch_id,
           rna_sample,
           rna_visit,
           dna_sample,
           chr,
           overlap_fltr,
           n_match_fltr,
           homo_expr_cand_fltr) %>%
    dplyr::rename(visit=rna_visit) %>%
    dplyr::rename(total_variants=overlap_fltr,
                  matching_variants=n_match_fltr,
                  homo_expr_cand=homo_expr_cand_fltr) %>%
    mutate(match_ratio=matching_variants/total_variants,
           rna_library = as.character(rna_library),
           visit = as.character(visit)) %>%
  left_join(select(whatdatall_tbl,id,subject) %>%
              mutate(subject=as.character(subject)), 
            by=c('rna_library'='subject')) %>%
  mutate(rna_library = ifelse(!is.na(id), id, rna_library)) %>%
    left_join(select(library_tbl, pk,fastq_id,visit,batch_id), 
              by = c('rna_library' = 'fastq_id', 
                     'visit', 
                     'batch_id')) %>%
  select(-id) %>%
  dplyr::rename(library_id=pk,
                dna_subject=dna_sample) %>%
  select(library_id,dna_subject,chr,total_variants,
         matching_variants,homo_expr_cand, match_ratio)

# dbApppend(con,'wgs_compare',collected_metrics)

At this point, we would filter for libraries with less than .94 match ratio or less than 100 total variants, removing libraries that have percent_intergenic greater than .08, and submit the all-by-all comparison. Once that is finished, the data is parsed into a dataframe just as above. It is then summarized like so:

whatdatall_tbl = tbl(con,'whatdatall') %>%
  collect()

library_tbl = tbl(con,'library') %>%
  collect() %>%
  mutate(library_id = as.character(library_id))

all_by_all_summary = collected_metrics_df %>%
  left_join(select(library_tbl,library_id,fastq_id)) %>%
  left_join(select(whatdatall_tbl,id,subject) %>%
              mutate(id = as.character(id)),
            by = c('fastq_id'='id')) %>%
  dplyr::rename(rna_subject=subject) %>% 
  parse_all_by_all_compiled_results()

# dbAppendTable(con,'full_wgs_compare', all_by_all_summary)

auditing the results

library(tidyverse)
library(RSQLite)
library(here)

con = dbConnect(RSQLite::SQLite(),
                here('llfs_rnaseq_data/llfs_rnaseq_database.sqlite'))

wgs_subjects = read_csv('/mnt/scratch/llfs_rna_dna_compare_test/lookups/wgs_dna_subject_ids.txt',
                       col_names = FALSE)$X1

whatdatall_tbl = 
  tbl(con,'whatdatall') %>%
  collect() %>%
  mutate(id=as.character(id))

corrected_sample_tbl = 
  tbl(con, 'corrected_sample') %>%
  collect() %>%
  filter(reason %in% c('v1_transfer', 'phantom', 'manifest_typo')) %>%
  select(-pk)

genomic_origin_tbl = 
  tbl(con,'genomic_origin_view') %>%
  collect()

wgs_compare_tbl = 
  tbl(con,'wgs_compare') %>%
  collect()

library_tbl =
  tbl(con,'library') %>% 
  collect() %>%
  left_join(select(corrected_sample_tbl,library_id,whatdatall_id,reason)) %>%
  filter(is.na(reason) | reason %in% c('v1_transfer', 'manifest_typo')) %>%
  mutate(fastq_id=ifelse(!is.na(whatdatall_id),whatdatall_id,fastq_id)) %>%
  select(-c(whatdatall_id,reason)) %>%
  left_join(select(whatdatall_tbl,id,subject), by = c('fastq_id'='id')) %>%
  filter(str_detect(fastq_id,regex('pool',ignore_case=TRUE),negate=TRUE)) %>%
  filter(subject %in% wgs_subjects) %>%
  left_join(select(genomic_origin_tbl,library_id,percent_intergenic)) %>%
  filter(percent_intergenic <0.08)

stopifnot(length(setdiff(library_tbl$library_id,
                         wgs_compare_tbl$library_id))==0)

wgs_compare_potential_mislabels = 
  wgs_compare_tbl %>%
  filter(total_variants < 100 | match_ratio < 0.94) %>%
  filter(library_id %in%
           (genomic_origin_tbl %>% 
           filter(percent_intergenic < 0.08) %>% 
           pull(library_id))) %>%
  filter(library_id %in% library_tbl$library_id)

full_wgs_compare_tbl = 
  tbl(con,'full_wgs_compare') %>%
  collect()

setdiff(wgs_compare_potential_mislabels$library_id,
        full_wgs_compare_tbl$library_id)

A work in progress is taking all of the gvcf files (the modified nf-core/rnavar pipeline outputs put vcf and gvcf files), combining them, and then using hte GATK joint variant caller. Doing this would allow the calculation of a kinship matrix, which may help further confirm the re-labelling, and also possibly identify and hopefully provide evidence for re-labelling mislabeled libraries which do not have WGS data.

References

The following are good papers on what to do with RNA variants. This is offered as hopefully a spur to pursue these methods:

Tools and best practices for data processing in allelic expression analysis
- This is from 2015, so it is worth checking against more recent papers. However, it seems still relevant, in particular the discussion on WASP and the alignment step in general.
Transcriptomic signatures across human tissues identify functional rare genetic variation
- That’s gold Jerry! Gold!
Assessing allele-specific expression across multiple tissues from RNA-seq read data
Leveraging allelic imbalance to refine fine-mapping for eQTL studies