This QC step looks at the expression of the topmost expressed gene(s)
in the data set, and the expression of hemoglobin genes. This data is
added to the database and to the dds
objects.
First, I connect to the database and load the genes dds
data.
Supplement 1 of this paper identifies 12 hemoglobin-related genes.
hemoglobin_genes = read_csv(
system.file(package='llfsRnaseq', 'hemoglobin_genes.csv'))
hemoglobin_gene_filter =
str_remove(rownames(dds), "\\.\\d+$") %in% hemoglobin_genes$`Ensemble ID`
stopifnot(sum(hemoglobin_gene_filter) == 12)
RN7SL1 and RN7SL2 can, troublingly, make up nearly 20% of some libraries. They are consistently the loci which are assigned the greatest number of reads.
rn7sl1 = 'ENSG00000276168'
rn7sl2 = 'ENSG00000274012'
rn7sl_filter =
str_remove(rownames(dds), "\\.\\d+$") %in% c(rn7sl1, rn7sl2)
stopifnot(sum(rn7sl_filter) == 2)
expression_qc_df = tibble(
library_id = dds$library_id,
library_size = colSums(counts(dds)),
protein_coding_total = colSums(counts(dds[rowRanges(dds)$gene_type=='protein_coding', ])),
hemoglobin_counts = colSums(counts(dds[hemoglobin_gene_filter,])),
rn7sl1_counts = colSums(counts(dds[rn7sl_filter,])))