Sex mislabeling implies sample mislabels, so here we’re looking for samples which group with the opposite of their sex label. Sorry libs, this is quite binary.
It happens to be the case that using the top two PCs calculated on the top 500 most variable genes very cleanly separates the samples by sex. We use this to do this analysis below:
pca_sex_full = suspicious_samples_data %>%
filter(purpose == 'experiment') %>%
ggplot(aes(PC1, PC2,
color = sex,
key = subject,
size = inferred_sex_mislabel,
alpha = inferred_sex_mislabel)) +
geom_point() +
geom_abline(slope=slope,
intercept = intercept) +
scale_alpha_manual(values = c('TRUE' = 1, 'FALSE' = 0.1)) +
scale_size_manual(values = c('TRUE' = 5, 'FALSE' = 1)) +
scale_color_manual(values = c("1" = "#B3589A",
"2"="#9BBF85",
"0" = "#F4A000"))
pca_sex_full
pca_sex_suspicious_labelled = suspicious_samples_data %>%
filter(inferred_sex_mislabel, sex != 0) %>%
ggplot(aes(PC1, PC2,
color = sex,
label = ifelse(inferred_sex_mislabel & sex != 0,
paste(as.character(subject),
as.character(visit),
sep="_"), NA))) +
geom_point(aes(size = inferred_sex_mislabel, alpha = inferred_sex_mislabel)) +
geom_text_repel(color='black') +
geom_abline(slope=slope,
intercept = intercept) +
scale_alpha_manual(values = c('TRUE' = 1, 'FALSE' = 0.1)) +
scale_size_manual(values = c('TRUE' = 5, 'FALSE' = 1)) +
scale_color_manual(values = c("1" = "#B3589A",
"2"="#9BBF85",
"control" = "#F4A000"))
pca_sex_suspicious_labelled
tl;dr: review the results, update the database, and
update the dds
object QC as appropriate.
Ideally, these sex mislabels are already captured in the database. Here, we check that these samples are all labelled “suspicious_sex”, that there are no “suspicious_sex” samples that aren’t here (eg, maybe we were able to positively identify a sample and forgot to remove the suspicious_sex label), and update the database if necessary.
Editing the database with this information can be a challenge, and requires that you understand the table structure. Please read the documentation on the database, and get comfortable with the key relationships between tables, and the llfs_rnaseq_metadata view.
Except for new samples which have not been reviewed, all old samples
which display unusual grouping should be accounted for in the
suspicious_sex
column of sample
(this is
TRUE
when WGS data is unavailable), in
mislabel
(this is TRUE
when the wgs data says
the sample isn’t what it says it is, but there is no positive match), or
in a sample
table notes
, which might state
that the sample was deemed to likely be correctly labelled for some
other reason.
If you notice something wonky here, it takes a good deal of
investigation – possibly you just need to update new sample info, most
likely the suspicious_sex
boolean value – but you might
also need to do some investigating to make sure you agree with how
samples are being labelled and re-labelled.