database-overview • llfsRnaseq

Introduction

This database will make re-labeling libraries less error prone, and will make it much easier to submit to NCBI when the time comes. Below, you will find a description of each table and view in the database. For each table and view, the full description of each column is hidden by default – just click the “view full details” to expand a table that provides the definitions. For each table (but not view, which doesn’t have a primary key per se), the primary and foreign key(s) are shown in the Description (not collapsed by default).

Note that one library might have up to six or more (should be even, of course) fastq files – the paired-end R1 and R2, and sometimes a single library has multiple R1 and R2 files over different lanes.

IMPORTANT

Please note the following:

All tables, except for library, whose primary key is library_id, and batch, whose primary key is batch_id, have an auto-incremented primary key called pk.

fastq_id is always a foreign key to the fastq table, but it has an additional meaning: the fastq_id is the id which the actual fastq file possesses. This fastq_id may be wrong – for example, if there was a typo, or that particular ID is in the visit_1 transfer set, or if it was determined to be mislabeled from the RNA variants vs WGS data comparison. In this case, a library should be thought of as having a fastq_id, derived from the filename of the fastq file, but also a whatdatall_id, which results from a correction (see the manual_library_relabel table and full_wgs_compare table). A given library’s subject will always be assigned from the library’s whatdatall_id, which, if you are using one of my views, is assigned after relabeling any known mislabels.

Views

A view is a saved SQL query. It provides a way to access joined and/or otherwise manipulated data as if it were a table.

llfs_rnaseq_subject

This is the main LLFS metadata and is what is used to create the metadata in the LLFS RNA Seq data releases.

NOTE: In the release data, the column count_headers is added, which corresponds to the corresponding release’s plain text count data headers. It is formed by appending the string ‘library_’ to the library_id.

LLFS RNAseq Metadata View

Field	Datatype	Description
library_id	INTEGER	The library ID (foreign key to the `library` table)
legal	TEXT	Boolean column. Legal status. 1 if legal
subject	INTEGER	The LLFS subject identifier (see `whatdatall`)
visit	TEXT	Despite it not being an officially defined variable, visit is what is used in the fastqfilename. This must be translated to visitcode in order to join the RNAseq data on (subject,visitcode) to the rest of the LLFS phenotype data
visitcode	TEXT	The whatdatall LLFS phenotype visitcode value. See the visitcode_map table for details on how this is derived from visit
batch	INTEGER	This is the name of the data directory on the DSG cluster and corresponds to the batch in which the sample was sequenced. See table `batch`
sex	INTEGER	The subject’s sex. Note that if a library is relabeled, the sex is determined after relabeling
suspicious_sex	INTEGER	see the `library` table for more details. This is this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library clearly shows sex chromosome expression opposite from its label. This column exists because not all RNAseq subjects have WGS data, so a TRUE here means that the subject did not exist in the WGS set AND it shows evidence of being mislabeled by sex chrom expression
purpose	TEXT	either ‘control_phantom’, ‘control_pool’
mislabel	INTEGER	See the `library` table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the sample was clearly mislabelled during the WGS comparison, but the actual label was not found
relabel	INTEGER	See the `library` table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the sample has been relabeled for any reason, including as a result of the WGS vs RNA variant comparison, because the sample is a phantom control, or due to manifest typos or v1_transfers. See `notes` below.
notes	INTEGER	notes on the samples, such as why it was labelled mislabel or relabel
y_x_ratio	REAL	Ratio of reads mapped to chromosome Y to chromosome X
library_total	REAL	total size of the library
percent_intergenic	REAL	Percentage of reads in intergenic regions
hemoglobin_total	REAL	total number of reads counted over 12 hemoglobin genes
percent_hemoglobin	REAL	total raw counts over 12 hemoglobin genes / library_size. see the expression_qc table for more details
protein_coding_total	REAL	raw reads over loci denoted “protein_coding” in the gtf (you can discover this by loading the `dds` object and doing this `rowRanges(dds)$type == 'protein_coding'`
percent_protein_coding	REAL	protein_coding_total / library_size. see the expression_qc table for more details
rn7sl_total	REAL	total number of reads over two rn7sl genes. These are highly over represented in nearly all of the libraries. see table `expression_qc` for more details
percent_rn7sl	REAL	total raw counts over two loci that are consistently, by far, the loci with the most counts. See expression_qc for more details

llfs_rnaseq_id

This should be treated as an intermediate table, though it is useful if you wish to associate the LLFS IDs with the samples.

LLFS metadata ID View

Field	Datatype	Description
library_id	INTEGER	The library ID (foreign key to the `library` table)
llfs_id	INTEGER	The LLFS ID (see `whatdatall`)
visit	TEXT	Despite it not being an officially defined variable, visit is what is used in the fastqfilename. This must be translated to visitcode in order to join the RNAseq data on (subject,visitcode) to the rest of the LLFS phenotype data
visitcode	TEXT	The whatdatall LLFS phenotype visitcode value. See the visitcode_map table for details on how this is derived from visit
data_dir	INTEGER	This is the name of the data directory on dsg. See table `batch`
purpose	TEXT	either ‘control_phantom’, ‘control_pool’
suspicious_sex	INTEGER	see the library table for more details. This is this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library clearly shows sex chromosome expression opposite from its label. Note that this is applied after all other re-labeling is completed
mislabel	INTEGER	See the `library` table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library was clearly mislabeled during the WGS comparison, but the actual label was not found
relabel	INTEGER	See the `library` table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library has been relabeled for any reason, including as a result of the WGS vs RNA variant comparison, because the library is a phantom control, or due to manifest typos or v1_transfers. See `notes` below.
notes	INTEGER	notes on the libraries, such as why it was labelled mislabel or relabel
legal	TEXT	Boolean column. Legal status. 1 if legal
y_x_ratio	REAL	Ratio of reads mapped to chromosome Y to chromosome X
library_size	REAL	total size of the library
percent_intergenic	REAL	Percentage of reads in intergenic regions
hemoglobin_total	REAL	total number of reads counted over 12 hemoglobin genes
percent_hemoglobin	REAL	total raw counts over 12 hemoglobin genes / library_size. see the expression_qc table for more details
protein_coding_total	REAL	raw reads over loci denoted “protein_coding” in the gtf (you can discover this by loading the `dds` object and doing this `rowRanges(dds)$type == 'protein_coding'`
percent_protein_coding	REAL	protein_coding_total / library_size. see the expression_qc table for more details
rn7sl_total	REAL	total number of reads over two rn7sl genes. These are highly over represented in nearly all of the libraries. see table `expression_qc` for more details
percent_rn7sl	REAL	total raw counts over two loci that are consistently, by far, the loci with the most counts. See expression_qc for more details

filtered_wgs_compare

This view summarizes the result of the all-by-all comparison between a given RNA Seq library and all 4,556 WGS samples. The all-by-all comparison candidates are chosen if the library is passing the percent_intergenic filter of 8%, has a match_ratio of less than .94 or has fewer than 100 total_variants.

Filtered WGS Compare

Field	Datatype	Description
empirical_best_id	INTEGER	The LLFS ID corresponding to the best empirical match between RNA variants and WGS data
library_id	INTEGER	`library_id` from the `library` table
labelled_dna_subject	INTEGER	the LLFS subject identifier corresponding to the LLFS ID which is contained in the library’s fastq filename
labelled_total_variants	INTEGER	The total number of high quality variant locations in the RNA library over which there high quality WGS genotype data for a given subject
labelled_match_variants	INTEGER	The total number of `total_variants` locations over which the genotypes in the two samples matched
labelled_match_ratio	INTEGER	`match_variants`/`total_variants`
empirical_best_subject	INTEGER	The LLFS subject corresponding to the best empirical match between RNA variants and WGS data
empirical_best_[total_variants,matching_variants,match_Ratio]	INTEGER	same as the `labelled_...` columns above, but for the best empirical match
empirical_next_subject	INTEGER	The LLFS subject corresponding to the second best empirical match in the all by all comparison
empirical_next_[total_variants,matching_variants,match_Ratio]	INTEGER	same as the `labelled_...` columns above, but for the second best empirical match
chisq_labelled	INTEGER	chisq pvalue for the labelled vs empirical best data
chisq_empirical	INTEGER	chisq pvalue for the best empirical match vs the next best empirical match
relabel	BOOLEAN	if the empirical best match is different than the original label, has a match ratio of >= .94 and both pvalues are significant, then the sample is relabeled (1). Else it is not (0)

library_relabel_summary

This provides an overview of all relabelled samples

Library Relabel Summary

Field	Datatype	Description
library_id	TEXT	`library_id` from the `library` table
proposed_relabel_id	INTEGER	The proposed relabel LLFS ID. Note that this is only applied in the metadata if `relabel` == 1
mislabel	BOOLEAN	0 if the sample is not a mislabel (ie, a phantom control), 1 if it is
relabel	BOOLEAN	0 if the data does not support relabeling. 1 if it does. See the `filtered_wgs_compare` view for details
notes	REAL	any additional notes from the underlying tables

genomic_origin_view

This view provides information about the genomic origin of reads based on the qualimap_genomic_origin table.

Qualimap Genomic Origin View

Field	Datatype	Description
library_id	TEXT	Foreign key to the library table
exonic	INTEGER	Number of reads in exonic regions
intronic	INTEGER	Number of reads in intronic regions
intergenic	INTEGER	Number of reads in intergenic regions
percent_exonic	REAL	Percentage of reads in exonic regions
percent_intronic	REAL	Percentage of reads in intronic regions
percent_intergenic	REAL	Percentage of reads in intergenic regions

xy_ratio

This view provides information about the XY ratio based on the samtools_idxstats_xy_plot table.

XY Ratio View

Field	Datatype	Description
library_id	TEXT	Foreign key to the library table
chromosome_x	INTEGER	Number of reads mapped to chromosome X
chromosome_y	INTEGER	Number of reads mapped to chromosome Y
y_x_ratio	REAL	Ratio of reads mapped to Y over reads mapped to X

expression_qc_view

This is a view of the expression_qc table with additional columns created from the hemoglobin, rn7sl and protein_coding_total / library size

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key autoincrement	INTEGER	auto incremented primary key
library_id	foreign key	INTEGER	foreign key to the `library_id` in table `library`
library_size		INTEGER	Holds the library size data for each record
protein_coding_total		INTEGER	Represents the total count of protein coding for each record
hemoglobin_counts		INTEGER	Stores the count of hemoglobin for each record
rn7sl1_counts		INTEGER	Stores the count of rn7sl1 for each record
percent_hemoglobin		INTEGER	hemoglobin_counts / library_size
percent_rn7sl		INTEGER	rn7sl_counts / library_size
percent_protein_coding		INTEGER	protein_coding_total / library_size

Tables

Each heading below is a physical table (not a view) in the database. A brief description of each is provided, and the primary key and foreign key, if one exists, is shown to give you an overview of how these tables are related.

To see the full description of any one of these tables, click “Full Table Details”.

batch

This table stores information about batches.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
data_dir		TEXT	name of the directory which stores this data in the dsg cluster
batch_alias		TEXT	I was unfortunately inconsistent in how I refered to these in processing the data. this column provides a map from what I named the batch when processing to the name of the dsg directory
dsg_fastq_dirpath		TEXT	The subdirectory structure of each data_dir varies quite a bit. This column provides the absolute path to the actual directory which stores the fastq files

manual_library_relabel

This table stores information about batches.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	foreign key to the `library_id` field in the `library` table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	foreign key to the `library_id` field in the `library` table
new_whatdatall_id		INTEGER	The correct LLFS ID for this library
reason		TEXT	The reason why this library was relabeled
notes		TEXT	Free entry for any additional information

expression_qc

This table stores count data which is extracted in the expression_qc step (see the articles in the served documentation, or the vignette in the code repo).

Field	Purpose	Datatype	Description
pk	primary key autoincrement	INTEGER	auto incremented primary key
library_id	foreign key	INTEGER	References the “library_id” field in the “library” table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key autoincrement	INTEGER	auto incremented primary key
library_id	foreign key	INTEGER	References the “library_id” field in the “library” table
library_size		INTEGER	Holds the library size data for each record
protein_coding_total		INTEGER	Represents the total count of protein coding for each record
hemoglobin_counts		INTEGER	Stores the count of hemoglobin for each record
rn7sl1_counts		INTEGER	Stores the count of rn7sl1 for each record

fastq

This table stores information about FASTQ files. Note that there may be more than 1 fastq record for each library_id since there are some (quite a few) libraries which have multiple lanes worth of paired-end libraries (so if 1 library is split across two lanes, there would be 2 R1 and 2 R2)

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	auto incremented primary key
library_id	foreign key	TEXT	References the “library_id” field in the “library” table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	foreign key to the library table
fastq_1		TEXT	name of the R1 fastq file
fastq_2		TEXT	name of the R2 fastq file
strandedness		TEXT	the strandedness of the library protocol
index_sequence		TEXT	index sequence extracted from the fastq filename
flow_cell_id	foreign key to flow_cell	TEXT	flow cell identifier
lane		TEXT	lane in the flowcell
notes		TEXT	any notes on the fastq. Defaults to ‘none’

multiqc_general

Note: parsing this data out of multiQC is a work in progress.

This is the ‘general’ table stats from MultiQC and includes both alignment and read level QC. As a result, there are at least 3 records for each library.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	References the `library_id` field in the `library` table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	References the `library_id` field in the `library` table
general.dupradar_intercept		REAL	Add a brief explanation of the field here.
general.percent_rrna		REAL	Add a brief explanation of the field here.
general.library		TEXT	Add a brief explanation of the field here.
general.unpaired_reads_examined		INTEGER	Add a brief explanation of the field here.
general.read_pairs_examined		INTEGER	Add a brief explanation of the field here.
general.secondary_or_supplementary_rds		INTEGER	Add a brief explanation of the field here.
general.unmapped_reads		INTEGER	Add a brief explanation of the field here.
general.unpaired_read_duplicates		INTEGER	Add a brief explanation of the field here.
general.read_pair_duplicates		INTEGER	Add a brief explanation of the field here.
general.read_pair_optical_duplicates		INTEGER	Add a brief explanation of the field here.
general.percent_duplication		INTEGER	Add a brief explanation of the field here.
general.estimated_library_size		INTEGER	Add a brief explanation of the field here.
general.5_3_bias		REAL	Add a brief explanation of the field here.
general.reads_aligned		INTEGER	Add a brief explanation of the field here.
general.unalignable		INTEGER	Add a brief explanation of the field here.
general.alignable		INTEGER	Add a brief explanation of the field here.
general.filtered		INTEGER	Add a brief explanation of the field here.
general.total		INTEGER	Add a brief explanation of the field here.
general.alignable_percent		REAL	Add a brief explanation of the field here.
general.unique		INTEGER	Add a brief explanation of the field here.
general.multi		INTEGER	Add a brief explanation of the field here.
general.uncertain		INTEGER	Add a brief explanation of the field here.
general.total_records		INTEGER	Add a brief explanation of the field here.
general.qc_failed		INTEGER	Add a brief explanation of the field here.
general.optical_pcr_duplicate		INTEGER	Add a brief explanation of the field here.
general.non_primary_hits		INTEGER	Add a brief explanation of the field here.
general.mapq_lt_mapq_cut_non_unique		INTEGER	Add a brief explanation of the field here.
general.mapq_gte_mapq_cut_unique		INTEGER	Add a brief explanation of the field here.
general.read_1		INTEGER	Add a brief explanation of the field here.
general.read_2		INTEGER	Add a brief explanation of the field here.
general.reads_map_to_sense		INTEGER	Add a brief explanation of the field here.
general.reads_map_to_antisense		INTEGER	Add a brief explanation of the field here.
general.non_splice_reads		INTEGER	Add a brief explanation of the field here.
general.splice_reads		INTEGER	Add a brief explanation of the field here.
general.reads_mapped_in_proper_pairs		INTEGER	Add a brief explanation of the field here.
general.proper_paired_reads_map_to_different_chrom		INTEGER	Add a brief explanation of the field here.
general.unique_percent		REAL	Add a brief explanation of the field here.
general.proper_pairs_percent		REAL	Add a brief explanation of the field here.
general.raw_total_sequences		INTEGER	Add a brief explanation of the field here.
general.filtered_sequences		INTEGER	Add a brief explanation of the field here.
general.sequences		INTEGER	Add a brief explanation of the field here.
general.is_sorted		INTEGER	Add a brief explanation of the field here.
general.1st_fragments		INTEGER	Add a brief explanation of the field here.
general.last_fragments		INTEGER	Add a brief explanation of the field here.
general.reads_mapped		INTEGER	Add a brief explanation of the field here.
general.reads_mapped_and_paired		INTEGER	Add a brief explanation of the field here.
general.reads_unmapped		INTEGER	Add a brief explanation of the field here.
general.reads_properly_paired		INTEGER	Add a brief explanation of the field here.
general.reads_paired		INTEGER	Add a brief explanation of the field here.
general.reads_duplicated		INTEGER	Add a brief explanation of the field here.
general.reads_mq0		INTEGER	Add a brief explanation of the field here.
general.reads_qc_failed		INTEGER	Add a brief explanation of the field here.
general.non_primary_alignments		INTEGER	Add a brief explanation of the field here.
general.supplementary_alignments		INTEGER	Add a brief explanation of the field here.
general.total_length		INTEGER	Add a brief explanation of the field here.
general.total_first_fragment_length		INTEGER	Add a brief explanation of the field here.
general.total_last_fragment_length		INTEGER	Add a brief explanation of the field here.
general.bases_mapped		INTEGER	Add a brief explanation of the field here.
general.bases_mapped_cigar		INTEGER	Add a brief explanation of the field here.
general.bases_trimmed		INTEGER	Add a brief explanation of the field here.
general.bases_duplicated		INTEGER	Add a brief explanation of the field here.
general.mismatches		INTEGER	Add a brief explanation of the field here.
general.error_rate		REAL	Add a brief explanation of the field here.
general.average_length		INTEGER	Add a brief explanation of the field here.
general.average_first_fragment_length		INTEGER	Add a brief explanation of the field here.
general.average_last_fragment_length		INTEGER	Add a brief explanation of the field here.
general.maximum_length		INTEGER	Add a brief explanation of the field here.
general.maximum_first_fragment_length		INTEGER	Add a brief explanation of the field here.
general.maximum_last_fragment_length		INTEGER	Add a brief explanation of the field here.
general.average_quality		REAL	Add a brief explanation of the field here.
general.insert_size_average		REAL	Add a brief explanation of the field here.
general.insert_size_standard_deviation		REAL	Add a brief explanation of the field here.
general.inward_oriented_pairs		INTEGER	Add a brief explanation of the field here.
general.outward_oriented_pairs		INTEGER	Add a brief explanation of the field here.
general.pairs_with_other_orientation		INTEGER	Add a brief explanation of the field here.
general.pairs_on_different_chromosomes		INTEGER	Add a brief explanation of the field here.
general.percentage_of_properly_paired_reads_%		REAL	Add a brief explanation of the field here.
general.reads_mapped_percent		REAL	Add a brief explanation of the field here.
general.reads_mapped_and_paired_percent		REAL	Add a brief explanation of the field here.
general.reads_unmapped_percent		REAL	Add a brief explanation of the field here.
general.reads_properly_paired_percent		REAL	Add a brief explanation of the field here.
general.reads_paired_percent		REAL	Add a brief explanation of the field here.
general.reads_duplicated_percent		REAL	Add a brief explanation of the field here.
general.reads_mq0_percent		REAL	Add a brief explanation of the field here.
general.reads_qc_failed_percent		REAL	Add a brief explanation of the field here.
general.total_passed		INTEGER	Add a brief explanation of the field here.
general.total_failed		INTEGER	Add a brief explanation of the field here.
general.secondary_passed		INTEGER	Add a brief explanation of the field here.
general.secondary_failed		INTEGER	Add a brief explanation of the field here.
general.supplementary_passed		INTEGER	Add a brief explanation of the field here.
general.supplementary_failed		INTEGER	Add a brief explanation of the field here.
general.duplicates_passed		INTEGER	Add a brief explanation of the field here.
general.duplicates_failed		INTEGER	Add a brief explanation of the field here.
general.mapped_passed		INTEGER	Add a brief explanation of the field here.
general.mapped_failed		INTEGER	Add a brief explanation of the field here.
general.mapped_passed_pct		REAL	Add a brief explanation of the field here.
general.mapped_failed_pct		REAL	Add a brief explanation of the field here.
general.paired_in_sequencing_passed		INTEGER	Add a brief explanation of the field here.
general.paired_in_sequencing_failed		INTEGER	Add a brief explanation of the field here.
general.read1_passed		INTEGER	Add a brief explanation of the field here.
general.read1_failed		INTEGER	Add a brief explanation of the field here.
general.read2_passed		INTEGER	Add a brief explanation of the field here.
general.read2_failed		INTEGER	Add a brief explanation of the field here.
general.properly_paired_passed		INTEGER	Add a brief explanation of the field here.
general.properly_paired_failed		INTEGER	Add a brief explanation of the field here.
general.properly_paired_passed_pct		REAL	Add a brief explanation of the field here.
general.properly_paired_failed_pct		REAL	Add a brief explanation of the field here.
general.with_itself_and_mate_mapped_passed		INTEGER	Add a brief explanation of the field here.
general.with_itself_and_mate_mapped_failed		INTEGER	Add a brief explanation of the field here.
general.singletons_passed		INTEGER	Add a brief explanation of the field here.
general.singletons_failed		INTEGER	Add a brief explanation of the field here.
general.singletons_passed_pct		REAL	Add a brief explanation of the field here.
general.singletons_failed_pct		REAL	Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_passed		INTEGER	Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_failed		INTEGER	Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_mapq__5_passed		INTEGER	Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_mapq__5_failed		INTEGER	Add a brief explanation of the field here.
general.flagstat_total		INTEGER	Add a brief explanation of the field here.
general.percent_gc		REAL	Add a brief explanation of the field here.
general.avg_sequence_length		REAL	Add a brief explanation of the field here.
general.total_sequences		REAL	Add a brief explanation of the field here.
general.percent_duplicates		REAL	Add a brief explanation of the field here.
general.percent_fails		REAL	Add a brief explanation of the field here.
general.r_processed		INTEGER	Add a brief explanation of the field here.
general.r_with_adapters		INTEGER	Add a brief explanation of the field here.
general.r_written		INTEGER	Add a brief explanation of the field here.
general.bp_processed		REAL	Add a brief explanation of the field here.
general.quality_trimmed		INTEGER	Add a brief explanation of the field here.
general.bp_written		REAL	Add a brief explanation of the field here.
general.percent_trimmed		REAL	Add a brief explanation of the field here.

qualimap_genomic_origin

This table stores genomic origin information from QualiMap. See the QualiMap documentation for more details on how a read is assigned to each category.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	TEXT	References the `library_id` field in the `library` table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	TEXT	References the `library_id` field in the `library` table
exonic		INTEGER	raw counts of reads over exonic regions
intronic		INTEGER	raw counts of reads over intronic regions
intergenic		INTEGER	raw counts of reads over intergenic regions

remove_ids

This table stores information about removed IDs. Note that if there is ever more than 1 reason (currently the only reason is legal_removal), then the reasons/sources should be extracted to a separate table like id_correction

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
id		INTEGER	References the “id” field in the “whatdatall” table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
id		INTEGER	References the “id” field in the “whatdatall” table
source		TEXT	What document, or what email, etc was used to communicate that this sample should be removed
reason		TEXT	Reason that the sample is being removed

rsem_assignment_plot

This table stores RSEM quantification QC information.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	TEXT	References the `library_id` field in the `library` table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	TEXT	References the `library_id` field in the `library` table
aligned_uniquely_to_a_gene		INTEGER	RSEM determined unique alignment raw counts
aligned_to_multiple_genes		INTEGER	RSEM determined multimap raw counts
filtered_due_to_too_many_alignments		INTEGER	Reads not considered for the EM quantification due to ‘too many’ multimaps
unalignable_reads		INTEGER	raw count of reads which count not be aligned in the genome

library

This is a very important table. It essentially functions to assign a unique ID to each actual sample (remember a sample might have as many as 6 fastq files assigned to it because of the paired-end nature, and the fact that a single sample can be split over mulitple lanes). The utility of having a unique primary key for every sample is that it gives us a way to refer to an actual biological entity independent of the (subject,visit) indentifiers, which may be wrong. This primary key will never change, even if the sample is re-labelled with a different subject later on.

Field	Purpose	Datatype	Description
library_id	primary key	INTEGER	autoincremented primary key
batch_id	foreign key to batch	INTEGER	References “pk” of the batch table

Full Table Details

Field	Purpose	Datatype	Description
library_id	primary key	INTEGER	autoincremented primary key
fastq_id	foreign key to fastq	TEXT	the ORIGINAL LLFS ID which is extracted from the fastq filename
visit		TEXT	The visit, extracted from the fastqfilename
batch_id	foreign key to batch	INTEGER	References “pk” of the batch table
suspicious_sex		BOOLEAN	TRUE (1) if the sex expression is obviously opposite of what is labelled. NOTE that this is a property of the library, not the `fastq_id`. This is applied only after doing the WGS and other relabeling and represents a sample which is mislabeled by sex and for which a correct relabeing cannot be found.

samtools_idxstats_xy_plot

This table stores the X and Y chromosome expression information as extracted by samtools. Note that I have found that looking at the top 500 most variable gene’s PCs is an easier method of determining sex (and mislabels determined by sex) than these counts are. However, I use the counts to confirm the PCA results.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	TEXT	References the `library_id` field in the `library` table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	TEXT	References the `library_id` field in the `library` table
chromosome_x		INTEGER	raw counts of reads over the X chromosome
chromosome_y		INTEGER	raw counts of reads over the Y chromosome

visitcode_map

This is a mapping from visit, which is used in the fastq filename to visitcode. (id/subject, visitcode) form the primary key of the LLFS phenotype data, while visit, though used in the fastq filename, is not defined. This is the mapping that is used to get the visitcode from the visit:

if visit == 1, then visitcode = 1
if visit == 2, then visitcode is either 3 or 4
if visit == 3, then visitcode is either 6 or 7

No subject can have visitcode 3 and 4, or visitcode 6 and 7, so this mapping is actually one-to-one.

Field	Purpose	Datatype	Description
pk	primary key autoincrement	INTEGER	Unique identifier for each record in the visitcode_map table
whatdatall_id	foreign key	TEXT	References the “id” field in the “whatdatall” table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key autoincrement	INTEGER	Unique identifier for each record in the visitcode_map table
whatdatall_id	foreign key	TEXT	References the “id” field in the “whatdatall” table
visit		TEXT	Represents a visit, only allows values ‘1’, ‘2’, ‘3’
visitcode		TEXT	Represents a visit code, only allows values from ‘1’ to ‘7’

wgs_compare

This table stores the results of comparing the RNA sample variants to WGS sample variants

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	References “pk” of library table
dna_subject		INTEGER	References “subject” of whatdatall table

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
library_id	foreign key	INTEGER	References “pk” of library table
dna_subject		INTEGER	References “subject” of whatdatall table
chr		TEXT	chromosome over which the comparison was conducted. Note that I frequently use chr21 b/c it is shorter. I have done comparisons and found that there is no loss of specificity by using a smaller chromosome
total_variants		INTEGER	Total number of overlapping variants between the RNA and DNA
matching_variants		INTEGER	Total number of overlapping variants which matched
homo_expr_cand		INTEGER	the total number of cases where the WGS is heterozygous but expression homozygous
match_ratio		REAL	matching_variants/total_variants

whatdatall

This table stores the ID, subject, and sex as extracted from the LLFS phenotype data. Note that this had to be augmented to include visit 3 samples, which are not yet included in the publically shared LLFS whatdatall.

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key

Full Table Details

Field	Purpose	Datatype	Description
pk	primary key	INTEGER	autoincremented primary key
id		INTEGER	the LLFS ID. Note that this is what is used in the fastq filenames
subject		INTEGER	The LLFS subject identifier
sex		INTEGER	labelled sex of the subject with this identifier
source		TEXT	where this record comes from
notes		TEXT	any notes on this particular record