Introduction

This database will make re-labeling libraries less error prone, and will make it much easier to submit to NCBI when the time comes. Below, you will find a description of each table and view in the database. For each table and view, the full description of each column is hidden by default – just click the “view full details” to expand a table that provides the definitions. For each table (but not view, which doesn’t have a primary key per se), the primary and foreign key(s) are shown in the Description (not collapsed by default).

Note that one library might have up to six or more (should be even, of course) fastq files – the paired-end R1 and R2, and sometimes a single library has multiple R1 and R2 files over different lanes.

IMPORTANT

Please note the following:

All tables, except for library, whose primary key is library_id, and batch, whose primary key is batch_id, have an auto-incremented primary key called pk.

fastq_id is always a foreign key to the fastq table, but it has an additional meaning: the fastq_id is the id which the actual fastq file possesses. This fastq_id may be wrong – for example, if there was a typo, or that particular ID is in the visit_1 transfer set, or if it was determined to be mislabeled from the RNA variants vs WGS data comparison. In this case, a library should be thought of as having a fastq_id, derived from the filename of the fastq file, but also a whatdatall_id, which results from a correction (see the manual_library_relabel table and full_wgs_compare table). A given library’s subject will always be assigned from the library’s whatdatall_id, which, if you are using one of my views, is assigned after relabeling any known mislabels.

Views

A view is a saved SQL query. It provides a way to access joined and/or otherwise manipulated data as if it were a table.

llfs_rnaseq_subject

This is the main LLFS metadata and is what is used to create the metadata in the LLFS RNA Seq data releases.

NOTE: In the release data, the column count_headers is added, which corresponds to the corresponding release’s plain text count data headers. It is formed by appending the string ‘library_’ to the library_id.

LLFS RNAseq Metadata View
Field Purpose Datatype Description
library_id INTEGER The library ID (foreign key to the library table)
legal TEXT Boolean column. Legal status. 1 if legal
subject INTEGER The LLFS subject identifier (see whatdatall)
visit TEXT Despite it not being an officially defined variable, visit is what is used in the fastqfilename. This must be translated to visitcode in order to join the RNAseq data on (subject,visitcode) to the rest of the LLFS phenotype data
visitcode TEXT The whatdatall LLFS phenotype visitcode value. See the visitcode_map table for details on how this is derived from visit
batch INTEGER This is the name of the data directory on the DSG cluster and corresponds to the batch in which the sample was sequenced. See table batch
sex INTEGER The subject’s sex. Note that if a library is relabeled, the sex is determined after relabeling
suspicious_sex INTEGER see the library table for more details. This is this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library clearly shows sex chromosome expression opposite from its label. This column exists because not all RNAseq subjects have WGS data, so a TRUE here means that the subject did not exist in the WGS set AND it shows evidence of being mislabeled by sex chrom expression
purpose TEXT either ‘control_phantom’, ‘control_pool’
mislabel INTEGER See the library table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the sample was clearly mislabelled during the WGS comparison, but the actual label was not found
relabel INTEGER See the library table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the sample has been relabeled for any reason, including as a result of the WGS vs RNA variant comparison, because the sample is a phantom control, or due to manifest typos or v1_transfers. See notes below.
notes INTEGER notes on the samples, such as why it was labelled mislabel or relabel
y_x_ratio REAL Ratio of reads mapped to chromosome Y to chromosome X
library_total REAL total size of the library
percent_intergenic REAL Percentage of reads in intergenic regions
hemoglobin_total REAL total number of reads counted over 12 hemoglobin genes
percent_hemoglobin REAL total raw counts over 12 hemoglobin genes / library_size. see the expression_qc table for more details
protein_coding_total REAL raw reads over loci denoted “protein_coding” in the gtf (you can discover this by loading the dds object and doing this rowRanges(dds)$type == 'protein_coding'
percent_protein_coding REAL protein_coding_total / library_size. see the expression_qc table for more details
rn7sl_total REAL total number of reads over two rn7sl genes. These are highly over represented in nearly all of the libraries. see table expression_qc for more details
percent_rn7sl REAL total raw counts over two loci that are consistently, by far, the loci with the most counts. See expression_qc for more details

llfs_rnaseq_id

This should be treated as an intermediate table, though it is useful if you wish to associate the LLFS IDs with the samples.

LLFS metadata ID View
Field Purpose Datatype Description
library_id INTEGER The library ID (foreign key to the library table)
llfs_id INTEGER The LLFS ID (see whatdatall)
visit TEXT Despite it not being an officially defined variable, visit is what is used in the fastqfilename. This must be translated to visitcode in order to join the RNAseq data on (subject,visitcode) to the rest of the LLFS phenotype data
visitcode TEXT The whatdatall LLFS phenotype visitcode value. See the visitcode_map table for details on how this is derived from visit
data_dir INTEGER This is the name of the data directory on dsg. See table batch
purpose TEXT either ‘control_phantom’, ‘control_pool’
suspicious_sex INTEGER see the library table for more details. This is this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library clearly shows sex chromosome expression opposite from its label. Note that this is applied after all other re-labeling is completed
mislabel INTEGER See the library table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library was clearly mislabeled during the WGS comparison, but the actual label was not found
relabel INTEGER See the library table for more details. this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library has been relabeled for any reason, including as a result of the WGS vs RNA variant comparison, because the library is a phantom control, or due to manifest typos or v1_transfers. See notes below.
notes INTEGER notes on the libraries, such as why it was labelled mislabel or relabel
legal TEXT Boolean column. Legal status. 1 if legal
y_x_ratio REAL Ratio of reads mapped to chromosome Y to chromosome X
library_size REAL total size of the library
percent_intergenic REAL Percentage of reads in intergenic regions
hemoglobin_total REAL total number of reads counted over 12 hemoglobin genes
percent_hemoglobin REAL total raw counts over 12 hemoglobin genes / library_size. see the expression_qc table for more details
protein_coding_total REAL raw reads over loci denoted “protein_coding” in the gtf (you can discover this by loading the dds object and doing this rowRanges(dds)$type == 'protein_coding'
percent_protein_coding REAL protein_coding_total / library_size. see the expression_qc table for more details
rn7sl_total REAL total number of reads over two rn7sl genes. These are highly over represented in nearly all of the libraries. see table expression_qc for more details
percent_rn7sl REAL total raw counts over two loci that are consistently, by far, the loci with the most counts. See expression_qc for more details

filtered_wgs_compare

This view summarizes the result of the all-by-all comparison between a given RNA Seq library and all 4,556 WGS samples. The all-by-all comparison candidates are chosen if the library is passing the percent_intergenic filter of 8%, has a match_ratio of less than .94 or has fewer than 100 total_variants.

Filtered WGS Compare
Field Purpose Datatype Description
empirical_best_id INTEGER The LLFS ID corresponding to the best empirical match between RNA variants and WGS data
library_id INTEGER library_id from the library table
labelled_dna_subject INTEGER the LLFS subject identifier corresponding to the LLFS ID which is contained in the library’s fastq filename
labelled_total_variants INTEGER The total number of high quality variant locations in the RNA library over which there high quality WGS genotype data for a given subject
labelled_match_variants INTEGER The total number of total_variants locations over which the genotypes in the two samples matched
labelled_match_ratio INTEGER match_variants/total_variants
empirical_best_subject INTEGER The LLFS subject corresponding to the best empirical match between RNA variants and WGS data
empirical_best_[total_variants,matching_variants,match_Ratio] INTEGER same as the labelled_... columns above, but for the best empirical match
empirical_next_subject INTEGER The LLFS subject corresponding to the second best empirical match in the all by all comparison
empirical_next_[total_variants,matching_variants,match_Ratio] INTEGER same as the labelled_... columns above, but for the second best empirical match
chisq_labelled INTEGER chisq pvalue for the labelled vs empirical best data
chisq_empirical INTEGER chisq pvalue for the best empirical match vs the next best empirical match
relabel BOOLEAN if the empirical best match is different than the original label, has a match ratio of >= .94 and both pvalues are significant, then the sample is relabeled (1). Else it is not (0)

library_relabel_summary

This provides an overview of all relabelled samples

Library Relabel Summary
Field Purpose Datatype Description
library_id TEXT library_id from the library table
proposed_relabel_id INTEGER The proposed relabel LLFS ID. Note that this is only applied in the metadata if relabel == 1
mislabel BOOLEAN 0 if the sample is not a mislabel (ie, a phantom control), 1 if it is
relabel BOOLEAN 0 if the data does not support relabeling. 1 if it does. See the filtered_wgs_compare view for details
notes REAL any additional notes from the underlying tables

genomic_origin_view

This view provides information about the genomic origin of reads based on the qualimap_genomic_origin table.

Qualimap Genomic Origin View
Field Purpose Datatype Description
library_id TEXT Foreign key to the library table
exonic INTEGER Number of reads in exonic regions
intronic INTEGER Number of reads in intronic regions
intergenic INTEGER Number of reads in intergenic regions
percent_exonic REAL Percentage of reads in exonic regions
percent_intronic REAL Percentage of reads in intronic regions
percent_intergenic REAL Percentage of reads in intergenic regions

xy_ratio

This view provides information about the XY ratio based on the samtools_idxstats_xy_plot table.

XY Ratio View
Field Purpose Datatype Description
library_id TEXT Foreign key to the library table
chromosome_x INTEGER Number of reads mapped to chromosome X
chromosome_y INTEGER Number of reads mapped to chromosome Y
y_x_ratio REAL Ratio of reads mapped to Y over reads mapped to X

expression_qc_view

This is a view of the expression_qc table with additional columns created from the hemoglobin, rn7sl and protein_coding_total / library size

Full Table Details
Field Purpose Datatype Description
pk primary key autoincrement INTEGER auto incremented primary key
library_id foreign key INTEGER foreign key to the library_id in table library
library_size INTEGER Holds the library size data for each record
protein_coding_total INTEGER Represents the total count of protein coding for each record
hemoglobin_counts INTEGER Stores the count of hemoglobin for each record
rn7sl1_counts INTEGER Stores the count of rn7sl1 for each record
percent_hemoglobin INTEGER hemoglobin_counts / library_size
percent_rn7sl INTEGER rn7sl_counts / library_size
percent_protein_coding INTEGER protein_coding_total / library_size

Tables

Each heading below is a physical table (not a view) in the database. A brief description of each is provided, and the primary key and foreign key, if one exists, is shown to give you an overview of how these tables are related.

To see the full description of any one of these tables, click “Full Table Details”.

batch

This table stores information about batches.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
data_dir TEXT name of the directory which stores this data in the dsg cluster
batch_alias TEXT I was unfortunately inconsistent in how I refered to these in processing the data. this column provides a map from what I named the batch when processing to the name of the dsg directory
dsg_fastq_dirpath TEXT The subdirectory structure of each data_dir varies quite a bit. This column provides the absolute path to the actual directory which stores the fastq files

manual_library_relabel

This table stores information about batches.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER foreign key to the library_id field in the library table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER foreign key to the library_id field in the library table
new_whatdatall_id INTEGER The correct LLFS ID for this library
reason TEXT The reason why this library was relabeled
notes TEXT Free entry for any additional information

expression_qc

This table stores count data which is extracted in the expression_qc step (see the articles in the served documentation, or the vignette in the code repo).

Field Purpose Datatype Description
pk primary key autoincrement INTEGER auto incremented primary key
library_id foreign key INTEGER References the “library_id” field in the “library” table
Full Table Details
Field Purpose Datatype Description
pk primary key autoincrement INTEGER auto incremented primary key
library_id foreign key INTEGER References the “library_id” field in the “library” table
library_size INTEGER Holds the library size data for each record
protein_coding_total INTEGER Represents the total count of protein coding for each record
hemoglobin_counts INTEGER Stores the count of hemoglobin for each record
rn7sl1_counts INTEGER Stores the count of rn7sl1 for each record

fastq

This table stores information about FASTQ files. Note that there may be more than 1 fastq record for each library_id since there are some (quite a few) libraries which have multiple lanes worth of paired-end libraries (so if 1 library is split across two lanes, there would be 2 R1 and 2 R2)

Field Purpose Datatype Description
pk primary key INTEGER auto incremented primary key
library_id foreign key TEXT References the “library_id” field in the “library” table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER foreign key to the library table
fastq_1 TEXT name of the R1 fastq file
fastq_2 TEXT name of the R2 fastq file
strandedness TEXT the strandedness of the library protocol
index_sequence TEXT index sequence extracted from the fastq filename
flow_cell_id foreign key to flow_cell TEXT flow cell identifier
lane TEXT lane in the flowcell
notes TEXT any notes on the fastq. Defaults to ‘none’

multiqc_general

Note: parsing this data out of multiQC is a work in progress.

This is the ‘general’ table stats from MultiQC and includes both alignment and read level QC. As a result, there are at least 3 records for each library.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER References the library_id field in the library table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER References the library_id field in the library table
general.dupradar_intercept REAL Add a brief explanation of the field here.
general.percent_rrna REAL Add a brief explanation of the field here.
general.library TEXT Add a brief explanation of the field here.
general.unpaired_reads_examined INTEGER Add a brief explanation of the field here.
general.read_pairs_examined INTEGER Add a brief explanation of the field here.
general.secondary_or_supplementary_rds INTEGER Add a brief explanation of the field here.
general.unmapped_reads INTEGER Add a brief explanation of the field here.
general.unpaired_read_duplicates INTEGER Add a brief explanation of the field here.
general.read_pair_duplicates INTEGER Add a brief explanation of the field here.
general.read_pair_optical_duplicates INTEGER Add a brief explanation of the field here.
general.percent_duplication INTEGER Add a brief explanation of the field here.
general.estimated_library_size INTEGER Add a brief explanation of the field here.
general.5_3_bias REAL Add a brief explanation of the field here.
general.reads_aligned INTEGER Add a brief explanation of the field here.
general.unalignable INTEGER Add a brief explanation of the field here.
general.alignable INTEGER Add a brief explanation of the field here.
general.filtered INTEGER Add a brief explanation of the field here.
general.total INTEGER Add a brief explanation of the field here.
general.alignable_percent REAL Add a brief explanation of the field here.
general.unique INTEGER Add a brief explanation of the field here.
general.multi INTEGER Add a brief explanation of the field here.
general.uncertain INTEGER Add a brief explanation of the field here.
general.total_records INTEGER Add a brief explanation of the field here.
general.qc_failed INTEGER Add a brief explanation of the field here.
general.optical_pcr_duplicate INTEGER Add a brief explanation of the field here.
general.non_primary_hits INTEGER Add a brief explanation of the field here.
general.mapq_lt_mapq_cut_non_unique INTEGER Add a brief explanation of the field here.
general.mapq_gte_mapq_cut_unique INTEGER Add a brief explanation of the field here.
general.read_1 INTEGER Add a brief explanation of the field here.
general.read_2 INTEGER Add a brief explanation of the field here.
general.reads_map_to_sense INTEGER Add a brief explanation of the field here.
general.reads_map_to_antisense INTEGER Add a brief explanation of the field here.
general.non_splice_reads INTEGER Add a brief explanation of the field here.
general.splice_reads INTEGER Add a brief explanation of the field here.
general.reads_mapped_in_proper_pairs INTEGER Add a brief explanation of the field here.
general.proper_paired_reads_map_to_different_chrom INTEGER Add a brief explanation of the field here.
general.unique_percent REAL Add a brief explanation of the field here.
general.proper_pairs_percent REAL Add a brief explanation of the field here.
general.raw_total_sequences INTEGER Add a brief explanation of the field here.
general.filtered_sequences INTEGER Add a brief explanation of the field here.
general.sequences INTEGER Add a brief explanation of the field here.
general.is_sorted INTEGER Add a brief explanation of the field here.
general.1st_fragments INTEGER Add a brief explanation of the field here.
general.last_fragments INTEGER Add a brief explanation of the field here.
general.reads_mapped INTEGER Add a brief explanation of the field here.
general.reads_mapped_and_paired INTEGER Add a brief explanation of the field here.
general.reads_unmapped INTEGER Add a brief explanation of the field here.
general.reads_properly_paired INTEGER Add a brief explanation of the field here.
general.reads_paired INTEGER Add a brief explanation of the field here.
general.reads_duplicated INTEGER Add a brief explanation of the field here.
general.reads_mq0 INTEGER Add a brief explanation of the field here.
general.reads_qc_failed INTEGER Add a brief explanation of the field here.
general.non_primary_alignments INTEGER Add a brief explanation of the field here.
general.supplementary_alignments INTEGER Add a brief explanation of the field here.
general.total_length INTEGER Add a brief explanation of the field here.
general.total_first_fragment_length INTEGER Add a brief explanation of the field here.
general.total_last_fragment_length INTEGER Add a brief explanation of the field here.
general.bases_mapped INTEGER Add a brief explanation of the field here.
general.bases_mapped_cigar INTEGER Add a brief explanation of the field here.
general.bases_trimmed INTEGER Add a brief explanation of the field here.
general.bases_duplicated INTEGER Add a brief explanation of the field here.
general.mismatches INTEGER Add a brief explanation of the field here.
general.error_rate REAL Add a brief explanation of the field here.
general.average_length INTEGER Add a brief explanation of the field here.
general.average_first_fragment_length INTEGER Add a brief explanation of the field here.
general.average_last_fragment_length INTEGER Add a brief explanation of the field here.
general.maximum_length INTEGER Add a brief explanation of the field here.
general.maximum_first_fragment_length INTEGER Add a brief explanation of the field here.
general.maximum_last_fragment_length INTEGER Add a brief explanation of the field here.
general.average_quality REAL Add a brief explanation of the field here.
general.insert_size_average REAL Add a brief explanation of the field here.
general.insert_size_standard_deviation REAL Add a brief explanation of the field here.
general.inward_oriented_pairs INTEGER Add a brief explanation of the field here.
general.outward_oriented_pairs INTEGER Add a brief explanation of the field here.
general.pairs_with_other_orientation INTEGER Add a brief explanation of the field here.
general.pairs_on_different_chromosomes INTEGER Add a brief explanation of the field here.
general.percentage_of_properly_paired_reads_% REAL Add a brief explanation of the field here.
general.reads_mapped_percent REAL Add a brief explanation of the field here.
general.reads_mapped_and_paired_percent REAL Add a brief explanation of the field here.
general.reads_unmapped_percent REAL Add a brief explanation of the field here.
general.reads_properly_paired_percent REAL Add a brief explanation of the field here.
general.reads_paired_percent REAL Add a brief explanation of the field here.
general.reads_duplicated_percent REAL Add a brief explanation of the field here.
general.reads_mq0_percent REAL Add a brief explanation of the field here.
general.reads_qc_failed_percent REAL Add a brief explanation of the field here.
general.total_passed INTEGER Add a brief explanation of the field here.
general.total_failed INTEGER Add a brief explanation of the field here.
general.secondary_passed INTEGER Add a brief explanation of the field here.
general.secondary_failed INTEGER Add a brief explanation of the field here.
general.supplementary_passed INTEGER Add a brief explanation of the field here.
general.supplementary_failed INTEGER Add a brief explanation of the field here.
general.duplicates_passed INTEGER Add a brief explanation of the field here.
general.duplicates_failed INTEGER Add a brief explanation of the field here.
general.mapped_passed INTEGER Add a brief explanation of the field here.
general.mapped_failed INTEGER Add a brief explanation of the field here.
general.mapped_passed_pct REAL Add a brief explanation of the field here.
general.mapped_failed_pct REAL Add a brief explanation of the field here.
general.paired_in_sequencing_passed INTEGER Add a brief explanation of the field here.
general.paired_in_sequencing_failed INTEGER Add a brief explanation of the field here.
general.read1_passed INTEGER Add a brief explanation of the field here.
general.read1_failed INTEGER Add a brief explanation of the field here.
general.read2_passed INTEGER Add a brief explanation of the field here.
general.read2_failed INTEGER Add a brief explanation of the field here.
general.properly_paired_passed INTEGER Add a brief explanation of the field here.
general.properly_paired_failed INTEGER Add a brief explanation of the field here.
general.properly_paired_passed_pct REAL Add a brief explanation of the field here.
general.properly_paired_failed_pct REAL Add a brief explanation of the field here.
general.with_itself_and_mate_mapped_passed INTEGER Add a brief explanation of the field here.
general.with_itself_and_mate_mapped_failed INTEGER Add a brief explanation of the field here.
general.singletons_passed INTEGER Add a brief explanation of the field here.
general.singletons_failed INTEGER Add a brief explanation of the field here.
general.singletons_passed_pct REAL Add a brief explanation of the field here.
general.singletons_failed_pct REAL Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_passed INTEGER Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_failed INTEGER Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_mapq__5_passed INTEGER Add a brief explanation of the field here.
general.with_mate_mapped_to_a_different_chr_mapq__5_failed INTEGER Add a brief explanation of the field here.
general.flagstat_total INTEGER Add a brief explanation of the field here.
general.percent_gc REAL Add a brief explanation of the field here.
general.avg_sequence_length REAL Add a brief explanation of the field here.
general.total_sequences REAL Add a brief explanation of the field here.
general.percent_duplicates REAL Add a brief explanation of the field here.
general.percent_fails REAL Add a brief explanation of the field here.
general.r_processed INTEGER Add a brief explanation of the field here.
general.r_with_adapters INTEGER Add a brief explanation of the field here.
general.r_written INTEGER Add a brief explanation of the field here.
general.bp_processed REAL Add a brief explanation of the field here.
general.quality_trimmed INTEGER Add a brief explanation of the field here.
general.bp_written REAL Add a brief explanation of the field here.
general.percent_trimmed REAL Add a brief explanation of the field here.

qualimap_genomic_origin

This table stores genomic origin information from QualiMap. See the QualiMap documentation for more details on how a read is assigned to each category.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key TEXT References the library_id field in the library table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key TEXT References the library_id field in the library table
exonic INTEGER raw counts of reads over exonic regions
intronic INTEGER raw counts of reads over intronic regions
intergenic INTEGER raw counts of reads over intergenic regions

remove_ids

This table stores information about removed IDs. Note that if there is ever more than 1 reason (currently the only reason is legal_removal), then the reasons/sources should be extracted to a separate table like id_correction

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
id INTEGER References the “id” field in the “whatdatall” table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
id INTEGER References the “id” field in the “whatdatall” table
source TEXT What document, or what email, etc was used to communicate that this sample should be removed
reason TEXT Reason that the sample is being removed

rsem_assignment_plot

This table stores RSEM quantification QC information.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key TEXT References the library_id field in the library table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key TEXT References the library_id field in the library table
aligned_uniquely_to_a_gene INTEGER RSEM determined unique alignment raw counts
aligned_to_multiple_genes INTEGER RSEM determined multimap raw counts
filtered_due_to_too_many_alignments INTEGER Reads not considered for the EM quantification due to ‘too many’ multimaps
unalignable_reads INTEGER raw count of reads which count not be aligned in the genome

library

This is a very important table. It essentially functions to assign a unique ID to each actual sample (remember a sample might have as many as 6 fastq files assigned to it because of the paired-end nature, and the fact that a single sample can be split over mulitple lanes). The utility of having a unique primary key for every sample is that it gives us a way to refer to an actual biological entity independent of the (subject,visit) indentifiers, which may be wrong. This primary key will never change, even if the sample is re-labelled with a different subject later on.

Field Purpose Datatype Description
library_id primary key INTEGER autoincremented primary key
batch_id foreign key to batch INTEGER References “pk” of the batch table
Full Table Details
Field Purpose Datatype Description
library_id primary key INTEGER autoincremented primary key
fastq_id foreign key to fastq TEXT the ORIGINAL LLFS ID which is extracted from the fastq filename
visit TEXT The visit, extracted from the fastqfilename
batch_id foreign key to batch INTEGER References “pk” of the batch table
suspicious_sex BOOLEAN TRUE (1) if the sex expression is obviously opposite of what is labelled. NOTE that this is a property of the library, not the fastq_id. This is applied only after doing the WGS and other relabeling and represents a sample which is mislabeled by sex and for which a correct relabeing cannot be found.

samtools_idxstats_xy_plot

This table stores the X and Y chromosome expression information as extracted by samtools. Note that I have found that looking at the top 500 most variable gene’s PCs is an easier method of determining sex (and mislabels determined by sex) than these counts are. However, I use the counts to confirm the PCA results.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key TEXT References the library_id field in the library table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key TEXT References the library_id field in the library table
chromosome_x INTEGER raw counts of reads over the X chromosome
chromosome_y INTEGER raw counts of reads over the Y chromosome

visitcode_map

This is a mapping from visit, which is used in the fastq filename to visitcode. (id/subject, visitcode) form the primary key of the LLFS phenotype data, while visit, though used in the fastq filename, is not defined. This is the mapping that is used to get the visitcode from the visit:

  • if visit == 1, then visitcode = 1
  • if visit == 2, then visitcode is either 3 or 4
  • if visit == 3, then visitcode is either 6 or 7

No subject can have visitcode 3 and 4, or visitcode 6 and 7, so this mapping is actually one-to-one.

Field Purpose Datatype Description
pk primary key autoincrement INTEGER Unique identifier for each record in the visitcode_map table
whatdatall_id foreign key TEXT References the “id” field in the “whatdatall” table

Full Table Details
Field Purpose Datatype Description
pk primary key autoincrement INTEGER Unique identifier for each record in the visitcode_map table
whatdatall_id foreign key TEXT References the “id” field in the “whatdatall” table
visit TEXT Represents a visit, only allows values ‘1’, ‘2’, ‘3’
visitcode TEXT Represents a visit code, only allows values from ‘1’ to ‘7’

wgs_compare

This table stores the results of comparing the RNA sample variants to WGS sample variants

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER References “pk” of library table
dna_subject INTEGER References “subject” of whatdatall table
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
library_id foreign key INTEGER References “pk” of library table
dna_subject INTEGER References “subject” of whatdatall table
chr TEXT chromosome over which the comparison was conducted. Note that I frequently use chr21 b/c it is shorter. I have done comparisons and found that there is no loss of specificity by using a smaller chromosome
total_variants INTEGER Total number of overlapping variants between the RNA and DNA
matching_variants INTEGER Total number of overlapping variants which matched
homo_expr_cand INTEGER the total number of cases where the WGS is heterozygous but expression homozygous
match_ratio REAL matching_variants/total_variants

whatdatall

This table stores the ID, subject, and sex as extracted from the LLFS phenotype data. Note that this had to be augmented to include visit 3 samples, which are not yet included in the publically shared LLFS whatdatall.

Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
Full Table Details
Field Purpose Datatype Description
pk primary key INTEGER autoincremented primary key
id INTEGER the LLFS ID. Note that this is what is used in the fastq filenames
subject INTEGER The LLFS subject identifier
sex INTEGER labelled sex of the subject with this identifier
source TEXT where this record comes from
notes TEXT any notes on this particular record