This database will make re-labeling libraries less error prone, and will make it much easier to submit to NCBI when the time comes. Below, you will find a description of each table and view in the database. For each table and view, the full description of each column is hidden by default – just click the “view full details” to expand a table that provides the definitions. For each table (but not view, which doesn’t have a primary key per se), the primary and foreign key(s) are shown in the Description (not collapsed by default).
Note that one library might have up to six or more (should be even, of course) fastq files – the paired-end R1 and R2, and sometimes a single library has multiple R1 and R2 files over different lanes.
Please note the following:
All tables, except for library, whose primary key is
library_id
, and batch
, whose primary key is
batch_id
, have an auto-incremented primary key called
pk
.
fastq_id
is always a foreign key to the
fastq
table, but it has an additional meaning: the
fastq_id
is the id
which the actual fastq file
possesses. This fastq_id
may be wrong – for example, if
there was a typo, or that particular ID is in the visit_1 transfer set,
or if it was determined to be mislabeled from the RNA variants vs WGS
data comparison. In this case, a library should be thought of as having
a fastq_id
, derived from the filename of the fastq file,
but also a whatdatall_id
, which results from a correction
(see the manual_library_relabel
table and full_wgs_compare table). A
given library’s subject
will always be assigned from the
library’s whatdatall_id
, which, if you are using one of my
views, is assigned after relabeling any known mislabels.
A view is a saved SQL query. It provides a way to access joined and/or otherwise manipulated data as if it were a table.
This is the main LLFS metadata and is what is used to create the metadata in the LLFS RNA Seq data releases.
NOTE: In the release data, the column
count_headers
is added, which corresponds to the
corresponding release’s plain text count data headers. It is formed by
appending the string ‘library_’ to the library_id
.
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | INTEGER | The library ID (foreign key to the library table) |
|
legal | TEXT | Boolean column. Legal status. 1 if legal | |
subject | INTEGER | The LLFS subject identifier (see whatdatall ) |
|
visit | TEXT | Despite it not being an officially defined variable, visit is what is used in the fastqfilename. This must be translated to visitcode in order to join the RNAseq data on (subject,visitcode) to the rest of the LLFS phenotype data | |
visitcode | TEXT | The whatdatall LLFS phenotype visitcode value. See the visitcode_map table for details on how this is derived from visit | |
batch | INTEGER | This is the name of the data directory on the DSG cluster and
corresponds to the batch in which the sample was sequenced. See table
batch
|
|
sex | INTEGER | The subject’s sex. Note that if a library is relabeled, the sex is determined after relabeling | |
suspicious_sex | INTEGER | see the library table for more details. This is this is
a boolean (0,1 where 1 is TRUE) field which is TRUE if the library
clearly shows sex chromosome expression opposite from its label. This
column exists because not all RNAseq subjects have WGS data, so a TRUE
here means that the subject did not exist in the WGS set AND it shows
evidence of being mislabeled by sex chrom expression |
|
purpose | TEXT | either ‘control_phantom’, ‘control_pool’ | |
mislabel | INTEGER | See the library table for more details. this is a
boolean (0,1 where 1 is TRUE) field which is TRUE if the sample was
clearly mislabelled during the WGS comparison, but the actual label was
not found |
|
relabel | INTEGER | See the library table for more details. this is a
boolean (0,1 where 1 is TRUE) field which is TRUE if the sample has been
relabeled for any reason, including as a result of the WGS vs RNA
variant comparison, because the sample is a phantom control, or due to
manifest typos or v1_transfers. See notes below. |
|
notes | INTEGER | notes on the samples, such as why it was labelled mislabel or relabel | |
y_x_ratio | REAL | Ratio of reads mapped to chromosome Y to chromosome X | |
library_total | REAL | total size of the library | |
percent_intergenic | REAL | Percentage of reads in intergenic regions | |
hemoglobin_total | REAL | total number of reads counted over 12 hemoglobin genes | |
percent_hemoglobin | REAL | total raw counts over 12 hemoglobin genes / library_size. see the expression_qc table for more details | |
protein_coding_total | REAL | raw reads over loci denoted “protein_coding” in the gtf (you can
discover this by loading the dds object and doing this
rowRanges(dds)$type == 'protein_coding'
|
|
percent_protein_coding | REAL | protein_coding_total / library_size. see the expression_qc table for more details | |
rn7sl_total | REAL | total number of reads over two rn7sl genes. These are highly over
represented in nearly all of the libraries. see table
expression_qc for more details |
|
percent_rn7sl | REAL | total raw counts over two loci that are consistently, by far, the loci with the most counts. See expression_qc for more details |
This should be treated as an intermediate table, though it is useful if you wish to associate the LLFS IDs with the samples.
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | INTEGER | The library ID (foreign key to the library table) |
|
llfs_id | INTEGER | The LLFS ID (see whatdatall ) |
|
visit | TEXT | Despite it not being an officially defined variable, visit is what is used in the fastqfilename. This must be translated to visitcode in order to join the RNAseq data on (subject,visitcode) to the rest of the LLFS phenotype data | |
visitcode | TEXT | The whatdatall LLFS phenotype visitcode value. See the visitcode_map table for details on how this is derived from visit | |
data_dir | INTEGER | This is the name of the data directory on dsg. See table
batch
|
|
purpose | TEXT | either ‘control_phantom’, ‘control_pool’ | |
suspicious_sex | INTEGER | see the library table for more details. This is this is a boolean (0,1 where 1 is TRUE) field which is TRUE if the library clearly shows sex chromosome expression opposite from its label. Note that this is applied after all other re-labeling is completed | |
mislabel | INTEGER | See the library table for more details. this is a
boolean (0,1 where 1 is TRUE) field which is TRUE if the library was
clearly mislabeled during the WGS comparison, but the actual label was
not found |
|
relabel | INTEGER | See the library table for more details. this is a
boolean (0,1 where 1 is TRUE) field which is TRUE if the library has
been relabeled for any reason, including as a result of the WGS vs RNA
variant comparison, because the library is a phantom control, or due to
manifest typos or v1_transfers. See notes below. |
|
notes | INTEGER | notes on the libraries, such as why it was labelled mislabel or relabel | |
legal | TEXT | Boolean column. Legal status. 1 if legal | |
y_x_ratio | REAL | Ratio of reads mapped to chromosome Y to chromosome X | |
library_size | REAL | total size of the library | |
percent_intergenic | REAL | Percentage of reads in intergenic regions | |
hemoglobin_total | REAL | total number of reads counted over 12 hemoglobin genes | |
percent_hemoglobin | REAL | total raw counts over 12 hemoglobin genes / library_size. see the expression_qc table for more details | |
protein_coding_total | REAL | raw reads over loci denoted “protein_coding” in the gtf (you can
discover this by loading the dds object and doing this
rowRanges(dds)$type == 'protein_coding'
|
|
percent_protein_coding | REAL | protein_coding_total / library_size. see the expression_qc table for more details | |
rn7sl_total | REAL | total number of reads over two rn7sl genes. These are highly over
represented in nearly all of the libraries. see table
expression_qc for more details |
|
percent_rn7sl | REAL | total raw counts over two loci that are consistently, by far, the loci with the most counts. See expression_qc for more details |
This view summarizes the result of the all-by-all comparison between
a given RNA Seq library and all 4,556 WGS samples. The all-by-all
comparison candidates are chosen if the library is passing the
percent_intergenic filter of 8%, has a match_ratio
of less
than .94 or has fewer than 100 total_variants
.
Field | Purpose | Datatype | Description |
---|---|---|---|
empirical_best_id | INTEGER | The LLFS ID corresponding to the best empirical match between RNA variants and WGS data | |
library_id | INTEGER |
library_id from the library table |
|
labelled_dna_subject | INTEGER | the LLFS subject identifier corresponding to the LLFS ID which is contained in the library’s fastq filename | |
labelled_total_variants | INTEGER | The total number of high quality variant locations in the RNA library over which there high quality WGS genotype data for a given subject | |
labelled_match_variants | INTEGER | The total number of total_variants locations over which
the genotypes in the two samples matched |
|
labelled_match_ratio | INTEGER |
match_variants /total_variants
|
|
empirical_best_subject | INTEGER | The LLFS subject corresponding to the best empirical match between RNA variants and WGS data | |
empirical_best_[total_variants,matching_variants,match_Ratio] | INTEGER | same as the labelled_... columns above, but for the
best empirical match |
|
empirical_next_subject | INTEGER | The LLFS subject corresponding to the second best empirical match in the all by all comparison | |
empirical_next_[total_variants,matching_variants,match_Ratio] | INTEGER | same as the labelled_... columns above, but for the
second best empirical match |
|
chisq_labelled | INTEGER | chisq pvalue for the labelled vs empirical best data | |
chisq_empirical | INTEGER | chisq pvalue for the best empirical match vs the next best empirical match | |
relabel | BOOLEAN | if the empirical best match is different than the original label, has a match ratio of >= .94 and both pvalues are significant, then the sample is relabeled (1). Else it is not (0) |
This provides an overview of all relabelled samples
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | TEXT |
library_id from the library table |
|
proposed_relabel_id | INTEGER | The proposed relabel LLFS ID. Note that this is only applied in the
metadata if relabel == 1 |
|
mislabel | BOOLEAN | 0 if the sample is not a mislabel (ie, a phantom control), 1 if it is | |
relabel | BOOLEAN | 0 if the data does not support relabeling. 1 if it does. See the
filtered_wgs_compare view for details |
|
notes | REAL | any additional notes from the underlying tables |
This view provides information about the genomic origin of reads
based on the qualimap_genomic_origin
table.
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | TEXT | Foreign key to the library table | |
exonic | INTEGER | Number of reads in exonic regions | |
intronic | INTEGER | Number of reads in intronic regions | |
intergenic | INTEGER | Number of reads in intergenic regions | |
percent_exonic | REAL | Percentage of reads in exonic regions | |
percent_intronic | REAL | Percentage of reads in intronic regions | |
percent_intergenic | REAL | Percentage of reads in intergenic regions |
This view provides information about the XY ratio based on the
samtools_idxstats_xy_plot
table.
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | TEXT | Foreign key to the library table | |
chromosome_x | INTEGER | Number of reads mapped to chromosome X | |
chromosome_y | INTEGER | Number of reads mapped to chromosome Y | |
y_x_ratio | REAL | Ratio of reads mapped to Y over reads mapped to X |
This is a view of the expression_qc table with additional columns created from the hemoglobin, rn7sl and protein_coding_total / library size
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key autoincrement | INTEGER | auto incremented primary key |
library_id | foreign key | INTEGER | foreign key to the library_id in table
library
|
library_size | INTEGER | Holds the library size data for each record | |
protein_coding_total | INTEGER | Represents the total count of protein coding for each record | |
hemoglobin_counts | INTEGER | Stores the count of hemoglobin for each record | |
rn7sl1_counts | INTEGER | Stores the count of rn7sl1 for each record | |
percent_hemoglobin | INTEGER | hemoglobin_counts / library_size | |
percent_rn7sl | INTEGER | rn7sl_counts / library_size | |
percent_protein_coding | INTEGER | protein_coding_total / library_size |
Each heading below is a physical table (not a view) in the database. A brief description of each is provided, and the primary key and foreign key, if one exists, is shown to give you an overview of how these tables are related.
To see the full description of any one of these tables, click “Full Table Details”.
This table stores information about batches.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
data_dir | TEXT | name of the directory which stores this data in the dsg cluster | |
batch_alias | TEXT | I was unfortunately inconsistent in how I refered to these in processing the data. this column provides a map from what I named the batch when processing to the name of the dsg directory | |
dsg_fastq_dirpath | TEXT | The subdirectory structure of each data_dir varies quite a bit. This column provides the absolute path to the actual directory which stores the fastq files |
This table stores information about batches.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | foreign key to the library_id field in the
library table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | foreign key to the library_id field in the
library table |
new_whatdatall_id | INTEGER | The correct LLFS ID for this library | |
reason | TEXT | The reason why this library was relabeled | |
notes | TEXT | Free entry for any additional information |
This table stores count data which is extracted in the expression_qc step (see the articles in the served documentation, or the vignette in the code repo).
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key autoincrement | INTEGER | auto incremented primary key |
library_id | foreign key | INTEGER | References the “library_id” field in the “library” table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key autoincrement | INTEGER | auto incremented primary key |
library_id | foreign key | INTEGER | References the “library_id” field in the “library” table |
library_size | INTEGER | Holds the library size data for each record | |
protein_coding_total | INTEGER | Represents the total count of protein coding for each record | |
hemoglobin_counts | INTEGER | Stores the count of hemoglobin for each record | |
rn7sl1_counts | INTEGER | Stores the count of rn7sl1 for each record |
This table stores information about FASTQ files. Note that there may be more than 1 fastq record for each library_id since there are some (quite a few) libraries which have multiple lanes worth of paired-end libraries (so if 1 library is split across two lanes, there would be 2 R1 and 2 R2)
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | auto incremented primary key |
library_id | foreign key | TEXT | References the “library_id” field in the “library” table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | foreign key to the library table |
fastq_1 | TEXT | name of the R1 fastq file | |
fastq_2 | TEXT | name of the R2 fastq file | |
strandedness | TEXT | the strandedness of the library protocol | |
index_sequence | TEXT | index sequence extracted from the fastq filename | |
flow_cell_id | foreign key to flow_cell | TEXT | flow cell identifier |
lane | TEXT | lane in the flowcell | |
notes | TEXT | any notes on the fastq. Defaults to ‘none’ |
Note: parsing this data out of multiQC is a work in progress.
This is the ‘general’ table stats from MultiQC and includes both alignment and read level QC. As a result, there are at least 3 records for each library.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | References the library_id field in the
library table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | References the library_id field in the
library table |
general.dupradar_intercept | REAL | Add a brief explanation of the field here. | |
general.percent_rrna | REAL | Add a brief explanation of the field here. | |
general.library | TEXT | Add a brief explanation of the field here. | |
general.unpaired_reads_examined | INTEGER | Add a brief explanation of the field here. | |
general.read_pairs_examined | INTEGER | Add a brief explanation of the field here. | |
general.secondary_or_supplementary_rds | INTEGER | Add a brief explanation of the field here. | |
general.unmapped_reads | INTEGER | Add a brief explanation of the field here. | |
general.unpaired_read_duplicates | INTEGER | Add a brief explanation of the field here. | |
general.read_pair_duplicates | INTEGER | Add a brief explanation of the field here. | |
general.read_pair_optical_duplicates | INTEGER | Add a brief explanation of the field here. | |
general.percent_duplication | INTEGER | Add a brief explanation of the field here. | |
general.estimated_library_size | INTEGER | Add a brief explanation of the field here. | |
general.5_3_bias | REAL | Add a brief explanation of the field here. | |
general.reads_aligned | INTEGER | Add a brief explanation of the field here. | |
general.unalignable | INTEGER | Add a brief explanation of the field here. | |
general.alignable | INTEGER | Add a brief explanation of the field here. | |
general.filtered | INTEGER | Add a brief explanation of the field here. | |
general.total | INTEGER | Add a brief explanation of the field here. | |
general.alignable_percent | REAL | Add a brief explanation of the field here. | |
general.unique | INTEGER | Add a brief explanation of the field here. | |
general.multi | INTEGER | Add a brief explanation of the field here. | |
general.uncertain | INTEGER | Add a brief explanation of the field here. | |
general.total_records | INTEGER | Add a brief explanation of the field here. | |
general.qc_failed | INTEGER | Add a brief explanation of the field here. | |
general.optical_pcr_duplicate | INTEGER | Add a brief explanation of the field here. | |
general.non_primary_hits | INTEGER | Add a brief explanation of the field here. | |
general.mapq_lt_mapq_cut_non_unique | INTEGER | Add a brief explanation of the field here. | |
general.mapq_gte_mapq_cut_unique | INTEGER | Add a brief explanation of the field here. | |
general.read_1 | INTEGER | Add a brief explanation of the field here. | |
general.read_2 | INTEGER | Add a brief explanation of the field here. | |
general.reads_map_to_sense | INTEGER | Add a brief explanation of the field here. | |
general.reads_map_to_antisense | INTEGER | Add a brief explanation of the field here. | |
general.non_splice_reads | INTEGER | Add a brief explanation of the field here. | |
general.splice_reads | INTEGER | Add a brief explanation of the field here. | |
general.reads_mapped_in_proper_pairs | INTEGER | Add a brief explanation of the field here. | |
general.proper_paired_reads_map_to_different_chrom | INTEGER | Add a brief explanation of the field here. | |
general.unique_percent | REAL | Add a brief explanation of the field here. | |
general.proper_pairs_percent | REAL | Add a brief explanation of the field here. | |
general.raw_total_sequences | INTEGER | Add a brief explanation of the field here. | |
general.filtered_sequences | INTEGER | Add a brief explanation of the field here. | |
general.sequences | INTEGER | Add a brief explanation of the field here. | |
general.is_sorted | INTEGER | Add a brief explanation of the field here. | |
general.1st_fragments | INTEGER | Add a brief explanation of the field here. | |
general.last_fragments | INTEGER | Add a brief explanation of the field here. | |
general.reads_mapped | INTEGER | Add a brief explanation of the field here. | |
general.reads_mapped_and_paired | INTEGER | Add a brief explanation of the field here. | |
general.reads_unmapped | INTEGER | Add a brief explanation of the field here. | |
general.reads_properly_paired | INTEGER | Add a brief explanation of the field here. | |
general.reads_paired | INTEGER | Add a brief explanation of the field here. | |
general.reads_duplicated | INTEGER | Add a brief explanation of the field here. | |
general.reads_mq0 | INTEGER | Add a brief explanation of the field here. | |
general.reads_qc_failed | INTEGER | Add a brief explanation of the field here. | |
general.non_primary_alignments | INTEGER | Add a brief explanation of the field here. | |
general.supplementary_alignments | INTEGER | Add a brief explanation of the field here. | |
general.total_length | INTEGER | Add a brief explanation of the field here. | |
general.total_first_fragment_length | INTEGER | Add a brief explanation of the field here. | |
general.total_last_fragment_length | INTEGER | Add a brief explanation of the field here. | |
general.bases_mapped | INTEGER | Add a brief explanation of the field here. | |
general.bases_mapped_cigar | INTEGER | Add a brief explanation of the field here. | |
general.bases_trimmed | INTEGER | Add a brief explanation of the field here. | |
general.bases_duplicated | INTEGER | Add a brief explanation of the field here. | |
general.mismatches | INTEGER | Add a brief explanation of the field here. | |
general.error_rate | REAL | Add a brief explanation of the field here. | |
general.average_length | INTEGER | Add a brief explanation of the field here. | |
general.average_first_fragment_length | INTEGER | Add a brief explanation of the field here. | |
general.average_last_fragment_length | INTEGER | Add a brief explanation of the field here. | |
general.maximum_length | INTEGER | Add a brief explanation of the field here. | |
general.maximum_first_fragment_length | INTEGER | Add a brief explanation of the field here. | |
general.maximum_last_fragment_length | INTEGER | Add a brief explanation of the field here. | |
general.average_quality | REAL | Add a brief explanation of the field here. | |
general.insert_size_average | REAL | Add a brief explanation of the field here. | |
general.insert_size_standard_deviation | REAL | Add a brief explanation of the field here. | |
general.inward_oriented_pairs | INTEGER | Add a brief explanation of the field here. | |
general.outward_oriented_pairs | INTEGER | Add a brief explanation of the field here. | |
general.pairs_with_other_orientation | INTEGER | Add a brief explanation of the field here. | |
general.pairs_on_different_chromosomes | INTEGER | Add a brief explanation of the field here. | |
general.percentage_of_properly_paired_reads_% | REAL | Add a brief explanation of the field here. | |
general.reads_mapped_percent | REAL | Add a brief explanation of the field here. | |
general.reads_mapped_and_paired_percent | REAL | Add a brief explanation of the field here. | |
general.reads_unmapped_percent | REAL | Add a brief explanation of the field here. | |
general.reads_properly_paired_percent | REAL | Add a brief explanation of the field here. | |
general.reads_paired_percent | REAL | Add a brief explanation of the field here. | |
general.reads_duplicated_percent | REAL | Add a brief explanation of the field here. | |
general.reads_mq0_percent | REAL | Add a brief explanation of the field here. | |
general.reads_qc_failed_percent | REAL | Add a brief explanation of the field here. | |
general.total_passed | INTEGER | Add a brief explanation of the field here. | |
general.total_failed | INTEGER | Add a brief explanation of the field here. | |
general.secondary_passed | INTEGER | Add a brief explanation of the field here. | |
general.secondary_failed | INTEGER | Add a brief explanation of the field here. | |
general.supplementary_passed | INTEGER | Add a brief explanation of the field here. | |
general.supplementary_failed | INTEGER | Add a brief explanation of the field here. | |
general.duplicates_passed | INTEGER | Add a brief explanation of the field here. | |
general.duplicates_failed | INTEGER | Add a brief explanation of the field here. | |
general.mapped_passed | INTEGER | Add a brief explanation of the field here. | |
general.mapped_failed | INTEGER | Add a brief explanation of the field here. | |
general.mapped_passed_pct | REAL | Add a brief explanation of the field here. | |
general.mapped_failed_pct | REAL | Add a brief explanation of the field here. | |
general.paired_in_sequencing_passed | INTEGER | Add a brief explanation of the field here. | |
general.paired_in_sequencing_failed | INTEGER | Add a brief explanation of the field here. | |
general.read1_passed | INTEGER | Add a brief explanation of the field here. | |
general.read1_failed | INTEGER | Add a brief explanation of the field here. | |
general.read2_passed | INTEGER | Add a brief explanation of the field here. | |
general.read2_failed | INTEGER | Add a brief explanation of the field here. | |
general.properly_paired_passed | INTEGER | Add a brief explanation of the field here. | |
general.properly_paired_failed | INTEGER | Add a brief explanation of the field here. | |
general.properly_paired_passed_pct | REAL | Add a brief explanation of the field here. | |
general.properly_paired_failed_pct | REAL | Add a brief explanation of the field here. | |
general.with_itself_and_mate_mapped_passed | INTEGER | Add a brief explanation of the field here. | |
general.with_itself_and_mate_mapped_failed | INTEGER | Add a brief explanation of the field here. | |
general.singletons_passed | INTEGER | Add a brief explanation of the field here. | |
general.singletons_failed | INTEGER | Add a brief explanation of the field here. | |
general.singletons_passed_pct | REAL | Add a brief explanation of the field here. | |
general.singletons_failed_pct | REAL | Add a brief explanation of the field here. | |
general.with_mate_mapped_to_a_different_chr_passed | INTEGER | Add a brief explanation of the field here. | |
general.with_mate_mapped_to_a_different_chr_failed | INTEGER | Add a brief explanation of the field here. | |
general.with_mate_mapped_to_a_different_chr_mapq__5_passed | INTEGER | Add a brief explanation of the field here. | |
general.with_mate_mapped_to_a_different_chr_mapq__5_failed | INTEGER | Add a brief explanation of the field here. | |
general.flagstat_total | INTEGER | Add a brief explanation of the field here. | |
general.percent_gc | REAL | Add a brief explanation of the field here. | |
general.avg_sequence_length | REAL | Add a brief explanation of the field here. | |
general.total_sequences | REAL | Add a brief explanation of the field here. | |
general.percent_duplicates | REAL | Add a brief explanation of the field here. | |
general.percent_fails | REAL | Add a brief explanation of the field here. | |
general.r_processed | INTEGER | Add a brief explanation of the field here. | |
general.r_with_adapters | INTEGER | Add a brief explanation of the field here. | |
general.r_written | INTEGER | Add a brief explanation of the field here. | |
general.bp_processed | REAL | Add a brief explanation of the field here. | |
general.quality_trimmed | INTEGER | Add a brief explanation of the field here. | |
general.bp_written | REAL | Add a brief explanation of the field here. | |
general.percent_trimmed | REAL | Add a brief explanation of the field here. |
This table stores genomic origin information from QualiMap. See the QualiMap documentation for more details on how a read is assigned to each category.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | TEXT | References the library_id field in the
library table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | TEXT | References the library_id field in the
library table |
exonic | INTEGER | raw counts of reads over exonic regions | |
intronic | INTEGER | raw counts of reads over intronic regions | |
intergenic | INTEGER | raw counts of reads over intergenic regions |
This table stores information about removed IDs. Note that if there
is ever more than 1 reason (currently the only reason is
legal_removal
), then the reasons/sources should be
extracted to a separate table like id_correction
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
id | INTEGER | References the “id” field in the “whatdatall” table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
id | INTEGER | References the “id” field in the “whatdatall” table | |
source | TEXT | What document, or what email, etc was used to communicate that this sample should be removed | |
reason | TEXT | Reason that the sample is being removed |
This table stores RSEM quantification QC information.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | TEXT | References the library_id field in the
library table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | TEXT | References the library_id field in the
library table |
aligned_uniquely_to_a_gene | INTEGER | RSEM determined unique alignment raw counts | |
aligned_to_multiple_genes | INTEGER | RSEM determined multimap raw counts | |
filtered_due_to_too_many_alignments | INTEGER | Reads not considered for the EM quantification due to ‘too many’ multimaps | |
unalignable_reads | INTEGER | raw count of reads which count not be aligned in the genome |
This is a very important table. It essentially functions to assign a unique ID to each actual sample (remember a sample might have as many as 6 fastq files assigned to it because of the paired-end nature, and the fact that a single sample can be split over mulitple lanes). The utility of having a unique primary key for every sample is that it gives us a way to refer to an actual biological entity independent of the (subject,visit) indentifiers, which may be wrong. This primary key will never change, even if the sample is re-labelled with a different subject later on.
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | primary key | INTEGER | autoincremented primary key |
batch_id | foreign key to batch | INTEGER | References “pk” of the batch table |
Field | Purpose | Datatype | Description |
---|---|---|---|
library_id | primary key | INTEGER | autoincremented primary key |
fastq_id | foreign key to fastq | TEXT | the ORIGINAL LLFS ID which is extracted from the fastq filename |
visit | TEXT | The visit, extracted from the fastqfilename | |
batch_id | foreign key to batch | INTEGER | References “pk” of the batch table |
suspicious_sex | BOOLEAN | TRUE (1) if the sex expression is obviously opposite of what is
labelled. NOTE that this is a property of the library, not the
fastq_id . This is applied only after doing
the WGS and other relabeling and represents a sample which is mislabeled
by sex and for which a correct relabeing cannot be found. |
This table stores the X and Y chromosome expression information as extracted by samtools. Note that I have found that looking at the top 500 most variable gene’s PCs is an easier method of determining sex (and mislabels determined by sex) than these counts are. However, I use the counts to confirm the PCA results.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | TEXT | References the library_id field in the
library table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | TEXT | References the library_id field in the
library table |
chromosome_x | INTEGER | raw counts of reads over the X chromosome | |
chromosome_y | INTEGER | raw counts of reads over the Y chromosome |
This is a mapping from visit
, which is used in the
fastq filename
to visitcode
.
(id/subject
, visitcode
) form the primary key
of the LLFS phenotype data, while visit
, though used in the
fastq filename, is not defined. This is the mapping that is used to get
the visitcode
from the visit
:
No subject can have visitcode 3 and 4, or visitcode 6 and 7, so this mapping is actually one-to-one.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key autoincrement | INTEGER | Unique identifier for each record in the visitcode_map table |
whatdatall_id | foreign key | TEXT | References the “id” field in the “whatdatall” table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key autoincrement | INTEGER | Unique identifier for each record in the visitcode_map table |
whatdatall_id | foreign key | TEXT | References the “id” field in the “whatdatall” table |
visit | TEXT | Represents a visit, only allows values ‘1’, ‘2’, ‘3’ | |
visitcode | TEXT | Represents a visit code, only allows values from ‘1’ to ‘7’ |
This table stores the results of comparing the RNA sample variants to WGS sample variants
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | References “pk” of library table |
dna_subject | INTEGER | References “subject” of whatdatall table |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
library_id | foreign key | INTEGER | References “pk” of library table |
dna_subject | INTEGER | References “subject” of whatdatall table | |
chr | TEXT | chromosome over which the comparison was conducted. Note that I frequently use chr21 b/c it is shorter. I have done comparisons and found that there is no loss of specificity by using a smaller chromosome | |
total_variants | INTEGER | Total number of overlapping variants between the RNA and DNA | |
matching_variants | INTEGER | Total number of overlapping variants which matched | |
homo_expr_cand | INTEGER | the total number of cases where the WGS is heterozygous but expression homozygous | |
match_ratio | REAL | matching_variants/total_variants |
This table stores the ID
, subject
, and
sex
as extracted from the LLFS phenotype data. Note that
this had to be augmented to include visit 3 samples, which are not yet
included in the publically shared LLFS whatdatall
.
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
Field | Purpose | Datatype | Description |
---|---|---|---|
pk | primary key | INTEGER | autoincremented primary key |
id | INTEGER | the LLFS ID. Note that this is what is used in the fastq filenames | |
subject | INTEGER | The LLFS subject identifier | |
sex | INTEGER | labelled sex of the subject with this identifier | |
source | TEXT | where this record comes from | |
notes | TEXT | any notes on this particular record |