DataCard Tutorial: Exploring HuggingFace Dataset Metadata¶
The DataCard class provides an interface for exploring HuggingFace dataset metadata without loading the actual genomic data. This is particularly useful for:
- Understanding dataset structure and available configurations
- Exploring experimental conditions at all hierarchy levels
- Discovering metadata relationships
- Planning data analysis workflows and metadata table creation
In this tutorial, we'll explore the BrentLab/harbison_2004 dataset, which contains ChIP-chip data for transcription factor binding across 14 environmental conditions in yeast.
1. Instantiating a DataCard Object¶
from labretriever.datacard import DataCard
card = DataCard('BrentLab/harbison_2004')
print(f"Repository: {card.repo_id}")
/home/chase/code/labretriever/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Repository: BrentLab/harbison_2004
2. Repository Overview¶
card.info() returns repository-level metadata including license, citation, tags, and a summary of all configurations. Pass a configuration name to get dataset-level detail.
# Get repository information
repo_info = card.info()
print("Repository Information:")
print("=" * 40)
for key, value in repo_info.items():
if key != "configs":
print(f"{key:20}: {value}")
print("\nConfigurations:")
for cfg in repo_info["configs"]:
default_mark = " (default)" if cfg["default"] else ""
print(f" - {cfg['config_name']}: {cfg['dataset_type']}{default_mark}")
print(f" {cfg['description']}")
Repository Information:
========================================
repo_id : BrentLab/harbison_2004
pretty_name : Harbison, 2004 ChIP-chip
license : mit
doi : https://doi.org/10.1038/nature02800
citation : Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature.
tags : ['genomics', 'yeast', 'transcription', 'binding']
language : ['en']
size_categories : ['1M<n<10M']
num_configs : 1
total_files : 7
last_modified : 2026-04-22T00:18:32+00:00
has_default_config : True
Configurations:
- harbison_2004: annotated_features (default)
ChIP-chip data, primarily in standard YPD batch culture, but selected TFs were also tested in specific conditions where they were expected to be active.
3. Exploring Configurations¶
Datasets can have multiple configurations representing different types of data.
# List all configurations
print(f"Number of configurations: {len(card.configs)}")
print("\nConfiguration details:")
for config in card.configs:
print(f"\n• {config.config_name}:")
print(f" Type: {config.dataset_type.value}")
print(f" Default: {config.default}")
print(f" Description: {config.description}")
print(f" Features: {len(config.dataset_info.features)}")
Number of configurations: 1 Configuration details: • harbison_2004: Type: annotated_features Default: True Description: ChIP-chip data, primarily in standard YPD batch culture, but selected TFs were also tested in specific conditions where they were expected to be active. Features: 7
4. Understanding Experimental Conditions: The Three-Level Hierarchy¶
The labretriever system supports experimental conditions at three hierarchy levels:
- Top-level (repo-wide): Conditions common to all datasets/samples
- Config-level: Conditions specific to a dataset configuration
- Field-level: Conditions that vary per sample, defined in field definitions
Let's explore each level for the Harbison 2004 dataset.
Level 1: Top-Level Conditions¶
Top-level conditions apply to all experiments in the repository.
# Get top-level experimental conditions
top_conditions = card.get_experimental_conditions()
print("Top-Level Experimental Conditions:")
print("=" * 40)
if top_conditions:
for key, value in top_conditions.items():
print(f"{key}: {value}")
else:
print("No top-level conditions defined for this repository")
print("(All conditions are defined at config or field level)")
Top-Level Experimental Conditions: ======================================== temperature_celsius: 30
Level 2: Config-Level Conditions¶
Config-level conditions apply to all samples in a specific configuration.
# Get config-level conditions (merged with top-level)
config_conditions = card.get_experimental_conditions('harbison_2004')
print("Config-Level Experimental Conditions:")
print("=" * 40)
if config_conditions:
for key, value in config_conditions.items():
print(f"{key}: {value}")
else:
print("No config-level conditions defined")
print("(Conditions vary per sample at field level)")
Config-Level Experimental Conditions: ======================================== temperature_celsius: 30
Level 3: Field-Level Conditions¶
Field-level conditions vary per sample and are defined in field definitions.
# Get definitions for the 'condition' field
# This maps each condition value to its detailed specification
condition_defs = card.get_field_definitions('harbison_2004', 'condition')
print(f"Condition Field Definitions:")
print("=" * 40)
print(f"Found {len(condition_defs)} defined conditions:\n")
# Show all condition names
for cond_name in sorted(condition_defs.keys()):
print(f" • {cond_name}")
Condition Field Definitions: ======================================== Found 14 defined conditions: • Acid • Alpha • BUT14 • BUT90 • GAL • H2O2Hi • H2O2Lo • HEAT • Pi- • RAFF • RAPA • SM • Thi- • YPD
# Explore a specific condition in detail
import json
# Let's look at the YPD baseline condition
ypd_def = condition_defs.get('YPD', {})
print("YPD Condition Definition:")
print("=" * 40)
print(json.dumps(ypd_def, indent=2))
YPD Condition Definition:
========================================
{
"description": "Standard YPD rich medium; the baseline condition for nearly all 204 regulators.",
"growth_phase_at_harvest": {
"od600": 0.8
},
"media": {
"name": "YPD",
"carbon_source": [
{
"compound": "D-glucose",
"concentration_percent": 2
}
],
"nitrogen_source": [
{
"compound": "yeast_extract",
"concentration_percent": 1
},
{
"compound": "peptone",
"concentration_percent": 2
}
]
}
}
# Let's look at a treatment condition (HEAT shock)
heat_def = condition_defs.get('HEAT', {})
print("HEAT Condition Definition:")
print("=" * 40)
print(json.dumps(heat_def, indent=2))
HEAT Condition Definition:
========================================
{
"description": "Temperature shift from 30\u00b0C to 37\u00b0C for 45 minutes.",
"temperature_celsius": 37,
"growth_phase_at_harvest": {
"od600": 0.5
},
"media": {
"name": "YPD",
"carbon_source": [
{
"compound": "D-glucose",
"concentration_percent": 2
}
],
"nitrogen_source": [
{
"compound": "yeast_extract",
"concentration_percent": 1
},
{
"compound": "peptone",
"concentration_percent": 2
}
]
}
}
5. Working with Condition Definitions¶
Now let's see how to extract specific information from condition definitions.
# Extract growth media names for all conditions
print("Growth Media Across Conditions:")
print("=" * 40)
for cond_name, cond_def in sorted(condition_defs.items()):
# Navigate the nested structure
media = cond_def.get('media', {})
media_name = media.get('name', 'unspecified')
print(f" {cond_name:10}: {media_name}")
Growth Media Across Conditions: ======================================== Acid : YPD Alpha : YPD BUT14 : YPD BUT90 : YPD GAL : yeast_extract_peptone_galactose H2O2Hi : YPD H2O2Lo : YPD HEAT : YPD Pi- : synthetic_complete RAFF : yeast_extract_peptone_raffinose RAPA : YPD SM : synthetic_complete Thi- : synthetic_complete YPD : YPD
condition_defs.get("YPD")
{'description': 'Standard YPD rich medium; the baseline condition for nearly all 204 regulators.',
'growth_phase_at_harvest': {'od600': 0.8},
'media': {'name': 'YPD',
'carbon_source': [{'compound': 'D-glucose', 'concentration_percent': 2}],
'nitrogen_source': [{'compound': 'yeast_extract',
'concentration_percent': 1},
{'compound': 'peptone', 'concentration_percent': 2}]}}
# Extract temperature conditions
print("Temperature Across Conditions:")
print("=" * 40)
for cond_name, cond_def in sorted(condition_defs.items()):
env_conds = cond_def.get('environmental_conditions', {})
temp = env_conds.get('temperature_celsius', 'not specified')
# Also check for temperature shifts
temp_shift = env_conds.get('temperature_shift')
if temp_shift:
from_temp = temp_shift.get('from_celsius', '?')
to_temp = temp_shift.get('to_celsius', '?')
print(f" {cond_name:10}: {from_temp}°C → {to_temp}°C")
else:
print(f" {cond_name:10}: {temp}°C")
Temperature Across Conditions: ======================================== Acid : not specified°C Alpha : not specified°C BUT14 : not specified°C BUT90 : not specified°C GAL : not specified°C H2O2Hi : not specified°C H2O2Lo : not specified°C HEAT : not specified°C Pi- : not specified°C RAFF : not specified°C RAPA : not specified°C SM : not specified°C Thi- : not specified°C YPD : not specified°C
6. Using extract_metadata_schema for Metadata Table Planning¶
The extract_metadata_schema method provides all condition information in one call, which is useful for planning metadata table creation.
# Extract complete metadata schema
schema = card.extract_metadata_schema('harbison_2004')
print("Metadata Schema Summary:")
print("=" * 40)
print(f"Regulator fields: {schema['regulator_fields']}")
print(f"Target fields: {schema['target_fields']}")
print(f"Condition fields: {schema['condition_fields']}")
print(f"\nTop-level conditions: {schema['top_level_conditions']}")
print(f"Config-level conditions: {schema['config_level_conditions']}")
print(f"Field definitions available for: {list(schema['condition_definitions'].keys())}")
Metadata Schema Summary:
========================================
Regulator fields: ['regulator_locus_tag', 'regulator_symbol']
Target fields: ['target_locus_tag', 'target_symbol']
Condition fields: ['condition']
Top-level conditions: {'temperature_celsius': 30}
Config-level conditions: None
Field definitions available for: ['condition']
7. DOI and Citation¶
Dataset cards carry publication metadata in two separate fields:
doi: a URL or DOI string pointing to the primary publication.citation: a full bibliographic citation string.
Both are available at the repository level and can be overridden at the individual
config level. card.info() returns both fields. get_citation() handles the
config-level fallback to repository-level automatically.
# Repository-level doi and citation from info()
repo_info = card.info()
print("DOI: ", repo_info["doi"])
print("Citation:", repo_info["citation"])
DOI: https://doi.org/10.1038/nature02800 Citation: Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature.
# Dataset-level doi and citation from info(config_name)
# Falls back to repository-level when no config-level value is defined.
dataset_info = card.info("harbison_2004")
print("DOI: ", dataset_info["doi"])
print("Citation:", dataset_info["citation"])
DOI: https://doi.org/10.1038/nature02800 Citation: Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature.