Skip to content

labretriever

A Python package for querying and managing genomic and transcriptomic datasets hosted on HuggingFace Hub. It provides a unified SQL interface (via DuckDB) across heterogeneous datasets, with local caching and structured metadata exploration.

See the documentation for full usage guides and API reference. The BrentLab yeast resources collection is an example of datasets designed to work with this package.

Installation

Install the latest release from PyPI:

pip install labretriever

To get the most recent changes ahead of a PyPI release, install directly from the main branch on GitHub:

pip install git+https://github.com/cmatKhan/labretriever.git@main

Set your HuggingFace token if accessing private datasets:

export HF_TOKEN=your_token_here

Usage

from labretriever import VirtualDB

vdb = VirtualDB("config.yaml")

# Discover available views
vdb.tables()
vdb.describe("harbison")

# Query with SQL
df = vdb.query("SELECT * FROM harbison_meta WHERE carbon_source = $cs", cs="glucose")

VirtualDB loads datasets from HuggingFace (caching locally), constructs DuckDB views over Parquet files, and exposes metadata and full-data views for SQL querying. See the docs for how to write a config.yaml and structure your HuggingFace dataset cards.

Development

git clone https://github.com/cmatKhan/labretriever
cd labretriever
poetry install
poetry run pre-commit install
poetry run pytest