MCP Server Setup¶
labretriever ships two MCP servers:
labretriever(labretriever-mcp): exposes aVirtualDBinstance as SQL query tools. Requires a VirtualDB config file.labretriever-repo(labretriever-mcp-repo): provides DataCard scaffold and collection audit tools. Requires no config file.
Quick Install (Claude Code Plugin)¶
First, install labretriever so that labretriever-mcp
and labretriever-mcp-repo are available on your PATH.
Then add the marketplace and install the plugin:
The plugin will prompt you for a VirtualDB config file path and an optional
HuggingFace token at enable time. Both MCP servers are registered automatically.
If labretriever-mcp is not found on PATH when a session starts, Claude will
display installation instructions.
For the BrentLab yeast resources collection, download the ready-to-use config from:
https://github.com/BrentLab/tfbpshiny/blob/main/tfbpshiny/brentlab_yeast_collection.yaml
Save it to a stable path and provide that path when the plugin prompts you.
Manual Configuration (without the plugin)¶
Install the package first — see Installation.
Add the following to .claude/settings.json (or ~/.claude/settings.json for
user-level):
{
"mcpServers": {
"labretriever": {
"command": "labretriever-mcp",
"type": "stdio",
"env": {
"LABRETRIEVER_CONFIG": "/absolute/path/to/brentlab_yeast_collection.yaml",
"HF_TOKEN": "${HF_TOKEN}"
}
},
"labretriever-repo": {
"command": "labretriever-mcp-repo",
"type": "stdio",
"env": {
"HF_TOKEN": "${HF_TOKEN}"
}
}
}
}
HF_TOKEN is only required for private HuggingFace repositories. If it is not
set and a query touches a private or gated repository, the server returns a
clear error naming the repository.
LABRETRIEVER_CONFIG is only required for the labretriever server.
labretriever-repo starts without it.
Available Tools¶
labretriever (VirtualDB query tools)¶
Requires LABRETRIEVER_CONFIG.
| Tool | Description |
|---|---|
list_datasets |
List all registered dataset names (call this first). |
describe_dataset |
Return column names and types for a {name} or {name}_meta view. |
get_column_metadata |
Return semantic roles and condition-level definitions for each column. |
get_tags |
Return provenance tags (assay type, publication, etc.) for a dataset. |
get_common_fields |
Return column names shared across all _meta views. |
query |
Execute DuckDB SQL; returns shape by default, rows when return_data=True. |
get_config_path |
Return the config file path (call before writing Python snippets). |
labretriever-repo (DataCard tools)¶
No config file required.
| Tool | Description |
|---|---|
scaffold_readme |
Scaffold a minimal HF DataCard README from a repository’s file structure. |
audit_collection |
Audit a collection of repos for DataCard completeness, schema consistency, and consolidation opportunities. |
scaffold_readme¶
Inspects the files in a HuggingFace dataset repository and produces a skeleton README in valid HuggingFace DataCard YAML format. The skeleton follows only the official DataCard specification — no labretriever-specific extensions are added.
Parameters:
| Name | Type | Description |
|---|---|---|
repo_id |
str |
HuggingFace dataset repository ID, e.g. "BrentLab/callingcards". |
Return value:
One of two shapes:
{"readme": "<yaml_string>"}— a complete skeleton YAML ready to paste into the repository README. Alldescriptionfields are empty ('').{"needs_input": [{"file": ..., "reason": ..., "question": ...}, ...]}— the tool encountered files or column dtypes it cannot resolve without user guidance. Address each question and call the tool again.
Behavior:
.parquetfiles: schema read via DuckDB..csv/.tsvfiles: schema read via pandas.- Standard non-data files (
.md,.py,.R,.sh,.yaml,.json, etc.) are silently skipped. - Any other extension returns
needs_inputimmediately. - Columns with unrecognized dtypes return
needs_inputrather than guessing. - Columns whose distinct-value count is <= 10 in a sample of >= 100 rows are
suggested as
class_label; thenameslist is left empty for the author to fill. - Hive-partitioned directories are detected automatically; the partition column
is appended to the feature list with
dtype: string.
Example:
from labretriever.mcp_server._repo_server import _scaffold_readme_impl
result = _scaffold_readme_impl("BrentLab/callingcards", token=None)
if "readme" in result:
print(result["readme"])
else:
for item in result["needs_input"]:
print(item["question"])
audit_collection¶
Audits every repo in a local directory or a HuggingFace collection URL against its DataCard specification. Reports completeness issues, schema inconsistencies, and generates ready-to-paste YAML for consolidating repeated field definitions.
Parameters:
| Name | Type | Description |
|---|---|---|
source |
str |
Local directory path (e.g. "/home/user/code/hf") or HuggingFace collection URL (e.g. "https://huggingface.co/collections/BrentLab/yeastresources-..."). |
collection_context |
str \| None |
Optional path to a collection context document. When supplied, field naming conventions and dataset type expectations from the document are used to add context-aware findings. |
Return value:
{
"source": "<resolved source>",
"schema_checks_performed": true,
"repos": {
"callingcards": {
"findings": [
{
"severity": "warning",
"check": "missing_feature_descriptions",
"config": "annotated_features",
"field": "experiment_hops",
"message": "Field 'experiment_hops' in config 'annotated_features' has no description.",
"doc_link": "https://cmatkhan.github.io/labretriever/huggingface_datacard/#feature-definitions"
}
]
}
},
"collection_suggestions": [
{
"type": "consolidate_to_shared_features",
"scope": "same_repo",
"repo": "callingcards",
"field": "target_locus_tag",
"configs": ["annotated_features", "annotated_features_combined"],
"message": "...",
"generated_yaml": "features:\n- applies_to:\n ...",
"doc_link": "https://cmatkhan.github.io/labretriever/huggingface_datacard/#shared-feature-definitions"
}
]
}
Checks performed:
| Check | Severity | Description |
|---|---|---|
missing_description |
warning | Config description is empty or absent. |
missing_feature_descriptions |
warning | A feature in dataset_info.features has no description. |
undocumented_column |
error | Column in data file not listed in dataset_info.features (local only). |
phantom_column |
warning | Feature in dataset_info.features not found in data file (local only). |
dtype_mismatch |
error | File column dtype is incompatible with declared HF dtype (local only, lenient). |
missing_role |
warning | Feature has no role field. |
missing_doi |
info | No doi at repo level. |
missing_citation |
info | No citation at repo level. |
class_label_empty_names |
warning | class_label dtype has an empty names list. |
non_reserved_dataset_type |
info | Config uses a collection-defined dataset_type (not an error). |
partition_col_undocumented |
warning | Hive partition column not listed in features (local only). |
Example:
from labretriever.mcp_server._repo_server import _audit_collection_impl
result = _audit_collection_impl(
"/home/user/code/hf",
collection_context="docs/brentlab_yeastresources_collection.md",
token=None,
)
for repo, data in result["repos"].items():
for finding in data["findings"]:
print(f"[{finding['severity'].upper()}] {repo}: {finding['message']}")
for s in result["collection_suggestions"]:
if s.get("generated_yaml"):
print(s["generated_yaml"])
Example Session¶
After connecting, a typical VirtualDB workflow in Claude Code looks like:
list_datasets- discover available views (harbison,callingcards, etc.)describe_dataset("harbison_meta")- inspect sample-level columnsget_column_metadata("harbison")- understand condition values and measurement rolesquery("SELECT * FROM harbison_meta WHERE condition = 'GAL'", return_data=True)- explorequery("SELECT regulator_symbol, COUNT(*) FROM harbison WHERE condition = 'GAL' AND pvalue < 0.001 GROUP BY 1 ORDER BY 2 DESC")- full analysis
See the VirtualDB tutorial for more query patterns.