HfCacheManager Tutorial: Intelligent Cache Management for HuggingFace Datasets¶
The HfCacheManager class provides sophisticated cache management capabilities for HuggingFace genomics datasets. It extends DataCard functionality with intelligent caching strategies and automated cache cleanup tools.
This tutorial covers:
- Setting up HfCacheManager for cache management
- Understanding the 3-case metadata caching strategy
- Automated cache cleanup by age, size, and revision
- Cache monitoring and diagnostics
- Best practices for efficient cache management
- Integration with data loading workflows
Prerequisites: Basic familiarity with DataCard (see datacard_tutorial.ipynb) and HuggingFace datasets.
1. Setting Up HfCacheManager¶
The HfCacheManager extends DataCard with cache management capabilities. Unlike DataCard which focuses on dataset exploration, HfCacheManager adds intelligent caching and cleanup features.
import duckdb
import logging
from labretriever.hf_cache_manager import HfCacheManager
from huggingface_hub import scan_cache_dir
# Set up logging to see cache management activities
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Create DuckDB connection for metadata caching
conn = duckdb.connect(':memory:')
# Initialize HfCacheManager
cache_manager = HfCacheManager(
repo_id='BrentLab/mahendrawada_2025',
duckdb_conn=conn,
logger=logger
)
print(f"HfCacheManager initialized for: {cache_manager.repo_id}")
print(f"DuckDB connection: {'Active' if conn else 'None'}")
print(f"Logger configured: {'Yes' if logger else 'No'}")
# Show current cache status -- NOTE: this is from huggingface_hub,
# not from HfCacheManager
cache_info = scan_cache_dir()
print(f"Current HF cache size: {cache_info.size_on_disk_str}")
print(f"Cached repositories: {len(cache_info.repos)}")
/home/chase/code/labretriever/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
HfCacheManager initialized for: BrentLab/mahendrawada_2025 DuckDB connection: Active Logger configured: Yes Current HF cache size: 1.2G Cached repositories: 9
cache_info
HFCacheInfo(size_on_disk=1238957028, repos=frozenset({CachedRepoInfo(repo_id='BrentLab/rossi_2021', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021'), size_on_disk=276463198, nb_files=3, revisions=frozenset({CachedRevisionInfo(commit_hash='8cbe9d508a668eecb0491d8babab1cfd431eada7', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/snapshots/8cbe9d508a668eecb0491d8babab1cfd431eada7'), size_on_disk=276463198, files=frozenset({CachedFileInfo(file_name='rossi_2021_metadata_sample.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/snapshots/8cbe9d508a668eecb0491d8babab1cfd431eada7/rossi_2021_metadata_sample.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/blobs/f5c55948ea44d6784e550b8bb32bd2e2d9a004884b18f6229cba60080a108e88'), size_on_disk=17020, blob_last_accessed=1776356347.484477, blob_last_modified=1773246771.4874742), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/snapshots/8cbe9d508a668eecb0491d8babab1cfd431eada7/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/blobs/3ade3f43d1295ad75f994f3ed256165a5a3f7246'), size_on_disk=14943, blob_last_accessed=1776356343.4075005, blob_last_modified=1773246711.8048496), CachedFileInfo(file_name='rossi_2021_af_combined.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/snapshots/8cbe9d508a668eecb0491d8babab1cfd431eada7/rossi_2021_af_combined.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--rossi_2021/blobs/7bf68b04e72863c5cacc4f3da774e572fa71e646fc29bda8dcab04e4fc545c56'), size_on_disk=276431235, blob_last_accessed=1776356347.4254775, blob_last_modified=1773246723.6477757)}), refs=frozenset({'main'}), last_modified=1773246771.4874742)}), last_accessed=1776356347.484477, last_modified=1773246771.4874742), CachedRepoInfo(repo_id='BrentLab/mahendrawada_2025', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025'), size_on_disk=115821358, nb_files=6, revisions=frozenset({CachedRevisionInfo(commit_hash='feff7544889aee9c5dce47a4ffc282053d292817', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/feff7544889aee9c5dce47a4ffc282053d292817'), size_on_disk=115710469, files=frozenset({CachedFileInfo(file_name='chec_mahendrawada_m2025_af_combined.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/feff7544889aee9c5dce47a4ffc282053d292817/chec_mahendrawada_m2025_af_combined.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/2593b0cd31b7c822b281ca76e469ef269574fd6369ad4aa025c767d7cdeb5327'), size_on_disk=74047926, blob_last_accessed=1776356347.4714773, blob_last_modified=1773246732.7167187), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/feff7544889aee9c5dce47a4ffc282053d292817/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/b60a55fd0f936b1dc717564c7403d02358acbfc3'), size_on_disk=33900, blob_last_accessed=1776091088.629696, blob_last_modified=1773246728.4097457), CachedFileInfo(file_name='rnaseq_reprocessed.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/feff7544889aee9c5dce47a4ffc282053d292817/rnaseq_reprocessed.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/2233d75235f0309d8c03ef92e73ecc1c4e3fdb9cc622bd5b7b910cbf12766e69'), size_on_disk=41622007, blob_last_accessed=1776356347.4794772, blob_last_modified=1773246735.1977031), CachedFileInfo(file_name='chec_mahendrawada_m2025_af_combined_meta.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/feff7544889aee9c5dce47a4ffc282053d292817/chec_mahendrawada_m2025_af_combined_meta.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/5d7a31f2711294a4abbce70f290a2998fab7325efba7da0e52d2669610bf4e15'), size_on_disk=6636, blob_last_accessed=1776356347.4854772, blob_last_modified=1773246772.653467)}), refs=frozenset(), last_modified=1773246772.653467), CachedRevisionInfo(commit_hash='31213daa96bea3f0a3406bc814e9e027836c4986', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/31213daa96bea3f0a3406bc814e9e027836c4986'), size_on_disk=115739195, files=frozenset({CachedFileInfo(file_name='chec_mahendrawada_m2025_af_combined_meta.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/31213daa96bea3f0a3406bc814e9e027836c4986/chec_mahendrawada_m2025_af_combined_meta.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/5d7a31f2711294a4abbce70f290a2998fab7325efba7da0e52d2669610bf4e15'), size_on_disk=6636, blob_last_accessed=1776356347.4854772, blob_last_modified=1773246772.653467), CachedFileInfo(file_name='rnaseq_reprocessed.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/31213daa96bea3f0a3406bc814e9e027836c4986/rnaseq_reprocessed.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/2233d75235f0309d8c03ef92e73ecc1c4e3fdb9cc622bd5b7b910cbf12766e69'), size_on_disk=41622007, blob_last_accessed=1776356347.4794772, blob_last_modified=1773246735.1977031), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/31213daa96bea3f0a3406bc814e9e027836c4986/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/27ec9c92d4fa6ae33a1ff810b80f2ded3c759d23'), size_on_disk=62626, blob_last_accessed=1776361245.0960524, blob_last_modified=1776272190.374475), CachedFileInfo(file_name='chec_mahendrawada_m2025_af_combined.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/31213daa96bea3f0a3406bc814e9e027836c4986/chec_mahendrawada_m2025_af_combined.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/2593b0cd31b7c822b281ca76e469ef269574fd6369ad4aa025c767d7cdeb5327'), size_on_disk=74047926, blob_last_accessed=1776356347.4714773, blob_last_modified=1773246732.7167187)}), refs=frozenset({'main'}), last_modified=1776272190.374475), CachedRevisionInfo(commit_hash='13bb6037fdc878f3fee0b62d513257b684976649', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/13bb6037fdc878f3fee0b62d513257b684976649'), size_on_disk=115724832, files=frozenset({CachedFileInfo(file_name='chec_mahendrawada_m2025_af_combined.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/13bb6037fdc878f3fee0b62d513257b684976649/chec_mahendrawada_m2025_af_combined.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/2593b0cd31b7c822b281ca76e469ef269574fd6369ad4aa025c767d7cdeb5327'), size_on_disk=74047926, blob_last_accessed=1776356347.4714773, blob_last_modified=1773246732.7167187), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/13bb6037fdc878f3fee0b62d513257b684976649/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/864e4da4b69ce2e7acba609b8a72e9ba3eddcbf2'), size_on_disk=48263, blob_last_accessed=1776267508.5302858, blob_last_modified=1776179563.7077723), CachedFileInfo(file_name='rnaseq_reprocessed.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/13bb6037fdc878f3fee0b62d513257b684976649/rnaseq_reprocessed.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/2233d75235f0309d8c03ef92e73ecc1c4e3fdb9cc622bd5b7b910cbf12766e69'), size_on_disk=41622007, blob_last_accessed=1776356347.4794772, blob_last_modified=1773246735.1977031), CachedFileInfo(file_name='chec_mahendrawada_m2025_af_combined_meta.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/snapshots/13bb6037fdc878f3fee0b62d513257b684976649/chec_mahendrawada_m2025_af_combined_meta.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/5d7a31f2711294a4abbce70f290a2998fab7325efba7da0e52d2669610bf4e15'), size_on_disk=6636, blob_last_accessed=1776356347.4854772, blob_last_modified=1773246772.653467)}), refs=frozenset(), last_modified=1776179563.7077723)}), last_accessed=1776361245.0960524, last_modified=1776272190.374475), CachedRepoInfo(repo_id='BrentLab/kemmeren_2014', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--kemmeren_2014'), size_on_disk=301362532, nb_files=2, revisions=frozenset({CachedRevisionInfo(commit_hash='4585d9c7759e0a8f146cfaf2db0220d05ff3a1d0', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--kemmeren_2014/snapshots/4585d9c7759e0a8f146cfaf2db0220d05ff3a1d0'), size_on_disk=301362532, files=frozenset({CachedFileInfo(file_name='kemmeren_2014.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--kemmeren_2014/snapshots/4585d9c7759e0a8f146cfaf2db0220d05ff3a1d0/kemmeren_2014.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--kemmeren_2014/blobs/e4003f0d49e17671ccf29f39e36b0e95091aba7988b3d3f6b371f2ecd1a4930f'), size_on_disk=301348699, blob_last_accessed=1776356347.481477, blob_last_modified=1773246750.6746056), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--kemmeren_2014/snapshots/4585d9c7759e0a8f146cfaf2db0220d05ff3a1d0/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--kemmeren_2014/blobs/c39f8eb9051d5af2701ad0a83333f3fa66c09300'), size_on_disk=13833, blob_last_accessed=1776356343.8754978, blob_last_modified=1773246738.236684)}), refs=frozenset({'main'}), last_modified=1773246750.6746056)}), last_accessed=1776356347.481477, last_modified=1773246750.6746056), CachedRepoInfo(repo_id='BrentLab/hu_2007_reimand_2010', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hu_2007_reimand_2010'), size_on_disk=43268342, nb_files=2, revisions=frozenset({CachedRevisionInfo(commit_hash='497f4d168197bfd84ad89a37dffc86403a87d6be', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hu_2007_reimand_2010/snapshots/497f4d168197bfd84ad89a37dffc86403a87d6be'), size_on_disk=43268342, files=frozenset({CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hu_2007_reimand_2010/snapshots/497f4d168197bfd84ad89a37dffc86403a87d6be/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hu_2007_reimand_2010/blobs/03285af34f0daec0b79005b4cebef3dfc2d57fd2'), size_on_disk=9509, blob_last_accessed=1776356343.5075, blob_last_modified=1773246723.9947734), CachedFileInfo(file_name='hu_2007_reimand_2010.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hu_2007_reimand_2010/snapshots/497f4d168197bfd84ad89a37dffc86403a87d6be/hu_2007_reimand_2010.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hu_2007_reimand_2010/blobs/d0b9f7e3f5ce689056e30bc9dee34694f4c3de724b81c4650743d69e4ef77d44'), size_on_disk=43258833, blob_last_accessed=1776356347.4634771, blob_last_modified=1773246727.8327494)}), refs=frozenset({'main'}), last_modified=1773246727.8327494)}), last_accessed=1776356347.4634771, last_modified=1773246727.8327494), CachedRepoInfo(repo_id='BrentLab/callingcards', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards'), size_on_disk=7410679, nb_files=2, revisions=frozenset({CachedRevisionInfo(commit_hash='15f15def43e9663f212482dfd50d903560940aa4', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/15f15def43e9663f212482dfd50d903560940aa4'), size_on_disk=7410679, files=frozenset({CachedFileInfo(file_name='2026_analysis_set.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/15f15def43e9663f212482dfd50d903560940aa4/2026_analysis_set.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/23d1243d81acd871b2be47b869c02282aa95bf8ac73b6dceb21b82a815acb108'), size_on_disk=7387958, blob_last_accessed=1776356347.4024775, blob_last_modified=1773246708.7888684), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/15f15def43e9663f212482dfd50d903560940aa4/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/7419a885229bb28acf5e387954bf5a978e6b08d7'), size_on_disk=22721, blob_last_accessed=1776356342.9005034, blob_last_modified=1773246707.6178758)}), refs=frozenset(), last_modified=1773246708.7888684), CachedRevisionInfo(commit_hash='860378b3e37a8c14acad2217d05666dc80c3167e', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/860378b3e37a8c14acad2217d05666dc80c3167e'), size_on_disk=7410679, files=frozenset({CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/860378b3e37a8c14acad2217d05666dc80c3167e/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/7419a885229bb28acf5e387954bf5a978e6b08d7'), size_on_disk=22721, blob_last_accessed=1776356342.9005034, blob_last_modified=1773246707.6178758), CachedFileInfo(file_name='2026_analysis_set.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/860378b3e37a8c14acad2217d05666dc80c3167e/2026_analysis_set.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/23d1243d81acd871b2be47b869c02282aa95bf8ac73b6dceb21b82a815acb108'), size_on_disk=7387958, blob_last_accessed=1776356347.4024775, blob_last_modified=1773246708.7888684)}), refs=frozenset(), last_modified=1773246708.7888684), CachedRevisionInfo(commit_hash='7355bbf8d0ceea7592083ba5fc500f04a13c02e1', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/7355bbf8d0ceea7592083ba5fc500f04a13c02e1'), size_on_disk=7410679, files=frozenset({CachedFileInfo(file_name='2026_analysis_set.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/7355bbf8d0ceea7592083ba5fc500f04a13c02e1/2026_analysis_set.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/23d1243d81acd871b2be47b869c02282aa95bf8ac73b6dceb21b82a815acb108'), size_on_disk=7387958, blob_last_accessed=1776356347.4024775, blob_last_modified=1773246708.7888684), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/7355bbf8d0ceea7592083ba5fc500f04a13c02e1/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/7419a885229bb28acf5e387954bf5a978e6b08d7'), size_on_disk=22721, blob_last_accessed=1776356342.9005034, blob_last_modified=1773246707.6178758)}), refs=frozenset({'main'}), last_modified=1773246708.7888684), CachedRevisionInfo(commit_hash='849a2cfc72aaa403456fa7c5caa479795a897f73', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/849a2cfc72aaa403456fa7c5caa479795a897f73'), size_on_disk=7410679, files=frozenset({CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/849a2cfc72aaa403456fa7c5caa479795a897f73/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/7419a885229bb28acf5e387954bf5a978e6b08d7'), size_on_disk=22721, blob_last_accessed=1776356342.9005034, blob_last_modified=1773246707.6178758), CachedFileInfo(file_name='2026_analysis_set.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/snapshots/849a2cfc72aaa403456fa7c5caa479795a897f73/2026_analysis_set.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--callingcards/blobs/23d1243d81acd871b2be47b869c02282aa95bf8ac73b6dceb21b82a815acb108'), size_on_disk=7387958, blob_last_accessed=1776356347.4024775, blob_last_modified=1773246708.7888684)}), refs=frozenset(), last_modified=1773246708.7888684)}), last_accessed=1776356347.4024775, last_modified=1773246708.7888684), CachedRepoInfo(repo_id='BrentLab/harbison_2004', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004'), size_on_disk=44910484, nb_files=3, revisions=frozenset({CachedRevisionInfo(commit_hash='a33c34b373e379dfa9bd4922d281790180bb1217', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/snapshots/a33c34b373e379dfa9bd4922d281790180bb1217'), size_on_disk=44899466, files=frozenset({CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/snapshots/a33c34b373e379dfa9bd4922d281790180bb1217/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/blobs/3cb3997d01aa1c91d0719f954e1cf207976c8a7d'), size_on_disk=13071, blob_last_accessed=1776267508.174287, blob_last_modified=1773246709.162866), CachedFileInfo(file_name='harbison_2004.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/snapshots/a33c34b373e379dfa9bd4922d281790180bb1217/harbison_2004.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/blobs/b89f865a471fbf3a8054871f8fe79507f6f4be5c2291dcd19030ec8fd4a5325c'), size_on_disk=44886395, blob_last_accessed=1776356347.4074776, blob_last_modified=1773246711.5658512)}), refs=frozenset(), last_modified=1773246711.5658512), CachedRevisionInfo(commit_hash='70a9dd6c061b2c3f99a24c9098aa6ebe3429bb7a', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/snapshots/70a9dd6c061b2c3f99a24c9098aa6ebe3429bb7a'), size_on_disk=44897413, files=frozenset({CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/snapshots/70a9dd6c061b2c3f99a24c9098aa6ebe3429bb7a/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/blobs/40354219fed00fee1eabb00dcf99aedb5d8d9e34'), size_on_disk=11018, blob_last_accessed=1776356343.3055012, blob_last_modified=1776356343.3025012), CachedFileInfo(file_name='harbison_2004.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/snapshots/70a9dd6c061b2c3f99a24c9098aa6ebe3429bb7a/harbison_2004.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/blobs/b89f865a471fbf3a8054871f8fe79507f6f4be5c2291dcd19030ec8fd4a5325c'), size_on_disk=44886395, blob_last_accessed=1776356347.4074776, blob_last_modified=1773246711.5658512)}), refs=frozenset({'main'}), last_modified=1776356343.3025012)}), last_accessed=1776356347.4074776, last_modified=1776356343.3025012), CachedRepoInfo(repo_id='BrentLab/hackett_2020', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hackett_2020'), size_on_disk=433272307, nb_files=2, revisions=frozenset({CachedRevisionInfo(commit_hash='5f0db7c0a9e0baa0426b03353bb9cee2c5bb3f6a', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hackett_2020/snapshots/5f0db7c0a9e0baa0426b03353bb9cee2c5bb3f6a'), size_on_disk=433272307, files=frozenset({CachedFileInfo(file_name='hackett_2020.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hackett_2020/snapshots/5f0db7c0a9e0baa0426b03353bb9cee2c5bb3f6a/hackett_2020.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hackett_2020/blobs/2d1496f61358e24b7333b056e11742a265c2855f1cb68433878f4f89b18508fe'), size_on_disk=433257326, blob_last_accessed=1776356321.00763, blob_last_modified=1773246768.024496), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hackett_2020/snapshots/5f0db7c0a9e0baa0426b03353bb9cee2c5bb3f6a/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hackett_2020/blobs/0c224f3bde79f463364bf28f35eb69f3f675195f'), size_on_disk=14981, blob_last_accessed=1776356343.9674973, blob_last_modified=1773246750.8806045)}), refs=frozenset({'main'}), last_modified=1773246768.024496)}), last_accessed=1776356343.9674973, last_modified=1773246768.024496), CachedRepoInfo(repo_id='BrentLab/hughes_2006', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006'), size_on_disk=15896148, nb_files=4, revisions=frozenset({CachedRevisionInfo(commit_hash='8cf3dc9d97c6c634b894c52bc6c8c99be37607cd', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/snapshots/8cf3dc9d97c6c634b894c52bc6c8c99be37607cd'), size_on_disk=15896148, files=frozenset({CachedFileInfo(file_name='metadata.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/snapshots/8cf3dc9d97c6c634b894c52bc6c8c99be37607cd/metadata.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/blobs/54173aa9fff98e5b077e2b24e139ad600a6d7cea74a47b0305cb0fbdf1c14d9b'), size_on_disk=8362, blob_last_accessed=1776356347.4854772, blob_last_modified=1773246773.3364625), CachedFileInfo(file_name='overexpression.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/snapshots/8cf3dc9d97c6c634b894c52bc6c8c99be37607cd/overexpression.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/blobs/ddfbe2f1c1d4fb77d75fcd6085f8a3f28881fea821e5b36288c7ea3f291d1c91'), size_on_disk=8403772, blob_last_accessed=1776356347.4794772, blob_last_modified=1773246736.5646944), CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/snapshots/8cf3dc9d97c6c634b894c52bc6c8c99be37607cd/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/blobs/06742f8853f036ebe26e517a7b7cb36541946a31'), size_on_disk=10696, blob_last_accessed=1776356343.7624986, blob_last_modified=1773246735.3627021), CachedFileInfo(file_name='knockout.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/snapshots/8cf3dc9d97c6c634b894c52bc6c8c99be37607cd/knockout.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--hughes_2006/blobs/35b874d4fa0ef9dcb8b4447a616ca308e0505412bd9750513031d5397b6c54a0'), size_on_disk=7473318, blob_last_accessed=1776356347.480477, blob_last_modified=1773246737.911686)}), refs=frozenset({'main'}), last_modified=1773246773.3364625)}), last_accessed=1776356347.4854772, last_modified=1773246773.3364625), CachedRepoInfo(repo_id='BrentLab/yeast_comparative_analysis', repo_type='dataset', repo_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis'), size_on_disk=551980, nb_files=37, revisions=frozenset({CachedRevisionInfo(commit_hash='e83926cb6208bfd6c230b82bf372aae14b3abf93', snapshot_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93'), size_on_disk=551980, files=frozenset({CachedFileInfo(file_name='README.md', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/README.md'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/715d05146c1e6959618ca54c3024650df41189ec'), size_on_disk=3981, blob_last_accessed=1776356344.0614967, blob_last_modified=1773246768.4444935), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=rossi_2021-rossi_2021_af_combined/perturbation_repo_dataset=hughes_2006-overexpression/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/fe950be88f6a4f8d0ced92ca3ee261c1ddc970bc1feeb8fb2b3f874bf99a8308'), size_on_disk=5445, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.5024805), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-2026_analysis_set/perturbation_repo_dataset=kemmeren_2014-kemmeren_2014/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/660c86535183987b8b8599eaff2f3b2907444299f84df5d767b189e67d21370b'), size_on_disk=12014, blob_last_accessed=1776196987.383157, blob_last_modified=1773246769.145489), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=mahendrawada_2025-chec_mahendrawada_m2025_af_combined/perturbation_repo_dataset=kemmeren_2014-kemmeren_2014/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/cb97579fd26ab25bed0342c6991fcda5ee6d2d79993393b59114b9180271d406'), size_on_disk=10177, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.3654814), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=harbison_2004-harbison_2004/perturbation_repo_dataset=hughes_2006-knockout/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/f402bd8ee2f3fa4e79b061955d45a06a16f9d9b77706cfcf4791d649731cbf9e'), size_on_disk=6687, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.0644832), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-2026_analysis_set/perturbation_repo_dataset=hughes_2006-knockout/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/8ef2a8df412abbd3ca7fadaaae6d58e487f88c0ea1443c6768a0bca0fb3e9b56'), size_on_disk=5766, blob_last_accessed=1776196987.383157, blob_last_modified=1773246768.9844902), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=mahendrawada_2025-chec_mahendrawada_m2025_af_combined/perturbation_repo_dataset=mahendrawada_2025-rnaseq_reprocessed/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/3b74f14d706dac6c9201c87b76d6e3fadfb61ef7c22f2147a9288970172c55c2'), size_on_disk=9306, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.4844806), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-2026_analysis_set/perturbation_repo_dataset=hu_2007_reimand_2010-hu_2007_reimand_2010/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/5eb768ad11a375f27740d1477f9a0cbcebc3ce5f2fa0ebf4f7f4a525a983af3f'), size_on_disk=11438, blob_last_accessed=1776196987.383157, blob_last_modified=1773246769.2154887), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features/perturbation_repo_dataset=hughes_2006-overexpression/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/91fd61b26f487db796eed5d3a0bcc46f8366f7b7536618c5fa54bf1b35cb4d7d'), size_on_disk=8996, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.3984876), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=rossi_2021-rossi_2021_af_combined/perturbation_repo_dataset=hu_2007_reimand_2010-hu_2007_reimand_2010/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/c82e1d86414d5559fe1ac2629a64046f51e434de77bbb0437df2eec0692ab1ea'), size_on_disk=12706, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.429481), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=harbison_2004-harbison_2004/perturbation_repo_dataset=hughes_2006-overexpression/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/00cd725d15af2398fb20fefb6d6ba0b0e75b264139dcdd592d3e4981b13c4c36'), size_on_disk=6797, blob_last_accessed=1776356347.484477, blob_last_modified=1773246770.101483), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=mahendrawada_2025-chec_mahendrawada_m2025_af_combined/perturbation_repo_dataset=hughes_2006-overexpression/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/5e72ba29908b8f03317e749f90610a1f58e4339900e6305d7f9ebb6ca8eb3c1d'), size_on_disk=5981, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.3704813), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=rossi_2021-rossi_2021_af_combined/perturbation_repo_dataset=hackett_2020-hackett_2020/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/7a12bd4496d20416c1697efb9ba0dd7a0008ede03924d465b7380c6054717d68'), size_on_disk=30244, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.437481), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features/perturbation_repo_dataset=hu_2007_reimand_2010-hu_2007_reimand_2010/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/cd6f715e0dc9bfd3af3da7ebc06d4a0103cd68f27b34467496bfe613b74ea67e'), size_on_disk=26416, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.303488), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=harbison_2004-harbison_2004/perturbation_repo_dataset=kemmeren_2014-kemmeren_2014/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/af6979f8295019071e8603f860c33271341a68270370e404206f9e9230903d95'), size_on_disk=16022, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.108483), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features/perturbation_repo_dataset=kemmeren_2014-kemmeren_2014/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/c236a75849014375d8e473447c4692a06f58980867021bf4e896892f83562904'), size_on_disk=26994, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.616486), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=mahendrawada_2025-chec_mahendrawada_m2025_af_combined/perturbation_repo_dataset=hu_2007_reimand_2010-hu_2007_reimand_2010/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/7e8c390d2d7dbb0f58ba1743a0f11774ddd7761f602e18e2c612a91d7df99a06'), size_on_disk=10021, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.2294822), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features/perturbation_repo_dataset=mahendrawada_2025-rnaseq_reprocessed/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/9e1c92ac9b3e3f827ecfd063de60ecb76a0c4515e21ea28dfb40b2298eb6d6d9'), size_on_disk=19723, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.7294853), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features/perturbation_repo_dataset=hackett_2020-hackett_2020/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/f69ad109b917a6dddf07261619964c760cbbf55bebc733efd7b5e98ba77876d7'), size_on_disk=84943, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.316488), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=rossi_2021-rossi_2021_af_combined/perturbation_repo_dataset=hughes_2006-knockout/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/5232d971fcdc5c9318a2bd1de601d28412ecaec1e81384d316ff66a3d8d6f30c'), size_on_disk=5177, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.5284803), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=harbison_2004-harbison_2004/perturbation_repo_dataset=mahendrawada_2025-rnaseq_reprocessed/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/945b4e1ed52baaf89b1629e44b90c8791db2cba2bd35208a16e2ad7f51bfd47e'), size_on_disk=12434, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.102483), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=mahendrawada_2025-chec_mahendrawada_m2025_af_combined/perturbation_repo_dataset=hughes_2006-knockout/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/ae6c44441444d136dbaecd29669fec6c5c292fcf315bc27d19b2f295048ef984'), size_on_disk=5885, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.3374815), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-2026_analysis_set/perturbation_repo_dataset=mahendrawada_2025-rnaseq_reprocessed/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/13f9230bd654bd40e8829c97a5cbefb546452379aab967a46fe3e2254eaf6891'), size_on_disk=9737, blob_last_accessed=1776196987.383157, blob_last_modified=1773246769.142489), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features_combined/perturbation_repo_dataset=mahendrawada_2025-rnaseq_reprocessed/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/38df481250bc5c3359b5ed9ad8df56bc01ea4049a88f6a807312fee0d872c5f4'), size_on_disk=6572, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.796485), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features_combined/perturbation_repo_dataset=hu_2007_reimand_2010-hu_2007_reimand_2010/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/b6f60901256e249bab7c9c2620347302af09e501626a5773e5c1b5c168f9684f'), size_on_disk=7297, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.7414854), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features_combined/perturbation_repo_dataset=hughes_2006-knockout/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/594255c0193073dc185fb35a71438c0d2392d6bce5546046cf9f5dc973872081'), size_on_disk=4685, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.7774851), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=harbison_2004-harbison_2004/perturbation_repo_dataset=hu_2007_reimand_2010-hu_2007_reimand_2010/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/f7e168c8232768d8935d4c4173297a6f4376fc5e8389d5bbb75e522092c72b34'), size_on_disk=16882, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.0734832), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-2026_analysis_set/perturbation_repo_dataset=hackett_2020-hackett_2020/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/36419922ad4b1bd61c46beba31cd9f805440f5ebde090fcd4a10c56a087c5a52'), size_on_disk=27758, blob_last_accessed=1776196987.383157, blob_last_modified=1773246768.9364903), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=rossi_2021-rossi_2021_af_combined/perturbation_repo_dataset=mahendrawada_2025-rnaseq_reprocessed/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/1d0b78dc5a31adfebb502be14b9c1ac61c7cd78b0e5e8a2a8d109140943a6f86'), size_on_disk=8373, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.8404784), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=rossi_2021-rossi_2021_af_combined/perturbation_repo_dataset=kemmeren_2014-kemmeren_2014/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/a43c76e6c9915ba4e92abb57a2fcb15a56b58062e395f70b7e3fbcdc164091a8'), size_on_disk=23419, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.6794794), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features_combined/perturbation_repo_dataset=hughes_2006-overexpression/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/f0857925cc90e8be5020e60bd4f131ea303f98862216d286e3fdad51e3b1d4a0'), size_on_disk=4772, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.6114862), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=mahendrawada_2025-chec_mahendrawada_m2025_af_combined/perturbation_repo_dataset=hackett_2020-hackett_2020/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/808ae079ab71f57554f9ef90896182bfaf21f228dd71b9eb104a555ea598797b'), size_on_disk=27334, blob_last_accessed=1776196987.382157, blob_last_modified=1773246770.1694825), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features/perturbation_repo_dataset=hughes_2006-knockout/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/c9a0e24c5c92268fd48f976ba956202a24ed8c1383c15127396edaac156c6148'), size_on_disk=8686, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.3664877), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features_combined/perturbation_repo_dataset=kemmeren_2014-kemmeren_2014/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/50a28ba9cef82e34b14abcdde26e2e45cf025628727fe88b63da16e4c4c22f41'), size_on_disk=7397, blob_last_accessed=1776196987.383157, blob_last_modified=1773246769.7304854), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-annotated_features_combined/perturbation_repo_dataset=hackett_2020-hackett_2020/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/9e04eb7f702af75823f80138c40d7d1b6ff9687125cdc844daf73a675ea4a713'), size_on_disk=15328, blob_last_accessed=1776196987.383157, blob_last_modified=1773246769.8264847), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=callingcards-2026_analysis_set/perturbation_repo_dataset=hughes_2006-overexpression/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/b30341eaf1b91cc487e7b58b5ac45fb8315f593b9165e5cd6e6a743a87a5b45d'), size_on_disk=5881, blob_last_accessed=1776196987.384157, blob_last_modified=1773246769.2134886), CachedFileInfo(file_name='part-0.parquet', file_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/snapshots/e83926cb6208bfd6c230b82bf372aae14b3abf93/dto/binding_repo_dataset=harbison_2004-harbison_2004/perturbation_repo_dataset=hackett_2020-hackett_2020/part-0.parquet'), blob_path=PosixPath('/home/chase/.cache/huggingface/hub/datasets--BrentLab--yeast_comparative_analysis/blobs/786f9e6c63a40ebd49fccc4fc66d09e59c38431d27640083517b9d5e6d6ca7fd'), size_on_disk=40706, blob_last_accessed=1776196987.382157, blob_last_modified=1773246769.9294841)}), refs=frozenset({'main'}), last_modified=1773246770.8404784)}), last_accessed=1776356347.484477, last_modified=1773246770.8404784)}), warnings=[])
2. Understanding the 3-Case Metadata Caching Strategy¶
HfCacheManager implements an intelligent 3-case strategy for metadata access that minimizes downloads and maximizes performance:
- DuckDB Check: First check if metadata already exists in the DuckDB database
- Cache Load: If not in DuckDB, try to load from local HuggingFace cache
- Download: If not cached, download from HuggingFace Hub
This strategy is implemented in the internal _get_metadata_for_config() method and automatically used when loading data with HfQueryAPI.
Demonstrating the 3-Case Strategy¶
Let's see how the caching strategy works by examining what metadata tables would be created and checking the DuckDB state.
# Check current DuckDB state (Case 1 check)
tables = conn.execute(
"SELECT table_name FROM information_schema.tables WHERE table_name LIKE 'metadata_%'"
).fetchall()
print("DuckDB Metadata Tables (Case 1):")
print("=" * 35)
if tables:
for table in tables:
count = conn.execute(f"SELECT COUNT(*) FROM {table[0]}").fetchone()[0]
print(f" • {table[0]}: {count} rows")
else:
print(" No metadata tables found in DuckDB")
print(" → Would proceed to Case 2 (check HF cache) or Case 3 (download)")
print(f"\nThe 3-case strategy ensures:")
print("• Fast access: DuckDB queries are nearly instantaneous")
print("• Minimal downloads: Reuse locally cached files when possible")
print("• Automatic fallback: Download only when necessary")
print("• Transparent operation: Works automatically with HfQueryAPI")
DuckDB Metadata Tables (Case 1): =================================== No metadata tables found in DuckDB → Would proceed to Case 2 (check HF cache) or Case 3 (download) The 3-case strategy ensures: • Fast access: DuckDB queries are nearly instantaneous • Minimal downloads: Reuse locally cached files when possible • Automatic fallback: Download only when necessary • Transparent operation: Works automatically with HfQueryAPI
Checking HuggingFace Cache Status (Case 2)¶
The second case checks if files are already cached locally by HuggingFace. Let's examine the cache state for our target repository.
# Check if target repository is in HuggingFace cache
cache_info = scan_cache_dir()
target_repo = None
for repo in cache_info.repos:
if repo.repo_id == cache_manager.repo_id:
target_repo = repo
break
print("HuggingFace Cache Status (Case 2):")
print("=" * 40)
if target_repo:
print(f"✓ Repository {cache_manager.repo_id} found in cache")
print(f" Size: {target_repo.size_on_disk_str}")
print(f" Revisions: {len(target_repo.revisions)}")
print(f" Files: {target_repo.nb_files}")
# Show latest revision info
if target_repo.revisions:
latest_rev = max(target_repo.revisions, key=lambda r: r.last_modified)
print(f" Latest revision: {latest_rev.commit_hash[:8]}")
print(f" Last accessed: {latest_rev.last_modified}")
print("\n → Case 2 would succeed: Load from local cache")
else:
print(f"✗ Repository {cache_manager.repo_id} not found in cache")
print(" → Would proceed to Case 3: Download from HuggingFace Hub")
print(f"\nCache efficiency: Using local files avoids re-downloading {target_repo.size_on_disk_str if target_repo else 'unknown size'}")
HuggingFace Cache Status (Case 2): ======================================== ✓ Repository BrentLab/mahendrawada_2025 found in cache Size: 115.8M Revisions: 3 Files: 6 Latest revision: 31213daa Last accessed: 1776272190.374475 → Case 2 would succeed: Load from local cache Cache efficiency: Using local files avoids re-downloading 115.8M
3. Cache Management and Cleanup¶
HfCacheManager's primary value is in providing sophisticated cache management. Let's explore the different cleanup strategies available.
Cache Overview and Current Status¶
Before cleaning, let's understand what we're working with.
# Get comprehensive cache overview
cache_info = scan_cache_dir()
print("Current HuggingFace Cache Overview:")
print("=" * 40)
print(f"Total cache size: {cache_info.size_on_disk_str}")
print(f"Number of repositories: {len(cache_info.repos)}")
# Analyze cache by repository size
repo_sizes = []
for repo in cache_info.repos:
repo_sizes.append((repo.repo_id, repo.size_on_disk, repo.size_on_disk_str, len(repo.revisions)))
# Sort by size (largest first)
repo_sizes.sort(key=lambda x: x[1], reverse=True)
print(f"\nLargest repositories (top 5):")
for repo_id, size_bytes, size_str, revisions in repo_sizes[:5]:
print(f" • {repo_id}: {size_str} ({revisions} revisions)")
if len(repo_sizes) > 5:
print(f" ... and {len(repo_sizes) - 5} more repositories")
# Calculate total revisions
total_revisions = sum(len(repo.revisions) for repo in cache_info.repos)
print(f"\nTotal revisions across all repos: {total_revisions}")
# Show age distribution
from datetime import datetime
now = datetime.now().timestamp()
old_revisions = 0
for repo in cache_info.repos:
for rev in repo.revisions:
age_days = (now - rev.last_modified) / (24 * 3600)
if age_days > 30:
old_revisions += 1
print(f"Revisions older than 30 days: {old_revisions}")
print(f"Recent revisions (≤30 days): {total_revisions - old_revisions}")
Current HuggingFace Cache Overview: ======================================== Total cache size: 1.2G Number of repositories: 9 Largest repositories (top 5): • BrentLab/hackett_2020: 433.3M (1 revisions) • BrentLab/kemmeren_2014: 301.4M (1 revisions) • BrentLab/rossi_2021: 276.5M (1 revisions) • BrentLab/mahendrawada_2025: 115.8M (3 revisions) • BrentLab/harbison_2004: 44.9M (2 revisions) ... and 4 more repositories Total revisions across all repos: 15 Revisions older than 30 days: 12 Recent revisions (≤30 days): 3
Querying Loaded Metadata¶
Once metadata is loaded into DuckDB, we can query it using SQL.
Internal Cache Management Methods¶
HfCacheManager provides several internal methods that work behind the scenes. Let's explore what these methods do and how they integrate with the caching strategy.
4. Working with Specific Metadata Configurations¶
You can also retrieve metadata for specific configurations rather than all at once.
# Demonstrate understanding of internal cache methods
print("HfCacheManager Internal Methods:")
print("=" * 35)
print("\n1. _get_metadata_for_config(config)")
print(" → Implements the 3-case strategy for a specific configuration")
print(" → Returns detailed result with strategy used and success status")
print("\n2. _check_metadata_exists_in_duckdb(table_name)")
print(" → Case 1: Checks if metadata table already exists in DuckDB")
print(" → Fast check using information_schema.tables")
print("\n3. _load_metadata_from_cache(config, table_name)")
print(" → Case 2: Attempts to load from local HuggingFace cache")
print(" → Uses try_to_load_from_cache() to find cached files")
print("\n4. _download_and_load_metadata(config, table_name)")
print(" → Case 3: Downloads from HuggingFace Hub if not cached")
print(" → Uses snapshot_download() for efficient file retrieval")
print("\n5. _create_duckdb_table_from_files(file_paths, table_name)")
print(" → Creates DuckDB views from parquet files")
print(" → Handles both single files and multiple files efficiently")
print("\n6. _extract_embedded_metadata_field(data_table, field, metadata_table)")
print(" → Extracts metadata fields from data tables")
print(" → Creates separate queryable metadata views")
print("\nThese methods work together to provide:")
print("• Transparent caching that 'just works'")
print("• Minimal network usage through intelligent fallbacks")
print("• Fast metadata access via DuckDB views")
print("• Automatic handling of different file structures")
HfCacheManager Internal Methods: =================================== 1. _get_metadata_for_config(config) → Implements the 3-case strategy for a specific configuration → Returns detailed result with strategy used and success status 2. _check_metadata_exists_in_duckdb(table_name) → Case 1: Checks if metadata table already exists in DuckDB → Fast check using information_schema.tables 3. _load_metadata_from_cache(config, table_name) → Case 2: Attempts to load from local HuggingFace cache → Uses try_to_load_from_cache() to find cached files 4. _download_and_load_metadata(config, table_name) → Case 3: Downloads from HuggingFace Hub if not cached → Uses snapshot_download() for efficient file retrieval 5. _create_duckdb_table_from_files(file_paths, table_name) → Creates DuckDB views from parquet files → Handles both single files and multiple files efficiently 6. _extract_embedded_metadata_field(data_table, field, metadata_table) → Extracts metadata fields from data tables → Creates separate queryable metadata views These methods work together to provide: • Transparent caching that 'just works' • Minimal network usage through intelligent fallbacks • Fast metadata access via DuckDB views • Automatic handling of different file structures
5. Extracting Embedded Metadata¶
Some datasets have metadata embedded within their data files. The HfCacheManager can extract this embedded metadata into separate, queryable tables.
4. Embedded Metadata Extraction¶
One unique feature of HfCacheManager is the ability to extract embedded metadata fields from data tables into separate, queryable metadata tables.
Demonstrate embedded metadata extraction concept¶
print("Embedded Metadata Extraction:") print("=" * 35)
print("\nScenario: You have a data table with embedded metadata fields") print("Example: genomics data with 'experimental_condition' field")
Create sample data to demonstrate the concept¶
conn.execute(""" CREATE TABLE sample_genomics_data AS SELECT 'gene_' || (row_number() OVER()) as gene_id, random() * 1000 as expression_value, CASE WHEN (row_number() OVER()) % 4 = 0 THEN 'control' WHEN (row_number() OVER()) % 4 = 1 THEN 'treatment_A' WHEN (row_number() OVER()) % 4 = 2 THEN 'treatment_B' ELSE 'stress_condition' END as experimental_condition, CASE WHEN (row_number() OVER()) % 3 = 0 THEN 'timepoint_0h' WHEN (row_number() OVER()) % 3 = 1 THEN 'timepoint_6h' ELSE 'timepoint_24h' END as timepoint FROM range(100) """)
print("✓ Created sample genomics data with embedded metadata fields")
Show the data structure¶
sample_data = conn.execute( "SELECT * FROM sample_genomics_data LIMIT 5" ).fetchall()
print(f"\nSample data structure:") print("gene_id | expression_value | experimental_condition | timepoint") print("-" * 65) for row in sample_data: print(f"{row[0]:8} | {row[1]:15.1f} | {row[2]:20} | {row[3]}")
print(f"\nEmbedded metadata fields identified:") print("• experimental_condition: Contains treatment/control information") print("• timepoint: Contains temporal sampling information")
Use HfCacheManager to extract embedded metadata¶
print("Using HfCacheManager for Metadata Extraction:") print("=" * 50)
Extract experimental_condition metadata¶
success1 = cache_manager._extract_embedded_metadata_field( 'sample_genomics_data', 'experimental_condition', 'metadata_experimental_conditions' )
Extract timepoint metadata¶
success2 = cache_manager._extract_embedded_metadata_field( 'sample_genomics_data', 'timepoint', 'metadata_timepoints' )
print(f"Experimental condition extraction: {'✓ Success' if success1 else '✗ Failed'}") print(f"Timepoint extraction: {'✓ Success' if success2 else '✗ Failed'}")
Show extracted metadata tables¶
if success1: print(f"\nExtracted experimental conditions:") conditions = conn.execute( "SELECT value, count FROM metadata_experimental_conditions ORDER BY count DESC" ).fetchall()
for condition, count in conditions:
print(f" • {condition}: {count} samples")
if success2: print(f"\nExtracted timepoints:") timepoints = conn.execute( "SELECT value, count FROM metadata_timepoints ORDER BY count DESC" ).fetchall()
for timepoint, count in timepoints:
print(f" • {timepoint}: {count} samples")
print(f"\nBenefits of extraction:") print("• Separate queryable metadata tables") print("• Fast metadata-based filtering and analysis") print("• Clear separation of data and metadata concerns") print("• Reusable metadata across different analyses")
from huggingface_hub import scan_cache_dir
# Get current cache information
cache_info = scan_cache_dir()
print("Current HuggingFace Cache Status:")
print("=" * 35)
print(f"Total size: {cache_info.size_on_disk_str}")
print(f"Number of repositories: {len(cache_info.repos)}")
print("\nRepository breakdown:")
for repo in list(cache_info.repos)[:5]: # Show first 5 repos
print(f" • {repo.repo_id}: {repo.size_on_disk_str} ({len(repo.revisions)} revisions)")
if len(cache_info.repos) > 5:
print(f" ... and {len(cache_info.repos) - 5} more repositories")
# Show target repository if it exists in cache
target_repo = None
for repo in cache_info.repos:
if repo.repo_id == cache_manager.repo_id:
target_repo = repo
break
if target_repo:
print(f"\nTarget repository ({cache_manager.repo_id}) cache info:")
print(f" Size: {target_repo.size_on_disk_str}")
print(f" Revisions: {len(target_repo.revisions)}")
if target_repo.revisions:
latest_rev = max(target_repo.revisions, key=lambda r: r.last_modified)
print(f" Latest revision: {latest_rev.commit_hash[:8]}")
print(f" Last modified: {latest_rev.last_modified}")
else:
print(f"\nTarget repository ({cache_manager.repo_id}) not found in cache.")
print("It may need to be downloaded first.")
Current HuggingFace Cache Status: =================================== Total size: 1.2G Number of repositories: 9 Repository breakdown: • BrentLab/rossi_2021: 276.5M (1 revisions) • BrentLab/mahendrawada_2025: 115.8M (3 revisions) • BrentLab/kemmeren_2014: 301.4M (1 revisions) • BrentLab/hu_2007_reimand_2010: 43.3M (1 revisions) • BrentLab/callingcards: 7.4M (4 revisions) ... and 4 more repositories Target repository (BrentLab/mahendrawada_2025) cache info: Size: 115.8M Revisions: 3 Latest revision: 31213daa Last modified: 1776272190.374475
Cache Cleanup by Age¶
# Clean cache entries older than 30 days (dry run)
print("Cleaning cache by age (30+ days old):")
print("=" * 40)
age_cleanup = cache_manager.clean_cache_by_age(
max_age_days=30,
dry_run=True # Set to False to actually execute
)
print(f"\nCleanup strategy created:")
print(f"Expected space freed: {age_cleanup.expected_freed_size_str}")
# Count total items to delete across all categories
total_items = len(age_cleanup.blobs) + len(age_cleanup.refs) + len(age_cleanup.repos) + len(age_cleanup.snapshots)
print(f"Items to delete: {total_items}")
# Show breakdown of what would be deleted
if total_items > 0:
print(f"\nBreakdown of items to delete:")
print(f" • Blob files: {len(age_cleanup.blobs)}")
print(f" • Reference files: {len(age_cleanup.refs)}")
print(f" • Repository directories: {len(age_cleanup.repos)}")
print(f" • Snapshot directories: {len(age_cleanup.snapshots)}")
# Show some example items
if age_cleanup.blobs:
print(f"\nSample blob files to delete:")
for item in list(age_cleanup.blobs)[:3]:
print(f" • {item}")
if len(age_cleanup.blobs) > 3:
print(f" ... and {len(age_cleanup.blobs) - 3} more blob files")
else:
print("No old files found for cleanup.")
INFO:__main__:Found 12 old revisions. Will free 1.1G INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
Cleaning cache by age (30+ days old): ======================================== Cleanup strategy created: Expected space freed: 1.1G Items to delete: 11 Breakdown of items to delete: • Blob files: 2 • Reference files: 0 • Repository directories: 7 • Snapshot directories: 2 Sample blob files to delete: • /home/chase/.cache/huggingface/hub/datasets--BrentLab--mahendrawada_2025/blobs/b60a55fd0f936b1dc717564c7403d02358acbfc3 • /home/chase/.cache/huggingface/hub/datasets--BrentLab--harbison_2004/blobs/3cb3997d01aa1c91d0719f954e1cf207976c8a7d
Cache Cleanup by Size¶
# Clean cache to target size (dry run)
target_size = "1GB"
print(f"Cleaning cache to target size: {target_size}")
print("=" * 40)
size_cleanup = cache_manager.clean_cache_by_size(
target_size=target_size,
strategy="oldest_first", # Can be: oldest_first, largest_first, least_used
dry_run=True
)
print(f"\nSize-based cleanup strategy:")
print(f"Expected space freed: {size_cleanup.expected_freed_size_str}")
# Count total items to delete across all categories
total_items = len(size_cleanup.blobs) + len(size_cleanup.refs) + len(size_cleanup.repos) + len(size_cleanup.snapshots)
print(f"Items to delete: {total_items}")
# Compare different strategies
strategies = ["oldest_first", "largest_first", "least_used"]
print(f"\nComparing cleanup strategies for {target_size}:")
for strategy in strategies:
try:
strategy_result = cache_manager.clean_cache_by_size(
target_size=target_size,
strategy=strategy,
dry_run=True
)
strategy_total = (len(strategy_result.blobs) + len(strategy_result.refs) +
len(strategy_result.repos) + len(strategy_result.snapshots))
print(f" • {strategy:15}: {strategy_result.expected_freed_size_str:>8} "
f"({strategy_total} items)")
except Exception as e:
print(f" • {strategy:15}: Error - {e}")
INFO:__main__:Selected 7 revisions for deletion. Will free 352.1M INFO:__main__:Dry run completed. Use dry_run=False to execute deletion INFO:__main__:Selected 7 revisions for deletion. Will free 352.1M INFO:__main__:Dry run completed. Use dry_run=False to execute deletion INFO:__main__:Selected 1 revisions for deletion. Will free 433.3M INFO:__main__:Dry run completed. Use dry_run=False to execute deletion INFO:__main__:Selected 7 revisions for deletion. Will free 352.1M INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
Cleaning cache to target size: 1GB ======================================== Size-based cleanup strategy: Expected space freed: 352.1M Items to delete: 5 Comparing cleanup strategies for 1GB: • oldest_first : 352.1M (5 items) • largest_first : 433.3M (1 items) • least_used : 352.1M (5 items)
Cleaning Unused Revisions¶
# Clean unused revisions, keeping only the latest 2 per repository
print("Cleaning unused revisions (keep latest 2 per repo):")
print("=" * 50)
revision_cleanup = cache_manager.clean_unused_revisions(
keep_latest=2,
dry_run=True
)
print(f"\nRevision cleanup strategy:")
print(f"Expected space freed: {revision_cleanup.expected_freed_size_str}")
# Count total items to delete across all categories
total_items = len(revision_cleanup.blobs) + len(revision_cleanup.refs) + len(revision_cleanup.repos) + len(revision_cleanup.snapshots)
print(f"Items to delete: {total_items}")
# Show breakdown
if total_items > 0:
print(f"\nBreakdown of cleanup:")
print(f" • Blob files: {len(revision_cleanup.blobs)}")
print(f" • Reference files: {len(revision_cleanup.refs)}")
print(f" • Repository directories: {len(revision_cleanup.repos)}")
print(f" • Snapshot directories: {len(revision_cleanup.snapshots)}")
# Show repository-specific breakdown
cache_info = scan_cache_dir()
if cache_info.repos:
print("\nPer-repository revision analysis:")
for repo in list(cache_info.repos)[:3]:
print(f"\n • {repo.repo_id}:")
print(f" Total revisions: {len(repo.revisions)}")
print(f" Would keep: {min(2, len(repo.revisions))}")
print(f" Would delete: {max(0, len(repo.revisions) - 2)}")
# Show revision details
sorted_revisions = sorted(repo.revisions, key=lambda r: r.last_modified, reverse=True)
for i, rev in enumerate(sorted_revisions[:2]):
print(f" Keep: {rev.commit_hash[:8]} (modified: {rev.last_modified})")
for rev in sorted_revisions[2:3]: # Show one that would be deleted
print(f" Delete: {rev.commit_hash[:8]} (modified: {rev.last_modified})")
INFO:__main__:Found 3 unused revisions. Will free 33.9K INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
Cleaning unused revisions (keep latest 2 per repo):
==================================================
Revision cleanup strategy:
Expected space freed: 33.9K
Items to delete: 5
Breakdown of cleanup:
• Blob files: 1
• Reference files: 1
• Repository directories: 0
• Snapshot directories: 3
Per-repository revision analysis:
• BrentLab/rossi_2021:
Total revisions: 1
Would keep: 1
Would delete: 0
Keep: 8cbe9d50 (modified: 1773246771.4874742)
• BrentLab/mahendrawada_2025:
Total revisions: 3
Would keep: 2
Would delete: 1
Keep: 31213daa (modified: 1776272190.374475)
Keep: 13bb6037 (modified: 1776179563.7077723)
Delete: feff7544 (modified: 1773246772.653467)
• BrentLab/kemmeren_2014:
Total revisions: 1
Would keep: 1
Would delete: 0
Keep: 4585d9c7 (modified: 1773246750.6746056)
Automated Cache Management¶
# Automated cache cleanup with multiple strategies
print("Automated cache cleanup (comprehensive):")
print("=" * 40)
auto_cleanup = cache_manager.auto_clean_cache(
max_age_days=30, # Remove anything older than 30 days
max_total_size="5GB", # Target maximum cache size
keep_latest_per_repo=2, # Keep 2 latest revisions per repo
dry_run=True # Dry run for safety
)
print(f"\nAutomated cleanup executed {len(auto_cleanup)} strategies:")
total_freed = 0
for i, strategy in enumerate(auto_cleanup, 1):
print(f" {i}. Strategy freed: {strategy.expected_freed_size_str}")
total_freed += strategy.expected_freed_size
print(f"\nTotal space that would be freed: {cache_manager._format_bytes(total_freed)}")
# Calculate final cache size
current_cache = scan_cache_dir()
final_size = current_cache.size_on_disk - total_freed
print(f"Cache size after cleanup: {cache_manager._format_bytes(max(0, final_size))}")
INFO:__main__:Starting automated cache cleanup... INFO:__main__:Found 12 old revisions. Will free 1.1G INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
Automated cache cleanup (comprehensive): ========================================
INFO:__main__:Found 3 unused revisions. Will free 33.9K INFO:__main__:Dry run completed. Use dry_run=False to execute deletion INFO:__main__:Automated cleanup complete. Total freed: 1.0GB
Automated cleanup executed 2 strategies: 1. Strategy freed: 1.1G 2. Strategy freed: 33.9K Total space that would be freed: 1.0GB Cache size after cleanup: 153.2MB
7. Best Practices and Performance Tips¶
Here are some best practices for using HfCacheManager effectively:
Performance Best Practices¶
import time
print("Performance Demonstration: Cache Management Benefits")
print("=" * 55)
print("\nDemonstrating cache cleanup performance...")
# Show performance of cache scanning and cleanup strategy creation
print("\n1. Cache scanning performance:")
start_time = time.time()
cache_info = scan_cache_dir()
scan_time = time.time() - start_time
print(f" Time to scan cache: {scan_time:.3f} seconds")
print(f" Repositories found: {len(cache_info.repos)}")
print(f" Total cache size: {cache_info.size_on_disk_str}")
# Show performance of cleanup strategy creation
print("\n2. Cleanup strategy creation performance:")
start_time = time.time()
age_strategy = cache_manager.clean_cache_by_age(max_age_days=30, dry_run=True)
age_time = time.time() - start_time
print(f" Age cleanup strategy: {age_time:.3f} seconds")
start_time = time.time()
size_strategy = cache_manager.clean_cache_by_size(target_size="1GB", dry_run=True)
size_time = time.time() - start_time
print(f" Size cleanup strategy: {size_time:.3f} seconds")
start_time = time.time()
revision_strategy = cache_manager.clean_unused_revisions(keep_latest=2, dry_run=True)
revision_time = time.time() - start_time
print(f" Revision cleanup strategy: {revision_time:.3f} seconds")
print(f"\nPerformance insights:")
print(f"• Cache scanning is fast: {scan_time:.3f}s for {len(cache_info.repos)} repos")
print(f"• Cleanup strategy creation is efficient")
print(f"• Dry runs allow safe preview of cleanup operations")
print(f"• Multiple strategies can be compared quickly")
Performance Demonstration: Cache Management Benefits ======================================================= Demonstrating cache cleanup performance... 1. Cache scanning performance:
INFO:__main__:Found 12 old revisions. Will free 1.1G INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
Time to scan cache: 0.008 seconds Repositories found: 9 Total cache size: 1.2G 2. Cleanup strategy creation performance: Age cleanup strategy: 0.009 seconds
INFO:__main__:Selected 7 revisions for deletion. Will free 352.1M INFO:__main__:Dry run completed. Use dry_run=False to execute deletion INFO:__main__:Found 3 unused revisions. Will free 33.9K
Size cleanup strategy: 0.009 seconds
INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
Revision cleanup strategy: 0.012 seconds Performance insights: • Cache scanning is fast: 0.008s for 9 repos • Cleanup strategy creation is efficient • Dry runs allow safe preview of cleanup operations • Multiple strategies can be compared quickly
Memory and Storage Optimization¶
print("Memory and Storage Optimization Tips:")
print("=" * 40)
print("\n1. DuckDB Views vs Tables:")
print(" • HfCacheManager creates VIEWS by default (not tables)")
print(" • Views reference original parquet files without duplication")
print(" • This saves storage space while enabling fast SQL queries")
print("\n2. Metadata-First Workflow:")
print(" • Load metadata first to understand data structure")
print(" • Use metadata to filter and select specific data subsets")
print(" • Avoid loading entire datasets when only portions are needed")
print("\n3. Cache Management Strategy:")
print(" • Run automated cleanup regularly")
print(" • Keep cache size reasonable for your system")
print(" • Prioritize keeping recent and frequently-used datasets")
# Demonstrate DuckDB view benefits
tables_info = conn.execute(
"SELECT table_name, table_type FROM information_schema.tables WHERE table_name LIKE 'metadata_%'"
).fetchall()
if tables_info:
print(f"\nCurrent DuckDB objects ({len(tables_info)} total):")
for table_name, table_type in tables_info:
print(f" • {table_name}: {table_type}")
view_count = sum(1 for _, table_type in tables_info if table_type == 'VIEW')
print(f"\n {view_count} views created (space-efficient!)")
Memory and Storage Optimization Tips: ======================================== 1. DuckDB Views vs Tables: • HfCacheManager creates VIEWS by default (not tables) • Views reference original parquet files without duplication • This saves storage space while enabling fast SQL queries 2. Metadata-First Workflow: • Load metadata first to understand data structure • Use metadata to filter and select specific data subsets • Avoid loading entire datasets when only portions are needed 3. Cache Management Strategy: • Run automated cleanup regularly • Keep cache size reasonable for your system • Prioritize keeping recent and frequently-used datasets
8. Integration with Other Components¶
The HfCacheManager works seamlessly with other components in the labretriever ecosystem.
print("HfCacheManager Integration Workflow:")
print("=" * 40)
print("\n1. Cache Management Setup:")
print(" from labretriever.HfCacheManager import HfCacheManager")
print(" cache_mgr = HfCacheManager(repo_id, duckdb_conn)")
print(" # Inherits all DataCard functionality + cache management")
print("\n2. Proactive Cache Cleanup:")
print(" # Clean before large operations")
print(" cache_mgr.auto_clean_cache(max_total_size='5GB', dry_run=False)")
print(" # Or use specific strategies")
print(" cache_mgr.clean_cache_by_age(max_age_days=30)")
print("\n3. Data Loading with Cache Awareness:")
print(" # The 3-case strategy works automatically with HfQueryAPI")
print(" from labretriever import HfQueryAPI")
print(" query_api = HfQueryAPI(repo_id, duckdb_conn)")
print(" # Metadata loading uses cache manager's strategy")
print(" data_df = query_api.get_pandas('config_name')")
print("\n4. Embedded Metadata Extraction:")
print(" # Extract metadata fields after data loading")
print(" cache_mgr._extract_embedded_metadata_field(")
print(" 'data_table_name', 'metadata_field', 'metadata_table_name')")
print("\n5. Regular Cache Maintenance:")
print(" # Schedule regular cleanup")
print(" cache_mgr.clean_unused_revisions(keep_latest=2)")
print(" cache_mgr.clean_cache_by_size('10GB', strategy='oldest_first')")
# Show current state
print(f"\nCurrent Session State:")
print(f"Repository: {cache_manager.repo_id}")
print(f"DuckDB tables: {len(conn.execute('SELECT table_name FROM information_schema.tables').fetchall())}")
cache_info = scan_cache_dir()
print(f"HF cache size: {cache_info.size_on_disk_str}")
print(f"Cache repositories: {len(cache_info.repos)}")
HfCacheManager Integration Workflow:
========================================
1. Cache Management Setup:
from labretriever.HfCacheManager import HfCacheManager
cache_mgr = HfCacheManager(repo_id, duckdb_conn)
# Inherits all DataCard functionality + cache management
2. Proactive Cache Cleanup:
# Clean before large operations
cache_mgr.auto_clean_cache(max_total_size='5GB', dry_run=False)
# Or use specific strategies
cache_mgr.clean_cache_by_age(max_age_days=30)
3. Data Loading with Cache Awareness:
# The 3-case strategy works automatically with HfQueryAPI
from labretriever import HfQueryAPI
query_api = HfQueryAPI(repo_id, duckdb_conn)
# Metadata loading uses cache manager's strategy
data_df = query_api.get_pandas('config_name')
4. Embedded Metadata Extraction:
# Extract metadata fields after data loading
cache_mgr._extract_embedded_metadata_field(
'data_table_name', 'metadata_field', 'metadata_table_name')
5. Regular Cache Maintenance:
# Schedule regular cleanup
cache_mgr.clean_unused_revisions(keep_latest=2)
cache_mgr.clean_cache_by_size('10GB', strategy='oldest_first')
Current Session State:
Repository: BrentLab/mahendrawada_2025
DuckDB tables: 0
HF cache size: 1.2G
Cache repositories: 9
9. Troubleshooting and Error Handling¶
The HfCacheManager includes comprehensive error handling and diagnostic capabilities.
print("Cache Management Troubleshooting:")
print("=" * 35)
print("\n1. Import and Setup Issues:")
print(" • Ensure correct import: from labretriever.HfCacheManager import HfCacheManager")
print(" • Verify DuckDB connection: conn = duckdb.connect(':memory:')")
print(" • Check repository access permissions")
print("\n2. Cache Space and Performance Issues:")
try:
cache_info = scan_cache_dir()
print(f" Current cache size: {cache_info.size_on_disk_str}")
print(" • Use auto_clean_cache() for automated management")
print(" • Monitor cache growth with scan_cache_dir()")
print(" • Set appropriate size limits for your system")
# Show if cache is getting large
total_gb = cache_info.size_on_disk / (1024**3)
if total_gb > 10:
print(f" ⚠️ Large cache detected ({total_gb:.1f}GB) - consider cleanup")
except Exception as e:
print(f" Cache scan error: {e}")
print("\n3. Cache Cleanup Issues:")
print(" • Use dry_run=True first to preview changes")
print(" • Check disk permissions for cache directory")
print(" • Verify no active processes are using cached files")
print("\n4. DuckDB Integration Issues:")
print(" • Ensure DuckDB connection is active")
print(" • Check memory limits for in-memory databases")
print(" • Verify table names don't conflict")
# Perform health checks
print(f"\nCache Health Check:")
# Test DuckDB
try:
test_result = conn.execute("SELECT 'DuckDB OK' as status").fetchone()
print(f"✓ DuckDB connection: {test_result[0]}")
except Exception as e:
print(f"✗ DuckDB connection: {e}")
# Test cache access
try:
cache_info = scan_cache_dir()
print(f"✓ Cache access: {len(cache_info.repos)} repositories found")
except Exception as e:
print(f"✗ Cache access: {e}")
# Test cache manager methods
try:
test_cleanup = cache_manager.clean_cache_by_age(max_age_days=999, dry_run=True)
print(f"✓ Cache cleanup methods: Working")
except Exception as e:
print(f"✗ Cache cleanup methods: {e}")
print(f"\nCurrent Status:")
print(f"Repository: {cache_manager.repo_id}")
print(f"Logger configured: {cache_manager.logger is not None}")
print(f"Cache management ready: ✓")
Cache Management Troubleshooting:
===================================
1. Import and Setup Issues:
• Ensure correct import: from labretriever.HfCacheManager import HfCacheManager
• Verify DuckDB connection: conn = duckdb.connect(':memory:')
• Check repository access permissions
2. Cache Space and Performance Issues:
Current cache size: 1.2G
• Use auto_clean_cache() for automated management
• Monitor cache growth with scan_cache_dir()
• Set appropriate size limits for your system
3. Cache Cleanup Issues:
• Use dry_run=True first to preview changes
• Check disk permissions for cache directory
• Verify no active processes are using cached files
4. DuckDB Integration Issues:
• Ensure DuckDB connection is active
• Check memory limits for in-memory databases
• Verify table names don't conflict
Cache Health Check:
✓ DuckDB connection: DuckDB OK
✓ Cache access: 9 repositories found
INFO:__main__:No old revisions found to delete INFO:__main__:Found 0 old revisions. Will free 0.0 INFO:__main__:Dry run completed. Use dry_run=False to execute deletion
✓ Cache cleanup methods: Working Current Status: Repository: BrentLab/mahendrawada_2025 Logger configured: True Cache management ready: ✓