Unified HDF5 Torch Cache Schema
This page documents the versioned on-disk cache schema used by
HDF5StructureDataset.build_database(..., persist_features=...,
persist_force_derivatives=...).
The user-facing training and dataset guides describe when to enable these
cache sections and how they interact with cache_features=True at runtime.
This page focuses on the on-disk schema, metadata contract, and compatibility
rules behind that workflow.
Scope
Schema version 2 introduces a unified /torch_cache container for optional
persisted payload sections:
raw unnormalized descriptor features
sparse local derivative payloads for force-labeled structures
New cache-writing builds use schema version 2 whenever either optional
payload is requested. Legacy derivative-only schema version 1 files stored
under /force_derivatives remain readable.
Compatibility Contract
Persisted cache compatibility is keyed to the descriptor settings that change the raw geometry-dependent payloads:
descriptor class
species order
radial order and cutoff
angular order and cutoff
minimum cutoff
whether multi-species/typespin weighting is active
Storage dtype is recorded as metadata, but it is not part of the compatibility signature. A cache may therefore be written in one floating-point dtype and loaded through another compatible descriptor dtype, with values cast on load.
Schema Version 2 Layout
The root group is /torch_cache.
Root attributes:
schema_version: integer schema version, currently2cache_format: format identifier string,"aenet.torch_training.cache.v2"descriptor_compat_json: canonical JSON serialization of the compatibility-relevant descriptor settingsdescriptor_compat_sha256: SHA-256 hash of that JSON payloadstorage_dtype: floating-point dtype used for stored arrayscontains_features: whether the/torch_cache/featuressection existscontains_force_derivatives: whether the/torch_cache/force_derivativessection exists
Feature Section
Feature payloads live under /torch_cache/features.
Nodes:
/torch_cache/features/index/torch_cache/features/values
Index columns:
entry_idx: dataset entry index in/entries/structurescache_row: row number used byvaluesn_atoms: atom count for the structuren_features: raw feature widthF
Payload semantics:
one flattened raw
(N, F)tensor per cached entry invaluesfeatures are stored pre-normalization
load-time helpers reshape back to
(N, F)and cast to the active descriptor dtype
Force-Derivative Section
Derivative payloads live under /torch_cache/force_derivatives.
Section attributes:
schema_version: derivative payload schema version, currently1payload_format: format identifier string,"aenet.torch_training.local_derivatives.v1"descriptor_compat_jsondescriptor_compat_sha256storage_dtypen_radial_featuresn_angular_featuresmulticontains_features: currentlyFalsewithin the derivative subsectioncontains_positions: currentlyFalse
Index table:
/torch_cache/force_derivatives/indexone row per cached force-labeled structure
columns: -
entry_idx-cache_row-n_atoms-n_radial_edges-n_angular_triplets
Radial payload nodes:
/torch_cache/force_derivatives/radial/center_idx/torch_cache/force_derivatives/radial/neighbor_idx/torch_cache/force_derivatives/radial/dG_drij/torch_cache/force_derivatives/radial/neighbor_typespin
Angular payload nodes:
/torch_cache/force_derivatives/angular/center_idx/torch_cache/force_derivatives/angular/neighbor_j_idx/torch_cache/force_derivatives/angular/neighbor_k_idx/torch_cache/force_derivatives/angular/grads_i/torch_cache/force_derivatives/angular/grads_j/torch_cache/force_derivatives/angular/grads_k/torch_cache/force_derivatives/angular/triplet_typespin
The logical tensor shapes are unchanged from the original derivative cache design. The v2 schema only relocates the derivative section under the shared cache root.
Loading Semantics
The persistence layer exposes the cache through explicit dataset helpers:
has_persisted_features()get_persisted_feature_cache_info()load_persisted_features(idx)has_persisted_force_derivatives()get_force_derivative_cache_info()load_persisted_force_derivatives(idx)
Runtime sample materialization now uses the persisted cache lazily when the payload is present and descriptor-compatible:
energy-view materialization checks the trainer-owned runtime
cache_features=Truecache first, then falls back to persisted HDF5 features, then finally recomputes features on demandforce-view materialization reuses persisted raw features when available
when both persisted raw features and persisted local derivatives are available for a force-supervised entry,
HDF5StructureDatasetcan serve the force sample without rebuilding graph/triplet payloads
This keeps feature normalization as a runtime training concern and preserves on-the-fly fallback behavior when a persisted section is absent.
Legacy Version 1 Compatibility
Legacy derivative-only files with a root /force_derivatives group remain
supported for read access.
Version 1 characteristics:
derivative-only layout
schema_version = 1no unified
/torch_cacherootno persisted raw feature section
New builds do not write schema version 1. They standardize on schema version 2 whenever persisted cache payloads are requested.