Unified HDF5 Torch Cache Schema

This page documents the versioned on-disk cache schema used by HDF5StructureDataset.build_database(..., persist_features=..., persist_force_derivatives=...).

The user-facing training and dataset guides describe when to enable these cache sections and how they interact with cache_features=True at runtime. This page focuses on the on-disk schema, metadata contract, and compatibility rules behind that workflow.

Scope

Schema version 2 introduces a unified /torch_cache container for optional persisted payload sections:

raw unnormalized descriptor features
sparse local derivative payloads for force-labeled structures

New cache-writing builds use schema version 2 whenever either optional payload is requested. Legacy derivative-only schema version 1 files stored under /force_derivatives remain readable.

Compatibility Contract

Persisted cache compatibility is keyed to the descriptor settings that change the raw geometry-dependent payloads:

descriptor class
species order
radial order and cutoff
angular order and cutoff
minimum cutoff
whether multi-species/typespin weighting is active

Storage dtype is recorded as metadata, but it is not part of the compatibility signature. A cache may therefore be written in one floating-point dtype and loaded through another compatible descriptor dtype, with values cast on load.

Schema Version 2 Layout

The root group is /torch_cache.

Root attributes:

schema_version: integer schema version, currently 2
cache_format: format identifier string, "aenet.torch_training.cache.v2"
descriptor_compat_json: canonical JSON serialization of the compatibility-relevant descriptor settings
descriptor_compat_sha256: SHA-256 hash of that JSON payload
storage_dtype: floating-point dtype used for stored arrays
contains_features: whether the /torch_cache/features section exists
contains_force_derivatives: whether the /torch_cache/force_derivatives section exists

Feature Section

Feature payloads live under /torch_cache/features.

Nodes:

/torch_cache/features/index
/torch_cache/features/values

Index columns:

entry_idx: dataset entry index in /entries/structures
cache_row: row number used by values
n_atoms: atom count for the structure
n_features: raw feature width F

Payload semantics:

one flattened raw (N, F) tensor per cached entry in values
features are stored pre-normalization
load-time helpers reshape back to (N, F) and cast to the active descriptor dtype

Force-Derivative Section

Derivative payloads live under /torch_cache/force_derivatives.

Section attributes:

schema_version: derivative payload schema version, currently 1
payload_format: format identifier string, "aenet.torch_training.local_derivatives.v1"
descriptor_compat_json
descriptor_compat_sha256
storage_dtype
n_radial_features
n_angular_features
multi
contains_features: currently False within the derivative subsection
contains_positions: currently False

Index table:

/torch_cache/force_derivatives/index
one row per cached force-labeled structure
columns: - entry_idx - cache_row - n_atoms - n_radial_edges - n_angular_triplets

Radial payload nodes:

/torch_cache/force_derivatives/radial/center_idx
/torch_cache/force_derivatives/radial/neighbor_idx
/torch_cache/force_derivatives/radial/dG_drij
/torch_cache/force_derivatives/radial/neighbor_typespin

Angular payload nodes:

/torch_cache/force_derivatives/angular/center_idx
/torch_cache/force_derivatives/angular/neighbor_j_idx
/torch_cache/force_derivatives/angular/neighbor_k_idx
/torch_cache/force_derivatives/angular/grads_i
/torch_cache/force_derivatives/angular/grads_j
/torch_cache/force_derivatives/angular/grads_k
/torch_cache/force_derivatives/angular/triplet_typespin

The logical tensor shapes are unchanged from the original derivative cache design. The v2 schema only relocates the derivative section under the shared cache root.

Loading Semantics

The persistence layer exposes the cache through explicit dataset helpers:

has_persisted_features()
get_persisted_feature_cache_info()
load_persisted_features(idx)
has_persisted_force_derivatives()
get_force_derivative_cache_info()
load_persisted_force_derivatives(idx)

Runtime sample materialization now uses the persisted cache lazily when the payload is present and descriptor-compatible:

energy-view materialization checks the trainer-owned runtime cache_features=True cache first, then falls back to persisted HDF5 features, then finally recomputes features on demand
force-view materialization reuses persisted raw features when available
when both persisted raw features and persisted local derivatives are available for a force-supervised entry, HDF5StructureDataset can serve the force sample without rebuilding graph/triplet payloads

This keeps feature normalization as a runtime training concern and preserves on-the-fly fallback behavior when a persisted section is absent.

Legacy Version 1 Compatibility

Legacy derivative-only files with a root /force_derivatives group remain supported for read access.

Version 1 characteristics:

derivative-only layout
schema_version = 1
no unified /torch_cache root
no persisted raw feature section

New builds do not write schema version 1. They standardize on schema version 2 whenever persisted cache payloads are requested.