Unified HDF5 Torch Cache Schema
===============================

This page documents the versioned on-disk cache schema used by
``HDF5StructureDataset.build_database(..., persist_features=...,``
``persist_force_derivatives=...)``.

The user-facing training and dataset guides describe when to enable these
cache sections and how they interact with ``cache_features=True`` at runtime.
This page focuses on the on-disk schema, metadata contract, and compatibility
rules behind that workflow.

Scope
-----

Schema version 2 introduces a unified ``/torch_cache`` container for optional
persisted payload sections:

- raw unnormalized descriptor features
- sparse local derivative payloads for force-labeled structures

New cache-writing builds use schema version 2 whenever either optional
payload is requested. Legacy derivative-only schema version 1 files stored
under ``/force_derivatives`` remain readable.

Compatibility Contract
----------------------

Persisted cache compatibility is keyed to the descriptor settings that change
the raw geometry-dependent payloads:

- descriptor class
- species order
- radial order and cutoff
- angular order and cutoff
- minimum cutoff
- whether multi-species/typespin weighting is active

Storage dtype is recorded as metadata, but it is not part of the compatibility
signature. A cache may therefore be written in one floating-point dtype and
loaded through another compatible descriptor dtype, with values cast on load.

Schema Version 2 Layout
-----------------------

The root group is ``/torch_cache``.

Root attributes:

- ``schema_version``: integer schema version, currently ``2``
- ``cache_format``: format identifier string,
  ``"aenet.torch_training.cache.v2"``
- ``descriptor_compat_json``: canonical JSON serialization of the
  compatibility-relevant descriptor settings
- ``descriptor_compat_sha256``: SHA-256 hash of that JSON payload
- ``storage_dtype``: floating-point dtype used for stored arrays
- ``contains_features``: whether the ``/torch_cache/features`` section exists
- ``contains_force_derivatives``: whether the
  ``/torch_cache/force_derivatives`` section exists

Feature Section
---------------

Feature payloads live under ``/torch_cache/features``.

Nodes:

- ``/torch_cache/features/index``
- ``/torch_cache/features/values``

Index columns:

- ``entry_idx``: dataset entry index in ``/entries/structures``
- ``cache_row``: row number used by ``values``
- ``n_atoms``: atom count for the structure
- ``n_features``: raw feature width ``F``

Payload semantics:

- one flattened raw ``(N, F)`` tensor per cached entry in ``values``
- features are stored pre-normalization
- load-time helpers reshape back to ``(N, F)`` and cast to the active
  descriptor dtype

Force-Derivative Section
------------------------

Derivative payloads live under ``/torch_cache/force_derivatives``.

Section attributes:

- ``schema_version``: derivative payload schema version, currently ``1``
- ``payload_format``: format identifier string,
  ``"aenet.torch_training.local_derivatives.v1"``
- ``descriptor_compat_json``
- ``descriptor_compat_sha256``
- ``storage_dtype``
- ``n_radial_features``
- ``n_angular_features``
- ``multi``
- ``contains_features``: currently ``False`` within the derivative subsection
- ``contains_positions``: currently ``False``

Index table:

- ``/torch_cache/force_derivatives/index``
- one row per cached force-labeled structure
- columns:
  - ``entry_idx``
  - ``cache_row``
  - ``n_atoms``
  - ``n_radial_edges``
  - ``n_angular_triplets``

Radial payload nodes:

- ``/torch_cache/force_derivatives/radial/center_idx``
- ``/torch_cache/force_derivatives/radial/neighbor_idx``
- ``/torch_cache/force_derivatives/radial/dG_drij``
- ``/torch_cache/force_derivatives/radial/neighbor_typespin``

Angular payload nodes:

- ``/torch_cache/force_derivatives/angular/center_idx``
- ``/torch_cache/force_derivatives/angular/neighbor_j_idx``
- ``/torch_cache/force_derivatives/angular/neighbor_k_idx``
- ``/torch_cache/force_derivatives/angular/grads_i``
- ``/torch_cache/force_derivatives/angular/grads_j``
- ``/torch_cache/force_derivatives/angular/grads_k``
- ``/torch_cache/force_derivatives/angular/triplet_typespin``

The logical tensor shapes are unchanged from the original derivative cache
design. The v2 schema only relocates the derivative section under the shared
cache root.

Loading Semantics
-----------------

The persistence layer exposes the cache through explicit dataset helpers:

- ``has_persisted_features()``
- ``get_persisted_feature_cache_info()``
- ``load_persisted_features(idx)``
- ``has_persisted_force_derivatives()``
- ``get_force_derivative_cache_info()``
- ``load_persisted_force_derivatives(idx)``

Runtime sample materialization now uses the persisted cache lazily when the
payload is present and descriptor-compatible:

- energy-view materialization checks the trainer-owned runtime
  ``cache_features=True`` cache first, then falls back to persisted HDF5
  features, then finally recomputes features on demand
- force-view materialization reuses persisted raw features when available
- when both persisted raw features and persisted local derivatives are
  available for a force-supervised entry, ``HDF5StructureDataset`` can serve
  the force sample without rebuilding graph/triplet payloads

This keeps feature normalization as a runtime training concern and preserves
on-the-fly fallback behavior when a persisted section is absent.

Legacy Version 1 Compatibility
------------------------------

Legacy derivative-only files with a root ``/force_derivatives`` group remain
supported for read access.

Version 1 characteristics:

- derivative-only layout
- ``schema_version = 1``
- no unified ``/torch_cache`` root
- no persisted raw feature section

New builds do not write schema version 1. They standardize on schema version
2 whenever persisted cache payloads are requested.

Related Descriptor Manifest
---------------------------

When ``persist_descriptor=True`` is requested explicitly, or implicitly via
``persist_features=True`` or ``persist_force_derivatives=True``, the HDF5 file
also stores a versioned descriptor manifest under ``/descriptor_manifest``.

That manifest remains distinct from the cache payload schema and exists only
to reconstruct supported descriptor objects safely when a dataset is reopened.