PyTorch-based Training
======================

This page covers training machine learning interatomic potentials (MLIPs)
using the PyTorch-based implementation in ``aenet-python``. The PyTorch
implementation provides a pure Python workflow with GPU acceleration,
and automatic differentiation for forces.

.. note::

   Training as described here makes use of PyTorch.  Make sure to
   install core torch support as described in :doc:`installation`.  Most
   descriptor-based training workflows also require the matching
   ``torch-scatter`` and ``torch-cluster`` wheels.

.. note::

   **Alternative**: For training using ænet's Fortran-based tools,
   see :doc:`training`.

Overview
--------

The PyTorch training workflow consists of three main steps:

1. **Prepare structures**: Load atomic structures with energies (and optionally forces)
2. **Configure training**: Set up the model architecture and training parameters
3. **Train the model**: Run the training loop and save the trained potential

This tutorial demonstrates both **energy-only** and training on
**energies and forces**.


Example notebooks
-----------------

Jupyter notebooks with examples can be found in the `notebooks
<https://github.com/atomisticnet/aenet-python/tree/master/notebooks>`_
directory within the repository.

For the maintained PyTorch training walkthrough, including the file-backed TiO2
workflow, explicit ``CachedStructureDataset`` usage, fixed train/test splits,
dataset-backed prediction, and committee training, see
`example-05-torch-training.ipynb
<https://github.com/atomisticnet/aenet-python/blob/master/notebooks/example-05-torch-training.ipynb>`_.

If you need to construct ``atomic_energies`` programmatically before
training or before building a large HDF5 dataset, see
:class:`aenet.reference_energies.ReferenceEnergies`. Its regression helper
accepts lazy ``(composition, energy)`` samples directly, and its
reference-compound helper selects the lowest-energy sample for each requested
composition before solving the constrained system. The module also provides a
file-path iterator backed by ``aenet.io.structure`` for streaming-friendly
preprocessing.


Energy-Only Training
--------------------

Here's a compact CPU-only example that keeps the full setup in memory. The
notebook linked above remains the maintained home for the file-backed TiO2
workflow, checkpoint rotation, explicit ``CachedStructureDataset`` usage,
fixed train/test splits, dataset-backed prediction, and plotting.

.. code-block:: python

   import numpy as np
   import torch

   from aenet.torch_featurize import ChebyshevDescriptor
   from aenet.torch_training import (
       Adam,
       Structure,
       TorchANNPotential,
       TorchTrainingConfig,
   )

   structures = [
       Structure(
           positions=np.array(
               [
                   [0.0, 0.0, 0.0],
                   [0.9, 0.0, 0.0],
                   [0.0, 0.9, 0.0],
               ]
           ),
           species=["H", "H", "H"],
           energy=0.0,
       ),
       Structure(
           positions=np.array(
               [
                   [0.1, 0.0, 0.0],
                   [1.0, 0.0, 0.0],
                   [0.0, 1.0, 0.0],
               ]
           ),
           species=["H", "H", "H"],
           energy=0.5,
       ),
   ]

   descriptor = ChebyshevDescriptor(
       species=["H"],
       rad_order=1,
       rad_cutoff=2.0,
       ang_order=0,
       ang_cutoff=2.0,
       min_cutoff=0.1,
       device="cpu",
       dtype=torch.float64,
   )
   arch = {"H": [(4, "tanh")]}

   mlp = TorchANNPotential(arch, descriptor=descriptor)

   config = TorchTrainingConfig(
       iterations=1,
       method=Adam(mu=0.001, batchsize=1),
       testpercent=50,
       force_weight=0.0,
       atomic_energies={"H": 0.0},
       normalize_features=False,
       normalize_energy=False,
       memory_mode="cpu",
       device="cpu",
       checkpoint_dir=None,
       checkpoint_interval=0,
       max_checkpoints=None,
       save_best=False,
       use_scheduler=False,
   )

   results = mlp.train(structures=structures, config=config)
   print(results.errors[["RMSE_train", "RMSE_test"]].tail(1))

This trains a neural network potential using energies only, with 50% of the
structures held out for validation. The
:meth:`~aenet.torch_training.TorchANNPotential.train` method returns a
:class:`~aenet.io.train.TrainOut` object containing training history,
statistics, and plotting helpers.

.. note::

   Setting ``testpercent > 0`` does more than hold out structures. It also
   enables any validation-driven controls in your configuration, such as
   ``use_scheduler=True`` and ``save_best=True``. On very small validation
   splits, these controls can react to noisy metrics and change the training
   behavior qualitatively.


Reproducibility Controls
------------------------

The PyTorch trainer separates run-level stochastic behavior from split
selection:

.. doctest::

   >>> from aenet.torch_training import TorchTrainingConfig
   >>> config = TorchTrainingConfig(seed=11, split_seed=7)
   >>> (config.seed, config.split_seed)
   (11, 7)

Use ``split_seed`` when you want the trainer-owned train/validation partition
to stay fixed across runs. Use ``seed`` when you want model initialization,
training-shuffle order, weighted sampling, and random force-subset selection
to be reproducible. Committee workflows typically keep ``split_seed`` shared
across members while varying ``seed`` per member.


Committee Training
------------------

Phase 2 committee support adds a trainer-side orchestration layer on top of
the single-member ``TorchANNPotential`` workflow:

.. code-block:: python

   from pathlib import Path

   from aenet.torch_training import (
       Adam,
       TorchCommitteeConfig,
       TorchCommitteePotential,
       TorchTrainingConfig,
   )

   committee = TorchCommitteePotential(arch, descriptor=descriptor)
   train_config = TorchTrainingConfig(
       iterations=1,
       method=Adam(mu=0.001, batchsize=1),
       testpercent=50,
       split_seed=7,
       atomic_energies={"H": 0.0},
       normalize_features=False,
       normalize_energy=False,
       memory_mode="cpu",
       device="cpu",
       checkpoint_dir=None,
       checkpoint_interval=0,
       max_checkpoints=None,
       save_best=False,
       use_scheduler=False,
   )
   committee_config = TorchCommitteeConfig(
       num_members=2,
       base_seed=11,
       max_parallel=1,
       output_dir=Path("committee_run"),
   )

   result = committee.train(
       structures=structures,
       config=train_config,
       committee_config=committee_config,
   )
   print(result.metadata_path)
   print([member.seed for member in result.members])
   print(result)

   member_results = result.trainouts
   member_0_errors = member_results[0].errors
   committee_table = result.to_dataframe()
   committee_stats = result.stats

Committee runs materialize a stable output layout:

.. code-block:: text

   committee_run/
     committee_metadata.json
     member_000/
       model.pt
       history.json
       history.csv
       summary.json
     member_001/
       model.pt
       history.json
       history.csv
       summary.json

The committee layer computes any trainer-owned train/validation split once in
the parent process and reuses that split across all members. In the first
committee implementation, the main reproducibility pattern is a shared
``split_seed`` with distinct per-member ``seed`` values derived from
``base_seed`` or from ``member_seeds``.

``TorchCommitteeTrainResult`` mirrors the single-network ``TrainOut`` summary
style where possible. ``print(result)`` reports the mean and standard
deviation of each available final metric across completed committee members.
``result.stats`` exposes the same aggregate values programmatically,
``result.to_dataframe()`` returns one row per member, and ``member.trainout``
or ``result.trainouts`` rebuilds the familiar per-member ``TrainOut`` objects
from the persisted ``history.json`` files.

Committee Inference and ASCII Export
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Phase 3 adds committee-level loading, aggregated prediction, and a
committee-wide ASCII export helper:

.. code-block:: python

   reloaded = TorchCommitteePotential.from_directory(result.output_dir)
   predictions = reloaded.predict(structures, eval_forces=False)
   first_result = predictions[0]

   print(first_result.energy_mean, first_result.energy_std)
   print(first_result.member_energies)
   print(predictions.member_outputs[0].total_energy)

   uncertainty_table = predictions.to_dataframe()
   most_uncertain = predictions.top_uncertain(n=10)

   dataset_predictions = reloaded.predict_dataset(test_dataset)
   dataset_uncertainty_table = dataset_predictions.to_dataframe()

   members = reloaded.to_aenet_ascii(
       Path("ascii_committee"),
       prefix="committee",
       structures=structures,
   )
   print(members[0])

``predict()`` and the dataset-backed ``predict_dataset()`` return a list-like
``TorchCommitteePredictResult``. Iterating over it or indexing it returns one
:class:`aenet.mlip.ensemble.AenetEnsembleResult` per input structure, so
existing list-style code remains valid. The result also keeps the per-member
:class:`aenet.io.predict.PredictOut` objects in ``member_outputs`` and
provides ``to_dataframe()``, ``sort_by()``, and ``top_uncertain()`` helpers
for uncertainty-driven structure selection. Dataset-backed prediction tracks
both split-local ``index`` and root-dataset ``source_index`` where possible.
When ``eval_forces=False``, it follows the cached-feature
``TorchANNPotential.predict_dataset()`` path. When ``eval_forces=True``, each
member falls back to materialized structures, so the dataset must expose raw
structures through ``get_structure()``, ``structures``, or a supported
``Subset`` wrapper.

The maintained notebook ``notebooks/example-05-torch-training.ipynb`` now
includes a TiO2 committee-training example that trains a small committee,
reloads it from ``committee_metadata.json``, inspects aggregated uncertainty,
and exports the member manifest for later Fortran-backed ensemble inference.

``to_aenet_ascii()`` exports each committee member into a stable layout and
returns the member manifest expected by
``AenetEnsembleInterface`` and ``AenetEnsembleCalculator``:
pass ``structures=...`` or explicit ``descriptor_stats=...`` when exact
descriptor statistics must be written into the ASCII files.

.. code-block:: text

   ascii_committee/
     member_000/
       committee.H.nn.ascii
     member_001/
       committee.H.nn.ascii


Structure Sampling Policies
---------------------------

The PyTorch trainer distinguishes three separate concepts that all affect
training behavior:

* ``use_scheduler`` controls the learning-rate scheduler
* ``force_sampling`` controls which force-labeled structures contribute force
  loss in a given epoch window
* ``sampling_policy`` controls how structures in the training split are drawn
  into training batches

The default structure-sampling policy is uniform shuffled batching:

.. doctest::

   >>> from aenet.torch_training import TorchTrainingConfig
   >>> config = TorchTrainingConfig(sampling_policy="uniform")
   >>> config.sampling_policy
   'uniform'

Epoch semantics are different for uniform and non-uniform policies:

* ``sampling_policy="uniform"`` uses shuffled batching without replacement,
  so each training structure appears exactly once per epoch.
* Non-uniform policies use weighted sampling with replacement and draw
  ``len(train_split)`` structures per epoch. Some structures may appear
  multiple times in one epoch and some may not appear at all.
* ``iterations`` still means training epochs. Under non-uniform sampling,
  one epoch is not guaranteed to be a full pass over distinct training
  structures.
* Validation sampling remains uniform and deterministic.

The static non-uniform option ``sampling_policy="energy_weighted"`` biases
sampling toward lower cohesive or referenced formation energy per atom:

.. doctest::

   >>> config = TorchTrainingConfig(
   ...     sampling_policy="energy_weighted",
   ...     atomic_energies={"H": 0.0},
   ... )
   >>> config.sampling_policy
   'energy_weighted'

The weighting always uses the same atomic-reference convention as the
training targets. When the trainer builds datasets from raw
``structures=...`` input, that convention comes from
``TorchTrainingConfig.atomic_energies``. When you pass a prebuilt dataset,
the dataset owns ``atomic_energies`` and the trainer uses those instead.
If no atomic references are provided in either path, training still proceeds
with all-zero atomic references; in that case, the energy-weighted policy
uses the provided per-atom labels as-is and emits a warning so the fallback
is explicit.

The exact per-draw sampling probability is determined as follows for a
training split with ``N`` structures. For structure ``i``:

.. math::

   e_i = \frac{E_i - \sum_{a \in i} E^{\mathrm{atom}}_a}{n_i}

where ``E_i`` is the stored total energy, ``E^{atom}_a`` comes from the
resolved atomic-reference convention, and ``n_i`` is the atom count. Then:

.. math::

   \Delta_i = e_i - \min_j e_j

.. math::

   \Delta_{\max} = \max_j \Delta_j

If ``\Delta_max <= 0``, all structures receive equal weight:

.. math::

   w_i = 1

Otherwise:

.. math::

   w_i = \frac{1}{1 + \Delta_i / \Delta_{\max}}

The trainer draws with replacement using ``num_samples = N`` per epoch, so
the probability that a single draw selects structure ``i`` is:

.. math::

   p_i = \frac{w_i}{\sum_j w_j}

This is the full implementation. Lower referenced per-atom energy means
larger ``w_i`` and therefore larger sampling probability ``p_i``.

The adaptive non-uniform option ``sampling_policy="error_weighted"`` starts
with uniform epoch-0 sampling and then increases the sampling frequency of
structures with higher recently observed training loss:

.. doctest::

   >>> config = TorchTrainingConfig(
   ...     sampling_policy="error_weighted",
   ... )
   >>> config.sampling_policy
   'error_weighted'

Its behavior is:

* Epoch 0 uses uniform weights because no per-structure error history exists
  yet.
* After each training epoch, the trainer computes a structure-level score
  from the sampled training structures and normalizes the next epoch's
  weights so the mean weight is 1.
* Those structure-level scores are measured in the same training target space
  used for the energy loss. If training uses referenced cohesive or formation
  energies, adaptive sampling uses the same references; if training uses raw
  total energies, adaptive sampling follows that convention instead.
* Force losses do not contribute to these adaptive structure scores, even
  during force training. ``error_weighted`` always uses energy error only.
* If a structure is sampled multiple times in an epoch, its next score uses
  the mean of those sampled occurrences.
* If a structure is not sampled in an epoch, it keeps its previous score.
* Resume currently does not persist adaptive sampler state; resumed
  ``error_weighted`` training therefore bootstraps from uniform sampling
  again.

The structure-level score used by ``error_weighted`` is:

* all training modes: absolute energy error per atom for that structure
* force-training settings such as ``force_weight``, ``force_fraction``, and
  ``force_sampling`` do not change the adaptive structure score definition

The exact adaptive-sampling update is:

1. Epoch 0 starts with uniform structure scores:

   .. math::

      s_i^{(0)} = 1

2. After epoch ``t``, each sampled structure gets an observed score
   ``\hat{s}_i^{(t)}`` equal to the mean absolute energy error per atom of
   that structure's sampled occurrences during the epoch. If a structure is
   not sampled in epoch
   ``t``, it keeps its previous score:

   .. math::

      s_i^{(t+1)} =
      \begin{cases}
      \hat{s}_i^{(t)} & \text{if structure } i \text{ was sampled in epoch } t \\
      s_i^{(t)} & \text{otherwise}
      \end{cases}

3. The trainer converts scores into positive sampler weights by first
   replacing non-finite values with ``0`` and clamping negative values to
   ``0``:

   .. math::

      u_i = \max(0, s_i)

4. If all ``u_i`` are zero, the sampler falls back to uniform weights:

   .. math::

      w_i = 1

5. Otherwise, the trainer clamps each nonzero weight to at least
   ``10^{-12}`` and normalizes weights to unit mean:

   .. math::

      \tilde{u}_i = \max(u_i, 10^{-12})

   .. math::

      w_i = \frac{\tilde{u}_i}{\frac{1}{N}\sum_j \tilde{u}_j}

6. As with ``energy_weighted``, sampling is with replacement and
   ``num_samples = N`` per epoch, so each individual draw uses:

   .. math::

      p_i = \frac{w_i}{\sum_j w_j}

Because dividing by the mean does not change normalized probabilities,
``error_weighted`` is equivalent to drawing with probability proportional to
the latest clamped per-structure score. The unit-mean normalization only
keeps the raw weight magnitudes numerically well scaled.

For both non-uniform policies, ``force_sampling`` remains a separate control.
It determines whether a sampled force-labeled structure contributes force
loss; it does not define how often that structure is drawn into batches.


Force Training
--------------

To include force supervision, add force arrays to the structures and set
``force_weight > 0.0``:

.. doctest::

   >>> from aenet.torch_training import Adam, TorchTrainingConfig

   >>> config = TorchTrainingConfig(
   ...     iterations=2,
   ...     method=Adam(mu=0.001, batchsize=1),
   ...     testpercent=50,
   ...     force_weight=0.1,
   ...     force_fraction=0.5,
   ...     force_sampling="fixed",
   ... )
   >>> config.force_weight
   0.1
   >>> config.force_fraction
   0.5
   >>> config.force_sampling
   'fixed'

The ``force_weight`` parameter (α) balances energy and force contributions:

.. math::

   \text{Loss} = (1 - \alpha) \cdot \text{RMSE}_{\text{energy}} + \alpha \cdot \text{RMSE}_{\text{forces}}

Common values:

* ``force_weight=0.0``: Energy-only (fastest training)
* ``force_weight=0.1``: Primarily energy, with force regularization
* ``force_weight=0.5``: Equal weighting
* ``force_weight=1.0``: Force-only (rarely used)

.. note::

   Force training requires structures with force data. Structures without
   forces will only contribute to the energy loss term.

The notebook linked above remains the maintained home for the longer
force-training workflow, including checkpoint output and plotting.


Dataset Options
---------------

The PyTorch training workflow supports flexible dataset options, from simple
structure lists to advanced HDF5-backed lazy-loading for large-scale
training.

For detailed information about dataset classes, input formats, and performance
optimization, see :doc:`torch_datasets`.

The longer file-backed dataset workflow is intentionally kept in the training
notebook above so the ``torch_datasets`` page can stay focused on compact
API-facing examples.

Execution Model
~~~~~~~~~~~~~~~~

The current trainer has two distinct runtime stages:

1. Sample preparation happens in the main process when ``num_workers=0``, or
   in ``DataLoader`` workers when ``num_workers > 0``. Structures are
   converted to tensors on ``descriptor.device``, and descriptor
   featurization, neighbor reuse, graph/triplet construction, and lazy HDF5
   cache reads happen there.
2. The collated batch is then moved onto ``config.device`` inside the
   training loop. Model forward passes, normalization, loss computation, and
   optimizer steps run on that device.

In practice, GPU training with ``num_workers > 0`` is best understood as
worker-side data preparation feeding a training loop on the selected device.
It is not currently a separate mixed CPU/GPU execution pipeline.

If ``descriptor.device`` and ``config.device`` match, featurization and model
compute happen on the same device. If they differ, samples are materialized on
``descriptor.device`` and transferred before the forward pass. The compact
examples on this page create the descriptor on CPU, so later
``device='cuda'`` examples describe CPU-side sample preparation feeding GPU
training unless you also move the descriptor to CUDA.

For HDF5-backed datasets, each worker reopens its own read-only file handle
and keeps its own bounded ``in_memory_cache_size`` LRU cache. Trainer-owned
runtime caches (``cache_features``, ``cache_neighbors``,
``cache_force_triplets``) are also per process/worker, so
``cache_warmup=True`` is skipped automatically when ``num_workers > 0``. See
:doc:`torch_datasets` for persisted HDF5 cache precedence and for the
distinction between build-time ``build_workers`` and training-time
``num_workers``.

``memory_mode='mixed'`` is reserved for a future real mixed-memory mode and
currently raises ``NotImplementedError`` if requested. Today, the supported
execution modes remain ``'cpu'`` and ``'gpu'``.

Performance Optimization Tips
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**For Energy-Only Training**

.. doctest::

   >>> from aenet.torch_training import TorchTrainingConfig
   >>> config = TorchTrainingConfig(
   ...     force_weight=0.0,
   ...     cache_features=True,
   ...     num_workers=4,
   ...     prefetch_factor=4,
   ...     persistent_workers=True,
   ... )
   >>> (config.cache_features, config.num_workers, config.prefetch_factor)
   (True, 4, 4)

**For Force Training**

.. doctest::

   >>> config = TorchTrainingConfig(
   ...     force_weight=0.1,
   ...     force_fraction=0.3,
   ...     force_sampling="random",
   ...     cache_features=True,
   ...     cache_neighbors=True,
   ...     num_workers=4,
   ...     prefetch_factor=4,
   ... )
   >>> (config.cache_neighbors, config.cache_force_triplets)
   (True, False)

**Caching Strategies**

* **cache_features**: For energy-only structure-list workflows, this can
  precompute features eagerly. For force training, it caches energy-view
  features for structures not selected for force supervision in the current
  epoch.
* **cache_neighbors**: Reuse neighbor search results for energy-view reuse
  and legacy non-graph paths
* **cache_force_triplets**: Cache CSR graphs and triplets for the default sparse
  force-training path instead of rebuilding them on demand
* **cache_*_max_entries**: Bound the trainer-owned runtime caches per split
  and per process/worker instead of letting them grow without limit
* **cache_warmup**: Optional single-process prefill of trainer-owned runtime
  caches before epoch 0; skipped automatically when ``num_workers > 0``

These runtime caches are distinct from the on-disk HDF5 persisted cache
sections created with ``HDF5StructureDataset.build_database(...)``. For HDF5
datasets, ``cache_features=True`` is still only a per-run in-memory layer; it
does not replace ``persist_features=True`` or
``persist_force_derivatives=True``, which are the build-time options for
reusing raw features or sparse local derivatives across sessions. See
:doc:`torch_datasets` for the full cache-precedence workflow. HDF5 energy
filtering is also a build-time concern: set ``max_energy`` and
``atomic_energies`` on the ``HDF5StructureDataset`` before calling
``build_database()`` rather than relying on
``TorchTrainingConfig.max_energy`` at runtime.

Common Pitfalls
~~~~~~~~~~~~~~~

1. **Descriptor mismatch**: Ensure descriptor species order matches your dataset.
   Datasets use ``descriptor.species_to_idx`` for species indexing.

Training Configuration
----------------------

The :class:`~aenet.torch_training.TorchTrainingConfig` class provides extensive
control over the training process. Here are the most commonly used parameters:

Basic Settings
~~~~~~~~~~~~~~

.. doctest::

   >>> from aenet.torch_training import TorchTrainingConfig
   >>> config = TorchTrainingConfig(
   ...     iterations=100,
   ...     testpercent=10,
   ...     device="cpu",
   ...     show_progress=True,
   ... )
   >>> (config.iterations, config.device, config.show_progress)
   (100, 'cpu', True)

Optimizer Selection
~~~~~~~~~~~~~~~~~~~

Choose and configure the optimization algorithm:

.. doctest::

   >>> from aenet.torch_training import Adam, SGD, TorchTrainingConfig

   >>> method = Adam(
   ...     mu=0.001,
   ...     batchsize=32,
   ...     beta1=0.9,
   ...     beta2=0.999,
   ...     weight_decay=0.0,
   ... )
   >>> (method.method_name, method.batchsize)
   ('adam', 32)

   >>> method = SGD(
   ...     lr=0.01,
   ...     batchsize=32,
   ...     momentum=0.9,
   ...     weight_decay=0.0,
   ... )
   >>> TorchTrainingConfig(iterations=100, method=method).method.method_name
   'sgd'

**Adam** is recommended for most applications due to its adaptive learning rates
and robust convergence properties.


Common Training Patterns
-------------------------

Small Dataset (< 100 structures)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   config = TorchTrainingConfig(
       iterations=200,  # More epochs for small data
       method=Adam(mu=0.001, batchsize=16),  # Smaller batches
       testpercent=10,
       force_weight=0.1,
       device='cpu'  # CPU fine for small datasets
   )

Large Dataset (> 500 structures)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   config = TorchTrainingConfig(
       iterations=50,   # Fewer epochs needed
       method=Adam(mu=0.001, batchsize=64),  # Larger batches
       testpercent=10,
       force_weight=0.1,
       device='cuda',  # Model/loss on GPU
       # Performance optimizations
       cache_features=True,  # Runtime in-memory feature cache
       cache_feature_max_entries=1024,
       num_workers=8,         # Parallel CPU-side sample preparation
       prefetch_factor=4
   )

Energy-Only with Maximum Speed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   config = TorchTrainingConfig(
       iterations=100,
       method=Adam(mu=0.001, batchsize=32),
       testpercent=10,
       force_weight=0.0,  # Energy-only
       cache_features=True,  # Bounded runtime feature cache for this run
       cache_warmup=True,    # Optional single-process prefill
       device='cuda'
   )

Force Training with Optimizations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   config = TorchTrainingConfig(
       iterations=100,
       method=Adam(mu=0.001, batchsize=32),
       testpercent=10,
       force_weight=0.1,
       force_fraction=0.3,  # Use 30% of forces (3× faster)
       cache_neighbors=True,  # Cache worker-local neighbor lists
       num_workers=4,         # Parallel CPU-side sample preparation
       device='cuda'
   )


Advanced Configuration Reference
---------------------------------

This section documents all configuration parameters available in
:class:`~aenet.torch_training.TorchTrainingConfig`.

Checkpointing & Model Saving
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**checkpoint_dir** : str (default: 'checkpoints')
   Directory to save checkpoint files. Set to None to disable checkpointing.

**checkpoint_interval** : int (default: 1)
   Save a checkpoint every N epochs. Set to 0 to disable periodic checkpoints.

**max_checkpoints** : int (default: None)
   Maximum number of checkpoint files to keep. Older checkpoints are automatically
   deleted. None = keep all checkpoints.

**save_best** : bool (default: True)
   Save the model with the best validation loss as ``best_model.pt``.
   Requires ``testpercent > 0`` to compute validation loss.

   For very small validation sets, the selected checkpoint can be unstable.
   In that case prefer ``save_best=False`` or supply a larger or explicit
   validation split.

**Resuming Training**

To resume training from a checkpoint, pass the checkpoint path to
``train(..., resume_from="checkpoints/checkpoint_epoch_0050.pt")``. The
notebook above contains the maintained checkpoint workflow.

When ``resume_from`` is provided, ``config.iterations`` means the number of
additional epochs to run in that ``train()`` call. For example, resuming a
checkpoint with ``iterations=10`` runs 10 more epochs after the saved
checkpoint epoch, regardless of how many epochs were completed in the
original run. This applies to numbered checkpoints and ``best_model.pt``
alike.

The trainer will automatically:

* Load model and optimizer state
* Restore training history and normalization statistics
* Continue from the next epoch

.. note::

   Checkpoint files are NOT interchangeable with model files created by ``save()``.
   Checkpoints include additional training state (optimizer, history) needed for
   resuming, while model files are optimized for deployment and inference.


Learning Rate Scheduling
~~~~~~~~~~~~~~~~~~~~~~~~~

**use_scheduler** : bool (default: False)
   Enable learning rate scheduler. Uses ReduceLROnPlateau, which reduces the
   learning rate when validation loss plateaus.

**scheduler_patience** : int (default: 10)
   Number of epochs with no improvement before reducing learning rate.

**scheduler_factor** : float (default: 0.5)
   Factor by which to reduce learning rate. New LR = current LR × factor.

**scheduler_min_lr** : float (default: 1e-6)
   Minimum allowed learning rate. Scheduler stops reducing below this value.

**Example Usage**

.. doctest::

   >>> from aenet.torch_training import Adam, TorchTrainingConfig
   >>> config = TorchTrainingConfig(
   ...     iterations=200,
   ...     method=Adam(mu=0.001, batchsize=32),
   ...     testpercent=10,
   ...     use_scheduler=True,
   ...     scheduler_patience=10,
   ...     scheduler_factor=0.5,
   ...     scheduler_min_lr=1e-6,
   ... )
   >>> (config.use_scheduler, config.scheduler_patience)
   (True, 10)

The scheduler helps training converge when progress stalls, automatically
adjusting the learning rate for optimal performance.

.. note::

   The scheduler requires ``testpercent > 0`` to monitor validation loss.
   With only a few validation structures, the monitored loss can be too noisy
   for stable plateau detection. In that case prefer ``use_scheduler=False``
   or a larger or explicit validation split.


Force Training Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~

**force_fraction** : float (default: 1.0)
   Fraction of structures (0.0-1.0) to use for force training. Using a subset
   can significantly speed up training while maintaining accuracy.
   Example: ``force_fraction=0.3`` uses 30% of force-labeled structures.

**force_sampling** : str (default: 'random')
   Sampling strategy for force subset: ``'random'`` (resample periodically) or
   ``'fixed'`` (static subset). Random sampling provides better generalization.

**force_resample_num_epochs** : int (default: 0)
   Number of epochs between resampling the force-trained subset when
   ``force_sampling='random'``. Controls the resampling frequency:

   * ``0`` = No resampling (use fixed subset for entire training)
   * ``1`` = Resample every epoch (maximum variety, highest computational cost)
   * ``N > 1`` = Resample every N epochs (balance between variety and efficiency)

   .. note::
      The default value of 0 (no resampling) represents a conservative choice
      that maintains consistent training dynamics and reduces computational
      overhead. Set to 1 or higher for dynamic resampling.

**force_min_structures_per_epoch** : int (default: 1)
   Minimum number of force-labeled structures per epoch, regardless of
   ``force_fraction``. Ensures force gradient signal is not lost.

**force_scale_unbiased** : bool (default: False)
   Apply sqrt(1/f) scaling to force RMSE where f is the supervised fraction,
   approximating constant scale under sub-sampling.


Performance & Caching
~~~~~~~~~~~~~~~~~~~~~~

**cache_features** : bool (default: False)
   Enable feature caching. Behavior depends on training mode:

   * For energy-only training (``force_weight=0``): Pre-computes all features once,
     providing ~100× speedup
   * For force training (``force_weight > 0``): Caches features for structures not
     selected for force supervision in current epoch (useful with ``force_fraction < 1.0``)

**cache_feature_max_entries** : int or None (default: 1024)
   Maximum number of trainer-owned energy-view feature entries to retain per
   split and per process/worker when ``cache_features=True``. Use ``None`` for
   an explicit unbounded cache or ``0`` to suppress storage.

**cache_neighbors** : bool (default: False)
   Cache per-structure neighbor graphs (indices, displacement vectors) across
   epochs. Avoids repeated neighbor searches for fixed geometries on
   energy-view reuse and legacy non-graph paths. Supported force training
   does not require this option.

**cache_neighbor_max_entries** : int or None (default: 512)
   Maximum number of trainer-owned neighbor payload entries to retain per
   split and per process/worker when ``cache_neighbors=True``. Use ``None`` for
   an explicit unbounded cache or ``0`` to suppress storage.

**cache_force_triplets** : bool (default: False)
   Cache CSR neighbor graphs and precompute angular triplet indices for the
   default sparse force-training path. Leaving this disabled still uses the
   sparse graph/triplet path, but rebuilds those graph payloads on demand.

**cache_force_triplet_max_entries** : int or None (default: 256)
   Maximum number of trainer-owned graph/triplet payload entries to retain per
   split and per process/worker when ``cache_force_triplets=True``. Use
   ``None`` for an explicit unbounded cache or ``0`` to suppress storage.

**cache_persist_dir** : str (default: None)
   Directory for persisting graph/triplet caches to disk for reuse across runs.

**cache_scope** : str (default: 'all')
   Which dataset splits to cache: ``'train'``, ``'val'``, or ``'all'``.

**cache_warmup** : bool (default: False)
   If True, pre-populate trainer-owned runtime caches before the first epoch
   in single-process training. When all enabled caches have finite entry
   limits, warmup stops once those limits are filled. Warmup is skipped
   automatically when ``num_workers > 0`` because workers own their own cache
   instances and the main-process warmup would not populate those worker-local
   caches.

**num_workers** : int (default: 0)
   Number of parallel ``DataLoader`` workers for structure loading, HDF5
   reads, and on-the-fly featurization. ``0`` keeps sample preparation in the
   main process. Values ``>0`` parallelize worker-side sample preparation; they
   do not parallelize model compute.

**prefetch_factor** : int (default: 2)
   Number of batches to prefetch per worker when ``num_workers > 0``.

**persistent_workers** : bool (default: True)
   Keep DataLoader workers alive between epochs for faster iteration.
   During training, this is disabled automatically when
   ``force_sampling='random'`` uses epoch-level resampling, because worker
   copies would otherwise keep a stale force-supervision subset. Trainer-owned
   runtime caches and HDF5 ``in_memory_cache_size`` state are also
   worker-local when ``num_workers > 0``. For HDF5-backed datasets, worker
   handles are opened lazily per worker and closed explicitly when that worker
   exits.


Data Filtering & Quality Control
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**max_energy** : float (default: None)
   Exclude structures with referenced cohesive or formation energy per atom
   above this threshold when the trainer constructs datasets from raw
   ``structures=...`` input. If ``atomic_energies`` is omitted, the filter
   falls back to all-zero atomic references and uses the provided per-atom
   labels as-is. When you pass a prebuilt ``dataset=...`` or explicit
   ``train_dataset=...``/``test_dataset=...``, this option is ignored and the
   trainer emits a warning.

**max_forces** : float (default: None)
   Exclude structures with maximum atomic force magnitude above this threshold.
   Units: eV/Å.


Energy Configuration
~~~~~~~~~~~~~~~~~~~~

**atomic_energies** : dict (default: None)
   Optional atomic reference energies used to convert total energies to
   cohesive-energy targets during training when the trainer constructs
   datasets from raw ``structures=...`` input.
   Format: ``{'H': -13.6, 'O': -432.0, ...}`` in eV.
   If omitted, the training target remains the total energy because all atomic
   reference energies default to 0.0. When you pass a prebuilt
   ``dataset=...`` or explicit ``train_dataset=...``/``test_dataset=...``,
   the dataset owns ``atomic_energies`` instead; matching config values are
   allowed, but mismatched values raise an error.

**normalize_features** : bool (default: True)
   Normalize features to zero mean and unit variance. Improves training
   stability and convergence.

**normalize_energy** : bool (default: True)
   Normalize energies by shifting and scaling. Applied after cohesive energy
   conversion if enabled.

**E_shift** : float (default: None)
   Override per-atom energy shift for normalization. Auto-computed from
   training set if None.

**E_scaling** : float (default: None)
   Override energy scaling factor. Auto-computed from training set if None.

**feature_stats** : dict (default: None)
   Override feature normalization statistics.
   Format: ``{'mean': np.ndarray, 'std': np.ndarray}``.
   Auto-computed from training set if None.


Output & Diagnostics
~~~~~~~~~~~~~~~~~~~~

**save_energies** : bool (default: False)
   Save predicted energies for train/test sets to disk. The
   ``Path-of-input-file`` column preserves the original structure path or
   name when available; otherwise it uses a stable ``structure_XXXXXX``
   identifier from the pre-split input order. For HDF5-backed datasets,
   the identifier is synthesized from persisted source metadata as
   ``display_name#frame=N`` when a display name is available,
   ``source_id#frame=N`` otherwise, then ``name#frame=N`` when only the
   persisted structure name is available, and
   ``structure_XXXXXX#frame=N`` as the final fallback. Source metadata is
   validated at HDF5 build time so these identifiers are not silently
   truncated on write.

**save_forces** : bool (default: False)
   Save predicted forces for train/test sets to disk.

**timing** : bool (default: False)
   Enable detailed timing output for performance profiling.

**show_progress** : bool (default: True)
   Display epoch-level progress bar. The reported training errors depend on
   the active sampling strategy: with ``sampling_policy="uniform"``, the
   epoch training error is computed from one full pass over the training
   split without replacement; with non-uniform sampling, the displayed
   training error is computed from that epoch's sampled-with-replacement
   training draws and may therefore include repeated structures and omit
   others. The final metrics returned by ``train()`` are recomputed
   afterwards from a deterministic full pass over the train/test splits.

**show_batch_progress** : bool (default: False)
   Display batch-level progress bar within each epoch. Verbose for large
   datasets.


Advanced Options
~~~~~~~~~~~~~~~~

**precision** : str (default: 'auto')
   Numeric precision: ``'auto'`` (match descriptor dtype), ``'float32'``, or
   ``'float64'``. Higher precision improves accuracy but increases memory usage.

**memory_mode** : str (default: 'gpu')
   Memory management strategy: ``'cpu'``, ``'gpu'``, or ``'mixed'``.
   ``'mixed'`` is reserved for a future real mixed-memory implementation and
   currently raises ``NotImplementedError``. Use ``'cpu'`` or ``'gpu'`` with
   ``descriptor.device`` and ``device`` set explicitly to control the current
   execution path.

**device** : str (default: None)
   PyTorch device: ``'cpu'``, ``'cuda'``, or ``'cuda:0'``. Auto-detected if
   None. This selects the model/training-loop device. ``descriptor.device``
   separately controls where structures are featurized. When the two differ,
   samples are prepared on ``descriptor.device`` and moved to ``device``
   before the forward pass.


Monitoring Training Progress
-----------------------------

The :class:`~aenet.io.train.TrainOut` object returned by ``train()`` provides
built-in visualization and analysis tools:

Common entry points are:

* ``results.plot_training_summary(outfile="training_summary.png")`` for a
  combined energy/force plot
* ``results.plot_training_errors(outfile="energy_errors.png")`` for
  energy-only training curves
* ``results.plot_force_errors(outfile="force_errors.png")`` when force data
  are present
* ``results.errors`` for direct access to the underlying pandas DataFrame used
  for custom plotting

The notebook linked above demonstrates these plotting helpers in a full
training workflow.

Signs of good training:

* Steady decrease in both train and test RMSE
* Test RMSE follows train RMSE (no overfitting)
* Convergence to acceptable error levels (< 0.01 eV/atom for energy)

Signs of problems:

* Test RMSE increases while train RMSE decreases (overfitting)
* Both RMSEs plateau at high values (underfitting, poor architecture)
* Divergence or oscillation (learning rate too high)


See Also
--------

* :doc:`torch_featurization` - PyTorch-based structure featurization
* :doc:`choosing_implementation` - Fortran vs PyTorch comparison