.. _usage-training:

Training ANN Potentials (Fortran)
==================================

.. note::

   Training as described here makes use of ænet's compiled ``train.x`` tool.
   Make sure to install ænet and configure the paths as described
   in :doc:`installation`.

.. note::

   **Alternative**: For a pure Python/PyTorch implementation that does not
   require Fortran, see :doc:`torch_training`.

``aenet-python`` provides tools to facilitate the training of ænet
potentials directly from Python scripts.  This workflow is managed
primarily by the :class:`~aenet.mlip.ANNPotential` class.


Example notebooks
-----------------

Jupyter notebooks with examples how to train potentials can
be found in the `notebooks
<https://github.com/atomisticnet/aenet-python/tree/master/notebooks>`_
directory within the repository.


Defining the Network Architecture
---------------------------------

Before training, you need to define the architecture of the ANN for each
atomic species involved. This is done using a Python dictionary where
keys are the element symbols (e.g., "Si", "O") and values are lists of
tuples. Each tuple represents a layer in the network, specifying the
number of nodes and the activation function for that layer.

Supported activation functions are:
``'tanh'``, ``'linear'``, and ``'signmoid'``.

The final ANN layer is always a linear layer with one node, which
outputs the energy for the corresponding atomic species.  This layer
does not need to be defined.

Example architecture for a silicon potential:

.. code-block:: python

    from aenet.mlip import ANNPotential

    # Define architecture: Si with two hidden layers
    # (10 nodes, tanh activation)
    arch = {
        "Si": [(10, 'tanh'), (10, 'tanh')]
    }

    # Create the potential object
    potential = ANNPotential(arch)


Training Configuration
----------------------

Training parameters are managed through the :class:`~aenet.mlip.TrainingConfig`
class, which centralizes all configuration options with built-in validation.
This ensures type safety and prevents invalid parameter combinations.

The ``TrainingConfig`` class includes:

*   ``iterations`` (int): Maximum number of training iterations. Default: ``0``
*   ``method`` (TrainingMethod): The optimization algorithm to use. Default: ``Adam()``
*   ``testpercent`` (int): Percentage of data for test set (0-100). Default: ``0``
*   ``max_energy`` (float, optional): Exclude structures with referenced cohesive or formation energy per atom above this threshold when the trainer builds datasets from raw ``structures=...`` input. If ``atomic_energies`` is omitted, the filter falls back to all-zero atomic references and uses the provided per-atom labels as-is. Prebuilt datasets must be filtered when they are constructed. Default: ``None``
*   ``sampling`` (str, optional): Sampling method ('sequential', 'random', 'weighted', 'energy'). Default: ``None``
*   ``timing`` (bool): Enable detailed timing output. Default: ``False``
*   ``save_energies`` (bool): Save predicted energies for training/test sets. Default: ``False``

The configuration validates parameters at creation time, raising ``ValueError``
for invalid inputs (e.g., testpercent outside 0-100 range, invalid sampling method).


Training Methods
----------------

The training process uses optimization methods to adjust the neural network
weights. Each method has specific parameters with sensible defaults.
The available training methods are provided as typed classes:

*   :class:`~aenet.mlip.Adam` - ADAM optimizer (default)
*   :class:`~aenet.mlip.BFGS` - L-BFGS-B optimizer (no parameters)
*   :class:`~aenet.mlip.EKF` - Extended Kalman filter
*   :class:`~aenet.mlip.LM` - Levenberg-Marquardt
*   :class:`~aenet.mlip.OnlineSD` - Online steepest descent

Each training method class encodes both the algorithm name and its parameters
with appropriate defaults based on the `aenet` Fortran implementation.


Training the Potential
----------------------

Once the architecture is defined, you can train the potential using
the :meth:`~aenet.mlip.ANNPotential.train` method. This method automates
several steps:

1.  Checks if the provided training set file exists and is compatible
    with the defined architecture.
2.  Creates a temporary working directory (or uses a specified one).
3.  Generates the necessary ``train.in`` file based on the architecture
    and training parameters.
4.  Calls the ``train.x`` executable from the configured `aenet` installation.
5.  Monitors the training progress with a progress bar.
6.  Collects the resulting potential files (``.nn`` files), energy files,
    and timing information into the current directory upon completion.

Basic Training Example:

.. code-block:: python

    from aenet.mlip import ANNPotential, TrainingConfig

    # Assuming 'potential' is an ANNPotential object defined as above
    # and 'data.train' is your training set file.

    # Simple training with defaults (uses Adam optimizer)
    potential.train('data.train')

    # Or customize parameters using TrainingConfig
    config = TrainingConfig(iterations=1000, testpercent=10)
    potential.train('data.train', config=config)

    # Inline configuration also works
    potential.train('data.train',
                   config=TrainingConfig(iterations=1000, testpercent=10))
    print("Training completed successfully.")

Using Different Training Methods:

.. code-block:: python

    from aenet.mlip import ANNPotential, TrainingConfig
    from aenet.mlip import BFGS, Adam, LM, EKF, OnlineSD

    # Use BFGS optimizer
    config = TrainingConfig(iterations=1000, method=BFGS())
    potential.train('data.train', config=config)

    # Customize Adam parameters
    config = TrainingConfig(
        iterations=1000,
        method=Adam(mu=0.005, batchsize=200),
        testpercent=10
    )
    potential.train('data.train', config=config)

    # Use Levenberg-Marquardt with additional options
    config = TrainingConfig(
        iterations=500,
        method=LM(batchsize=128, learnrate=0.05),
        sampling='random',
        max_energy=100.0
    )
    potential.train('data.train', config=config)

    # Use Extended Kalman filter
    config = TrainingConfig(
        iterations=500,
        method=EKF(lambda_=0.95, P=150.0),
        timing=True
    )
    potential.train('data.train', config=config)

    # Use Online steepest descent
    config = TrainingConfig(
        iterations=10000,
        method=OnlineSD(gamma=1e-6, alpha=0.3),
        save_energies=True
    )
    potential.train('data.train', config=config)


Key Parameters for ``train()``:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*   ``trnset_file`` (str or Path, optional): Path to the training set file. Defaults to ``'data.train'``.
*   ``config`` (TrainingConfig, optional): Training configuration object containing all training parameters (iterations, method, testpercent, max_energy, sampling, timing, save_energies). If not provided, uses default ``TrainingConfig()`` with Adam optimizer. Defaults to ``None``.
*   ``workdir`` (str or Path, optional): A directory to store temporary files during training. If not provided, a temporary directory is created and removed afterwards. Defaults to ``None``.
*   ``output_file`` (str or Path, optional): File path to save the standard output of the ``train.x`` executable. Defaults to ``'train.out'``.

See the ``TrainingConfig`` class documentation above for all available configuration parameters.

Training Method Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~

**Adam** (default method)

*   ``mu`` (float): Learning rate. Default: ``0.001``
*   ``b1`` (float): Exponential decay rate for first moment estimates. Default: ``0.9``
*   ``b2`` (float): Exponential decay rate for second moment estimates. Default: ``0.999``
*   ``eps`` (float): Small constant for numerical stability. Default: ``1.0e-8``
*   ``batchsize`` (int): Number of structures per batch. Default: ``16``
*   ``samplesize`` (int): Number of structures to sample per epoch. Default: ``100``

**BFGS**

*   No configurable parameters.
*   Note: Not supported on ARM-based Macs.

**EKF** (Extended Kalman Filter)

*   ``lambda`` (float): Forgetting factor. Default: ``0.99``
*   ``lambda0`` (float): Initial forgetting factor. Default: ``0.999``
*   ``P`` (float): Initial covariance. Default: ``100.0``
*   ``mnoise`` (float): Measurement noise. Default: ``0.0``
*   ``pnoise`` (float): Process noise. Default: ``1.0e-5``
*   ``wgmax`` (int): Maximum weight change. Default: ``500``

**LM** (Levenberg-Marquardt)

*   ``batchsize`` (int): Number of structures per batch. Default: ``256``
*   ``learnrate`` (float): Learning rate. Default: ``0.1``
*   ``iter`` (int): Number of iterations per epoch. Default: ``3``
*   ``conv`` (float): Convergence criterion. Default: ``1e-3``
*   ``adjust`` (int): Adjustment parameter. Default: ``5``

**OnlineSD** (Online Steepest Descent)

*   ``gamma`` (float): Learning rate. Default: ``1.0e-5``
*   ``alpha`` (float): Momentum parameter. Default: ``0.25``

This method requires a configured `aenet` installation.
Use ``aenet config`` on the command line to set the paths to the `aenet`
executables.

MPI Parallelization
-------------------

Training can be accelerated using MPI parallelization if the ``train.x``
executable is built with MPI support. This allows training to run across
multiple CPU cores or nodes on HPC systems.

Prerequisites
~~~~~~~~~~~~~

1. The ``train.x`` executable must be compiled with MPI support
2. MPI must be enabled in the aenet-python configuration:

.. code-block:: bash

    $ aenet config --enable-mpi

3. (Optional) Customize the MPI launcher for your system:

.. code-block:: bash

    # For SLURM systems
    $ aenet config --set-mpi-launcher "srun -n {num_proc} {exec}"

    # Default is "mpirun -np {num_proc} {exec}"

Using MPI in Training
~~~~~~~~~~~~~~~~~~~~~

To enable MPI parallelization, pass the ``num_processes`` parameter to the
``train()`` method:

.. code-block:: python

    from aenet.mlip import ANNPotential, TrainingConfig

    # Define architecture
    arch = {"Si": [(10, 'tanh'), (10, 'tanh')]}
    potential = ANNPotential(arch)

    # Standard training (sequential, no MPI)
    config = TrainingConfig(iterations=1000)
    potential.train('data.train', config=config)

    # MPI training with 8 processes
    config = TrainingConfig(iterations=1000)
    potential.train('data.train', config=config, num_processes=8)

    # MPI training with custom configuration
    config = TrainingConfig(
        iterations=1000,
        method=Adam(mu=0.005, batchsize=32),
        testpercent=10
    )
    potential.train('data.train', config=config, num_processes=16)

The ``num_processes`` parameter specifies how many MPI processes to use.
The actual command executed will be based on the configured MPI launcher.
For example, with the default launcher and ``num_processes=8``, the
command would be:

.. code-block:: bash

    mpirun -np 8 /path/to/train.x train.in

Inference with Trained Potentials

Once you have trained ANN potentials, you can use them to make predictions
(inference) on new atomic structures. The prediction functionality is
integrated into the :class:`~aenet.mlip.ANNPotential` class, providing a
unified interface for both training and inference.