Training Set Management

Classes for interacting with aenet training set files.

Currently, only training set fuiles in ASCII format are supported.

Classes

FeaturizedAtomicStructure

class aenet.trainset.FeaturizedAtomicStructure(path: str, energy: float, atom_types: List[str], atoms: List[dict], neighbor_info: dict = None, cell: ndarray = None, pbc: bool = None)[source]

Bases: Serializable

Class to hold all information of an atomic structure.

path

Path to the original structure file.

Type:

str

energy

Total energy of the structure.

Type:

float

atom_types

List of atom types (chemical symbols).

Type:

list[str]

atoms

Atomic information per atom with keys: {‘type’: atom_type, ‘fingerprint’: fingerprint, ‘coords’: coords, ‘forces’: forces}.

Type:

list[dict]

neighbor_info

Optional neighbor information for force training. If present, contains the keys ‘neighbor_counts’ (n_atoms,) array of neighbor counts, ‘neighbor_lists’ list of (nnb,) arrays with neighbor indices, and ‘neighbor_vectors’ list of (nnb, 3) arrays with displacement vectors.

Type:

dict or None

cell

Unit cell lattice vectors as (3, 3) array where rows are lattice vectors.

Type:

numpy.ndarray or None

pbc

True for 3D-periodic structures; False for isolated structures.

Type:

bool

Notes

Properties like has_neighbor_info, has_cell, is_periodic, num_atoms, max_descriptor_length, composition, atom_weights, and atom_features are documented on their respective properties.

neighbor_info

Optional dictionary containing neighbor information for force training. If present, contains:

  • neighbor_counts: (n_atoms,) numpy array of neighbor counts per atom

  • neighbor_lists: List of (nnb,) numpy arrays with neighbor atom indices

  • neighbor_vectors: List of (nnb, 3) numpy arrays with displacement vectors

This information is used for computing force derivatives during training.

Type:

dict or None

property has_neighbor_info

Returns True if neighbor information is available for force training.

Returns True if neighbor information is available for force training.

Returns:

Whether the structure contains neighbor information

Return type:

bool

property atom_features
atom_features_for_type(atom_type: str)[source]

Return only the features for atoms of a selected type.

Parameters:

atom_type (str) – Chemical symbol.

property atom_weights
property avec

Get unit cell lattice vectors.

Returns cell from HDF5 if available, otherwise reads from XSF file as fallback for legacy binary format support.

property composition
property coords
property forces
global_moment_fingerprint(outer_moment: int = 1, inner_moment: int = 1, weighted: bool = False, weights: dict = None, append_weighted: bool = False, stack_type_features: bool = False, exclude_zero_atoms: bool = False, atom_types: List[str] = None)[source]

Calculate a global fingerprint from local atomic fingerprints using a moment expansion.

This implementation assumes that atomic descriptors for each species have the same length.

Parameters:
  • outer_moment (int, default=1) – Up to which outer moment to compute. Must be >= 1. Not used when stack_type_features is True.

  • inner_moment (int, default=1) – Up to which inner moment to compute. Must be >= 0 (0 = no moment, 1 = mean).

  • weighted (bool, default=False) – Whether to apply species weights to the type fingerprints.

  • weights (dict[str, float] or None, default=None) – Mapping of atom symbol to weight. Defaults to self.atom_weights when weighted is True.

  • append_weighted (bool, default=False) – If True and weighted is True, append the weighted features to the unweighted features; otherwise only return weighted features.

  • stack_type_features (bool, default=False) – If True, concatenate per-type feature vectors instead of performing the outer moment expansion.

  • exclude_zero_atoms (bool, default=False) – If True, skip species with zero count in the structure.

  • atom_types (list[str] or None, default=None) – Subset of chemical symbols to consider. Defaults to all self.atom_types.

Returns:

Global fingerprint vector.

Return type:

numpy.ndarray

Notes

The global fingerprint can be conceived as:

F_global = outer_moments(w_A * inner_moments(F_A)
                         U w_B * inner_moments(F_B) U ...)

where

  • F_global is the global fingerprint

  • F_s is the union of atomic fingerprints for species s (F_s = F_s(1) U F_s(2) U ...)

  • F_s(i) is the atomic fingerprint for species s at site i

  • w_s is the weight for species s

The dimension is len(type_fingerprint) * inner_moment * outer_moment (or len(type_fingerprint) * outer_moment if inner_moment is 0).

property has_cell

Returns True if unit cell information is available.

property has_neighbor_info

Returns True if neighbor information is available for force training.

property is_periodic

Returns True if structure is periodic (3D-periodic).

property max_descriptor_length

Dimension of longest fingerprint among all atoms of the atomic structure

property num_atoms
property structure
property types

TrnSet

class aenet.trainset.TrnSet(name: str, normalized: bool, scale: float, shift: float, atom_types: List[str], atomic_energy: List[float], num_atoms_tot: int, num_structures: int, E_min: float, E_max: float, E_av: float, filename: PathLike = None, fileformat: str = 'ascii', schema: str = None, origin: PathLike = None, has_persisted_features: bool = False, **kwargs)[source]

Bases: object

Class for parsing aenet training set files.

Attention: atom type indices here internally start with zero

(whereas they start with 1 in Fortran)

has_neighbor_info() bool[source]

Check if the training set file contains neighbor information.

Returns:

True if neighbor information is available (only for HDF5 format), False otherwise.

Check if the training set file contains neighbor information.

Neighbor information is only available for HDF5 format files that were generated with the include_neighbor_info=True option. This information is required for force training with PyTorch autograd.

Returns:

True if neighbor information is available (only for HDF5 format), False otherwise

Return type:

bool

Example:

from aenet.trainset import TrnSet

with TrnSet.from_file("features_with_neighbors.h5") as trnset:
    if trnset.has_neighbor_info():
        struct = trnset.read_structure(
            0,
            read_coords=True,
            read_forces=True,
        )

        if struct.has_neighbor_info:
            print(struct.neighbor_info["neighbor_counts"][0])

Use trnset.has_neighbor_info() to check whether the file stores neighbor-information tables at all, and struct.has_neighbor_info to check whether a particular returned structure exposes per-atom neighbor arrays.

close()[source]
classmethod from_ascii_file(ascii_file: PathLike, **kwargs)[source]

Load training set from aenet ASCII file.

Parameters:

ascii_file – path to an aenet training set file in ASCII format

classmethod from_file(filename: PathLike, file_format: str = 'guess', **kwargs)[source]
classmethod from_fortran_binary_file(binary_file: PathLike, ascii_file: PathLike = None, **kwargs)[source]

First convert training set file in Fortran binary format to ASCII format, then open it. This requires the tool ‘trnset2ASCII.x’.

classmethod from_hdf5_file(hdf5_file: PathLike, **kwargs)[source]
has_neighbor_info() bool[source]

Check if the training set file contains neighbor information.

Returns:

True if neighbor information is available (only for HDF5 format), False otherwise.

iter_structures(read_coords=False, read_forces=False)[source]
property num_types
open()[source]

Open training set file for reading.

read_next_structure(read_coords=False, read_forces=False)[source]
read_structure(idx: int, read_coords=False, read_forces=False)[source]
rewind()[source]
to_hdf5(filename: PathLike, complevel: int = 1)[source]

Save data set to HDF5 file.

Example Notebook

For the maintained end-to-end featurization workflows, including HDF5 export, PyTorch-backed HDF5 compatibility, optional GPU execution, and longer neighbor-information generation examples, see example-01-featurization.ipynb.

Usage Examples

Inspecting an Existing Training Set

Keep this page focused on inspecting already-generated training sets. Prefer the notebook linked above for file-backed featurization or generation workflows.

from aenet.trainset import TrnSet

with TrnSet.from_file("sample.h5") as trnset:
    print(trnset.schema)
    print(trnset.num_structures)
    print(trnset.atom_types)

    struct = trnset[0]
    print(struct.num_atoms)
    print(struct.atom_features.shape)

For HDF5 inputs, trnset.schema reports which on-disk schema was read. Current values are "trnset_hdf5" for featurizer-generated training sets and "torch_training_hdf5" for HDF5StructureDataset files.

Comparing HDF5 and ASCII Readers

Both backends expose the same high-level TrnSet API for inspection:

from aenet.trainset import TrnSet

with TrnSet.from_file("sample.h5") as trnset_h5, \
        TrnSet.from_file("sample.train.ascii") as trnset_ascii:
    struct_h5 = trnset_h5.read_structure(0, read_coords=True, read_forces=True)
    struct_ascii = trnset_ascii.read_structure(
        0,
        read_coords=True,
        read_forces=True,
    )

    assert trnset_h5.num_structures == trnset_ascii.num_structures
    assert trnset_h5.atom_types == trnset_ascii.atom_types
    assert struct_h5.atom_features.shape == struct_ascii.atom_features.shape
    assert struct_h5.coords.shape == struct_ascii.coords.shape

Checking Optional Neighbor Information

Use both the dataset-level and structure-level checks before consuming stored neighbor arrays:

from aenet.trainset import TrnSet

with TrnSet.from_file("features_with_neighbors.h5") as trnset:
    if trnset.has_neighbor_info():
        struct = trnset.read_structure(0, read_coords=True, read_forces=True)

        if struct.has_neighbor_info:
            print(struct.neighbor_info["neighbor_counts"][0])

Backward Compatibility

The implementation maintains full backward compatibility:

from aenet.trainset import TrnSet

with TrnSet.from_file("sample.h5") as trnset_h5:
    struct = trnset_h5.read_structure(0)
    assert not struct.has_neighbor_info
    assert struct.neighbor_info is None

with TrnSet.from_file("sample.train.ascii") as trnset_ascii:
    assert not trnset_ascii.has_neighbor_info()

See Also