Training Set Management

Classes for interacting with aenet training set files.

Currently, only training set fuiles in ASCII format are supported.

Classes

FeaturizedAtomicStructure

class aenet.trainset.FeaturizedAtomicStructure(path: str, energy: float, atom_types: List[str], atoms: List[dict], neighbor_info: dict = None, cell: ndarray = None, pbc: bool = None)[source]

Bases: Serializable

Class to hold all information of an atomic structure.

path

Path to the original structure file.

Type:: str

energy

Total energy of the structure.

Type:: float

atom_types

List of atom types (chemical symbols).

Type:: list[str]

atoms

Atomic information per atom with keys: {‘type’: atom_type, ‘fingerprint’: fingerprint, ‘coords’: coords, ‘forces’: forces}.

Type:: list[dict]

neighbor_info

Optional neighbor information for force training. If present, contains the keys ‘neighbor_counts’ (n_atoms,) array of neighbor counts, ‘neighbor_lists’ list of (nnb,) arrays with neighbor indices, and ‘neighbor_vectors’ list of (nnb, 3) arrays with displacement vectors.

Type:: dict or None

cell

Unit cell lattice vectors as (3, 3) array where rows are lattice vectors.

Type:: numpy.ndarray or None

pbc

True for 3D-periodic structures; False for isolated structures.

Type:: bool

Notes

Properties like has_neighbor_info, has_cell, is_periodic, num_atoms, max_descriptor_length, composition, atom_weights, and atom_features are documented on their respective properties.

neighbor_info

Optional dictionary containing neighbor information for force training. If present, contains:

neighbor_counts: (n_atoms,) numpy array of neighbor counts per atom
neighbor_lists: List of (nnb,) numpy arrays with neighbor atom indices
neighbor_vectors: List of (nnb, 3) numpy arrays with displacement vectors

This information is used for computing force derivatives during training.

Type:: dict or None

property has_neighbor_info

Returns True if neighbor information is available for force training.

Returns:: Whether the structure contains neighbor information
Return type:: bool

property atom_features

atom_features_for_type(atom_type: str)[source]

Return only the features for atoms of a selected type.

Parameters:: atom_type (str) – Chemical symbol.

property atom_weights

property avec

Get unit cell lattice vectors.

Returns cell from HDF5 if available, otherwise reads from XSF file as fallback for legacy binary format support.

property composition

property coords

property forces

global_moment_fingerprint(outer_moment: int = 1, inner_moment: int = 1, weighted: bool = False, weights: dict = None, append_weighted: bool = False, stack_type_features: bool = False, exclude_zero_atoms: bool = False, atom_types: List[str] = None)[source]

Calculate a global fingerprint from local atomic fingerprints using a moment expansion.

This implementation assumes that atomic descriptors for each species have the same length.

Parameters:

outer_moment (int, default=1) – Up to which outer moment to compute. Must be >= 1. Not used when stack_type_features is True.
inner_moment (int, default=1) – Up to which inner moment to compute. Must be >= 0 (0 = no moment, 1 = mean).
weighted (bool, default=False) – Whether to apply species weights to the type fingerprints.
weights (dict[str, float] or None, default=None) – Mapping of atom symbol to weight. Defaults to self.atom_weights when weighted is True.
append_weighted (bool, default=False) – If True and weighted is True, append the weighted features to the unweighted features; otherwise only return weighted features.
stack_type_features (bool, default=False) – If True, concatenate per-type feature vectors instead of performing the outer moment expansion.
exclude_zero_atoms (bool, default=False) – If True, skip species with zero count in the structure.
atom_types (list[str] or None, default=None) – Subset of chemical symbols to consider. Defaults to all self.atom_types.

Returns:

Global fingerprint vector.

Return type:

numpy.ndarray

Notes

The global fingerprint can be conceived as:

F_global = outer_moments(w_A * inner_moments(F_A)
                         U w_B * inner_moments(F_B) U ...)

where

F_global is the global fingerprint

F_s is the union of atomic fingerprints for species s (F_s = F_s(1) U F_s(2) U ...)

F_s(i) is the atomic fingerprint for species s at site i

w_s is the weight for species s

The dimension is len(type_fingerprint) * inner_moment * outer_moment (or len(type_fingerprint) * outer_moment if inner_moment is 0).

property has_cell: Returns True if unit cell information is available.

property has_neighbor_info: Returns True if neighbor information is available for force training.

property is_periodic: Returns True if structure is periodic (3D-periodic).

property max_descriptor_length: Dimension of longest fingerprint among all atoms of the atomic structure

property num_atoms

property structure

property types

TrnSet

class aenet.trainset.TrnSet(name: str, normalized: bool, scale: float, shift: float, atom_types: List[str], atomic_energy: List[float], num_atoms_tot: int, num_structures: int, E_min: float, E_max: float, E_av: float, filename: PathLike = None, fileformat: str = 'ascii', schema: str = None, origin: PathLike = None, has_persisted_features: bool = False, **kwargs)[source]

Bases: object

Class for parsing aenet training set files.

Attention: atom type indices here internally start with zero: (whereas they start with 1 in Fortran)

has_neighbor_info() → bool[source]

Check if the training set file contains neighbor information.

Returns:: True if neighbor information is available (only for HDF5 format), False otherwise.

Check if the training set file contains neighbor information.

Neighbor information is only available for HDF5 format files that were generated with the include_neighbor_info=True option. This information is required for force training with PyTorch autograd.

Returns:: True if neighbor information is available (only for HDF5 format), False otherwise
Return type:: bool

Example:

from aenet.trainset import TrnSet

with TrnSet.from_file("features_with_neighbors.h5") as trnset:
    if trnset.has_neighbor_info():
        struct = trnset.read_structure(
            0,
            read_coords=True,
            read_forces=True,
        )

        if struct.has_neighbor_info:
            print(struct.neighbor_info["neighbor_counts"][0])

Use trnset.has_neighbor_info() to check whether the file stores neighbor-information tables at all, and struct.has_neighbor_info to check whether a particular returned structure exposes per-atom neighbor arrays.

close()[source]

classmethod from_ascii_file(ascii_file: PathLike, **kwargs)[source]

Load training set from aenet ASCII file.

Parameters:: ascii_file – path to an aenet training set file in ASCII format

classmethod from_file(filename: PathLike, file_format: str = 'guess', **kwargs)[source]

classmethod from_fortran_binary_file(binary_file: PathLike, ascii_file: PathLike = None, **kwargs)[source]: First convert training set file in Fortran binary format to ASCII format, then open it. This requires the tool ‘trnset2ASCII.x’.

classmethod from_hdf5_file(hdf5_file: PathLike, **kwargs)[source]

has_neighbor_info() → bool[source]

Check if the training set file contains neighbor information.

Returns:: True if neighbor information is available (only for HDF5 format), False otherwise.

iter_structures(read_coords=False, read_forces=False)[source]

property num_types

open()[source]: Open training set file for reading.

read_next_structure(read_coords=False, read_forces=False)[source]

read_structure(idx: int, read_coords=False, read_forces=False)[source]

rewind()[source]

to_hdf5(filename: PathLike, complevel: int = 1)[source]: Save data set to HDF5 file.

Example Notebook

For the maintained end-to-end featurization workflows, including HDF5 export, PyTorch-backed HDF5 compatibility, optional GPU execution, and longer neighbor-information generation examples, see example-01-featurization.ipynb.

Usage Examples

Inspecting an Existing Training Set

Keep this page focused on inspecting already-generated training sets. Prefer the notebook linked above for file-backed featurization or generation workflows.

from aenet.trainset import TrnSet

with TrnSet.from_file("sample.h5") as trnset:
    print(trnset.schema)
    print(trnset.num_structures)
    print(trnset.atom_types)

    struct = trnset[0]
    print(struct.num_atoms)
    print(struct.atom_features.shape)

For HDF5 inputs, trnset.schema reports which on-disk schema was read. Current values are "trnset_hdf5" for featurizer-generated training sets and "torch_training_hdf5" for HDF5StructureDataset files.

Comparing HDF5 and ASCII Readers

Both backends expose the same high-level TrnSet API for inspection:

from aenet.trainset import TrnSet

with TrnSet.from_file("sample.h5") as trnset_h5, \
        TrnSet.from_file("sample.train.ascii") as trnset_ascii:
    struct_h5 = trnset_h5.read_structure(0, read_coords=True, read_forces=True)
    struct_ascii = trnset_ascii.read_structure(
        0,
        read_coords=True,
        read_forces=True,
    )

    assert trnset_h5.num_structures == trnset_ascii.num_structures
    assert trnset_h5.atom_types == trnset_ascii.atom_types
    assert struct_h5.atom_features.shape == struct_ascii.atom_features.shape
    assert struct_h5.coords.shape == struct_ascii.coords.shape

Checking Optional Neighbor Information

Use both the dataset-level and structure-level checks before consuming stored neighbor arrays:

from aenet.trainset import TrnSet

with TrnSet.from_file("features_with_neighbors.h5") as trnset:
    if trnset.has_neighbor_info():
        struct = trnset.read_structure(0, read_coords=True, read_forces=True)

        if struct.has_neighbor_info:
            print(struct.neighbor_info["neighbor_counts"][0])

Backward Compatibility

The implementation maintains full backward compatibility:

from aenet.trainset import TrnSet

with TrnSet.from_file("sample.h5") as trnset_h5:
    struct = trnset_h5.read_structure(0)
    assert not struct.has_neighbor_info
    assert struct.neighbor_info is None

with TrnSet.from_file("sample.train.ascii") as trnset_ascii:
    assert not trnset_ascii.has_neighbor_info()

Training Set Management

Classes

FeaturizedAtomicStructure

TrnSet

Example Notebook

Usage Examples

Inspecting an Existing Training Set

Comparing HDF5 and ASCII Readers

Checking Optional Neighbor Information

Backward Compatibility

See Also