Training Set Management
Classes for interacting with aenet training set files.
Currently, only training set fuiles in ASCII format are supported.
Classes
FeaturizedAtomicStructure
- class aenet.trainset.FeaturizedAtomicStructure(path: str, energy: float, atom_types: List[str], atoms: List[dict], neighbor_info: dict = None, cell: ndarray = None, pbc: bool = None)[source]
Bases:
SerializableClass to hold all information of an atomic structure.
- path
Path to the original structure file.
- Type:
str
- energy
Total energy of the structure.
- Type:
float
- atom_types
List of atom types (chemical symbols).
- Type:
list[str]
- atoms
Atomic information per atom with keys: {‘type’: atom_type, ‘fingerprint’: fingerprint, ‘coords’: coords, ‘forces’: forces}.
- Type:
list[dict]
- neighbor_info
Optional neighbor information for force training. If present, contains the keys ‘neighbor_counts’ (n_atoms,) array of neighbor counts, ‘neighbor_lists’ list of (nnb,) arrays with neighbor indices, and ‘neighbor_vectors’ list of (nnb, 3) arrays with displacement vectors.
- Type:
dict or None
- cell
Unit cell lattice vectors as (3, 3) array where rows are lattice vectors.
- Type:
numpy.ndarray or None
- pbc
True for 3D-periodic structures; False for isolated structures.
- Type:
bool
Notes
Properties like has_neighbor_info, has_cell, is_periodic, num_atoms, max_descriptor_length, composition, atom_weights, and atom_features are documented on their respective properties.
- neighbor_info
Optional dictionary containing neighbor information for force training. If present, contains:
neighbor_counts: (n_atoms,) numpy array of neighbor counts per atomneighbor_lists: List of (nnb,) numpy arrays with neighbor atom indicesneighbor_vectors: List of (nnb, 3) numpy arrays with displacement vectors
This information is used for computing force derivatives during training.
- Type:
dict or None
- property has_neighbor_info
Returns True if neighbor information is available for force training.
Returns True if neighbor information is available for force training.
- Returns:
Whether the structure contains neighbor information
- Return type:
bool
- property atom_features
- atom_features_for_type(atom_type: str)[source]
Return only the features for atoms of a selected type.
- Parameters:
atom_type (str) – Chemical symbol.
- property atom_weights
- property avec
Get unit cell lattice vectors.
Returns cell from HDF5 if available, otherwise reads from XSF file as fallback for legacy binary format support.
- property composition
- property coords
- property forces
- global_moment_fingerprint(outer_moment: int = 1, inner_moment: int = 1, weighted: bool = False, weights: dict = None, append_weighted: bool = False, stack_type_features: bool = False, exclude_zero_atoms: bool = False, atom_types: List[str] = None)[source]
Calculate a global fingerprint from local atomic fingerprints using a moment expansion.
This implementation assumes that atomic descriptors for each species have the same length.
- Parameters:
outer_moment (int, default=1) – Up to which outer moment to compute. Must be >= 1. Not used when
stack_type_featuresis True.inner_moment (int, default=1) – Up to which inner moment to compute. Must be >= 0 (0 = no moment, 1 = mean).
weighted (bool, default=False) – Whether to apply species weights to the type fingerprints.
weights (dict[str, float] or None, default=None) – Mapping of atom symbol to weight. Defaults to
self.atom_weightswhenweightedis True.append_weighted (bool, default=False) – If True and
weightedis True, append the weighted features to the unweighted features; otherwise only return weighted features.stack_type_features (bool, default=False) – If True, concatenate per-type feature vectors instead of performing the outer moment expansion.
exclude_zero_atoms (bool, default=False) – If True, skip species with zero count in the structure.
atom_types (list[str] or None, default=None) – Subset of chemical symbols to consider. Defaults to all
self.atom_types.
- Returns:
Global fingerprint vector.
- Return type:
numpy.ndarray
Notes
The global fingerprint can be conceived as:
F_global = outer_moments(w_A * inner_moments(F_A) U w_B * inner_moments(F_B) U ...)
where
F_globalis the global fingerprintF_sis the union of atomic fingerprints for speciess(F_s = F_s(1) U F_s(2) U ...)F_s(i)is the atomic fingerprint for speciessat siteiw_sis the weight for speciess
The dimension is
len(type_fingerprint) * inner_moment * outer_moment(orlen(type_fingerprint) * outer_momentifinner_momentis 0).
- property has_cell
Returns True if unit cell information is available.
- property has_neighbor_info
Returns True if neighbor information is available for force training.
- property is_periodic
Returns True if structure is periodic (3D-periodic).
- property max_descriptor_length
Dimension of longest fingerprint among all atoms of the atomic structure
- property num_atoms
- property structure
- property types
TrnSet
- class aenet.trainset.TrnSet(name: str, normalized: bool, scale: float, shift: float, atom_types: List[str], atomic_energy: List[float], num_atoms_tot: int, num_structures: int, E_min: float, E_max: float, E_av: float, filename: PathLike = None, fileformat: str = 'ascii', schema: str = None, origin: PathLike = None, has_persisted_features: bool = False, **kwargs)[source]
Bases:
objectClass for parsing aenet training set files.
- Attention: atom type indices here internally start with zero
(whereas they start with 1 in Fortran)
- has_neighbor_info() bool[source]
Check if the training set file contains neighbor information.
- Returns:
True if neighbor information is available (only for HDF5 format), False otherwise.
Check if the training set file contains neighbor information.
Neighbor information is only available for HDF5 format files that were generated with the
include_neighbor_info=Trueoption. This information is required for force training with PyTorch autograd.- Returns:
True if neighbor information is available (only for HDF5 format), False otherwise
- Return type:
bool
Example:
from aenet.trainset import TrnSet with TrnSet.from_file("features_with_neighbors.h5") as trnset: if trnset.has_neighbor_info(): struct = trnset.read_structure( 0, read_coords=True, read_forces=True, ) if struct.has_neighbor_info: print(struct.neighbor_info["neighbor_counts"][0])
Use
trnset.has_neighbor_info()to check whether the file stores neighbor-information tables at all, andstruct.has_neighbor_infoto check whether a particular returned structure exposes per-atom neighbor arrays.
- classmethod from_ascii_file(ascii_file: PathLike, **kwargs)[source]
Load training set from aenet ASCII file.
- Parameters:
ascii_file – path to an aenet training set file in ASCII format
- classmethod from_fortran_binary_file(binary_file: PathLike, ascii_file: PathLike = None, **kwargs)[source]
First convert training set file in Fortran binary format to ASCII format, then open it. This requires the tool ‘trnset2ASCII.x’.
- has_neighbor_info() bool[source]
Check if the training set file contains neighbor information.
- Returns:
True if neighbor information is available (only for HDF5 format), False otherwise.
- property num_types
Example Notebook
For the maintained end-to-end featurization workflows, including HDF5 export, PyTorch-backed HDF5 compatibility, optional GPU execution, and longer neighbor-information generation examples, see example-01-featurization.ipynb.
Usage Examples
Inspecting an Existing Training Set
Keep this page focused on inspecting already-generated training sets. Prefer the notebook linked above for file-backed featurization or generation workflows.
from aenet.trainset import TrnSet
with TrnSet.from_file("sample.h5") as trnset:
print(trnset.schema)
print(trnset.num_structures)
print(trnset.atom_types)
struct = trnset[0]
print(struct.num_atoms)
print(struct.atom_features.shape)
For HDF5 inputs, trnset.schema reports which on-disk schema was read.
Current values are "trnset_hdf5" for featurizer-generated training sets
and "torch_training_hdf5" for HDF5StructureDataset files.
Comparing HDF5 and ASCII Readers
Both backends expose the same high-level TrnSet API for inspection:
from aenet.trainset import TrnSet
with TrnSet.from_file("sample.h5") as trnset_h5, \
TrnSet.from_file("sample.train.ascii") as trnset_ascii:
struct_h5 = trnset_h5.read_structure(0, read_coords=True, read_forces=True)
struct_ascii = trnset_ascii.read_structure(
0,
read_coords=True,
read_forces=True,
)
assert trnset_h5.num_structures == trnset_ascii.num_structures
assert trnset_h5.atom_types == trnset_ascii.atom_types
assert struct_h5.atom_features.shape == struct_ascii.atom_features.shape
assert struct_h5.coords.shape == struct_ascii.coords.shape
Checking Optional Neighbor Information
Use both the dataset-level and structure-level checks before consuming stored neighbor arrays:
from aenet.trainset import TrnSet
with TrnSet.from_file("features_with_neighbors.h5") as trnset:
if trnset.has_neighbor_info():
struct = trnset.read_structure(0, read_coords=True, read_forces=True)
if struct.has_neighbor_info:
print(struct.neighbor_info["neighbor_counts"][0])
Backward Compatibility
The implementation maintains full backward compatibility:
from aenet.trainset import TrnSet
with TrnSet.from_file("sample.h5") as trnset_h5:
struct = trnset_h5.read_structure(0)
assert not struct.has_neighbor_info
assert struct.neighbor_info is None
with TrnSet.from_file("sample.train.ascii") as trnset_ascii:
assert not trnset_ascii.has_neighbor_info()
See Also
PyTorch-Based Featurization - PyTorch-based featurization APIs