Advanced Structure Transformations

This guide covers advanced features including transformation chains, stochastic sampling, reproducibility, and iterator composition patterns.

See Basic Structure Transformations for an introduction to individual transformations.

Transformation Chains

TransformationChain allows sequential application of multiple transformations using depth-first streaming. This is useful for generating complex structural variations by composing simple transformations.

Basic Chain Example

from aenet.geometry import AtomicStructure
from aenet.geometry.transformations import (
    AtomDisplacementTransformation,
    CellVolumeTransformation,
    TransformationChain
)

structure = AtomicStructure.from_file('structure.xsf')

# Create a chain of transformations
chain = TransformationChain([
    AtomDisplacementTransformation(displacement=0.05),
    CellVolumeTransformation(min_percent=-2, max_percent=2, steps=3)
])

# Process structures lazily
for s in chain.apply_transformation(structure):
    process(s)

# Or get all structures at once
all_structures = list(chain.apply_transformation(structure))
print(f"Generated {len(all_structures)} structures")
# Output: Generated 72 structures (24 displaced × 3 volumes)

How chains work:

The chain applies transformations sequentially using depth-first streaming:

  1. First transformation generates structures from input

  2. Each generated structure is immediately passed to the second transformation

  3. Results flow through the entire chain before the next structure is processed

This depth-first approach ensures correct multiplicative behavior and memory efficiency.

Controlling Output Size

Since chains can produce many structures, use standard Python tools to limit output:

Using itertools.islice():

import itertools

chain = TransformationChain([
    AtomDisplacementTransformation(displacement=0.05),
    CellVolumeTransformation(min_percent=-5, max_percent=5, steps=10)
])

# Get only first 100 structures
first_100 = list(itertools.islice(
    chain.apply_transformation(structure), 100
))

Using a counter:

results = []
for i, s in enumerate(chain.apply_transformation(structure)):
    if i >= 1000:
        break
    results.append(s)

Filtering with conditions:

# Only keep structures meeting criteria
good_structures = [
    s for s in itertools.islice(
        chain.apply_transformation(structure), 10000
    )
    if meets_criteria(s)
][:100]  # Keep first 100 that meet criteria

Iterator Composition

Chains are iterators, so they compose naturally with Python’s itertools:

Interleaving multiple chains:

import itertools

chain1 = TransformationChain([transform1, transform2])
chain2 = TransformationChain([transform3, transform4])

# Alternate between chains
combined = itertools.chain(
    chain1.apply_transformation(structure),
    chain2.apply_transformation(structure)
)

Processing in batches:

def batch_iter(iterable, batch_size):
    """Yield batches of structures."""
    iterator = iter(iterable)
    while True:
        batch = list(itertools.islice(iterator, batch_size))
        if not batch:
            break
        yield batch

# Process 10 structures at a time
for batch in batch_iter(chain.apply_transformation(structure), 10):
    process_batch(batch)

Parallel processing:

from multiprocessing import Pool

def process_structure(s):
    # Expensive computation
    return result

# Generate structures lazily, process in parallel
with Pool(4) as pool:
    results = pool.map(
        process_structure,
        itertools.islice(chain.apply_transformation(structure), 1000)
    )

Stochastic Transformations

RandomDisplacementTransformation generates random atomic displacement vectors. The displacements can be optionally orthonormalized to create an independent basis of perturbations. Unlike deterministic transformations, it uses randomness and requires careful attention to reproducibility.

Basic Usage

from aenet.geometry.transformations import RandomDisplacementTransformation

# Generate random structures with 0.1 Å RMS displacement
# By default, generates 3N-3 orthonormal structures
transform = RandomDisplacementTransformation(
    rms=0.1,
    random_state=42,  # Seed for reproducibility
    orthonormalize=True,  # Generate orthonormal basis (default)
    remove_translations=True  # Remove translational modes (default)
)

# Get all random structures
random_structures = list(transform.apply_transformation(structure))
print(f"Generated {len(random_structures)} random structures")

# Generate non-orthonormalized random samples
transform_random = RandomDisplacementTransformation(
    rms=0.1,
    max_structures=50,  # Specify number of samples
    orthonormalize=False,  # Just random perturbations
    random_state=42
)

Parameters:

  • rms: Target root-mean-square displacement in Angstroms

  • max_structures: Maximum number of structures to generate. If None, defaults to 3N-3 (default: None)

  • random_state: Integer seed or numpy Generator for reproducibility

  • orthonormalize: If True, generate orthonormal basis via QR decomposition (default: True)

  • remove_translations: If True, remove 3 uniform translation modes (default: True)

Algorithm Details

When orthonormalize=True (default):

The transformation generates orthonormal displacement vectors in 3N-dimensional space (N atoms, 3 coordinates per atom):

  1. Generate random 3N×M matrix (M = max_structures)

  2. Apply QR decomposition to obtain orthonormal columns

  3. Optionally project out 3 translational modes

  4. Re-orthonormalize remaining vectors via second QR decomposition

  5. Normalize each vector to target RMS

When orthonormalize=False:

Random displacement samples are generated independently:

  1. Generate random 3N-dimensional vector from normal distribution

  2. Optionally remove center-of-mass translation

  3. Normalize to target RMS

  4. Repeat for each structure (no orthogonality constraint)

RMS displacement is defined as:

\[\text{RMS} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} \|\mathbf{d}_i\|^2}\]

where \(\mathbf{d}_i\) is the displacement vector for atom \(i\).

Orthogonality ensures displacement vectors are mutually independent:

\[\mathbf{v}_i \cdot \mathbf{v}_j = \delta_{ij}\]

where \(\delta_{ij}\) is the Kronecker delta (1 if i=j, 0 otherwise).

Translation Mode Removal

When remove_translations=True, the three uniform translational modes are removed:

\[\mathbf{t}_x = (\underbrace{1, 0, 0, 1, 0, 0, \ldots}_{\text{N atoms}}), \quad \mathbf{t}_y = (0, 1, 0, 0, 1, 0, \ldots), \quad \mathbf{t}_z = (0, 0, 1, 0, 0, 1, \ldots)\]

After projection, 3N-3 orthonormal vectors remain. For a single atom (3 degrees of freedom), this results in zero vectors, so the transformation yields no structures.

Re-orthonormalization after projection is critical to maintain mutual orthogonality of the remaining vectors.

Reproducibility

For scientific reproducibility, always control randomness in stochastic transformations.

Using Integer Seeds

# Same seed → identical results
transform1 = RandomDisplacementTransformation(rms=0.1, random_state=42)
transform2 = RandomDisplacementTransformation(rms=0.1, random_state=42)

results1 = list(transform1.apply_transformation(structure))
results2 = list(transform2.apply_transformation(structure))

# results1 and results2 are identical

Using numpy Generator

For more control, use a numpy Generator:

import numpy as np

# Create explicit generator
rng = np.random.default_rng(seed=12345)

transform = RandomDisplacementTransformation(
    rms=0.1,
    max_structures=50,
    random_state=rng
)

# Generator state is advanced after use
results = list(transform.apply_transformation(structure))

# Reusing same generator produces different results (state advanced)
more_results = list(transform.apply_transformation(structure))

Reproducibility Checklist

For fully reproducible workflows:

  1. Pin package versions in requirements.txt or environment.yml

  2. Always set random_state for stochastic transformations

  3. Document seeds in scripts and logs

  4. Version control transformation parameters

  5. Record numpy version (RNG implementation may vary)

  6. Log all generated structures with metadata

Example:

import json
import numpy as np
from aenet.geometry.transformations import RandomDisplacementTransformation

# Configuration
config = {
    'rms': 0.1,
    'max_structures': 100,
    'seed': 42,
    'orthonormalize': True,
    'remove_translations': True,
    'numpy_version': np.__version__
}

# Save configuration
with open('transform_config.json', 'w') as f:
    json.dump(config, f, indent=2)

# Apply transformation
transform = RandomDisplacementTransformation(
    rms=config['rms'],
    max_structures=config['max_structures'],
    random_state=config['seed'],
    orthonormalize=config['orthonormalize'],
    remove_translations=config['remove_translations']
)

structures = list(transform.apply_transformation(structure))

Complete Workflow Example

Generate training data for a machine learning potential:

from aenet.geometry import AtomicStructure
from aenet.geometry.transformations import (
    RandomDisplacementTransformation,
    CellVolumeTransformation,
    IsovolumetricStrainTransformation,
    TransformationChain
)
import itertools

# Load reference structure
structure = AtomicStructure.from_file('reference.xsf')

# Define workflow: random displacements + volume + strain
chain = TransformationChain([
    RandomDisplacementTransformation(
        rms=0.08,
        max_structures=20,
        random_state=42
    ),
    CellVolumeTransformation(
        min_percent=-3,
        max_percent=3,
        steps=3
    ),
    IsovolumetricStrainTransformation(
        direction=1,
        len_min=0.95,
        len_max=1.05,
        steps=3
    )
])

# Generate up to 500 structures
print("Generating training structures...")
training_structures = list(itertools.islice(
    chain.apply_transformation(structure), 500
))
print(f"Generated {len(training_structures)} structures")

# Save for training
for i, s in enumerate(training_structures):
    s.to_file(f'training_data/struct_{i:04d}.xsf')

print("Training data generation complete")

This workflow:

  1. Generates 20 random displacements (orthonormal, RMS=0.08 Å)

  2. Applies 3 volume scalings to each

  3. Applies 3 isovolumetric strains to each

  4. Limits total output to 500 structures

  5. Saves all structures for training

Expected output: 20 × 3 × 3 = 180 structures (or 500 if limited)

Performance Considerations

Computational Complexity:

  • AtomDisplacementTransformation: O(N) per structure, yields 3N structures

  • CellVolumeTransformation: O(1) per structure, yields M structures

  • IsovolumetricStrainTransformation: O(1) per structure, yields M structures

  • ShearStrainTransformation: O(1) per structure, yields M structures

  • RandomDisplacementTransformation: O(N²M) for QR decomposition (when orthonormalize=True)

where N = number of atoms, M = number of steps/structures.

Memory Usage:

  • Iterator-based design: Only one structure in memory at a time

  • Use list() only when you need all structures at once

  • Process structures as they’re generated for maximum efficiency

Recommendations:

  • For small systems (< 100 atoms): No special considerations needed

  • For medium systems (100-1000 atoms): Use itertools.islice() to limit output

  • For large systems (> 1000 atoms): Process structures one at a time

  • QR decomposition for N=100 atoms, M=100 structures: < 1 second

Memory-Efficient Patterns

Process and discard:

# Don't store all structures
for s in chain.apply_transformation(structure):
    result = expensive_calculation(s)
    save_result(result)
    # Structure is discarded after processing

Accumulate results, not structures:

# Store computed properties, not structures
energies = []
for s in chain.apply_transformation(structure):
    energy = compute_energy(s)
    energies.append(energy)

Stream to disk:

# Write structures as they're generated
for i, s in enumerate(chain.apply_transformation(structure)):
    if i >= 1000:
        break
    s.to_file(f'output_{i:04d}.xsf')

Combining Deterministic and Stochastic

Mix deterministic and stochastic transformations for comprehensive sampling:

import itertools

# Deterministic baseline
deterministic_chain = TransformationChain([
    CellVolumeTransformation(min_percent=-5, max_percent=5, steps=5),
    IsovolumetricStrainTransformation(direction=1, len_min=0.9, len_max=1.1, steps=5)
])

# Stochastic perturbations
stochastic = RandomDisplacementTransformation(rms=0.05, max_structures=10, random_state=42)

# Combine: deterministic structures + random displacements for each
all_structures = []
for base_structure in deterministic_chain.apply_transformation(structure):
    # Original deterministic structure
    all_structures.append(base_structure)
    # Plus random variations
    for displaced in stochastic.apply_transformation(base_structure):
        all_structures.append(displaced)

print(f"Total: {len(all_structures)} structures")
# 25 deterministic + 25×10 random = 275 structures

Troubleshooting

Problem: Too many structures generated

Solution: Use itertools.islice() to limit output

Problem: Out of memory

Solution: Process structures one at a time, don’t use list()

Problem: Non-reproducible results

Solution: Always set random_state for stochastic transformations

Problem: Structures too similar

Solution: Increase displacement/strain magnitudes or reduce steps

Problem: QR decomposition slow

Solution: Reduce max_structures for RandomDisplacementTransformation, or use orthonormalize=False for faster random sampling

Problem: Not enough structures for single atom

Solution: With remove_translations=True, single atoms have 3N-3=0 vectors; use remove_translations=False to get 3 displacement vectors

Further Reading