Stay updated

Reproducibility in Albumentations 🔗

Reproducibility is crucial for scientific experiments, debugging, and production deployments. This guide covers everything you need to know about creating reproducible augmentation pipelines in Albumentations.

Quick Start 🔗

To make your augmentations reproducible, set the seed parameter in Compose:

import albumentations as A

# This pipeline will produce the same augmentations
# every time it's instantiated with the same seed
transform = A.Compose([
    A.RandomCrop(height=256, width=256),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
], seed=137)

Key Concepts 🔗

1. Independent Random State 🔗

Albumentations maintains its own internal random state that is completely independent from global random seeds. This design choice ensures:

Pipeline reproducibility is not affected by external code
Multiple pipelines can coexist without interfering with each other
Your augmentations remain consistent regardless of other random operations in your code

import random
import numpy as np
import albumentations as A

# These global seeds DO NOT affect Albumentations
np.random.seed(137)
random.seed(137)

# Only the seed parameter in Compose controls reproducibility
transform = A.Compose([
    A.RandomRotate90(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
], seed=137)  # This is what matters

2. Seed Behavior 🔗

When you set a seed in Compose:

Two instances with the same seed produce identical sequences:

transform1 = A.Compose([...], seed=137)
transform2 = A.Compose([...], seed=137)
# transform1 and transform2 will apply the same random parameters

Each call still produces random augmentations:

transform = A.Compose([...], seed=137)
# Different random augmentations for each call
result1 = transform(image=image1)
result2 = transform(image=image2)
# But the sequence is reproducible when recreating the pipeline

No seed means truly random behavior:

transform = A.Compose([...])  # seed=None by default
# Different random sequence every time you create the pipeline

Common Use Cases 🔗

1. Reproducible Training Experiments 🔗

def create_train_transform(seed=None):
    """Create a training augmentation pipeline with optional seed."""
    return A.Compose([
        A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
        A.HorizontalFlip(p=0.5),
        A.ColorJitter(
            brightness=0.2,
            contrast=0.2,
            saturation=0.2,
            hue=0.1,
            p=0.8
        ),
        A.GaussNoise(p=0.2),
        A.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        ),
    ], seed=seed)

# For reproducible experiments
train_transform = create_train_transform(seed=137)

# For production (no fixed seed, different augmentations each run)
train_transform = create_train_transform(seed=None)

2. Debugging with Fixed Seeds 🔗

When debugging augmentation issues, use a fixed seed to ensure consistent behavior:

# Debug mode - same augmentations every run
debug_transform = A.Compose([
    A.RandomCrop(height=256, width=256),
    A.ShiftScaleRotate(
        shift_limit=0.1,
        scale_limit=0.2,
        rotate_limit=30,
        p=1.0  # Always apply for debugging
    ),
], seed=137)

# Test with the same image multiple times
for i in range(3):
    result = debug_transform(image=test_image)
    # Will produce the exact same augmented image each iteration

3. A/B Testing Augmentation Strategies 🔗

Compare different augmentation strategies with controlled randomness:

# Strategy A with fixed seed
strategy_a = A.Compose([
    A.RandomRotate90(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
], seed=100)

# Strategy B with the same seed for fair comparison
strategy_b = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(p=0.3),
], seed=100)

# Both will use the same random sequence for probability checks

4. Multi-Stage Pipelines 🔗

When using multiple Compose instances in sequence, each can have its own seed:

# Stage 1: Geometric transforms
geometric = A.Compose([
    A.RandomRotate90(p=0.5),
    A.HorizontalFlip(p=0.5),
], seed=137)

# Stage 2: Color transforms
color = A.Compose([
    A.RandomBrightnessContrast(p=0.5),
    A.HueSaturationValue(p=0.5),
], seed=137)

# Apply stages sequentially
image = geometric(image=image)['image']
image = color(image=image)['image']

Resetting Seeds for Existing Pipelines 🔗

You can reset the random seed of an existing pipeline without recreating it:

import albumentations as A

# Create a pipeline
transform = A.Compose([
    A.RandomCrop(height=256, width=256),
    A.HorizontalFlip(p=0.5),
], seed=137)

# Apply some augmentations
result1 = transform(image=image)

# Reset to a new seed
transform.set_random_seed(200)

# Now uses the new seed
result2 = transform(image=image)

# Reset to original seed
transform.set_random_seed(137)

# You can also set random state directly from generators
import numpy as np
import random

rng = np.random.default_rng(100)
py_rng = random.Random(100)
transform.set_random_state(rng, py_rng)

DataLoader Workers and Reproducibility 🔗

Key Concept: In AlbumentationsX, the augmentation sequence depends on BOTH the seed AND the number of workers. Using seed=137 with num_workers=4 produces different results than seed=137 with num_workers=8. This is by design to maximize augmentation diversity in parallel processing.

Automatic Worker Seed Handling 🔗

AlbumentationsX automatically handles seed synchronization when used with PyTorch DataLoader workers:

import torch
from torch.utils.data import Dataset, DataLoader
import albumentations as A

class MyDataset(Dataset):
    def __init__(self, data, transform=None):
        self.data = data
        self.transform = transform

    def __getitem__(self, idx):
        image = self.data[idx]
        if self.transform:
            image = self.transform(image=image)['image']
        return image

    def __len__(self):
        return len(self.data)

# Create transform with seed
transform = A.Compose([
    A.RandomCrop(height=256, width=256),
    A.HorizontalFlip(p=0.5),
], seed=137)

dataset = MyDataset(images, transform=transform)

# Each worker gets a unique, reproducible seed automatically
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,  # Multiple workers
    shuffle=True
)

How Worker Seeds Work 🔗

Base Seed: When you set seed=137 in Compose, this becomes the base seed
Worker Differentiation: Each worker automatically gets a unique seed based on:
- Base seed (137)
- PyTorch's worker-specific torch.initial_seed()
Reproducibility: The same worker ID always gets the same effective seed across runs
Respawn Handling: Seeds update correctly when workers are respawned

The effective seed formula:

effective_seed = (base_seed + torch.initial_seed()) % (2**32)

Important: Same Seed, Different num_workers = Different Augmentations 🔗

Critical Note: Using the same seed with different num_workers settings will produce different augmentation sequences:

# Same seed=137, but different num_workers -> Different results!
transform = A.Compose([...], seed=137)

# With 1 worker
loader1 = DataLoader(dataset, num_workers=1)
# Worker 0 gets: effective_seed = 137 + torch_seed_0

# With 4 workers
loader2 = DataLoader(dataset, num_workers=4)
# Worker 0 gets: effective_seed = 137 + torch_seed_0
# Worker 1 gets: effective_seed = 137 + torch_seed_1
# Worker 2 gets: effective_seed = 137 + torch_seed_2
# Worker 3 gets: effective_seed = 137 + torch_seed_3
# Different data distribution across workers = different overall results!

# With 8 workers
loader3 = DataLoader(dataset, num_workers=8)
# 8 different effective seeds = yet another different result!

This is by design to ensure:

Each worker produces unique augmentations (no duplicates)
Maximum augmentation diversity in parallel processing
Reproducibility when using the SAME num_workers configuration

Key insight: The augmentation sequence depends on BOTH the seed AND num_workers. To get identical results, you must use the same seed AND the same num_workers.

Manual Worker Seed Management 🔗

If you need custom worker seed logic:

def worker_init_fn(worker_id):
    # Custom seed logic
    worker_seed = torch.initial_seed() % 2**32
    # The transform will automatically use this for differentiation

dataloader = DataLoader(
    dataset,
    num_workers=4,
    worker_init_fn=worker_init_fn
)

Single Process vs Multi-Process 🔗

# Single process (num_workers=0)
# Uses base seed directly
transform = A.Compose([...], seed=137)
loader = DataLoader(dataset, num_workers=0)
# Always produces the same sequence

# Multi-process (num_workers>0)
# Each worker gets unique seed automatically
loader = DataLoader(dataset, num_workers=4)
# Each worker produces different sequences
# But sequences are reproducible across runs

Making Augmentations Identical Across Different num_workers 🔗

Important: By design, different num_workers values produce different augmentation sequences, even with the same seed. This is because each worker gets a unique effective seed. If you need identical augmentations regardless of num_workers (unlikely but possible use case), here are some workarounds:

Note: When using persistent_workers=True, the difference becomes more pronounced as the worker seed state may not reset properly between epochs.

# Option 1: Force same seed for all workers (ignores worker ID)
class IdenticalAugmentDataset(Dataset):
    def __init__(self, data):
        self.data = data
        # Create fixed random generators shared across all workers
        self.rng = np.random.default_rng(137)
        self.py_rng = random.Random(137)
        self.transform = A.Compose([
            A.RandomCrop(height=256, width=256),
            A.HorizontalFlip(p=0.5),
        ])  # No seed here!

    def __getitem__(self, idx):
        # Force the same random state regardless of worker
        self.transform.set_random_state(self.rng, self.py_rng)
        image = self.data[idx]
        return self.transform(image=image)['image']

# Now num_workers=4 and num_workers=8 produce identical sequences

# Option 2: Use a fixed seed that ignores worker differentiation
def worker_init_fn(worker_id):
    # Override the automatic worker seed differentiation
    worker_info = torch.utils.data.get_worker_info()
    if worker_info is not None:
        dataset = worker_info.dataset
        # Use the same seed for all workers (not recommended for training!)
        dataset.transform = A.Compose([
            A.RandomCrop(height=256, width=256),
            A.HorizontalFlip(p=0.5),
        ], seed=137)  # Same seed, ignoring worker_id

Warning: Making augmentations identical across different num_workers defeats the purpose of parallel data loading and reduces augmentation diversity. This is typically only useful for debugging or specific reproducibility requirements.

This behavior is discussed in AlbumentationsX Issue #81.

Custom Transforms and Reproducibility 🔗

When creating custom transforms, use the provided random generators to maintain reproducibility:

from albumentations.core.transforms_interface import DualTransform

class MyCustomTransform(DualTransform):
    def get_params_dependent_on_data(self, params, data):
        # CORRECT: Use self.py_random for Python's random operations
        random_value = self.py_random.uniform(0, 1)

        # CORRECT: Use self.random_generator for NumPy operations
        random_array = self.random_generator.uniform(0, 1, size=(3, 3))

        # WRONG: Don't use global random functions
        # bad_value = random.random()  # This ignores the seed!
        # bad_array = np.random.rand(3, 3)  # This also ignores the seed!

        return {"value": random_value, "array": random_array}

See the Creating Custom Transforms Guide for more details.

Saving and Loading Pipelines 🔗

For perfect reproducibility across different environments or time, save your pipeline configuration:

# Save pipeline configuration
A.save(transform, 'augmentation_pipeline.json')

# Load the exact same pipeline later
transform = A.load('augmentation_pipeline.json')

Note: The loaded pipeline will have the same seed as the original. See the Serialization Guide for more details.

Tracking Applied Augmentations 🔗

To debug or analyze which augmentations were actually applied, use save_applied_params:

transform = A.Compose([
    A.RandomCrop(256, 256),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.5),
], save_applied_params=True, seed=137)

result = transform(image=image)
print(transform.applied_transforms)
# Shows exactly which transforms were applied and their parameters

Best Practices 🔗

Development vs Production:
- Use fixed seeds during development and debugging
- Remove seeds (or use different seeds per epoch) in production training
- Always use fixed seeds for validation/test transforms if you need comparable results
Experiment Tracking:
- Log the seed value in your experiment tracking system
- Save the complete pipeline configuration using A.save()
- Document the Albumentations version used
- Track num_workers setting as it affects augmentation sequences
Testing:
- Unit tests should always use fixed seeds
- Integration tests may use random seeds to test robustness
- Create separate test cases for both scenarios
- Test with both single and multi-worker configurations
Distributed Training:
- AlbumentationsX automatically handles worker differentiation
- Each worker gets a unique, reproducible seed based on base_seed + torch.initial_seed()
- No need for manual seed = base_seed + worker_id logic
- Seeds are automatically updated on worker respawn
DataLoader Configuration:
- Be aware that changing num_workers changes augmentation sequences
- Document your num_workers setting for reproducibility
- Use consistent num_workers across experiments for comparable results
- Avoid persistent_workers=True if exact reproducibility is critical (see known issue below)

Common Pitfalls 🔗

❌ Don't rely on global seeds 🔗

# This WILL NOT make Albumentations reproducible
np.random.seed(137)
random.seed(137)

transform = A.Compose([...])  # Still random!

❌ Don't forget that each Compose call is still random 🔗

transform = A.Compose([...], seed=137)

# These will be different (but reproducible sequence)
aug1 = transform(image=img)
aug2 = transform(image=img)  # Different augmentation!

✅ Do create new instances for identical augmentations 🔗

# If you need the exact same augmentation
transform1 = A.Compose([...], seed=137)
transform2 = A.Compose([...], seed=137)

# Now these will be identical
aug1 = transform1(image=img)
aug2 = transform2(image=img)  # Same augmentation!

Summary 🔗

This guide covered:

Setting and resetting seeds for reproducible augmentations
Automatic worker seed handling in PyTorch DataLoaders
How different num_workers settings affect augmentation sequences
Best practices for reproducible experiments
Common pitfalls and how to avoid them

Pipelines and Compose - Understanding pipeline configuration
Probabilities - How probabilities interact with seeds
Creating Custom Transforms - Making custom transforms reproducible
Serialization - Saving and loading reproducible pipelines