Reproducibility in Albumentations
Reproducibility is crucial for scientific experiments, debugging, and production deployments. This guide covers everything you need to know about creating reproducible augmentation pipelines in Albumentations.
Quick Start
To make your augmentations reproducible, set the seed parameter in Compose:
import albumentations as A
# This pipeline will produce the same augmentations
# every time it's instantiated with the same seed
transform = A.Compose([
A.RandomCrop(height=256, width=256),
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.2),
], seed=137)
Key Concepts
1. Independent Random State
Albumentations maintains its own internal random state that is completely independent from global random seeds. This design choice ensures:
- Pipeline reproducibility is not affected by external code
- Multiple pipelines can coexist without interfering with each other
- Your augmentations remain consistent regardless of other random operations in your code
import random
import numpy as np
import albumentations as A
# These global seeds DO NOT affect Albumentations
np.random.seed(137)
random.seed(137)
# Only the seed parameter in Compose controls reproducibility
transform = A.Compose([
A.RandomRotate90(p=0.5),
A.RandomBrightnessContrast(p=0.2),
], seed=137) # This is what matters
2. Seed Behavior
When you set a seed in Compose:
-
Two instances with the same seed produce identical sequences:
transform1 = A.Compose([...], seed=137) transform2 = A.Compose([...], seed=137) # transform1 and transform2 will apply the same random parameters -
Each call still produces random augmentations:
transform = A.Compose([...], seed=137) # Different random augmentations for each call result1 = transform(image=image1) result2 = transform(image=image2) # But the sequence is reproducible when recreating the pipeline -
No seed means truly random behavior:
transform = A.Compose([...]) # seed=None by default # Different random sequence every time you create the pipeline
Common Use Cases
1. Reproducible Training Experiments
def create_train_transform(seed=None):
"""Create a training augmentation pipeline with optional seed."""
return A.Compose([
A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(
brightness=0.2,
contrast=0.2,
saturation=0.2,
hue=0.1,
p=0.8
),
A.GaussNoise(p=0.2),
A.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
], seed=seed)
# For reproducible experiments
train_transform = create_train_transform(seed=137)
# For production (no fixed seed, different augmentations each run)
train_transform = create_train_transform(seed=None)
2. Debugging with Fixed Seeds
When debugging augmentation issues, use a fixed seed to ensure consistent behavior:
# Debug mode - same augmentations every run
debug_transform = A.Compose([
A.RandomCrop(height=256, width=256),
A.ShiftScaleRotate(
shift_limit=0.1,
scale_limit=0.2,
rotate_limit=30,
p=1.0 # Always apply for debugging
),
], seed=137)
# Test with the same image multiple times
for i in range(3):
result = debug_transform(image=test_image)
# Will produce the exact same augmented image each iteration
3. A/B Testing Augmentation Strategies
Compare different augmentation strategies with controlled randomness:
# Strategy A with fixed seed
strategy_a = A.Compose([
A.RandomRotate90(p=0.5),
A.RandomBrightnessContrast(p=0.3),
], seed=100)
# Strategy B with the same seed for fair comparison
strategy_b = A.Compose([
A.HorizontalFlip(p=0.5),
A.ColorJitter(p=0.3),
], seed=100)
# Both will use the same random sequence for probability checks
4. Multi-Stage Pipelines
When using multiple Compose instances in sequence, each can have its own seed:
# Stage 1: Geometric transforms
geometric = A.Compose([
A.RandomRotate90(p=0.5),
A.HorizontalFlip(p=0.5),
], seed=137)
# Stage 2: Color transforms
color = A.Compose([
A.RandomBrightnessContrast(p=0.5),
A.HueSaturationValue(p=0.5),
], seed=137)
# Apply stages sequentially
image = geometric(image=image)['image']
image = color(image=image)['image']
Resetting Seeds for Existing Pipelines
You can reset the random seed of an existing pipeline without recreating it:
import albumentations as A
# Create a pipeline
transform = A.Compose([
A.RandomCrop(height=256, width=256),
A.HorizontalFlip(p=0.5),
], seed=137)
# Apply some augmentations
result1 = transform(image=image)
# Reset to a new seed
transform.set_random_seed(200)
# Now uses the new seed
result2 = transform(image=image)
# Reset to original seed
transform.set_random_seed(137)
# You can also set random state directly from generators
import numpy as np
import random
rng = np.random.default_rng(100)
py_rng = random.Random(100)
transform.set_random_state(rng, py_rng)
DataLoader Workers and Reproducibility
Key Concept: In AlbumentationsX, the augmentation sequence depends on BOTH the seed AND the number of workers. Using seed=137 with num_workers=4 produces different results than seed=137 with num_workers=8. This is by design to maximize augmentation diversity in parallel processing.
Automatic Worker Seed Handling
AlbumentationsX automatically handles seed synchronization when used with PyTorch DataLoader workers:
import torch
from torch.utils.data import Dataset, DataLoader
import albumentations as A
class MyDataset(Dataset):
def __init__(self, data, transform=None):
self.data = data
self.transform = transform
def __getitem__(self, idx):
image = self.data[idx]
if self.transform:
image = self.transform(image=image)['image']
return image
def __len__(self):
return len(self.data)
# Create transform with seed
transform = A.Compose([
A.RandomCrop(height=256, width=256),
A.HorizontalFlip(p=0.5),
], seed=137)
dataset = MyDataset(images, transform=transform)
# Each worker gets a unique, reproducible seed automatically
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4, # Multiple workers
shuffle=True
)
How Worker Seeds Work
- Base Seed: When you set
seed=137in Compose, this becomes the base seed - Worker Differentiation: Each worker automatically gets a unique seed based on:
- Base seed (137)
- PyTorch's worker-specific
torch.initial_seed()
- Reproducibility: The same worker ID always gets the same effective seed across runs
- Respawn Handling: Seeds update correctly when workers are respawned
The effective seed formula:
effective_seed = (base_seed + torch.initial_seed()) % (2**32)
Important: Same Seed, Different num_workers = Different Augmentations
Critical Note: Using the same seed with different num_workers settings will produce different augmentation sequences:
# Same seed=137, but different num_workers -> Different results!
transform = A.Compose([...], seed=137)
# With 1 worker
loader1 = DataLoader(dataset, num_workers=1)
# Worker 0 gets: effective_seed = 137 + torch_seed_0
# With 4 workers
loader2 = DataLoader(dataset, num_workers=4)
# Worker 0 gets: effective_seed = 137 + torch_seed_0
# Worker 1 gets: effective_seed = 137 + torch_seed_1
# Worker 2 gets: effective_seed = 137 + torch_seed_2
# Worker 3 gets: effective_seed = 137 + torch_seed_3
# Different data distribution across workers = different overall results!
# With 8 workers
loader3 = DataLoader(dataset, num_workers=8)
# 8 different effective seeds = yet another different result!
This is by design to ensure:
- Each worker produces unique augmentations (no duplicates)
- Maximum augmentation diversity in parallel processing
- Reproducibility when using the SAME num_workers configuration
Key insight: The augmentation sequence depends on BOTH the seed AND num_workers. To get identical results, you must use the same seed AND the same num_workers.
Manual Worker Seed Management
If you need custom worker seed logic:
def worker_init_fn(worker_id):
# Custom seed logic
worker_seed = torch.initial_seed() % 2**32
# The transform will automatically use this for differentiation
dataloader = DataLoader(
dataset,
num_workers=4,
worker_init_fn=worker_init_fn
)
Single Process vs Multi-Process
# Single process (num_workers=0)
# Uses base seed directly
transform = A.Compose([...], seed=137)
loader = DataLoader(dataset, num_workers=0)
# Always produces the same sequence
# Multi-process (num_workers>0)
# Each worker gets unique seed automatically
loader = DataLoader(dataset, num_workers=4)
# Each worker produces different sequences
# But sequences are reproducible across runs
Making Augmentations Identical Across Different num_workers
Important: By design, different num_workers values produce different augmentation sequences, even with the same seed. This is because each worker gets a unique effective seed. If you need identical augmentations regardless of num_workers (unlikely but possible use case), here are some workarounds:
Note: When using persistent_workers=True, the difference becomes more pronounced as the worker seed state may not reset properly between epochs.
# Option 1: Force same seed for all workers (ignores worker ID)
class IdenticalAugmentDataset(Dataset):
def __init__(self, data):
self.data = data
# Create fixed random generators shared across all workers
self.rng = np.random.default_rng(137)
self.py_rng = random.Random(137)
self.transform = A.Compose([
A.RandomCrop(height=256, width=256),
A.HorizontalFlip(p=0.5),
]) # No seed here!
def __getitem__(self, idx):
# Force the same random state regardless of worker
self.transform.set_random_state(self.rng, self.py_rng)
image = self.data[idx]
return self.transform(image=image)['image']
# Now num_workers=4 and num_workers=8 produce identical sequences
# Option 2: Use a fixed seed that ignores worker differentiation
def worker_init_fn(worker_id):
# Override the automatic worker seed differentiation
worker_info = torch.utils.data.get_worker_info()
if worker_info is not None:
dataset = worker_info.dataset
# Use the same seed for all workers (not recommended for training!)
dataset.transform = A.Compose([
A.RandomCrop(height=256, width=256),
A.HorizontalFlip(p=0.5),
], seed=137) # Same seed, ignoring worker_id
Warning: Making augmentations identical across different num_workers defeats the purpose of parallel data loading and reduces augmentation diversity. This is typically only useful for debugging or specific reproducibility requirements.
This behavior is discussed in AlbumentationsX Issue #81.
Custom Transforms and Reproducibility
When creating custom transforms, use the provided random generators to maintain reproducibility:
from albumentations.core.transforms_interface import DualTransform
class MyCustomTransform(DualTransform):
def get_params_dependent_on_data(self, params, data):
# CORRECT: Use self.py_random for Python's random operations
random_value = self.py_random.uniform(0, 1)
# CORRECT: Use self.random_generator for NumPy operations
random_array = self.random_generator.uniform(0, 1, size=(3, 3))
# WRONG: Don't use global random functions
# bad_value = random.random() # This ignores the seed!
# bad_array = np.random.rand(3, 3) # This also ignores the seed!
return {"value": random_value, "array": random_array}
See the Creating Custom Transforms Guide for more details.
Saving and Loading Pipelines
For perfect reproducibility across different environments or time, save your pipeline configuration:
# Save pipeline configuration
A.save(transform, 'augmentation_pipeline.json')
# Load the exact same pipeline later
transform = A.load('augmentation_pipeline.json')
Note: The loaded pipeline will have the same seed as the original. See the Serialization Guide for more details.
Tracking Applied Augmentations
To debug or analyze which augmentations were actually applied, use save_applied_params:
transform = A.Compose([
A.RandomCrop(256, 256),
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.5),
], save_applied_params=True, seed=137)
result = transform(image=image)
print(transform.applied_transforms)
# Shows exactly which transforms were applied and their parameters
Best Practices
-
Development vs Production:
- Use fixed seeds during development and debugging
- Remove seeds (or use different seeds per epoch) in production training
- Always use fixed seeds for validation/test transforms if you need comparable results
-
Experiment Tracking:
- Log the seed value in your experiment tracking system
- Save the complete pipeline configuration using
A.save() - Document the Albumentations version used
- Track
num_workerssetting as it affects augmentation sequences
-
Testing:
- Unit tests should always use fixed seeds
- Integration tests may use random seeds to test robustness
- Create separate test cases for both scenarios
- Test with both single and multi-worker configurations
-
Distributed Training:
- AlbumentationsX automatically handles worker differentiation
- Each worker gets a unique, reproducible seed based on
base_seed + torch.initial_seed() - No need for manual
seed = base_seed + worker_idlogic - Seeds are automatically updated on worker respawn
-
DataLoader Configuration:
- Be aware that changing
num_workerschanges augmentation sequences - Document your
num_workerssetting for reproducibility - Use consistent
num_workersacross experiments for comparable results - Avoid
persistent_workers=Trueif exact reproducibility is critical (see known issue below)
- Be aware that changing
Common Pitfalls
❌ Don't rely on global seeds
# This WILL NOT make Albumentations reproducible
np.random.seed(137)
random.seed(137)
transform = A.Compose([...]) # Still random!
❌ Don't forget that each Compose call is still random
transform = A.Compose([...], seed=137)
# These will be different (but reproducible sequence)
aug1 = transform(image=img)
aug2 = transform(image=img) # Different augmentation!
✅ Do create new instances for identical augmentations
# If you need the exact same augmentation
transform1 = A.Compose([...], seed=137)
transform2 = A.Compose([...], seed=137)
# Now these will be identical
aug1 = transform1(image=img)
aug2 = transform2(image=img) # Same augmentation!
Summary
This guide covered:
- Setting and resetting seeds for reproducible augmentations
- Automatic worker seed handling in PyTorch DataLoaders
- How different
num_workerssettings affect augmentation sequences - Best practices for reproducible experiments
- Common pitfalls and how to avoid them
Related Topics
- Pipelines and Compose - Understanding pipeline configuration
- Probabilities - How probabilities interact with seeds
- Creating Custom Transforms - Making custom transforms reproducible
- Serialization - Saving and loading reproducible pipelines