Albumentations 2.0.20 Release Notes
This release brings faster video/batch processing for two widely-used transforms and a new user_data target that lets you pass arbitrary custom data through any augmentation pipeline.
Performance Improvements
Both improvements by @Dipet
Perspective — up to 2.7× faster on video batches
Perspective is now significantly faster when applied to video or image batches (the images= key / apply_to_images path).
Benchmark results (uint8, 16 frames per batch, comparing 2.0.19 → 2.0.20):
| Config | 2.0.19 | 2.0.20 | Speedup |
|---|---|---|---|
| 16×256×256 grayscale | 0.0027s | 0.0010s | 2.7× |
| 32×256×256 grayscale | 0.0056s | 0.0023s | 2.4× |
| 16×1024×1024 RGB | 0.0139s | 0.0095s | 1.5× |
| 32×1024×1024 RGB | 0.0274s | 0.0223s | 1.2× |
| 16×1024×1024 5ch | 0.0259s | 0.0186s | 1.4× |
| 32×1024×1024 5ch | 0.0555s | 0.0381s | 1.5× |
The gain is largest for grayscale (depth maps, medical slices) where the batch can be warped in a single C++ call instead of one per frame.
HueSaturationValue — up to 1.2× faster on video batches
HueSaturationValue now avoids redundant memory allocations when processing batches. Gains are most visible at large resolutions.
| Config | 2.0.19 | 2.0.20 | Speedup |
|---|---|---|---|
| 8×1024×1024 RGB | 0.0325s | 0.0291s | 1.1× |
| 16×1024×1024 RGB | 0.0940s | 0.0767s | 1.2× |
| 32×1024×1024 RGB | 0.1456s | 0.1216s | 1.2× |
New Feature: user_data Target
You can now pass arbitrary Python objects through an Albumentations pipeline alongside images, masks, bboxes, and keypoints. By default user_data passes through unchanged. To update it in response to a transform, subclass the transform and override apply_to_user_data.
Basic usage — passthrough
import numpy as np
import albumentations as A
image = np.random.randint(0, 256, (100, 100, 3), dtype=np.uint8)
transform = A.Compose([A.HorizontalFlip(p=1.0)])
result = transform(image=image, user_data={"caption": "a cat on the left"})
print(result["user_data"]) # {"caption": "a cat on the left"} — unchanged
Custom transform that updates user_data
by @ternaus
class FlipAwareHorizontalFlip(A.HorizontalFlip):
def apply_to_user_data(self, data: dict, **params) -> dict:
caption = data["caption"].replace("left", "right").replace("right side", "left side")
return {**data, "caption": caption}
transform = A.Compose([FlipAwareHorizontalFlip(p=1.0)])
result = transform(image=image, user_data={"caption": "a cat on the left"})
print(result["user_data"]) # {"caption": "a cat on the right"}
Use case: camera intrinsics for robot/autonomous driving
When you crop or zoom an image you need to update the camera intrinsics matrix K so downstream 3D reasoning stays correct.
import numpy as np
import albumentations as A
class CropAwareCenterCrop(A.CenterCrop):
def apply_to_user_data(self, data: dict, crop_coords: tuple, **params) -> dict:
x1, y1, x2, y2 = crop_coords
K = data["K"].copy()
K[0, 2] -= x1 # shift principal point cx
K[1, 2] -= y1 # shift principal point cy
return {**data, "K": K}
K = np.array([[500, 0, 320],
[0, 500, 240],
[0, 0, 1]], dtype=np.float64)
image = np.random.randint(0, 256, (480, 640, 3), dtype=np.uint8)
transform = A.Compose([CropAwareCenterCrop(height=256, width=256, p=1.0)])
result = transform(image=image, user_data={"K": K})
print(result["user_data"]["K"])
# [[500. 0. 192.] ← cx updated
# [ 0. 500. 112.] ← cy updated
# [ 0. 0. 1.]]
Use case: vision-language models (caption-aware augmentation)
Pass an image + text caption through the same pipeline and keep the caption semantically consistent with the augmented image.
class CaptionAwareFlip(A.HorizontalFlip):
def apply_to_user_data(self, data: dict, **params) -> dict:
caption = (
data["caption"]
.replace("left", "__L__")
.replace("right", "left")
.replace("__L__", "right")
)
return {**data, "caption": caption}
transform = A.Compose([CaptionAwareFlip(p=1.0)])
result = transform(
image=image,
user_data={"caption": "a dog running to the left of the frame"},
)
print(result["user_data"]["caption"])
# "a dog running to the right of the frame"
Other strong use cases
- LiDAR / BEV: pass a point cloud dict alongside the image; update 3D coordinates when the image is flipped or rotated
- Temporal metadata: pass frame timestamps when cropping video clips; update the time range to match the spatial crop
- Soft labels / confidence maps: pass pixel-wise annotation confidence; update spatial positions to match geometric transforms
- Multi-sensor fusion: pass IMU or GPS data that needs coordinate-frame corrections when the image is spatially transformed