AlbumentationsX vs Torchvision

Compare AlbumentationsX with Torchvision transforms: API differences, RGB image benchmark results, and a PyTorch migration guide.

What Is Different?

Torchvision is the native augmentation layer for the PyTorch ecosystem. AlbumentationsX is framework-independent and optimized around NumPy/OpenCV image augmentation before tensors enter the model.

Torchvision commonly operates on PIL images or tensors; AlbumentationsX operates on NumPy arrays and can emit PyTorch tensors with ToTensorV2.
Torchvision integrates tightly with PyTorch datasets and model examples; AlbumentationsX focuses on faster CPU augmentation and richer computer-vision target handling.
AlbumentationsX pipelines pass named targets such as image, mask, bboxes, and keypoints; Torchvision v1-style pipelines are mostly image-first unless you use the newer tv_tensors stack.
AlbumentationsX tends to be easier when geometric transforms must stay consistent across masks, boxes, and keypoints.

Benchmark vs Torchvision

The benchmark data is generated during the site build from the published benchmark artifacts. Results are shown for the full PyTorch DataLoader pipeline and for the single-operation RGB micro benchmark. Published GPU artifacts are shown in the DataLoader table where they exist; the RGB micro table on this page is CPU.

DataLoader Pipeline

End-to-end in-memory DataLoader throughput with 8 workers, batch size 256, normalization, tensor conversion, batching, and collation included. The table shows the variable transform name only. Higher images/second is better. CPU and GPU columns are shown where published artifacts exist.

54 / 57

pipelines where AlbumentationsX is faster

2.2x

average speedup

average computed from benchmark table rows

2.2.6 vs 0.26.0

library versions

from generated benchmark metadata

Transform	Speedup Albx / best	AlbumentationsX CPU · 2.2.6	Torchvision CPU · 0.26.0	Torchvision GPU · 0.26.0
Elastic	12x	2878 ± 0	234 ± 0	118 ± 0
PhotoMetricDistort	3.5x	4269 ± 0	1234 ± 0	580 ± 2
ColorJitter	3.4x	4299 ± 0	1263 ± 0	629 ± 3
ColorJiggle	3.4x	4255 ± 0	1267 ± 0	626 ± 3
Sharpen	2.1-2.2x	4978 ± 0	1899 ± 0	2304 ± 13
GaussianBlur	1.9x	5029 ± 0	2654 ± 0	1957 ± 14
AutoContrast	1.8-1.9x	4646 ± 0	2424 ± 0	2562 ± 78
Contrast	1.6x	5435 ± 0	3313 ± 0	2282 ± 15
Perspective	1.6x	4132 ± 0	2640 ± 0	792 ± 4
Rotate	1.5x	4865 ± 0	3141 ± 0	1273 ± 8
Equalize	1.5x	4511 ± 0	3062 ± 0	1518 ± 5
Affine	1.4x	4528 ± 0	3175 ± 0	1221 ± 4
Brightness	1.4x	5310 ± 0	3888 ± 0	3097 ± 16
Erasing	1.3x	5415 ± 0	4068 ± 0	2150 ± 27
Solarize	1.3x	5407 ± 0	3569 ± 0	4130 ± 90
Grayscale	1.3x	5378 ± 0	4188 ± 0	3960 ± 122
JpegCompression	1.2x	4267 ± 0	3447 ± 0	—
RandomResizedCrop	1.2-1.3x	5056 ± 0	3870 ± 0	4140 ± 182
ChannelShuffle	1.2x	5467 ± 0	4543 ± 0	3631 ± 51
Posterize	1.1-1.2x	5431 ± 0	4126 ± 0	4726 ± 73
Invert	1.1-1.2x	5569 ± 0	4065 ± 0	4868 ± 104
VerticalFlip	1.1x	5512 ± 0	4008 ± 0	5020 ± 58
Pad	1.1x	4996 ± 0	4162 ± 0	4634 ± 36
HorizontalFlip	1.0x	5002 ± 0	4024 ± 0	5050 ± 58
RandomCrop224	0.8x	5130 ± 0	4367 ± 0	6253 ± 97
Resize	0.5x	1350 ± 0	1235 ± 0	2829 ± 98
Blur	—	5507 ± 0	—	—
CLAHE	—	3505 ± 0	—	—
ChannelDropout	—	5314 ± 0	—	—
CornerIllumination	—	3969 ± 0	—	—
EnhanceDetail	—	5033 ± 0	—	—
EnhanceEdge	—	4923 ± 0	—	—
GaussianIllumination	—	3656 ± 0	—	—
GaussianNoise	—	3415 ± 0	—	—
Hue	—	4723 ± 0	—	—
LinearIllumination	—	4179 ± 0	—	—
LongestMaxSize	—	1363 ± 0	—	—
MedianBlur	—	4038 ± 0	—	—
MotionBlur	—	4785 ± 0	—	—
OpticalDistortion	—	3579 ± 0	—	—
PlankianJitter	—	4979 ± 0	—	—
PlasmaBrightness	—	2727 ± 0	—	—
PlasmaContrast	—	2291 ± 0	—	—
PlasmaShadow	—	2797 ± 0	—	—
RGBShift	—	4977 ± 0	—	—
Rain	—	4198 ± 0	—	—
RandomGamma	—	5450 ± 0	—	—
RandomJigsaw	—	4863 ± 0	—	—
RandomRotate90	—	5087 ± 0	—	—
SaltAndPepper	—	4508 ± 0	—	—
Saturation	—	4646 ± 0	—	—
Shear	—	4141 ± 0	—	—
SmallestMaxSize	—	1370 ± 0	—	—
Snow	—	4135 ± 0	—	—
ThinPlateSpline	—	743 ± 0	—	—
Transpose	—	5338 ± 0	—	—
UnsharpMask	—	4635 ± 0	—	—

Micro Benchmark

The benchmark below compares single-threaded RGB image augmentation throughput. Torchvision receives PIL images; AlbumentationsX receives OpenCV-loaded RGB NumPy arrays. Higher images/second is better.

26 / 26

transforms where AlbumentationsX is faster

5.61x

median speedup

3.78x-12.01x IQR

2.2.6 vs 0.26.0

library versions

from generated benchmark metadata

Transform	Speedup Albx / best	albumentationsx 2.2.6	Torchvision 0.26.0
GaussianBlur	27x	2343 ± 4	86 ± 0
Sharpen	18-19x	1388 ± 5	75 ± 0
Solarize	18x	9760 ± 34	545 ± 10
Contrast	14-15x	6933 ± 30	475 ± 7
ColorJitter	14x	641 ± 1	47 ± 0
ColorJiggle	13-14x	639 ± 5	47 ± 0
PhotoMetricDistort	13x	581 ± 4	45 ± 0
Elastic	≥9.5x	191 ± 0	≤20
Brightness	8.4-8.8x	6912 ± 13	804 ± 15
AutoContrast	7.7-7.9x	1243 ± 19	159 ± 0
Rotate	6.1-6.5x	1408 ± 40	223 ± 1
Invert	5.4-6.1x	15095 ± 61	2619 ± 152
VerticalFlip	5.3-6.0x	14051 ± 55	2490 ± 149
Posterize	5.2-5.9x	14399 ± 58	2598 ± 137
Pad	5.1-5.8x	13181 ± 118	2420 ± 122
Erasing	4.9-5.3x	9511 ± 74	1872 ± 71
RandomCrop224	3.9-5.3x	38380 ± 192	8492 ± 1207
Grayscale	4.2-4.5x	5194 ± 1	1198 ± 32
HorizontalFlip	4.0-4.4x	8416 ± 19	1999 ± 82
Affine	3.6-3.7x	872 ± 8	240 ± 1
Perspective	2.7-2.8x	559 ± 2	202 ± 2
Equalize	2.6x	807 ± 3	313 ± 1
RandomResizedCrop	2.4-2.7x	7150 ± 19	2823 ± 172
Resize	2.4-2.6x	2463 ± 37	979 ± 23
ChannelShuffle	2.2-2.4x	4337 ± 13	1866 ± 72
JpegCompression	1.3-1.4x	692 ± 7	512 ± 4
Blur	—	4449 ± 17	—
CLAHE	—	283 ± 1	—
ChannelDropout	—	6810 ± 65	—
CornerIllumination	—	425 ± 2	—
EnhanceDetail	—	2148 ± 13	—
EnhanceEdge	—	1373 ± 16	—
GaussianIllumination	—	388 ± 1	—
GaussianNoise	—	225 ± 0	—
Hue	—	967 ± 1	—
LinearIllumination	—	521 ± 1	—
LongestMaxSize	—	2825 ± 42	—
MedianBlur	—	843 ± 4	—
MotionBlur	—	1953 ± 21	—
OpticalDistortion	—	274 ± 1	—
PlankianJitter	—	2253 ± 17	—
PlasmaBrightness	—	267 ± 1	—
PlasmaContrast	—	143 ± 0	—
PlasmaShadow	—	420 ± 3	—
RGBShift	—	2292 ± 3	—
Rain	—	1259 ± 2	—
RandomGamma	—	9938 ± 46	—
RandomJigsaw	—	5172 ± 16	—
RandomRotate90	—	5990 ± 85	—
SaltAndPepper	—	738 ± 10	—
Saturation	—	847 ± 17	—
Shear	—	784 ± 6	—
SmallestMaxSize	—	2017 ± 25	—
Snow	—	489 ± 3	—
ThinPlateSpline	—	52 ± 0	—
Transpose	—	4627 ± 26	—
UnsharpMask	—	906 ± 2	—

See the aggregate image benchmark or inspect the benchmark source code.

Conversion Guide

In PyTorch projects, the usual migration is to keep your Dataset and DataLoader, replace torchvision.transforms with AlbumentationsX, then finish with ToTensorV2.

Read images as NumPy arrays, usually with OpenCV plus BGR to RGB conversion.
Replace transform lists with A.Compose.
For classification, return transformed['image'] after ToTensorV2.
For detection or segmentation, pass masks, bboxes, labels, and keypoint params through Compose instead of updating them manually.

Torchvision

from torchvision import transforms

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
])

image = transform(pil_image)

AlbumentationsX

import albumentations as A
from albumentations.pytorch import ToTensorV2

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.5),
    ToTensorV2(),
])

image = transform(image=image_np)["image"]

Use AlbumentationsX When

PyTorch training pipelines where CPU augmentation speed matters.
Detection, segmentation, keypoints, and multi-input augmentation where targets must stay aligned.
Projects that want the same augmentation library across PyTorch, TensorFlow, Keras, and custom training loops.

Use Torchvision When

Simple PyTorch classification baselines that already use torchvision examples.
Tensor-native workflows that rely on torchvision transforms, tv_tensors, or PyTorch-only deployment assumptions.