Albumentations vs Torchvision
On this page
- Use Case Fit
- Supported Targets
- Transform Coverage
- Speed and Pipeline Efficiency
- Integration Cost
- GPU Memory
- What You Gain Moving from Torchvision
- What You Lose Moving from Torchvision
- Bottom Line
- Evidence Sources
Albumentations is the default augmentation library for most computer vision users who need broad policies, target-aware augmentation, replay, serialization, rich array workflows, and benchmarked DataLoader performance. It is used normally with PyTorch, TensorFlow, JAX, CUDA, and GPU training pipelines. In PyTorch training, it commonly runs in Dataset.__getitem__ or DataLoader workers before the batch is transferred to the model step.
Torchvision is excellent PyTorch infrastructure. Keep it for datasets, models, pretrained weights, image and video IO, TVTensors, tensor conversion, simple v2 transform policies, same-parameter batched tensor transforms, and post-batch PyTorch work. Moving augmentation from Torchvision to Albumentations is not a framework migration; PyTorch stays PyTorch. In many PyTorch projects, the best pipeline is both libraries: Albumentations for CPU-side per-sample augmentation, then Torchvision or PyTorch tensor code for batch-level operations such as MixUp, CutMix, or GPU-side normalization.
Use Case Fit
| User need | Better fit |
|---|---|
| Broad training augmentation policy for classification, detection, segmentation, pose, OCR, restoration, medical, remote-sensing, or non-RGB arrays | Albumentations |
| Test-time augmentation, validation-time stress tests, and reproducible preprocessing experiments | Albumentations |
| One policy that updates images, masks, boxes, keypoints, oriented bounding boxes (OBB), labels, and related arrays together | Albumentations |
| Replay, serialization, readable experiment configs, and inspection of sampled augmentation parameters | Albumentations |
Fast CPU-side preprocessing inside PyTorch Dataset / DataLoader workers | Albumentations, based on the measured benchmark regime |
| PyTorch-native datasets, reference models, pretrained weights, image/video IO, and tensor utilities | Torchvision |
| Dependency-light PyTorch-only transform stack for a compact classification recipe | Torchvision |
| Existing TVTensor-based sample structure with a small v2 policy that already covers the task | Torchvision |
| Batch-level MixUp/CutMix, same-parameter batched tensor transforms, CUDA tensor transforms, or deterministic post-batch operations inside PyTorch code | Torchvision or PyTorch tensor code, combined with Albumentations for the main augmentation policy |
Supported Targets
Torchvision v2 is not the old PIL-only transform stack. Current official docs list support for PIL images, pure tensors, videos, masks, keypoints, and axis-aligned or oriented bounding boxes (OBB, also called rotated bounding boxes) through TVTensors. Tensor inputs can be on CPU or CUDA. OBB and keypoints are documented as beta features in Torchvision 0.26/0.24-era docs, so use them deliberately and test your edge cases.
Read the table as a decision matrix: Supported, Limited, or Not supported. Details are handled in prose below the table.
| Target / data type | Albumentations | Torchvision v2 |
|---|---|---|
| Images | Supported | Supported |
| Masks | Supported | Supported |
| Axis-aligned bounding boxes | Supported | Supported |
| Oriented bounding boxes (OBB) | Supported | Supported |
| Keypoints | Supported | Supported |
| Classification labels | Supported | Supported |
| Multiple related targets | Supported | Supported |
| Video | Supported | Supported |
| Volumes / 3D | Supported | Not supported |
| Arbitrary-channel arrays | Supported | Supported |
Torchvision target support for masks, boxes, OBB, keypoints, and video is through TVTensors and tensor inputs. OBB and keypoints are beta features. Torchvision arbitrary-channel support depends on transform semantics: tensor geometry can fit many channel counts, while RGB/color transforms remain channel-semantic.
Transform Coverage
Torchvision v2 covers common geometric, color, tensor conversion, bbox/keypoint sanitation, video transforms, and batch-level policies such as CutMix/MixUp. That is useful, but it should not be read as “choose Torchvision for the whole augmentation stack.” For many training pipelines, the cleaner design is Albumentations before batching and Torchvision or PyTorch tensor code after batching.
Albumentations has the broader augmentation catalog. The practical difference shows up when the policy needs bbox-safe crops, non-rigid distortions, camera/weather/illumination corruptions, object-aware dropout, per-sample multi-image policies such as Mosaic and CopyAndPaste, 3D volume transforms, or domain-specific image transforms. The Torchvision transform mapping separates direct Torchvision v2 mappings from unsupported rows.
Speed and Pipeline Efficiency
Use the generated Torchvision benchmark page for performance comparison. It separates micro CPU, micro GPU, DataLoader CPU, and DataLoader GPU regimes, with scope and provenance shown together:
- micro benchmarks measure isolated transform execution
DataLoaderbenchmarks measure the training input path more closely- GPU regimes must account for materialization, transfer scope, supported transforms, early stops, and memory cost
- unsupported transforms are part of the result, because dropping a policy is not equivalent to running it faster
For the common PyTorch training pattern where augmentation happens before batching in DataLoader workers, Albumentations is the stronger default when the benchmarked transform set and target contract match the workload.
Integration Cost
Torchvision uses PyTorch tensors, PIL images, and TVTensors. If the project already stores samples as TVTensors and the v2 transform catalog covers the policy, integration cost is low.
Albumentations commonly receives NumPy arrays in OpenCV-style H,W,C channel-last layout and returns augmented arrays plus targets. In a PyTorch project, the normal integration point is the dataset: decode or load the sample, run Albumentations, then convert to tensors for the model. In TensorFlow, JAX, or custom training loops, the same array-first boundary is often the simplest place to keep augmentation policy independent from the framework.
For annotated data, Albumentations makes target semantics explicit in A.Compose: bbox formats, label fields, keypoint formats, visibility/filtering rules, mask interpolation, replay, and serialization are part of the augmentation policy instead of being scattered around dataset code.
For classification recipes that use MixUp or CutMix, do not treat that as a reason to move the whole pipeline to Torchvision. A practical PyTorch setup is:
- run Albumentations in the dataset or
DataLoaderworkers for per-sample spatial, color, corruption, and target-aware augmentation - convert and collate the batch
- run normalization, MixUp, CutMix, or other batch-level tensor policies after collation, often on GPU if profiling shows it helps
GPU Memory
Torchvision can run tensor transforms on CUDA tensors, so it can be useful for deterministic post-batch work or for applying the same sampled transform parameters to every item in a batch. If each sample in the batch needs independently sampled random parameters, Torchvision does not provide a native Kornia-style batch augmentation API; you usually run the transform per sample in a loop, which changes the performance profile.
That does not make every augmentation policy a good GPU workload. GPU augmentation competes with the model for memory and compute, and it can add transfer or materialization costs depending on where decode, collation, conversion, and normalization happen.
Albumentations usually keeps augmentation in CPU-side dataset or DataLoader code before the batch reaches GPU training. That is a normal GPU training pipeline: the model trains on GPU, while augmentation runs before transfer. The benchmark page should report GPU memory separately for GPU regimes and keep unsupported or early-stopped rows visible.
What You Gain Moving from Torchvision
- A larger augmentation catalog, especially for detection, segmentation, pose, OCR, restoration, medical, remote-sensing, non-RGB, and 3D workflows.
- A single augmentation policy that updates images and supported targets together.
- Explicit bbox/keypoint configuration, label fields, filtering, visibility rules, and mask interpolation choices.
- Replay and serialization for debugging and reproducible experiments.
- Array-first integration that works across PyTorch, TensorFlow, JAX, and custom pipelines.
- Benchmark coverage for the full
DataLoaderpath, not only isolated transforms.
What You Lose Moving from Torchvision
- Pure PyTorch-only dependency simplicity for small pipelines.
- Native TVTensor sample flow when the whole dataset and policy already live in Torchvision v2.
- Direct use of Torchvision transform classes inside
torch.nn.Module-style tensor pipelines. - A ready-made place for batch-level MixUp/CutMix and some convenience for same-parameter batched tensor transforms, CUDA tensor transforms, and deterministic post-batch PyTorch operations.
- Tight coupling with Torchvision datasets, model references, image/video IO, and PyTorch examples.
Bottom Line
Use Albumentations when augmentation policy is a meaningful part of the experiment, especially with annotations, broad transform coverage, non-RGB arrays, replay, serialization, or measured DataLoader performance needs.
Use Torchvision transforms when the policy is small, PyTorch-native, already covered by v2, or belongs after batching as deterministic, same-parameter, MixUp, or CutMix tensor work. For most non-trivial augmentation stacks, use Albumentations for the main CPU-side augmentation policy and add Torchvision/PyTorch batch-level tensor operations where they fit.
Evidence Sources
- Torchvision capability source: Torchvision documentation, transforms v2, TVTensors, BoundingBoxes, and KeyPoints
- Benchmark source: albumentations-team/benchmark
- Generated benchmark route: Torchvision benchmarks
- Mapping route: Torchvision transform mapping