YOLO-style training pipeline with instance_binding
This notebook builds an Ultralytics-flavored training augmentation stack on top of AlbumentationsX 2.2.2 — the kind of pipeline you'd use to train YOLO11-seg on a custom instance-segmentation dataset, but assembled out of pure Albumentations primitives so you can swap pieces in and out.
Pipeline:
| Ultralytics op | Albumentations equivalent |
|---|---|
Mosaic | A.Mosaic |
CopyPaste | A.CopyAndPaste (new in 2.2.0) |
RandomPerspective | A.Perspective + A.Affine |
Fliplr | A.HorizontalFlip |
HSV jitter | A.HueSaturationValue |
| brightness / contrast | A.RandomBrightnessContrast |
Everything runs under Compose(instance_binding=["masks", "bboxes"]), so the per-instance dict format you feed in is exactly what comes back out — including pasted instances and instances that survive Mosaic cropping.
For an introduction to instance_binding itself, see example_instance_binding.ipynb.
Heads-up: 2.2.0+ API conventions
- All sampling-range constructor args end in
_range(Rotate.angle_range,HueSaturationValue.hue_shift_range,Blur.blur_range, …). The old*_limitnames are removed. - Range args are tuples only — pass
(-15, 15), not15. - Per-instance
bbox_labelsandkeypoint_labelsare dicts keyed byBboxParams.label_fields/KeypointParams.label_fields. Bare lists raiseTypeError.
Version notes
- 2.2.2 — fixes the last
instance_bindingordering bug (Mosaic → Perspective → CopyAndPasteno longer crashes the mask resync). All transform orderings used in this notebook work. - 2.2.1 — fixed
Affine/ShiftScaleRotateafterMosaic, andMosaic + CopyAndPaste + Perspective. - 2.2.0 — initial release of
instance_bindingandCopyAndPaste.
Pin albumentationsx>=2.2.2 if you want this exact pipeline.
!pip install -q 'albumentationsx>=2.2.2' matplotlib
[notice] A new release of pip is available: 25.0 -> 26.0.1
[notice] To update, run: pip install --upgrade pip
from __future__ import annotations
import io
import urllib.request
import zipfile
from pathlib import Path
import albumentations as A
import cv2
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
print("albumentations", A.__version__)
albumentations 2.2.2
Load coco128-seg
Same dataset and loader as example_instance_binding.ipynb: 128 COCO val images with YOLO-format polygon annotations, ~7 MB. We rasterize each polygon into a binary instance mask.
DATA_ROOT = Path("../data/coco128-seg")
DATASET_URL = "https://www.ultralytics.com/assets/coco128-seg.zip"
COCO_NAMES = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
"fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
"elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
"skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
"tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
"sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
"potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard",
"cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase",
"scissors", "teddy bear", "hair drier", "toothbrush",
]
def ensure_dataset() -> Path:
if DATA_ROOT.exists() and any(DATA_ROOT.glob("images/train2017/*.jpg")):
return DATA_ROOT
DATA_ROOT.parent.mkdir(parents=True, exist_ok=True)
with urllib.request.urlopen(DATASET_URL) as resp:
buf = io.BytesIO(resp.read())
with zipfile.ZipFile(buf) as zf:
zf.extractall(DATA_ROOT.parent)
return DATA_ROOT
def load_sample(image_path: Path) -> tuple[np.ndarray, list[dict]]:
image = cv2.cvtColor(cv2.imread(str(image_path)), cv2.COLOR_BGR2RGB)
h, w = image.shape[:2]
label_path = DATA_ROOT / "labels" / "train2017" / (image_path.stem + ".txt")
instances: list[dict] = []
if not label_path.exists():
return image, instances
for line in label_path.read_text().splitlines():
parts = line.split()
if len(parts) < 7:
continue
class_idx = int(parts[0])
coords = np.array(parts[1:], dtype=np.float32).reshape(-1, 2)
coords[:, 0] *= w
coords[:, 1] *= h
polygon = coords.round().astype(np.int32)
mask = np.zeros((h, w), dtype=np.uint8)
cv2.fillPoly(mask, [polygon], 1)
if mask.sum() == 0:
continue
ys, xs = np.where(mask > 0)
bbox = np.array([xs.min(), ys.min(), xs.max() + 1, ys.max() + 1], dtype=np.float32)
instances.append({
"mask": mask,
"bbox": bbox,
"bbox_labels": {"class_name": COCO_NAMES[class_idx]},
})
return image, instances
ensure_dataset()
image_paths = sorted(DATA_ROOT.glob("images/train2017/*.jpg"))
print(f"{len(image_paths)} images")
128 images
Build a 16-image pool
We pre-load 16 samples once. Each call to the augmentation pipeline samples a primary, three Mosaic neighbors, and a few standalone donor instances for CopyAndPaste.
POOL_SIZE = 16
pool: list[tuple[np.ndarray, list[dict]]] = []
for p in image_paths:
image, instances = load_sample(p)
if len(instances) >= 1:
pool.append((image, instances))
if len(pool) >= POOL_SIZE:
break
print(f"pool: {len(pool)} samples, {sum(len(i) for _, i in pool)} total instances")
pool: 16 samples, 72 total instances
Pipeline + per-call data assembly
Mosaic and CopyAndPaste consume their own metadata keys (mosaic_metadata, copy_paste_metadata). The shape of those metadata entries is independent of instance_binding: Mosaic wants per-image dicts (masks (N, H, W), bboxes (N, 4), bbox_labels {field: [list]}), CopyAndPaste wants per-object dicts (mask, bbox_labels {field: value}).
Ultralytics' RandomPerspective is one fused affine + perspective warp; here we split it into A.Affine (rotate / shear / scale / translate) and A.Perspective (small projective warp), both placed after Mosaic. As of 2.2.2 every ordering in this pipeline is safe under instance_binding.
transform = A.Compose(
[
A.Mosaic(
grid_yx=(2, 2),
target_size=(640, 640),
cell_shape=(640, 640),
fit_mode="cover",
p=1.0,
),
A.Affine(
scale=(0.9, 1.1),
rotate=(-10, 10),
shear=(-2, 2),
translate_percent=(-0.05, 0.05),
p=0.5,
),
A.Perspective(
scale=(0.02, 0.05),
keep_size=True,
fit_output=False,
p=0.3,
),
A.CopyAndPaste(
scale_range=(0.4, 1.0),
blend_mode="gaussian",
blend_sigma_range=(1.0, 2.0),
min_visibility_after_paste=0.1,
p=0.7,
),
A.HorizontalFlip(p=0.5),
A.HueSaturationValue(
hue_shift_range=(-15, 15),
sat_shift_range=(-40, 40),
val_shift_range=(-25, 25),
p=1.0,
),
A.RandomBrightnessContrast(
brightness_range=(-0.2, 0.2),
contrast_range=(-0.2, 0.2),
p=0.5,
),
],
bbox_params=A.BboxParams(
coord_format="pascal_voc",
label_fields=["class_name"],
min_visibility=0.1,
min_area=8.0,
),
instance_binding=["masks", "bboxes"],
seed=42,
)
def make_call_data(rng: np.random.Generator, n_mosaic: int = 3, n_paste: int = 4) -> dict:
"""Sample a primary + Mosaic neighbors + CopyAndPaste donors from the pool."""
indices = rng.choice(len(pool), size=1 + n_mosaic, replace=False)
primary_image, primary_instances = pool[int(indices[0])]
mosaic_metadata = []
for idx in indices[1:]:
img, instances = pool[int(idx)]
if not instances:
continue
mosaic_metadata.append({
"image": img,
"masks": np.stack([i["mask"] for i in instances]),
"bboxes": np.stack([i["bbox"] for i in instances]),
"bbox_labels": {"class_name": [i["bbox_labels"]["class_name"] for i in instances]},
})
flat = [(img, inst) for img, instances in pool for inst in instances]
paste_indices = rng.choice(len(flat), size=min(n_paste, len(flat)), replace=False)
copy_paste_metadata = [
{
"image": flat[int(k)][0],
"mask": flat[int(k)][1]["mask"],
"bbox_labels": {"class_name": flat[int(k)][1]["bbox_labels"]["class_name"]},
}
for k in paste_indices
]
return {
"image": primary_image,
"instances": primary_instances,
"mosaic_metadata": mosaic_metadata,
"copy_paste_metadata": copy_paste_metadata,
}
Visualize 6 augmented samples
Each call assembles a fresh primary + Mosaic neighbors + paste donors and runs the full pipeline. out["instances"] is the per-object list — same shape you fed in, with surviving primaries + freshly pasted instances.
_RNG = np.random.default_rng(0)
_COLOR_CACHE: dict[int, tuple[float, float, float]] = {}
def _color_for(idx: int) -> tuple[float, float, float]:
if idx not in _COLOR_CACHE:
_COLOR_CACHE[idx] = tuple(_RNG.uniform(0.3, 1.0, size=3))
return _COLOR_CACHE[idx]
def draw_instances(ax: plt.Axes, image: np.ndarray, instances: list[dict], title: str = "") -> None:
overlay = image.astype(np.float32).copy()
for i, inst in enumerate(instances):
color = np.array(_color_for(i)) * 255
mask = inst.get("mask")
if mask is not None:
blend = (mask > 0)[..., None]
overlay = np.where(blend, 0.5 * overlay + 0.5 * color, overlay)
ax.imshow(np.clip(overlay, 0, 255).astype(np.uint8))
for i, inst in enumerate(instances):
x1, y1, x2, y2 = inst["bbox"]
color = _color_for(i)
ax.add_patch(mpatches.Rectangle((x1, y1), x2 - x1, y2 - y1, fill=False, edgecolor=color, linewidth=1.5))
label = inst.get("bbox_labels", {}).get("class_name", "?")
ax.text(x1, max(0, y1 - 4), label, color="white", fontsize=8,
bbox=dict(facecolor=color, alpha=0.8, edgecolor="none", pad=1))
ax.set_title(title)
ax.axis("off")
rng = np.random.default_rng(2026)
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()
for i, ax in enumerate(axes):
transform.set_random_seed(int(rng.integers(0, 1_000_000)))
data = make_call_data(rng)
out = transform(**data)
draw_instances(
ax,
out["image"],
out["instances"],
title=f"sample #{i}: {len(out['instances'])} instances",
)
plt.tight_layout()
plt.show()
No code provided![]()
No code providedSanity check: instance dicts round-trip
Confirm the output instances is the same list-of-dicts shape you'd hand to a Mask-RCNN-style head — every entry has mask ((H, W) binary), bbox ((4,) pascal_voc), and bbox_labels ({"class_name": str}).
rng = np.random.default_rng(0)
transform.set_random_seed(0)
data = make_call_data(rng)
out = transform(**data)
print(f"image: {out['image'].shape} dtype={out['image'].dtype}")
print(f"instances: {len(out['instances'])}")
for i, inst in enumerate(out["instances"]):
print(
f" [{i}] keys={sorted(inst.keys())} "
f"mask={inst['mask'].shape} bbox={inst['bbox'].round(1).tolist()} "
f"label={inst['bbox_labels']}",
)
image: (640, 640, 3) dtype=uint8
instances: 10
[0] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[0.0, 0.0, 374.8, 353.9] label={'class_name': 'umbrella'}
[1] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[0.0, 0.0, 133.7, 353.9] label={'class_name': 'person'}
[2] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[397.9, 0.0, 521.9, 52.3] label={'class_name': 'elephant'}
[3] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[522.5, 0.0, 632.0, 87.4] label={'class_name': 'elephant'}
[4] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[0.0, 353.8, 383.1, 640.0] label={'class_name': 'motorcycle'}
[5] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[44.0, 328.4, 396.5, 640.0] label={'class_name': 'motorcycle'}
[6] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[29.0, 361.0, 332.0, 579.0] label={'class_name': 'bowl'}
[7] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[310.0, 150.0, 323.0, 176.0] label={'class_name': 'person'}
[8] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[36.0, 50.0, 454.0, 409.0] label={'class_name': 'zebra'}
[9] keys=['bbox', 'bbox_labels', 'mask'] mask=(640, 640) bbox=[283.0, 75.0, 302.0, 84.0] label={'class_name': 'car'}
Wiring this into a training loop
class CocoSegDataset(torch.utils.data.Dataset):
def __init__(self, image_paths, transform, pool):
self.image_paths = image_paths
self.transform = transform
self.pool = pool
self.rng = np.random.default_rng()
def __getitem__(self, idx):
image, instances = load_sample(self.image_paths[idx])
# sample 3 Mosaic neighbors + a few donors from the same training pool
mosaic_metadata, copy_paste_metadata = sample_metadata(self.pool, self.rng)
out = self.transform(
image=image,
instances=instances,
mosaic_metadata=mosaic_metadata,
copy_paste_metadata=copy_paste_metadata,
)
return out["image"], out["instances"]
From there, model-specific collate (Mask-RCNN expects targets[i] = {boxes, masks, labels}; YOLO-seg expects per-image polygon tensors) is a one-liner over out["instances"].
What this notebook deliberately doesn't replicate
MixUp— not yet a first-class transform. Roll your own withimage = alpha * image_a + (1 - alpha) * image_band concatenate the instance lists.- Auto-anchor / letterbox-resize — model-side concerns, not augmentation.
If you need the actual Ultralytics training loop with custom Albumentations injected, see example_ultralytics.ipynb.