Instance Segmentation

Instance segmentation predicts a separate mask, bbox, and (optionally) keypoints for each individual object in the image. Unlike semantic segmentation — where every pixel just gets a class id — every pixel here belongs to a specific instance, and the model has to keep object identity stable across the augmentation pipeline.

The hard part is not the augmentations themselves. The hard part is keeping the targets aligned: mask plane N must stay tied to bbox N, bbox label fields for object N, keypoints for object N, and keypoint label fields for object N across cropping, mixing, and filtering. Get this wrong silently and your model trains on shuffled targets.

The object-row invariant

Instance binding treats every object as one row in the sample:

object i = bbox i + bbox label fields i + mask i + keypoints i + keypoint label fields i

That row is the unit of survival. If object i survives a transform, all of its bound targets survive together and stay attached to the same object. If object i is removed by a crop, min_area, min_visibility, or other bbox filtering rule, every bound target for that row is removed together.

This is different from synchronizing geometry. Normal target-aware augmentation can apply the same crop to an image, mask, boxes, and keypoints. Instance binding adds the object identity contract on top: the 7th surviving bbox still belongs to the 7th surviving mask, bbox label fields, keypoint set, and keypoint label fields.

What you give up if you don't bind instances

If you pass masks or object keypoint groups separately from bboxes, they can drift apart during filtering. Bbox label fields declared through BboxParams.label_fields stay synchronized with the filtered bbox rows, but independent masks and per-object keypoint groups do not. A RandomCrop that pushes one object below the visibility threshold removes that bbox row and its bbox label fields; the corresponding mask plane or keypoint group can stay behind, attached to the next object in the array. The bug is silent. It surfaces months later as "the model is bad at small objects" or "pose collapses on the edge of the frame".

Compose(instance_binding=[...]) fixes this by routing every per-instance target through the bbox processor. When an instance is dropped, its mask, bbox, bbox label fields, keypoints, and keypoint label fields are dropped together. When it survives, all bound targets move together. The output is repacked in the same per-object shape you fed in — out["instances"] is a list of dicts, one per surviving object.

You stop hand-rolling per-object index tracking. Mosaic, CopyAndPaste, crops, and filters compose without alignment bugs.

The `instances` input format

Pass instances as a list of per-object dicts to the instances keyword:

instances = [
    {
        "mask": mask_uint8,                   # (H, W) binary
        "bbox": np.array([x1, y1, x2, y2]),   # pascal_voc, pixel coords
        "bbox_labels": {"class_name": "bowl"},
    },
    ...
]

Two rules to know up front:

bbox_labels and keypoint_labels inside instance dicts are reserved keys. They must be dicts keyed by BboxParams.label_fields / KeypointParams.label_fields (e.g. {"class_name": "bowl"}). A bare list raises TypeError.
mask and masks are mutually exclusive in instance_binding — pick one.

Three supported bindings cover the common cases.

Format A — masks `(N, H, W)`

instance_binding=["masks", "bboxes"] packs the per-instance masks into a stacked (N, H, W) tensor before transforms run, then unpacks them back into per-object dicts. This is the layout Mask R-CNN-style heads and panoptic decoders want.

import albumentations as A

transform = A.Compose(
    [
        A.HorizontalFlip(p=1.0),
        A.RandomCrop(height=400, width=500, p=1.0),
        A.Affine(rotate=(-30, 30), p=1.0),
    ],
    bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
    instance_binding=["masks", "bboxes"],
    seed=137,
)

out = transform(image=image, instances=instances)
# out["instances"] is the same list-of-dicts shape you fed in,
# with dropped objects removed.

Format B — multi-channel mask `(H, W, N)`

Same call, one-line change: instance_binding=["mask", "bboxes"]. The per-instance masks are stacked along the last axis instead of the first, producing a single mask tensor of shape (H, W, N). This is what older Keras / TF segmentation pipelines often expect.

transform = A.Compose(
    [A.HorizontalFlip(p=1.0), A.RandomCrop(height=400, width=500, p=1.0)],
    bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
    instance_binding=["mask", "bboxes"],
    seed=137,
)

Format C — bboxes + keypoints (no masks)

For pose tasks, bind bboxes and keypoints and skip masks entirely. Each instance dict gets a keypoints array and a keypoint_labels dict.

transform = A.Compose(
    [A.HorizontalFlip(p=1.0), A.Affine(rotate=(-20, 20), p=1.0), A.RandomCrop(height=400, width=500, p=1.0)],
    bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
    keypoint_params=A.KeypointParams(coord_format="xy", label_fields=["kp_kind"]),
    instance_binding=["bboxes", "keypoints"],
    seed=137,
)

instances = [
    {
        "bbox": np.array([x1, y1, x2, y2]),
        "keypoints": np.array([[cx, cy], [x1, y1]], dtype=np.float32),
        "bbox_labels": {"class_name": "person"},
        "keypoint_labels": {"kp_kind": ["center", "tl"]},
    },
    ...
]

When keypoints are bound, KeypointParams.remove_invisible and check_each_transform are forced to False. Out-of-frame keypoints stay in the dict — only instance survival (driven by the bbox processor) decides what gets dropped. Each surviving instance keeps a constant keypoint count, which is what pose models want.

Instance survival under cropping and filtering

Survival is decided by the bbox processor — min_visibility, min_area, filter_invalid_bboxes, and related options on BboxParams. Crops and geometric transforms can shrink, clip, or move a bbox out of frame. When the bbox processor drops that row, the bound mask, bbox label fields, keypoints, and keypoint label fields disappear with it; surviving instances keep their object identity intact.

transform = A.Compose(
    [A.RandomCrop(height=180, width=180, p=1.0)],
    bbox_params=A.BboxParams(
        coord_format="pascal_voc",
        label_fields=["class_name"],
        min_visibility=0.1,
    ),
    instance_binding=["masks", "bboxes"],
    seed=137,
)

out = transform(image=image, instances=instances)
# len(out["instances"]) <= len(instances), and every dict in the
# output corresponds to exactly one surviving input object.

Mixing transforms: Mosaic and CopyAndPaste

The two staple instance-segmentation mixers — popularized by YOLO — are first-class transforms, both wired into instance_binding:

Mosaic tiles your primary image with neighbors into a single output canvas. Without per-instance binding, mosaic'd masks and bboxes drift apart at the cell seams; with it, every surviving instance — primary or neighbor — appears in out["instances"] in the same per-object shape.
CopyAndPaste pastes donor objects onto the primary image with mask-tight cropping, optional shrink-fit, scale jitter, and Gaussian blending at the seam. Pasted instances get fresh ids and slot in alongside the surviving primaries. Without it, you write the cut/paste compositing, the mask and bbox propagation, the visibility recompute, and the id assignment yourself — this is the canonical place where in-house augmentation stacks have bugs.

Metadata shapes are independent of instance_binding:

mosaic_metadata is a list of per-image dicts with stacked masks/bboxes (one entry per neighbor image).
copy_paste_metadata is a list of per-object dicts (one entry per donor instance).

Both feed back into the per-object output format.

Mosaic input example

Each mosaic_metadata item is another image-level sample. Use the stacked target layout there; instance_binding converts the final result back to per-object dicts.

mosaic_metadata = [
    {
        "image": neighbor_image,
        "masks": (
            np.stack([inst["mask"] for inst in neighbor_instances])
            if neighbor_instances
            else np.empty((0, *neighbor_image.shape[:2]), dtype=np.uint8)
        ),
        "bboxes": (
            np.stack([inst["bbox"] for inst in neighbor_instances])
            if neighbor_instances
            else np.empty((0, 4), dtype=np.float32)
        ),
        "bbox_labels": {"class_name": [inst["bbox_labels"]["class_name"] for inst in neighbor_instances]},
    }
    for neighbor_image, neighbor_instances in neighbor_samples
]

out = transform(
    image=primary_image,
    instances=primary_instances,
    mosaic_metadata=mosaic_metadata,
)

Objects can be clipped by cell placement or filtered after the mosaic canvas is built. The object-row invariant still holds: every surviving neighbor object keeps its bbox, mask, and bbox label fields together.

CopyAndPaste input example

Each copy_paste_metadata item is one donor object. The transform pastes donor instances into the primary sample, then runs the same survival and filtering rules on the combined instance set.

copy_paste_metadata = [
    {
        "image": donor_image,
        "mask": donor_instance["mask"],
        "bbox_labels": {"class_name": donor_instance["bbox_labels"]["class_name"]},
    }
    for donor_image, donor_instance in donor_objects
]

out = transform(
    image=primary_image,
    instances=primary_instances,
    copy_paste_metadata=copy_paste_metadata,
)

Pasted objects enter out["instances"] in the same format as primary objects. If a pasted object is too small, outside the paste region, or hidden below min_visibility_after_paste, the whole donor row is removed.

A YOLO-style training pipeline

End-to-end pipeline equivalent to what Ultralytics ships for YOLO11-seg, assembled out of pure Albumentations primitives:

transform = A.Compose(
    [
        A.Mosaic(
            grid_yx=(2, 2),
            target_size=(640, 640),
            cell_shape=(640, 640),
            fit_mode="cover",
            p=1.0,
        ),
        A.Affine(
            scale=(0.9, 1.1),
            rotate=(-10, 10),
            shear=(-2, 2),
            translate_percent=(-0.05, 0.05),
            p=0.5,
        ),
        A.Perspective(scale=(0.02, 0.05), keep_size=True, p=0.3),
        A.CopyAndPaste(
            scale_range=(0.4, 1.0),
            blend_mode="gaussian",
            blend_sigma_range=(1.0, 2.0),
            min_visibility_after_paste=0.1,
            p=0.7,
        ),
        A.HorizontalFlip(p=0.5),
        A.HueSaturationValue(
            hue_shift_range=(-15, 15),
            sat_shift_range=(-40, 40),
            val_shift_range=(-25, 25),
            p=1.0,
        ),
        A.RandomBrightnessContrast(
            brightness_range=(-0.2, 0.2),
            contrast_range=(-0.2, 0.2),
            p=0.5,
        ),
    ],
    bbox_params=A.BboxParams(
        coord_format="pascal_voc",
        label_fields=["class_name"],
        min_visibility=0.1,
        min_area=8.0,
    ),
    instance_binding=["masks", "bboxes"],
    seed=137,
)

Per-call data assembly, sampling neighbors from a pool:

mosaic_metadata = [
    {
        "image": img,
        "masks": (
            np.stack([i["mask"] for i in instances])
            if instances
            else np.empty((0, *img.shape[:2]), dtype=np.uint8)
        ),
        "bboxes": (
            np.stack([i["bbox"] for i in instances])
            if instances
            else np.empty((0, 4), dtype=np.float32)
        ),
        "bbox_labels": {"class_name": [i["bbox_labels"]["class_name"] for i in instances]},
    }
    for img, instances in neighbor_samples
]

copy_paste_metadata = [
    {
        "image": img,
        "mask": inst["mask"],
        "bbox_labels": {"class_name": inst["bbox_labels"]["class_name"]},
    }
    for img, inst in donor_objects
]

out = transform(
    image=primary_image,
    instances=primary_instances,
    mosaic_metadata=mosaic_metadata,
    copy_paste_metadata=copy_paste_metadata,
)

out["instances"] is the surviving primaries plus pasted donors, in the same per-object dict format you fed in.

API conventions to know

Two breaking changes you'll hit if you copy code from older snippets:

Every sampling-range constructor argument ends in _range (e.g. Rotate.angle_range, Blur.blur_range, HueSaturationValue.hue_shift_range). The old *_limit names are gone.
Range arguments are tuples only — A.Rotate(angle_range=(-30, 30)), not A.Rotate(angle_range=30).

Pin albumentationsx>=2.2.2 for the pipeline above — earlier 2.2.x versions had ordering bugs around Mosaic + CopyAndPaste + Perspective under instance_binding.

Where to Go Next

Object Detection (Bounding Boxes): Coordinate formats, label fields, and bbox-only pipelines.
Semantic Segmentation: Pixel-class pipelines without per-object identity.
Keypoint Augmentation: Standalone keypoint pipelines and label semantics.
Runnable notebooks:
- example_instance_binding.ipynb — walks through all three binding formats on coco128-seg.
- example_yolo_style_pipeline.ipynb — full YOLO-style training pipeline end-to-end.
Visually Explore Transforms: Live previews of Mosaic, CopyAndPaste, and the rest.