Instance Segmentation

On this page

Instance segmentation predicts a separate mask, bbox, and (optionally) keypoints for each individual object in the image. Unlike semantic segmentation — where every pixel just gets a class id — every pixel here belongs to a specific instance, and the model has to keep object identity stable across the augmentation pipeline.

Instance segmentation sample with per-object masks, bboxes, and labels

The hard part is not the augmentations themselves. The hard part is keeping the targets aligned: mask plane N must stay tied to bbox N and to the keypoints of object N across cropping, mixing, and filtering. Get this wrong silently and your model trains on shuffled targets.

What you give up if you don't bind instances

If you pass masks, bboxes, and keypoints as three independent arrays — the historical Albumentations API — they are filtered independently. A RandomCrop that pushes one object below the visibility threshold removes only that bbox row. The corresponding mask plane and keypoints stay behind, attached to the next object in the array. The bug is silent. It surfaces months later as "the model is bad at small objects" or "pose collapses on the edge of the frame".

Without instance_binding masks drift to wrong objects after a crop; with binding alignment is preserved

Compose(instance_binding=[...]) fixes this by routing every per-instance target through the bbox processor. When an instance is dropped, its mask, bbox, and keypoints are dropped together. When it survives, all three move together. The output is repacked in the same per-object shape you fed in — out["instances"] is a list of dicts, one per surviving object.

You stop hand-rolling per-object index tracking. Mosaic, CopyAndPaste, crops, and filters compose without alignment bugs.

The instances input format

Pass instances as a list of per-object dicts to the instances keyword:

instances = [
    {
        "mask": mask_uint8,                   # (H, W) binary
        "bbox": np.array([x1, y1, x2, y2]),   # pascal_voc, pixel coords
        "bbox_labels": {"class_name": "bowl"},
    },
    ...
]

Two rules to know up front:

  • bbox_labels and keypoint_labels inside instance dicts are reserved keys. They must be dicts keyed by BboxParams.label_fields / KeypointParams.label_fields (e.g. {"class_name": "bowl"}). A bare list raises TypeError.
  • mask and masks are mutually exclusive in instance_binding — pick one.

Three supported bindings cover the common cases.

Format A — masks (N, H, W)

instance_binding=["masks", "bboxes"] packs the per-instance masks into a stacked (N, H, W) tensor before transforms run, then unpacks them back into per-object dicts. This is the layout Mask R-CNN-style heads and panoptic decoders want.

import albumentations as A

transform = A.Compose(
    [
        A.HorizontalFlip(p=1.0),
        A.RandomCrop(height=400, width=500, p=1.0),
        A.Affine(rotate=(-30, 30), p=1.0),
    ],
    bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
    instance_binding=["masks", "bboxes"],
    seed=137,
)

out = transform(image=image, instances=instances)
# out["instances"] is the same list-of-dicts shape you fed in,
# with dropped objects removed.

Format B — multi-channel mask (H, W, N)

Same call, one-line change: instance_binding=["mask", "bboxes"]. The per-instance masks are stacked along the last axis instead of the first, producing a single mask tensor of shape (H, W, N). This is what older Keras / TF segmentation pipelines often expect.

transform = A.Compose(
    [A.HorizontalFlip(p=1.0), A.RandomCrop(height=400, width=500, p=1.0)],
    bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
    instance_binding=["mask", "bboxes"],
    seed=137,
)

Format C — bboxes + keypoints (no masks)

For pose tasks, bind bboxes and keypoints and skip masks entirely. Each instance dict gets a keypoints array and a keypoint_labels dict.

transform = A.Compose(
    [A.HorizontalFlip(p=1.0), A.Affine(rotate=(-20, 20), p=1.0), A.RandomCrop(height=400, width=500, p=1.0)],
    bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
    keypoint_params=A.KeypointParams(coord_format="xy", label_fields=["kp_kind"]),
    instance_binding=["bboxes", "keypoints"],
    seed=137,
)

instances = [
    {
        "bbox": np.array([x1, y1, x2, y2]),
        "keypoints": np.array([[cx, cy], [x1, y1]], dtype=np.float32),
        "bbox_labels": {"class_name": "person"},
        "keypoint_labels": {"kp_kind": ["center", "tl"]},
    },
    ...
]

When keypoints are bound, KeypointParams.remove_invisible and check_each_transform are forced to False. Out-of-frame keypoints stay in the dict — only instance survival (driven by the bbox processor) decides what gets dropped. Each surviving instance keeps a constant keypoint count, which is what pose models want.

Instance survival under cropping

Survival is decided by the bbox processor — min_visibility, min_area, etc. on BboxParams. When an instance fails the threshold, its mask and keypoints disappear with it; surviving instances keep their object identity intact.

transform = A.Compose(
    [A.RandomCrop(height=180, width=180, p=1.0)],
    bbox_params=A.BboxParams(
        coord_format="pascal_voc",
        label_fields=["class_name"],
        min_visibility=0.1,
    ),
    instance_binding=["masks", "bboxes"],
    seed=137,
)

out = transform(image=image, instances=instances)
# len(out["instances"]) <= len(instances), and every dict in the
# output corresponds to exactly one input object.

Six random crops with surviving instance counts; masks and bboxes track the same objects

Mixing transforms: Mosaic and CopyAndPaste

The two staple instance-segmentation mixers — popularized by YOLO — are first-class transforms, both wired into instance_binding:

  • Mosaic tiles your primary image with neighbors into a single output canvas. Without per-instance binding, mosaic'd masks and bboxes drift apart at the cell seams; with it, every surviving instance — primary or neighbor — appears in out["instances"] in the same per-object shape.
  • CopyAndPaste pastes donor objects onto the primary image with mask-tight cropping, optional shrink-fit, scale jitter, and Gaussian blending at the seam. Pasted instances get fresh ids and slot in alongside the surviving primaries. Without it, you write the cut/paste compositing, the mask and bbox propagation, the visibility recompute, and the id assignment yourself — this is the canonical place where in-house augmentation stacks have bugs.

Metadata shapes are independent of instance_binding:

  • mosaic_metadata is a list of per-image dicts with stacked masks/bboxes (one entry per neighbor image).
  • copy_paste_metadata is a list of per-object dicts (one entry per donor instance).

Both feed back into the per-object output format.

Primary image, donor pool, and result of CopyAndPaste with three pasted instances

A YOLO-style training pipeline

End-to-end pipeline equivalent to what Ultralytics ships for YOLO11-seg, assembled out of pure Albumentations primitives:

transform = A.Compose(
    [
        A.Mosaic(
            grid_yx=(2, 2),
            target_size=(640, 640),
            cell_shape=(640, 640),
            fit_mode="cover",
            p=1.0,
        ),
        A.Affine(
            scale=(0.9, 1.1),
            rotate=(-10, 10),
            shear=(-2, 2),
            translate_percent=(-0.05, 0.05),
            p=0.5,
        ),
        A.Perspective(scale=(0.02, 0.05), keep_size=True, p=0.3),
        A.CopyAndPaste(
            scale_range=(0.4, 1.0),
            blend_mode="gaussian",
            blend_sigma_range=(1.0, 2.0),
            min_visibility_after_paste=0.1,
            p=0.7,
        ),
        A.HorizontalFlip(p=0.5),
        A.HueSaturationValue(
            hue_shift_range=(-15, 15),
            sat_shift_range=(-40, 40),
            val_shift_range=(-25, 25),
            p=1.0,
        ),
        A.RandomBrightnessContrast(
            brightness_range=(-0.2, 0.2),
            contrast_range=(-0.2, 0.2),
            p=0.5,
        ),
    ],
    bbox_params=A.BboxParams(
        coord_format="pascal_voc",
        label_fields=["class_name"],
        min_visibility=0.1,
        min_area=8.0,
    ),
    instance_binding=["masks", "bboxes"],
    seed=137,
)

Per-call data assembly, sampling neighbors from a pool:

mosaic_metadata = [
    {
        "image": img,
        "masks": np.stack([i["mask"] for i in instances]),
        "bboxes": np.stack([i["bbox"] for i in instances]),
        "bbox_labels": {"class_name": [i["bbox_labels"]["class_name"] for i in instances]},
    }
    for img, instances in neighbor_samples
]

copy_paste_metadata = [
    {
        "image": img,
        "mask": inst["mask"],
        "bbox_labels": {"class_name": inst["bbox_labels"]["class_name"]},
    }
    for img, inst in donor_objects
]

out = transform(
    image=primary_image,
    instances=primary_instances,
    mosaic_metadata=mosaic_metadata,
    copy_paste_metadata=copy_paste_metadata,
)

out["instances"] is the surviving primaries plus pasted donors, in the same per-object dict format you fed in.

Six samples from the YOLO-style pipeline showing mosaic tiling, copy-paste, and color jitter

API conventions to know

Two breaking changes you'll hit if you copy code from older snippets:

  • Every sampling-range constructor argument ends in _range (e.g. Rotate.angle_range, Blur.blur_range, HueSaturationValue.hue_shift_range). The old *_limit names are gone.
  • Range arguments are tuples onlyA.Rotate(angle_range=(-30, 30)), not A.Rotate(angle_range=30).

Pin albumentationsx>=2.2.2 for the pipeline above — earlier 2.2.x versions had ordering bugs around Mosaic + CopyAndPaste + Perspective under instance_binding.

Where to Go Next