Instance Segmentation
On this page
- What you give up if you don't bind instances
- The instances input format
- Instance survival under cropping
- Mixing transforms: Mosaic and CopyAndPaste
- API conventions to know
- Where to Go Next
Instance segmentation predicts a separate mask, bbox, and (optionally) keypoints for each individual object in the image. Unlike semantic segmentation — where every pixel just gets a class id — every pixel here belongs to a specific instance, and the model has to keep object identity stable across the augmentation pipeline.
![]()
The hard part is not the augmentations themselves. The hard part is keeping the targets aligned: mask plane N must stay tied to bbox N and to the keypoints of object N across cropping, mixing, and filtering. Get this wrong silently and your model trains on shuffled targets.
What you give up if you don't bind instances
If you pass masks, bboxes, and keypoints as three independent arrays — the historical Albumentations API — they are filtered independently. A RandomCrop that pushes one object below the visibility threshold removes only that bbox row. The corresponding mask plane and keypoints stay behind, attached to the next object in the array. The bug is silent. It surfaces months later as "the model is bad at small objects" or "pose collapses on the edge of the frame".
![]()
Compose(instance_binding=[...]) fixes this by routing every per-instance target through the bbox processor. When an instance is dropped, its mask, bbox, and keypoints are dropped together. When it survives, all three move together. The output is repacked in the same per-object shape you fed in — out["instances"] is a list of dicts, one per surviving object.
You stop hand-rolling per-object index tracking. Mosaic, CopyAndPaste, crops, and filters compose without alignment bugs.
The instances input format
Pass instances as a list of per-object dicts to the instances keyword:
instances = [
{
"mask": mask_uint8, # (H, W) binary
"bbox": np.array([x1, y1, x2, y2]), # pascal_voc, pixel coords
"bbox_labels": {"class_name": "bowl"},
},
...
]
Two rules to know up front:
bbox_labelsandkeypoint_labelsinside instance dicts are reserved keys. They must be dicts keyed byBboxParams.label_fields/KeypointParams.label_fields(e.g.{"class_name": "bowl"}). A bare list raisesTypeError.maskandmasksare mutually exclusive ininstance_binding— pick one.
Three supported bindings cover the common cases.
Format A — masks (N, H, W)
instance_binding=["masks", "bboxes"] packs the per-instance masks into a stacked (N, H, W) tensor before transforms run, then unpacks them back into per-object dicts. This is the layout Mask R-CNN-style heads and panoptic decoders want.
import albumentations as A
transform = A.Compose(
[
A.HorizontalFlip(p=1.0),
A.RandomCrop(height=400, width=500, p=1.0),
A.Affine(rotate=(-30, 30), p=1.0),
],
bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
instance_binding=["masks", "bboxes"],
seed=137,
)
out = transform(image=image, instances=instances)
# out["instances"] is the same list-of-dicts shape you fed in,
# with dropped objects removed.
Format B — multi-channel mask (H, W, N)
Same call, one-line change: instance_binding=["mask", "bboxes"]. The per-instance masks are stacked along the last axis instead of the first, producing a single mask tensor of shape (H, W, N). This is what older Keras / TF segmentation pipelines often expect.
transform = A.Compose(
[A.HorizontalFlip(p=1.0), A.RandomCrop(height=400, width=500, p=1.0)],
bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
instance_binding=["mask", "bboxes"],
seed=137,
)
Format C — bboxes + keypoints (no masks)
For pose tasks, bind bboxes and keypoints and skip masks entirely. Each instance dict gets a keypoints array and a keypoint_labels dict.
transform = A.Compose(
[A.HorizontalFlip(p=1.0), A.Affine(rotate=(-20, 20), p=1.0), A.RandomCrop(height=400, width=500, p=1.0)],
bbox_params=A.BboxParams(coord_format="pascal_voc", label_fields=["class_name"]),
keypoint_params=A.KeypointParams(coord_format="xy", label_fields=["kp_kind"]),
instance_binding=["bboxes", "keypoints"],
seed=137,
)
instances = [
{
"bbox": np.array([x1, y1, x2, y2]),
"keypoints": np.array([[cx, cy], [x1, y1]], dtype=np.float32),
"bbox_labels": {"class_name": "person"},
"keypoint_labels": {"kp_kind": ["center", "tl"]},
},
...
]
When keypoints are bound, KeypointParams.remove_invisible and check_each_transform are forced to False. Out-of-frame keypoints stay in the dict — only instance survival (driven by the bbox processor) decides what gets dropped. Each surviving instance keeps a constant keypoint count, which is what pose models want.
Instance survival under cropping
Survival is decided by the bbox processor — min_visibility, min_area, etc. on BboxParams. When an instance fails the threshold, its mask and keypoints disappear with it; surviving instances keep their object identity intact.
transform = A.Compose(
[A.RandomCrop(height=180, width=180, p=1.0)],
bbox_params=A.BboxParams(
coord_format="pascal_voc",
label_fields=["class_name"],
min_visibility=0.1,
),
instance_binding=["masks", "bboxes"],
seed=137,
)
out = transform(image=image, instances=instances)
# len(out["instances"]) <= len(instances), and every dict in the
# output corresponds to exactly one input object.
![]()
Mixing transforms: Mosaic and CopyAndPaste
The two staple instance-segmentation mixers — popularized by YOLO — are first-class transforms, both wired into instance_binding:
Mosaictiles your primary image with neighbors into a single output canvas. Without per-instance binding, mosaic'd masks and bboxes drift apart at the cell seams; with it, every surviving instance — primary or neighbor — appears inout["instances"]in the same per-object shape.CopyAndPastepastes donor objects onto the primary image with mask-tight cropping, optional shrink-fit, scale jitter, and Gaussian blending at the seam. Pasted instances get fresh ids and slot in alongside the surviving primaries. Without it, you write the cut/paste compositing, the mask and bbox propagation, the visibility recompute, and the id assignment yourself — this is the canonical place where in-house augmentation stacks have bugs.
Metadata shapes are independent of instance_binding:
mosaic_metadatais a list of per-image dicts with stacked masks/bboxes (one entry per neighbor image).copy_paste_metadatais a list of per-object dicts (one entry per donor instance).
Both feed back into the per-object output format.
![]()
A YOLO-style training pipeline
End-to-end pipeline equivalent to what Ultralytics ships for YOLO11-seg, assembled out of pure Albumentations primitives:
transform = A.Compose(
[
A.Mosaic(
grid_yx=(2, 2),
target_size=(640, 640),
cell_shape=(640, 640),
fit_mode="cover",
p=1.0,
),
A.Affine(
scale=(0.9, 1.1),
rotate=(-10, 10),
shear=(-2, 2),
translate_percent=(-0.05, 0.05),
p=0.5,
),
A.Perspective(scale=(0.02, 0.05), keep_size=True, p=0.3),
A.CopyAndPaste(
scale_range=(0.4, 1.0),
blend_mode="gaussian",
blend_sigma_range=(1.0, 2.0),
min_visibility_after_paste=0.1,
p=0.7,
),
A.HorizontalFlip(p=0.5),
A.HueSaturationValue(
hue_shift_range=(-15, 15),
sat_shift_range=(-40, 40),
val_shift_range=(-25, 25),
p=1.0,
),
A.RandomBrightnessContrast(
brightness_range=(-0.2, 0.2),
contrast_range=(-0.2, 0.2),
p=0.5,
),
],
bbox_params=A.BboxParams(
coord_format="pascal_voc",
label_fields=["class_name"],
min_visibility=0.1,
min_area=8.0,
),
instance_binding=["masks", "bboxes"],
seed=137,
)
Per-call data assembly, sampling neighbors from a pool:
mosaic_metadata = [
{
"image": img,
"masks": np.stack([i["mask"] for i in instances]),
"bboxes": np.stack([i["bbox"] for i in instances]),
"bbox_labels": {"class_name": [i["bbox_labels"]["class_name"] for i in instances]},
}
for img, instances in neighbor_samples
]
copy_paste_metadata = [
{
"image": img,
"mask": inst["mask"],
"bbox_labels": {"class_name": inst["bbox_labels"]["class_name"]},
}
for img, inst in donor_objects
]
out = transform(
image=primary_image,
instances=primary_instances,
mosaic_metadata=mosaic_metadata,
copy_paste_metadata=copy_paste_metadata,
)
out["instances"] is the surviving primaries plus pasted donors, in the same per-object dict format you fed in.
![]()
API conventions to know
Two breaking changes you'll hit if you copy code from older snippets:
- Every sampling-range constructor argument ends in
_range(e.g.Rotate.angle_range,Blur.blur_range,HueSaturationValue.hue_shift_range). The old*_limitnames are gone. - Range arguments are tuples only —
A.Rotate(angle_range=(-30, 30)), notA.Rotate(angle_range=30).
Pin albumentationsx>=2.2.2 for the pipeline above — earlier 2.2.x versions had ordering bugs around Mosaic + CopyAndPaste + Perspective under instance_binding.
Where to Go Next
- Object Detection (Bounding Boxes): Coordinate formats, label fields, and bbox-only pipelines.
- Semantic Segmentation: Pixel-class pipelines without per-object identity.
- Keypoint Augmentation: Standalone keypoint pipelines and label semantics.
- Runnable notebooks:
example_instance_binding.ipynb— walks through all three binding formats oncoco128-seg.example_yolo_style_pipeline.ipynb— full YOLO-style training pipeline end-to-end.
- Visually Explore Transforms: Live previews of
Mosaic,CopyAndPaste, and the rest.