Skip to content

Commit cf57674

Browse files
authored
perf(aws): Batch ECR image layer relationship loads (#2698)
### Type of change - [x] Bug fix (non-breaking change that fixes an issue) ### Summary Split ECR image layer relationship writes out of high-fanout node ingestion queries and load them as scoped matchlinks in smaller batches. This keeps the existing graph model intact while reducing transaction memory pressure for repositories with many image layers. ### Related issues or links N/A ### Breaking changes None. ### How was this tested? - Tested locally ``` 2026-04-26 16:54:21,267 INFO cartography.intel.aws.ecr Loading 1 ECR repositories for region us-east-1 into graph. 2026-04-26 16:54:21,382 INFO cartography.client.core.tx Loaded 1 ECRRepository nodes 2026-04-26 16:54:21,382 INFO cartography.intel.aws.ecr Loading 1 ECR images and 1 ECR repository images in us-east-1 into graph. 2026-04-26 16:54:21,790 INFO cartography.client.core.tx Loaded 1 ECRImage nodes 2026-04-26 16:54:21,904 INFO cartography.client.core.tx Loaded 1 ECRRepositoryImage nodes 2026-04-26 16:54:21,904 INFO cartography.intel.aws.ecr_image_layers Syncing ECR image layers for region 'us-east-1' in account '<AWS_ACCOUNT_ID>'. 2026-04-26 16:54:21,944 INFO cartography.intel.aws.ecr_image_layers Found 1 distinct ECR image digests in graph for region us-east-1 2026-04-26 16:54:21,944 INFO cartography.intel.aws.ecr_image_layers Starting to fetch layers for 1 images... 2026-04-26 16:54:21,960 INFO cartography.intel.aws.ecr_image_layers Fetching layers for 1 images with 200 concurrent connections... 2026-04-26 16:54:22,738 INFO cartography.intel.aws.ecr_image_layers Fetched layer metadata for 1/1 images (100.0%) 2026-04-26 16:54:22,738 INFO cartography.intel.aws.ecr_image_layers Successfully fetched layers for 1/1 images 2026-04-26 16:54:22,738 INFO cartography.intel.aws.ecr_image_layers Extracted history commands for 10 layers 2026-04-26 16:54:22,739 INFO cartography.intel.aws.ecr_image_layers Successfully fetched layers for 1 images 2026-04-26 16:54:22,739 INFO cartography.intel.aws.ecr_image_layers Loading 10 image layers for region us-east-1 into graph. 2026-04-26 16:54:22,820 INFO cartography.client.core.tx Loaded 10 ECRImageLayer nodes 2026-04-26 16:54:22,820 INFO cartography.intel.aws.ecr_image_layers Loading 9 ECR image layer NEXT relationships for region us-east-1 into graph. 2026-04-26 16:54:22,857 INFO cartography.client.core.tx Loaded 9 (ECRImageLayer)-[NEXT]->(ECRImageLayer) relationships 2026-04-26 16:54:22,857 INFO cartography.intel.aws.ecr_image_layers Loading 1 ECR image HEAD relationships for region us-east-1 into graph. 2026-04-26 16:54:22,888 INFO cartography.client.core.tx Loaded 1 (ECRImage)-[HEAD]->(ECRImageLayer) relationships 2026-04-26 16:54:22,888 INFO cartography.intel.aws.ecr_image_layers Loading 1 ECR image TAIL relationships for region us-east-1 into graph. 2026-04-26 16:54:22,918 INFO cartography.client.core.tx Loaded 1 (ECRImage)-[TAIL]->(ECRImageLayer) relationships 2026-04-26 16:54:23,150 INFO cartography.client.core.tx Loaded 1 ECRImage nodes 2026-04-26 16:54:23,150 INFO cartography.intel.aws.ecr_image_layers Loading 10 ECR image HAS_LAYER relationships for region us-east-1 into graph. 2026-04-26 16:54:23,184 INFO cartography.client.core.tx Loaded 10 (ECRImage)-[HAS_LAYER]->(ECRImageLayer) relationships 2026-04-26 16:54:23,206 INFO cartography.graph.statement Completed ECRImageLayer statement #1 2026-04-26 16:54:23,225 INFO cartography.graph.statement Completed ECRImageLayer statement #2 2026-04-26 16:54:23,246 INFO cartography.graph.statement Completed ECRImageLayer statement #3 2026-04-26 16:54:23,266 INFO cartography.graph.statement Completed ECRImageLayer statement #4 2026-04-26 16:54:23,286 INFO cartography.graph.statement Completed ECRImageLayer statement #5 2026-04-26 16:54:23,286 INFO cartography.graph.job Finished job ECRImageLayer 2026-04-26 16:54:23,307 INFO cartography.graph.statement Completed NEXT statement #1 2026-04-26 16:54:23,307 INFO cartography.graph.job Finished job NEXT 2026-04-26 16:54:23,326 INFO cartography.graph.statement Completed HEAD statement #1 2026-04-26 16:54:23,326 INFO cartography.graph.job Finished job HEAD 2026-04-26 16:54:23,345 INFO cartography.graph.statement Completed TAIL statement #1 2026-04-26 16:54:23,345 INFO cartography.graph.job Finished job TAIL 2026-04-26 16:54:23,365 INFO cartography.graph.statement Completed HAS_LAYER statement #1 2026-04-26 16:54:23,365 INFO cartography.graph.job Finished job HAS_LAYER ``` ### Checklist #### General - [x] I have read the [contributing guidelines](https://cartography-cncf.github.io/cartography/dev/developer-guide.html). - [x] The linter passes locally (`make lint`). - [x] I have added/updated tests that prove my fix is effective or my feature works. #### Proof of functionality - [ ] Screenshot showing the graph before and after changes. - [x] New or updated unit/integration tests. #### If you are adding or modifying a synced entity N/A #### If you are changing a node or relationship N/A. This changes the load strategy for existing ECR image layer relationships; it does not add or rename graph nodes or relationships. #### If you are implementing a new intel module N/A ### Notes for reviewers The main behavior change is that `NEXT`, `HEAD`, `TAIL`, and `HAS_LAYER` are loaded as flattened matchlink rows instead of as one-to-many relationship arrays during node ingestion. --------- Signed-off-by: Kunaal Sikka <[email protected]>
1 parent 306249b commit cf57674

4 files changed

Lines changed: 323 additions & 10 deletions

File tree

cartography/intel/aws/ecr_image_layers.py

Lines changed: 114 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,13 @@
2424
from cartography.intel.supply_chain import extract_container_parent_image
2525
from cartography.intel.supply_chain import extract_image_source_provenance
2626
from cartography.intel.supply_chain import extract_workflow_path_from_ref
27-
from cartography.models.aws.ecr.image import ECRImageSchema
27+
from cartography.models.aws.ecr.image import ECRImageHasLayerRelSchema
28+
from cartography.models.aws.ecr.image import ECRImageLayerEnrichmentSchema
29+
from cartography.models.aws.ecr.image_layer import ECRImageLayerHeadRelSchema
30+
from cartography.models.aws.ecr.image_layer import ECRImageLayerNextRelSchema
31+
from cartography.models.aws.ecr.image_layer import ECRImageLayerNodeSchema
2832
from cartography.models.aws.ecr.image_layer import ECRImageLayerSchema
33+
from cartography.models.aws.ecr.image_layer import ECRImageLayerTailRelSchema
2934
from cartography.util import timeit
3035

3136
logger = logging.getLogger(__name__)
@@ -39,8 +44,11 @@ class ECRLayerFetchTransientError(Exception):
3944
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef"
4045
)
4146

42-
# Keep per-transaction memory low; each record fan-outs to many relationships.
47+
# Keep per-transaction memory low; each node record can carry large layer metadata.
4348
ECR_LAYER_BATCH_SIZE = 200
49+
# Matchlink rows are simple source/target pairs, so use larger batches while
50+
# keeping relationship transactions bounded and predictable.
51+
ECR_LAYER_REL_BATCH_SIZE = 1000
4452

4553
# ECR manifest media types
4654
ECR_DOCKER_INDEX_MT = "application/vnd.docker.distribution.manifest.list.v2+json"
@@ -639,6 +647,42 @@ def transform_ecr_image_layers(
639647
return layers, memberships
640648

641649

650+
def _build_next_relationships(image_layers: list[dict]) -> list[dict[str, Any]]:
651+
return [
652+
{
653+
"diff_id": layer["diff_id"],
654+
"next_diff_ids": [next_diff_id],
655+
}
656+
for layer in image_layers
657+
for next_diff_id in layer.get("next_diff_ids", [])
658+
]
659+
660+
661+
def _build_image_to_layer_relationships(
662+
image_layers: list[dict],
663+
image_id_field: str,
664+
) -> list[dict[str, Any]]:
665+
return [
666+
{
667+
"diff_id": layer["diff_id"],
668+
image_id_field: [image_id],
669+
}
670+
for layer in image_layers
671+
for image_id in layer.get(image_id_field, [])
672+
]
673+
674+
675+
def _build_has_layer_relationships(memberships: list[dict]) -> list[dict[str, Any]]:
676+
return [
677+
{
678+
"imageDigest": membership["imageDigest"],
679+
"layer_diff_ids": [diff_id],
680+
}
681+
for membership in memberships
682+
for diff_id in membership.get("layer_diff_ids", [])
683+
]
684+
685+
642686
@timeit
643687
def load_ecr_image_layers(
644688
neo4j_session: neo4j.Session,
@@ -650,23 +694,70 @@ def load_ecr_image_layers(
650694
"""
651695
Load image layers into Neo4j.
652696
653-
Uses a conservative batch size (ECR_LAYER_LOAD_BATCH_SIZE) to avoid Neo4j
654-
transaction memory limits, since layer objects can contain large arrays of
655-
relationships.
697+
Load layer nodes separately from NEXT/HEAD/TAIL relationships so each
698+
transaction handles a bounded amount of node or relationship data.
656699
"""
657700
logger.info(
658701
f"Loading {len(image_layers)} image layers for region {region} into graph.",
659702
)
660703

661704
load(
662705
neo4j_session,
663-
ECRImageLayerSchema(),
706+
ECRImageLayerNodeSchema(),
664707
image_layers,
665708
batch_size=ECR_LAYER_BATCH_SIZE,
666709
lastupdated=aws_update_tag,
667710
AWS_ID=current_aws_account_id,
668711
)
669712

713+
next_relationships = _build_next_relationships(image_layers)
714+
logger.info(
715+
"Loading %d ECR image layer NEXT relationships for region %s into graph.",
716+
len(next_relationships),
717+
region,
718+
)
719+
load(
720+
neo4j_session,
721+
ECRImageLayerNextRelSchema(),
722+
next_relationships,
723+
batch_size=ECR_LAYER_REL_BATCH_SIZE,
724+
lastupdated=aws_update_tag,
725+
)
726+
727+
head_relationships = _build_image_to_layer_relationships(
728+
image_layers,
729+
"head_image_ids",
730+
)
731+
logger.info(
732+
"Loading %d ECR image HEAD relationships for region %s into graph.",
733+
len(head_relationships),
734+
region,
735+
)
736+
load(
737+
neo4j_session,
738+
ECRImageLayerHeadRelSchema(),
739+
head_relationships,
740+
batch_size=ECR_LAYER_REL_BATCH_SIZE,
741+
lastupdated=aws_update_tag,
742+
)
743+
744+
tail_relationships = _build_image_to_layer_relationships(
745+
image_layers,
746+
"tail_image_ids",
747+
)
748+
logger.info(
749+
"Loading %d ECR image TAIL relationships for region %s into graph.",
750+
len(tail_relationships),
751+
region,
752+
)
753+
load(
754+
neo4j_session,
755+
ECRImageLayerTailRelSchema(),
756+
tail_relationships,
757+
batch_size=ECR_LAYER_REL_BATCH_SIZE,
758+
lastupdated=aws_update_tag,
759+
)
760+
670761

671762
@timeit
672763
def load_ecr_image_layer_memberships(
@@ -679,20 +770,33 @@ def load_ecr_image_layer_memberships(
679770
"""
680771
Load image layer memberships into Neo4j.
681772
682-
Uses a conservative batch size (ECR_LAYER_MEMBERSHIP_BATCH_SIZE) to avoid
683-
Neo4j transaction memory limits, since membership objects can contain large
684-
arrays of layer diff_ids.
773+
Load ECRImage layer metadata separately from HAS_LAYER relationships so
774+
each transaction handles a bounded amount of node or relationship data.
685775
"""
686776
load(
687777
neo4j_session,
688-
ECRImageSchema(),
778+
ECRImageLayerEnrichmentSchema(),
689779
memberships,
690780
batch_size=ECR_LAYER_BATCH_SIZE,
691781
lastupdated=aws_update_tag,
692782
Region=region,
693783
AWS_ID=current_aws_account_id,
694784
)
695785

786+
has_layer_relationships = _build_has_layer_relationships(memberships)
787+
logger.info(
788+
"Loading %d ECR image HAS_LAYER relationships for region %s into graph.",
789+
len(has_layer_relationships),
790+
region,
791+
)
792+
load(
793+
neo4j_session,
794+
ECRImageHasLayerRelSchema(),
795+
has_layer_relationships,
796+
batch_size=ECR_LAYER_REL_BATCH_SIZE,
797+
lastupdated=aws_update_tag,
798+
)
799+
696800

697801
async def fetch_image_layers_async(
698802
ecr_client: ECRClient,

cartography/models/aws/ecr/image.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,3 +235,53 @@ class ECRImageSchema(CartographyNodeSchema):
235235
),
236236
],
237237
)
238+
239+
240+
@dataclass(frozen=True)
241+
class ECRImageLayerEnrichmentSchema(CartographyNodeSchema):
242+
"""Load ECRImage layer/provenance properties without fan-out HAS_LAYER edges."""
243+
244+
label: str = "ECRImage"
245+
properties: ECRImageNodeProperties = ECRImageNodeProperties()
246+
sub_resource_relationship: ECRImageToAWSAccountRel = ECRImageToAWSAccountRel()
247+
other_relationships: OtherRelationships = OtherRelationships(
248+
[
249+
ECRImageToParentImageRel(),
250+
ECRImageContainsImageRel(),
251+
ECRImageAttestsRel(),
252+
],
253+
)
254+
extra_node_labels: ExtraNodeLabels = ExtraNodeLabels(
255+
[
256+
ConditionalNodeLabel(
257+
label="Image",
258+
conditions={"type": "image"},
259+
),
260+
ConditionalNodeLabel(
261+
label="ImageAttestation",
262+
conditions={"type": "attestation"},
263+
),
264+
ConditionalNodeLabel(
265+
label="ImageManifestList",
266+
conditions={"type": "manifest_list"},
267+
),
268+
],
269+
)
270+
271+
272+
@dataclass(frozen=True)
273+
class ECRImageHasLayerRelLoadProperties(CartographyNodeProperties):
274+
id: PropertyRef = PropertyRef("imageDigest")
275+
digest: PropertyRef = PropertyRef("imageDigest")
276+
lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)
277+
278+
279+
@dataclass(frozen=True)
280+
class ECRImageHasLayerRelSchema(CartographyNodeSchema):
281+
"""Load bounded HAS_LAYER relationship rows without reloading image metadata."""
282+
283+
label: str = "ECRImage"
284+
properties: ECRImageHasLayerRelLoadProperties = ECRImageHasLayerRelLoadProperties()
285+
other_relationships: OtherRelationships = OtherRelationships(
286+
[ECRImageHasLayerRel()],
287+
)

cartography/models/aws/ecr/image_layer.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,55 @@ class ECRImageLayerSchema(CartographyNodeSchema):
106106
]
107107
)
108108
extra_node_labels: ExtraNodeLabels = ExtraNodeLabels(["ImageLayer"])
109+
110+
111+
@dataclass(frozen=True)
112+
class ECRImageLayerNodeSchema(CartographyNodeSchema):
113+
"""Load ECRImageLayer nodes without high-fanout one-to-many relationships."""
114+
115+
label: str = "ECRImageLayer"
116+
properties: ECRImageLayerNodeProperties = ECRImageLayerNodeProperties()
117+
sub_resource_relationship: ECRImageLayerToAWSAccountRel = (
118+
ECRImageLayerToAWSAccountRel()
119+
)
120+
extra_node_labels: ExtraNodeLabels = ExtraNodeLabels(["ImageLayer"])
121+
122+
123+
@dataclass(frozen=True)
124+
class ECRImageLayerRelLoadProperties(CartographyNodeProperties):
125+
id: PropertyRef = PropertyRef("diff_id")
126+
diff_id: PropertyRef = PropertyRef("diff_id")
127+
lastupdated: PropertyRef = PropertyRef("lastupdated", set_in_kwargs=True)
128+
129+
130+
@dataclass(frozen=True)
131+
class ECRImageLayerNextRelSchema(CartographyNodeSchema):
132+
"""Load bounded NEXT relationship rows without reloading layer metadata."""
133+
134+
label: str = "ECRImageLayer"
135+
properties: ECRImageLayerRelLoadProperties = ECRImageLayerRelLoadProperties()
136+
other_relationships: OtherRelationships = OtherRelationships(
137+
[ECRImageLayerToNextRel()],
138+
)
139+
140+
141+
@dataclass(frozen=True)
142+
class ECRImageLayerHeadRelSchema(CartographyNodeSchema):
143+
"""Load bounded HEAD relationship rows without reloading layer metadata."""
144+
145+
label: str = "ECRImageLayer"
146+
properties: ECRImageLayerRelLoadProperties = ECRImageLayerRelLoadProperties()
147+
other_relationships: OtherRelationships = OtherRelationships(
148+
[ECRImageLayerHeadOfImageRel()],
149+
)
150+
151+
152+
@dataclass(frozen=True)
153+
class ECRImageLayerTailRelSchema(CartographyNodeSchema):
154+
"""Load bounded TAIL relationship rows without reloading layer metadata."""
155+
156+
label: str = "ECRImageLayer"
157+
properties: ECRImageLayerRelLoadProperties = ECRImageLayerRelLoadProperties()
158+
other_relationships: OtherRelationships = OtherRelationships(
159+
[ECRImageLayerTailOfImageRel()],
160+
)

0 commit comments

Comments
 (0)