AKS Compliance Audit - Kubernetes Network Planning API

Updated: February 4, 2026. Tier configurations now use differentiated subnet sizes with 3 AZs for production tiers. See API.md for current tier values.

Date: February 1, 2026
Scope: Azure Kubernetes Service (AKS)
Status: FULLY COMPLIANT

1. Executive Summary

The Kubernetes Network Planning API has been validated against Microsoft Azure Kubernetes Service (AKS) best practices and requirements. All five deployment tiers (Micro, Standard, Professional, Enterprise, Hyperscale) are fully compliant with AKS constraints and recommended configurations.

Key Findings:

All tier configurations support AKS scaling limits (5,000 nodes, 200,000 pods)
Azure CNI Overlay compatibility verified for all configurations
Token bucket throttling algorithm documented
Azure Resource Manager (ARM) quota system compatible
Multi-zone distribution automatically configured for all tiers
Azure availability zones properly assigned using real region names (e.g., eastus-1, eastus-2, eastus-3)
Multi-node pool scaling patterns supported
Control plane tier support (Free, Standard, Premium) documented
RFC 1918 private addressing enforced

Compliance Level: PRODUCTION READY

0.1. Azure Region and Availability Zone Naming Standards

Region Naming Convention

Azure regions follow a simplified naming pattern: <location><direction?><number?>

Format: {location}{direction}{number} (no separators)

location: Geographic area or city (east, west, central, north, south, australia, brazil, canada, etc.)
direction: Optional cardinal direction (east, west, central, etc.)
number: Optional distinguisher (2, 3, etc.)

Key Differences from AWS/GCP:

No hyphens or underscores: Region names are lowercase without separators
Uses programmatic names: eastus not East US
Human-readable display names: Separate from programmatic names

Azure Regions by Geography (Programmatic Names)

Geography	Region Code	Display Name	Physical Location	AZ Support
United States
	`eastus`	East US	Virginia	Yes
	`eastus2`	East US 2	Virginia	Yes
	`centralus`	Central US	Iowa	Yes
	`northcentralus`	North Central US	Illinois	No
	`southcentralus`	South Central US	Texas	Yes
	`westcentralus`	West Central US	Wyoming	No
	`westus`	West US	California	No
	`westus2`	West US 2	Washington	Yes
	`westus3`	West US 3	Arizona	Yes
Canada
	`canadacentral`	Canada Central	Toronto	Yes
	`canadaeast`	Canada East	Quebec	No
Europe
	`northeurope`	North Europe	Ireland	Yes
	`westeurope`	West Europe	Netherlands	Yes
	`uksouth`	UK South	London	Yes
	`ukwest`	UK West	Cardiff	No
	`francecentral`	France Central	Paris	Yes
	`francesouth`	France South	Marseille	No
	`germanywestcentral`	Germany West Central	Frankfurt	Yes
	`germanynorth`	Germany North	Berlin	No
	`switzerlandnorth`	Switzerland North	Zurich	Yes
	`switzerlandwest`	Switzerland West	Geneva	No
	`norwayeast`	Norway East	Oslo	Yes
	`norwaywest`	Norway West	Stavanger	No
	`swedencentral`	Sweden Central	Gävle	Yes
	`polandcentral`	Poland Central	Warsaw	Yes
	`italynorth`	Italy North	Milan	Yes
	`spaincentral`	Spain Central	Madrid	Yes
Asia Pacific
	`eastasia`	East Asia	Hong Kong	Yes
	`southeastasia`	Southeast Asia	Singapore	Yes
	`japaneast`	Japan East	Tokyo	Yes
	`japanwest`	Japan West	Osaka	Yes
	`koreacentral`	Korea Central	Seoul	Yes
	`koreasouth`	Korea South	Busan	No
	`centralindia`	Central India	Pune	Yes
	`southindia`	South India	Chennai	No
	`westindia`	West India	Mumbai	No
	`australiaeast`	Australia East	Sydney	Yes
	`australiasoutheast`	Australia Southeast	Melbourne	No
	`australiacentral`	Australia Central	Canberra	No
	`indonesiacentral`	Indonesia Central	Jakarta	Yes
	`malaysiawest`	Malaysia West	Kuala Lumpur	Yes
	`newzealandnorth`	New Zealand North	Auckland	Yes
South America
	`brazilsouth`	Brazil South	São Paulo	Yes
	`brazilsoutheast`	Brazil Southeast	Rio	No
	`chilecentral`	Chile Central	Santiago	Yes
Middle East & Africa
	`uaenorth`	UAE North	Dubai	Yes
	`uaecentral`	UAE Central	Abu Dhabi	No
	`qatarcentral`	Qatar Central	Doha	Yes
	`israelcentral`	Israel Central	Tel Aviv	Yes
	`southafricanorth`	South Africa North	Johannesburg	Yes
	`southafricawest`	South Africa West	Cape Town	No

Availability Zone Naming Convention

Azure AZs use numeric identifiers (NOT letters like AWS/GCP):

Format: {region}-{zone-number}

region: Programmatic region code (e.g., eastus)
zone-number: Numeric zone identifier (1, 2, 3)

Examples:

eastus-1 - First AZ in East US (Virginia)
eastus-2 - Second AZ in East US (Virginia)
eastus-3 - Third AZ in East US (Virginia)

Important Notes:

Logical-to-Physical Mapping: Azure maps logical zone numbers (1, 2, 3) to physical zones per subscription
- Zone 1 in Subscription A may map to a different physical zone than Zone 1 in Subscription B
- This provides balanced distribution across physical infrastructure
Zone Redundant Services: Some Azure services are "zone-redundant" (automatically replicated)
- Storage accounts, SQL Database, etc.
- No need to specify individual zones
Maximum 3 Zones: Azure regions that support AZs have exactly 3 zones (1, 2, 3)
- Unlike AWS (up to 6) or GCP (variable)
Not All Regions Support AZs: Check the AZ Support column
- Use zone-redundant services in non-AZ regions for HA

Retrieving Region Names Programmatically

# Azure CLI
az account list-locations --output table

# Azure PowerShell
Get-AzLocation | Select-Object DisplayName, Location

# Example output:
# DisplayName          Location
# East US              eastus
# East US 2            eastus2
# West US              westus
# Central US           centralus

API Implementation

Our API uses real Azure region names for AZ assignment:

// Default region for AKS
const region = "eastus";

// AZ assignment for subnets (numeric zones)
private-1 -> eastus-1
private-2 -> eastus-2
private-3 -> eastus-3
public-1  -> eastus-1
public-2  -> eastus-2
public-3  -> eastus-3

Reference: Azure Regions List

0. IPv4 Address Consumption Model (Azure CNI Overlay)

Network Architecture Overview

Azure Kubernetes Service uses Azure CNI Overlay mode for pod networking, which separates pod IP allocation from the VNet address space. This is fundamentally different from EKS's shared model and similar to GKE's alias IP model, but simpler:

Critical Insight: Pods use a separate overlay CIDR that does NOT consume VNet subnet space. Nodes get IPs from VNet subnets, Pods get IPs from overlay CIDR. This eliminates IP exhaustion risk entirely.

IPv4 Allocation by Component

1. Node IP Allocation (Primary VNet Subnet)

Source: VNet subnet (e.g., /18 = 16,384 IPs for Hyperscale)
Consumption: 1 IP per Node from primary VNet subnet
Calculation: Same as EKS/GKE - simple node count

Node_Capacity = 2^(32 - subnet_prefix) - 4
Example: /18 subnet = 2^14 - 4 = 16,380 nodes capacity

For Hyperscale Tier:

Primary subnet: /18 = 16,384 IPs
Actual nodes: 5,000 (supports AKS 5,000-node limit)
Node capacity: 16,380 nodes (sufficient for all tiers)

2. Pod IP Allocation (Overlay CIDR)

Source: Overlay CIDR (e.g., /13 = 524,288 IPs for Hyperscale)
Consumption: Pods use overlay IPs separate from VNet subnet
Calculation: Simple pod capacity calculation

With Azure CNI Overlay:
Pod_Addresses = 2^(32 - pod_prefix)

Example (Hyperscale with /13 Pod CIDR):
Pod_Addresses = 2^(32 - 13) = 524,288 IPs
Actual Limit: Minimum of calculated or 200,000 pods per cluster (AKS max)

Key Differences from EKS/GKE:

EKS: Pods use secondary IPs from VNet subnet (competes with Nodes) -> HIGH exhaustion risk
GKE: Pods use alias IP ranges (Google manages automatically) -> LOW exhaustion risk
AKS: Pods use overlay CIDR (completely separate from VNet) -> NO exhaustion risk

Azure CNI Overlay Benefits:

No VNet IP consumption for pods
No subnet fragmentation issues
No secondary IP allocation complexity
Maximum 200,000 pods per cluster (sufficient for Hyperscale)
Simple subnet sizing (only Node count matters)

3. Service IP Allocation (Service CIDR)

Source: Service CIDR (e.g., /16 = 65,536 IPs)
Consumption: Virtual IPs managed by kube-proxy
Does NOT consume VNet subnet space

Service_IPs = 2^(32 - service_prefix)
Example: /16 = 65,536 ClusterIP addresses

Important: Services use virtual IPs that exist only within the cluster. They are not routable outside the cluster and do not consume VNet IP space. This is identical to EKS and GKE.

4. LoadBalancer IP Allocation (External Azure Resources)

Source: Azure Load Balancer (external resource)
Consumption: Azure assigns IPs from its own pool
Does NOT consume VNet subnet space

Azure LoadBalancer Types:

Internal LoadBalancer: Uses private IP from VNet (user-specified subnet)
External LoadBalancer: Uses Azure public IP (external resource)

Important: While internal load balancers use VNet IPs, they are not allocated from Node/Pod subnets. User specifies a separate subnet for load balancers (best practice).

IP Exhaustion Risk Analysis

AKS has NO IP exhaustion risk due to overlay CIDR model:

Problem Scenario (Theoretical - Does NOT Apply to AKS):

If AKS used EKS's model: /24 subnet (256 IPs) exhausted by 10 Nodes + 1,100 Pods = 1,110 IPs needed -> FAIL

AKS Reality (Overlay CIDR):

Node subnet: /24 = 252 usable IPs for Nodes
Pod overlay: /16 = 65,536 IPs for Pods (separate CIDR)
No competition: Nodes and Pods use different IP spaces
Result: /24 Node subnet + /16 Pod CIDR = 10 Nodes + 27,720 Pods -> SUCCESS

Hyperscale Tier IP Exhaustion Risk (AKS):

Node subnet: /18 = 16,380 IPs for Nodes
Pod overlay: /13 = 524,288 IPs for Pods (AKS limit: 200,000)
5,000 nodes + 200,000 pods = NO RISK
Pod overlay can scale independently of VNet

Comparison to EKS and GKE

Component	EKS (VPC CNI)	GKE (Alias IP)	AKS (CNI Overlay)
Node IPs	VPC subnet (1 IP/node)	VPC subnet (1 IP/node)	VNet subnet (1 IP/node)
Pod IPs	Same VPC subnet (secondary IPs)	Alias ranges (Google-managed)	Overlay CIDR (separate)
Service IPs	Separate virtual range	Separate virtual range	Separate virtual range
LoadBalancer IPs	External AWS resources	External Google Cloud LB	External Azure LB
IP Exhaustion Risk	HIGH (Pods compete with Nodes)	LOW (Google manages)	NONE (Pods separate)
VNet IP Consumption	5K Nodes + 550K Pods = 555K IPs	5K Nodes = 5K IPs (Pods separate)	5K Nodes = 5K IPs (Pods separate)

Key Takeaways for AKS

NO IP Exhaustion Risk: Overlay CIDR eliminates VNet IP competition entirely
Simple Node Subnet Sizing: Only need to account for Node count (1 IP per Node)
Independent Pod Scaling: Pod CIDR can be sized independently of VNet
200K Pod Limit: AKS enforces 200,000 pods per cluster (regardless of IP space)
Multi-Node Pool Support: 5,000 nodes require 5-10 node pools (1,000 nodes per pool max)
Service CIDR: Virtual IPs only, do NOT consume VNet space
LoadBalancers: External resources, do NOT consume VNet space (except internal LBs use separate subnet)

Cross-Reference: For detailed comparison of IPv4 allocation across all three platforms, see ip-allocation-cross-reference.md.

2. Subnet Overlap Validation

Non-Overlapping Subnet Guarantee

VALIDATED - The API guarantees that public and private subnets never overlap within the VPC CIDR.

Implementation:

Public subnets are generated first from the VPC base address
Private subnets use an offset parameter to start AFTER all public subnets
Calculation: subnetStart = vpcNum + ((offset + index) * subnetAddresses)
Offset for private subnets = number of public subnets

Example (Hyperscale tier with VPC 10.0.0.0/18, differentiated sizes):

Public subnets (offset=0, /23 each):
  public-1: 10.0.0.0/23   (10.0.0.0 - 10.0.1.255)   [512 IPs]
  public-2: 10.0.2.0/23   (10.0.2.0 - 10.0.3.255)   [512 IPs]
  public-3: 10.0.4.0/23   (10.0.4.0 - 10.0.5.255)   [512 IPs]

Private subnets (offset=3, /20 each):
  private-1: 10.0.16.0/20  (10.0.16.0 - 10.0.31.255)  [4,096 IPs] [PASS] No overlap
  private-2: 10.0.32.0/20  (10.0.32.0 - 10.0.47.255)  [4,096 IPs] [PASS] No overlap
  private-3: 10.0.48.0/20  (10.0.48.0 - 10.0.63.255)  [4,096 IPs] [PASS] No overlap

Test Coverage:

[PASS] Unit test: should ensure public and private subnets do not overlap
[PASS] Unit test: should validate all subnets fit within VPC CIDR
Validated across all 5 deployment tiers

AKS Relevance: Prevents Azure CNI Overlay routing conflicts where pod CIDR and node subnet overlap would break cluster networking. Critical for 5,000-node hyperscale deployments.

3. AKS Cluster Scalability Limits

Based on Microsoft Azure AKS documentation:

Documented Limits (Per AKS Quotas & Best Practices)

Aspect	Azure Limit	Our Hyperscale	Status
Max Nodes per Cluster	5,000 nodes	5,000 nodes	Supported
Max Pods per Cluster	200,000 pods (CNI Overlay)	260,000+ pods (CNI Overlay)	Over-provisioned
Max Nodes per Node Pool	1,000 nodes	1,000 nodes (suggest 5 pools for 5K)	Supported
Max Node Pools	100 pools	Supports multiple pools	Compliant
Max Pods per Node	250 pods (max)	110 default, 250 with Azure CNI	Supported
Primary Subnet	VNet subnet required	/20 × 3 hyperscale	Compliant
Pod CIDR	CNI secondary range	/13 hyperscale	Compliant
Service CIDR	Typically /16-/20	/16 all tiers	Over-provisioned

Scaling Guidance from Microsoft

Recommended Planning:

<300 nodes: Standard operations
300-1000 nodes: Monitor control plane, start planning
1000-5000 nodes: Contact Azure support, plan carefully
>5000 nodes: Not supported without special arrangements

Control Plane Tiers:

Free Tier: Recommended node limit of 10 nodes (not for production)
Standard Tier: Up to 5,000 nodes with auto-scaling control plane
Premium Tier: Up to 5,000 nodes with guaranteed SLAs

Our Tier Distribution:

Tier	Node Range	Pod Capacity	Control Plane Tier
Micro	1	~110	Free/Standard
Standard	1-3	~330-440	Standard
Professional	3-10	~1,100-3,300	Standard
Enterprise	10-50	~3,300-16,500	Standard
Hyperscale	50-5000	~55,000-260,000	Standard/Premium

2.1. Azure NAT Gateway & Outbound Connectivity

Overview

Azure NAT Gateway provides outbound internet connectivity for private AKS nodes and pods without exposing them to inbound traffic.

SNAT Port Allocation Formula

Adapted for Azure:

NAT Gateway IPs needed = ((# of instances) × (Ports / Instance)) / 64,000

Azure NAT Gateway Limits:

64,000 SNAT ports per IP (configurable 1,024-64,000)
16 public IPs max per NAT Gateway
1,024,000 total ports per gateway (16 × 64,000)
No bandwidth limit (Standard Public IP SKU)

Hyperscale Tier Examples (5,000 Nodes)

Scenario 1: Standard Workload (128 ports/node)

5,000 nodes × 128 ports = 640,000 ports
640,000 / 64,000 = 10 IPs
1 NAT Gateway (below 16 IP limit)

Scenario 2: High Connections (1,024 ports/node)

5,000 nodes × 1,024 ports = 5,120,000 ports
5,120,000 / 64,000 = 80 IPs
80 / 16 = 5 NAT Gateways required

Scenario 3: Maximum Single Gateway

16 IPs × 64,000 ports = 1,024,000 ports total
1,024,000 / 128 = 8,000 nodes (standard workload)
1,024,000 / 1,024 = 1,000 nodes (high-connection workload)

Token Bucket Throttling (API Rate Limiting)

Critical for 5,000-node clusters:

PUT ManagedCluster: 20 burst, 1 req/min sustained
PUT AgentPool:      20 burst, 1 req/min sustained

Error: HTTP 429 (Too Many Requests)
Header: Retry-After: <seconds>

Scaling Best Practice: Scale in batches of 500-700 nodes, wait 2-5 min between operations.

Azure NAT Gateway Quotas

Resource	Limit	Notes
NAT Gateways/region	1,000	Per subscription
IPs/NAT Gateway	16	Hard limit
Subnets/gateway	800	Multi-subnet support
SNAT ports/IP	64,000	Configurable
Idle timeout	4-120 min	Configurable

Load Balancer IP Consumption

LB Type	IP Consumption	VNet Impact
Azure LB (Public)	Azure-managed public IP	[FAIL] NO
Azure LB (Internal)	1 IP from VNet subnet	[PASS] YES
Application Gateway	1 public + dedicated subnet	[PASS] YES (/24)

Hyperscale Estimate (5,000 nodes):

NAT Gateways: 3-5 gateways × 16 IPs = 48-80 Azure IPs (no VNet impact)
Public Load Balancers: ~20 services = Azure IPs (no VNet impact)
Internal Load Balancers: ~10 services = 10 VNet IPs
Application Gateways: ~2 × 256 IPs = 512 VNet IPs (dedicated subnets)
Total VNet IPs: 5,000 (nodes) + 522 (LBs+AppGW) = ~5,522 IPs

Note: Pods use overlay CIDR (10.244.0.0/16), no VNet consumption.

3. Azure CNI Architecture

Networking Models Supported

1. Azure CNI (Direct IP Allocation)

Each pod gets a direct VNet IP address
Native VNet integration
Maximum 50,000 pods per cluster
Better for direct pod-to-pod communication

2. Azure CNI Overlay (Recommended for AKS)

Pods get IPs from overlay network
Overlays VNet subnets
Maximum 200,000 pods per cluster
Better pod density
Reduced VNet IP pressure

3. Kubenet (Deprecated for new clusters)

Simple bridge networking
Limited scalability
Not recommended for production

IP Allocation Formula - Azure CNI Overlay

Formula for Pod Capacity:

With Azure CNI Overlay:
Pod_Capacity_Per_Cluster = Overlay_CIDR_Size - Reserved
Overlay_CIDR = Secondary range (e.g., /13)

Example: /13 overlay CIDR
Addresses = 2^(32-13) = 524,288
Pod_Capacity = 524,288 / 1.1 (overhead) = ~477,000 pods
Actual AKS Limit: 200,000 pods per cluster

Advantages Over Direct Azure CNI:

More pod addresses available
Doesn't consume VNet address space
Better scaling for large clusters
Recommended by Microsoft for 5000+ node clusters

Node Capacity Per Pool Formula

Formula:

Node_Capacity_Per_Pool = 1,000 nodes maximum per pool
Cluster_Node_Capacity = Number_of_Pools × 1,000

For 5,000 nodes: Need minimum 5 node pools
(5 × 1,000 = 5,000)

4. Azure Throttling Algorithm

Token Bucket Throttling

Microsoft Azure uses a token bucket throttling algorithm for AKS resource provider APIs:

Algorithm:

Tokens = Fixed Size (Burst Rate)
Refill Rate = Sustained Rate (tokens/second)

When request arrives:
- If tokens > 0: Accept request, decrement tokens
- If tokens = 0: Reject with HTTP 429 (Too Many Requests)
- Tokens refill at fixed rate over time

AKS Resource Provider Throttling Limits

API	Burst Rate	Sustained Rate	Scope
LIST ManagedClusters	500 requests	1 request/1 sec	Subscription
LIST ManagedClusters	60 requests	1 request/1 sec	ResourceGroup
PUT AgentPool	20 requests	1 request/1 min	AgentPool
PUT ManagedCluster	20 requests	1 request/1 min	ManagedCluster
GET ManagedCluster	60 requests	1 request/1 sec	Cluster
GET Operation Status	200 requests	2 requests/1 sec	Subscription
All Other APIs	60 requests	1 request/1 sec	Subscription

Error Response:

HTTP 429 (Too Many Requests)
Error Code: Throttled
Header: Retry-After: <delay-seconds>

Comparison: Token Bucket vs Rate Limiting

Aspect	Token Bucket (Azure)	Per-IP Limiting (AWS/others)
Fairness	Account/scope based	IP-based (can be spoofed)
Burst Support	Yes (token savings)	Limited
Multi-client	Shared bucket	Per IP
Scale	Better for multi-cluster	Better for single client

5. Tier-by-Tier AKS Compliance Analysis

Micro Tier (Development/PoC)

Configuration:

Primary Subnet: /25 (128 addresses)
Pod CIDR: /18 (16,384 addresses)
Service CIDR: /16 (65,536 addresses)
Nodes: 1
Node Pools: 1
Pod Limit: ~110

AKS Compliance:

Free Tier sufficient (single node)
Pod CIDR over-provisioned (safe for overlay network)
No node pool scaling concerns
No Azure CNI overlay pressure
Subnet prefix contiguity: Not an issue

Control Plane:

Free Tier recommended for learning
Not suitable for production

Recommendation: Appropriate for PoC and learning. Consider upgrading to Standard tier for any persistent workloads.

Standard Tier (Development/Testing)

Configuration:

Primary Subnet: /24 (256 addresses)
Pod CIDR: /16 (65,536 addresses)
Service CIDR: /16 (65,536 addresses)
Nodes: 1-3
Node Pools: 1
Pod Limit: ~330-440

AKS Compliance:

Standard Tier recommended (auto-scaling control plane)
Pod CIDR sufficient with CNI overlay
Single node pool supports 1-3 nodes
No pod address exhaustion
Subnet space adequate for temporary scaling

Azure CNI Overlay:

Enabled by default in newer AKS clusters
IPs from overlay secondary range
No VNet IP pressure

Recommendation: Good for development and testing. Use Standard tier control plane for reliability.

Professional Tier (Small Production)

Configuration:

Primary Subnet: /23 (512 addresses)
Pod CIDR: /16 (65,536 addresses)
Service CIDR: /16 (65,536 addresses)
Nodes: 3-10
Node Pools: 1-2 (recommended 1 for 3-10 nodes)
Pod Limit: ~1,100-3,300

AKS Compliance:

Standard Tier recommended for production
/23 subnet supports 10-node cluster
Single node pool sufficient (< 1000 nodes)
Pod CIDR with overlay supports full capacity
Availability zones supported
Multi-AZ deployment ready

Control Plane Scaling:

Automatic at this scale
Monitor API server latency
Recommended by Microsoft for production

Azure Throttling:

No throttling concerns at this scale
API calls well within burst rates

Recommendation: Production-grade with HA support. Requires Standard tier control plane. Good for 3-10 node enterprise clusters.

Enterprise Tier (Large Production)

Configuration:

Primary Subnet: /23 (512 addresses)
Pod CIDR: /16 (65,536 addresses)
Service CIDR: /16 (65,536 addresses)
Nodes: 10-50
Node Pools: 1-2 (recommended 1)
Pod Limit: ~3,300-16,500

AKS Compliance:

Standard Tier with control plane auto-scaling
/23 primary supports up to 50 nodes per pool
Single node pool sufficient (< 1000 nodes)
Pod CIDR overlay supports full 16.5K pod capacity
Multi-AZ deployment recommended (3+ zones)
Full production support

Control Plane Considerations:

Standard tier auto-scales at 10-50 node range
Monitor control plane metrics critical
Azure Monitor integration recommended

Network Policies:

Azure NPM supports up to 250 nodes (note limitation)
For > 250 nodes, consider Azure CNI with Cilium
Application-level routing with Istio also option

Cluster Upgrade:

Max surge: 10-20% recommended (5-10 nodes)
Plan scaling operations carefully
Azure handles temporary infrastructure

Recommendation: Production-grade with zone redundancy. Requires Standard tier. Supports enterprise SLAs.

Hyperscale Tier (Global/Extreme Scale)

Configuration:

Primary Subnet: /19 (8,192 addresses)
Pod CIDR: /13 (524,288 addresses)
Service CIDR: /16 (65,536 addresses)
Nodes: 50-5,000
Node Pools: 5 minimum (5 × 1,000 = 5,000 nodes)
Pod Limit: 55,000-260,000 (AKS cap: 200,000 with CNI Overlay)

AKS Compliance:

Standard or Premium Tier control plane
/19 primary supports full 5,000-node cluster
Multiple node pools required (5-10 pools for 5K nodes)
Azure CNI Overlay mandatory for 200K pods
Pod CIDR /13 supports 200K+ pods
Service CIDR /16 provides 65K services
Maximum AKS scale fully supported

Multi-Node Pool Strategy:

Example 5,000 node configuration:
- System pool: 1-2 nodes (for kube-system, default: 30-50 pods/node)
- User pool 1: 1,000 nodes (application workloads)
- User pool 2: 1,000 nodes (application workloads)
- User pool 3: 1,000 nodes (application workloads)
- User pool 4: 1,000 nodes (application workloads)
- User pool 5: 1,000 nodes (application workloads)
Total: 5,002 nodes

Azure CNI Overlay Requirements:

MANDATORY for 200K+ pods
Reduces VNet IP pressure
Enables true cluster-level pod scaling
Configuration: --network-plugin azure --pod-cidr 10.100.0.0/13

Scaling Thresholds:

1,000+ nodes: Plan carefully, monitor API latency
5,000 nodes: Contact Azure support before attempting
>5,000 nodes: Not officially supported (beyond scope)

Control Plane Tiers:

Standard Tier: Supports up to 5,000 nodes with auto-scaling
Premium Tier: Recommended for guaranteed SLAs
Both support maximum 200K pods with CNI Overlay

Cluster Services Scaling:

coredns: Scale to 10+ replicas
kube-proxy: Runs on all nodes (auto-scales)
Azure CNI: DaemonSet on all nodes
Azure NPM: Not suitable > 250 nodes (use Cilium instead)

Network Configuration:

Primary VNet: 10.0.0.0/16 (65,536 addresses)
├─ Node Subnets (8):
│  ├─ 10.0.0.0/19   (8,192 addresses - nodes)
│  ├─ 10.0.32.0/19  (8,192 addresses - future)
│  └─ ... (additional for redundancy)
├─ Secondary Ranges:
│  ├─ Pod CIDR:     10.100.0.0/13  (524,288 addresses)
│  └─ Service CIDR: 10.1.0.0/16    (65,536 addresses)
└─ Managed Infrastructure:
   ├─ Public IPs (NAT Gateway)
   ├─ Load Balancers
   └─ Storage accounts

Azure Throttling at Scale:

Subscription-level limits apply
Multiple clusters share API quotas
Recommendation: 20-40 clusters per subscription-region
Token bucket ensures fair queuing

Monitoring Requirements:

Control plane metrics (Azure Managed Prometheus)
API server latency tracking
etcd size monitoring
Pod churn rate tracking
Network saturation monitoring

Cluster Upgrade at Scale:

Constraint: Cannot upgrade at 5,000 nodes (no surge capacity)
Solution: Scale down to <3,000 nodes before upgrade
Limitation: Blocks cluster upgrades at max scale

Production Requirements:

Multi-AZ Deployment: Distribute across 3+ availability zones
Multiple Node Pools: At least 5 pools for 5,000 nodes
CNI Overlay Networking: Required for 200K pods
Premium Control Plane: Recommended for SLA guarantees
Azure Monitor Integration: Essential for visibility
Network Policy: Use Cilium (replaces NPM for >250 nodes)
Planned Maintenance: Scale down during upgrades

Recommendation: Enterprise-grade hyperscale configuration for large organizations. Requires deep Azure expertise and support engagement. Primary use cases: large AI/ML workloads, big data processing, multi-tenant platforms.

6. Azure Networking Model

VNet Integration

Primary Subnet (Node IPs):

Nodes get IPs directly from VNet subnet
ENIs attached to primary VNet subnet
Requires contiguous IP space
Formula: 2^(32-prefix) - 4 reserved IPs

Secondary Ranges (Pod/Service IPs):

Defined in AKS cluster configuration
Can be separate VNet prefixes
Used by Azure CNI plugin
Pod IPs with overlay don't consume VNet IPs

Azure CNI Overlay Model

Architecture:

VNet Subnet: 10.0.0.0/24 (256 addresses)
└─ Node IPs: 10.0.0.0-10.0.0.254 (254 usable)

Secondary Ranges (overlay):
├─ Pod CIDR:     10.100.0.0/13  (managed by Azure CNI, not from VNet)
└─ Service CIDR: 10.1.0.0/16    (managed by Kubernetes, not from VNet)

Benefits:

Pod IPs don't consume VNet addresses
Enables 200,000 pods per cluster
Better scaling than direct CNI (limited to 50K pods)
Reduced operator complexity

Managed VNet vs Custom VNet

Managed VNet (Recommended for most):

Azure creates VNet automatically
Automatic IP range management
Simplifies networking
Good for Hyperscale

Custom VNet (Advanced):

User provides VNet
Manual IP range management
More control, more complexity
Required for advanced networking

7. AKS Resource Quotas

Subscription-Level Quotas

Item	Enterprise Agreement	CSP/Pay-As-You-Go	Free Trial
Max Clusters	100 per region	10 per region	3 total
Max Subscriptions	N/A	N/A	1
Quota Request	Can request increase	Can request increase	Cannot increase

Cluster-Level Quotas

Limit	Value	Notes
Clusters per subscription globally	5,000	Soft limit
Nodes per cluster (with VMSS)	5,000	Hard limit
Nodes per node pool (VMSS)	1,000	Hard limit
Node pools per cluster	100	Hard limit
Pods per node (max)	250	With Azure CNI
Pods per cluster (with CNI Overlay)	200,000	Hard limit
Load-balanced services (Standard LB)	300	Per cluster

Quota Enforcement Timeline

Coming September 2025:

Azure will enforce managed cluster quotas
Requires both cluster quota AND VM SKU quota
Existing customers: Default limit at or above current usage
New customers: Default limit based on capacity

8. RFC 1918 Private Address Space Compliance

All AKS clusters must use RFC 1918 address ranges:

Supported Private Ranges

RFC Range	Size	Our Primary Usage
10.0.0.0/8	16.7M	Primary VNet (10.0.0.0/16)
172.16.0.0/12	1.0M	Not used in default config
192.168.0.0/16	65.5K	Not used in default config

Allocation Strategy (Example)

Hyperscale Tier Allocation:
VNet: 10.0.0.0/18 (16,384 addresses)
├─ Public Subnets: 3 × /23 (1,536 total for LBs, NAT gateways)
├─ Node Subnets: 3 × /20 (12,288 total for 5,000 nodes across 3 AZs)
├─ Pod Overlay: 10.100.0.0/13 (524,288 addresses)
└─ Service CIDR: 10.1.0.0/16 (65,536 addresses)

Azure Compliance

VNet must use RFC 1918 (enforced)
Secondary ranges must use RFC 1918 (enforced)
No public IP addresses for pod/service CIDRs (enforced)
No overlap between ranges (enforced by Azure)

9. Multi-Availability Zone Deployment

AZs and Cluster Tiers

Free Tier: Single AZ

Not recommended for production
No HA guarantees

Standard/Premium Tiers: Multi-AZ Ready

Tier	Recommended AZ Distribution
Micro	Single AZ (acceptable for dev)
Standard	1-2 AZs (dual HA)
Professional	2-3 AZs (zone redundancy)
Enterprise	3 AZs (maximum redundancy)
Hyperscale	3 AZs (zone redundancy)

Node Pool Distribution

Hyperscale with 3 AZs:

Zone 1: System pool (1-2 nodes) + User pool 1 (1,000 nodes)
Zone 2: User pool 2 (1,000 nodes) + User pool 3 (1,000 nodes)
Zone 3: User pool 4 (1,000 nodes) + User pool 5 (1,000 nodes)
Total: 5,002-5,003 nodes across 3 zones

AZ Considerations

Pod Affinity: Control plane distributes pods across zones
Node Affinity: Workloads can target specific zones
Failure Isolation: Zone failure only affects pods in that zone
Cost: Multiple AZs incur separate subnet costs

10. Comparison: AKS vs EKS vs GKE

Network Architecture

Aspect	AKS	EKS	GKE
Model	VNet + Azure CNI	VPC + EC2 ENI	VPC + Alias IPs
Pod IP Source	Overlay or VNet	Secondary IPs	Alias ranges
Max Pods	200K (Overlay)	250/node × nodes	110/node × nodes
IP Calculation	`2^(32-prefix)`	`ENIs × IPs × prefixes`	`/24` per node
Configuration	Azure CLI/Portal	Terraform/AWS CLI	`gcloud`

Throttling/Scaling

Aspect	AKS	EKS	GKE
Throttling	Token bucket (ARM)	Per-IP rate limiting	Automatic (managed)
Max Nodes	5,000	100,000 (with support)	5,000 (Autopilot)
Max Pods	200,000	Unlimited (node limit)	200,000 (GKE limit)
Node Pools	100 max	Unlimited	Unlimited
Nodes/Pool	1,000 max	Unlimited	Unlimited

Our Implementation Strategy

AKS Focus: Managed VNet integration, CNI overlay, token bucket awareness EKS Focus: ENI prefix delegation, Nitro instances, per-IP rate limiting
GKE Focus: Automatic alias IPs, pod density formulas

All Three Covered:

Tier configurations support all three
Default settings work for all
Over-provisioning makes tuning optional
Documentation addresses platform-specific details

11. Recommended Best Practices for AKS

Cluster Configuration

Use Azure CNI Overlay (default for new clusters)
- Enables 200K pods per cluster
- Recommended by Microsoft for scalability
- No VNet IP pressure
Standard Tier Control Plane (minimum for production)
- Auto-scales with cluster size
- Better API server capacity
- Recommended for all production workloads
Multiple Node Pools (for >1000 nodes)
- Maximum 1,000 nodes per pool
- Distribute across 5 pools for 5,000 nodes
- Enables zone distribution
Managed NAT Gateway (for cluster egress)
- At least 2 public IPs
- Proper outbound scaling
- Better than load balancer for egress

Scaling Operations

Scale in Batches
- For >1000 nodes: scale 500-700 at a time
- 2-5 minute wait between operations
- Prevents Azure API throttling
Monitor During Scale
- API server latency tracking
- Control plane resource usage
- Pod scheduling metrics
Upgrade Considerations
- Cannot upgrade at 5,000 nodes (no surge capacity)
- Scale down to <3,000 nodes before upgrade
- Plan maintenance windows carefully

Network Configuration

Use Managed VNet (most cases)
- Azure manages IP allocation
- Simplifies operations
- Recommended for Hyperscale
Network Policy (for security)
- Azure NPM: Up to 250 nodes only
- For larger clusters: Use Cilium or network segmentation
- Plan policy early
Multi-AZ Deployment (production minimum)
- Zone redundancy
- Automatic pod redistribution
- HA guarantees

12. Tier Compliance Checklist

All Tiers

Professional & Above (Production)

Standard tier control plane recommended
Multi-AZ capable (2-3 zones)
Load-balanced service limits understood (<300)
Network policy considerations noted
Azure throttling limits understood
Scaling batch recommendations included

Enterprise & Hyperscale (Large Scale)

Multiple node pools required
Token bucket throttling accounted for
Node pool distribution strategy documented
Premium tier control plane recommended (Hyperscale)
Azure CNI Overlay mandatory (Hyperscale)
Cluster upgrade limitations noted
Monitoring requirements specified
Cluster services scaling guidelines provided

13. AKS-Specific Algorithms

1. Token Bucket Throttling

Algorithm:

bucket_size = burst_rate
refill_rate = sustained_rate

on_request():
  if bucket_size > 0:
    bucket_size -= 1
    return ACCEPT
  else:
    return REJECT_429

on_tick(time):
  if bucket_size < initial_size:
    bucket_size += refill_rate * elapsed_time

Applied to PUT ManagedCluster:

Burst: 20 requests initially allowed
Sustained: 1 request per minute refill
Scope: Per managed cluster

2. Node IP Allocation

Formula:

Node_Capacity = 2^(32 - subnet_prefix) - 4

Examples:
/26 = 2^6 - 4 = 60 nodes (micro public)
/25 = 2^7 - 4 = 124 nodes (micro private)
/24 = 2^8 - 4 = 252 nodes (standard private)
/23 = 2^9 - 4 = 508 nodes (professional/enterprise)
/21 = 2^11 - 4 = 2,044 nodes (enterprise private)
/20 = 2^12 - 4 = 4,092 nodes (hyperscale private)

3. Pod CIDR Space (Overlay Model)

Formula:

Pod_Addresses = 2^(32 - pod_prefix)
Effective_Pod_Capacity = (Pod_Addresses / efficiency) up to 200K

Examples:
/18 = 2^14 = 16,384 addresses
/16 = 2^16 = 65,536 addresses
/13 = 2^19 = 524,288 addresses
Actual Limit: min(calculated, 200,000)

4. Service CIDR Allocation

Formula:

Service_Addresses = 2^(32 - service_prefix)
Max_Services = 300 per cluster (hard load balancer limit)

/20 (AWS default) = 4,096 addresses (sufficient)
/16 (our default) = 65,536 addresses (16× AWS)

14. Recommended Next Steps

Priority 1 (Ready to Implement)

Add AKS-Specific Algorithms to Docs
- Token bucket explanation with examples
- Node IP formula: 2^(32-prefix) - 4
- Pod CIDR overlay calculations
- Service CIDR allocation
Document Azure CNI Overlay Benefits
- Comparison table: Overlay vs Direct CNI vs Kubenet
- 200K pod support explanation
- VNet IP pressure reduction
Add Scaling Batch Guidance
- Examples of batch sizes (500-700 nodes)
- Timing recommendations (2-5 minute waits)
- API throttling prevention

Priority 2 (Future Enhancement)

Add Multi-Node Pool Calculator
- Accept node count, suggest pool distribution
- Zone distribution recommendations
- AZ affinity suggestions
Create Azure Quota Planning Tool
- Calculate cluster quota requirements
- VM SKU quota calculations
- Multi-cluster subscription planning
Add Cluster Upgrade Analyzer
- Detect when cluster is at max nodes
- Recommend scale-down before upgrade
- Estimate upgrade impact

Priority 3 (Extended Features)

Azure Cost Calculator Integration
- Estimate NAT gateway costs
- Load balancer cost impact
- Multi-AZ premium costs
Network Policy Recommendation Engine
- Suggest NPM (up to 250 nodes) vs Cilium
- Policy complexity estimation
- Performance impact warnings
Multi-Region AKS Planner
- Distribute clusters across regions
- Handle Azure quota limits per region
- Global network topology planning

15. Summary & Verification

Compliance Summary

Status: FULLY COMPLIANT with Azure AKS best practices

Verified Against:

Microsoft Azure AKS Best Practices for Large Workloads
Azure AKS Quotas, SKUs, and Regions documentation
Azure CNI Overlay networking guide
Azure Resource Manager throttling limits
Token bucket algorithm (standard)

Test Coverage

Unit Tests: All 214 tests passing

Subnet calculation verification
CIDR allocation correctness
Multi-pool node distribution

Integration Tests: Design system validated

AZ configurations
RFC 1918 compliance
Tier scaling characteristics

Production Readiness

All tier configurations production-ready
Documentation comprehensive
Token bucket throttling explained
Azure CNI Overlay support documented
Multi-node pool strategy included
Scaling guidelines provided
Upgrade limitations noted

No Configuration Changes Needed

Azure AKS implementation is optimized for realistic deployments:

Hyperscale tier with 3 × /20 private subnets supports 5,000-node clusters
Pod CIDR /13 supports 200K+ pods (AKS overlay limit)
Service CIDR /16 provides ample space
All tier configurations validated for Azure compatibility

Current Tier Configuration Summary:

Micro: 1 × /26 public, 1 × /25 private, /24 min VPC
Standard: 1 × /25 public, 1 × /24 private, /23 min VPC
Professional: 2 × /25 public, 2 × /23 private, /21 min VPC
Enterprise: 3 × /24 public, 3 × /21 private, /18 min VPC
Hyperscale: 3 × /23 public, 3 × /20 private, /18 min VPC, /13 pods

16. Files Modified

Documentation Files Created/Updated:

AKS_COMPLIANCE_AUDIT.md (UPDATED February 4, 2026)
- This comprehensive audit document
- 16 sections covering all aspects
- Updated with differentiated public/private subnet sizes
- Ready for reference in Azure documentation
.github/copilot-instructions.md (UPDATE PENDING)
- Add "AKS Compliance & IP Calculation Formulas" section
- Document token bucket algorithm
- Include Azure CNI Overlay architecture
- Add multi-node pool strategies
- Include throttling prevention guidelines
shared/kubernetes-schema.ts (NO CHANGES NEEDED)
- Configuration already AKS-compliant
- All tiers support Azure requirements

Optional Enhancements

These optional configurations can improve scalability and observability for large-scale AKS deployments:

Azure NAT Gateway Scaling Considerations

Azure NAT Gateway Specifications (Azure documentation):

SNAT Port Allocation: 64,512 ports per public IP address
Port Allocation: Dynamic, shared across all VMs in subnet
Public IPs per NAT Gateway: 1-16 public IPs (1,032,192 total ports max)
Connection Limit: 1 million concurrent connections per NAT Gateway
Documentation: Azure NAT Gateway resource limits

Scaling Recommendations:

Hyperscale Tier (5000 nodes @ 200K pods):
- NAT Gateway Configuration:
  - Attach NAT Gateway to all private subnets (8 subnets)
  - Allocate 4-8 public IPs per NAT Gateway (256K-512K ports)
  - Monitor SNAT port usage with Azure Monitor
- Monitoring:
```
az monitor metrics list \
  --resource <nat-gateway-resource-id> \
  --metric "SNATConnectionCount" \
  --aggregation Average
```
Enterprise Tier (50 nodes):
- Single public IP per NAT Gateway sufficient
- 64,512 ports supports ~10K concurrent connections
- Standard SKU (default)
Monitoring:
- Metric: SNATConnectionCount (active connections)
- Metric: ByteCount (throughput)
- Alert on: SNATConnectionCount > 50000 (port exhaustion warning)

Azure Load Balancer Integration

AKS Load Balancer Options:

Azure Load Balancer Standard: Layer 4, Service type LoadBalancer
Azure Application Gateway: Layer 7, Ingress controller (AGIC)
Internal Load Balancer: Private subnet traffic only

Subnet Requirements (from AKS documentation):

Primary Subnet (Nodes): Must support all node IPs
Pod Overlay CIDR: Separate range (10.244.0.0/16 default for kubenet)
Service CIDR: 10.0.0.0/16 default (internal cluster networking)
Load Balancer IPs: Allocated from Azure public IP pool or private subnet

Azure CNI Overlay (recommended for large clusters):

Separates pod IPs from VNet (no VNet IP exhaustion)
Supports 200,000 pods per cluster (vs 50K for direct CNI)
Nodes use VNet IPs, pods use overlay network
Better scalability for hyperscale deployments

Official Documentation References

AKS Networking:

Azure Virtual Network:

AKS Best Practices:

References

Microsoft Azure Documentation:

Related Audit Documents:

GKE_COMPLIANCE_AUDIT.md - Google Kubernetes Engine compliance
EKS_COMPLIANCE_AUDIT.md - AWS Elastic Kubernetes Service compliance

Document Status: Complete and ready for reference
Last Updated: February 4, 2026
Compliance Level: Production Ready

FilesExpand file tree

AKS_COMPLIANCE_AUDIT.md

Latest commit

History

AKS_COMPLIANCE_AUDIT.md

File metadata and controls

AKS Compliance Audit - Kubernetes Network Planning API

1. Executive Summary

0.1. Azure Region and Availability Zone Naming Standards

Region Naming Convention

Azure Regions by Geography (Programmatic Names)

Availability Zone Naming Convention

Retrieving Region Names Programmatically

API Implementation

0. IPv4 Address Consumption Model (Azure CNI Overlay)

Network Architecture Overview

IPv4 Allocation by Component

1. Node IP Allocation (Primary VNet Subnet)

2. Pod IP Allocation (Overlay CIDR)

3. Service IP Allocation (Service CIDR)

4. LoadBalancer IP Allocation (External Azure Resources)

IP Exhaustion Risk Analysis

Comparison to EKS and GKE

Key Takeaways for AKS

2. Subnet Overlap Validation

Non-Overlapping Subnet Guarantee

3. AKS Cluster Scalability Limits

Documented Limits (Per AKS Quotas & Best Practices)

Scaling Guidance from Microsoft

2.1. Azure NAT Gateway & Outbound Connectivity

Overview

SNAT Port Allocation Formula

Hyperscale Tier Examples (5,000 Nodes)

Token Bucket Throttling (API Rate Limiting)

Azure NAT Gateway Quotas

Load Balancer IP Consumption

3. Azure CNI Architecture

Networking Models Supported

IP Allocation Formula - Azure CNI Overlay

Node Capacity Per Pool Formula

4. Azure Throttling Algorithm

Token Bucket Throttling

AKS Resource Provider Throttling Limits

Comparison: Token Bucket vs Rate Limiting

5. Tier-by-Tier AKS Compliance Analysis

Micro Tier (Development/PoC)

Standard Tier (Development/Testing)

Professional Tier (Small Production)

Enterprise Tier (Large Production)

Hyperscale Tier (Global/Extreme Scale)

6. Azure Networking Model

VNet Integration

Azure CNI Overlay Model

Managed VNet vs Custom VNet

7. AKS Resource Quotas

Subscription-Level Quotas

Cluster-Level Quotas

Quota Enforcement Timeline

8. RFC 1918 Private Address Space Compliance

Supported Private Ranges

Allocation Strategy (Example)

Azure Compliance

9. Multi-Availability Zone Deployment

AZs and Cluster Tiers

Node Pool Distribution

AZ Considerations

10. Comparison: AKS vs EKS vs GKE

Network Architecture

Throttling/Scaling

Our Implementation Strategy

11. Recommended Best Practices for AKS

Cluster Configuration

Scaling Operations

Network Configuration

12. Tier Compliance Checklist

All Tiers

Professional & Above (Production)

Enterprise & Hyperscale (Large Scale)

13. AKS-Specific Algorithms

1. Token Bucket Throttling