Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
2f6a1f8
feat(server,k8s): implement pause/resume with rootfs snapshot support
fengcone Apr 8, 2026
c90d969
fix(server,k8s): clean up field naming and lint issues
fengcone Apr 8, 2026
b4930b5
refactor(server,k8s): redesign SandboxSnapshot spec/status boundary a…
fengcone Apr 12, 2026
3b98183
docs(server,k8s): update doc for pause and resume feature
fengcone Apr 12, 2026
3a13c3a
Merge branch 'main' into feature/public-k8s-pause-resume
fengcone Apr 12, 2026
486658b
fix(server): return workload state when snapshot fails but sandbox st…
fengcone Apr 12, 2026
6e302bc
fix(server,k8s): add sandboxsnapshots RBAC and surface resume failures
fengcone Apr 12, 2026
17d7008
fix(image-committer): replace crictl with nerdctl for container disco…
fengcone Apr 12, 2026
7fa689d
fix(k8s): surface resume failures and fix paused sandbox image URI
fengcone Apr 12, 2026
dd1876a
fix(controller): requeue on transient API errors in validatePauseSpec
fengcone Apr 12, 2026
f86fd2a
fix(server): include full pause config in re-pause snapshot patch
fengcone Apr 12, 2026
7474fa0
fix(server,k8s): copy user labels to snapshot and verify full label i…
fengcone Apr 13, 2026
9efe77c
docs(k8s): complete pause/resume state machine and add Chinese docume…
fengcone Apr 13, 2026
6f6c231
feat(server): expose intermediate pause/resume states with detailed r…
fengcone Apr 13, 2026
38bee80
fix(kubernetes): replace crictl with nerdctl
fengcone Apr 13, 2026
dc5902e
feat(kubernetes): implement snapshot-based pause/resume lifecycle
fengcone Apr 23, 2026
40f9230
Merge branch 'main' into feature/public-k8s-pause-resume
fengcone Apr 23, 2026
eebfd43
fix(kubernetes): stabilize Kubernetes pooled pause-resume flow
fengcone Apr 24, 2026
c00d9f6
Merge branch 'main' into feature/public-k8s-pause-resume
fengcone Apr 24, 2026
89e1743
fix(kubernetes): stabilize Kubernetes pooled pause-resume flow
fengcone Apr 24, 2026
1806da1
fix(kubernetes): harden snapshot commit jobs
fengcone Apr 24, 2026
77e137e
fix(kubernetes): harden pause/resume snapshot lifecycle
fengcone Apr 27, 2026
7e33f90
fix(kubernetes): harden pause/resume snapshot flow
fengcone Apr 28, 2026
8d0d3fa
Merge branch 'main' into feature/public-k8s-pause-resume
fengcone Apr 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
436 changes: 436 additions & 0 deletions docs/pause-resume.md

Large diffs are not rendered by default.

71 changes: 71 additions & 0 deletions kubernetes/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Kubernetes Operator

## Overview

Kubernetes operator managing sandbox environments via custom resources. Provides BatchSandbox (O(1) batch delivery), Pool (resource pooling for fast provisioning), and optional task orchestration. Built with controller-runtime (Kubebuilder).

## Structure

```
kubernetes/
├── apis/sandbox/v1alpha1/ # CRD type definitions
│ ├── batchsandbox_types.go # BatchSandbox spec + status
│ ├── pool_types.go # Pool spec + status
│ └── sandboxsnapshot_types.go
├── cmd/
│ ├── controller/main.go # Controller manager entry point
│ ├── image-committer/main.go # Image committer binary (runs as commit Job)
│ └── task-executor/main.go # Task executor binary (runs as sidecar)
├── internal/
│ ├── controller/ # Reconciliation loops
│ ├── scheduler/ # Pool allocation logic (bufferMin/Max, poolMax)
│ └── utils/ # Utility functions
├── config/
│ ├── crd/bases/ # Generated CRD YAML manifests
│ ├── rbac/ # ClusterRole, ClusterRoleBinding
│ ├── manager/ # Controller deployment manifest
│ └── samples/ # Example CRD instances
├── charts/ # Helm charts (opensandbox-controller, opensandbox-server, opensandbox)
├── test/e2e/ # End-to-end tests + testdata
└── Dockerfile # Controller image build
Dockerfile.image-committer # Image-committer image build
```

## Where to Look

| Task | File | Notes |
|------|------|-------|
| Add CRD field | `apis/sandbox/v1alpha1/*_types.go` | Run `make install` to update CRDs |
| Controller logic | `internal/controller/` | BatchSandbox + Pool reconciliation |
| Pool allocation | `internal/scheduler/` | Buffer management, sandbox→pool assignment |
| Task execution | `cmd/task-executor/`, `internal/task-executor/` | Process-based tasks in sandboxes |
| Helm values | `charts/opensandbox-controller/values.yaml` | Controller + task-executor image refs |
| RBAC permissions | `config/rbac/` | ClusterRole rules |
| E2E tests | `test/e2e/` | Ginkgo/Gomega test framework |

## Conventions

- **Framework**: Kubebuilder with `controller-runtime` v0.21.
- **Go version**: 1.24. Own `go.mod` (`github.com/alibaba/opensandbox/sandbox-k8s`).
- **Concurrency**: BatchSandbox controller concurrency=32, Pool controller concurrency=1.
- **CRD version**: `v1alpha1` under group `sandbox.opensandbox.io`.
- **Helm charts**: Umbrella chart (`opensandbox`) wraps controller + server subcharts.
- **Logging**: `klog/v2` + `zap`. Log level configurable via `--zap-log-level` flag.

## Anti-Patterns

- `pause`/`resume` lifecycle uses SandboxSnapshot CRD + image-committer Job to snapshot and restore containers.
- BatchSandbox deletion waits for running tasks to terminate before removing the resource.
- Task-executor requires `shareProcessNamespace: true` and `SYS_PTRACE` capability in pod spec.
- Pool template changes do not affect already-allocated sandboxes.

## Commands

```bash
make install # install CRDs into cluster
make deploy CONTROLLER_IMG=... TASK_EXECUTOR_IMG=... # deploy controller
make docker-build # build controller image
make docker-build-task-executor # build task-executor image
make docker-build-image-committer # build image-committer image
make test # run tests
```
56 changes: 56 additions & 0 deletions kubernetes/Dockerfile.image-committer
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright 2025 Alibaba Group Holding Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build stage
FROM golang:1.24-alpine AS builder

# Use Aliyun mirror for faster downloads in China
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories

WORKDIR /workspace

# Copy go mod files
COPY go.mod go.sum ./
RUN GOPROXY=https://goproxy.cn,direct go mod download

# Copy source code
COPY cmd/image-committer/ cmd/image-committer/

# Build binary
RUN CGO_ENABLED=0 GOOS=linux go build -o /usr/local/bin/image-committer ./cmd/image-committer/

# Runtime stage
FROM alpine:3.19

# Use Aliyun mirror for faster downloads in China
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories

# Install containerd CLI tools
RUN apk add --no-cache \
containerd-ctr \
cri-tools \
curl \
jq \
nerdctl

# Create directories for socket mounts
RUN mkdir -p /var/run/containerd /run/k8s/containerd

# Copy the built binary from builder stage
COPY --from=builder /usr/local/bin/image-committer /usr/local/bin/image-committer
RUN chmod +x /usr/local/bin/image-committer

WORKDIR /workspace

ENTRYPOINT ["/usr/local/bin/image-committer"]
12 changes: 11 additions & 1 deletion kubernetes/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ OPERATOR_SDK_VERSION ?= v1.42.0
CONTROLLER_IMG ?= controller:dev
# TASK_EXECUTOR_IMG defines the image for the task-executor service.
TASK_EXECUTOR_IMG ?= task-executor:dev
# IMAGE_COMMITTER_IMG defines the image for the image-committer service.
IMAGE_COMMITTER_IMG ?= image-committer:dev

# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
ifeq (,$(shell go env GOBIN))
Expand Down Expand Up @@ -122,7 +124,7 @@ test: manifests generate fmt vet setup-envtest ## Run tests.
# To use a different vendor for e2e tests, modify the setup under 'tests/e2e'.
# The default setup assumes Kind is pre-installed and builds/loads the Manager Docker image locally.
KIND_CLUSTER ?= sandbox-k8s-test-e2e
KIND_K8S_VERSION ?= v1.22.4
KIND_K8S_VERSION ?= v1.27.3
GINKGO_ARGS ?=
E2E_TIMEOUT ?= 30m

Expand Down Expand Up @@ -279,6 +281,14 @@ docker-build-controller: ## Build docker image with the manager.
docker-build-task-executor: ## Build docker image with task-executor.
$(CONTAINER_TOOL) build $(DOCKER_BUILD_ARGS) --build-arg PACKAGE=cmd/task-executor/main.go --build-arg USERID=0 -t ${TASK_EXECUTOR_IMG} .

.PHONY: docker-build-image-committer
docker-build-image-committer: ## Build docker image for image commit operations.
$(CONTAINER_TOOL) build $(DOCKER_BUILD_ARGS) -f Dockerfile.image-committer -t ${IMAGE_COMMITTER_IMG} .

.PHONY: docker-push-image-committer
docker-push-image-committer: ## Push docker image for image-committer.
$(CONTAINER_TOOL) push ${IMAGE_COMMITTER_IMG}

.PHONY: docker-push
# docker-push: ## Push docker image with the manager.
# $(CONTAINER_TOOL) push ${CONTROLLER_IMG}
Expand Down
111 changes: 108 additions & 3 deletions kubernetes/README.md
Comment thread
fengcone marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ OpenSandbox Kubernetes Controller is a Kubernetes operator that manages sandbox
- **Batch and Individual Delivery**: Support both single sandbox (for real-user interactions) and batch sandbox delivery (for high-throughput agentic-RL scenarios)
- **Optional Task Scheduling**: Integrated task orchestration with optional shard task templates for heterogeneous task distribution and customized sandbox delivery (e.g., process injection)
- **Resource Pooling**: Maintain pre-warmed resource pools for rapid sandbox provisioning
- **Pause and Resume**: Persist sandbox filesystem state via rootfs snapshots, releasing cluster resources between sessions
- **Comprehensive Monitoring**: Real-time status tracking of sandboxes and tasks

## Features
Expand Down Expand Up @@ -57,11 +58,115 @@ Intelligent resource management features:
- Pool-wide capacity limits to prevent resource exhaustion
- Automatic scaling based on demand

## Pause and Resume (Rootfs Snapshot)

OpenSandbox supports **pause and resume** for Kubernetes sandboxes by persisting the container root filesystem as an OCI image.

```text
Time ---------------------------------------------------------------->

Sandbox lifecycle: [Running]--[Pausing]--[Paused]--[Resuming]--[Running]
| |
commit rootfs create new BatchSandbox
push to registry from snapshot image
delete BatchSandbox
```

### How it works

1. **Pause**: The server creates a `SandboxSnapshot` CR. The controller creates a commit Job on the same node, commits the container rootfs, and pushes it to the configured OCI registry. After the snapshot is ready, the source `BatchSandbox` (and its Pod) is deleted to release cluster resources.
2. **Resume**: The server sets `action: Resume` on the `SandboxSnapshot`. The controller creates a new `BatchSandbox` from the snapshot image, restoring the filesystem state. The public `sandboxId` remains stable across pause/resume cycles.

### The SandboxSnapshot CRD

The `SandboxSnapshot` CR is the central resource for pause/resume lifecycle:

| Field | Location | Description |
|-------|----------|-------------|
| `spec.sandboxId` | Spec | Target sandbox identifier |
| `spec.sourceBatchSandboxName` | Spec | Source BatchSandbox to snapshot |
| `spec.action` | Spec | `Pause` or `Resume` |
| `spec.snapshotRegistry` | Spec | OCI registry prefix (filled by Server from `[pause]` config) |
| `spec.snapshotPushSecret` | Spec | Secret name for pushing (filled by Server) |
| `spec.resumeImagePullSecret` | Spec | Secret name for pulling on resume (filled by Server) |
| `status.phase` | Status | `Pending` → `Committing` → `Ready` / `Failed` |
| `status.containerSnapshots` | Status | Committed image URIs per container |
| `status.sourcePodName` | Status | Pod name resolved by controller |
| `status.history` | Status | Audit log of pause/resume actions (last 10) |

### Prerequisites

1. **OCI Registry**: An accessible container registry for storing snapshot images.
2. **Kubernetes Secrets**: Docker config secrets for push and pull access.
3. **Server configuration**: Set `[pause]` section in `~/.sandbox.toml` (see [Server configuration](../server/configuration.md#pause--kubernetes-only)).
4. **Controller RBAC**: The controller requires `secrets: get` permission (included in the Helm chart and `make manifests` output).

### Controller Configuration

The snapshot controller supports the following command-line flags:

| Flag | Default | Description |
|------|---------|-------------|
| `--image-committer-image` | `image-committer:dev` | Image used for commit operations (must contain `ctr` or `crictl` tools) |
| `--commit-job-timeout` | `10m` | Timeout duration for commit jobs |

These flags are configured at controller startup. The `image-committer-image` must be a container image with container runtime tools (e.g., `ctr`, `crictl`) to perform rootfs commit and push operations.

**Helm configuration:**

```sh
helm install opensandbox-controller ./charts/opensandbox-controller \
--set controller.snapshot.imageCommitterImage=<your-registry>/image-committer:v1.0.0 \
--set controller.snapshot.commitJobTimeout=15m
```

**Kustomize configuration:**

```sh
make deploy CONTROLLER_IMG=<controller-image> \
IMAGE_COMMITTER_IMAGE=<your-registry>/image-committer:v1.0.0 \
COMMIT_JOB_TIMEOUT=15m
```

### Quick setup

```bash
# Create push secret
kubectl create secret docker-registry registry-push-secret \
--docker-server=<your-registry> \
--docker-username=<user> \
--docker-password=<token>

# Create pull secret (can reuse push secret)
kubectl create secret docker-registry registry-pull-secret \
--docker-server=<your-registry> \
--docker-username=<user> \
--docker-password=<token>
```

Server config (`~/.sandbox.toml`):

```toml
[pause]
snapshot_registry = "<your-registry>/sandboxes"
snapshot_push_secret = "registry-push-secret"
resume_pull_secret = "registry-pull-secret"
```

### CRD cleanup

To remove SandboxSnapshot CRDs when uninstalling:

```bash
kubectl delete crd sandboxsnapshots.sandbox.opensandbox.io
```

For a complete guide including troubleshooting and failure scenarios, see [`docs/pause-resume.md`](../docs/pause-resume.md).

## Runtime API Support Notes

- `pause` / `resume` lifecycle APIs are currently **NOT SUPPORTED** by the Kubernetes runtime.
- Calling these APIs against Kubernetes runtime returns `501 Not Implemented`.
- Pause/resume semantics in OpenSandbox mean preserving in-memory process state (container-level suspend/resume). Kubernetes provider currently focuses on create/get/list/delete/renew workflows.
- `pause` / `resume` lifecycle APIs are supported on Kubernetes runtime via rootfs snapshot. See [Pause and Resume](#pause-and-resume-rootfs-snapshot) above.
- Docker runtime supports cgroup-level freeze (`pause`/`resume`) but does not persist filesystem state across restarts.


## Relationship with [kubernates-sigs/agent-sandbox](kubernates-sigs/agent-sandbox)
Expand Down
Loading
Loading