Skip to content

Commit cb57660

Browse files
authored
feat(crash): rewrite crash harness as controller/worker with failpoint hooks (#4)
Replace the simple kill-loop orchestrator/writer with a structured controller/worker crash harness. The controller pre-generates a deterministic workload from a seed, feeds operations one at a time over a binary-safe NDJSON protocol, kills the worker at controlled points, and verifies recovered state against acknowledged operations. Add internal/crashhook package for crash and fault injection hooks activated by environment variables. Wire seven hook sites into the engine's WAL and flush paths, each marked with a grep-able // FAILPOINT: tag. Key changes: - cmd/crash: new run/replay/worker subcommands, artifact model, oracle-based verification, deterministic workload generator - internal/crashhook: CrashIfArmed and MaybeFault framework - engine/db.go: integrate failpoints at wal_after_append, wal_sync_error, wal_after_sync, sst_write_error, sst_publish_error, flush_after_publish, flush_after_file_sync - engine/crashhook_test.go: engine-level subprocess tests for hooks - Makefile: rename crash-test to crash-check, use new subcommand - docs/testing.md: document harness and crashhook layer
1 parent c621cf9 commit cb57660

17 files changed

Lines changed: 2361 additions & 262 deletions

Makefile

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: all build test coverage lint fmt-check fmt clean check examples crash-test help
1+
.PHONY: all build test coverage lint fmt-check fmt clean check examples crash-check help
22

33
CYCLES ?= 100
44

@@ -48,25 +48,25 @@ fmt:
4848
## check: Runs fmt-check, lint and test
4949
check: fmt-check lint test
5050

51-
## crash-test: Run crash harness in writer+orchestrator modes ($(CYCLES) cycles) using /tmp workspace
52-
crash-test:
51+
## crash-check: Run the controller/worker crash harness ($(CYCLES) cycles) with a temporary workspace
52+
crash-check:
5353
@set -eu; \
5454
tmpdir=$$(mktemp -d /tmp/beachdb-crash.XXXXXX); \
55-
trap 'rm -rf "$$tmpdir"' EXIT INT TERM; \
56-
writer_db="$$tmpdir/writer-db"; \
57-
orchestrator_db="$$tmpdir/orchestrator-db"; \
58-
mkdir -p "$$writer_db" "$$orchestrator_db"; \
59-
echo "Running writer mode crash loop ($(CYCLES) cycles) in $$writer_db"; \
60-
for cycle in $$(seq 1 $(CYCLES)); do \
61-
state_file="$$tmpdir/writer-state-$$cycle.txt"; \
62-
go run ./cmd/crash --mode=writer --dbdir="$$writer_db" --state="$$state_file" >/dev/null 2>&1 & \
63-
pid=$$!; \
64-
sleep 0.05; \
65-
kill -9 $$pid >/dev/null 2>&1 || true; \
66-
wait $$pid >/dev/null 2>&1 || true; \
67-
done; \
68-
echo "Running orchestrator mode crash loop ($(CYCLES) cycles) in $$orchestrator_db"; \
69-
go run ./cmd/crash --mode=orchestrator --dbdir="$$orchestrator_db" --cycles=$(CYCLES)
55+
dbdir="$$tmpdir/db"; \
56+
artdir="$$tmpdir/artifacts"; \
57+
echo "Running crash harness ($(CYCLES) cycles) in $$dbdir"; \
58+
echo ""; \
59+
go run ./cmd/crash run \
60+
--dbdir="$$dbdir" \
61+
--artifact-dir="$$artdir" \
62+
--cycles=$(CYCLES) \
63+
--seed=777 \
64+
--ops=64 \
65+
--min-delay-ms=10 \
66+
--max-delay-ms=30 \
67+
--verify-every-cycle=true; \
68+
echo ""; \
69+
echo "Crash run data kept in "$$tmpdir". Delete it when done: 'rm -rf $$tmpdir'";
7070

7171
## clean: Remove build artifacts and test output
7272
clean:

cmd/crash/README.md

Lines changed: 90 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,99 @@
11
# crash
22

3-
Tests database durability by spawning writer subprocesses and killing them with SIGKILL at random intervals.
3+
`cmd/crash` is BeachDB's controller/worker crash harness.
44

5-
## Usage
5+
It generates a deterministic workload, feeds one operation at a time to a
6+
worker subprocess, kills that subprocess at controlled points, reopens the
7+
database after every cycle, and verifies that recovered state is consistent
8+
with the set of acknowledged operations.
69

7-
```bash
8-
# Run crash test with 50 cycles (default)
9-
./bin/crash --dbdir=/tmp/crashtest
10+
## Subcommands
1011

11-
# Customize cycle count and kill timing
12-
./bin/crash --dbdir=/tmp/crashtest --cycles=100 --min-delay=5 --max-delay=50
12+
### `crash run`
1313

14-
# Run writer subprocess manually (for debugging)
15-
./bin/crash --mode=writer --dbdir=/tmp/testdb --state=/tmp/state.txt
16-
```
14+
Runs a new crash harness session.
1715

18-
Example output:
16+
```bash
17+
./bin/crash run \
18+
--dbdir=/tmp/beachdb-crash-db \
19+
--artifact-dir=/tmp/beachdb-crash-artifacts \
20+
--cycles=100 \
21+
--seed=777 \
22+
--ops=500
1923
```
20-
2026/02/08 15:03:32 Starting crash orchestrator: 30 cycles
21-
2026/02/08 15:03:32 Cycle 0: spawning writer subprocess
22-
2026/02/08 15:03:32 Cycle 0: killing subprocess with SIGKILL after 47ms
23-
...
24-
2026/02/08 15:03:47 Results: 70 recovered, 10 lost out of 80 total
24+
25+
Important flags:
26+
27+
- `--dbdir`: required, must be missing or empty
28+
- `--artifact-dir`: required, stores the run artifact JSON
29+
- `--cycles`: number of worker crash/reopen cycles
30+
- `--seed`: deterministic workload and kill-schedule seed
31+
- `--ops`: number of logical operations in the generated workload
32+
- `--profile=ci|full`: deterministic preset profiles
33+
- `--crash-point`: internal crash point to arm for phase-2 testing
34+
- `--fault-point`: internal fault point to arm for phase-2 testing
35+
36+
### `crash replay`
37+
38+
Replays a recorded artifact with the same workload and deterministic schedule.
39+
40+
```bash
41+
./bin/crash replay \
42+
--artifact=/tmp/beachdb-crash-artifacts/crash-20260419T213015.123Z.json \
43+
--dbdir=/tmp/beachdb-crash-replay-db
2544
```
45+
46+
### `crash worker`
47+
48+
Internal subprocess used by `run` and `replay`. It is not intended to be used
49+
manually.
50+
51+
## Artifact model
52+
53+
Each run writes a JSON artifact containing:
54+
55+
- the run configuration
56+
- the deterministic seed
57+
- the full generated workload
58+
- the ordered worker event stream
59+
- per-cycle worker metadata and verification results
60+
- the last acknowledged op ID
61+
- the first verification failure, if any
62+
63+
Keys and values are base64-encoded so binary payloads survive round-trip
64+
without newline or text parsing issues.
65+
66+
## Durability contract
67+
68+
The worker emits:
69+
70+
- `ready` after opening the DB
71+
- `start` immediately before executing an operation
72+
- `ack` only after the DB call succeeds
73+
- `fail` if the DB call returns an error
74+
75+
The controller treats:
76+
77+
- `ack`ed operations as required after recovery
78+
- the single `start`ed-but-not-`ack`ed operation as indeterminate
79+
- never-started operations as irrelevant to that cycle
80+
81+
Per-cycle verification therefore checks that recovered state matches either:
82+
83+
1. all acknowledged operations only, or
84+
2. all acknowledged operations plus the one in-flight idempotent operation
85+
86+
## Internal crash and fault points
87+
88+
Phase-2 hook points are internal-only and activated by environment variables
89+
that the controller passes to the worker:
90+
91+
- crash points:
92+
- `wal_after_append`
93+
- `wal_after_sync`
94+
- `flush_after_file_sync`
95+
- `flush_after_publish`
96+
- fault points:
97+
- `wal_sync_error`
98+
- `sst_write_error`
99+
- `sst_publish_error`

cmd/crash/artifact.go

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
package main
2+
3+
import (
4+
"encoding/json"
5+
"fmt"
6+
"os"
7+
"path/filepath"
8+
"time"
9+
)
10+
11+
// artifactVersion is the JSON schema version for persisted crash artifacts.
12+
const artifactVersion = 1
13+
14+
// artifact captures a full run or replay transcript for deterministic analysis.
15+
type artifact struct {
16+
Version int `json:"version"`
17+
Config artifactConfig `json:"config"`
18+
Seed uint64 `json:"seed"`
19+
Ops []operationMessage `json:"ops"`
20+
Cycles []artifactCycle `json:"cycles"`
21+
Events []artifactEvent `json:"events"`
22+
LastAckedOpID int `json:"last_acked_op_id,omitempty"`
23+
Failure *verificationFailure `json:"failure,omitempty"`
24+
}
25+
26+
// artifactConfig records the effective run configuration embedded in an artifact.
27+
type artifactConfig struct {
28+
DBDir string `json:"dbdir"`
29+
Cycles int `json:"cycles"`
30+
MinDelayMS int `json:"min_delay_ms"`
31+
MaxDelayMS int `json:"max_delay_ms"`
32+
Ops int `json:"ops"`
33+
PutRatio int `json:"put_ratio"`
34+
DeleteRatio int `json:"delete_ratio"`
35+
BatchRatio int `json:"batch_ratio"`
36+
MaxKeyLen int `json:"max_key_len"`
37+
MaxValueLen int `json:"max_value_len"`
38+
HotKeyRatio int `json:"hot_key_ratio"`
39+
VerifyEveryCycle bool `json:"verify_every_cycle"`
40+
Profile string `json:"profile"`
41+
CrashPoint string `json:"crash_point,omitempty"`
42+
FaultPoint string `json:"fault_point,omitempty"`
43+
}
44+
45+
// artifactCycle stores cycle-level worker execution and verification metadata.
46+
type artifactCycle struct {
47+
Index int `json:"index"`
48+
WorkerPID int `json:"worker_pid"`
49+
PlannedKillDelayMS int `json:"planned_kill_delay_ms"`
50+
ActualEndUnixMilli int64 `json:"actual_end_unix_ms"`
51+
ExitCode int `json:"exit_code"`
52+
LastEvent *artifactEventRef `json:"last_event,omitempty"`
53+
Verification artifactVerification `json:"verification"`
54+
CrashPoint string `json:"crash_point,omitempty"`
55+
FaultPoint string `json:"fault_point,omitempty"`
56+
}
57+
58+
// artifactVerification stores verification summary data for one cycle.
59+
type artifactVerification struct {
60+
CheckedKeys int `json:"checked_keys"`
61+
Allowed []int `json:"allowed_optional_ops,omitempty"`
62+
}
63+
64+
// artifactEventRef points to the last event seen for a cycle.
65+
type artifactEventRef struct {
66+
Kind eventKind `json:"kind"`
67+
OpID int `json:"op_id,omitempty"`
68+
}
69+
70+
// artifactEvent is one timestamped worker/controller protocol event.
71+
type artifactEvent struct {
72+
Cycle int `json:"cycle"`
73+
Time int64 `json:"time_unix_ms"`
74+
Event eventMessage `json:"event"`
75+
}
76+
77+
// newArtifact constructs an artifact with encoded operations and config metadata.
78+
func newArtifact(cfg runConfig, ops []operation) *artifact {
79+
encodedOps := make([]operationMessage, len(ops))
80+
for i, op := range ops {
81+
encodedOps[i] = op.toMessage()
82+
}
83+
84+
return &artifact{
85+
Version: artifactVersion,
86+
Config: artifactConfig{
87+
DBDir: cfg.DBDir,
88+
Cycles: cfg.Cycles,
89+
MinDelayMS: cfg.MinDelayMS,
90+
MaxDelayMS: cfg.MaxDelayMS,
91+
Ops: cfg.Ops,
92+
PutRatio: cfg.PutRatio,
93+
DeleteRatio: cfg.DeleteRatio,
94+
BatchRatio: cfg.BatchRatio,
95+
MaxKeyLen: cfg.MaxKeyLen,
96+
MaxValueLen: cfg.MaxValueLen,
97+
HotKeyRatio: cfg.HotKeyRatio,
98+
VerifyEveryCycle: cfg.VerifyEveryCycle,
99+
Profile: cfg.Profile,
100+
CrashPoint: cfg.CrashPoint,
101+
FaultPoint: cfg.FaultPoint,
102+
},
103+
Seed: cfg.Seed,
104+
Ops: encodedOps,
105+
}
106+
}
107+
108+
// createArtifactPath returns a timestamped artifact filename in the given directory.
109+
func createArtifactPath(dir string) string {
110+
return filepath.Join(dir, fmt.Sprintf("crash-%s.json", time.Now().UTC().Format("20060102T150405.000Z")))
111+
}
112+
113+
// loadArtifact reads and validates an artifact file and decodes its operations.
114+
func loadArtifact(path string) (*artifact, []operation, error) {
115+
//nolint:gosec // artifact path is an explicit user-selected local file
116+
data, err := os.ReadFile(path)
117+
if err != nil {
118+
return nil, nil, fmt.Errorf("reading artifact: %w", err)
119+
}
120+
121+
var art artifact
122+
if err := json.Unmarshal(data, &art); err != nil {
123+
return nil, nil, fmt.Errorf("decoding artifact: %w", err)
124+
}
125+
if art.Version != artifactVersion {
126+
return nil, nil, fmt.Errorf("unsupported artifact version %d", art.Version)
127+
}
128+
129+
ops := make([]operation, len(art.Ops))
130+
for i, msg := range art.Ops {
131+
op, err := msg.toOperation()
132+
if err != nil {
133+
return nil, nil, fmt.Errorf("decoding artifact operation %d: %w", i, err)
134+
}
135+
ops[i] = op
136+
}
137+
138+
return &art, ops, nil
139+
}
140+
141+
// saveArtifact atomically writes an artifact to disk via temp-file rename.
142+
func saveArtifact(path string, art *artifact) error {
143+
data, err := json.MarshalIndent(art, "", " ")
144+
if err != nil {
145+
return fmt.Errorf("encoding artifact: %w", err)
146+
}
147+
148+
tmpPath := path + ".tmp"
149+
if err := os.WriteFile(tmpPath, append(data, '\n'), 0o600); err != nil {
150+
return fmt.Errorf("writing artifact temp file: %w", err)
151+
}
152+
if err := os.Rename(tmpPath, path); err != nil {
153+
return fmt.Errorf("renaming artifact temp file: %w", err)
154+
}
155+
156+
return nil
157+
}

0 commit comments

Comments
 (0)