Skip to content

Data race in service-mirror RemoteClusterServiceWatcher #15096

@dryrun-dev

Description

@dryrun-dev

What is the issue?

RemoteClusterServiceWatcher in multicluster/service-mirror/cluster_watcher.go has no synchronization on its shared receiver. Multiple goroutines launched by Start() concurrently access and mutate rcsw fields without a mutex.

Affected Code

File: multicluster/service-mirror/cluster_watcher.go
Component: service-mirror (multicluster)
Severity: Low-medium (event queue serializes most work, but direct field writes race)

The Race

Start() launches two goroutines that access rcsw:

Goroutine 1: processEvents (line 1459)

go rcsw.processEvents(ctx)

Processes events from eventsQueue, calls handlers like createOrUpdateService, handleCreateOrUpdateEndpoints, repairEndpoints — all reading/writing rcsw fields.

Goroutine 2: repair ticker (line 1471)

go func() {
    ticker := time.NewTicker(rcsw.repairPeriod)
    for {
        select {
        case <-ticker.C:
            ev := RepairEndpoints{}
            rcsw.eventsQueue.Add(&ev)
        case alive := <-rcsw.liveness:
            rcsw.gatewayAlive = alive  // ← DIRECT FIELD WRITE, NO LOCK
            ev := RepairEndpoints{}
            rcsw.eventsQueue.Add(&ev)
        case <-rcsw.stopper:
            return
        }
    }
}()

Line 1479: rcsw.gatewayAlive = alive is a direct field write from the repair goroutine while processEvents may be reading rcsw.gatewayAlive in repairEndpoints() or other handlers.

Affected methods (both files)

All methods on RemoteClusterServiceWatcher share the unprotected receiver across goroutines. This includes methods in both cluster_watcher.go and cluster_watcher_headless.go:

  • cluster_watcher.go: repairEndpoints, createOrUpdateGatewayEndpoints, handleCreateOrUpdateEndpoints, cleanupOrphanedServices, cleanupMirroredResources
  • cluster_watcher_headless.go: createOrUpdateHeadlessEndpoints, createHeadlessMirrorEndpoints, createEndpointMirrorService

Additionally

Informer callbacks registered in Start() (lines 1374-1458) run on Kubernetes informer goroutines. They enqueue events to eventsQueue (thread-safe), but the Informer goroutines are a third concurrent accessor of the rcsw struct.

Impact

  • rcsw.gatewayAlive read/write race between repair goroutine and processEvents
  • Stale gatewayAlive value → endpoints repaired against wrong gateway state → traffic routed to dead gateways in multicluster setups
  • Under gateway flapping (rapid liveness changes), the race window grows
  • Potential panic from nil pointer if handlers assume consistent state during concurrent mutation

Precedent: #11163 (same pattern, different component)

This is the same race pattern that caused production panics in the Destination endpoint watcher (#11163, reported via #11135). In that case:

  • pp.updateServer() called Add on listeners with no mutex
  • pp.unsubscribe() set listeners to nil concurrently
  • Result: nil pointer panic in production under high endpoint churn
  • Fix: added mutex protection to all functions that modify listeners

The service-mirror RemoteClusterServiceWatcher has the same architectural gap — shared struct accessed from multiple goroutines with zero synchronization. The Destination component was fixed; the service-mirror component was not.

Detection

Found by DryRun, confirmed by go test -race.

Verification

How can it be reproduced?

The race is confirmed by go test -race. Add this test to cluster_watcher_race_test.go:

func TestGatewayAliveRace(t *testing.T) {
	liveness := make(chan bool, 100)
	stopper := make(chan struct{})

	watcher := &RemoteClusterServiceWatcher{
		gatewayAlive: true,
		liveness:     liveness,
		stopper:      stopper,
	}

	// Goroutine 1: reads gatewayAlive (simulates processEvents/repairEndpoints)
	go func() {
		for {
			select {
			case <-stopper:
				return
			default:
				_ = watcher.gatewayAlive
				time.Sleep(time.Millisecond)
			}
		}
	}()

	// Goroutine 2: writes gatewayAlive (simulates repair ticker liveness handler)
	go func() {
		for {
			select {
			case alive := <-liveness:
				watcher.gatewayAlive = alive
			case <-stopper:
				return
			}
		}
	}()

	for i := 0; i < 200; i++ {
		liveness <- (i%2 == 0)
	}

	time.Sleep(50 * time.Millisecond)
	close(stopper)
}

Logs, error output, etc

$ go test -race -run TestGatewayAliveRace ./multicluster/service-mirror/
WARNING: DATA RACE
Write at cluster_watcher_race_test.go:41 by goroutine 60
Previous read at cluster_watcher_race_test.go:30 by goroutine 59
--- FAIL: TestGatewayAliveRace (0.05s)
FAIL

output of linkerd check -o short

N/A — static analysis, no cluster.

Environment

N/A — found via static analysis of source code, not runtime testing.

Possible solution

Add a sync.Mutex to RemoteClusterServiceWatcher:

type RemoteClusterServiceWatcher struct {
    mu           sync.Mutex  // protects gatewayAlive and other shared fields
    // ... existing fields
}

Protect the direct write:

case alive := <-rcsw.liveness:
    rcsw.mu.Lock()
    rcsw.gatewayAlive = alive
    rcsw.mu.Unlock()
    // ...

And reads in handlers:

func (rcsw *RemoteClusterServiceWatcher) repairEndpoints(ctx context.Context) error {
    rcsw.mu.Lock()
    alive := rcsw.gatewayAlive
    rcsw.mu.Unlock()
    // use alive...
}

Alternatively, use atomic.Bool for gatewayAlive since it's a single boolean field.

Additional context

No response

Would you like to work on fixing this bug?

yes

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions