What is the issue?
RemoteClusterServiceWatcher in multicluster/service-mirror/cluster_watcher.go has no synchronization on its shared receiver. Multiple goroutines launched by Start() concurrently access and mutate rcsw fields without a mutex.
Affected Code
File: multicluster/service-mirror/cluster_watcher.go
Component: service-mirror (multicluster)
Severity: Low-medium (event queue serializes most work, but direct field writes race)
The Race
Start() launches two goroutines that access rcsw:
Goroutine 1: processEvents (line 1459)
go rcsw.processEvents(ctx)
Processes events from eventsQueue, calls handlers like createOrUpdateService, handleCreateOrUpdateEndpoints, repairEndpoints — all reading/writing rcsw fields.
Goroutine 2: repair ticker (line 1471)
go func() {
ticker := time.NewTicker(rcsw.repairPeriod)
for {
select {
case <-ticker.C:
ev := RepairEndpoints{}
rcsw.eventsQueue.Add(&ev)
case alive := <-rcsw.liveness:
rcsw.gatewayAlive = alive // ← DIRECT FIELD WRITE, NO LOCK
ev := RepairEndpoints{}
rcsw.eventsQueue.Add(&ev)
case <-rcsw.stopper:
return
}
}
}()
Line 1479: rcsw.gatewayAlive = alive is a direct field write from the repair goroutine while processEvents may be reading rcsw.gatewayAlive in repairEndpoints() or other handlers.
Affected methods (both files)
All methods on RemoteClusterServiceWatcher share the unprotected receiver across goroutines. This includes methods in both cluster_watcher.go and cluster_watcher_headless.go:
cluster_watcher.go: repairEndpoints, createOrUpdateGatewayEndpoints, handleCreateOrUpdateEndpoints, cleanupOrphanedServices, cleanupMirroredResources
cluster_watcher_headless.go: createOrUpdateHeadlessEndpoints, createHeadlessMirrorEndpoints, createEndpointMirrorService
Additionally
Informer callbacks registered in Start() (lines 1374-1458) run on Kubernetes informer goroutines. They enqueue events to eventsQueue (thread-safe), but the Informer goroutines are a third concurrent accessor of the rcsw struct.
Impact
rcsw.gatewayAlive read/write race between repair goroutine and processEvents
- Stale
gatewayAlive value → endpoints repaired against wrong gateway state → traffic routed to dead gateways in multicluster setups
- Under gateway flapping (rapid liveness changes), the race window grows
- Potential panic from nil pointer if handlers assume consistent state during concurrent mutation
Precedent: #11163 (same pattern, different component)
This is the same race pattern that caused production panics in the Destination endpoint watcher (#11163, reported via #11135). In that case:
pp.updateServer() called Add on listeners with no mutex
pp.unsubscribe() set listeners to nil concurrently
- Result: nil pointer panic in production under high endpoint churn
- Fix: added mutex protection to all functions that modify listeners
The service-mirror RemoteClusterServiceWatcher has the same architectural gap — shared struct accessed from multiple goroutines with zero synchronization. The Destination component was fixed; the service-mirror component was not.
Detection
Found by DryRun, confirmed by go test -race.
Verification
How can it be reproduced?
The race is confirmed by go test -race. Add this test to cluster_watcher_race_test.go:
func TestGatewayAliveRace(t *testing.T) {
liveness := make(chan bool, 100)
stopper := make(chan struct{})
watcher := &RemoteClusterServiceWatcher{
gatewayAlive: true,
liveness: liveness,
stopper: stopper,
}
// Goroutine 1: reads gatewayAlive (simulates processEvents/repairEndpoints)
go func() {
for {
select {
case <-stopper:
return
default:
_ = watcher.gatewayAlive
time.Sleep(time.Millisecond)
}
}
}()
// Goroutine 2: writes gatewayAlive (simulates repair ticker liveness handler)
go func() {
for {
select {
case alive := <-liveness:
watcher.gatewayAlive = alive
case <-stopper:
return
}
}
}()
for i := 0; i < 200; i++ {
liveness <- (i%2 == 0)
}
time.Sleep(50 * time.Millisecond)
close(stopper)
}
Logs, error output, etc
$ go test -race -run TestGatewayAliveRace ./multicluster/service-mirror/
WARNING: DATA RACE
Write at cluster_watcher_race_test.go:41 by goroutine 60
Previous read at cluster_watcher_race_test.go:30 by goroutine 59
--- FAIL: TestGatewayAliveRace (0.05s)
FAIL
output of linkerd check -o short
N/A — static analysis, no cluster.
Environment
N/A — found via static analysis of source code, not runtime testing.
Possible solution
Add a sync.Mutex to RemoteClusterServiceWatcher:
type RemoteClusterServiceWatcher struct {
mu sync.Mutex // protects gatewayAlive and other shared fields
// ... existing fields
}
Protect the direct write:
case alive := <-rcsw.liveness:
rcsw.mu.Lock()
rcsw.gatewayAlive = alive
rcsw.mu.Unlock()
// ...
And reads in handlers:
func (rcsw *RemoteClusterServiceWatcher) repairEndpoints(ctx context.Context) error {
rcsw.mu.Lock()
alive := rcsw.gatewayAlive
rcsw.mu.Unlock()
// use alive...
}
Alternatively, use atomic.Bool for gatewayAlive since it's a single boolean field.
Additional context
No response
Would you like to work on fixing this bug?
yes
What is the issue?
RemoteClusterServiceWatcherinmulticluster/service-mirror/cluster_watcher.gohas no synchronization on its shared receiver. Multiple goroutines launched byStart()concurrently access and mutatercswfields without a mutex.Affected Code
File:
multicluster/service-mirror/cluster_watcher.goComponent: service-mirror (multicluster)
Severity: Low-medium (event queue serializes most work, but direct field writes race)
The Race
Start()launches two goroutines that accessrcsw:Goroutine 1:
processEvents(line 1459)Processes events from
eventsQueue, calls handlers likecreateOrUpdateService,handleCreateOrUpdateEndpoints,repairEndpoints— all reading/writingrcswfields.Goroutine 2: repair ticker (line 1471)
Line 1479:
rcsw.gatewayAlive = aliveis a direct field write from the repair goroutine whileprocessEventsmay be readingrcsw.gatewayAliveinrepairEndpoints()or other handlers.Affected methods (both files)
All methods on
RemoteClusterServiceWatchershare the unprotected receiver across goroutines. This includes methods in bothcluster_watcher.goandcluster_watcher_headless.go:cluster_watcher.go:repairEndpoints,createOrUpdateGatewayEndpoints,handleCreateOrUpdateEndpoints,cleanupOrphanedServices,cleanupMirroredResourcescluster_watcher_headless.go:createOrUpdateHeadlessEndpoints,createHeadlessMirrorEndpoints,createEndpointMirrorServiceAdditionally
Informer callbacks registered in
Start()(lines 1374-1458) run on Kubernetes informer goroutines. They enqueue events toeventsQueue(thread-safe), but the Informer goroutines are a third concurrent accessor of thercswstruct.Impact
rcsw.gatewayAliveread/write race between repair goroutine and processEventsgatewayAlivevalue → endpoints repaired against wrong gateway state → traffic routed to dead gateways in multicluster setupsPrecedent: #11163 (same pattern, different component)
This is the same race pattern that caused production panics in the Destination endpoint watcher (#11163, reported via #11135). In that case:
pp.updateServer()calledAddon listeners with no mutexpp.unsubscribe()set listeners to nil concurrentlyThe service-mirror
RemoteClusterServiceWatcherhas the same architectural gap — shared struct accessed from multiple goroutines with zero synchronization. The Destination component was fixed; the service-mirror component was not.Detection
Found by DryRun, confirmed by
go test -race.Verification
How can it be reproduced?
The race is confirmed by
go test -race. Add this test tocluster_watcher_race_test.go:Logs, error output, etc
output of
linkerd check -o shortN/A — static analysis, no cluster.
Environment
N/A — found via static analysis of source code, not runtime testing.
Possible solution
Add a
sync.MutextoRemoteClusterServiceWatcher:Protect the direct write:
And reads in handlers:
Alternatively, use
atomic.BoolforgatewayAlivesince it's a single boolean field.Additional context
No response
Would you like to work on fixing this bug?
yes