fix(discov): prevent duplicate watch goroutines and suppress etcd logger noise#5588
Open
kevwan wants to merge 3 commits into
Open
fix(discov): prevent duplicate watch goroutines and suppress etcd logger noise#5588kevwan wants to merge 3 commits into
kevwan wants to merge 3 commits into
Conversation
…ger noise When Registry.Monitor() is called concurrently for the same key, two goroutines can both observe ok==false before the watcher entry exists (TOCTOU race), causing each to call c.monitor() independently. The second goroutine then launches its own background watch goroutine, issues a duplicate etcd Get, and appends to the watcher values map without holding the lock. Fix: pre-create the watchValue entry under lock at the start of c.monitor(). The second concurrent caller now finds the entry and simply appends its listener plus replays already-loaded KVs — no extra goroutine or etcd RPC is issued. Also suppresses etcd's internal zap logger in DialClient() by setting Logger: zap.NewNop(). Without this, etcd writes 'context deadline exceeded' errors via its own logger regardless of go-zero's log level, generating spurious noise in production logs. Adds TestCluster_monitor_Idempotent to guard regressions: Times(1) on Get and Watch mock calls verifies that a second c.monitor() call for the same key never spawns a duplicate goroutine.
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a TOCTOU race in core/discov/internal/registry.go where concurrent Registry.Monitor() calls for the same key could spawn duplicate etcd watch goroutines (causing extra Get/Watch RPCs and unsynchronized writes to watcher.values), and silences the etcd client's internal zap logger.
Changes:
- In
cluster.monitor(), atomically check-or-create thewatchValueentry under the cluster lock; if it already exists, append the listener and replay already-loaded KVs instead of spawning a second watch goroutine. - In
DialClient(), setclientv3.Config.Logger = zap.NewNop()to suppress etcd's internal logger output. - Add
TestCluster_monitor_Idempotentasserting that a secondc.monitor()call for the same key triggers exactly oneGetand oneWatch, with both listeners registered and replayed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| core/discov/internal/registry.go | Pre-creates the per-key watchValue under lock inside monitor() to dedupe concurrent watch setup; sets etcd client Logger to zap.NewNop() |
| core/discov/internal/registry_test.go | Adds TestCluster_monitor_Idempotent (with Times(1) mock expectations) plus a nopCloser helper for connManager cleanup |
Comment on lines
+502
to
+505
| // Suppress the etcd client's internal zap logger; its messages (e.g. | ||
| // context deadline exceeded on watch reconnects) are redundant with | ||
| // go-zero's own error handling and would bypass logx's level setting. | ||
| Logger: zap.NewNop(), |
Contributor
Author
|
Addressed the logger review comment in commit e14ca3e. What changed:
So default behavior still suppresses retry noise, but users can opt-in to etcd diagnostics when needed. |
Comment on lines
+26
to
+36
| // Default to NOP to avoid noisy logs in production, and let callers opt in. | ||
| EtcdClientLogger = zap.NewNop() | ||
| // NewClient is used to create etcd clients. | ||
| NewClient = DialClient | ||
| ) | ||
|
|
||
| // SetEtcdClientLogger sets the etcd client's internal logger. | ||
| // Passing nil resets it to a no-op logger. | ||
| func SetEtcdClientLogger(logger *zap.Logger) { | ||
| if logger == nil { | ||
| EtcdClientLogger = zap.NewNop() |
Comment on lines
+32
to
+41
| // SetEtcdClientLogger sets the etcd client's internal logger. | ||
| // Passing nil resets it to a no-op logger. | ||
| func SetEtcdClientLogger(logger *zap.Logger) { | ||
| if logger == nil { | ||
| EtcdClientLogger = zap.NewNop() | ||
| return | ||
| } | ||
|
|
||
| EtcdClientLogger = logger | ||
| } |
Keep EtcdClientLogger configuration internal only. Advanced users can configure it directly via internal.SetEtcdClientLogger if needed, but the default (no-op) behavior is suitable for most cases without exposing a zap-specific public API.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two separate issues in
core/discov/internal/registry.go:1. TOCTOU race: duplicate watch goroutines per key
When
Registry.Monitor()is called concurrently for the same key, two goroutines can both observeok==falsebefore the watcher entry is created:Inside
c.monitor(), the same race repeats onc.watchers[key]. Each caller ends up:watchgoroutineGetfor initial KV loadingwatcher.valueswithout a lock, causing data races2. etcd internal logger bypasses go-zero's logx
DialClient()createdclientv3.New(cfg)without settingLogger, so etcd writescontext deadline exceededand other internal errors through its own zap logger regardless of go-zero's configured log level. This generates spurious noise in production logs that cannot be silenced.Fix
For the race: pre-create the
watchValueentry under lock at the start ofc.monitor(). The second concurrent caller finds the entry already present and simply appends its listener and replays already-loaded KVs — no extra goroutine or etcd RPC is ever issued.For the logger: set
Logger: zap.NewNop()inDialClient()to suppress etcd's internal log output entirely, deferring all observability to go-zero's own logx.Test
Added
TestCluster_monitor_Idempotentwhich:Times(1)assertions on both theGetandWatchmock callsc.monitor()call would issue a secondGet+Watch, failing the mock expectations