Skip to content

fix(discov): prevent duplicate watch goroutines and suppress etcd logger noise#5588

Open
kevwan wants to merge 3 commits into
zeromicro:masterfrom
kevwan:fix/discov-monitor-goroutine-race
Open

fix(discov): prevent duplicate watch goroutines and suppress etcd logger noise#5588
kevwan wants to merge 3 commits into
zeromicro:masterfrom
kevwan:fix/discov-monitor-goroutine-race

Conversation

@kevwan
Copy link
Copy Markdown
Contributor

@kevwan kevwan commented May 16, 2026

Problem

Two separate issues in core/discov/internal/registry.go:

1. TOCTOU race: duplicate watch goroutines per key

When Registry.Monitor() is called concurrently for the same key, two goroutines can both observe ok==false before the watcher entry is created:

// Registry.Monitor() — before fix
c.lock.RLock()
_, ok := c.clusters[key]
c.lock.RUnlock()             // ← both callers pass here with ok==false
if !ok {
    c.clusters[key] = newCluster(endpoints)   // both callers create a cluster
}
// both callers fall through to c.monitor()

Inside c.monitor(), the same race repeats on c.watchers[key]. Each caller ends up:

  • Spawning its own background watch goroutine
  • Issuing a duplicate etcd Get for initial KV loading
  • Appending to watcher.values without a lock, causing data races

2. etcd internal logger bypasses go-zero's logx

DialClient() created clientv3.New(cfg) without setting Logger, so etcd writes context deadline exceeded and other internal errors through its own zap logger regardless of go-zero's configured log level. This generates spurious noise in production logs that cannot be silenced.

Fix

For the race: pre-create the watchValue entry under lock at the start of c.monitor(). The second concurrent caller finds the entry already present and simply appends its listener and replays already-loaded KVs — no extra goroutine or etcd RPC is ever issued.

For the logger: set Logger: zap.NewNop() in DialClient() to suppress etcd's internal log output entirely, deferring all observability to go-zero's own logx.

Test

Added TestCluster_monitor_Idempotent which:

  • Uses Times(1) assertions on both the Get and Watch mock calls
  • If the bug were present, a second c.monitor() call would issue a second Get + Watch, failing the mock expectations
  • Verifies both listeners are correctly registered
ok  github.com/zeromicro/go-zero/core/discov         (race)
ok  github.com/zeromicro/go-zero/core/discov/internal (race)

…ger noise

When Registry.Monitor() is called concurrently for the same key, two goroutines
can both observe ok==false before the watcher entry exists (TOCTOU race), causing
each to call c.monitor() independently. The second goroutine then launches its own
background watch goroutine, issues a duplicate etcd Get, and appends to the watcher
values map without holding the lock.

Fix: pre-create the watchValue entry under lock at the start of c.monitor(). The
second concurrent caller now finds the entry and simply appends its listener plus
replays already-loaded KVs — no extra goroutine or etcd RPC is issued.

Also suppresses etcd's internal zap logger in DialClient() by setting Logger:
zap.NewNop(). Without this, etcd writes 'context deadline exceeded' errors via its
own logger regardless of go-zero's log level, generating spurious noise in
production logs.

Adds TestCluster_monitor_Idempotent to guard regressions: Times(1) on Get and
Watch mock calls verifies that a second c.monitor() call for the same key never
spawns a duplicate goroutine.
Copilot AI review requested due to automatic review settings May 16, 2026 05:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a TOCTOU race in core/discov/internal/registry.go where concurrent Registry.Monitor() calls for the same key could spawn duplicate etcd watch goroutines (causing extra Get/Watch RPCs and unsynchronized writes to watcher.values), and silences the etcd client's internal zap logger.

Changes:

  • In cluster.monitor(), atomically check-or-create the watchValue entry under the cluster lock; if it already exists, append the listener and replay already-loaded KVs instead of spawning a second watch goroutine.
  • In DialClient(), set clientv3.Config.Logger = zap.NewNop() to suppress etcd's internal logger output.
  • Add TestCluster_monitor_Idempotent asserting that a second c.monitor() call for the same key triggers exactly one Get and one Watch, with both listeners registered and replayed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
core/discov/internal/registry.go Pre-creates the per-key watchValue under lock inside monitor() to dedupe concurrent watch setup; sets etcd client Logger to zap.NewNop()
core/discov/internal/registry_test.go Adds TestCluster_monitor_Idempotent (with Times(1) mock expectations) plus a nopCloser helper for connManager cleanup

Comment thread core/discov/internal/registry.go Outdated
Comment on lines +502 to +505
// Suppress the etcd client's internal zap logger; its messages (e.g.
// context deadline exceeded on watch reconnects) are redundant with
// go-zero's own error handling and would bypass logx's level setting.
Logger: zap.NewNop(),
@kevwan
Copy link
Copy Markdown
Contributor Author

kevwan commented May 16, 2026

Addressed the logger review comment in commit e14ca3e.

What changed:

  1. Made etcd internal logger configurable instead of hard-coded zap.NewNop().
  2. Added internal.EtcdClientLogger (default remains no-op) and internal.SetEtcdClientLogger(*zap.Logger).
  3. DialClient() now uses Logger: EtcdClientLogger.
  4. Exposed this via discov.SetEtcdClientLogger(*zap.Logger) for callers.
  5. Added tests in both internal and public wrapper layers.

So default behavior still suppresses retry noise, but users can opt-in to etcd diagnostics when needed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines +26 to +36
// Default to NOP to avoid noisy logs in production, and let callers opt in.
EtcdClientLogger = zap.NewNop()
// NewClient is used to create etcd clients.
NewClient = DialClient
)

// SetEtcdClientLogger sets the etcd client's internal logger.
// Passing nil resets it to a no-op logger.
func SetEtcdClientLogger(logger *zap.Logger) {
if logger == nil {
EtcdClientLogger = zap.NewNop()
Comment on lines +32 to +41
// SetEtcdClientLogger sets the etcd client's internal logger.
// Passing nil resets it to a no-op logger.
func SetEtcdClientLogger(logger *zap.Logger) {
if logger == nil {
EtcdClientLogger = zap.NewNop()
return
}

EtcdClientLogger = logger
}
Keep EtcdClientLogger configuration internal only. Advanced users can
configure it directly via internal.SetEtcdClientLogger if needed, but the
default (no-op) behavior is suitable for most cases without exposing a
zap-specific public API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants