Skip to content

ALEC-297: Make DBScan no-path spatial distance configurable#155

Open
joseanesONMS wants to merge 2 commits intoOpenNMS-Plugins:developfrom
joseanesONMS:ja/jira/ALEC-297
Open

ALEC-297: Make DBScan no-path spatial distance configurable#155
joseanesONMS wants to merge 2 commits intoOpenNMS-Plugins:developfrom
joseanesONMS:ja/jira/ALEC-297

Conversation

@joseanesONMS
Copy link
Copy Markdown

@joseanesONMS joseanesONMS commented May 5, 2026

Summary

Fixes the cross-device correlation bug surfaced as a @Ignore'd reproducer (CrossDeviceCorrelationIT) in ALEC-296 (PR #154).

AlarmInSpaceTimeDistanceMeasure.compute() previously substituted Integer.MAX_VALUE for the spatial distance whenever no topology path existed between two alarm vertices. With the default DBScan parameters that produces inter-alarm distances of ~10⁹–10¹⁰ — so no realistic ε ever clusters alarms on different devices, regardless of how the user tunes α / β / ε.

This PR replaces the hardcoded sentinel with a configurable noPathDistance parameter. Default 100 (same scale as InventoryObject.DEFAULT_WEIGHT), so two alarms on different devices ~1 minute apart now produce a distance of ~66 — within the default ε of 100. Operators who depended on the old "never cluster cross-device" behaviour can opt into it by setting noPathDistance to a large value (e.g. Integer.MAX_VALUE).

Behavioural change

Day-one behaviour for existing installs changes: cross-device alarms that did not cluster before will start clustering at default ε. This is the whole point of the fix, but worth calling out:

  • If an operator wants the old behaviour, set noPathDistance (via REST POST /alec/engine/configuration or etc/org.opennms.alec.engine.dbscan.cfg) to something large.
  • The pre-fix invariant ("disconnected vertices ⇒ never cluster") is verified by CrossDeviceCorrelationIT.largeNoPathDistance_preservesPreFixBehaviour.

Plumbing

  • AlarmInSpaceTimeDistanceMeasure — new noPathDistance constructor param + DEFAULT_NO_PATH_DISTANCE = 100. Substituted when spatialDistance == 0 or >= Integer.MAX_VALUE (covers both legacy "no path = 0" callers and AbstractClusterEngine.spatialDistances, which returns Integer.MAX_VALUE directly for disconnected vertices — that was the path actually triggering the bug).
  • AlarmInSpaceAndTimeDistanceMeasureFactory — holds the field + getter/setter; passes it to created measures.
  • DBScanEngineFactorysetNoPathDistance / getNoPathDistance delegating to the inner factory; included in getNameConf() and getParameters().
  • engine/dbscan blueprint — new <cm:property name="noPathDistance" value="100"/> bound to the factory bean.
  • EngineParameter interface + EngineParameterImpl — new getNoPathDistance() with default fallback when null + dbscan engine.
  • EngineRestImpl.configureAndStoreDBScan — applies and persists the new value alongside α / β / ε.

Tests

  • CrossDeviceCorrelationIT un-@Ignore'd, two cases:
    • crossDeviceAlarmsClusterAtDefaultParameters — verifies the fix at default config.
    • largeNoPathDistance_preservesPreFixBehaviour — verifies the opt-in escape hatch.
  • AlarmInSpaceTimeDistanceMeasureTest — direct unit tests for the new param, asserting the math (~66 at default config; far above default ε at large penalty).
  • EngineRestImplTest — round-trip of noPathDistance through configureAndStoreDBScan to the factory.
  • EngineParameterImplTest.serializeSimpleString — JSON expectation updated for the new field.

Note on overlap with PR #154

CrossDeviceCorrelationIT.java exists on both this branch and ja/jira/ALEC-296 (the @Ignore'd reproducer there is the documentation that motivated this ticket). Whichever PR merges first, the other will need a small textual conflict resolution on this single file — trivial in either direction.

Ticket: ALEC-297

Test plan

  • mvn -pl engine/dbscan,engine/itest,features/ui -am test — full module + dependency test pass (verified locally; 21 itests including the two new ones, 27 features/ui tests, 3 dbscan unit tests).
  • mvn clean install -DskipTests — KAR rebuilds.
  • Smoke-tested on a local OpenNMS Horizon 36 dev install:
    • GET /opennms/rest/alec/engine/configuration returns the new noPathDistance field.
    • POST .../configuration round-trips a custom value (verified set→250, GET→250).

Replace the hardcoded Integer.MAX_VALUE no-path penalty in
AlarmInSpaceTimeDistanceMeasure with a configurable noPathDistance
parameter (default 100, comparable to typical graph diameters).

With the pre-fix value, alarms on different devices with no topology
path between them required epsilon >= ~1e10 to cluster (verified by
CrossDeviceCorrelationIT) — unusable in practice. The new default
lets cross-device alarms cluster at default epsilon while still
allowing operators who depend on the old behaviour to opt back in
by setting noPathDistance to a large value.

Plumbing:
- AlarmInSpaceTimeDistanceMeasure: new constructor param + default;
  treats both spatialDistance == 0 (caller convention) and
  spatialDistance >= Integer.MAX_VALUE (the value AbstractClusterEngine's
  spatialDistances cache returns for disconnected vertices) as
  "no path" and substitutes noPathDistance.
- AlarmInSpaceAndTimeDistanceMeasureFactory: holds field + getter/setter,
  passes value to created measures.
- DBScanEngineFactory: setNoPathDistance/getNoPathDistance delegating
  to the inner factory; included in getNameConf() and getParameters().
- engine/dbscan blueprint: new <cm:property name="noPathDistance"
  value="100"/> bound to the factory bean.
- EngineParameter / EngineParameterImpl: getNoPathDistance() with a
  default fallback when null + dbscan engine.
- EngineRestImpl.configureAndStoreDBScan: applies and persists the
  new value alongside alpha/beta/epsilon.

Tests:
- CrossDeviceCorrelationIT: un-@ignore'd. Two cases — cluster at
  default config; preserve old behaviour at noPathDistance=Integer.MAX_VALUE.
- AlarmInSpaceTimeDistanceMeasureTest: direct unit tests for the new
  param, asserting the math (~66 at default config, far above default
  epsilon at large penalty).
- EngineRestImplTest: round-trip of noPathDistance to factory.
- EngineParameterImplTest.serializeSimpleString: JSON expectation
  updated to include the new field.
The CircleCI build for PR OpenNMS-Plugins#155 failed in DBScanEnginePerfTest because
the new default noPathDistance lets disconnected-IO alarms cluster,
which exercised two pre-existing latent bugs that were previously
unreachable when no situations formed:

1. AbstractClusterEngine called
       Thread.currentThread().getName().substring(20)
   to extract the engine-config from the driver thread name, which
   it sets as "ALEC Driver Tick -- <params>" (a 20-char prefix).
   In any other thread (test code, default JUnit "main" thread, etc.)
   the name is shorter than 20 characters and the call threw
   StringIndexOutOfBoundsException. Made the substring defensive:
   only strip the prefix when it is actually present.

2. DBScanEnginePerfTest.canRunDBScanOnLargeGraphs never registered
   a SituationHandler. Other tests in the same file already do
   (registerSituationHandler(mock(SituationHandler.class))) — added
   the same line to this one so the engine's onSituation callback
   has somewhere to go now that situations actually form.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant