ALEC-297: Make DBScan no-path spatial distance configurable#155
Open
joseanesONMS wants to merge 2 commits intoOpenNMS-Plugins:developfrom
Open
ALEC-297: Make DBScan no-path spatial distance configurable#155joseanesONMS wants to merge 2 commits intoOpenNMS-Plugins:developfrom
joseanesONMS wants to merge 2 commits intoOpenNMS-Plugins:developfrom
Conversation
Replace the hardcoded Integer.MAX_VALUE no-path penalty in AlarmInSpaceTimeDistanceMeasure with a configurable noPathDistance parameter (default 100, comparable to typical graph diameters). With the pre-fix value, alarms on different devices with no topology path between them required epsilon >= ~1e10 to cluster (verified by CrossDeviceCorrelationIT) — unusable in practice. The new default lets cross-device alarms cluster at default epsilon while still allowing operators who depend on the old behaviour to opt back in by setting noPathDistance to a large value. Plumbing: - AlarmInSpaceTimeDistanceMeasure: new constructor param + default; treats both spatialDistance == 0 (caller convention) and spatialDistance >= Integer.MAX_VALUE (the value AbstractClusterEngine's spatialDistances cache returns for disconnected vertices) as "no path" and substitutes noPathDistance. - AlarmInSpaceAndTimeDistanceMeasureFactory: holds field + getter/setter, passes value to created measures. - DBScanEngineFactory: setNoPathDistance/getNoPathDistance delegating to the inner factory; included in getNameConf() and getParameters(). - engine/dbscan blueprint: new <cm:property name="noPathDistance" value="100"/> bound to the factory bean. - EngineParameter / EngineParameterImpl: getNoPathDistance() with a default fallback when null + dbscan engine. - EngineRestImpl.configureAndStoreDBScan: applies and persists the new value alongside alpha/beta/epsilon. Tests: - CrossDeviceCorrelationIT: un-@ignore'd. Two cases — cluster at default config; preserve old behaviour at noPathDistance=Integer.MAX_VALUE. - AlarmInSpaceTimeDistanceMeasureTest: direct unit tests for the new param, asserting the math (~66 at default config, far above default epsilon at large penalty). - EngineRestImplTest: round-trip of noPathDistance to factory. - EngineParameterImplTest.serializeSimpleString: JSON expectation updated to include the new field.
The CircleCI build for PR OpenNMS-Plugins#155 failed in DBScanEnginePerfTest because the new default noPathDistance lets disconnected-IO alarms cluster, which exercised two pre-existing latent bugs that were previously unreachable when no situations formed: 1. AbstractClusterEngine called Thread.currentThread().getName().substring(20) to extract the engine-config from the driver thread name, which it sets as "ALEC Driver Tick -- <params>" (a 20-char prefix). In any other thread (test code, default JUnit "main" thread, etc.) the name is shorter than 20 characters and the call threw StringIndexOutOfBoundsException. Made the substring defensive: only strip the prefix when it is actually present. 2. DBScanEnginePerfTest.canRunDBScanOnLargeGraphs never registered a SituationHandler. Other tests in the same file already do (registerSituationHandler(mock(SituationHandler.class))) — added the same line to this one so the engine's onSituation callback has somewhere to go now that situations actually form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the cross-device correlation bug surfaced as a
@Ignore'd reproducer (CrossDeviceCorrelationIT) in ALEC-296 (PR #154).AlarmInSpaceTimeDistanceMeasure.compute()previously substitutedInteger.MAX_VALUEfor the spatial distance whenever no topology path existed between two alarm vertices. With the default DBScan parameters that produces inter-alarm distances of ~10⁹–10¹⁰ — so no realistic ε ever clusters alarms on different devices, regardless of how the user tunes α / β / ε.This PR replaces the hardcoded sentinel with a configurable
noPathDistanceparameter. Default 100 (same scale asInventoryObject.DEFAULT_WEIGHT), so two alarms on different devices ~1 minute apart now produce a distance of ~66 — within the default ε of 100. Operators who depended on the old "never cluster cross-device" behaviour can opt into it by settingnoPathDistanceto a large value (e.g.Integer.MAX_VALUE).Behavioural change
Day-one behaviour for existing installs changes: cross-device alarms that did not cluster before will start clustering at default ε. This is the whole point of the fix, but worth calling out:
noPathDistance(via RESTPOST /alec/engine/configurationoretc/org.opennms.alec.engine.dbscan.cfg) to something large.CrossDeviceCorrelationIT.largeNoPathDistance_preservesPreFixBehaviour.Plumbing
AlarmInSpaceTimeDistanceMeasure— newnoPathDistanceconstructor param +DEFAULT_NO_PATH_DISTANCE = 100. Substituted whenspatialDistance == 0or>= Integer.MAX_VALUE(covers both legacy "no path = 0" callers andAbstractClusterEngine.spatialDistances, which returnsInteger.MAX_VALUEdirectly for disconnected vertices — that was the path actually triggering the bug).AlarmInSpaceAndTimeDistanceMeasureFactory— holds the field + getter/setter; passes it to created measures.DBScanEngineFactory—setNoPathDistance/getNoPathDistancedelegating to the inner factory; included ingetNameConf()andgetParameters().engine/dbscanblueprint — new<cm:property name="noPathDistance" value="100"/>bound to the factory bean.EngineParameterinterface +EngineParameterImpl— newgetNoPathDistance()with default fallback when null + dbscan engine.EngineRestImpl.configureAndStoreDBScan— applies and persists the new value alongside α / β / ε.Tests
CrossDeviceCorrelationITun-@Ignore'd, two cases:crossDeviceAlarmsClusterAtDefaultParameters— verifies the fix at default config.largeNoPathDistance_preservesPreFixBehaviour— verifies the opt-in escape hatch.AlarmInSpaceTimeDistanceMeasureTest— direct unit tests for the new param, asserting the math (~66 at default config; far above default ε at large penalty).EngineRestImplTest— round-trip ofnoPathDistancethroughconfigureAndStoreDBScanto the factory.EngineParameterImplTest.serializeSimpleString— JSON expectation updated for the new field.Note on overlap with PR #154
CrossDeviceCorrelationIT.javaexists on both this branch andja/jira/ALEC-296(the@Ignore'd reproducer there is the documentation that motivated this ticket). Whichever PR merges first, the other will need a small textual conflict resolution on this single file — trivial in either direction.Ticket: ALEC-297
Test plan
mvn -pl engine/dbscan,engine/itest,features/ui -am test— full module + dependency test pass (verified locally; 21 itests including the two new ones, 27 features/ui tests, 3 dbscan unit tests).mvn clean install -DskipTests— KAR rebuilds.GET /opennms/rest/alec/engine/configurationreturns the newnoPathDistancefield.POST .../configurationround-trips a custom value (verified set→250, GET→250).