Spotted in the wild, a workflow stuck in an infinite loop, unable to continue, save progress to the DB or shutdown.
The workflow was getting stuck in the increment_graph_window code in the for location in locations block where it was operating on "blank" locations.
Reproducible Example
The exact example is extremely complex and involves a DB dump.
However, I was able to reduce the example to something which doesn't produce the infinite loop problem but does produce the "blank" location problem.
Apply this diff:
diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index e5171760f..b6b894480 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -901,6 +901,7 @@ class DataStoreMgr:
itask:
Active/Other task proxy, passed in with pool invocation.
"""
+ LOG.warning(f'# increment_graph_window({source_tokens.relative_id})')
# common refrences
active_id = source_tokens.id
@@ -1040,6 +1041,10 @@ class DataStoreMgr:
locations = ['']
# Explore/walk locations
for location in locations:
+ LOG.info(f'# {len(locations)} locations - {location}')
+ if not location:
+ LOG.error('NULL LOCATION')
+ # continue
walk_incomplete = True
if not location:
loc_nodes = {active_id}
And run this workflow:
[scheduler]
UTC mode = True
allow implicit tasks = True
[scheduling]
initial cycle point = 20260407T1300Z
final cycle point = 20270331T1800Z
runahead limit = P5
[[special tasks]]
sequential = housekeep_cycle
clock-trigger = check_hall(PT1H45M)
[[queues]]
[[[default]]]
limit = 150
[[graph]]
R1 = """
install_cold
install_cold_mirror
fcm_make
fcm_make_mirror
"""
R1/20260407T1300Z = """
install_cold[^] => check_hall
"""
R1/20260407T1400Z = """
install_cold[^] => check_hall
"""
R1/20260407T1500Z = """
install_cold[^] => check_hall
"""
R1/20260407T1600Z = """
install_cold[^] => check_hall
"""
R1/20260407T1700Z = """
install_cold[^] => check_hall
"""
R1/20260407T1800Z = """
install_cold[^] => check_hall
"""
R1/20260408T1300Z = """
install_cold[^] => check_hall
"""
R1/20260408T1400Z = """
install_cold[^] => check_hall
"""
R1/20260408T1500Z = """
install_cold[^] => check_hall
"""
R1/20260408T1600Z = """
install_cold[^] => check_hall
"""
R1/20260408T1700Z = """
install_cold[^] => check_hall
"""
R1/20260408T1800Z = """
install_cold[^] => check_hall
"""
R1/20260409T1300Z = """
install_cold[^] => check_hall
"""
R1/20260409T1400Z = """
install_cold[^] => check_hall
"""
R1/20260409T1500Z = """
install_cold[^] => check_hall
"""
R1/20260409T1600Z = """
install_cold[^] => check_hall
"""
R1/20260409T1700Z = """
install_cold[^] => check_hall
"""
R1/20260409T1800Z = """
install_cold[^] => check_hall
"""
R1/20260410T1300Z = """
install_cold[^] => check_hall
"""
R1/20260410T1400Z = """
install_cold[^] => check_hall
"""
R1/20260410T1500Z = """
install_cold[^] => check_hall
"""
R1/20260410T1600Z = """
install_cold[^] => check_hall
"""
R1/20260410T1700Z = """
install_cold[^] => check_hall
"""
R1/20260410T1800Z = """
install_cold[^] => check_hall
"""
R1/20260411T1300Z = """
install_cold[^] => check_hall
"""
R1/20260411T1400Z = """
install_cold[^] => check_hall
"""
R1/20260411T1500Z = """
install_cold[^] => check_hall
"""
R1/20260411T1600Z = """
install_cold[^] => check_hall
"""
R1/20260411T1700Z = """
install_cold[^] => check_hall
"""
R1/20260411T1800Z = """
install_cold[^] => check_hall
"""
R1/20260412T1300Z = """
install_cold[^] => check_hall
"""
R1/20260412T1400Z = """
install_cold[^] => check_hall
"""
R1/20260412T1500Z = """
install_cold[^] => check_hall
"""
R1/20260412T1600Z = """
install_cold[^] => check_hall
"""
R1/20260412T1700Z = """
install_cold[^] => check_hall
"""
For the real example, commentating the continue here allows the scheduler to proceed.
Spotted in the wild, a workflow stuck in an infinite loop, unable to continue, save progress to the DB or shutdown.
The workflow was getting stuck in the
increment_graph_windowcode in thefor location in locationsblock where it was operating on "blank" locations.Reproducible Example
The exact example is extremely complex and involves a DB dump.
However, I was able to reduce the example to something which doesn't produce the infinite loop problem but does produce the "blank" location problem.
Apply this diff:
And run this workflow:
For the real example, commentating the
continuehere allows the scheduler to proceed.