Skip to content

data store: infinite loop in increment_graph_window? #7267

@oliver-sanders

Description

@oliver-sanders

Spotted in the wild, a workflow stuck in an infinite loop, unable to continue, save progress to the DB or shutdown.

The workflow was getting stuck in the increment_graph_window code in the for location in locations block where it was operating on "blank" locations.

Reproducible Example

The exact example is extremely complex and involves a DB dump.

However, I was able to reduce the example to something which doesn't produce the infinite loop problem but does produce the "blank" location problem.

Apply this diff:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index e5171760f..b6b894480 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -901,6 +901,7 @@ class DataStoreMgr:
             itask:
                 Active/Other task proxy, passed in with pool invocation.
         """
+        LOG.warning(f'# increment_graph_window({source_tokens.relative_id})')
 
         # common refrences
         active_id = source_tokens.id
@@ -1040,6 +1041,10 @@ class DataStoreMgr:
                 locations = ['']
             # Explore/walk locations
             for location in locations:
+                LOG.info(f'# {len(locations)} locations  -  {location}')
+                if not location:
+                    LOG.error('NULL LOCATION')
+                    # continue
                 walk_incomplete = True
                 if not location:
                     loc_nodes = {active_id}

And run this workflow:

[scheduler]
    UTC mode = True
    allow implicit tasks = True

[scheduling]
    initial cycle point = 20260407T1300Z
    final cycle point = 20270331T1800Z
    runahead limit = P5
    [[special tasks]]
        sequential = housekeep_cycle
        clock-trigger = check_hall(PT1H45M)
    [[queues]]
        [[[default]]]
            limit = 150
    [[graph]]
        R1 = """
                    install_cold
                    install_cold_mirror
                    fcm_make
                    fcm_make_mirror
                    """
        R1/20260407T1300Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260407T1400Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260407T1500Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260407T1600Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260407T1700Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260407T1800Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260408T1300Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260408T1400Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260408T1500Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260408T1600Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260408T1700Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260408T1800Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260409T1300Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260409T1400Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260409T1500Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260409T1600Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260409T1700Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260409T1800Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260410T1300Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260410T1400Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260410T1500Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260410T1600Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260410T1700Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260410T1800Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260411T1300Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260411T1400Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260411T1500Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260411T1600Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260411T1700Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260411T1800Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260412T1300Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260412T1400Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260412T1500Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260412T1600Z = """
                    install_cold[^] => check_hall
                    """
        R1/20260412T1700Z = """
                    install_cold[^] => check_hall
                    """

For the real example, commentating the continue here allows the scheduler to proceed.

Metadata

Metadata

Assignees

Labels

bugSomething is wrong :(

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions