Skip to content

Intermittent test failures related to message broker #428

@ehpor

Description

@ehpor

We've had intermittent failures of messages in the message broker. At first, these were related to race conditions in the subscription being started after the initial request has been submitted (meaning that the response could have been submitted before the subscription is even active). Those bugs have now been fixed, but we're still seeing intermittent test failures.

My only potential cause right now is that there is a bug in the retry logic for the events. Events retry cycles where they might be missing wakeups during checking of flags. An example:

inline void EventFutex::Wait(double timeout_in_sec, std::function<bool()> condition, void (*error_check)())
{
Timer timer;
int expected = m_SharedState->m_Futex.load(std::memory_order_acquire);
while (!condition())
{
// Wait for a maximum of 20ms to perform periodic error checking.
double time_remaining = timeout_in_sec - timer.GetTime();
double timeout_wait = std::min(0.020, time_remaining);
if (timeout_wait <= 0)
{
// The timeout expired.
throw std::runtime_error("Waiting time has expired.");
}
struct timespec timeout;
timeout.tv_sec = static_cast<time_t>(timeout_wait);
timeout.tv_nsec = 1'000'000'000 * (timeout_wait - static_cast<time_t>(timeout_wait));
if (timeout.tv_nsec >= 1'000'000'000)
{
timeout.tv_sec += 1;
timeout.tv_nsec -= 1'000'000'000;
}
if (futex_wait(&m_SharedState->m_Futex, expected, &timeout) < 0)
{
if (errno == EAGAIN)
{
// The value was not equal to the expected value. This usually means that
// the futex was triggered in between us getting the expected value and
// the futex_wait() call. So we need to reset the expected value and check
// the condition before calling futex_wait() again.
expected = m_SharedState->m_Futex.load(std::memory_order_acquire);
continue;
}
if (errno == ETIMEDOUT)
{
// The futex timed out. We should check the condition and futex_wait() again.
continue;
}
// Otherwise, an error occurred.
throw std::runtime_error("Futex wait failed: " + std::to_string(errno));
}
if (error_check != nullptr)
error_check();
}
}

This code waits for the entire duration of the timeout in chunks of 20ms and each time it wakes up, it looks to Python exception flags and if the condition has been satisfied (which could be spurious). When the thread is not sleeping inside the futex, the event might be triggered and the futex will go to sleep again, having not noticed that the event has been notified. Note that in this case, the futex checks for the variable to be equal to the expected value, which closes this specific race condition, but there might be bugs in the other implementations of Event. Note that this is just a suspicion, not a found bug.

Scratch all of that. Even if there was a race condition, the condition is checked at least every 20ms, so it would not miss the message entirely, it would just be really late. Now I don't have a clue.

I'm making this issue to have a place to track these issues to identify their root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions