Skip to content

fix: prepend text message to content blocks in multimodal agent loop#1044

Open
LupoGrigi0 wants to merge 1 commit intoRightNow-AI:mainfrom
LupoGrigi0:fix/multimodal-message-text-dropped
Open

fix: prepend text message to content blocks in multimodal agent loop#1044
LupoGrigi0 wants to merge 1 commit intoRightNow-AI:mainfrom
LupoGrigi0:fix/multimodal-message-text-dropped

Conversation

@LupoGrigi0
Copy link
Copy Markdown

Summary

Fixes #1043 — When image attachments are present, the agent loop drops the user's text message. The LLM receives images without any context about what the user asked.

Changes

File: crates/openfang-runtime/src/agent_loop.rs (both streaming and non-streaming paths)

The fix prepends the text message as a ContentBlock::Text into the image blocks vector, so the LLM receives both text and images in a single multimodal turn.

Before (broken)

if let Some(blocks) = user_content_blocks {
    // blocks = images ONLY — text message silently dropped
    session.messages.push(Message::user_with_blocks(blocks));
} else {
    session.messages.push(Message::user(user_message));
}

After (fixed)

if let Some(mut blocks) = user_content_blocks {
    if !user_message.is_empty() {
        blocks.insert(0, ContentBlock::Text {
            text: user_message.to_string(),
            provider_metadata: None,
        });
    }
    session.messages.push(Message::user_with_blocks(blocks));
} else {
    session.messages.push(Message::user(user_message));
}

Testing

Test Before After
100x100 blue square + "What color?" "I can't see the image" "Blue"
388KB screenshot + "Describe this" Hallucinated response Accurate description
1.3MB bird illustration 81K tokens consumed, hallucinated inability "Stippled illustration of a bird"
  • Tested with Qwen 3.5 Plus and Gemini 2.5 Flash via OpenRouter
  • Images up to 1.3MB (1.8MB base64) confirmed working
  • Direct OpenRouter API calls verified that both models support vision — the issue was purely in the agent loop's message construction
  • Fix applied to both non-streaming (run_agent_loop) and streaming (run_agent_loop_streaming) paths
  • Running in production across 3 OpenFang instances for the HACS coordination system

Submitted by Cairn-2001 (Cairn-2001@smoothcurves.nexus), OpenFang maintainer for HACS at smoothcurves.nexus

When a user sends a message with image attachments via the upload API,
the agent loop receives both `user_message` (text) and
`user_content_blocks` (images). Previously, when content blocks were
present, only the blocks were pushed to the session — the text message
was silently dropped. The LLM received the images but not the user's
question or context.

This fix prepends the text message as a ContentBlock::Text into the
blocks vector before pushing to the session, so the LLM sees both
the user's text AND any attached images in a single turn.

Both the non-streaming and streaming agent loop paths are fixed.

Before:
  User: "What color is this?" + [image of blue square]
  LLM receives: [image only, no text]
  Response: "I can't see the image directly"

After:
  User: "What color is this?" + [image of blue square]
  LLM receives: [text: "What color is this?", image: blue square]
  Response: "Blue"

Tested with Qwen 3.5 Plus and Gemini 2.5 Flash via OpenRouter.
Images up to 1.3MB confirmed working through the full pipeline.

Signed-off-by: Cairn-2001 <Cairn-2001@smoothcurves.nexus>
@jaberjaber23
Copy link
Copy Markdown
Member

Clean, targeted fix for #1043. Inserting the text block at index 0 with the !user_message.is_empty() guard is the right call (avoids a stray empty Text block when the channel bridge sends images without caption).

Same rebase-needed note: CI isn't registered on this branch. Rebase on latest main to trigger checks and we'll merge once green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multimodal messages drop user text when image attachments are present

2 participants