Skip to content

rubric_based_final_response_quality_v1 is hard to use for factual evaluation of google_search agents because its judge prompt requires tool_response evidence #5831

@ftnext

Description

@ftnext

🔴 Required Information

Describe the Bug:
rubric_based_final_response_quality_v1 is difficult to use for factual evaluation of agents that use the built-in google_search tool.

The agents-cli evaluation guide recommends using only rubric_based_final_response_quality_v1 for agents with google_search or model-internal tools, because those tools do not appear in trajectory.
ref: https://google.github.io/agents-cli/guide/evaluation/#writing-eval-cases-and-choosing-metrics

However, the current ADK Python judge prompt for rubric_based_final_response_quality_v1 appears to require trusted evidence from response_steps / tool_response.
ref: https://github.com/google/adk-python/blob/v2.1.0/src/google/adk/evaluation/rubric_based_final_response_quality_v1.py
Since google_search is model-internal, its search results / grounding metadata do not appear as normal function-call tool responses. As a result, factual rubrics such as "the response describes recent AI news and includes announcement timing" are judged as unsupported, even when the final response was generated after Google Search grounding.

This makes the recommended metric usable mainly for indirect visible-text checks, not direct factual evaluation of google_search-grounded answers.

Steps to Reproduce:

  1. Install ADK with eval dependencies:
    uv add "google-adk[eval]"
  2. Create an agent that uses the built-in google_search tool:
    from google.adk.agents.llm_agent import Agent
    from google.adk.apps import App
    from google.adk.tools import google_search
    
    search_agent = Agent(
        model="gemini-3.1-pro-preview",
        name="search_agent",
        description="A helpful assistant that can search Google.",
        instruction="""\
    You are a helpful assistant with access to Google Search.
    
    If the user asks a question that requires current information or facts, use the 'google_search' tool with English query.
    Provide the answer clearly based on the search results and always cite your sources by including URLs from the search results.
    """,
        tools=[google_search],
    )
    
    root_agent = search_agent
  3. Create an eval case such as:
    {
      "eval_set_id": "search_agent_should_use_search_tool_eval_set",
      "eval_cases": [
        {
          "eval_id": "search_agent_should_use_search_tool",
          "conversation": [
            {
              "user_content": {
                "parts": [
                  {
                    "text": "最新のAIニュースにはどんなものがある?\n発表時期と合わせて教えて"
                  }
                ],
                "role": "user"
              }
            }
          ]
        }
      ]
    }
  4. Use rubric_based_final_response_quality_v1 with factual rubrics:
    {
      "criteria": {
        "rubric_based_final_response_quality_v1": {
          "threshold": 0.6,
          "judge_model_options": {
            "judge_model": "gemini-3.1-pro-preview",
            "num_samples": 1
          },
          "rubrics": [
            {
              "rubric_id": "freshness",
              "rubric_content": {
                "text_property": "The response describes recent AI news and includes announcement timing."
              }
            },
            {
              "rubric_id": "grounded_with_sources",
              "rubric_content": {
                "text_property": "The response is grounded in Google Search results and includes source URLs."
              }
            }
          ]
        }
      }
    }
  5. Run the ADK eval, for example with AgentEvaluator.

Expected Behavior:
rubric_based_final_response_quality_v1 should be usable for google_search agents in a way that can evaluate factual final-answer quality, or the documentation should clarify that factual rubrics are not appropriate unless grounding metadata is available to the evaluator.

Ideally, the evaluator would receive google_search grounding metadata / grounding chunks / source URLs as trusted evidence, or there would be a google_search-aware final response quality metric or prompt.

Observed Behavior:
The metric is recommended for google_search agents, but factual rubrics are hard to satisfy because the judge prompt asks the evaluator to rely on trusted evidence from response_steps / tool_response.

For built-in google_search, the search result evidence does not appear as normal tool responses. This causes factual rubrics to be judged as unsupported.

The only reliable workaround I found was to use indirect visible-text rubrics such as:

{
  "rubric_id": "visible_date_expression",
  "rubric_content": {
    "text_property": "The visible response text includes at least one date-like expression such as a four-digit year, a month, a date, an event date, or release timing. This property evaluates only visible text format, not factual correctness."
  }
}

This passes as a smoke test, but no longer directly evaluates whether the answer is factually grounded in Google Search results.

Environment Details:

  • ADK Library Version (pip show google-adk): 1.34.0
  • Desktop OS: macOS
  • Python Version (python -V): Python 3.13.12

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used: gemini-3.1-pro-preview

🟡 Optional Information

Regression:
N/A. I do not know whether this worked in a previous ADK version.

Logs:
Example failure with a factual rubric:

Summary: `EvalStatus.FAILED` for Metric: `rubric_based_final_response_quality_v1`.
Expected threshold: `0.6`, actual value: `0.0`.

rubric_based_final_response_quality_v1 for None Failed. Expected 0.6, but got 0.0.

Screenshots / Video:
N/A

Additional Context:
The agents-cli evaluation guide says:

Agents with google_search or model-internal tools — use only rubric_based_final_response_quality_v1 (model-internal tools don't appear in trajectory).

That recommendation makes sense for avoiding tool_trajectory_avg_score.

The issue is that the current rubric_based_final_response_quality_v1 prompt appears to be designed for function-based tools whose outputs are available as tool_response evidence. For model-internal tools such as google_search, the judge cannot see the search result evidence through normal response_steps.

Minimal Reproduction Code:

See "Steps to Reproduce".
Full version: https://github.com/ftnext/agent-practice/tree/d7d0b3b26d453e27a4dec87a730faede8c4d5fb4/agentengine

How often has this issue occurred?:

  • Always (100%) for factual rubrics that require verifying claims against Google Search evidence

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions