🔴 Required Information
Describe the Bug:
rubric_based_final_response_quality_v1 is difficult to use for factual evaluation of agents that use the built-in google_search tool.
The agents-cli evaluation guide recommends using only rubric_based_final_response_quality_v1 for agents with google_search or model-internal tools, because those tools do not appear in trajectory.
ref: https://google.github.io/agents-cli/guide/evaluation/#writing-eval-cases-and-choosing-metrics
However, the current ADK Python judge prompt for rubric_based_final_response_quality_v1 appears to require trusted evidence from response_steps / tool_response.
ref: https://github.com/google/adk-python/blob/v2.1.0/src/google/adk/evaluation/rubric_based_final_response_quality_v1.py
Since google_search is model-internal, its search results / grounding metadata do not appear as normal function-call tool responses. As a result, factual rubrics such as "the response describes recent AI news and includes announcement timing" are judged as unsupported, even when the final response was generated after Google Search grounding.
This makes the recommended metric usable mainly for indirect visible-text checks, not direct factual evaluation of google_search-grounded answers.
Steps to Reproduce:
- Install ADK with eval dependencies:
uv add "google-adk[eval]"
- Create an agent that uses the built-in
google_search tool:
from google.adk.agents.llm_agent import Agent
from google.adk.apps import App
from google.adk.tools import google_search
search_agent = Agent(
model="gemini-3.1-pro-preview",
name="search_agent",
description="A helpful assistant that can search Google.",
instruction="""\
You are a helpful assistant with access to Google Search.
If the user asks a question that requires current information or facts, use the 'google_search' tool with English query.
Provide the answer clearly based on the search results and always cite your sources by including URLs from the search results.
""",
tools=[google_search],
)
root_agent = search_agent
- Create an eval case such as:
{
"eval_set_id": "search_agent_should_use_search_tool_eval_set",
"eval_cases": [
{
"eval_id": "search_agent_should_use_search_tool",
"conversation": [
{
"user_content": {
"parts": [
{
"text": "最新のAIニュースにはどんなものがある?\n発表時期と合わせて教えて"
}
],
"role": "user"
}
}
]
}
]
}
- Use
rubric_based_final_response_quality_v1 with factual rubrics:
{
"criteria": {
"rubric_based_final_response_quality_v1": {
"threshold": 0.6,
"judge_model_options": {
"judge_model": "gemini-3.1-pro-preview",
"num_samples": 1
},
"rubrics": [
{
"rubric_id": "freshness",
"rubric_content": {
"text_property": "The response describes recent AI news and includes announcement timing."
}
},
{
"rubric_id": "grounded_with_sources",
"rubric_content": {
"text_property": "The response is grounded in Google Search results and includes source URLs."
}
}
]
}
}
}
- Run the ADK eval, for example with
AgentEvaluator.
Expected Behavior:
rubric_based_final_response_quality_v1 should be usable for google_search agents in a way that can evaluate factual final-answer quality, or the documentation should clarify that factual rubrics are not appropriate unless grounding metadata is available to the evaluator.
Ideally, the evaluator would receive google_search grounding metadata / grounding chunks / source URLs as trusted evidence, or there would be a google_search-aware final response quality metric or prompt.
Observed Behavior:
The metric is recommended for google_search agents, but factual rubrics are hard to satisfy because the judge prompt asks the evaluator to rely on trusted evidence from response_steps / tool_response.
For built-in google_search, the search result evidence does not appear as normal tool responses. This causes factual rubrics to be judged as unsupported.
The only reliable workaround I found was to use indirect visible-text rubrics such as:
{
"rubric_id": "visible_date_expression",
"rubric_content": {
"text_property": "The visible response text includes at least one date-like expression such as a four-digit year, a month, a date, an event date, or release timing. This property evaluates only visible text format, not factual correctness."
}
}
This passes as a smoke test, but no longer directly evaluates whether the answer is factually grounded in Google Search results.
Environment Details:
- ADK Library Version (pip show google-adk):
1.34.0
- Desktop OS: macOS
- Python Version (python -V):
Python 3.13.12
Model Information:
- Are you using LiteLLM: No
- Which model is being used:
gemini-3.1-pro-preview
🟡 Optional Information
Regression:
N/A. I do not know whether this worked in a previous ADK version.
Logs:
Example failure with a factual rubric:
Summary: `EvalStatus.FAILED` for Metric: `rubric_based_final_response_quality_v1`.
Expected threshold: `0.6`, actual value: `0.0`.
rubric_based_final_response_quality_v1 for None Failed. Expected 0.6, but got 0.0.
Screenshots / Video:
N/A
Additional Context:
The agents-cli evaluation guide says:
Agents with google_search or model-internal tools — use only rubric_based_final_response_quality_v1 (model-internal tools don't appear in trajectory).
That recommendation makes sense for avoiding tool_trajectory_avg_score.
The issue is that the current rubric_based_final_response_quality_v1 prompt appears to be designed for function-based tools whose outputs are available as tool_response evidence. For model-internal tools such as google_search, the judge cannot see the search result evidence through normal response_steps.
Minimal Reproduction Code:
See "Steps to Reproduce".
Full version: https://github.com/ftnext/agent-practice/tree/d7d0b3b26d453e27a4dec87a730faede8c4d5fb4/agentengine
How often has this issue occurred?:
- Always (100%) for factual rubrics that require verifying claims against Google Search evidence
🔴 Required Information
Describe the Bug:
rubric_based_final_response_quality_v1is difficult to use for factual evaluation of agents that use the built-ingoogle_searchtool.The agents-cli evaluation guide recommends using only
rubric_based_final_response_quality_v1for agents withgoogle_searchor model-internal tools, because those tools do not appear in trajectory.ref: https://google.github.io/agents-cli/guide/evaluation/#writing-eval-cases-and-choosing-metrics
However, the current ADK Python judge prompt for
rubric_based_final_response_quality_v1appears to require trusted evidence fromresponse_steps/tool_response.ref: https://github.com/google/adk-python/blob/v2.1.0/src/google/adk/evaluation/rubric_based_final_response_quality_v1.py
Since
google_searchis model-internal, its search results / grounding metadata do not appear as normal function-call tool responses. As a result, factual rubrics such as "the response describes recent AI news and includes announcement timing" are judged as unsupported, even when the final response was generated after Google Search grounding.This makes the recommended metric usable mainly for indirect visible-text checks, not direct factual evaluation of
google_search-grounded answers.Steps to Reproduce:
uv add "google-adk[eval]"google_searchtool:{ "eval_set_id": "search_agent_should_use_search_tool_eval_set", "eval_cases": [ { "eval_id": "search_agent_should_use_search_tool", "conversation": [ { "user_content": { "parts": [ { "text": "最新のAIニュースにはどんなものがある?\n発表時期と合わせて教えて" } ], "role": "user" } } ] } ] }rubric_based_final_response_quality_v1with factual rubrics:{ "criteria": { "rubric_based_final_response_quality_v1": { "threshold": 0.6, "judge_model_options": { "judge_model": "gemini-3.1-pro-preview", "num_samples": 1 }, "rubrics": [ { "rubric_id": "freshness", "rubric_content": { "text_property": "The response describes recent AI news and includes announcement timing." } }, { "rubric_id": "grounded_with_sources", "rubric_content": { "text_property": "The response is grounded in Google Search results and includes source URLs." } } ] } } }AgentEvaluator.Expected Behavior:
rubric_based_final_response_quality_v1should be usable forgoogle_searchagents in a way that can evaluate factual final-answer quality, or the documentation should clarify that factual rubrics are not appropriate unless grounding metadata is available to the evaluator.Ideally, the evaluator would receive
google_searchgrounding metadata / grounding chunks / source URLs as trusted evidence, or there would be agoogle_search-aware final response quality metric or prompt.Observed Behavior:
The metric is recommended for
google_searchagents, but factual rubrics are hard to satisfy because the judge prompt asks the evaluator to rely on trusted evidence fromresponse_steps/tool_response.For built-in
google_search, the search result evidence does not appear as normal tool responses. This causes factual rubrics to be judged as unsupported.The only reliable workaround I found was to use indirect visible-text rubrics such as:
{ "rubric_id": "visible_date_expression", "rubric_content": { "text_property": "The visible response text includes at least one date-like expression such as a four-digit year, a month, a date, an event date, or release timing. This property evaluates only visible text format, not factual correctness." } }This passes as a smoke test, but no longer directly evaluates whether the answer is factually grounded in Google Search results.
Environment Details:
1.34.0Python 3.13.12Model Information:
gemini-3.1-pro-preview🟡 Optional Information
Regression:
N/A. I do not know whether this worked in a previous ADK version.
Logs:
Example failure with a factual rubric:
Screenshots / Video:
N/A
Additional Context:
The agents-cli evaluation guide says:
That recommendation makes sense for avoiding
tool_trajectory_avg_score.The issue is that the current
rubric_based_final_response_quality_v1prompt appears to be designed for function-based tools whose outputs are available astool_responseevidence. For model-internal tools such asgoogle_search, the judge cannot see the search result evidence through normalresponse_steps.Minimal Reproduction Code:
See "Steps to Reproduce".
Full version: https://github.com/ftnext/agent-practice/tree/d7d0b3b26d453e27a4dec87a730faede8c4d5fb4/agentengine
How often has this issue occurred?: