rubric_based_final_response_quality_v1 is hard to use for factual evaluation of google_search agents because its judge prompt requires tool_response evidence

## 🔴 Required Information

**Describe the Bug:**
`rubric_based_final_response_quality_v1` is difficult to use for factual evaluation of agents that use the built-in `google_search` tool.

The agents-cli evaluation guide recommends using only `rubric_based_final_response_quality_v1` for agents with `google_search` or model-internal tools, because those tools do not appear in trajectory.
ref: https://google.github.io/agents-cli/guide/evaluation/#writing-eval-cases-and-choosing-metrics

However, the current ADK Python judge prompt for `rubric_based_final_response_quality_v1` appears to require trusted evidence from `response_steps` / `tool_response`.
ref: https://github.com/google/adk-python/blob/v2.1.0/src/google/adk/evaluation/rubric_based_final_response_quality_v1.py
Since `google_search` is model-internal, its search results / grounding metadata do not appear as normal function-call tool responses. As a result, factual rubrics such as "the response describes recent AI news and includes announcement timing" are judged as unsupported, even when the final response was generated after Google Search grounding.

This makes the recommended metric usable mainly for indirect visible-text checks, not direct factual evaluation of `google_search`-grounded answers.

**Steps to Reproduce:**
1. Install ADK with eval dependencies:
   ```bash
   uv add "google-adk[eval]"
   ```
2. Create an agent that uses the built-in `google_search` tool:
   ```python
   from google.adk.agents.llm_agent import Agent
   from google.adk.apps import App
   from google.adk.tools import google_search

   search_agent = Agent(
       model="gemini-3.1-pro-preview",
       name="search_agent",
       description="A helpful assistant that can search Google.",
       instruction="""\
   You are a helpful assistant with access to Google Search.

   If the user asks a question that requires current information or facts, use the 'google_search' tool with English query.
   Provide the answer clearly based on the search results and always cite your sources by including URLs from the search results.
   """,
       tools=[google_search],
   )

   root_agent = search_agent
   ```
3. Create an eval case such as:
   ```json
   {
     "eval_set_id": "search_agent_should_use_search_tool_eval_set",
     "eval_cases": [
       {
         "eval_id": "search_agent_should_use_search_tool",
         "conversation": [
           {
             "user_content": {
               "parts": [
                 {
                   "text": "最新のAIニュースにはどんなものがある？\n発表時期と合わせて教えて"
                 }
               ],
               "role": "user"
             }
           }
         ]
       }
     ]
   }
   ```
4. Use `rubric_based_final_response_quality_v1` with factual rubrics:
   ```json
   {
     "criteria": {
       "rubric_based_final_response_quality_v1": {
         "threshold": 0.6,
         "judge_model_options": {
           "judge_model": "gemini-3.1-pro-preview",
           "num_samples": 1
         },
         "rubrics": [
           {
             "rubric_id": "freshness",
             "rubric_content": {
               "text_property": "The response describes recent AI news and includes announcement timing."
             }
           },
           {
             "rubric_id": "grounded_with_sources",
             "rubric_content": {
               "text_property": "The response is grounded in Google Search results and includes source URLs."
             }
           }
         ]
       }
     }
   }
   ```
5. Run the ADK eval, for example with `AgentEvaluator`.

**Expected Behavior:**
`rubric_based_final_response_quality_v1` should be usable for `google_search` agents in a way that can evaluate factual final-answer quality, or the documentation should clarify that factual rubrics are not appropriate unless grounding metadata is available to the evaluator.

Ideally, the evaluator would receive `google_search` grounding metadata / grounding chunks / source URLs as trusted evidence, or there would be a `google_search`-aware final response quality metric or prompt.

**Observed Behavior:**
The metric is recommended for `google_search` agents, but factual rubrics are hard to satisfy because the judge prompt asks the evaluator to rely on trusted evidence from `response_steps` / `tool_response`.

For built-in `google_search`, the search result evidence does not appear as normal tool responses. This causes factual rubrics to be judged as unsupported.

The only reliable workaround I found was to use indirect visible-text rubrics such as:

```json
{
  "rubric_id": "visible_date_expression",
  "rubric_content": {
    "text_property": "The visible response text includes at least one date-like expression such as a four-digit year, a month, a date, an event date, or release timing. This property evaluates only visible text format, not factual correctness."
  }
}
```

This passes as a smoke test, but no longer directly evaluates whether the answer is factually grounded in Google Search results.

**Environment Details:**

 - ADK Library Version (pip show google-adk): `1.34.0`
 - Desktop OS: macOS
 - Python Version (python -V): `Python 3.13.12`

**Model Information:**

 - Are you using LiteLLM: No
 - Which model is being used: `gemini-3.1-pro-preview`

---

## 🟡 Optional Information
**Regression:**
N/A. I do not know whether this worked in a previous ADK version.

**Logs:**
Example failure with a factual rubric:

```text
Summary: `EvalStatus.FAILED` for Metric: `rubric_based_final_response_quality_v1`.
Expected threshold: `0.6`, actual value: `0.0`.

rubric_based_final_response_quality_v1 for None Failed. Expected 0.6, but got 0.0.
```

**Screenshots / Video:**
N/A

**Additional Context:**
The agents-cli evaluation guide says:

> Agents with `google_search` or model-internal tools — use only `rubric_based_final_response_quality_v1` (model-internal tools don't appear in trajectory).

That recommendation makes sense for avoiding `tool_trajectory_avg_score`.

The issue is that the current `rubric_based_final_response_quality_v1` prompt appears to be designed for function-based tools whose outputs are available as `tool_response` evidence. For model-internal tools such as `google_search`, the judge cannot see the search result evidence through normal `response_steps`.

**Minimal Reproduction Code:**

See "Steps to Reproduce".
Full version: https://github.com/ftnext/agent-practice/tree/d7d0b3b26d453e27a4dec87a730faede8c4d5fb4/agentengine

**How often has this issue occurred?:**

 - Always (100%) for factual rubrics that require verifying claims against Google Search evidence


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rubric_based_final_response_quality_v1 is hard to use for factual evaluation of google_search agents because its judge prompt requires tool_response evidence #5831

🔴 Required Information

🟡 Optional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

rubric_based_final_response_quality_v1 is hard to use for factual evaluation of google_search agents because its judge prompt requires tool_response evidence #5831

Description

🔴 Required Information

🟡 Optional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions