Skip to content

Re-land remote cold start workflow headers#2224

Open
hansent wants to merge 3 commits intomainfrom
codex/reland-remote-cold-start-headers
Open

Re-land remote cold start workflow headers#2224
hansent wants to merge 3 commits intomainfrom
codex/reland-remote-cold-start-headers

Conversation

@hansent
Copy link
Copy Markdown
Collaborator

@hansent hansent commented Apr 10, 2026

Summary

  • re-land remote cold start model metadata aggregation in workflow response headers
  • restore the dedicated request_metrics module and its focused tests
  • fix the stale http_api.py middleware copy by using the shared request_metrics implementation

Root Cause

PR #2209 merged into main and was then reverted in #2222 because inference/core/interfaces/http/http_api.py still contained an older local copy of GCPServerlessMiddleware after the shared request_metrics.py extraction. That stale copy referenced symbols that were no longer imported there, which triggered the CI flake8 F821 undefined name failures.

Why

Workflow responses should include remote model IDs, cold start counts, load times, and detailed load metadata when remote execution happens. Without this reland, top-level workflow headers only reflect local execution and under-report remote cold starts.

Validation

  • python3.10 -m py_compile inference/core/interfaces/http/http_api.py inference/core/interfaces/http/request_metrics.py inference_sdk/http/utils/executors.py inference_sdk/config.py inference/core/managers/model_load_collector.py
  • /Users/hansent/.venv/bin/python -m pytest tests/inference_sdk/unit_tests/test_config.py tests/inference_sdk/unit_tests/http/utils/test_remote_processing_time_collection.py tests/inference/unit_tests/core/interfaces/http/test_remote_processing_time_middleware.py tests/inference/unit_tests/core/interfaces/http/test_model_response_headers.py

Context

This PR re-lands the intended behavior from #2209 after the revert in #2222, with the missing import/middleware cleanup included.

@hansent hansent marked this pull request as ready for review April 10, 2026 16:50
@hansent hansent requested a review from rafel-roboflow as a code owner April 14, 2026 14:40
@rafel-roboflow
Copy link
Copy Markdown
Contributor

rafel-roboflow commented Apr 16, 2026

Hello, Claude found two possible bugs, could you confirm if they are relevant?

Bug 1 — Single-response SDK calls miss new model-load headers

_collect_processing_time_from_response (client.py L118) only reads X-Processing-Time and calls collector.add(). It does not handle X-Model-Id, X-Model-Cold-Start, X-Model-Load-Time, or X-Model-Load-Details.

The new header parsing was added in _collect_remote_processing_times (executors.py L164), which is used by the batch/parallel sync path. But these three single-response methods still call the old function:

get_clip_text_embeddings → client.py L1373
clip_compare → client.py L1478
get_perception_encoder_text_embeddings → client.py L1590

Example: The inference server cold-starts the CLIP model and returns:

X-Processing-Time: 1.5
X-Model-Cold-Start: true
X-Model-Id: clip/ViT-B-32
X-Model-Load-Time: 3.2

_collect_processing_time_from_response records processing_time=1.5 but silently drops everything else. The collector ends up with _model_ids = {}, _cold_start_count = 0, _cold_start_total_load_time = 0.0. If the same request went through execute_requests_packages → _collect_remote_processing_times, all four fields would be populated.

These methods are called from workflow blocks (clip/v1.py L216, perception_encoder/v1.py L189), so cold-start metadata from remote CLIP/Perception Encoder text embeddings never surfaces in the outer response.

Bug 2 — Async SDK path drops all headers before any collection

make_request_async (executors.py L393) returns (response.status, response_data) inside an async with block — headers are destroyed when the context manager exits. make_parallel_requests_async (executors.py L354) then strips even the status code: return [r[1] for r in responses], yielding bare dict | bytes bodies.

Unlike its sync counterpart execute_requests_packages (which calls _collect_remote_processing_times at L81), execute_requests_packages_async (executors.py L304) has no call to any header-collection function. Headers are structurally gone before any collector could see them.

Affected call sites (all receive body-only, no headers):

  • infer_from_api_v0_async — client.py L671
  • infer_from_api_v1_async — client.py L824
  • OCR async — client.py L1226
  • SAM3 3D async — client.py L2013
  • YOLO World async — client.py L2541
  • Generic async inference — client.py L2889

Example: An async object-detection call triggers a cold start. The server returns X-Processing-Time: 0.8, X-Model-Cold-Start: true, X-Model-Id: my-project/1, X-Model-Load-Time: 5.1. The RemoteProcessingTimeCollector records nothing — all headers were discarded at make_request_async and further reduced to body-only at make_parallel_requests_async. The cold-start/model-load feature is effectively sync-only.

@hansent hansent changed the title [codex] Re-land remote cold start workflow headers Re-land remote cold start workflow headers Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants