Conversation
|
Hello, Claude found two possible bugs, could you confirm if they are relevant? Bug 1 — Single-response SDK calls miss new model-load headers _collect_processing_time_from_response (client.py L118) only reads X-Processing-Time and calls collector.add(). It does not handle X-Model-Id, X-Model-Cold-Start, X-Model-Load-Time, or X-Model-Load-Details. The new header parsing was added in _collect_remote_processing_times (executors.py L164), which is used by the batch/parallel sync path. But these three single-response methods still call the old function: Example: The inference server cold-starts the CLIP model and returns: _collect_processing_time_from_response records processing_time=1.5 but silently drops everything else. The collector ends up with _model_ids = {}, _cold_start_count = 0, _cold_start_total_load_time = 0.0. If the same request went through execute_requests_packages → _collect_remote_processing_times, all four fields would be populated. These methods are called from workflow blocks (clip/v1.py L216, perception_encoder/v1.py L189), so cold-start metadata from remote CLIP/Perception Encoder text embeddings never surfaces in the outer response. Bug 2 — Async SDK path drops all headers before any collection make_request_async (executors.py L393) returns (response.status, response_data) inside an async with block — headers are destroyed when the context manager exits. make_parallel_requests_async (executors.py L354) then strips even the status code: return [r[1] for r in responses], yielding bare dict | bytes bodies. Unlike its sync counterpart execute_requests_packages (which calls _collect_remote_processing_times at L81), execute_requests_packages_async (executors.py L304) has no call to any header-collection function. Headers are structurally gone before any collector could see them. Affected call sites (all receive body-only, no headers):
Example: An async object-detection call triggers a cold start. The server returns X-Processing-Time: 0.8, X-Model-Cold-Start: true, X-Model-Id: my-project/1, X-Model-Load-Time: 5.1. The RemoteProcessingTimeCollector records nothing — all headers were discarded at make_request_async and further reduced to body-only at make_parallel_requests_async. The cold-start/model-load feature is effectively sync-only. |
Summary
request_metricsmodule and its focused testshttp_api.pymiddleware copy by using the sharedrequest_metricsimplementationRoot Cause
PR #2209 merged into
mainand was then reverted in #2222 becauseinference/core/interfaces/http/http_api.pystill contained an older local copy ofGCPServerlessMiddlewareafter the sharedrequest_metrics.pyextraction. That stale copy referenced symbols that were no longer imported there, which triggered the CIflake8F821 undefined namefailures.Why
Workflow responses should include remote model IDs, cold start counts, load times, and detailed load metadata when remote execution happens. Without this reland, top-level workflow headers only reflect local execution and under-report remote cold starts.
Validation
python3.10 -m py_compile inference/core/interfaces/http/http_api.py inference/core/interfaces/http/request_metrics.py inference_sdk/http/utils/executors.py inference_sdk/config.py inference/core/managers/model_load_collector.py/Users/hansent/.venv/bin/python -m pytest tests/inference_sdk/unit_tests/test_config.py tests/inference_sdk/unit_tests/http/utils/test_remote_processing_time_collection.py tests/inference/unit_tests/core/interfaces/http/test_remote_processing_time_middleware.py tests/inference/unit_tests/core/interfaces/http/test_model_response_headers.pyContext
This PR re-lands the intended behavior from #2209 after the revert in #2222, with the missing import/middleware cleanup included.