Skip to content

πŸ”„ daily merge: master β†’ main 2026-04-08#763

Open
antfin-oss wants to merge 583 commits intomainfrom
create-pull-request/patch-42e69df4d0
Open

πŸ”„ daily merge: master β†’ main 2026-04-08#763
antfin-oss wants to merge 583 commits intomainfrom
create-pull-request/patch-42e69df4d0

Conversation

@antfin-oss
Copy link
Copy Markdown

@antfin-oss antfin-oss commented Feb 4, 2026

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-04-08
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits January 14, 2026 16:57
…ct#60146)

so that generic changes to ray python code will not trigger expensive
GPU tests

also separates out min-build as its own tag.

Signed-off-by: Lonnie Liu <[email protected]>
Uses existing auth, allows sourcing+testing crane via existing Bazel
libraries

Signed-off-by: andrew <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…#60155)

updating incorrect path for ray llm requirements file

Signed-off-by: elliot-barn <[email protected]>
## Description
Add redundant shuffle fusion rules by dropping the 1st shuffle
- Repartition -> Aggregate
- StreamingRepartition -> Repartition
- Repartition -> StreamingRepartition
- Sort -> Sort

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <[email protected]>
…project#60175)

Was recently debugging something related to this behavior and found
these logs useful.

---------

Signed-off-by: Edward Oakes <[email protected]>
## Why are these changes needed?

This PR adds gRPC-based inter-deployment communication for Ray Serve,
allowing deployments to communicate with each other using gRPC transport
instead of Ray actor calls. This can provide performance benefits in
certain scenarios.

### Key Changes

1. **gRPC Server on Replicas**: Each replica now starts a gRPC server
that can handle requests from other deployments.

2. **gRPC Replica Wrapper**: A new `gRPCReplicaWrapper` class handles
sending requests via gRPC and processing responses.

3. **Handle Options**: The `_by_reference` option on handles controls
whether to use Ray actor calls (`True`) or gRPC transport (`False`).

4. **New Environment Variables**:
- `RAY_SERVE_USE_GRPC_BY_DEFAULT`: Master flag to enable gRPC transport
by default for all inter-deployment communication
- `RAY_SERVE_PROXY_USE_GRPC`: Controls whether the proxy uses gRPC
transport (defaults to the master flag value)
- `RAY_SERVE_GRPC_MAX_MESSAGE_SIZE`: Configures the maximum gRPC message
size (default: 2GB-1)

## Related issue number

N/A

## Checks

- [x] I've signed all my commits
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
        temporary testing hook, I've added it under the API Reference
        (Experimental) page.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests.

## Test Plan

- `python/ray/serve/tests/test_grpc_e2e.py`
- `python/ray/serve/tests/test_grpc_replica_wrapper.py`
- `python/ray/serve/tests/unit/test_grpc_replica_result.py`

## Benchmarks
Script available
[here](https://gist.github.com/eicherseiji/02808c32d0e377803888671da64524d1)

Results show throughput/latency improvements w/ gRPC for message size <
~1MB.

<img width="2229" height="740" alt="benchmark_plot"
src="https://github.com/user-attachments/assets/e7e25f94-00b4-434d-9eff-10cd36047356"
/>

```
  ⎿  ==============================================================================
       gRPC vs Plasma Benchmark Results
     ==============================================================================

       Payload  Metric               Plasma         gRPC          Ξ”     Winner
       ----------------------------------------------------------------------
       1 KB     Latency p50          2.63ms       1.89ms       +28%       gRPC
                Chain p50            4.11ms       3.02ms       +26%       gRPC
                Throughput            160/s        190/s       +16%       gRPC
       ----------------------------------------------------------------------
       10 KB    Latency p50          2.68ms       1.68ms       +37%       gRPC
                Chain p50            3.91ms       2.94ms       +25%       gRPC
                Throughput            167/s        185/s       +10%       gRPC
       ----------------------------------------------------------------------
       100 KB   Latency p50          2.74ms       2.02ms       +26%       gRPC
                Chain p50            4.28ms       3.06ms       +28%       gRPC
                Throughput            157/s        182/s       +13%       gRPC
       ----------------------------------------------------------------------
       500 KB   Latency p50          5.78ms       3.52ms       +39%       gRPC
                Chain p50            5.65ms       4.82ms       +15%       gRPC
                Throughput            114/s        144/s       +21%       gRPC
       ----------------------------------------------------------------------
       1 MB     Latency p50          6.31ms       5.18ms       +18%       gRPC
                Chain p50            5.96ms       6.20ms        -4%     Plasma
                Throughput            130/s        165/s       +21%       gRPC
       ----------------------------------------------------------------------
       2 MB     Latency p50          8.82ms       9.57ms        -9%     Plasma
                Chain p50            7.20ms      10.69ms       -48%     Plasma
                Throughput            123/s        106/s       -16%     Plasma
       ----------------------------------------------------------------------
       5 MB     Latency p50         15.20ms      23.72ms       -56%     Plasma
                Chain p50            8.90ms      23.25ms      -161%     Plasma
                Throughput             78/s         49/s       -58%     Plasma
       ----------------------------------------------------------------------
       10 MB    Latency p50         25.02ms      34.34ms       -37%     Plasma
                Chain p50            9.72ms      34.71ms      -257%     Plasma
                Throughput             38/s         31/s       -24%     Plasma
       ----------------------------------------------------------------------
```

Compared to parity implementation:
```
==============================================================================
  gRPC Transport: OSS 3.0.0.dev0 vs Parity 2.53.0
==============================================================================

  Payload  Metric              OSS 3.0.0.dev0  Parity 2.53.0
  ----------------------------------------------------------------------
  1 KB     Latency p50              1.82ms           2.27ms
           Chain p50                2.95ms           2.99ms
           Throughput                268/s            272/s
  ----------------------------------------------------------------------
  10 KB    Latency p50              1.82ms           2.05ms
           Chain p50                2.85ms           2.80ms
           Throughput                246/s            293/s
  ----------------------------------------------------------------------
  100 KB   Latency p50              2.04ms           2.35ms
           Chain p50                3.27ms           3.12ms
           Throughput                262/s            257/s
  ----------------------------------------------------------------------
  500 KB   Latency p50              3.67ms           3.78ms
           Chain p50                5.77ms           4.91ms
           Throughput                186/s            192/s
  ----------------------------------------------------------------------
  1 MB     Latency p50              4.99ms           5.39ms
           Chain p50                5.95ms           6.56ms
           Throughput                177/s            156/s
  ----------------------------------------------------------------------
  2 MB     Latency p50              7.91ms           7.37ms
           Chain p50                8.26ms          12.16ms
           Throughput                117/s            129/s
  ----------------------------------------------------------------------
  5 MB     Latency p50             17.86ms          19.53ms
           Chain p50               22.65ms          23.85ms
           Throughput                 87/s             54/s
  ----------------------------------------------------------------------
  10 MB    Latency p50             23.79ms          27.78ms
           Chain p50               35.67ms          31.06ms
           Throughput                 48/s             27/s
  ----------------------------------------------------------------------

  Cluster: 2 worker nodes (48 CPU, 4 GPU, 192GB RAM, 54.5GB object store each)
  3-trial average
```
Note: OSS 3.0.0.dev0 includes a token auth optimization (ray-project#59500) that
reduces per-RPC overhead by caching auth tokens and avoiding object
construction on each call. This likely explains the improved latency and
throughput at larger payload sizes.

---------

Signed-off-by: Seiji Eicher <[email protected]>
…ject#59788)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.


### [Data] Disable ConcurrencyCap Backpressure policy by default

- With DownstreamCapacityBackpressurePolicy now enabled by default,
disable ConcurrencyCapBackpressurePolicy by default.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Srinath Krishnamachari <[email protected]>
…roject#60179)

`ray-docs` has become a bottleneck for review. No longer requiring their
approval for library documentation changes, but leaving it as a
catch-all for other docs changes.

Flyby: removing code ownership for removed Ray Workflows library
directories.

Signed-off-by: Edward Oakes <[email protected]>
…stage config refactor (ray-project#59214)

Signed-off-by: Nikhil Ghosh <[email protected]>
Signed-off-by: Nikhil G <[email protected]>
## Description
Introduced in ray-project#59544, we added
many optimisations for IMPALA / APPO learner.
One of these optimisations is to minimise the time thread locked from
the queue.
Using `queue.get_nowait()` will raise an exception if the queue has no
data. Therefore, we wrap the request in a try/except.
Currently, when this exception occurs then we log a warning but
reviewing the training logs this actually causes a massive amount of
spam.

This PR, therefore, removes the warning with an associated comment to
explain why we use a `pass` instead.

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
Before:

<img width="616" height="210" alt="Screenshot 2026-01-07 at 11 03 26β€―AM"
src="https://github.com/user-attachments/assets/139496f4-136d-4ade-9ec5-ad788cc4f8f9"
/>

After:

<img width="5000" height="2812" alt="ray-job-diagram"
src="https://github.com/user-attachments/assets/e9c653be-3191-458d-9fa1-30329c395496"
/>

---------

Signed-off-by: Edward Oakes <[email protected]>
> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: akshay-anyscale <[email protected]>
…ay-project#60176)

No behavior changes, pure refactoring to retain my sanity.

- No longer inherit from `SchedulingQueue` in `NormalTaskExecutionQueue`
- Remove unnecessary methods from `SchedulingQueue` interface
- `NormalSchedulingQueue` -> `NormalTaskExecutionQueue`
- `ActorSchedulingQueue` -> `OrderedActorTaskExecutionQueue`
- `OutOfOrderActorSchedulingQueue` -> `UnorderedActorTaskExecutionQueue`
- `SchedulingQueue` -> `ActorTaskExecutionQueueInterface`
- A few method/field renamings.

---------

Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: Ibrahim Rabbani <[email protected]>
…re flag, GCS, and Raylet code. (ray-project#59979)

This is 1/N in a series of PRs to remove Centralized Actor Scheduling by
the GCS (introduced in ray-project#15943). The feature is off by default and no
longer in use or supported.

In this PR, I remove the feature flag to turn the feature on and remove
related code and tests in the GCS and the Raylet.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
…registry example (ray-project#60071)

Noticed ray-project#59917 but it didn't fix
it
- Linking to notebook instead of README.md
- Removing notebook from exclude_patterns

The notebook should be the single source of truth since it’s what's
tested and validated, so we should link to it rather than the README.md.
The README.md is generated from the notebook (jupyter nbconvert) and
exists only for display in the console when converting the example into
an Anyscale template

Also fixing the error 404 of the mlflow registry example by lin,ing to
the proper doc

---------

Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Co-authored-by: Aydin Abiar <[email protected]>
…y-project#60161)

so that it is clear that these functions are not meant to be used by
other files

Signed-off-by: Lonnie Liu <[email protected]>
not used anywhere any more

Signed-off-by: Lonnie Liu <[email protected]>
…oject#57555)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Since `0.26.0` `uvicorn`
[changed](https://uvicorn.dev/release-notes/?utm_source=chatgpt.com#0260-january-16-2024)
how it processes `root_path`. To support all `uvicorn` versions,
injecting `root_path` to ASGI app instead of passing it to
`uvicorn.Config` starting from version `0.26.0`.

Before the change:
```
# uvicorn==0.22.0
pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass
# uvicorn==0.40.0 - latest
pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - failed
# FAILED python/ray/serve/tests/test_standalone.py::test_http_root_path - assert 404 == 200
```

After the change:
```
# uvicorn==0.22.0
pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass
# uvicorn==0.40.0 - latest
pytest -s -v python/ray/serve/tests/test_standalone.py::test_http_root_path - pass
```

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

Closes ray-project#55776.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: axreldable <[email protected]>
…nd Model composition for recsys examples (ray-project#59166)

## Description
Adding two examples for Ray Serve as part of our workload based series:
- Model multiplexing with forecasting models βœ…
- Model composition for recsys (recommendation systems) βœ…

Will later be published as templates in the anyscale console

Lots of added/modified files but the contents to review are under the
`content/` folder, everything else is related to the publishing workflow
in ray docs + setting up testing in the CI

author: @Aydin-ab

---------

Signed-off-by: Aydin Abiar <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Co-authored-by: Aydin Abiar <[email protected]>
Minor cleanups on the task execution path as I muddle through it. No
behavior changes.

- `DependencyWaiter` -> `ActorTaskExecutionArgWaiter`. Previously, I
found myself continually confused if (1) this was only for actor tasks
or also normal tasks and (2) if it was on the submission or execution
path.
- Added `ActorTaskExecutionArgWaiterInterface` instead of having an
`Impl`.
- Added header comments for what the `ActorTaskExecutionArgWaiter` is
doing.
- `HandleTask` -> `QueueTaskForExecution`.

---------

Signed-off-by: Edward Oakes <[email protected]>
…ect#60149)

## Description
The `python redis test` is failing with high probability in testing due
to `test_network_partial_failures` catching a "subprocess is still
running" resource warning when it expects no warnings to be thrown.

Investigation showed that existing kill redis server logic during test
cleanup does not wait for the redis server process to die before moving
onto the next test, causing the resource warning we observed above. This
PR addresses this issue by adding wait to ensure that the redis server
is fully cleaned up before moving to the proceeding test during redis
test cleanup step.

## Related issues
Fixes failing `test_network_partial_failures` in CI automated tests.

## Additional information
Example of the fix passing `test_network_partial_failures` in post
merge: https://buildkite.com/ray-project/postmerge/builds/15416#_

---------

Signed-off-by: davik <[email protected]>
Co-authored-by: davik <[email protected]>
…project#60173)

## Description


Due to Python's operator precedence (+ binds tighter than if-else):
```python
result = [1, 2, 3] + [4] if False else []
print(f"result = {result}")
#result = []
```

`PSUTIL_PROCESS_ATTRS` on Windows actually returns an empty list, but
our intention is to only exclude `num_fds`. This PR fixes it.




## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: yicheng <[email protected]>
Co-authored-by: yicheng <[email protected]>
…c callback during shutdown (ray-project#60048)

## Description
When a Ray worker process shuts down (e.g., during `ray.shutdown()` or
node termination), the OpenTelemetry `PeriodicExportingMetricReader`'s
background thread may still be invoking the gauge callback
(`_DoubleGaugeCallback`), which then accesses already-destroyed member
data, resulting in a use-after-free crash.

The error message:
```
(bundle_reservation_check_func pid=1543823) pure virtual method called
(bundle_reservation_check_func pid=1543823) __cxa_deleted_virtual
```


I looked further into this, and ideally, at the OpenTelemetry code
level, shutdown should be handled correctly.

[PeriodicExportingMetricReader's
shutdown](https://github.com/open-telemetry/opentelemetry-cpp/blob/f33dcc07c56c7e3b18fd18e13986f0eda965d116/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L292-L299)
waits for `worker_thread_` to finish.
```c
bool PeriodicExportingMetricReader::OnShutDown(std::chrono::microseconds timeout) noexcept
{
  if (worker_thread_.joinable())
  {
    cv_.notify_all();
    worker_thread_.join();
  }
  return exporter_->Shutdown(timeout);
}
```

And callback(`worker_thread_`) is in a [while (IsShutdown() !=
true)](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L147)
loop.

Therefore, there should be no use-after-free race condition at the
OpenTelemetry code level, and it should be safe to call
`meter_provider_->Shutdown()`.

However, the issue is that the last callback appears to access member
data that has already been destroyed during ForceFlush, which is called
before Shutdown. This member data belongs to the OpenTelemetry SDK
itself.

The more I look into it, the more it feels like this is actually a bug
in the OpenTelemetry SDK.

And even further, I found this:[[SDK] Use shared_ptr internally for
AttributesProcessor to prevent use-after-free
](open-telemetry/opentelemetry-cpp#3457)

Which is exactly the issue I encountered!

This PR upgrade the OpenTelemetry C++ SDK version to include this fix.



## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
It is quit easy to reproduced, For example, if we manually running the
`test_placement_group_reschedule_node_dead` in
`python/ray/autoscaler/v2/tests/test_e2e.py`.
```
(docs) ubuntu@devbox:~/ray$ pkill -9 -f raylet 2>/dev/null || true; pkill -9 -f gcs_server 2>/dev/null || true; ray stop --force 2>/dev/null || true; sleep 2
Did not find any active Ray processes.
(docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?"

............

__cxa_deleted_virtual
opentelemetry::v1::sdk::metrics::FilteredOrderedAttributeMap::FilteredOrderedAttributeMap()::{lambda()#1}::operator()()
opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
opentelemetry::v1::sdk::metrics::ObserverResultT<>::Observe()
opentelemetry::v1::metrics::ObserverResultT<>::Observe<>()
ray::observability::OpenTelemetryMetricRecorder::CollectGaugeMetricValues()
(anonymous namespace)::_DoubleGaugeCallback()
opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
opentelemetry::v1::sdk::metrics::Meter::Collect()
opentelemetry::v1::sdk::metrics::MetricCollector::Produce()
opentelemetry::v1::sdk::metrics::MetricReader::Collect()
opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()
std::thread::_State_impl<>::_M_run()


............
```

after this pr, no such error message:
```
(docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?"
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/ray
configfile: pytest.ini
plugins: asyncio-1.3.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 2 items

python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v1] Did not find any active Ray processes.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.5.171

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.5.171:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
2026-01-12 12:30:00,347 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379...
2026-01-12 12:30:00,385 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +11s) Resized to 0 CPUs.
(autoscaler +12s) Resized to 0 CPUs.
(autoscaler +14s) Resized to 0 CPUs.
(autoscaler +15s) Resized to 0 CPUs.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +15s) Adding 1 node(s) of type type-1.
(autoscaler +15s) Adding 1 node(s) of type type-2.
(autoscaler +15s) Adding 1 node(s) of type type-3.
(autoscaler +16s) Resized to 0 CPUs.
(autoscaler +16s) Adding 1 node(s) of type type-1.
(autoscaler +16s) Adding 1 node(s) of type type-2.
(autoscaler +16s) Adding 1 node(s) of type type-3.
Killing pids 1566233
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms
    [state-dump]        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB.
    [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9

(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +17s) Adding 1 node(s) of type type-3.
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(autoscaler +24s) Removing 1 nodes of type type-3 (idle).
(raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a     (1) raylet crashes unexpectedly (OOM, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms
    [state-dump]        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 1
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB.
    [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9
Stopped all 10 Ray processes.

(autoscaler +32s) Resized to 0 CPUs.
(autoscaler +32s) Adding 1 node(s) of type type-1.
(autoscaler +32s) Adding 1 node(s) of type type-2.
(autoscaler +32s) Adding 1 node(s) of type type-3.
(autoscaler +32s) Adding 1 node(s) of type type-3.
(autoscaler +32s) Removing 1 nodes of type type-3 (idle).
PASSED
python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v2] Did not find any active Ray processes.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.5.171

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.5.171:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
2026-01-12 12:30:40,170 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379...
2026-01-12 12:30:40,202 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
Stopped only 9 out of 12 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=1569612, name='raylet', status='terminated'), psutil.Process(pid=1569160, name='raylet', status='terminated'), psutil.Process(pid=1568952, name='raylet', status='terminated')] will be forcefully terminated.
You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination.
Killing pids 1568744
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB.
    [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9

(raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a  (1) raylet crashes unexpectedly (OOM, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 
    [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000
    [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB.
    [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool.
    [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001
    [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002
    [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003
    [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0
    [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1
    [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5
    [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6
    [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7
    [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9

PASSED

========================= 2 passed in 80.90s (0:01:20) =========================
EXIT CODE: 0
(docs) ubuntu@devbox:~/ray$ 
```

Signed-off-by: yicheng <[email protected]>
Co-authored-by: yicheng <[email protected]>
…eManager from GCS (ray-project#60121)

This PR stacks on ray-project#60019.

This is 3/N in a series of PRs to remove Centralized Actor Scheduling by
the GCS (introduced in ray-project#15943).
The feature is off by default and no longer in use or supported.

In this PR, I've removed the GCS's dependency on the LocalLeaseManager.
I've also moved LocalLeaseManager to the raylet/scheduling package and
made it's visibility private to the package. Also deleted the
NoopLocalLeaseManager.

The LocalLeaseManager is used by the ClusterLeaseManager to see if a
task can scheduled locally by a Raylet. The GCS used only the Noop
implementation.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
…ject#60145)

## Why are these changes needed?

When `EngineDeadError` occurs, the vLLM engine subprocess is dead but
the Ray actor process is still alive. Previously, we re-raised the
exception, but this causes **task retries to go to the SAME actor**
(actor methods are bound to specific instances), creating an infinite
retry loop on the broken actor.

### The Problem

```
vLLM engine subprocess crashes
       ↓
EngineDeadError raised
       ↓
Exception re-raised (actor stays ALIVE)
       ↓
Ray: actor_task_retry_on_errors triggers retry
       ↓
Retry goes to SAME actor (actor methods are bound)
       ↓
Same actor, engine still dead β†’ EngineDeadError
       ↓
Infinite loop (with max_task_retries=-1)
```

### The Fix

Call `os._exit(1)` to exit the actor. This triggers Ray to:
1. Mark the actor as `RESTARTING`
2. Create a replacement actor with a fresh vLLM engine
3. Route task retries to healthy actors (Ray Data excludes `RESTARTING`
actors from dispatch)

This leverages Ray Data's existing fault tolerance infrastructure:
- `max_restarts=-1` (default) enables actor replacement
- `max_task_retries=-1` (default) enables task retry

### Why `os._exit(1)` instead of `ray.actor.exit_actor()`?

We must use `os._exit(1)` rather than `ray.actor.exit_actor()` because
they produce different exit types with different retry behavior:

| Exit Method | Exit Type | Exception Raised | Retried? |
|-------------|-----------|------------------|----------|
| `os._exit(1)` | `SYSTEM_ERROR` | `RaySystemError` | Yes |
| `ray.actor.exit_actor()` | `INTENDED_USER_EXIT` | `ActorDiedError` |
No |

The root cause is that Ray Data only adds `RaySystemError` to its
`retry_exceptions` list (in `_add_system_error_to_retry_exceptions()`).
Since `ActorDiedError` is NOT a subclass of `RaySystemError` (they're
siblings in the exception hierarchy), tasks that fail due to
`ray.actor.exit_actor()` are not retried.

**The semantic gap**: Ray currently lacks a "fatal application error"
concept - an error where the actor should be restarted AND pending tasks
retried. The available options are:
- Clean exit (`exit_actor`) = "I'm intentionally done" β†’ no retry
- Crash (`os._exit`) = "Something broke unexpectedly" β†’ retry

We need the "crash" semantics even though this is a deliberate decision,
so `os._exit(1)` is the correct workaround until Ray Core adds explicit
support for fatal application errors.

See: ray-project#60150

cc @goutamvenkat-anyscale 

### Validation

We created a minimal reproduction script demonstrating:
1. **The problem**: All retries go to the same broken actor (same PID)
2. **The fix**: Actor exits β†’ replacement created β†’ job succeeds with
multiple PIDs

```python
# Demo output showing fix works:
[SUCCESS] Processed 20 rows!
PIDs that processed batches: {708593, 708820}
-> Multiple PIDs = replacement actor joined and processed work
```

Full reproduction:
https://gist.github.com/nrghosh/c18e514a975144a238511012774bab8b

## Related issue number

Fixes ray-project#59522

## Checks

- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
temporary file handling method, I've added it in
`doc/source/ray-core/api/doc/ray.util.temp_files.rst`.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests.

---------

Signed-off-by: Nikhil Ghosh <[email protected]>
we already know the cloud provider / type via test definition, so there
is no need to fetch it via sdk.

Signed-off-by: Lonnie Liu <[email protected]>
not used anywhere in JobFileManager; all files are transfered via shared
blog storage access.

Signed-off-by: Lonnie Liu <[email protected]>
no test definition is using this field

---------

Signed-off-by: Lonnie Liu <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
**Problem**
All versions of setproctitle after 1.2.3 seem to introduce significant
expense when forking. This in turn results in very slow or crashing runs
of ray jobs on macOs, particularly when spawning many jobs.

**Solution**
Looking at the code, it seems as if versions after 1.2.3 make quite a
lot of changes on Darwin in order to make the process renames work
properly with Activity monitor and other MacOs utilities. This... is not
'especially' important to Ray. So downgrading doesn't cost us too much.
Though of course it would have been vastly preferable to rely on a
non-vendored latest version, but latest versions seem to have this
issue. So we can downgrade for now and come back later potentially.
## Related issues
fixes ray-project#59663

**Historic Context**
ray-project#53471 vendored the dependency
and made a slight logic tweak in the cython binding in setproctitle.pxi.
This had the benefit of fixing the cmdline parse issue described in the
PR but had the downside of upgrading the library version (which now
included a set of Darwin tweaks which leads to the slowdown). After this
PR, the state will be that the vendored version is now old enough to not
contain the activity monitor tweaks for Darwin, as well as having the
changes in setproctitle.pxi.

Signed-off-by: ZacAttack <[email protected]>
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-02-25 πŸ”„ daily merge: master β†’ main 2026-02-26 Feb 26, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-02-26 πŸ”„ daily merge: master β†’ main 2026-02-27 Feb 27, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-02-27 πŸ”„ daily merge: master β†’ main 2026-03-02 Mar 2, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-02 πŸ”„ daily merge: master β†’ main 2026-03-03 Mar 3, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-03 πŸ”„ daily merge: master β†’ main 2026-03-04 Mar 4, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-04 πŸ”„ daily merge: master β†’ main 2026-03-05 Mar 5, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-05 πŸ”„ daily merge: master β†’ main 2026-03-06 Mar 6, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-06 πŸ”„ daily merge: master β†’ main 2026-03-09 Mar 9, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-09 πŸ”„ daily merge: master β†’ main 2026-03-10 Mar 10, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-10 πŸ”„ daily merge: master β†’ main 2026-03-11 Mar 11, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-11 πŸ”„ daily merge: master β†’ main 2026-03-12 Mar 12, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-12 πŸ”„ daily merge: master β†’ main 2026-03-13 Mar 13, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-13 πŸ”„ daily merge: master β†’ main 2026-03-16 Mar 16, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-16 πŸ”„ daily merge: master β†’ main 2026-03-17 Mar 17, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-17 πŸ”„ daily merge: master β†’ main 2026-03-18 Mar 18, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-18 πŸ”„ daily merge: master β†’ main 2026-03-19 Mar 19, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-19 πŸ”„ daily merge: master β†’ main 2026-03-20 Mar 20, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-20 πŸ”„ daily merge: master β†’ main 2026-03-23 Mar 23, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-23 πŸ”„ daily merge: master β†’ main 2026-03-24 Mar 24, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-24 πŸ”„ daily merge: master β†’ main 2026-03-25 Mar 25, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-25 πŸ”„ daily merge: master β†’ main 2026-03-26 Mar 26, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-26 πŸ”„ daily merge: master β†’ main 2026-03-27 Mar 27, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-27 πŸ”„ daily merge: master β†’ main 2026-03-30 Mar 30, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-30 πŸ”„ daily merge: master β†’ main 2026-03-31 Mar 31, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-03-31 πŸ”„ daily merge: master β†’ main 2026-04-01 Apr 1, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-04-01 πŸ”„ daily merge: master β†’ main 2026-04-02 Apr 2, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-04-02 πŸ”„ daily merge: master β†’ main 2026-04-03 Apr 3, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-04-03 πŸ”„ daily merge: master β†’ main 2026-04-06 Apr 6, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-04-06 πŸ”„ daily merge: master β†’ main 2026-04-07 Apr 7, 2026
@antfin-oss antfin-oss changed the title πŸ”„ daily merge: master β†’ main 2026-04-07 πŸ”„ daily merge: master β†’ main 2026-04-08 Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.