pathwaycom
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/2.developers/4.user-guide/80.advanced/60.worker_count_scaling.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/2.developers/4.user-guide/80.advanced/60.worker_count_scaling.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/2.developers/4.user-guide/80.advanced/70.running_on_multiple_machines.md‎
Lines changed: 137 additions & 0 deletions b/‎docs/2.developers/4.user-guide/80.advanced/70.running_on_multiple_machines.md‎
Lines changed: 137 additions & 0 deletions
diff --git a/‎integration_tests/common/identity.py‎
Lines changed: 10 additions & 0 deletions b/‎integration_tests/common/identity.py‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎integration_tests/common/test_cli.py‎
Lines changed: 102 additions & 0 deletions b/‎integration_tests/common/test_cli.py‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎integration_tests/common/test_multiple_machines.py‎
Lines changed: 72 additions & 0 deletions b/‎integration_tests/common/test_multiple_machines.py‎
Lines changed: 72 additions & 0 deletions
@@ -7,6 +7,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 
 ### Added
 - `pw.io.milvus.write` connector, which writes a Pathway table to a Milvus collection. Row additions are sent as upserts and row deletions are sent as deletes keyed on the configured primary key column. Requires a Pathway Scale license.
+- `pathway spawn` now supports the `--addresses` and `--process-id` flags for multi-machine deployments. Pass a comma-separated list of `host:port` addresses for all processes and the index of the local process; Pathway will connect the cluster over TCP without requiring all processes to run on the same machine.
 
 ## [0.30.0] - 2026-03-24
 
 
@@ -45,6 +45,10 @@ In any case, you can't have less than one worker. Therefore, even if the pipelin
 
 The scaling process scales only by increasing or decreasing the number of **processes**. Threads are **not used for dynamic scaling** in this mechanism. This way, if your initial configuration uses thread workers or uses both, threads and processes, the scaling will only change the process number. For example, if you launched the computation with one process, containing two workers, the upscaling will lead to two processes, having two workers each. On the other hand, downscaling from the initial configuration in this case won't be possible, since the number of processes is already equal to one.
 
+### Fixed Address Pool
+
+Dynamic scaling is not available when the worker pool is defined via the `--addresses` flag. In that mode, the set of processes is fixed for the entire run: Pathway cannot add or remove machines at runtime. If your pipeline is launched with `--addresses`, scaling signals from workers are ignored and a warning is emitted to the logs. To use dynamic scaling, let Pathway manage the processes itself by using `--processes` instead.
+
 ### License Limitations
 
 You need a Pathway License in order for the scaling to work. You can obtain your free Pathway Scale license [here](/get-license). The page contains instructions for getting the license and using it in the pipeline.
 
@@ -0,0 +1,137 @@
+---
+title: 'Running on Multiple Machines'
+description: 'This page describes how to distribute a Pathway pipeline across several machines'
+---
+
+# Running on Multiple Machines
+
+Pathway pipelines can be distributed across multiple machines. Each machine runs a process, and together they form a single logical computation. Workers on different machines communicate over TCP, exchanging data and progress information the same way co-located processes do.
+
+This is useful when:
+- The dataset or working state does not fit in the memory of a single machine.
+- The computation is CPU-bound enough to saturate all cores on one host.
+- You want to co-locate workers with partitioned data sources (e.g., Kafka brokers) to reduce network transfer.
+
+## How It Works
+
+Every Pathway worker — regardless of which machine it runs on — executes the same dataflow on a different shard of the data. The workers discover each other through a fixed list of `host:port` addresses provided at startup. All processes must be started before any of them begins processing data: the pipeline waits until the full cluster is assembled. While the pipeline waits for all of its workers, you will see a "Preparing Pathway computation" log message.
+
+This is different from the default single-machine multi-process mode (`pathway spawn -n N`), where Pathway automatically assigns ports on `127.0.0.1` and launches all processes itself. In the multi-machine mode, you are responsible for starting one process per machine and telling each process where all the others are.
+
+## Setting Up
+
+### 1. Decide on addresses
+
+Choose a `host:port` pair for each process. The port must be reachable from all other machines in the cluster. For example, with two machines:
+
+| Process | Address |
+|---------|---------------------|
+| 0       | `192.168.1.10:9000` |
+| 1       | `192.168.1.11:9000` |
+
+### 2. Start the process on each machine
+
+On **machine 0**:
+
+```bash
+pathway spawn \
+  --addresses 192.168.1.10:9000,192.168.1.11:9000 \
+  --process-id 0 \
+  python pipeline.py
+```
+
+On **machine 1**:
+
+```bash
+pathway spawn \
+  --addresses 192.168.1.10:9000,192.168.1.11:9000 \
+  --process-id 1 \
+  python pipeline.py
+```
+
+Both commands receive the same `--addresses` list. The `--process-id` flag tells each machine which entry in that list belongs to it — process 0 binds to `192.168.1.10:9000`, process 1 binds to `192.168.1.11:9000`. 
+
+The two commands can be started in any order. The process that starts first will wait for the others to connect before beginning computation.
+
+Note that a single machine can host more than one process. In that case, use the same host with different ports for each process on that machine:
+
+```bash
+pathway spawn \
+  --addresses 192.168.1.10:9000,192.168.1.10:9001,192.168.1.11:9000 \
+  --process-id 0 \
+  python pipeline.py
+```
+
+```bash
+pathway spawn \
+  --addresses 192.168.1.10:9000,192.168.1.10:9001,192.168.1.11:9000 \
+  --process-id 1 \
+  python pipeline.py
+```
+
+```bash
+pathway spawn \
+  --addresses 192.168.1.10:9000,192.168.1.10:9001,192.168.1.11:9000 \
+  --process-id 2 \
+  python pipeline.py
+```
+
+Here processes 0 and 1 both run on `192.168.1.10`, listening on ports `9000` and `9001` respectively, while process 2 runs on `192.168.1.11`.
+
+Please keep in mind that due to how the communication internally works, **the list of workers must have them in the same order in all of the launched commands**. Only the **`--process-id`** parameter must be varied, taking all values from **`0`** through **the length of the list minus one**.
+
+### 3. Use threads for intra-machine parallelism
+
+The `--threads` flag works independently of `--addresses`. To run two threads per machine with the two-machine setup above, add `--threads 2` to both commands. This gives four total workers: two on each machine.
+
+```bash
+pathway spawn \
+  --addresses 192.168.1.10:9000,192.168.1.11:9000 \
+  --process-id 0 \
+  --threads 2 \
+  python pipeline.py
+```
+
+### 4. Add persistence (recommended)
+
+When running across machines, data persistence is strongly recommended. If any process crashes, the whole cluster must be restarted. Persistence ensures the pipeline resumes from the last checkpoint rather than replaying from the beginning:
+
+```python
+persistence_config = pw.persistence.Config(
+    backend=pw.persistence.Backend.s3(
+        bucket_name="my-bucket",
+        root_path="pathway-state/",
+    ),
+)
+
+pw.run(persistence_config=persistence_config)
+```
+
+It is important to use a shared storage (S3, GCS, Azure Blob, NFS) so that all machines can read and write the same state.
+
+## License
+
+Running Pathway on multiple machines requires a Pathway Scale or Pathway Enterprise license. You can obtain a free Pathway Scale license [here](/get-license). The page contains instructions for getting the license and using it in your pipeline.
+
+## Limitations
+
+**No dynamic scaling.** The `--addresses` flag defines a fixed worker pool. Pathway's autoscaling mechanism (described in [Dynamic Worker Scaling](/developers/user-guide/advanced/worker-count-scaling/)) is not available when a fixed address list is used. The number of processes is determined by the length of the `--addresses` list and cannot change at runtime.
+
+**All processes must start for the pipeline to begin.** If one machine fails to start or takes too long, the others will wait indefinitely. There is no partial startup or degraded mode.
+
+**At-least-once delivery.** As with all Pathway deployments, recovery after a crash replays data from the last committed checkpoint. Records written after the last checkpoint but before the crash may be processed again. Exactly-once semantics are available in the enterprise edition.
+
+**Same binary on all machines.** All machines must run the same version of Pathway and the same pipeline code. Mismatched versions will cause a connection failure or undefined behavior.
+
+**Firewall and networking.** Each machine must be able to reach all others on the specified ports. Pathway does not support NAT traversal or proxies between workers.
+
+## Conclusion
+
+To run a Pathway pipeline across multiple machines:
+
+1. **Choose one `host:port` per process** and ensure the ports are mutually reachable.
+2. **Start each process independently** using `pathway spawn --addresses <list> --process-id <N>`.
+3. **Use shared persistent storage** to enable fast recovery after restarts.
+4. **Do not mix `--addresses` with `--processes`** — the process count is derived from the address list.
+
+If you have any questions, feel free to reach out on [Discord](http://discord.com/invite/pathway) or open an issue on our [GitHub](https://github.com/pathwaycom/pathway/issues/).
@@ -0,0 +1,10 @@
+import sys
+
+import pathway as pw
+
+input_path = sys.argv[1]
+output_path = sys.argv[2]
+
+t = pw.io.plaintext.read(input_path, mode="static")
+pw.io.jsonlines.write(t, output_path)
+pw.run()
@@ -9,6 +9,7 @@
 
 REPOSITORY_URL = "https://github.com/pathway-labs/airbyte-to-deltalake"
 TRACKED_REPOSITORY_URL = "https://github.com/pathwaycom/pathway/"
+IDENTITY_PROGRAM = os.path.join(os.path.dirname(__file__), "identity.py")
 
 
 def count_commits_in_pathway_repository(tmp_path):
@@ -42,3 +43,104 @@ def test_repository_url_feature(tmp_path):
     expected_n_commits = count_commits_in_pathway_repository(tmp_path)
 
     assert actual_n_commits == expected_n_commits
+
+
+def invoke_spawn(runner, args):
+    return runner.invoke(
+        cli.spawn, args + ["python", IDENTITY_PROGRAM, "input.txt", "output.jsonl"]
+    )
+
+
+def test_processes_and_addresses_are_mutually_exclusive(runner):
+    result = invoke_spawn(
+        runner, ["--processes", "2", "--addresses", "host0:10000,host1:10000"]
+    )
+    assert result.exit_code != 0
+    assert "--processes and --addresses are mutually exclusive" in result.output
+
+
+def test_process_id_requires_addresses(runner):
+    result = invoke_spawn(runner, ["--process-id", "1"])
+    assert result.exit_code != 0
+    assert "--process-id requires --addresses" in result.output
+
+
+def test_addresses_requires_process_id(runner):
+    result = invoke_spawn(runner, ["--addresses", "host0:10000,host1:10000"])
+    assert result.exit_code != 0
+    assert "--process-id is required when --addresses is set" in result.output
+
+
+def test_address_invalid_format_no_colon(runner):
+    result = invoke_spawn(runner, ["--addresses", "host010000", "--process-id", "0"])
+    assert result.exit_code != 0
+    assert "expected host:port format" in result.output
+
+
+def test_address_invalid_format_non_numeric_port(runner):
+    result = invoke_spawn(runner, ["--addresses", "host0:abc", "--process-id", "0"])
+    assert result.exit_code != 0
+    assert "expected host:port format" in result.output
+
+
+def test_address_port_zero(runner):
+    result = invoke_spawn(runner, ["--addresses", "host0:0", "--process-id", "0"])
+    assert result.exit_code != 0
+    assert "must be in range" in result.output
+
+
+def test_address_port_too_large(runner):
+    result = invoke_spawn(runner, ["--addresses", "host0:99999", "--process-id", "0"])
+    assert result.exit_code != 0
+    assert "must be in range" in result.output
+
+
+def test_addresses_duplicate_entries(runner):
+    result = invoke_spawn(
+        runner,
+        ["--addresses", "host0:10000,host0:10000", "--process-id", "0"],
+    )
+    assert result.exit_code != 0
+    assert "duplicate entries" in result.output
+
+
+def test_process_id_out_of_range(runner):
+    result = invoke_spawn(
+        runner,
+        ["--addresses", "host0:10000,host1:10000", "--process-id", "5"],
+    )
+    assert result.exit_code != 0
+    assert "--process-id 5 is out of range" in result.output
+
+
+def test_process_id_negative(runner):
+    result = invoke_spawn(
+        runner,
+        ["--addresses", "host0:10000,host1:10000", "--process-id", "-1"],
+    )
+    assert result.exit_code != 0
+    assert "--process-id -1 is out of range" in result.output
+
+
+def test_threads_zero(runner):
+    result = invoke_spawn(runner, ["--threads", "0"])
+    assert result.exit_code != 0
+    assert "--threads must be at least 1" in result.output
+
+
+def test_threads_negative(runner):
+    result = invoke_spawn(runner, ["--threads", "-4"])
+    assert result.exit_code != 0
+    assert "--threads must be at least 1" in result.output
+
+
+def test_processes_zero(runner):
+    result = invoke_spawn(runner, ["--processes", "0"])
+    assert result.exit_code != 0
+    assert "--processes must be at least 1" in result.output
+
+
+def test_first_port_overflow(runner):
+    result = invoke_spawn(runner, ["--processes", "3", "--first-port", "65534"])
+    assert result.exit_code != 0
+    assert "exceeds the maximum" in result.output
@@ -0,0 +1,72 @@
+import json
+import os
+import time
+import uuid
+
+import pytest
+
+from pathway.cli import create_process_handles, terminate_process_handles
+
+IDENTITY_PROGRAM = os.path.join(os.path.dirname(__file__), "identity.py")
+
+
+def test_two_machine_identity(tmp_path, two_free_ports):
+    port1, port2 = two_free_ports
+    addresses = f"127.0.0.1:{port1},127.0.0.1:{port2}"
+    input_path = tmp_path / "input.txt"
+    output_path = tmp_path / "output.jsonl"
+    input_path.write_text("hello world\n")
+
+    env_base = os.environ.copy()
+    common_args = dict(
+        processes=2,
+        threads=1,
+        first_port=port1,
+        addresses=addresses,
+        env_base=env_base,
+        program="python",
+        arguments=[IDENTITY_PROGRAM, str(input_path), str(output_path)],
+    )
+
+    process0 = create_process_handles(
+        **common_args, process_id=0, run_id=str(uuid.uuid4())
+    )[0]
+    try:
+        time.sleep(15)
+        assert process0.poll() is None, "Process 0 exited before process 1 was launched"
+        assert not output_path.exists(), "Output appeared before process 1 was launched"
+
+        time.sleep(15)
+        assert process0.poll() is None, "Process 0 exited before process 1 was launched"
+        assert not output_path.exists(), "Output appeared before process 1 was launched"
+
+        process1 = create_process_handles(
+            **common_args, process_id=1, run_id=str(uuid.uuid4())
+        )[0]
+        try:
+            deadline = time.time() + 60
+            while time.time() < deadline:
+                if process0.poll() is not None and process1.poll() is not None:
+                    break
+                time.sleep(0.5)
+            else:
+                pytest.fail("Processes did not complete within 60 seconds")
+
+            assert (
+                process0.returncode == 0
+            ), f"Process 0 exited with code {process0.returncode}"
+            assert (
+                process1.returncode == 0
+            ), f"Process 1 exited with code {process1.returncode}"
+        finally:
+            terminate_process_handles([process1])
+    finally:
+        terminate_process_handles([process0])
+
+    assert output_path.exists(), "Output file was not created"
+    rows = [
+        json.loads(line)
+        for line in output_path.read_text().splitlines()
+        if line.strip()
+    ]
+    assert any(row.get("data") == "hello world" for row in rows)