Skip to content

Commit 34705b2

Browse files
committed
lammps: multi plan
Signed-off-by: vsoch <[email protected]>
1 parent cf0837d commit 34705b2

2 files changed

Lines changed: 42 additions & 2 deletions

File tree

aws-autoscale/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,10 @@ helm install efa eks/aws-efa-k8s-device-plugin -n kube-system
110110
## 1. AMG2023
111111

112112
```bash
113-
outdir=./results/amg2023-4-nodes
113+
outdir=./results/amg2023-4-nodes-deploy
114+
outdir=./results/amg2023-4-nodes-build
114115
mkdir -p $outdir
115-
for i in $(seq 1 10)
116+
for i in $(seq 1 3)
116117
do
117118
fractale agent --plan ./plans/amg2023-4-nodes-build.yaml --results $outdir-build --incremental
118119
fractale agent --plan ./plans/amg2023-4-nodes-deploy.yaml --results $outdir-deploy --incremental
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: Build and Deploy LAMMPS
2+
description: Build a Docker container and deploy it as a Kubernetes Job.
3+
plan:
4+
- agent: build
5+
context:
6+
environment: "AWS CPU instance in Kubernetes to run across nodes"
7+
application: lammps-reax
8+
platforms: linux/amd64,linux/arm64
9+
container: ghcr.io/converged-computing/fractale-agent-experiments:lammps-reax
10+
push: true
11+
max_attempts: 10
12+
details: |
13+
Ensure all globbed files from examples/reaxff/HNS from the root of the lammps codebase are in the WORKDIR. Clone the latest branch of LAMMPS. You MUST put lmp on the PATH. You MUST install libgomp1.
14+
This will be run with a workload manager that can bootstrap MPI. You MUST install MPI but you do not need ssh.
15+
You MUST install OpenMPI 4.1.2 with libfabric --with-efa for AWS.
16+
You MUST build the container for a multi-node MPI environment
17+
18+
- agent: minicluster
19+
context:
20+
environment: "AWS CPU instance in Kubernetes"
21+
container: ghcr.io/converged-computing/fractale-agent-experiments:lammps-reax
22+
max_attempts: 10
23+
max_runtime: 300
24+
optimize: |
25+
You MUST maximize the LAMMPS FOM, k-atom or m-atom steps per second.
26+
You MUST run choose the problem size to maximize FOM. You MUST start with 1, then 2 nodes. You MUST set environment variables for MPI to use EFA with libfabric.
27+
resources: |
28+
The resource spec you got earlier is for an autoscaling cluster, so the nodes possible are not there. You must add the nodeSelector to use:
29+
m7g.16xlarge, 64 CPU, 256 GiB Memory, ARM (Graviton3)
30+
You are still limited to up to 4 nodes.
31+
testing:
32+
Run in.reaxff.hns in the pwd with lmp -v x 2 -v y 2 -v z 2 -in ./in.reaxff.hns -nocite for testing only.
33+
details: |
34+
You MUST run on up to 4 nodes and you MUST use only 1 node to first test.
35+
The Flux Operator uses flux run in the pwd with the tasks determined by the spec.tasks.
36+
You MUST set resource requests and limits to use vpc.amazonaws.com/efa: 1
37+
Since this is an ARM instance you MUST change the flux.container.image to be
38+
ghcr.io/converged-computing/flux-view-rocky:arm-9. Otherwise, do not change or set it.
39+
If you are using an ARM instance you MUST also set flux.arch: "arm".

0 commit comments

Comments
 (0)