-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathqqqqqqqqq
More file actions
329 lines (269 loc) · 9.88 KB
/
qqqqqqqqq
File metadata and controls
329 lines (269 loc) · 9.88 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
Kubeflow Installation & Katib Setup Documentation
1. Prerequisites
Before starting, ensure you have:
A Kubernetes cluster up and running.
kubectl installed and configured.
Sufficient resources (CPU, RAM, and storage).
Persistent storage provisioner (like local-path).
2. Prepare the Namespace
Create the kubeflow namespace:
kubectl create ns kubeflow
3. Install Katib Components
Katib components are installed in the following order:
3.1 Namespace
kubectl apply -k applications/katib/upstream/components/namespace
3.2 MySQL
kubectl apply -k applications/katib/upstream/components/mysql
3.3 Controller
kubectl apply -k applications/katib/upstream/components/controller
3.4 UI
kubectl apply -k applications/katib/upstream/components/ui
3.5 Webhook
kubectl apply -k applications/katib/upstream/components/webhook
4. Install Katib using Standalone/With-Kubeflow Install Manifests
There are multiple installation paths:
4.1 With Kubeflow
cd ~/kubeflow-manifests/applications/katib/upstream/installs/katib-with-kubeflow
kubectl apply -k .
4.2 Standalone (optional)
cd ~/kubeflow-manifests/applications/katib/upstream/installs/katib-standalone
kubectl apply -k .
4.3 Standalone with Postgres (optional)
cd ~/kubeflow-manifests/applications/katib/upstream/installs/katib-standalone-postgres
kubectl apply -k .
Note: If you face pending PVC issues (like katib-postgres), check your storage class and node availability.
5. Verify Components
Check pods in kubeflow namespace:
kubectl get pods -n kubeflow
Check PVCs:
kubectl get pvc -n kubeflow
Check ConfigMaps & Webhooks:
kubectl get configmap katib-config -n kubeflow -o yaml
kubectl get mutatingwebhookconfiguration katib.kubeflow.org -o yaml
kubectl get validatingwebhookconfiguration katib.kubeflow.org -o yaml
6. Label Namespace for Katib
Katib requires metrics collector injection:
kubectl label namespace kubeflow katib.kubeflow.org/metrics-collector-injection=enabled --overwrite
Verify label:
kubectl get ns kubeflow --show-labels
7. Create an Example Experiment
Create example-experiment.yaml with a valid trial template:
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: example-hp-tuning
namespace: kubeflow
spec:
objective:
type: maximize
goal: 1.0
objectiveMetricName: accuracy
algorithm:
algorithmName: random
maxTrialCount: 3
parallelTrialCount: 1
parameters:
- name: learning-rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
- name: batch-size
parameterType: int
feasibleSpace:
min: "16"
max: "64"
trialTemplate:
primaryContainerName: training-container
successCondition: "status.succeeded > 0"
failureCondition: "status.failed > 0"
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/bitnami/tensorflow:2.13.0
command:
- python
- -c
- |
import os, random, time
lr = float(os.environ.get("LEARNING_RATE", 0.01))
bs = int(os.environ.get("BATCH_SIZE", 32))
print(f"Training with lr={lr}, batch_size={bs}")
time.sleep(random.randint(5,10))
print("accuracy:", random.random())
env:
- name: LEARNING_RATE
value: "{{trialParameters.learning-rate}}"
- name: BATCH_SIZE
value: "{{trialParameters.batch-size}}"
restartPolicy: Never
trialParameters:
- name: learning-rate
description: Learning rate
reference: learning-rate
- name: batch-size
description: Batch size
reference: batch-size
Apply the experiment:
kubectl apply -f example-experiment.yaml -n kubeflow
8. Verify Experiment & Trials
kubectl get experiments -n kubeflow
kubectl get trials -n kubeflow
kubectl get pods -n kubeflow -l experiment=example-hp-tuning -w
9. Troubleshooting Tips
PVC Pending
Ensure storage class exists and nodes have enough capacity.
Check WaitForFirstConsumer events.
Webhook Denials
Ensure namespace is labeled correctly.
Ensure trialParameters names match exactly in trialTemplate.
Katib Pods Not Running
Check kubectl describe pod <pod> -n kubeflow
Look for webhook connection errors.
If you want, I can make a fully formatted PDF/Markdown document with all commands, notes, and examples in one clean file ready to use as official documentation for yourself.
Do you want me to do that?
u not given proper steps -u missing --- i want clear way from starting documention
Got it! Let’s create a full, clear step-by-step Kubeflow + Katib installation guide, starting from a fresh environment. I’ll include all the steps you did, along with proper notes for issues, labels, PVCs, and experiment creation. I’ll make it structured for documentation purposes.
Kubeflow Installation and Katib Setup Guide
1. Prerequisites
Before you start, ensure you have:
Kubernetes cluster running (v1.23+ recommended).
kubectl installed and configured.
Minimum resources: 4 CPUs, 8GB RAM, 20GB storage.
Persistent volume provisioner (like local-path) configured.
Check cluster:
kubectl get nodes
kubectl version --short
2. Clone Kubeflow Manifests
git clone https://github.com/kubeflow/manifests.git ~/kubeflow-manifests
cd ~/kubeflow-manifests
3. Create Kubeflow Namespace
kubectl create ns kubeflow --dry-run=client -o yaml | kubectl apply -f -
Verify:
kubectl get ns kubeflow --show-labels
4. Install Katib Components (from manifests)
4.1 Namespace
kubectl apply -k applications/katib/upstream/components/namespace
4.2 MySQL
kubectl apply -k applications/katib/upstream/components/mysql
4.3 Controller
kubectl apply -k applications/katib/upstream/components/controller
4.4 UI
kubectl apply -k applications/katib/upstream/components/ui
4.5 Webhook
kubectl apply -k applications/katib/upstream/components/webhook
Check pods:
kubectl get pods -n kubeflow
⚠ Note: If katib-controller pods don’t appear, check PVC and webhook errors.
5. Choose Katib Installation Type
You can install Katib with Kubeflow or standalone:
5.1 Katib with Kubeflow
cd ~/kubeflow-manifests/applications/katib/upstream/installs/katib-with-kubeflow
kubectl apply -k .
5.2 Katib Standalone
cd ~/kubeflow-manifests/applications/katib/upstream/installs/katib-standalone
kubectl apply -k .
5.3 Standalone with Postgres
cd ~/kubeflow-manifests/applications/katib/upstream/installs/katib-standalone-postgres
kubectl apply -k .
Check pods and PVCs:
kubectl get pods -n kubeflow
kubectl get pvc -n kubeflow
6. Label Namespace for Metrics Collector
Katib requires the namespace to be labeled for metrics injection:
kubectl label namespace kubeflow katib.kubeflow.org/metrics-collector-injection=enabled --overwrite
kubectl get ns kubeflow --show-labels
7. Verify Katib Config
Check ConfigMap:
kubectl get configmap katib-config -n kubeflow -o yaml
Check Webhooks:
kubectl get mutatingwebhookconfiguration katib.kubeflow.org -o yaml
kubectl get validatingwebhookconfiguration katib.kubeflow.org -o yaml
8. Create a Sample Experiment
Create example-experiment.yaml:
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: example-hp-tuning
namespace: kubeflow
spec:
objective:
type: maximize
goal: 1.0
objectiveMetricName: accuracy
algorithm:
algorithmName: random
maxTrialCount: 3
parallelTrialCount: 1
parameters:
- name: learning-rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
- name: batch-size
parameterType: int
feasibleSpace:
min: "16"
max: "64"
trialTemplate:
primaryContainerName: training-container
successCondition: "status.succeeded > 0"
failureCondition: "status.failed > 0"
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/bitnami/tensorflow:2.13.0
command:
- python
- -c
- |
import os, random, time
lr = float(os.environ.get("LEARNING_RATE", 0.01))
bs = int(os.environ.get("BATCH_SIZE", 32))
print(f"Training with lr={lr}, batch_size={bs}")
time.sleep(random.randint(5,10))
print("accuracy:", random.random())
env:
- name: LEARNING_RATE
value: "{{trialParameters.learning-rate}}"
- name: BATCH_SIZE
value: "{{trialParameters.batch-size}}"
restartPolicy: Never
trialParameters:
- name: learning-rate
description: Learning rate
reference: learning-rate
- name: batch-size
description: Batch size
reference: batch-size
Apply experiment:
kubectl apply -f example-experiment.yaml -n kubeflow
9. Verify Experiments and Trials
kubectl get experiments -n kubeflow
kubectl get trials -n kubeflow
kubectl get pods -n kubeflow -l experiment=example-hp-tuning -w
10. Common Issues & Fixes
Webhook Denials
Ensure namespace is labeled correctly:
kubectl label namespace kubeflow katib.kubeflow.org/metrics-collector-injection=enabled --overwrite
Ensure trialParameters names match exactly with trialTemplate env references.
PVC Pending
Ensure your storage class exists and nodes have capacity.
If using WaitForFirstConsumer, ensure pods can schedule to a node.
Katib Controller Not Running
Check logs:
kubectl logs -n kubeflow deployment/katib-controller
Check for webhook errors:
kubectl get events -n kubeflow --sort-by=.metadata.creationTimestamp
This document captures all the steps from a fresh install to running a sample Katib experiment, including the troubleshooting notes you encountered.