This runbook covers issues where the Wake Lambda or /predict API call fails due to the SageMaker endpoint being unavailable, cold, or stuck in a non-InService state.
- The UI shows errors during prediction.
- The API returns 500, 502, or "Endpoint is not in service".
- Terraform apply hangs or fails on the SageMaker update step.
- Lambda logs show
ModelError,InvocationException, or timeout. - API Gateway returns 502 Bad Gateway.
- SageMaker endpoint status is not "InService".
- Terraform reports
Resource is updating.
- SageMaker endpoint stuck in Updating/Creating.
- Cold start taking too long for large model or insufficient memory.
- Past update attempt failed and left endpoint in partial state.
aws sagemaker describe-endpoint \
--endpoint-name "$ENDPOINT_NAME" \
--region "$AWS_REGION" \
--output tableExpected value:
InService
If not InService → continue to Step 2.
aws sagemaker describe-endpoint-config \
--endpoint-config-name "$CONFIG_NAME" \
--region "$AWS_REGION" \
--output tableaws sagemaker delete-endpoint \
--endpoint-name "$ENDPOINT_NAME" \
--region "$AWS_REGION"
aws sagemaker create-endpoint \
--endpoint-name "$ENDPOINT_NAME" \
--endpoint-config-name "$CONFIG_NAME" \
--region "$AWS_REGION"curl -X POST "$API_URL/predict" \
-H "Content-Type: application/json" \
-d '{"ping": true}'- Endpoint status = InService
- Lambda logs show successful invocation
- UI prediction works
- Terraform shows no pending changes
- Increase serverless memory for faster cold starts.
- Add CloudWatch alarm for endpoint update delays.
- Add retry logic in the Wake Lambda.