Add juicefs mode for storage backend#1964
Conversation
✅ Deploy Preview for transformerlab canceled.
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
aliasaria
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds a new juicefs storage provider that mounts a shared filesystem on the API host and on remote pods, scoped per-team via --subdir orgs/<team_id>. The architecture is reasonable — a dedicated launch_juicefs helper module, validated env-var configuration, and a mount-point pre-existence check at startup. The main concerns are a leftover dev-only branch-specific snippet that must come out before merge, an unverified binary download, and some defensive hygiene around path scoping and retry logging.
Must Fix (blocks merge)
[Maintainability] Remove leftover dev/branch hack in launch_template.py
Lines 504-507 contain a commented-out git clone … git checkout add/juicefs-mounting block. This was clearly a debugging artifact for installing the SDK from this feature branch. It pins the snippet to the branch name and will be confusing/risky if uncommented later by mistake.
# setup_commands.insert(
# 0,
# "export CURRENT_DIR=$(pwd); git clone https://github.com/transformerlab/transformerlab-app; cd transformerlab-app/lab-sdk; git checkout add/juicefs-mounting; uv pip install -e .; cd $CURRENT_DIR",
# )Fix: Delete those four lines. Once the SDK bump to 0.1.32 is published, pip install -q transformerlab (already on the line above) does the right thing.
[Security] JuiceFS binary download has no integrity verification or version pinning
build_juicefs_install_command() does:
"curl -fsSL https://juicefs.com/static/juicefs -o /tmp/juicefs; "
"chmod +x /tmp/juicefs; "
"mv /tmp/juicefs /usr/local/bin/juicefs ..."No SHA256 verification and no version pin. Every pod fetches whatever lives at that URL today, making your remote workload trust upstream's HTTPS endpoint as an integrity boundary. If juicefs.com is compromised or that path mutates, you push it to every JuiceFS-mounted pod.
Fix: Pin a specific version (e.g. https://github.com/juicedata/juicefs/releases/download/v1.2.3/juicefs-1.2.3-linux-amd64.tar.gz), record a known-good SHA256, and verify with sha256sum -c before chmod +x. At minimum, pin the version in the URL.
Should Fix (important but not blocking)
[Security] Validate team_id against path traversal in _create_juicefs_directory
org_path = os.path.join(mount_point, "orgs", team_id) followed by os.makedirs(org_path, exist_ok=True). If team_id ever contained .. segments or an absolute path, the directory could land outside the mount. Today team IDs come from the DB so this is theoretical, but it's the exact spot where defense-in-depth costs nothing.
if not re.fullmatch(r"[A-Za-z0-9_-]+", team_id):
print(f"❌ Invalid team_id for JuiceFS: {team_id}")
return FalseSame applies to the f-string f'orgs/{team_id}' passed into build_juicefs_pod_config's --subdir. shlex.quote blocks shell injection but not path traversal — --subdir will happily resolve ../.
[Correctness] Validate TFL_JUICEFS_STORAGE_BACKEND against the known set
_validate_juicefs_config only checks the env var is non-empty. But in build_juicefs_backend_credentials_setup, anything other than "aws" | "gcp" | "azure" silently produces empty setup_commands and env_vars, leading to a confusing juicefs auth failure on the pod. A typo (s3, AWS, awss) reaches launch.
backend = os.getenv("TFL_JUICEFS_STORAGE_BACKEND", "")
if backend not in ("aws", "gcp", "azure"):
print(f"❌ Invalid TFL_JUICEFS_STORAGE_BACKEND: {backend!r}", file=sys.stderr)
sys.exit(1)[Maintainability] The retry loop in experiment.create_job masks the underlying issue without a signal
Retrying a UUID4 collision 5 times silently is fine as a JuiceFS-metadata-race workaround, but if it's ever firing for a different reason (e.g. a buggy Job.create raising FileExistsError from the wrong file), nobody will know. Add a logger.warning on retry so the symptom shows up in logs.
except FileExistsError:
logger.warning("Job.create raised FileExistsError; retrying with new UUID", exc_info=True)
continue[Maintainability] Fragile .replace() for console-url injection in mount_cmd
In launch_juicefs.py:
if juicefs_console_url:
auth_cmd = auth_cmd.replace(
'--token "$TFL_JUICEFS_TOKEN"',
'--token "$TFL_JUICEFS_TOKEN" --console-url "$TFL_JUICEFS_CONSOLE_URL"',
)The string match here happens to occur twice (once per branch of the if/else), so the replace double-injects the flag — that's actually what you want, but it's coincidental and easy to break by editing the auth string elsewhere. Build the snippet directly instead:
console_flag = ' --console-url "$TFL_JUICEFS_CONSOLE_URL"' if juicefs_console_url else ""
auth_cmd = (
'if [ -n "$ACCESS_KEY" ] && [ -n "$SECRET_KEY" ]; then '
f'juicefs auth {shlex.quote(volume_name)} --token "$TFL_JUICEFS_TOKEN"{console_flag} '
'--access-key "$ACCESS_KEY" --secret-key "$SECRET_KEY"; '
'else '
f'juicefs auth {shlex.quote(volume_name)} --token "$TFL_JUICEFS_TOKEN"{console_flag}; '
'fi'
)Consider Improving (optional / nice-to-have)
- The JuiceFS credential-injection block is duplicated near-verbatim in
launch_template.py(lines 267-269) andlaunch_sweep.py(lines 290-296). Pull it into a tiny helper alongsidebuild_juicefs_backend_credentials_setup. _JUICEFS_MOUNT_POINTinlab/storage.pyis captured at import time, while siblings likeTFL_REMOTE_STORAGE_ENABLEDare read on every call. Reading it dynamically insideset_organization_idwould be more consistent and avoid stale-value surprises in tests/long-lived processes.- Consider logging (rather than
print) for the_create_juicefs_directoryand_validate_juicefs_configfailure paths so they integrate with the rest of the API server's structured logs.
What's Working Well
- Strong test coverage across both modes (API host and pod) —
_validate_juicefs_configis exercised for each missing env var, the mount-command builder covers the non-token, hosted-token, and console-URL variants, anddirs.set_organization_idhas tests for both API-host scoping and pod subdir-mount paths. - Correct use of
shlex.quoteon every interpolated path/volume in the generated shell snippets — this is the right defense for the command-construction layer. - Clear architectural separation: JuiceFS-specific install/mount/auth logic lives in a dedicated
launch_juicefs.pyrather than being smeared acrosslaunch_template.pyandlaunch_sweep.py. The reuse of existinggenerate_aws_credentials_setup/generate_gcp_credentials_setup/generate_azure_credentials_setuphelpers is exactly right. - The "JuiceFS bypasses
TFL_REMOTE_STORAGE_ENABLED" decision is well-documented in inline comments at every site that branches on it (api.py,launch_template.py,launch_sweep.py,remote_workspace.py). That's the kind of subtle invariant that would otherwise rot fast.
I'll fix the Correctness and Maintainability points |
|
Pushed the fixes |
JuiceFS Setup: Required Env Vars & Pre-startup Mount
To run with JuiceFS storage, set the following env vars before starting the API server:
Required:
TFL_STORAGE_PROVIDER=juicefsTFL_JUICEFS_METADATA_URLredis://localhost:6379/1) (NOT REQUIRED IF USING ENTERPRISE HOSTED BY JUICEFS)TFL_JUICEFS_VOLUME_NAMETFL_JUICEFS_STORAGE_BACKENDaws,gcp, orazureOptional:
TFL_JUICEFS_MOUNT_POINT/mnt/juicefsTFL_JUICEFS_TOKENTFL_JUICEFS_CONSOLE_URLTFL_JUICEFS_REMOTE_MOUNT_POINT/tmp/tlab-juicefsPlus standard credentials for the backing store (
AWS_PROFILE,TFL_GCP_SERVICE_ACCOUNT_JSON_PATH,AZURE_STORAGE_CONNECTION_STRING, etc.) depending on which backend you set.Important: The JuiceFS volume must be mounted at
TFL_JUICEFS_MOUNT_POINTbefore starting the API server. On startup, the server validates that the mount point is active and will exit with an error if it isn't. Runjuicefs authandjuicefs mountas part of your host provisioning/boot sequence beforeinvoking
api/run.sh.