Skip to content

Add juicefs mode for storage backend#1964

Open
deep1401 wants to merge 16 commits into
mainfrom
add/juicefs-mounting
Open

Add juicefs mode for storage backend#1964
deep1401 wants to merge 16 commits into
mainfrom
add/juicefs-mounting

Conversation

@deep1401
Copy link
Copy Markdown
Member

@deep1401 deep1401 commented May 4, 2026

JuiceFS Setup: Required Env Vars & Pre-startup Mount

To run with JuiceFS storage, set the following env vars before starting the API server:

Required:

Variable Description
TFL_STORAGE_PROVIDER=juicefs Enables JuiceFS mode
TFL_JUICEFS_METADATA_URL Metadata engine URL (e.g. redis://localhost:6379/1) (NOT REQUIRED IF USING ENTERPRISE HOSTED BY JUICEFS)
TFL_JUICEFS_VOLUME_NAME Name of the JuiceFS volume to mount
TFL_JUICEFS_STORAGE_BACKEND Backing object store: aws, gcp, or azure

Optional:

Variable Default Description
TFL_JUICEFS_MOUNT_POINT /mnt/juicefs Where JuiceFS is mounted on the API host
TFL_JUICEFS_TOKEN Auth token for hosted JuiceFS (juicefs.com) (REQUIRED IF USING HOSTED)
TFL_JUICEFS_CONSOLE_URL Console URL for self-hosted JuiceFS
TFL_JUICEFS_REMOTE_MOUNT_POINT /tmp/tlab-juicefs Mount point inside remote pods (not the API host)

Plus standard credentials for the backing store (AWS_PROFILE, TFL_GCP_SERVICE_ACCOUNT_JSON_PATH, AZURE_STORAGE_CONNECTION_STRING, etc.) depending on which backend you set.

Important: The JuiceFS volume must be mounted at TFL_JUICEFS_MOUNT_POINT before starting the API server. On startup, the server validates that the mount point is active and will exit with an error if it isn't. Run juicefs auth and juicefs mount as part of your host provisioning/boot sequence before
invoking api/run.sh.

@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2026

Deploy Preview for transformerlab canceled.

Name Link
🔨 Latest commit 44eef73
🔍 Latest deploy log https://app.netlify.com/projects/transformerlab/deploys/69fa538bbe0fec0008d589f8

@sentry
Copy link
Copy Markdown

sentry Bot commented May 4, 2026

@deep1401 deep1401 marked this pull request as ready for review May 5, 2026 16:57
Copy link
Copy Markdown
Member

@aliasaria aliasaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds a new juicefs storage provider that mounts a shared filesystem on the API host and on remote pods, scoped per-team via --subdir orgs/<team_id>. The architecture is reasonable — a dedicated launch_juicefs helper module, validated env-var configuration, and a mount-point pre-existence check at startup. The main concerns are a leftover dev-only branch-specific snippet that must come out before merge, an unverified binary download, and some defensive hygiene around path scoping and retry logging.


Must Fix (blocks merge)

[Maintainability] Remove leftover dev/branch hack in launch_template.py

Lines 504-507 contain a commented-out git clone … git checkout add/juicefs-mounting block. This was clearly a debugging artifact for installing the SDK from this feature branch. It pins the snippet to the branch name and will be confusing/risky if uncommented later by mistake.

# setup_commands.insert(
#     0,
#     "export CURRENT_DIR=$(pwd); git clone https://github.com/transformerlab/transformerlab-app; cd transformerlab-app/lab-sdk; git checkout add/juicefs-mounting; uv pip install -e .; cd $CURRENT_DIR",
# )

Fix: Delete those four lines. Once the SDK bump to 0.1.32 is published, pip install -q transformerlab (already on the line above) does the right thing.

[Security] JuiceFS binary download has no integrity verification or version pinning

build_juicefs_install_command() does:

"curl -fsSL https://juicefs.com/static/juicefs -o /tmp/juicefs; "
"chmod +x /tmp/juicefs; "
"mv /tmp/juicefs /usr/local/bin/juicefs ..."

No SHA256 verification and no version pin. Every pod fetches whatever lives at that URL today, making your remote workload trust upstream's HTTPS endpoint as an integrity boundary. If juicefs.com is compromised or that path mutates, you push it to every JuiceFS-mounted pod.

Fix: Pin a specific version (e.g. https://github.com/juicedata/juicefs/releases/download/v1.2.3/juicefs-1.2.3-linux-amd64.tar.gz), record a known-good SHA256, and verify with sha256sum -c before chmod +x. At minimum, pin the version in the URL.


Should Fix (important but not blocking)

[Security] Validate team_id against path traversal in _create_juicefs_directory

org_path = os.path.join(mount_point, "orgs", team_id) followed by os.makedirs(org_path, exist_ok=True). If team_id ever contained .. segments or an absolute path, the directory could land outside the mount. Today team IDs come from the DB so this is theoretical, but it's the exact spot where defense-in-depth costs nothing.

if not re.fullmatch(r"[A-Za-z0-9_-]+", team_id):
    print(f"❌ Invalid team_id for JuiceFS: {team_id}")
    return False

Same applies to the f-string f'orgs/{team_id}' passed into build_juicefs_pod_config's --subdir. shlex.quote blocks shell injection but not path traversal — --subdir will happily resolve ../.

[Correctness] Validate TFL_JUICEFS_STORAGE_BACKEND against the known set

_validate_juicefs_config only checks the env var is non-empty. But in build_juicefs_backend_credentials_setup, anything other than "aws" | "gcp" | "azure" silently produces empty setup_commands and env_vars, leading to a confusing juicefs auth failure on the pod. A typo (s3, AWS, awss) reaches launch.

backend = os.getenv("TFL_JUICEFS_STORAGE_BACKEND", "")
if backend not in ("aws", "gcp", "azure"):
    print(f"❌ Invalid TFL_JUICEFS_STORAGE_BACKEND: {backend!r}", file=sys.stderr)
    sys.exit(1)

[Maintainability] The retry loop in experiment.create_job masks the underlying issue without a signal

Retrying a UUID4 collision 5 times silently is fine as a JuiceFS-metadata-race workaround, but if it's ever firing for a different reason (e.g. a buggy Job.create raising FileExistsError from the wrong file), nobody will know. Add a logger.warning on retry so the symptom shows up in logs.

except FileExistsError:
    logger.warning("Job.create raised FileExistsError; retrying with new UUID", exc_info=True)
    continue

[Maintainability] Fragile .replace() for console-url injection in mount_cmd

In launch_juicefs.py:

if juicefs_console_url:
    auth_cmd = auth_cmd.replace(
        '--token "$TFL_JUICEFS_TOKEN"',
        '--token "$TFL_JUICEFS_TOKEN" --console-url "$TFL_JUICEFS_CONSOLE_URL"',
    )

The string match here happens to occur twice (once per branch of the if/else), so the replace double-injects the flag — that's actually what you want, but it's coincidental and easy to break by editing the auth string elsewhere. Build the snippet directly instead:

console_flag = ' --console-url "$TFL_JUICEFS_CONSOLE_URL"' if juicefs_console_url else ""
auth_cmd = (
    'if [ -n "$ACCESS_KEY" ] && [ -n "$SECRET_KEY" ]; then '
    f'juicefs auth {shlex.quote(volume_name)} --token "$TFL_JUICEFS_TOKEN"{console_flag} '
    '--access-key "$ACCESS_KEY" --secret-key "$SECRET_KEY"; '
    'else '
    f'juicefs auth {shlex.quote(volume_name)} --token "$TFL_JUICEFS_TOKEN"{console_flag}; '
    'fi'
)

Consider Improving (optional / nice-to-have)

  • The JuiceFS credential-injection block is duplicated near-verbatim in launch_template.py (lines 267-269) and launch_sweep.py (lines 290-296). Pull it into a tiny helper alongside build_juicefs_backend_credentials_setup.
  • _JUICEFS_MOUNT_POINT in lab/storage.py is captured at import time, while siblings like TFL_REMOTE_STORAGE_ENABLED are read on every call. Reading it dynamically inside set_organization_id would be more consistent and avoid stale-value surprises in tests/long-lived processes.
  • Consider logging (rather than print) for the _create_juicefs_directory and _validate_juicefs_config failure paths so they integrate with the rest of the API server's structured logs.

What's Working Well

  • Strong test coverage across both modes (API host and pod) — _validate_juicefs_config is exercised for each missing env var, the mount-command builder covers the non-token, hosted-token, and console-URL variants, and dirs.set_organization_id has tests for both API-host scoping and pod subdir-mount paths.
  • Correct use of shlex.quote on every interpolated path/volume in the generated shell snippets — this is the right defense for the command-construction layer.
  • Clear architectural separation: JuiceFS-specific install/mount/auth logic lives in a dedicated launch_juicefs.py rather than being smeared across launch_template.py and launch_sweep.py. The reuse of existing generate_aws_credentials_setup / generate_gcp_credentials_setup / generate_azure_credentials_setup helpers is exactly right.
  • The "JuiceFS bypasses TFL_REMOTE_STORAGE_ENABLED" decision is well-documented in inline comments at every site that branches on it (api.py, launch_template.py, launch_sweep.py, remote_workspace.py). That's the kind of subtle invariant that would otherwise rot fast.

@deep1401
Copy link
Copy Markdown
Member Author

deep1401 commented May 5, 2026

@aliasaria

  1. I've left that for whoever tests this PR. I will remove it once we approve the PR.
  2. The binary link is what the juicefs team officially recommends so I preferred this instead of getting it from elsewhere

I'll fix the Correctness and Maintainability points

@deep1401
Copy link
Copy Markdown
Member Author

deep1401 commented May 5, 2026

Pushed the fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants