Skip to content

Latest commit

 

History

History
762 lines (601 loc) · 19.6 KB

File metadata and controls

762 lines (601 loc) · 19.6 KB

RunAnywhere Web SDK Reference

Complete guide for integrating RunAnywhere on-device AI into web applications via WebAssembly.

Installation

The Web SDK uses a three-package architecture:

npm install @runanywhere/web @runanywhere/web-llamacpp @runanywhere/web-onnx
Package Purpose Exports
@runanywhere/web Core TypeScript API (no WASM) RunAnywhere, ModelManager, ModelCategory, LLMFramework, SDKEnvironment, EventBus, VideoCapture, AudioCapture, AudioPlayback, OPFSStorage, detectCapabilities, CompactModelDef
@runanywhere/web-llamacpp LLM/VLM backend via llama.cpp WASM LlamaCPP, TextGeneration, VLMWorkerBridge, startVLMWorkerRuntime
@runanywhere/web-onnx STT/TTS/VAD backend via sherpa-onnx WASM ONNX, STT, TTS, VAD

Requirements

Component Minimum Recommended
Browser Chrome 96+ / Edge 96+ Chrome 120+ / Edge 120+
WebAssembly Required Required
SharedArrayBuffer For multi-threaded WASM Requires Cross-Origin Isolation headers
WebGPU Optional (GPU acceleration) Chrome 120+
RAM 2GB 4GB+ for larger models

Setup

Bundler Configuration

Vite (recommended):

// vite.config.ts
import { defineConfig, Plugin } from 'vite';
import react from '@vitejs/plugin-react';
import path from 'path';
import fs from 'fs';

// Custom plugin to copy WASM files from node_modules to dist
function copyWasmPlugin(): Plugin {
  return {
    name: 'copy-wasm',
    writeBundle() {
      const outDir = path.resolve(__dirname, 'dist/assets');
      fs.mkdirSync(outDir, { recursive: true });

      // Copy llama.cpp WASM files
      const llamaWasm = path.resolve(__dirname, 'node_modules/@runanywhere/web-llamacpp/wasm');
      for (const f of fs.readdirSync(llamaWasm)) {
        fs.copyFileSync(path.join(llamaWasm, f), path.join(outDir, f));
      }

      // Copy sherpa-onnx WASM files
      const onnxWasm = path.resolve(__dirname, 'node_modules/@runanywhere/web-onnx/wasm');
      for (const f of fs.readdirSync(onnxWasm)) {
        if (f === 'sherpa') {
          // Copy sherpa subdirectory
          const sherpaDir = path.join(outDir, 'sherpa');
          fs.mkdirSync(sherpaDir, { recursive: true });
          for (const sf of fs.readdirSync(path.join(onnxWasm, 'sherpa'))) {
            fs.copyFileSync(path.join(onnxWasm, 'sherpa', sf), path.join(sherpaDir, sf));
          }
        } else {
          fs.copyFileSync(path.join(onnxWasm, f), path.join(outDir, f));
        }
      }
    },
  };
}

export default defineConfig({
  plugins: [react(), copyWasmPlugin()],
  assetsInclude: ['**/*.wasm'],
  optimizeDeps: {
    exclude: ['@runanywhere/web-llamacpp', '@runanywhere/web-onnx'],
  },
  worker: { format: 'es' },
  server: {
    headers: {
      'Cross-Origin-Opener-Policy': 'same-origin',
      'Cross-Origin-Embedder-Policy': 'credentialless',
    },
  },
});

Webpack:

// webpack.config.js
module.exports = {
  module: {
    rules: [
      { test: /\.wasm$/, type: 'asset/resource' },
    ],
  },
};

Cross-Origin Isolation Headers

Required for SharedArrayBuffer support (multi-threaded WASM):

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: credentialless

Without these headers, the SDK falls back to single-threaded mode with degraded performance.

Quick Start

1. Initialize SDK

import {
  RunAnywhere,
  SDKEnvironment,
  ModelManager,
  ModelCategory,
  LLMFramework,
  type CompactModelDef,
} from '@runanywhere/web';

import { LlamaCPP } from '@runanywhere/web-llamacpp';
import { ONNX } from '@runanywhere/web-onnx';

// Step 1: Initialize core SDK
await RunAnywhere.initialize({
  environment: SDKEnvironment.Development,
  debug: true,
});

// Step 2: Register WASM backends
await LlamaCPP.register();
await ONNX.register();

// Step 3: Register model catalog
const MODELS: CompactModelDef[] = [
  {
    id: 'lfm2-350m-q4_k_m',
    name: 'LFM2 350M Q4_K_M',
    repo: 'LiquidAI/LFM2-350M-GGUF',
    files: ['LFM2-350M-Q4_K_M.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Language,
    memoryRequirement: 250_000_000,
  },
];
RunAnywhere.registerModels(MODELS);

2. Text Generation (LLM)

import { TextGeneration } from '@runanywhere/web-llamacpp';

// Generate (non-streaming)
const result = await TextGeneration.generate('Explain quantum computing briefly.');
console.log(result.text);
console.log(`${result.tokensUsed} tokens in ${result.latencyMs}ms`);
console.log(`Speed: ${result.tokensPerSecond} tok/sec`);

// Stream tokens
const { stream, result: resultPromise, cancel } = TextGeneration.generateStream(
  'Write a haiku about code.',
  { maxTokens: 512, temperature: 0.7 }
);

for await (const token of stream) {
  process.stdout.write(token); // Real-time output
}

const finalResult = await resultPromise;
console.log(`${finalResult.tokensPerSecond} tok/sec`);

// Cancel mid-stream if needed
// cancel();

3. Speech-to-Text (STT)

import { STT } from '@runanywhere/web-onnx';

await STT.loadModel({
  modelId: 'whisper-tiny',
  type: STTModelType.Whisper,
  modelFiles: {
    encoder: '/models/encoder.onnx',
    decoder: '/models/decoder.onnx',
    tokens: '/models/tokens.txt'
  },
  sampleRate: 16000,
});

const result = await STT.transcribe(audioFloat32Array);
console.log(result.text);

4. Text-to-Speech (TTS)

import { TTS } from '@runanywhere/web-onnx';

await TTS.loadVoice({
  modelId: 'piper-us-english',
  modelPath: '/models/piper-model.onnx',
  tokensPath: '/models/tokens.txt'
});

const { audioData, sampleRate } = await TTS.synthesize('Hello from RunAnywhere!');
// audioData is Float32Array PCM audio

Advanced Usage

Vision Language Models (VLM)

VLM enables on-device visual understanding in the browser. It runs in a dedicated Web Worker to keep the UI responsive.

Step 1: Create VLM Web Worker

// src/workers/vlm-worker.ts
import { startVLMWorkerRuntime } from '@runanywhere/web-llamacpp';
startVLMWorkerRuntime();

Step 2: Wire up VLM during SDK initialization

import { RunAnywhere, SDKEnvironment, ModelCategory, LLMFramework, type CompactModelDef } from '@runanywhere/web';
import { LlamaCPP, VLMWorkerBridge } from '@runanywhere/web-llamacpp';

// Import worker URL (Vite syntax)
import vlmWorkerUrl from './workers/vlm-worker?worker&url';

await RunAnywhere.initialize({ environment: SDKEnvironment.Development });
await LlamaCPP.register();

// Register VLM model (requires model + mmproj files)
RunAnywhere.registerModels([
  {
    id: 'lfm2-vl-450m-q4_0',
    name: 'LFM2-VL 450M Q4_0',
    repo: 'runanywhere/LFM2-VL-450M-GGUF',
    files: ['LFM2-VL-450M-Q4_0.gguf', 'mmproj-LFM2-VL-450M-Q8_0.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Multimodal,
    memoryRequirement: 500_000_000,
  },
]);

// Wire VLM worker bridge
VLMWorkerBridge.shared.workerUrl = vlmWorkerUrl;
RunAnywhere.setVLMLoader({
  get isInitialized() { return VLMWorkerBridge.shared.isInitialized; },
  init: () => VLMWorkerBridge.shared.init(),
  loadModel: (params) => VLMWorkerBridge.shared.loadModel(params),
  unloadModel: () => VLMWorkerBridge.shared.unloadModel(),
});

Step 3: Capture camera frames and process with VLM

import { VideoCapture } from '@runanywhere/web';
import { VLMWorkerBridge } from '@runanywhere/web-llamacpp';

// Start camera
const camera = new VideoCapture({ facingMode: 'environment' });
await camera.start();

// Capture a frame (256px — CLIP resizes internally)
const frame = camera.captureFrame(256);

// Process with VLM
const result = await VLMWorkerBridge.shared.process(
  frame.rgbPixels,     // RGB pixel data
  frame.width,
  frame.height,
  'What do you see in this image? Describe the scene.',
  { maxTokens: 80, temperature: 0.7 }
);
console.log(result.text);

// Stop camera when done
camera.stop();

Live mode (continuous VLM processing):

// Poll every 2.5 seconds for live descriptions
const interval = setInterval(async () => {
  if (!camera.isCapturing) return;
  const frame = camera.captureFrame(256);
  const result = await VLMWorkerBridge.shared.process(
    frame.rgbPixels, frame.width, frame.height,
    'Briefly describe what you see.',
    { maxTokens: 30, temperature: 0.7 }
  );
  console.log(result.text);
}, 2500);

Voice Activity Detection (VAD)

import { VAD } from '@runanywhere/web-onnx';

await VAD.initialize();

// Process audio samples
const result = VAD.processSamples(audioFloat32Array);

// Get speech segments
const segment = VAD.popSpeechSegment();
if (segment) {
  console.log('Speech detected:', segment);
}

// Callback-based detection
VAD.onSpeechActivity((isSpeaking) => {
  console.log(isSpeaking ? 'Speech started' : 'Speech ended');
});

Tool Calling

import { TextGeneration } from '@runanywhere/web-llamacpp';

const tools = [
  {
    name: 'get_weather',
    description: 'Get current weather for a location',
    parameters: {
      type: 'object',
      properties: {
        location: { type: 'string', description: 'City name' }
      },
      required: ['location']
    }
  }
];

const result = await TextGeneration.generateWithTools(
  'What\'s the weather in Paris?',
  tools
);

if (result.toolCall) {
  console.log('Tool:', result.toolCall.name);
  console.log('Args:', result.toolCall.arguments);
}

Note: Use a tool-calling model like LFM2 1.2B Tool for best results.

Structured Output

import { TextGeneration } from '@runanywhere/web-llamacpp';

const schema = {
  type: 'object',
  properties: {
    name: { type: 'string' },
    age: { type: 'number' },
    hobbies: {
      type: 'array',
      items: { type: 'string' }
    }
  },
  required: ['name', 'age']
};

const result = await TextGeneration.generateStructured(
  'Generate a profile for a person who likes coding',
  schema
);
const data = JSON.parse(result.text);
console.log(data);  // { name: "...", age: ..., hobbies: [...] }

Embeddings

import { Embeddings } from '@runanywhere/web';

await Embeddings.loadModel('/models/all-minilm-l6-v2.gguf');

// Single text embedding
const embedding = await Embeddings.embed('Hello world');
console.log(embedding);  // Float32Array

// Batch embeddings
const embeddings = await Embeddings.embedBatch([
  'First text',
  'Second text',
  'Third text'
]);

Voice Pipeline

Complete VAD → STT → LLM → TTS orchestration:

import { VoicePipeline } from '@runanywhere/web';

await VoicePipeline.initialize({
  vadModel: '/models/silero-vad.onnx',
  sttModel: 'whisper-tiny',
  llmModel: 'lfm2-350m',
  ttsModel: 'piper-us-english'
});

// Callbacks
VoicePipeline.onTranscription = (text) => {
  console.log('User said:', text);
};

VoicePipeline.onResponse = (text) => {
  console.log('AI responded:', text);
};

VoicePipeline.onAudio = (audioData) => {
  // Play audio
};

// Start listening
await VoicePipeline.start();

// Stop
await VoicePipeline.stop();

Model Management

Model Registration (CompactModelDef)

Register models with their download source, framework, and memory requirements:

import { ModelCategory, LLMFramework, type CompactModelDef } from '@runanywhere/web';

const MODELS: CompactModelDef[] = [
  // GGUF model from Hugging Face repo
  {
    id: 'lfm2-350m-q4_k_m',
    name: 'LFM2 350M Q4_K_M',
    repo: 'LiquidAI/LFM2-350M-GGUF',           // HF repo
    files: ['LFM2-350M-Q4_K_M.gguf'],           // Files to download
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Language,
    memoryRequirement: 250_000_000,              // bytes
  },
  // VLM model (requires model + mmproj)
  {
    id: 'lfm2-vl-450m-q4_0',
    name: 'LFM2-VL 450M Q4_0',
    repo: 'runanywhere/LFM2-VL-450M-GGUF',
    files: ['LFM2-VL-450M-Q4_0.gguf', 'mmproj-LFM2-VL-450M-Q8_0.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Multimodal,
    memoryRequirement: 500_000_000,
  },
  // Tool-calling model
  {
    id: 'lfm2-1.2b-tool-q4_k_m',
    name: 'LFM2 1.2B Tool Q4_K_M',
    repo: 'LiquidAI/LFM2-1.2B-Tool-GGUF',
    files: ['LFM2-1.2B-Tool-Q4_K_M.gguf'],
    framework: LLMFramework.LlamaCpp,
    modality: ModelCategory.Language,
    memoryRequirement: 800_000_000,
  },
  // ONNX archive from direct URL
  {
    id: 'sherpa-onnx-whisper-tiny.en',
    name: 'Whisper Tiny English (ONNX)',
    url: 'https://huggingface.co/runanywhere/sherpa-onnx-whisper-tiny.en/resolve/main/sherpa-onnx-whisper-tiny.en.tar.gz',
    framework: LLMFramework.ONNX,
    modality: ModelCategory.SpeechRecognition,
    memoryRequirement: 105_000_000,
    artifactType: 'archive',
  },
  // TTS archive
  {
    id: 'vits-piper-en_US-lessac-medium',
    name: 'Piper TTS US English (Lessac)',
    url: 'https://huggingface.co/runanywhere/vits-piper-en_US-lessac-medium/resolve/main/vits-piper-en_US-lessac-medium.tar.gz',
    framework: LLMFramework.ONNX,
    modality: ModelCategory.SpeechSynthesis,
    memoryRequirement: 65_000_000,
    artifactType: 'archive',
  },
  // Single ONNX file
  {
    id: 'silero-vad-v5',
    name: 'Silero VAD v5',
    url: 'https://huggingface.co/runanywhere/silero-vad-v5/resolve/main/silero_vad.onnx',
    files: ['silero_vad.onnx'],
    framework: LLMFramework.ONNX,
    modality: ModelCategory.Audio,
    memoryRequirement: 5_000_000,
  },
];

RunAnywhere.registerModels(MODELS);

Download & Load Models

import { ModelManager, EventBus } from '@runanywhere/web';

// Track download progress
EventBus.shared.on('model.downloadProgress', (evt) => {
  console.log(`${evt.modelId}: ${(evt.progress * 100).toFixed(1)}%`);
});

// Download model
await ModelManager.downloadModel('lfm2-350m-q4_k_m');

// Load model (coexist: true allows multiple model categories loaded simultaneously)
await ModelManager.loadModel('lfm2-350m-q4_k_m', { coexist: false });

// Check loaded model
const loaded = ModelManager.getLoadedModel(ModelCategory.Language);
console.log('Loaded:', loaded?.id);

// List all registered models
const allModels = ModelManager.getModels();

React Hook for Model Loading

import { useState, useCallback } from 'react';
import { ModelManager, ModelCategory, EventBus } from '@runanywhere/web';

type LoaderState = 'idle' | 'downloading' | 'loading' | 'ready' | 'error';

function useModelLoader(category: ModelCategory, coexist = false) {
  const [state, setState] = useState<LoaderState>('idle');
  const [progress, setProgress] = useState(0);
  const [error, setError] = useState<string | null>(null);

  const ensure = useCallback(async () => {
    // Already loaded?
    if (ModelManager.getLoadedModel(category)) {
      setState('ready');
      return true;
    }

    const model = ModelManager.getModels().find(m => m.modality === category);
    if (!model) { setError('No model registered'); return false; }

    try {
      if (model.status !== 'downloaded') {
        setState('downloading');
        EventBus.shared.on('model.downloadProgress', (evt) => {
          if (evt.modelId === model.id) setProgress(evt.progress);
        });
        await ModelManager.downloadModel(model.id);
      }

      setState('loading');
      await ModelManager.loadModel(model.id, { coexist });
      setState('ready');
      return true;
    } catch (e) {
      setState('error');
      setError(String(e));
      return false;
    }
  }, [category, coexist]);

  return { state, progress, error, ensure };
}

Persistent Storage (OPFS)

Models are automatically stored in Origin Private File System:

import { OPFSStorage } from '@runanywhere/web';

// Models persist across sessions in OPFS automatically
// Use ModelManager for high-level management
// Use OPFSStorage for low-level access if needed

Configuration

Generation Options

const options = {
  maxTokens: 512,
  temperature: 0.7,
  topP: 1.0,
  topK: 40,
  stopSequences: ['END'],
  systemPrompt: 'You are a helpful assistant.',
};

const result = await TextGeneration.generate(prompt, options);

Performance Optimization

// Use Web Workers for non-blocking VLM inference (required for VLM)
// TextGeneration already runs in the main WASM thread

// Enable WebGPU acceleration (if available)
// LlamaCPP auto-detects WebGPU at registration time
const mode = LlamaCPP.accelerationMode; // 'webgpu' | 'wasm' | null

Browser Compatibility Check

import { detectCapabilities } from '@runanywhere/web';

const caps = await detectCapabilities();

console.log('Cross-Origin Isolated:', caps.isCrossOriginIsolated);
console.log('SharedArrayBuffer:', caps.hasSharedArrayBuffer);
console.log('WebGPU:', caps.hasWebGPU);
console.log('OPFS:', caps.hasOPFS);

if (!caps.isCrossOriginIsolated) {
  console.warn('Add COOP/COEP headers for multi-threaded performance');
}

Error Handling

import { SDKError } from '@runanywhere/web';

try {
  await ModelManager.loadModel(modelId);
} catch (error) {
  if (error instanceof SDKError) {
    console.error('SDK Error:', error.message);
  } else if (error.message.includes('quota')) {
    console.error('Storage quota exceeded');
  } else if (error.message.includes('SharedArrayBuffer')) {
    console.error('Missing Cross-Origin headers');
  } else {
    console.error('Model load failed:', error);
  }
}

Debugging

Enable Logging

import { SDKLogger, LogLevel } from '@runanywhere/web';

SDKLogger.enabled = true;
SDKLogger.level = LogLevel.Debug;

await RunAnywhere.initialize({
  environment: SDKEnvironment.Development,
  debug: true,
});

Performance Monitoring

const { stream, result: resultPromise } = TextGeneration.generateStream(prompt);
for await (const token of stream) { /* display */ }
const result = await resultPromise;

console.log('Tokens/sec:', result.tokensPerSecond);
console.log('Latency:', result.latencyMs);
console.log('Tokens used:', result.tokensUsed);

Common Issues

SharedArrayBuffer Not Available

Add Cross-Origin headers to your server:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: credentialless

In Vite, add to server.headers in vite.config.ts.

WASM Files Not Found

Ensure WASM files are copied to your output directory. Use the copyWasmPlugin() pattern shown in Setup, and exclude the backend packages from pre-bundling:

optimizeDeps: {
  exclude: ['@runanywhere/web-llamacpp', '@runanywhere/web-onnx'],
},

Model Load Fails

// Check storage quota
const estimate = await navigator.storage.estimate();
console.log('Used:', estimate.usage);
console.log('Quota:', estimate.quota);

if (estimate.usage / estimate.quota > 0.9) {
  // Clear old models via ModelManager
}

Slow Generation

// Use smaller quantized models (Q4_0 for web)
// Check if WebGPU is active:
console.log('Acceleration:', LlamaCPP.accelerationMode);

// Use streaming for better perceived performance
const { stream } = TextGeneration.generateStream(prompt);
for await (const token of stream) {
  // Display immediately
}

VLM Worker Not Initializing

Ensure the worker is imported with the correct Vite syntax and worker.format is set:

// vite.config.ts
worker: { format: 'es' }

// In your code
import vlmWorkerUrl from './workers/vlm-worker?worker&url';
VLMWorkerBridge.shared.workerUrl = vlmWorkerUrl;

Resources