claude-harness/content/articles/en/30-api-client.mdx at main · luzhenqian/claude-harness

title

API Client: Deep Integration with the Anthropic API

slug

30-api-client

part

order

description

Deep dive into Claude Code's API client — streaming response handling, model selection and fallback, Beta APIs, token counting and cost tracking, retry logic

The Problem

The moment you type a question in Claude Code and press Enter, a precisely orchestrated chain of operations begins: the system builds the message array, selects the appropriate model, adds Beta headers, receives the response via SSE streaming, parses token usage in real time, calculates costs, handles potential 429/529 errors, and falls back to an alternative model if necessary. All of this completes within 1-2 seconds — the user simply sees text start flowing.

Claude Code's API client is not a simple HTTP wrapper — it's a complex system encompassing retry logic, fallback, caching, cost tracking, and multi-provider adaptation. This article provides an in-depth analysis of every layer of this system.

API Client Layer Architecture

graph TD
  subgraph "Application Layer"
    QUERY["query.ts (main loop)"]
    QUERY --> CLAUDE["claude.ts (API orchestration)"]
  end

  subgraph "Retry Layer"
    CLAUDE --> RETRY["withRetry.ts"]
    RETRY --> FALLBACK["FallbackTriggeredError"]
    RETRY --> CANNOT["CannotRetryError"]
  end

  subgraph "SDK Layer"
    RETRY --> CLIENT["client.ts (SDK initialization)"]
    CLIENT --> ANTHROPIC["Anthropic SDK"]
    CLIENT --> BEDROCK["AWS Bedrock"]
    CLIENT --> VERTEX["GCP Vertex AI"]
    CLIENT --> FOUNDRY["Azure Foundry"]
  end

  subgraph "Support Layer"
    CLAUDE --> COST["cost-tracker.ts"]
    CLAUDE --> CACHE["promptCacheBreakDetection.ts"]
    CLAUDE --> BOOT["bootstrap.ts"]
  end

  style QUERY fill:#1e3a5f,color:#e0e0e0
  style CLAUDE fill:#1e3a5f,color:#e0e0e0
  style RETRY fill:#3d2e1a,color:#e0e0e0
  style FALLBACK fill:#3d2e1a,color:#e0e0e0
  style CANNOT fill:#3d2e1a,color:#e0e0e0
  style CLIENT fill:#1a3d1e,color:#e0e0e0
  style ANTHROPIC fill:#1a3d1e,color:#e0e0e0
  style BEDROCK fill:#1a3d1e,color:#e0e0e0
  style VERTEX fill:#1a3d1e,color:#e0e0e0
  style FOUNDRY fill:#1a3d1e,color:#e0e0e0
  style COST fill:#3d1a1e,color:#e0e0e0
  style CACHE fill:#3d1a1e,color:#e0e0e0
  style BOOT fill:#3d1a1e,color:#e0e0e0

Multi-Provider Client

Claude Code supports four API providers, each with different authentication and configuration methods:

// src/services/api/client.ts (lines 36-71, comment excerpts)
// Direct API:
//   ANTHROPIC_API_KEY: Required for direct API access
//
// AWS Bedrock:
//   AWS credentials configured via aws-sdk defaults
//   AWS_REGION or AWS_DEFAULT_REGION
//   ANTHROPIC_SMALL_FAST_MODEL_AWS_REGION: Optional override for Haiku
//
// Foundry (Azure):
//   ANTHROPIC_FOUNDRY_RESOURCE: Azure resource name
//   ANTHROPIC_FOUNDRY_BASE_URL: Alternative full base URL
//
// Vertex AI:
//   Model-specific region variables (VERTEX_REGION_CLAUDE_*)
//   CLOUD_ML_REGION: Default GCP region
//   ANTHROPIC_VERTEX_PROJECT_ID: Required GCP project ID

Client initialization accounts for debugging needs — when stderr is in debug mode, SDK logs are redirected to stderr:

// src/services/api/client.ts (lines 73-86)
function createStderrLogger(): ClientOptions['logger'] {
  return {
    error: (msg, ...args) =>
      console.error('[Anthropic SDK ERROR]', msg, ...args),
    warn: (msg, ...args) =>
      console.error('[Anthropic SDK WARN]', msg, ...args),
    info: (msg, ...args) =>
      console.error('[Anthropic SDK INFO]', msg, ...args),
    debug: (msg, ...args) =>
      console.error('[Anthropic SDK DEBUG]', msg, ...args),
  }
}

Beta Headers Management

Claude Code uses numerous Beta API features, declared via the anthropic-beta header:

// src/services/api/claude.ts (lines 134-143)
import {
  AFK_MODE_BETA_HEADER,
  CONTEXT_1M_BETA_HEADER,
  CONTEXT_MANAGEMENT_BETA_HEADER,
  EFFORT_BETA_HEADER,
  FAST_MODE_BETA_HEADER,
  PROMPT_CACHING_SCOPE_BETA_HEADER,
  REDACT_THINKING_BETA_HEADER,
  STRUCTURED_OUTPUTS_BETA_HEADER,
  TASK_BUDGETS_BETA_HEADER,
} from 'src/constants/betas.js'

These Beta features include:

Beta Header	Feature
`CONTEXT_1M_BETA_HEADER`	1M token context window
`CONTEXT_MANAGEMENT_BETA_HEADER`	Server-side context management
`FAST_MODE_BETA_HEADER`	Fast mode (reduced latency)
`EFFORT_BETA_HEADER`	Effort control (adjusts reasoning depth)
`PROMPT_CACHING_SCOPE_BETA_HEADER`	Prompt caching scope
`REDACT_THINKING_BETA_HEADER`	Thinking content redaction
`STRUCTURED_OUTPUTS_BETA_HEADER`	Structured outputs
`TASK_BUDGETS_BETA_HEADER`	Task budget control
`AFK_MODE_BETA_HEADER`	Away mode (background execution optimization)

Extra Body Parameters

Users can inject additional API parameters via the CLAUDE_CODE_EXTRA_BODY environment variable:

// src/services/api/claude.ts (lines 272-331)
export function getExtraBodyParams(betaHeaders?: string[]): JsonObject {
  const extraBodyStr = process.env.CLAUDE_CODE_EXTRA_BODY
  let result: JsonObject = {}

  if (extraBodyStr) {
    try {
      const parsed = safeParseJSON(extraBodyStr)
      if (parsed && typeof parsed === 'object' && !Array.isArray(parsed)) {
        // Shallow clone — safeParseJSON is LRU-cached and returns the
        // same object reference. Mutating result would poison the cache.
        result = { ...(parsed as JsonObject) }
      }
    } catch (error) {
      logForDebugging(`Error parsing CLAUDE_CODE_EXTRA_BODY: ${errorMessage(error)}`)
    }
  }

  // Anti-distillation: send fake_tools opt-in for 1P CLI only
  if (feature('ANTI_DISTILLATION_CC') ? /* gate check */ : false) {
    result.anti_distillation = ['fake_tools']
  }

  return result
}

Note the shallow clone — safeParseJSON uses an LRU cache, so directly mutating the return value would poison the cache, causing subsequent calls to see the modified value.

Prompt Cache Control

Prompt caching can be controlled at per-model granularity:

// src/services/api/claude.ts (line 333 onwards)
export function getPromptCachingEnabled(model: string): boolean {
  if (isEnvTruthy(process.env.DISABLE_PROMPT_CACHING)) return false
  if (isEnvTruthy(process.env.DISABLE_PROMPT_CACHING_HAIKU)) {
    if (model === getSmallFastModel()) return false
  }
  if (isEnvTruthy(process.env.DISABLE_PROMPT_CACHING_SONNET)) {
    if (model === getDefaultSonnetModel()) return false
  }
  // ...
}

This per-model disabling design stems from practical needs — the cache creation cost for some models may not be worthwhile (for instance, Haiku is already inexpensive, and the cache creation fee can actually exceed the savings).

Retry System

The retry logic is the most complex part of the API client, defined in withRetry.ts.

Retry Configuration

// src/services/api/withRetry.ts (lines 53-56)
const DEFAULT_MAX_RETRIES = 10
const FLOOR_OUTPUT_TOKENS = 3000
const MAX_529_RETRIES = 3
export const BASE_DELAY_MS = 500

Foreground vs. Background Query Sources

Not all queries should be retried. Background queries (summaries, title generation, classifiers) immediately give up on 529 errors — they are not what the user is waiting for, and retrying would only amplify capacity cascades:

// src/services/api/withRetry.ts (lines 62-82)
const FOREGROUND_529_RETRY_SOURCES = new Set<QuerySource>([
  'repl_main_thread',
  'repl_main_thread:outputStyle:custom',
  'repl_main_thread:outputStyle:Explanatory',
  'repl_main_thread:outputStyle:Learning',
  'sdk',
  'agent:custom',
  'agent:default',
  'agent:builtin',
  'compact',
  'hook_agent',
  'hook_prompt',
  'verification_agent',
  'side_question',
  'auto_mode',
])

Retry State Machine

stateDiagram-v2
  [*] --> Attempt: Start request
  Attempt --> Success: 200 OK
  Attempt --> Error: API error

  Error --> CheckType: Analyze error type

  CheckType --> FastModeFallback: 429/529 + Fast Mode
  CheckType --> Auth401: 401 Authentication failure
  CheckType --> Overloaded529: 529 Overloaded
  CheckType --> RateLimit429: 429 Rate limited
  CheckType --> StaleConn: ECONNRESET/EPIPE
  CheckType --> CannotRetry: Other errors

  FastModeFallback --> ShortWait: retry-after < threshold
  FastModeFallback --> Cooldown: retry-after >= threshold
  ShortWait --> Attempt
  Cooldown --> Attempt: Standard speed retry

  Auth401 --> RefreshToken: Refresh OAuth Token
  RefreshToken --> NewClient: Rebuild client
  NewClient --> Attempt

  Overloaded529 --> Count529: Count consecutive 529s
  Count529 --> ModelFallback: >= MAX_529_RETRIES
  Count529 --> BackoffWait: < MAX_529_RETRIES
  BackoffWait --> Attempt

  RateLimit429 --> ReadRetryAfter: Read retry-after
  ReadRetryAfter --> BackoffWait

  StaleConn --> DisableKeepAlive: Disable connection pool
  DisableKeepAlive --> NewClient

  ModelFallback --> FallbackError: Throw FallbackTriggeredError
  CannotRetry --> [*]: Throw CannotRetryError

  Success --> [*]

Fast Mode Fallback

Fast Mode is a low-latency mode. When rate-limited, the system must decide whether to wait (preserving cache hits) or fall back (switching to standard speed):

// src/services/api/withRetry.ts (lines 267-305)
if (wasFastModeActive && !isPersistentRetryEnabled() &&
    error instanceof APIError &&
    (error.status === 429 || is529Error(error))) {
  // Overage limit — permanently disable fast mode
  const overageReason = error.headers?.get(
    'anthropic-ratelimit-unified-overage-disabled-reason',
  )
  if (overageReason !== null && overageReason !== undefined) {
    handleFastModeOverageRejection(overageReason)
    retryContext.fastMode = false
    continue
  }

  const retryAfterMs = getRetryAfterMs(error)
  if (retryAfterMs !== null && retryAfterMs < SHORT_RETRY_THRESHOLD_MS) {
    // Short wait — keep fast mode to protect prompt cache
    await sleep(retryAfterMs, options.signal, { abortError })
    continue
  }

  // Long wait or unknown — enter cooldown period (switch to standard speed)
  const cooldownMs = Math.max(
    retryAfterMs ?? DEFAULT_FAST_MODE_FALLBACK_HOLD_MS,
    MIN_COOLDOWN_MS,
  )
  triggerFastModeCooldown(Date.now() + cooldownMs, cooldownReason)
  retryContext.fastMode = false
  continue
}

The decision logic:

retry-after < threshold -- short wait, keep fast mode (protects prompt cache from invalidation)
retry-after >= threshold or unknown -- enter cooldown period, switch to standard speed
Overage limit -- permanently disable fast mode

Authentication Error Recovery

// src/services/api/withRetry.ts (lines 218-251)
const isStaleConnection = isStaleConnectionError(lastError)
if (isStaleConnection && getFeatureValue_CACHED_MAY_BE_STALE(...)) {
  disableKeepAlive()  // Disable connection pool, rebuild connection
}

if (
  client === null ||
  (lastError instanceof APIError && lastError.status === 401) ||
  isOAuthTokenRevokedError(lastError) ||
  isBedrockAuthError(lastError) ||
  isVertexAuthError(lastError) ||
  isStaleConnection
) {
  if ((lastError instanceof APIError && lastError.status === 401) ||
      isOAuthTokenRevokedError(lastError)) {
    const failedAccessToken = getClaudeAIOAuthTokens()?.accessToken
    if (failedAccessToken) {
      await handleOAuth401Error(failedAccessToken)
    }
  }
  client = await getClient()  // Rebuild client
}

Authentication recovery covers special cases for all providers:

Anthropic 1P — refresh OAuth token on 401
AWS Bedrock — 403 or CredentialsProviderError
GCP Vertex — credential refresh failure
Connection reset — disable keep-alive and reconnect on ECONNRESET/EPIPE

Consecutive 529 Errors and Model Fallback

// src/services/api/withRetry.ts (lines 327-348)
if (is529Error(error) &&
    (process.env.FALLBACK_FOR_ALL_PRIMARY_MODELS ||
     (!isClaudeAISubscriber() && isNonCustomOpusModel(options.model)))) {
  consecutive529Errors++
  if (consecutive529Errors >= MAX_529_RETRIES) {
    if (options.fallbackModel) {
      throw new FallbackTriggeredError(
        options.model,
        options.fallbackModel,
      )
    }
  }
}

After 3 consecutive 529 errors, a model fallback is triggered (e.g., Opus to Sonnet). FallbackTriggeredError is caught and handled by query.ts — existing assistant messages are cleared, the model is switched, and the entire request is retried.

Persistent Retry (Unattended Mode)

For automation scenarios (CI/CD, cron jobs), the system supports unlimited retries:

// src/services/api/withRetry.ts (lines 96-104)
const PERSISTENT_MAX_BACKOFF_MS = 5 * 60 * 1000    // 5 minute max backoff
const PERSISTENT_RESET_CAP_MS = 6 * 60 * 60 * 1000 // 6 hour timeout
const HEARTBEAT_INTERVAL_MS = 30_000                 // 30 second heartbeat

function isPersistentRetryEnabled(): boolean {
  return feature('UNATTENDED_RETRY')
    ? isEnvTruthy(process.env.CLAUDE_CODE_UNATTENDED_RETRY)
    : false
}

Persistent retry sends heartbeats via SystemAPIErrorMessage, preventing the host environment (such as a container orchestration system) from marking the session as idle.

Cost Tracking

Every API response updates the cost state:

// src/cost-tracker.ts (lines 71-79)
type StoredCostState = {
  totalCostUSD: number
  totalAPIDuration: number
  totalAPIDurationWithoutRetries: number
  totalToolDuration: number
  totalLinesAdded: number
  totalLinesRemoved: number
  lastDuration: number | undefined
  modelUsage: { [modelName: string]: ModelUsage } | undefined
}

Cost calculation uses the calculateUSDCost function based on per-model pricing tables:

// src/services/api/claude.ts (line 146)
import { addToTotalSessionCost } from 'src/cost-tracker.js'

The cost state is not just for display — it is saved to the project configuration during session switches and read back on resume:

// src/cost-tracker.ts (lines 143-158)
export function saveCurrentSessionCosts(fpsMetrics?: FpsMetrics): void {
  saveCurrentProjectConfig(current => ({
    ...current,
    lastCost: getTotalCostUSD(),
    lastAPIDuration: getTotalAPIDuration(),
    lastAPIDurationWithoutRetries: getTotalAPIDurationWithoutRetries(),
    lastToolDuration: getTotalToolDuration(),
    lastDuration: getTotalDuration(),
    // ...
  }))
}

Bootstrap API

At startup, the system fetches server-side configuration via the Bootstrap API:

// src/services/api/bootstrap.ts (lines 42-100)
async function fetchBootstrapAPI(): Promise<BootstrapResponse | null> {
  if (isEssentialTrafficOnly()) return null  // Skip in privacy mode
  if (getAPIProvider() !== 'firstParty') return null  // Skip for third-party providers

  // OAuth preferred, API Key fallback
  const hasUsableOAuth =
    getClaudeAIOAuthTokens()?.accessToken && hasProfileScope()
  if (!hasUsableOAuth && !apiKey) return null

  const endpoint = `${getOauthConfig().BASE_API_URL}/api/claude_cli/bootstrap`

  return await withOAuth401Retry(async () => {
    const token = getClaudeAIOAuthTokens()?.accessToken
    // Re-read OAuth token each time (retry may have refreshed it)
    let authHeaders: Record<string, string>
    if (token && hasProfileScope()) {
      authHeaders = { Authorization: `Bearer ${token}`, ... }
    } else if (apiKey) {
      authHeaders = { 'x-api-key': apiKey }
    } else {
      return null
    }

    const response = await axios.get(endpoint, {
      headers: { ...authHeaders },
      timeout: 5000,
    })
    return bootstrapResponseSchema().safeParse(response.data)
  })
}

The data returned by Bootstrap includes:

client_data — client configuration
additional_model_options — list of additional available models

The 5-second timeout ensures startup doesn't hang due to network issues.

Streaming Response Handling

The main loop in query.ts consumes streaming responses via for await...of. Key processing logic includes:

Fallback Handling

When model fallback is triggered during streaming, partially received messages need to be discarded:

// src/query.ts (lines 709-741)
if (streamingFallbackOccured) {
  // Generate tombstones for already-emitted messages
  for (const msg of assistantMessages) {
    yield { type: 'tombstone' as const, message: msg }
  }

  assistantMessages.length = 0
  toolResults.length = 0
  toolUseBlocks.length = 0
  needsFollowUp = false

  // Discard pending results from the streaming tool executor
  if (streamingToolExecutor) {
    streamingToolExecutor.discard()
    streamingToolExecutor = new StreamingToolExecutor(
      toolUseContext.options.tools,
      canUseTool,
      toolUseContext,
    )
  }
}

Tombstone messages tell the UI and transcript to remove these partial messages — it is particularly important to remove incomplete thinking blocks, as they carry model-specific signatures that would cause API errors after falling back to a different model.

Error Suppression and Recovery

Certain API errors are recoverable — the system suppresses them within the streaming loop and attempts recovery after the stream ends:

// src/query.ts (lines 800-825)
let withheld = false
if (feature('CONTEXT_COLLAPSE')) {
  if (contextCollapse?.isWithheldPromptTooLong(message, ...)) {
    withheld = true
  }
}
if (reactiveCompact?.isWithheldPromptTooLong(message)) {
  withheld = true
}
if (mediaRecoveryEnabled && reactiveCompact?.isWithheldMediaSizeError(message)) {
  withheld = true
}
if (isWithheldMaxOutputTokens(message)) {
  withheld = true
}
if (!withheld) {
  yield yieldMessage
}

Suppressed messages still join the assistantMessages array — the recovery logic needs to inspect them. However, they are not sent to SDK consumers, as those consumers (such as desktop applications) might terminate the session upon seeing an error.

Request Construction Details

Tool Schema Conversion

Each tool definition needs to be converted to an API-compatible format, including handling of deferred tools:

// Reference: src/services/api/claude.ts
import {
  formatDeferredToolLine,
  isDeferredTool,
  TOOL_SEARCH_TOOL_NAME,
} from '../../tools/ToolSearchTool/prompt.js'

Advisor Mode

When Advisor is enabled, an additional model (such as Opus advising Sonnet) participates in decision-making:

// src/services/api/claude.ts (lines 150-155)
import {
  ADVISOR_TOOL_INSTRUCTIONS,
  getExperimentAdvisorModels,
  isAdvisorEnabled,
  isValidAdvisorModel,
  modelSupportsAdvisor,
} from 'src/utils/advisor.js'

Session Activity Tracking

During API requests, the session is marked as active, used for resource management in remote environments:

// src/services/api/claude.ts (lines 208-210)
import {
  startSessionActivity,
  stopSessionActivity,
} from '../../utils/sessionActivity.js'

Summary

Claude Code's API client is a multi-layered defense system:

Multi-provider abstraction — unified interface for Anthropic/Bedrock/Vertex/Foundry, configured via environment variables
Layered retry — different strategies for different error types (authentication/rate-limiting/overload/connection reset)
Intelligent fallback — Fast Mode to standard speed to alternative model, with sound decision logic at each step
Streaming error suppression — recoverable errors are not immediately exposed to consumers, giving the system a chance to recover
Full-chain cost tracking — from API response to project configuration persistence, with support for session resumption
Operational knobs — prompt caching, Fast Mode, retry strategies, and more are all controllable via environment variables and feature flags

The complexity of this system is not accidental — it reflects the reality that production AI applications face: networks are unreliable, services get overloaded, credentials expire, and users need an uninterrupted experience. Every layer of protection corresponds to a real-world failure mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Problem

API Client Layer Architecture

Multi-Provider Client

Beta Headers Management

Extra Body Parameters

Prompt Cache Control

Retry System

Retry Configuration

Foreground vs. Background Query Sources

Retry State Machine

Fast Mode Fallback

Authentication Error Recovery

Consecutive 529 Errors and Model Fallback

Persistent Retry (Unattended Mode)

Cost Tracking

Bootstrap API

Streaming Response Handling

Fallback Handling

Error Suppression and Recovery

Request Construction Details

Tool Schema Conversion

Advisor Mode

Session Activity Tracking

Summary

FilesExpand file tree

30-api-client.mdx

Latest commit

History

30-api-client.mdx

File metadata and controls

The Problem

API Client Layer Architecture

Multi-Provider Client

Beta Headers Management

Extra Body Parameters

Prompt Cache Control

Retry System

Retry Configuration

Foreground vs. Background Query Sources

Retry State Machine

Fast Mode Fallback

Authentication Error Recovery

Consecutive 529 Errors and Model Fallback

Persistent Retry (Unattended Mode)

Cost Tracking

Bootstrap API

Streaming Response Handling

Fallback Handling

Error Suppression and Recovery

Request Construction Details

Tool Schema Conversion

Advisor Mode

Session Activity Tracking

Summary