Live API Guide

The Live API enables low-latency, real-time voice and video interactions with Gemini. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses, creating a natural conversational experience.

Overview
Implementation Approaches
WebSocket Connection
Supported Modalities
Models and Response Modalities
Establishing a Session
Sending Content
Receiving Responses
Voice Activity Detection
Native Audio Features
Tool Use and Function Calling
Session Management
Ephemeral Tokens
Limitations
Examples

Overview

The Live API is a stateful, bidirectional streaming API built on WebSockets. Unlike the standard generateContent API, the Live API maintains a persistent connection where you can:

Send text, audio, or video continuously to the Gemini server
Receive audio, text, or function call requests from the Gemini server
Interrupt model responses mid-generation
Resume sessions after disconnection
Use automatic voice activity detection for hands-free conversations

Key Features

Voice Activity Detection (VAD): Automatic detection of when users start and stop speaking
Tool Use and Function Calling: Execute functions during real-time conversations
Session Management: Resume sessions, compress context windows, handle graceful disconnections
Ephemeral Tokens: Secure client-side authentication for browser/mobile applications
Native Audio: Natural speech output with affective dialog and proactive responses (v1alpha)

Implementation Approaches

When integrating with the Live API, choose between:

Server-to-Server

Your backend connects to the Live API using WebSockets. The client sends stream data (audio, video, text) to your server, which then forwards it to the Live API.

Client App -> Your Backend -> Live API

Client-to-Server

Your frontend connects directly to the Live API using WebSockets, bypassing your backend.

Client App -> Live API

Client-to-server offers better performance for streaming audio and video since it eliminates the hop through your backend. However, for production environments, use ephemeral tokens instead of standard API keys to mitigate security risks.

WebSocket Connection

Endpoint

The Live API uses WebSocket connections to the following endpoints:

Gemini API (AI Studio):

wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent

Vertex AI:

wss://{location}-aiplatform.googleapis.com/ws/google.cloud.aiplatform.v1.LlmBidiService/BidiGenerateContent

Vertex Live API requires billing enabled on the target GCP project. Without billing, the server closes setup with policy error 1008.

API Version

For Gemini API (auth: :gemini) connections, the standard API version is v1beta. Some features require v1alpha:

Affective dialog
Proactive audio
Ephemeral tokens

Set the API version per session:

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  api_version: "v1alpha",      # required for native audio extras
  generation_config: %{response_modalities: ["AUDIO"]}
)

For Vertex AI (auth: :vertex_ai) connections, gemini_ex uses the Vertex Live v1 endpoint.

This library abstracts the WebSocket connection details. You interact through the Gemini.Live.Session module.

Backend Schema Differences

Gemini Live and Vertex Live use slightly different wire fields for response usage metadata:

Gemini Live sends responseTokenCount and responseTokensDetails
Vertex Live v1 sends candidatesTokenCount and candidatesTokensDetails
Vertex Live may also include turnCompleteReason on serverContent

gemini_ex normalizes both backends into the same Live types:

Gemini.Types.Live.UsageMetadata.candidates_token_count and candidates_tokens_details are the canonical output-token fields
response_token_count and response_tokens_details are retained as backwards-compatible aliases
Gemini.Types.Live.UsageMetadata.output_token_count/1 and output_tokens_details/1 return the normalized output view
Gemini.Types.Live.ServerContent.turn_complete_reason is parsed as a Gemini.Types.Live.Enums.TurnCompleteReason value when present

Example callback code that works across both backends:

on_message: fn
  %{server_content: content, usage_metadata: usage} ->
    output_tokens = Gemini.Types.Live.UsageMetadata.output_token_count(usage)
    reason = if content, do: content.turn_complete_reason

    IO.inspect(%{
      output_tokens: output_tokens,
      turn_complete_reason: reason
    })

  _ ->
    :ok
end

Session Configuration

The initial message after establishing the WebSocket connection sets the session configuration:

alias Gemini.Live.Models

%{
  model: Models.resolve(:audio),
  generation_config: %{
    response_modalities: ["AUDIO"],
    temperature: 0.7,
    speech_config: %{voice_config: %{prebuilt_voice_config: %{voice_name: "Kore"}}}
  },
  system_instruction: "You are a helpful assistant.",
  tools: [%{function_declarations: [...]}]
}

Configuration cannot be updated while the connection is open. However, you can change parameters (except the model) when resuming via session resumption.

Supported Modalities

Input Modalities

Modality	Format	Notes
Audio	16-bit PCM, little-endian	Input natively at 16kHz; the API resamples other rates. MIME type: `audio/pcm;rate=16000`
Video	JPEG/PNG frames	Sent as base64-encoded blobs
Text	UTF-8 string	Via `clientContent` or `realtimeInput`

Output Modalities

Modality	Format	Notes
Audio	16-bit PCM, 24kHz	Native audio output models only
Text	UTF-8 string	Model-dependent; validate against the selected model

Important: You can only set one response modality per session. Support is model-specific, and Session.connect/1 rejects unsupported combinations before opening the WebSocket.

Models and Response Modalities

Native Audio Models (Recommended for Voice)

Native audio output provides natural, realistic-sounding speech with improved multilingual performance. Use these models when you need audio responses:

alias Gemini.Live.Models

# Resolve a Live audio model available for this key
model = Models.resolve(:audio)

Native audio models support:

128k token context window
Affective (emotion-aware) dialogue (v1alpha)
Proactive audio responses (v1alpha)
Thinking capabilities

Text UX over Live Audio Sessions

Current Gemini Live models are audio-first. To build a text-oriented terminal or chat UI, use an audio session and enable output transcription:

alias Gemini.Live.Models

# Resolve a Live audio model available for this key
model = Models.resolve(:audio)

Model Availability and Rollout Variability

Live API model availability can vary by project and rollout. The canonical Live docs may list newer models that are not yet enabled for your API key. When that happens, the Live API returns a 1008 close error like:

Publisher Model `projects/.../publishers/google/models/<model>` was not found
or is not supported for bidiGenerateContent

To make this robust, this library resolves a Live model at runtime based on your key's list_models results.

Use the resolver:

alias Gemini.Live.Models

audio_model = Models.resolve(:audio)

The resolver uses the model registry plus runtime list_models results for your credentials. For current Gemini Live usage, prefer Models.resolve(:audio). This library validates the session configuration locally so incompatible response modalities fail before the WebSocket opens.

If the audio model is not present in your Live-capable model list, audio sessions will not work for that key yet.

You can inspect what your key supports:

GEMINI_API_KEY=YOUR_KEY mix run -e 'alias Gemini.APIs.Coordinator; {:ok, resp}=Coordinator.list_models(); resp.models |> Enum.filter(&Enum.member?(&1.supported_generation_methods, "bidiGenerateContent")) |> Enum.each(fn m -> IO.puts(m.name) end)'

If you want to hardcode a model, prefer the resolver's fallback choices when newer Live models are not present in that list.

Session Limits

Configuration	Duration Limit
Audio only	15 minutes
Audio + Video	2 minutes

Use context window compression or session resumption to extend beyond these limits.

Establishing a Session

Basic Setup

alias Gemini.Live.Models
alias Gemini.Live.Session

# Resolve a model that is available for this API key
model = Models.resolve(:audio)

{:ok, session} = Session.start_link(
  model: model,
  auth: :gemini,  # or :vertex_ai
  generation_config: %{
    response_modalities: ["AUDIO"]
  },
  output_audio_transcription: %{},
  on_message: fn message ->
    IO.inspect(message, label: "Received")
  end
)

# Connect to the Live API
:ok = Session.connect(session)

# Session is now ready for messages

For Gemini sessions, you can also pass api_key: directly to Session.start_link/1. When api_key: is present and auth: is omitted, the session uses Gemini auth for that connection only.

{:ok, session} = Gemini.Live.Session.start_link(
  model: Gemini.Live.Models.resolve(:audio),
  api_key: "session-specific-key",
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{}
)

Full Configuration Options

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  # Required
  model: Models.resolve(:audio),

  # Authentication
  auth: :gemini,  # or :vertex_ai
  api_key: "session-specific-key",  # optional per-session Gemini override when using Gemini auth
  project_id: "your-project",  # required for :vertex_ai
  location: "us-central1",     # optional, default: "us-central1"
  api_version: "v1alpha",

  # Generation configuration
  generation_config: %{
    response_modalities: ["AUDIO"],
    temperature: 0.7,
    top_p: 0.95,
    speech_config: %{
      voice_config: %{
        prebuilt_voice_config: %{voice_name: "Kore"}
      }
    }
  },

  # System instruction
  system_instruction: "You are a helpful voice assistant.",

  # Tools for function calling
  tools: [%{function_declarations: [...]}],

  # Realtime input configuration
  realtime_input_config: %{
    automatic_activity_detection: %{
      disabled: false,  # true for manual VAD
      start_of_speech_sensitivity: "START_SENSITIVITY_HIGH",
      end_of_speech_sensitivity: "END_SENSITIVITY_HIGH"
    }
  },

  # Session management
  session_resumption: %{},           # Enable session resumption
  resume_handle: "previous-handle",  # Resume from previous session
  context_window_compression: %{sliding_window: %{}},

  # Audio transcription
  input_audio_transcription: %{},
  output_audio_transcription: %{},

  # Callbacks
  on_message: &handle_message/1,
  on_error: &handle_error/1,
  on_close: &handle_close/1,
  on_tool_call: &handle_tool_call/1,
  on_tool_call_cancellation: &handle_cancellation/1,
  on_transcription: &handle_transcription/1,
  on_voice_activity: &handle_voice_activity/1,
  on_session_resumption: &handle_resumption/1,
  on_go_away: &handle_go_away/1
)

Sending Content

The Live API provides two methods for sending content, each with different semantics:

Preferred helper

Use Session.send_text/3 for text turns in portable code. It selects the correct transport for the connected model (clientContent for older models, realtimeInput.text for Gemini 3.1 Live).

Session.send_text(session, "What is the capital of France?")

clientContent (Ordered, Explicit Turns)

Use send_client_content/3 only when you specifically need the clientContent wire format. This method:

Adds content to the conversation history in order
Interrupts any current model generation
Requires explicit turn completion signal

Note: Gemini 3.1 Flash Live only supports clientContent for initial-history seeding. For ongoing text turns, use Session.send_text/3 or Session.send_realtime_input/2.

# Seed initial history before ongoing realtime turns
Session.send_client_content(session, [
  %{role: "user", parts: [%{text: "What is the capital of France?"}]},
  %{role: "model", parts: [%{text: "Paris"}]},
  %{role: "user", parts: [%{text: "What about Germany?"}]}
], turn_complete: true)

realtimeInput (Streaming, Optimized for Speed)

Use send_realtime_input/2 for continuous streaming data (audio, video, text). This method:

Streams data without interrupting model generation
Optimizes for low latency at the expense of deterministic ordering
Derives turn boundaries from activity detection (VAD)
Processes data incrementally before turn completion

# Send audio chunk (16-bit PCM, 16kHz mono)
Session.send_realtime_input(session, audio: %{
  data: pcm_data,  # binary data, will be Base64 encoded
  mime_type: "audio/pcm;rate=16000"
})

# Send video frame
Session.send_realtime_input(session, video: %{
  data: jpeg_data,
  mime_type: "image/jpeg"
})

# Send text via realtime input
Session.send_realtime_input(session, text: "Hello")

# Manual activity signaling (when automatic VAD is disabled)
Session.send_realtime_input(session, activity_start: true)
# ... send audio chunks ...
Session.send_realtime_input(session, activity_end: true)

# Signal audio stream pause (for automatic VAD)
Session.send_realtime_input(session, audio_stream_end: true)

Ordering Considerations

clientContent messages are added to context in order
realtimeInput is optimized for responsiveness; ordering across modalities is not guaranteed
If you mix clientContent and realtimeInput, the server attempts to optimize but provides no ordering guarantees

Receiving Responses

Responses are delivered through the on_message callback. The server sends BidiGenerateContentServerMessage which may contain:

Message Types

Field	Description
`setup_complete`	Session setup successful
`server_content`	Model response content
`tool_call`	Function call request
`tool_call_cancellation`	Cancelled tool calls (due to interruption)
`go_away`	Session ending soon notice
`session_resumption_update`	New resumption handle
`voice_activity`	Voice activity signals
`usage_metadata`	Token usage information

Server Content

on_message: fn message ->
  case message do
    %{server_content: content} when not is_nil(content) ->
      # Extract text
      if text = Gemini.Types.Live.ServerContent.extract_text(content) do
        IO.write(text)
      end

      # Handle audio output
      if content.model_turn && content.model_turn.parts do
        for part <- content.model_turn.parts do
          if audio_data = part[:inline_data] do
            # Process audio (24kHz PCM)
            play_audio(audio_data.data)
          end
        end
      end

      # Turn completion signals
      if content.turn_complete do
        IO.puts("\n[Turn complete]")
      end

      # Generation complete (before turn_complete when streaming)
      if content.generation_complete do
        IO.puts("[Generation complete]")
      end

      # Handle interruption
      if content.interrupted do
        IO.puts("[Interrupted by user]")
        clear_audio_queue()
      end

    _ -> :ok
  end
end

Transcription

When transcription is enabled, you receive transcriptions separately from content:

on_transcription: fn
  {:input, %{"text" => text}} ->
    IO.puts("User said: #{text}")

  {:output, %{"text" => text}} ->
    IO.puts("Model said: #{text}")
end

Voice Activity Detection

VAD allows the model to recognize when a person is speaking, enabling natural interruptions.

Automatic VAD (Default)

When automatic VAD is enabled, the model automatically detects speech and triggers responses:

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  # VAD is enabled by default
  on_message: fn message ->
    case message do
      %{server_content: %{interrupted: true}} ->
        # User interrupted - clear playback queue
        clear_audio_playback()
      _ -> :ok
    end
  end
)

When the audio stream is paused (e.g., microphone turned off), send audio_stream_end to flush cached audio:

Session.send_realtime_input(session, audio_stream_end: true)

VAD Configuration

Fine-tune VAD behavior:

realtime_input_config: %{
  automatic_activity_detection: %{
    disabled: false,
    start_of_speech_sensitivity: "START_SENSITIVITY_LOW",  # or HIGH
    end_of_speech_sensitivity: "END_SENSITIVITY_LOW",      # or HIGH
    prefix_padding_ms: 20,      # Audio to keep before speech detection
    silence_duration_ms: 100    # Silence required for end-of-speech
  }
}

Manual VAD

For push-to-talk or custom VAD implementations:

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  realtime_input_config: %{
    automatic_activity_detection: %{disabled: true}
  }
)

# When user presses talk button
Session.send_realtime_input(session, activity_start: true)

# Stream audio while talking
for chunk <- audio_chunks do
  Session.send_realtime_input(session, audio: %{
    data: chunk,
    mime_type: "audio/pcm;rate=16000"
  })
end

# When user releases talk button
Session.send_realtime_input(session, activity_end: true)

Native Audio Features

Native audio models support advanced features (requires v1alpha API version for some features).

Voice Selection

generation_config: %{
  response_modalities: ["AUDIO"],
  speech_config: %{
    voice_config: %{
      prebuilt_voice_config: %{voice_name: "Kore"}
    }
  }
}

Available voices include: Kore, Puck, Charon, Fenrir, Aoede, and others. Listen to voices in AI Studio.

Affective Dialog (v1alpha)

Adapts response style to input expression and tone:

alias Gemini.Live.Models

# Note: Requires v1alpha API version
{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  api_version: "v1alpha",
  generation_config: %{response_modalities: ["AUDIO"]},
  enable_affective_dialog: true
)

Proactive Audio (v1alpha)

Allows the model to decide not to respond if content is irrelevant:

alias Gemini.Live.Models

# Note: Requires v1alpha API version
{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  api_version: "v1alpha",
  generation_config: %{response_modalities: ["AUDIO"]},
  proactivity: %{proactive_audio: true}
)

Thinking

Native audio models support thinking capabilities:

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  api_version: "v1alpha",
  generation_config: %{
    response_modalities: ["AUDIO"],
    thinking_config: %{
      thinking_budget: 1024,     # Token budget for thinking
      include_thoughts: true     # Include thought summaries
    }
  }
)

Tool Use and Function Calling

The Live API supports function calling, but unlike generateContent, you must handle tool responses manually.

Defining Tools

tools = [
  %{
    function_declarations: [
      %{
        name: "get_weather",
        description: "Get current weather for a location",
        parameters: %{
          type: "object",
          properties: %{
            location: %{type: "string", description: "City name"}
          },
          required: ["location"]
        }
      }
    ]
  }
]

Handling Tool Calls

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{},
  tools: tools,

  on_tool_call: fn %{function_calls: calls} ->
    responses = Enum.map(calls, fn call ->
      result = case call.name do
        "get_weather" ->
          location = call.args["location"]
          get_weather_data(location)  # Your implementation
        _ ->
          %{error: "Unknown function"}
      end

      %{id: call.id, name: call.name, response: result}
    end)

    # Return responses to send automatically
    {:tool_response, responses}
  end
)

Alternatively, send tool responses manually:

Session.send_tool_response(session, [
  %{id: "call_123", name: "get_weather", response: %{temp: 72}}
])

Asynchronous Function Calling

For non-blocking function execution on legacy 2.5 native-audio models:

tools = [
  %{
    function_declarations: [
      %{
        name: "long_running_task",
        behavior: "NON_BLOCKING"  # Execute asynchronously
      }
    ]
  }
]

# Control response timing with scheduling
Session.send_tool_response(session, [
  %{
    id: "call_123",
    name: "long_running_task",
    response: %{result: "done"},
    scheduling: :interrupt   # or :when_idle, :silent
  }
])

Scheduling options:

:interrupt - Interrupt current generation immediately
:when_idle - Wait until current turn completes
:silent - Don't generate a response

Tool Call Cancellation

When the user interrupts during function execution, the server sends cancellation:

on_tool_call_cancellation: fn cancelled_ids ->
  IO.puts("Cancelled: #{inspect(cancelled_ids)}")
  # Attempt to undo side effects if possible
end

Session Management

Session Resumption

Resume sessions after disconnection to preserve conversation context:

alias Gemini.Live.Models

# First session - enable resumption
{:ok, session1} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{},
  session_resumption: %{},
  on_session_resumption: fn %{handle: handle, resumable: true} ->
    # Store handle for later use
    save_handle(handle)
  end
)

:ok = Session.connect(session1)
:ok = Session.send_text(session1, "Remember: my name is Alice.")
Process.sleep(3000)

# Get handle before closing
handle = Session.get_session_handle(session1)
Session.close(session1)

# Later - resume with saved handle
{:ok, session2} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{},
  resume_handle: handle,
  session_resumption: %{}
)

:ok = Session.connect(session2)
:ok = Session.send_text(session2, "What's my name?")
# Model should remember: Alice

Resumption tokens are valid for 2 hours after the last session termination.

Context Window Compression

Enable sliding window compression for long sessions:

alias Gemini.Live.Models

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{},
  context_window_compression: %{
    sliding_window: %{
      target_tokens: 16000  # Target after compression
    },
    trigger_tokens: 24000   # When to trigger compression
  }
)

Compression extends session duration indefinitely but may affect response quality as older context is discarded.

GoAway Notice

The server sends a GoAway message before disconnecting:

on_go_away: fn %{time_left_ms: time_left, handle: handle} ->
  IO.puts("Session ending in #{time_left}ms")

  # Save handle for resumption
  if handle, do: save_handle(handle)

  # Prepare for reconnection
  schedule_reconnect()
end

Generation Complete

The server sends generation_complete when the model finishes generating (before turn_complete):

on_message: fn message ->
  case message do
    %{server_content: %{generation_complete: true}} ->
      IO.puts("[Model finished generating]")

    %{server_content: %{turn_complete: true}} ->
      IO.puts("[Turn complete - ready for next input]")

    _ -> :ok
  end
end

Ephemeral Tokens

Ephemeral tokens are short-lived authentication tokens for client-to-server implementations. They enhance security by:

Expiring quickly (default: 30 minutes)
Limiting the number of sessions they can create
Optionally constraining configuration options

Token Constraints

Ephemeral tokens require v1alpha API version and are only compatible with the Live API.

Token Properties:

expire_time: When messages will be rejected (default: 30 minutes)
new_session_expire_time: When new sessions will be rejected (default: 1 minute)
uses: Number of sessions the token can start (default: 1)

Creating Tokens (Server-Side)

Create tokens on your backend and pass them to clients:

# This would typically be done via the REST API on your backend
# The token is then passed to the client application

# Example token structure returned from API:
%{
  "name" => "ephemeral-token-string",  # Use this as the API key
  "expireTime" => "2025-01-23T12:00:00Z",
  "newSessionExpireTime" => "2025-01-23T11:31:00Z"
}

Using Tokens (Client-Side)

The client uses the token as if it were an API key:

// In browser/mobile client
const session = await ai.live.connect({
  model: 'gemini-3.1-flash-live-preview',
  apiKey: ephemeralToken.name,  // Use token instead of API key
  config: { responseModalities: ['AUDIO'] }
});

Token with Configuration Constraints

Lock tokens to specific configurations for additional security:

# Server-side token creation with constraints
token_config = %{
  uses: 1,
  live_connect_constraints: %{
    model: "gemini-3.1-flash-live-preview",
    config: %{
      session_resumption: %{},
      temperature: 0.7,
      response_modalities: ["AUDIO"]
    }
  }
}

Best Practices

Set short expiration times
Verify secure authentication on your backend before issuing tokens
Don't use ephemeral tokens for server-to-server connections (unnecessary overhead)
Use sessionResumption within a token's expireTime to reconnect without consuming additional uses

Limitations

Response Modalities

Configure only one response modality per session. For current Gemini Live models, use ["AUDIO"] and enable output transcription when you need text UX.

Session Duration

Without compression:

Audio-only: 15 minutes
Audio + Video: 2 minutes

Context Window

Context window limits vary by Live model generation and rollout. Check the selected model's current documentation or registry entry instead of assuming the older native-audio versus text-live split.

Authentication

Standard API keys should not be used in client-side code. Use ephemeral tokens for client-to-server implementations.

Supported Languages

Native audio models automatically detect language and don't support explicit language codes. See the canonical documentation for the full list of supported languages.

Examples

Text Chat Session

alias Gemini.Live.Session
alias Gemini.Live.Models

model = Models.resolve(:audio)

{:ok, session} = Session.start_link(
  model: model,
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{},
  system_instruction: "You are a helpful assistant.",
  on_message: fn
    %{server_content: content} when not is_nil(content) ->
      if text = Gemini.Types.Live.ServerContent.extract_text(content) do
        IO.write(text)
      end
      if content.turn_complete, do: IO.puts("\n")
    _ -> :ok
  end
)

:ok = Session.connect(session)

Session.send_text(session, "What is machine learning?")
Process.sleep(5000)

Session.close(session)

Audio Streaming

alias Gemini.Live.Session
alias Gemini.Live.Models

model = Models.resolve(:audio)

{:ok, session} = Session.start_link(
  model: model,
  auth: :gemini,
  api_version: "v1alpha",
  generation_config: %{
    response_modalities: ["AUDIO"],
    speech_config: %{voice_config: %{prebuilt_voice_config: %{voice_name: "Kore"}}}
  },
  input_audio_transcription: %{},
  output_audio_transcription: %{},
  on_message: fn
    %{server_content: content} when not is_nil(content) ->
      # Handle audio output
      if content.model_turn && content.model_turn.parts do
        for part <- content.model_turn.parts do
          if part[:inline_data], do: play_audio(part.inline_data.data)
        end
      end
    _ -> :ok
  end,
  on_transcription: fn
    {:input, t} -> IO.puts("User: #{t["text"]}")
    {:output, t} -> IO.puts("Model: #{t["text"]}")
  end
)

:ok = Session.connect(session)

# Send audio chunks (16kHz PCM)
for chunk <- audio_chunks do
  Session.send_realtime_input(session, audio: %{
    data: chunk,
    mime_type: "audio/pcm;rate=16000"
  })
end

Process.sleep(5000)
Session.close(session)

Function Calling

alias Gemini.Live.Session
alias Gemini.Live.Models

tools = [
  %{
    function_declarations: [
      %{
        name: "get_stock_price",
        description: "Get current stock price",
        parameters: %{
          type: "object",
          properties: %{symbol: %{type: "string"}},
          required: ["symbol"]
        }
      }
    ]
  }
]

{:ok, session} = Session.start_link(
  model: Models.resolve(:audio),
  auth: :gemini,
  generation_config: %{response_modalities: ["AUDIO"]},
  output_audio_transcription: %{},
  tools: tools,
  on_tool_call: fn %{function_calls: calls} ->
    responses = Enum.map(calls, fn call ->
      result = case call.name do
        "get_stock_price" -> %{price: 178.50, currency: "USD"}
        _ -> %{error: "Unknown function"}
      end
      %{id: call.id, name: call.name, response: result}
    end)
    {:tool_response, responses}
  end,
  on_message: fn
    %{server_content: c} when not is_nil(c) ->
      if text = Gemini.Types.Live.ServerContent.extract_text(c), do: IO.write(text)
      if c.turn_complete, do: IO.puts("\n")
    _ -> :ok
  end
)

:ok = Session.connect(session)
Session.send_text(session, "What's Apple's stock price?")
Process.sleep(10000)
Session.close(session)

Session Resumption

See examples/13_live_session_resumption.exs for a complete example.

Testing Live Sessions

When your environment variables are already exported, run the Live integration tests directly:

# Gemini Live session coverage
mix test --only live_gemini test/gemini/live/session_live_test.exs

# Gemini Live advanced features
mix test --only live_gemini test/gemini/live/features_live_test.exs

# Vertex Live coverage (billed; requires explicit opt-in)
RUN_BILLED_VERTEX_LIVE_TESTS=1 mix test --only live_vertex_ai test/gemini/live/session_vertex_live_test.exs

The default test suite excludes :live_gemini and :live_vertex_ai, so these targeted commands are the intended manual verification path for real credentials.

If your Vertex project does not expose a compatible Live audio model, the Vertex session tests will skip instead of failing.

FilesExpand file tree

live_api.md

Latest commit

History

live_api.md

File metadata and controls

Live API Guide

Table of Contents

Overview

Key Features

Implementation Approaches

Server-to-Server

Client-to-Server

WebSocket Connection

Endpoint

API Version

Backend Schema Differences

Session Configuration

Supported Modalities

Input Modalities

Output Modalities

Models and Response Modalities

Native Audio Models (Recommended for Voice)

Text UX over Live Audio Sessions

Model Availability and Rollout Variability

Session Limits

Establishing a Session

Basic Setup

Full Configuration Options

Sending Content

Preferred helper

clientContent (Ordered, Explicit Turns)

realtimeInput (Streaming, Optimized for Speed)

Ordering Considerations

Receiving Responses

Message Types

Server Content

Transcription

Voice Activity Detection

Automatic VAD (Default)

VAD Configuration

Manual VAD

Native Audio Features

Voice Selection

Affective Dialog (v1alpha)

Proactive Audio (v1alpha)

Thinking

Tool Use and Function Calling

Defining Tools

Handling Tool Calls

Asynchronous Function Calling

Tool Call Cancellation

Session Management

Session Resumption

Context Window Compression

GoAway Notice

Generation Complete

Ephemeral Tokens

Token Constraints

Creating Tokens (Server-Side)

Using Tokens (Client-Side)

Token with Configuration Constraints

Best Practices

Limitations

Response Modalities

Session Duration

Context Window

Authentication

Supported Languages

Examples

Text Chat Session

Audio Streaming

Function Calling

Session Resumption

Testing Live Sessions

Further Reading