Audio Understanding

curl -X POST "https://api.example.com/v1beta/models/gemini-2.5-pro:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          {
            "text": "Transcribe this audio"
          },
          {
            "inline_data": {
              "mime_type": "audio/mp3",
              "data": "BASE64_ENCODED_AUDIO_DATA"
            }
          }
        ]
      }
    ]
  }'

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "Transcription: Hello, this is a test audio recording. The speaker is discussing the benefits of artificial intelligence in modern technology."
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP"
    }
  ]
}

POST

v1beta

models

{model}

:generateContent

curl -X POST "https://api.example.com/v1beta/models/gemini-2.5-pro:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          {
            "text": "Transcribe this audio"
          },
          {
            "inline_data": {
              "mime_type": "audio/mp3",
              "data": "BASE64_ENCODED_AUDIO_DATA"
            }
          }
        ]
      }
    ]
  }'

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "Transcription: Hello, this is a test audio recording. The speaker is discussing the benefits of artificial intelligence in modern technology."
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP"
    }
  ]
}

Official documentation: https://ai.google.dev/gemini-api/docs/audio

Analyze and understand audio content using Google Gemini models. The model can transcribe speech, answer questions about audio, and extract information from audio files.

Request Parameters

key

string

required

API key.

contents

array

required

Content array containing text and audio data.Each content object contains:

role (string): Role (user or model)
parts (array): Content parts array, can include:
- text (string): Text prompt or question about the audio
- inline_data (object): Audio data
  - mime_type (string): Audio MIME type (e.g., “audio/mp3”, “audio/wav”)
  - data (string): Base64-encoded audio data

generationConfig

object

Generation configuration.

temperature (number): Sampling temperature
topP (number): Nucleus sampling parameter
maxOutputTokens (integer): Maximum output tokens

Response

Returns transcription and analysis of the provided audio.

curl -X POST "https://api.example.com/v1beta/models/gemini-2.5-pro:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          {
            "text": "Transcribe this audio"
          },
          {
            "inline_data": {
              "mime_type": "audio/mp3",
              "data": "BASE64_ENCODED_AUDIO_DATA"
            }
          }
        ]
      }
    ]
  }'

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "Transcription: Hello, this is a test audio recording. The speaker is discussing the benefits of artificial intelligence in modern technology."
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP"
    }
  ]
}

URL Context Embeddings

⌘I

Chat

Responses

Image Models

Video Models

GPTs

Doubao Series

Audio Understanding

Request Parameters

Response

Chat

Responses

Image Models

Video Models

GPTs

Doubao Series

​Request Parameters

​Response

Request Parameters

Response