Media tools
Generate images and videos using 40+ AI models. Always call get_models first to see available models, costs, and whether a model needs an input image.
REST HTTP clients: the same limits are documented in one place in REST API model requirements (and returned per model from GET /v1/models).
#generate_media
Starts an image or video generation.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| prompt | string | Yes | What to generate (e.g. “A red car on a mountain road”). |
| model | string | Yes | Model ID (from get_models). Examples: nano-banana, kling-2-6-image-to-video. |
| generation_type | string | No | text-to-image, text-to-video, image-to-video, or image-to-image. Default: text-to-image. |
| negative_prompt | string | No | What to avoid in the output. |
| source_media_urls | string or array | No | Required for image-to-video and image-to-image. URL(s) to image(s), or for some models (e.g. Kling 2.6 Motion) image + video. See input limits below. Omit for text-to-image and text-to-video. |
| aspect_ratio | string | No | e.g. 1:1, 16:9, 9:16, 4:5, 21:9. Default: 1:1. Note: each model only accepts a subset — get_models returns the allowed list. |
| duration | string | No | Video length. Only certain video models use this. See below. |
| quality | string | No | e.g. fast, standard, pro, ultra. Default: standard. |
| resolution | string | No | Output resolution tier. Only certain image models use this — gpt-image-2 (1K/2K/4K), nano-banana-pro/nano-banana-2 (1K/2K/4K), flux-2 (1K/2K). Each tier is a separate pricing SKU; get_models returns the per-tier credit cost. Ignored by models where resolution is encoded in the variant model_id (Seedance, Kling, P-Video). See the Resolution tiers table below for per-model constraints. |
| sound | boolean | No | When true, request video with generated audio. Only certain video models use this. Default: false. See below. |
| seed | number | No | Seed for reproducible results. |
Example (text-to-image):
{
"prompt": "A futuristic city at sunset with flying cars",
"model": "nano-banana",
"generation_type": "text-to-image",
"aspect_ratio": "16:9",
"quality": "pro"
}
Example (image-to-video, one input image required):
{
"prompt": "Gentle motion and subtle movement",
"model": "kling-2-6-image-to-video",
"generation_type": "image-to-video",
"source_media_urls": ["https://example.com/your-image.jpg"],
"aspect_ratio": "16:9",
"duration": "5s"
}
Response: Includes generation_id, status (e.g. pending), and often estimated_time_seconds and estimated_cost_credits. Poll with get_generation_status until status is completed or failed.
Models that support duration:
| Model(s) | Supported values | Notes |
|---|---|---|
| kling-2-6-text-to-video, kling-2-6-image-to-video | 5s, 10s | Optional with/without audio (model variant). |
| wan-2-5 (text-to-video, image-to-video) | 5s, 10s | |
| wan-2-7-720p, wan-2-7-1080p (text-to-video, image-to-video) | 2s–15s (integer, per second) | PER-SECOND pricing; resolution is encoded in the model_id. Pass duration (2–15) + generation_type. |
| grok-imagine-video-1-5-preview-480p, grok-imagine-video-1-5-preview-720p (image-to-video) | 2s–15s (integer, per second) | PER-SECOND pricing; resolution is encoded in the model_id. Image-to-video only — requires exactly 1 image URL in source_media_urls. Pass duration (2–15); optional aspect_ratio (auto by default). |
| v1-pro-fast-i2v | 5s, 10s | |
| seedance-1-5-pro | 4s, 8s, 12s | Supports both text-to-video (0–1 image optional) and image-to-video (2 images required). |
| seedance-2 (Standard) / seedance-2-fast (Fast) | 4s–15s (integer) | ByteDance Seedance 2. The tier is the model family itself — seedance-2-fast for the cheaper/faster tier, seedance-2 for the higher-quality tier. Each tier has concrete variant model_ids per resolution / reference-video combo (e.g. seedance-2-fast-480p, seedance-2-480p-video-ref). Pass the full variant id to generate_media; a family label alone returns a variant_required error with the choices. Multimodal text-to-video references (up to 9 images + 3 videos + 3 audios); image-to-video takes 1 required image + optional end frame + up to 3 reference audios. With a reference video the billing formula becomes credits = (ref_video_s + output_s) × rate/s. |
| gemini-omni-video (Google) | 4, 6, 8, 10 (baked into the variant id) | Discrete durations only — the duration parameter is ignored; resolution is also encoded in the variant id (gemini-omni-video-720p-6s, gemini-omni-video-1080p-6s, gemini-omni-video-4k-10s, etc.). Three video-ref variants (gemini-omni-video-720p-video-ref, gemini-omni-video-1080p-video-ref, gemini-omni-video-4k-video-ref) take their output duration from the trimmed source clip (≤ 10s). Pass a bare gemini-omni-video family label and the API returns a variant_required error with the 15 choices. |
| grok-text-to-video-6s | Fixed 6s | Duration parameter ignored. |
| kling-3-0-std, kling-3-0-pro | 3s–15s | Single-shot mode. Max 2500 chars; supports @element_name references. |
| grok-image-to-video, kling-2-5-image-to-video-pro, veo3-1 | Not configurable | Duration not set via this parameter. |
For image-only models, duration is ignored.
Models that support negative_prompt:
| Model(s) | Notes |
|---|---|
| imagen-4, imagen-4-fast, imagen-4-ultra | Text-to-image. |
| wan-2-5 (text-to-video, image-to-video) | |
| wan-2-7-720p, wan-2-7-1080p (text-to-video, image-to-video) | Up to 500 chars. |
| kling-2-5-image-to-video-pro |
All other models ignore negative_prompt.
Models that support quality (or equivalent):
| Model(s) | How it works | Values |
|---|---|---|
| imagen-4 variants | Mapped to model_variant. | standard, fast, ultra (use quality: standard / fast / ultra). |
| seedream-v4, seedream-v4-edit | Resolution via quality param. | 1K (default), 2K, 4K. |
| seedream-v4-5, seedream-v4-5-edit | Uses quality directly. | basic (2K, default), high (4K). |
| 5-lite-text-to-image, 5-lite-image-to-image | Uses quality directly. | basic (2K, default), high (4K). |
| veo3-1 vs veo3-1-fast | Different model IDs, not a single quality param. | Use model veo3-1 (quality) or veo3-1-fast (speed). |
| flux-2, nano-banana-pro, nano-banana-2 | Resolution (1K/2K/4K), not a generic “quality” string. | Pass via the dedicated resolution parameter — see below. |
| gpt-image-2 (t2i + i2i) | Resolution via the resolution parameter. | See Resolution tiers below. |
For other models, quality is ignored.
<a id="resolution-tiers"></a>
Resolution tiers (the resolution parameter):
| Model | Values | Pricing | Constraint |
|---|---|---|---|
| gpt-image-2 (t2i + i2i) | 1K (default), 2K, 4K | 11 / 15 / 21 credits | 2K and 4K require an explicit non-square, non-auto aspect_ratio — one of 9:16, 16:9, 4:3, 3:4. Passing auto or 1:1 at 2K/4K returns HTTP 400 with error: "aspect_ratio_incompatible_with_high_res" and no credits are held. 1K accepts every supported aspect including auto/1:1. |
| nano-banana-2 | 1K (default), 2K, 4K | See get_models | Each tier is a separate pricing SKU. Aspect ratio list unchanged across tiers. |
| nano-banana-pro | 1K (default), 2K, 4K | See get_models | Same pattern as nano-banana-2. |
| flux-2, flux-2-edit | 1K (default), 2K | See get_models | Two tiers only. |
When to pick which tier (GPT Image 2):
1K— default. Use for social posts, thumbnails, concepting, in-app previews, anything ≤ 1024 × 1024. Cheapest; no aspect-ratio gotchas.2K— use when the client needs a crisp web hero, newsletter cover, in-product illustration at retina density. Must pick a directional aspect (landscape or portrait).4K— use for print, out-of-home, banners, or any case the user explicitly asks for the highest output size. Confirm the aspect with the user first; the pill1:1/autowon't work.
Models not listed ignore resolution. For video families (Seedance, Kling, P-Video, Gemini Omni) the resolution is part of the concrete variant model_id — pass the variant (e.g. seedance-2-fast-480p, p-video-1080p, gemini-omni-video-1080p-6s), not this param.
Prompt character limits:
Some models enforce a maximum prompt length. Exceeding it may return an error or truncation.
| Model(s) | Max characters |
|---|---|
| wan-2-5 | 800 |
| wan-2-7-720p, wan-2-7-1080p | 5,000 |
| kling-2-6 (text-to-video, image-to-video) | 2,500 |
| seedance-2 (Fast + Standard) | 2,500 |
| kling-3-0-std, kling-3-0-pro | 2,500 |
| kling-2-5-image-to-video-pro | 2,500 |
| seedream-v4, seedream-v4-edit | 2,500 |
| seedream-v4-5, seedream-v4-5-edit | 3,000 |
| 5-lite-text-to-image, 5-lite-image-to-image | 2,995 |
| gpt-1.5-image-medium, gpt-1.5-image-high | 3,000 |
| nano-banana, imagen-4, flux-2, veo3-1, v1-pro-fast-i2v, grok (image/video), p-image-edit | 5,000 |
| nano-banana-pro (all variants) | 20,000 |
| nano-banana-2 (all variants) | 20,000 |
Others may have no documented limit or use server defaults.
Input file (image and video) limits:
For image-to-video and image-to-image, source_media_urls is a list of URLs. Most models accept images only (JPEG, PNG, WebP, typically 10 MB max per file). Some models also accept video inputs; when they do, format and size limits apply (e.g. MP4, max duration).
| Model(s) | Input type | Limit | Notes |
|---|---|---|---|
| kling-2-6-motion-control-720p, kling-2-6-motion-control-1080p | Image + video | 1 image + 1 video | Motion Control: reference video drives motion. Video max 30 s; video file typically up to 100 MB (MP4/WebM). |
| kling-3-0-motion-control-720p, kling-3-0-motion-control-1080p | Image + video | 1 image + 1 video | Kling 3.0 Motion Control: same as Kling 2.6. Per-second billing — see the Motion Control pricing table below. Video max 30 s; video file typically up to 100 MB (MP4/WebM). |
| kling-2-6-image-to-video, wan-2-5 (image-to-video), grok-image-to-video, v1-pro-fast-i2v | Images only | 1 image | Exactly one input image. |
| wan-2-7-720p, wan-2-7-1080p (image-to-video) | Images only | 1 image | First frame in source_media_urls; optional last_frame_url for a first+last-frame transition. |
| grok-imagine-video-1-5-preview-480p, grok-imagine-video-1-5-preview-720p | Images only | 1 image | Exactly one input image (the starting frame). Aspect ratio: auto (default, derives from the image), 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3. Output includes audio. |
| kling-2-5-image-to-video-pro | Images only | 2 images | Start frame and end frame. |
| seedance-1-5-pro | Images only | Mode-dependent | Text-to-video (generation_type: "text-to-video"): 0–1 images optional. Image-to-video (generation_type: "image-to-video"): exactly 2 images required (start + end frame). |
| seedance-2 (Fast + Standard) | Image + video + audio | Mode-dependent | Text-to-video: up to 9 images, 3 videos (combined ≤ 15s), and 3 audio clips (combined ≤ 15s) — all optional. Image-to-video: 1 required image (first frame) + optional end frame (2 images max) + up to 3 audio clips; no reference videos allowed in this mode. Pass every URL (image, video, audio) in source_media_urls — the backend classifies by file extension (.jpg/.png/.webp → image, .mp4/.mov/.webm → video, .mp3/.wav/.m4a → audio) and routes each to the right bucket. |
| gemini-omni-video (all variants) | Images + video | 7 reference slots | A single reference video consumes 2 slots; images fill the rest. Maximum 1 video reference per request; the video's first 10 seconds are used as the driving clip (no trim parameters exposed). A video reference requires a -video-ref variant — calling a duration variant with a video URL returns variant_mismatch. Audio file URLs are not accepted; voice output is controlled by the voice_id parameter (one of 29 built-in voices). |
| kling-3-0-std, kling-3-0-pro | Images only | 1–2 images | Start frame, or start + end frame. PNG/JPG/JPEG. Supports elements (see below). |
| seedream-v4-edit | Images only | 10 | For editing. |
| 5-lite-text-to-image, 5-lite-image-to-image | Images only | 10 | For editing (image-to-image). |
| nano-banana, nano-banana-edit | Images only | 10 | |
| nano-banana-pro (all variants) | Images only | 8 | |
| nano-banana-2 (all variants) | Images only | 8 | |
| p-image-edit | Images only | 1–8 | Pruna AI P Image Edit. Image-to-image only — set generation_type: "image-to-image". Pass 1–8 URLs in source_media_urls. aspect_ratio: auto matches the first input image, or use 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3. Optional turbo (default on). Default disable_safety_checker: true (moderation off); set disable_safety_checker: false to enable the safety checker. Optional seed. |
| flux-2-edit (image-to-image) | Images only | 8 | |
| gpt-1.5-image (image-to-image) | Images only | 16 | |
| veo3-1 (image-to-video / reference modes) | Images only | 1-3 | Depends on mode (1 text-to-video optional ref; 2 first+last frame; 3 reference). |
Use get_models to confirm input_media_types and capabilities for a given model. See Account tools for model list and pricing.
Kling 3.0 – elements (optional):
Elements let you reference images or videos in your prompt using @element_name. Pass kling_elements as an array of objects with name, description, and either element_input_urls (2–4 image URLs) or element_input_video_urls (1 video URL). Reference images for elements come from each element’s element_input_urls; main image_urls may be empty for text-to-video or hold optional start/end frames for image-to-video. Each element requires a title (name) and description. Image elements: JPG/PNG, min 300×300px, max 10MB each. Video elements: MP4/MOV, max 50MB.
Seedance 1.5 Pro – two modes (check generation_type before using images):
| Mode | generation_type | source_media_urls | Can use images? |
|---|---|---|---|
| Text-to-video | "text-to-video" | Empty or 1 URL | Optional: 0–1 images. Omit for text-only; include 1 URL to animate that image. |
| Image-to-video | "image-to-video" | Exactly 2 URLs | Required: exactly 2 images (start frame + end frame). |
Seedance 2 – two tiers, two modes, mixed references:
ByteDance Seedance 2 ships as two separate model families — seedance-2-fast (cheaper, faster) and seedance-2 (standard, higher quality). Each family exposes concrete variant model_ids per resolution and reference-video combo; pass the full variant (e.g. seedance-2-fast-480p, seedance-2-720p-video-ref) — a bare family label returns a variant_required error listing the options. Resolutions are 480p or 720p (no 1080p). Duration is an integer 4–15 seconds. Supported aspect ratios: 1:1, 4:3, 3:4, 16:9, 9:16, 21:9, adaptive. Prompt max 2,500 characters. Audio is a free toggle via generate_audio (defaults to true) — unlike Kling 3.0 there is no surcharge for audio.
| Mode | generation_type | Required inputs | Optional references | Not allowed |
|---|---|---|---|---|
| Text-to-video | "text-to-video" | None (prompt only) | Up to 9 images, up to 3 videos (combined ≤ 15s), up to 3 audio clips (combined ≤ 15s) | — |
| Image-to-video | "image-to-video" | 1 image (first frame) | Optional 2nd image (last frame) · up to 3 audio clips (combined ≤ 15s) | Reference videos (rejected with a clear error) |
Put every reference URL (images, videos, audio) into source_media_urls. The backend classifies each URL by file extension — .jpg / .png / .webp → image, .mp4 / .mov / .webm → video, .mp3 / .wav / .m4a → audio — and routes it to the correct bucket automatically.
Billing (two paths):
- Without a reference video:
credits = output_seconds × base_credits_per_second - With a reference video (upstream provider path):
credits = (ref_video_seconds + output_seconds) × base_credits_per_second, whereref_video_secondsis the sum of all reference video durations (capped at 15s per request).
Live per-second rates (pulled from the ai_models_config catalog — always current):
| Model | Name | Rate | Unit |
|---|---|---|---|
| seedance-2-1080p | Seedance 2 (1080p) | 75 | credits / sec |
| seedance-2-1080p-video-ref | Seedance 2 (1080p, video ref) | 48 | credits / sec |
| seedance-2-480p | Seedance 2 (480p) | 18 | credits / sec |
| seedance-2-480p-video-ref | Seedance 2 (480p, video ref) | 13 | credits / sec |
| seedance-2-720p | Seedance 2 (720p) | 35 | credits / sec |
| seedance-2-720p-video-ref | Seedance 2 (720p, video ref) | 25 | credits / sec |
| seedance-2-fast-480p | Seedance 2.0 Fast | 16 | credits / sec |
| seedance-2-fast-480p-video-ref | Seedance 2.0 Fast (video ref) | 12 | credits / sec |
| seedance-2-fast-720p | Seedance 2.0 Fast | 29 | credits / sec |
| seedance-2-fast-720p-video-ref | Seedance 2.0 Fast (video ref) | 19 | credits / sec |
MCP / REST API note: the backend cannot probe remote video durations from a URL, so reference-video requests from the MCP and REST API are billed the pessimistic worst case of 15s of reference video. The web UI probes each video locally and bills the exact sum. For cost-sensitive workflows with short reference videos, prefer the web UI.
Hard limits (enforced with 400 errors):
- More than 3 videos →
too_many_videos - More than 3 audios →
too_many_audios - More than 9 images →
too_many_images - Combined reference video duration > 15s → rejected
- Combined reference audio duration > 15s → rejected
- Any single reference video or audio file longer than 15s → rejected
Gemini Omni Video — Google's multi-modal video model:
Gemini Omni Video produces video output with built-in audio (29 named voices) and accepts multi-modal inputs (text, image references, an optional single video reference). The model ships as 15 concrete variants — pick the one that matches your target resolution and duration. 720p and 1080p are priced identically; 4K is a separate price tier.
Variant ids (pass to generate_media as model):
| Resolution | 4s | 6s | 8s | 10s | Video-ref (flat) |
|---|---|---|---|---|---|
| 720p | gemini-omni-video-720p-4s | gemini-omni-video-720p-6s | gemini-omni-video-720p-8s | gemini-omni-video-720p-10s | gemini-omni-video-720p-video-ref |
| 1080p | gemini-omni-video-1080p-4s | gemini-omni-video-1080p-6s | gemini-omni-video-1080p-8s | gemini-omni-video-1080p-10s | gemini-omni-video-1080p-video-ref |
| 4K | gemini-omni-video-4k-4s | gemini-omni-video-4k-6s | gemini-omni-video-4k-8s | gemini-omni-video-4k-10s | gemini-omni-video-4k-video-ref |
Video-ref variants take their output duration from the trimmed source clip (first 10s used; no trim parameters exposed). Duration variants accept 1–7 image references; video-ref variants accept 1 source clip (consumes 2 slots) + up to 5 image references.
Pricing is flat-per-task. Live rates (720p and 1080p share each row — Google bills them identically):
| Model | Name | Rate | Unit |
|---|---|---|---|
| gemini-omni-video-4k-10s | Gemini Omni Video (4K) | 320 | credits |
| gemini-omni-video-4k-4s | Gemini Omni Video (4K) | 230 | credits |
| gemini-omni-video-4k-6s | Gemini Omni Video (4K) | 260 | credits |
| gemini-omni-video-4k-8s | Gemini Omni Video (4K) | 290 | credits |
| gemini-omni-video-4k-video-ref | Gemini Omni Video (video ref, 4K) | 380 | credits |
| gemini-omni-video-hd-10s | Gemini Omni Video | 200 | credits |
| gemini-omni-video-hd-4s | Gemini Omni Video | 110 | credits |
| gemini-omni-video-hd-6s | Gemini Omni Video | 140 | credits |
| gemini-omni-video-hd-8s | Gemini Omni Video | 170 | credits |
| gemini-omni-video-hd-video-ref | Gemini Omni Video (video ref) | 260 | credits |
Parameters specific to Gemini Omni:
| Parameter | Type | Notes |
|---|---|---|
resolution | — | Ignored — encoded in the variant id. |
duration | — | Ignored — encoded in the variant id. Video-ref variants take their duration from the trimmed source clip. |
aspect_ratio | string | Strictly 16:9 or 9:16. Any other value returns aspect_ratio_invalid_for_model. |
voice_id | string (optional) | One of 29 built-in voices (e.g. kore, puck, achernar, zephyr). Discover the full list via the list_gemini_omni_voices tool — unknown ids are rejected client-side before any credits are charged. Omit to use the model's default voice. |
character_ids | string[] (optional) | One or more saved characters from the user's library (created via manage_library(kind="character", action="create", ...), listed via list_gemini_omni_characters). Each consumes 1 of the 7 reference slots. Provider renders a consistent identity across multiple clips. |
video_trim_start_s | number (optional) | Trim window start (seconds) for the reference video. Default 0. |
video_trim_end_s | number (optional) | Trim window end (seconds) for the reference video. Default: min(source_duration, start+10) when the URL is in media_upload_metadata, else start+10. Provider hard limits: range ≤ 10s, ends ≤ 30s. |
seed | number (optional) | Reproducible runs. |
negative_prompt | — | Not supported. Silently stripped. |
sound | — | Not supported — audio is intrinsic to every output. |
Source media (source_media_urls): 7 reference slots total. A single video reference consumes 2 slots; each character_id consumes 1 slot; images fill the rest. Maximum 1 video per request; passing a video URL requires a -video-ref variant (a duration variant + video returns variant_mismatch). The driving clip window defaults to the first 10 seconds; override with video_trim_start_s / video_trim_end_s. Audio file URLs are not accepted (returns unsupported_audio_url); use voice_id instead.
Companion tools (use these alongside Gemini Omni generations):
| Tool | Purpose |
|---|---|
list_gemini_omni_voices | Static catalog of all 30 voices (id, label, gender, character, preview_url to a ~7s mp3 sample). Free to call. |
play_gemini_omni_voice | Returns the preview_url for a single voice id — quicker than fetching the full catalog when you already know the voice. Surface the URL to the user so they can hear it before committing. |
list_gemini_omni_characters | List the user's saved characters. Each row has a character_id to pass to generate_media and an internal id for delete. |
manage_library(kind="character", action="create", ...) | Persist a new character from a reference image + description (+ optional character_name, voice_id). Returns the provider's opaque character_id. |
manage_library(kind="character", action="delete", character_id=...) | Local delete (the provider has no delete endpoint). Past generations that used this character keep their outputs. |
trim_video | Cut a long source video down to the ≤10s window Gemini Omni needs for -video-ref variants. Returns a hosted URL ready to use as source_media_urls. |
Example request (1080p, 6s, vertical, reusing a saved character + chosen voice):
{
"model": "gemini-omni-video-1080p-6s",
"prompt": "A barista pulls a perfect espresso shot, narrates the process",
"aspect_ratio": "9:16",
"voice_id": "kore",
"character_ids": ["char_4f2a9b7c"],
"source_media_urls": ["https://media.kubeez.com/cafe-shot.jpg"]
}
Family-name disambiguation:
Some model families expose multiple concrete variants per resolution / quality / reference-video combo: seedance-2, seedance-2-fast, gemini-omni-video, p-video, kling-3-0-motion-control, kling-2-6-motion-control. Passing just the family label as model returns:
{
"error": "variant_required",
"family": "seedance-2-fast",
"available_variants": [
"seedance-2-fast-480p",
"seedance-2-fast-720p",
"seedance-2-fast-480p-video-ref",
"seedance-2-fast-720p-video-ref"
],
"hint": "Ask the user which variant they want — resolution, draft vs standard, or with/without reference video. Pull pricing from get_models."
}
No credits are deducted. Re-call generate_media with a concrete model_id from available_variants (or from get_models).
Video audio (two different concepts):
capabilities.video_audioin get_models — how to know if the output has sound:included— The model’s output normally includes an audio track without using asoundparameter (e.g. Veo, Wan, Grok, Kling 2.5 image-to-video, Motion Control, Gemini Omni Video — the latter exposes avoice_idparameter to pick from 29 built-in voices).toggle_via_sound_param— Generated audio is turned on/off withsound: true/false(Kling 2.6, Kling 3.0, Seedance 1.5 Pro). Pricing may differ for audio vs silent. Kling 3.0 routessound: trueto dedicated-audiorows in the catalog — you keep usingkling-3-0-std/kling-3-0-proas the model id and just togglesound, the server picks the right priced row. Live rates:
| Model | Name | Rate | Unit |
|---|---|---|---|
| kling-3-0-pro | Kling 3.0 Pro | 21 | credits / sec |
| kling-3-0-pro-audio | Kling 3.0 Pro (with audio) | 30 | credits / sec |
| kling-3-0-std | Kling 3.0 | 17 | credits / sec |
| kling-3-0-std-audio | Kling 3.0 Std (with audio) | 23 | credits / sec |
Motion Control per-second rates (Kling 3.0 and Kling 2.6):
kling-3-0-motion-control,kling-2-6-motion-control(Motion-control rates are shown as one block since the two families share the same pricing shape: 720p and 1080p variants.) Seedance 2 also exposes a sound toggle (via its own generate_audio flag, default true) but audio is free — no surcharge vs silent.
silent— No generated audio (Seedance 1.0 /v1-pro-fast-i2vonly).
supports_sound— only means the API accepts asoundtoggle for that model. It does not mean other models are silent; most video models usevideo_audio: includedinstead.
Image-only models ignore sound.
#REST API: URLs for local or browser files
If you use the HTTP API (POST /v1/generate/media) and your inputs are files on disk or selected in the browser—not already public URLs—upload each file first with POST /v1/upload/media. Pass the returned urls values as source_media_urls.
<a id="trim_video"></a>
#trim_video
Cut a window out of any publicly-accessible video URL and host the trimmed clip. Kubeez handles the trim server-side and returns a public url you can pass straight into generate_media as a source_media_urls value.
When to use this tool:
- The user's source video is longer than a model's reference-clip limit and you want the result to be deterministic instead of relying on provider-side trim hints.
- You want to use a specific window of a longer asset (e.g. "use seconds 5–12 of this take, not the first 10").
- You're chaining multiple generations and want a single canonical trimmed URL to reuse across calls.
Per-model reference-clip caps (skip the trim when the source already fits):
| Model | Reference-clip cap | Notes |
|---|---|---|
Gemini Omni Video (-video-ref variants) | ≤ 10s window from the first 30s of source | Required when using a duration variant + a video ref returns variant_mismatch. |
| Kling 2.6 / 3.0 Motion Control | ≤ 30s video reference | Provider truncates longer clips; trim explicitly for predictable framing. |
| Seedance 2 / 2 Fast | ≤ 15s combined reference video time | If you pass multiple ref videos, trim each so the sum fits. |
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
source_url | string | Yes | Publicly accessible source video URL (Kubeez upload URL, R2/S3 URL, any public CDN). |
end_s | number | Yes | Trim window end, in seconds. Must be > start_s; the resulting end_s − start_s range is capped at 60s. |
start_s | number | No | Trim window start, in seconds. Default 0. |
Returns: { url, size_bytes, duration_s, start_s, end_s, elapsed_s }. The url is a Kubeez-hosted public link pointing at the media-inputs Supabase storage bucket, safe to use as a source_media_urls value. Trimmed clips live alongside user-uploaded inputs and share the same weekly auto-cleanup — treat them as ephemeral, not as a permanent asset.
Typical workflow (Gemini Omni Video with a long user source):
1. get_upload_url(model_id="gemini-omni-video-1080p-video-ref")
→ presigned upload page
2. (user uploads their 25s clip via the link)
3. get_upload_session(token) → { media_urls: ["https://.../source.mp4"] }
4. trim_video(source_url=media_urls[0], start_s=5, end_s=15)
→ { url: "https://.../trims/<uuid>.mp4", duration_s: 10 }
5. generate_media(
model="gemini-omni-video-1080p-video-ref",
prompt="...",
source_media_urls=[<trimmed url>]
)
Standalone trim (source URL already public — no upload step needed):
trim_video(source_url="https://media.kubeez.com/<asset>.mp4", start_s=0, end_s=8)
→ { url, duration_s: 8 }
generate_media(model="kling-3-0-motion-control-1080p", source_media_urls=[<image>, <trimmed url>], ...)
Errors:
| Error code | Cause |
|---|---|
missing_source_url | source_url not provided. |
unsafe_source_url | source_url uses a non-http(s) scheme, contains embedded credentials, or resolves to a private/loopback/metadata IP. Kubeez refuses to fetch internal hosts — pass a publicly-resolvable URL. |
invalid_trim_range | end_s ≤ start_s, negative start_s, or non-numeric values. |
trim_too_long | end_s − start_s > 60s. |
source_too_large | Source video > 500 MB. |
source_fetch_failed | Source URL returned 4xx, 5xx, was unreachable, or hit the redirect-hop cap. |
ffmpeg_failed | ffmpeg rejected the file (unsupported codec / corrupt container). |
processor_timeout | Trim service exceeded its download/processing budget. |
processor_unavailable | Trim service is temporarily unreachable. Retry later or contact support if it persists. |
Example request:
{
"source_url": "https://media.kubeez.com/u123/uploads/long-take.mp4",
"start_s": 5.2,
"end_s": 14.7
}
Example response:
{
"url": "https://<project>.supabase.co/storage/v1/object/public/media-inputs/u123/trims/abc.mp4",
"size_bytes": 1245678,
"duration_s": 9.5,
"start_s": 5.2,
"end_s": 14.7,
"elapsed_s": 1.83
}
Frame accuracy:
-c copysnaps the actual start to the nearest preceding keyframe (≤ 1 GOP shift, typically ≤ 1s). This matches the web UI's native MP4 stream-copy behavior. For frame-accurate cuts, re-encode the source first — the tradeoff is a slower, lossy operation we deliberately avoid here.
Don't keep the output: the URL points at the
media-inputsbucket which is cleared weekly. If you need the trimmed clip to outlive the next generation, usemanage_library(kind="asset", action="add", ...)to copy it into the user's permanent Asset Library.
#get_generation_status
Check the status of a media generation and get output URLs when done.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| generation_id | string | Yes | ID returned from generate_media. |
Response: Includes status (pending, queued, processing, completed, failed), progress, and when completed an outputs array with url, thumbnail_url, optimized_url, media_type, dimensions, etc.
#get_generation_estimate
Get a parameter-specific estimated processing time for a given model and options (no job is started). For a per-model estimated duration in one call, use get_models; each model includes estimated_time_seconds. Use get_generation_estimate when you need an estimate that depends on prompt length, duration, or other parameters.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Model ID. |
| generation_type | string | No | Same as in generate_media. Default: text-to-image. |
| prompt | string | No | Optional; can affect estimate. |
| negative_prompt | string | No | Optional. |
| parameters | object | No | Optional extra parameters. |
Response: Estimated time (and optionally confidence/sample size) so you can set user expectations before calling generate_media.
#Model rules
- Text-to-image and text-to-video: Do not send
source_media_urls(unless the model supports an optional reference image). Exception: seedance-1-5-pro in text-to-video mode accepts 0–1 optional images. - Image-to-video and image-to-image: Send image (and when supported, video) URL(s) in
source_media_urls. Most models need images only; some (e.g. Kling 2.6 Motion Control) require 1 image + 1 video. seedance-1-5-pro in image-to-video mode requires exactly 2 images (start + end frame). Respect each model’s input limits above. - Video audio: Use
capabilities.video_audiofrom get_models.included— audio is part of typical output without asoundparameter (e.g. Veo, Wan, Grok).toggle_via_sound_param— usesoundonly whensupports_soundis true (Kling 2.6, Kling 3.0, Seedance 1.5 Pro).silent— no generated audio (Seedance 1.0 only). Do not infer “no audio” fromsupports_sound: falsealone. - Use get_models to see which models support which generation types,
input_media_types(e.g. image, video), and required input counts.
See Limitations for rate limits and credits. For a single table of API defaults (prompt max, inputs, duration, flags), see REST API model requirements.
