Media tools
Generate images and videos using 40+ AI models. Always call get_models first to see available models, costs, and whether a model needs an input image.
REST HTTP clients: the same limits are documented in one place in REST API model requirements (and returned per model from GET /v1/models).
#generate_media
Starts an image or video generation.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| prompt | string | Yes | What to generate (e.g. “A red car on a mountain road”). |
| model | string | Yes | Model ID (from get_models). Examples: nano-banana, sora-2, kling-2-6-image-to-video. |
| generation_type | string | No | text-to-image, text-to-video, image-to-video, or image-to-image. Default: text-to-image. |
| negative_prompt | string | No | What to avoid in the output. |
| source_media_urls | string or array | No | Required for image-to-video and image-to-image. URL(s) to image(s), or for some models (e.g. Kling 2.6 Motion) image + video. See input limits below. Omit for text-to-image and text-to-video. |
| aspect_ratio | string | No | e.g. 1:1, 16:9, 9:16, 4:5, 21:9. Default: 1:1. Note: each model only accepts a subset — get_models returns the allowed list. |
| duration | string | No | Video length. Only certain video models use this. See below. |
| quality | string | No | e.g. fast, standard, pro, ultra. Default: standard. |
| resolution | string | No | Output resolution tier. Only certain image models use this — gpt-image-2 (1K/2K/4K), nano-banana-pro/nano-banana-2 (1K/2K/4K), flux-2 (1K/2K). Each tier is a separate pricing SKU; get_models returns the per-tier credit cost. Ignored by models where resolution is encoded in the variant model_id (Seedance, Kling, Sora, P-Video). See the Resolution tiers table below for per-model constraints. |
| sound | boolean | No | When true, request video with generated audio. Only certain video models use this. Default: false. See below. |
| seed | number | No | Seed for reproducible results. |
Example (text-to-image):
{
"prompt": "A futuristic city at sunset with flying cars",
"model": "nano-banana",
"generation_type": "text-to-image",
"aspect_ratio": "16:9",
"quality": "pro"
}
Example (image-to-video, one input image required):
{
"prompt": "Gentle motion and subtle movement",
"model": "kling-2-6-image-to-video",
"generation_type": "image-to-video",
"source_media_urls": ["https://example.com/your-image.jpg"],
"aspect_ratio": "16:9",
"duration": "5s"
}
Response: Includes generation_id, status (e.g. pending), and often estimated_time_seconds and estimated_cost_credits. Poll with get_generation_status until status is completed or failed.
Models that support duration:
| Model(s) | Supported values | Notes |
|---|---|---|
| kling-2-6-text-to-video, kling-2-6-image-to-video | 5s, 10s | Optional with/without audio (model variant). |
| wan-2-5 (text-to-video, image-to-video) | 5s, 10s | |
| v1-pro-fast-i2v | 5s, 10s | |
| seedance-1-5-pro | 4s, 8s, 12s | Supports both text-to-video (0–1 image optional) and image-to-video (2 images required). |
| seedance-2 (Standard) / seedance-2-fast (Fast) | 4s–15s (integer) | ByteDance Seedance 2. The tier is the model family itself — seedance-2-fast for the cheaper/faster tier, seedance-2 for the higher-quality tier. Each tier has concrete variant model_ids per resolution / reference-video combo (e.g. seedance-2-fast-480p, seedance-2-480p-video-ref). Pass the full variant id to generate_media; a family label alone returns a variant_required error with the choices. Multimodal text-to-video references (up to 9 images + 3 videos + 3 audios); image-to-video takes 1 required image + optional end frame + up to 3 reference audios. With a reference video the billing formula becomes credits = (ref_video_s + output_s) × rate/s. |
| sora-2, sora-2-pro (text-to-video, image-to-video) | 10s, 15s | |
| sora-2-pro-storyboard | 10s, 15s, 25s | Scene-based; duration from shots. |
| grok-text-to-video-6s | Fixed 6s | Duration parameter ignored. |
| kling-3-0-std, kling-3-0-pro | 3s–15s | Single-shot mode. Max 2500 chars; supports @element_name references. |
| grok-image-to-video, kling-2-5-image-to-video-pro, veo3-1 | Not configurable | Duration not set via this parameter. |
For image-only models, duration is ignored.
Models that support negative_prompt:
| Model(s) | Notes |
|---|---|
| imagen-4, imagen-4-fast, imagen-4-ultra | Text-to-image. |
| wan-2-5 (text-to-video, image-to-video) | |
| kling-2-5-image-to-video-pro |
All other models ignore negative_prompt.
Models that support quality (or equivalent):
| Model(s) | How it works | Values |
|---|---|---|
| sora-2-pro (text-to-video, image-to-video) | Mapped to size (standard vs HD). | standard, pro/high/hd (for HD). |
| imagen-4 variants | Mapped to model_variant. | standard, fast, ultra (use quality: standard / fast / ultra). |
| seedream-v4, seedream-v4-edit | Resolution via quality param. | 1K (default), 2K, 4K. |
| seedream-v4-5, seedream-v4-5-edit | Uses quality directly. | basic (2K, default), high (4K). |
| 5-lite-text-to-image, 5-lite-image-to-image | Uses quality directly. | basic (2K, default), high (4K). |
| veo3-1 vs veo3-1-fast | Different model IDs, not a single quality param. | Use model veo3-1 (quality) or veo3-1-fast (speed). |
| flux-2, nano-banana-pro, nano-banana-2 | Resolution (1K/2K/4K), not a generic “quality” string. | Pass via the dedicated resolution parameter — see below. |
| gpt-image-2 (t2i + i2i) | Resolution via the resolution parameter. | See Resolution tiers below. |
For other models, quality is ignored.
<a id="resolution-tiers"></a>
Resolution tiers (the resolution parameter):
| Model | Values | Pricing | Constraint |
|---|---|---|---|
| gpt-image-2 (t2i + i2i) | 1K (default), 2K, 4K | 11 / 15 / 21 credits | 2K and 4K require an explicit non-square, non-auto aspect_ratio — one of 9:16, 16:9, 4:3, 3:4. Passing auto or 1:1 at 2K/4K returns HTTP 400 with error: "aspect_ratio_incompatible_with_high_res" and no credits are held. 1K accepts every supported aspect including auto/1:1. |
| nano-banana-2 | 1K (default), 2K, 4K | See get_models | Each tier is a separate pricing SKU. Aspect ratio list unchanged across tiers. |
| nano-banana-pro | 1K (default), 2K, 4K | See get_models | Same pattern as nano-banana-2. |
| flux-2, flux-2-edit | 1K (default), 2K | See get_models | Two tiers only. |
When to pick which tier (GPT Image 2):
1K— default. Use for social posts, thumbnails, concepting, in-app previews, anything ≤ 1024 × 1024. Cheapest; no aspect-ratio gotchas.2K— use when the client needs a crisp web hero, newsletter cover, in-product illustration at retina density. Must pick a directional aspect (landscape or portrait).4K— use for print, out-of-home, banners, or any case the user explicitly asks for the highest output size. Confirm the aspect with the user first; the pill1:1/autowon't work.
Models not listed ignore resolution. For video families (Seedance, Kling, Sora, P-Video) the resolution is part of the concrete variant model_id — pass the variant (e.g. seedance-2-fast-480p, p-video-1080p), not this param.
Prompt character limits:
Some models enforce a maximum prompt length. Exceeding it may return an error or truncation.
| Model(s) | Max characters |
|---|---|
| wan-2-5 | 800 |
| kling-2-6 (text-to-video, image-to-video) | 2,500 |
| seedance-2 (Fast + Standard) | 2,500 |
| kling-3-0-std, kling-3-0-pro | 2,500 |
| kling-2-5-image-to-video-pro | 2,500 |
| seedream-v4, seedream-v4-edit | 2,500 |
| seedream-v4-5, seedream-v4-5-edit | 3,000 |
| 5-lite-text-to-image, 5-lite-image-to-image | 2,995 |
| gpt-1.5-image-medium, gpt-1.5-image-high | 3,000 |
| nano-banana, imagen-4, sora-2, flux-2, veo3-1, v1-pro-fast-i2v, grok (image/video), p-image-edit | 5,000 |
| nano-banana-pro (all variants) | 20,000 |
| nano-banana-2 (all variants) | 20,000 |
Others may have no documented limit or use server defaults.
Input file (image and video) limits:
For image-to-video and image-to-image, source_media_urls is a list of URLs. Most models accept images only (JPEG, PNG, WebP, typically 10 MB max per file). Some models also accept video inputs; when they do, format and size limits apply (e.g. MP4, max duration).
| Model(s) | Input type | Limit | Notes |
|---|---|---|---|
| kling-2-6-motion-control-720p, kling-2-6-motion-control-1080p | Image + video | 1 image + 1 video | Motion Control: reference video drives motion. Video max 30 s; video file typically up to 100 MB (MP4/WebM). |
| kling-3-0-motion-control-720p, kling-3-0-motion-control-1080p | Image + video | 1 image + 1 video | Kling 3.0 Motion Control: same as Kling 2.6. Per-second billing — see the Motion Control pricing table below. Video max 30 s; video file typically up to 100 MB (MP4/WebM). |
| kling-2-6-image-to-video, sora-2 (image-to-video), wan-2-5 (image-to-video), grok-image-to-video, v1-pro-fast-i2v | Images only | 1 image | Exactly one input image. |
| kling-2-5-image-to-video-pro | Images only | 2 images | Start frame and end frame. |
| seedance-1-5-pro | Images only | Mode-dependent | Text-to-video (generation_type: "text-to-video"): 0–1 images optional. Image-to-video (generation_type: "image-to-video"): exactly 2 images required (start + end frame). |
| seedance-2 (Fast + Standard) | Image + video + audio | Mode-dependent | Text-to-video: up to 9 images, 3 videos (combined ≤ 15s), and 3 audio clips (combined ≤ 15s) — all optional. Image-to-video: 1 required image (first frame) + optional end frame (2 images max) + up to 3 audio clips; no reference videos allowed in this mode. Pass every URL (image, video, audio) in source_media_urls — the backend classifies by file extension (.jpg/.png/.webp → image, .mp4/.mov/.webm → video, .mp3/.wav/.m4a → audio) and routes each to the right bucket. |
| kling-3-0-std, kling-3-0-pro | Images only | 1–2 images | Start frame, or start + end frame. PNG/JPG/JPEG. Supports elements (see below). |
| seedream-v4-edit | Images only | 10 | For editing. |
| 5-lite-text-to-image, 5-lite-image-to-image | Images only | 10 | For editing (image-to-image). |
| nano-banana, nano-banana-edit | Images only | 10 | |
| nano-banana-pro (all variants) | Images only | 8 | |
| nano-banana-2 (all variants) | Images only | 8 | |
| p-image-edit | Images only | 1–8 | Pruna AI P Image Edit. Image-to-image only — set generation_type: "image-to-image". Pass 1–8 URLs in source_media_urls. aspect_ratio: auto matches the first input image, or use 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3. Optional turbo (default on). Default disable_safety_checker: true (moderation off); set disable_safety_checker: false to enable the safety checker. Optional seed. |
| flux-2-edit (image-to-image) | Images only | 8 | |
| gpt-1.5-image (image-to-image) | Images only | 16 | |
| veo3-1 (image-to-video / reference modes) | Images only | 1-3 | Depends on mode (1 text-to-video optional ref; 2 first+last frame; 3 reference). |
| sora-2-pro-storyboard | Images only | 1 | Optional. |
Use get_models to confirm input_media_types and capabilities for a given model. See Account tools for model list and pricing.
Kling 3.0 – elements (optional):
Elements let you reference images or videos in your prompt using @element_name. Pass kling_elements as an array of objects with name, description, and either element_input_urls (2–4 image URLs) or element_input_video_urls (1 video URL). Reference images for elements come from each element’s element_input_urls; main image_urls may be empty for text-to-video or hold optional start/end frames for image-to-video. Each element requires a title (name) and description. Image elements: JPG/PNG, min 300×300px, max 10MB each. Video elements: MP4/MOV, max 50MB.
Seedance 1.5 Pro – two modes (check generation_type before using images):
| Mode | generation_type | source_media_urls | Can use images? |
|---|---|---|---|
| Text-to-video | "text-to-video" | Empty or 1 URL | Optional: 0–1 images. Omit for text-only; include 1 URL to animate that image. |
| Image-to-video | "image-to-video" | Exactly 2 URLs | Required: exactly 2 images (start frame + end frame). |
Seedance 2 – two tiers, two modes, mixed references:
ByteDance Seedance 2 ships as two separate model families — seedance-2-fast (cheaper, faster) and seedance-2 (standard, higher quality). Each family exposes concrete variant model_ids per resolution and reference-video combo; pass the full variant (e.g. seedance-2-fast-480p, seedance-2-720p-video-ref) — a bare family label returns a variant_required error listing the options. Resolutions are 480p or 720p (no 1080p). Duration is an integer 4–15 seconds. Supported aspect ratios: 1:1, 4:3, 3:4, 16:9, 9:16, 21:9, adaptive. Prompt max 2,500 characters. Audio is a free toggle via generate_audio (defaults to true) — unlike Kling 3.0 there is no surcharge for audio.
| Mode | generation_type | Required inputs | Optional references | Not allowed |
|---|---|---|---|---|
| Text-to-video | "text-to-video" | None (prompt only) | Up to 9 images, up to 3 videos (combined ≤ 15s), up to 3 audio clips (combined ≤ 15s) | — |
| Image-to-video | "image-to-video" | 1 image (first frame) | Optional 2nd image (last frame) · up to 3 audio clips (combined ≤ 15s) | Reference videos (rejected with a clear error) |
Put every reference URL (images, videos, audio) into source_media_urls. The backend classifies each URL by file extension — .jpg / .png / .webp → image, .mp4 / .mov / .webm → video, .mp3 / .wav / .m4a → audio — and routes it to the correct bucket automatically.
Billing (two paths):
- Without a reference video:
credits = output_seconds × base_credits_per_second - With a reference video (upstream provider path):
credits = (ref_video_seconds + output_seconds) × base_credits_per_second, whereref_video_secondsis the sum of all reference video durations (capped at 15s per request).
Live per-second rates (pulled from the ai_models_config catalog — always current):
| Model | Name | Rate | Unit |
|---|---|---|---|
| seedance-2-1080p | Seedance 2 (1080p) | 93 | credits / sec |
| seedance-2-1080p-video-ref | Seedance 2 (1080p, video ref) | 65 | credits / sec |
| seedance-2-480p | Seedance 2 (480p) | 18 | credits / sec |
| seedance-2-480p-video-ref | Seedance 2 (480p, video ref) | 13 | credits / sec |
| seedance-2-720p | Seedance 2 (720p) | 40 | credits / sec |
| seedance-2-720p-video-ref | Seedance 2 (720p, video ref) | 29 | credits / sec |
| seedance-2-fast-480p | Seedance 2.0 Fast | 16 | credits / sec |
| seedance-2-fast-480p-video-ref | Seedance 2.0 Fast (video ref) | 12 | credits / sec |
| seedance-2-fast-720p | Seedance 2.0 Fast | 34 | credits / sec |
| seedance-2-fast-720p-video-ref | Seedance 2.0 Fast (video ref) | 24 | credits / sec |
MCP / REST API note: the backend cannot probe remote video durations from a URL, so reference-video requests from the MCP and REST API are billed the pessimistic worst case of 15s of reference video. The web UI probes each video locally and bills the exact sum. For cost-sensitive workflows with short reference videos, prefer the web UI.
Hard limits (enforced with 400 errors):
- More than 3 videos →
too_many_videos - More than 3 audios →
too_many_audios - More than 9 images →
too_many_images - Combined reference video duration > 15s → rejected
- Combined reference audio duration > 15s → rejected
- Any single reference video or audio file longer than 15s → rejected
Family-name disambiguation:
Some model families expose multiple concrete variants per resolution / quality / reference-video combo: seedance-2, seedance-2-fast, p-video, kling-3-0-motion-control, kling-2-6-motion-control. Passing just the family label as model returns:
{
"error": "variant_required",
"family": "seedance-2-fast",
"available_variants": [
"seedance-2-fast-480p",
"seedance-2-fast-720p",
"seedance-2-fast-480p-video-ref",
"seedance-2-fast-720p-video-ref"
],
"hint": "Ask the user which variant they want — resolution, draft vs standard, or with/without reference video. Pull pricing from get_models."
}
No credits are deducted. Re-call generate_media with a concrete model_id from available_variants (or from get_models).
Video audio (two different concepts):
capabilities.video_audioin get_models — how to know if the output has sound:included— The model’s output normally includes an audio track without using asoundparameter (e.g. Veo, Sora, Wan, Grok, Kling 2.5 image-to-video, Motion Control).toggle_via_sound_param— Generated audio is turned on/off withsound: true/false(Kling 2.6, Kling 3.0, Seedance 1.5 Pro). Pricing may differ for audio vs silent. Kling 3.0 routessound: trueto dedicated-audiorows in the catalog — you keep usingkling-3-0-std/kling-3-0-proas the model id and just togglesound, the server picks the right priced row. Live rates:
| Model | Name | Rate | Unit |
|---|---|---|---|
| kling-3-0-pro | Kling 3.0 Pro | 21 | credits / sec |
| kling-3-0-pro-audio | Kling 3.0 Pro (with audio) | 30 | credits / sec |
| kling-3-0-std | Kling 3.0 | 17 | credits / sec |
| kling-3-0-std-audio | Kling 3.0 Std (with audio) | 23 | credits / sec |
Motion Control per-second rates (Kling 3.0 and Kling 2.6):
kling-3-0-motion-control,kling-2-6-motion-control(Motion-control rates are shown as one block since the two families share the same pricing shape: 720p and 1080p variants.) Seedance 2 also exposes a sound toggle (via its own generate_audio flag, default true) but audio is free — no surcharge vs silent.
silent— No generated audio (Seedance 1.0 /v1-pro-fast-i2vonly).
supports_sound— only means the API accepts asoundtoggle for that model. It does not mean other models are silent; most video models usevideo_audio: includedinstead.
Image-only models ignore sound.
#REST API: URLs for local or browser files
If you use the HTTP API (POST /v1/generate/media) and your inputs are files on disk or selected in the browser—not already public URLs—upload each file first with POST /v1/upload/media. Pass the returned urls values as source_media_urls.
#get_generation_status
Check the status of a media generation and get output URLs when done.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| generation_id | string | Yes | ID returned from generate_media. |
Response: Includes status (pending, queued, processing, completed, failed), progress, and when completed an outputs array with url, thumbnail_url, optimized_url, media_type, dimensions, etc.
#get_generation_estimate
Get a parameter-specific estimated processing time for a given model and options (no job is started). For a per-model estimated duration in one call, use get_models; each model includes estimated_time_seconds. Use get_generation_estimate when you need an estimate that depends on prompt length, duration, or other parameters.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Model ID. |
| generation_type | string | No | Same as in generate_media. Default: text-to-image. |
| prompt | string | No | Optional; can affect estimate. |
| negative_prompt | string | No | Optional. |
| parameters | object | No | Optional extra parameters. |
Response: Estimated time (and optionally confidence/sample size) so you can set user expectations before calling generate_media.
#Model rules
- Text-to-image and text-to-video: Do not send
source_media_urls(unless the model supports an optional reference image). Exception: seedance-1-5-pro in text-to-video mode accepts 0–1 optional images. - Image-to-video and image-to-image: Send image (and when supported, video) URL(s) in
source_media_urls. Most models need images only; some (e.g. Kling 2.6 Motion Control) require 1 image + 1 video. seedance-1-5-pro in image-to-video mode requires exactly 2 images (start + end frame). Respect each model’s input limits above. - Video audio: Use
capabilities.video_audiofrom get_models.included— audio is part of typical output without asoundparameter (e.g. Veo, Sora, Wan, Grok).toggle_via_sound_param— usesoundonly whensupports_soundis true (Kling 2.6, Kling 3.0, Seedance 1.5 Pro).silent— no generated audio (Seedance 1.0 only). Do not infer “no audio” fromsupports_sound: falsealone. - Use get_models to see which models support which generation types,
input_media_types(e.g. image, video), and required input counts.
See Limitations for rate limits and credits. For a single table of API defaults (prompt max, inputs, duration, flags), see REST API model requirements.
