Media tools
Generate images and videos using 40+ AI models. Always call get_models first to see available models, costs, and whether a model needs an input image.
#generate_media
Starts an image or video generation.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| prompt | string | Yes | What to generate (e.g. “A red car on a mountain road”). |
| model | string | Yes | Model ID (from get_models). Examples: nano-banana, sora-2, kling-2-6-image-to-video. |
| generation_type | string | No | text-to-image, text-to-video, image-to-video, or image-to-image. Default: text-to-image. |
| negative_prompt | string | No | What to avoid in the output. |
| source_media_urls | string or array | No | Required for image-to-video and image-to-image. URL(s) to image(s), or for some models (e.g. Kling 2.6 Motion) image + video. See input limits below. Omit for text-to-image and text-to-video. |
| aspect_ratio | string | No | e.g. 1:1, 16:9, 9:16, 4:5, 21:9. Default: 1:1. |
| duration | string | No | Video length. Only certain video models use this. See below. |
| quality | string | No | e.g. fast, standard, pro, ultra. Default: standard. |
| sound | boolean | No | When true, request video with generated audio. Only certain video models use this. Default: false. See below. |
| seed | number | No | Seed for reproducible results. |
Example (text-to-image):
{
"prompt": "A futuristic city at sunset with flying cars",
"model": "nano-banana",
"generation_type": "text-to-image",
"aspect_ratio": "16:9",
"quality": "pro"
}
Example (image-to-video, one input image required):
{
"prompt": "Gentle motion and subtle movement",
"model": "kling-2-6-image-to-video",
"generation_type": "image-to-video",
"source_media_urls": ["https://example.com/your-image.jpg"],
"aspect_ratio": "16:9",
"duration": "5s"
}
Response: Includes generation_id, status (e.g. pending), and often estimated_time_seconds and estimated_cost_credits. Poll with get_generation_status until status is completed or failed.
Models that support duration:
| Model(s) | Supported values | Notes |
|---|---|---|
| kling-2-6-text-to-video, kling-2-6-image-to-video | 5s, 10s | Optional with/without audio (model variant). |
| wan-2-5 (text-to-video, image-to-video) | 5s, 10s | |
| v1-pro-fast-i2v | 5s, 10s | |
| seedance-1-5-pro | 4s, 8s, 12s | Supports both text-to-video (0–1 image optional) and image-to-video (2 images required). |
| sora-2, sora-2-pro (text-to-video, image-to-video) | 10s, 15s | |
| sora-2-pro-storyboard | 10s, 15s, 25s | Scene-based; duration from shots. |
| grok-text-to-video-6s | Fixed 6s | Duration parameter ignored. |
| kling-3-0-std, kling-3-0-pro | 3s–15s | Single-shot mode. Max 2500 chars; supports @element_name references. |
| grok-image-to-video, kling-2-5-image-to-video-pro, veo3-1 | Not configurable | Duration not set via this parameter. |
For image-only models, duration is ignored.
Models that support negative_prompt:
| Model(s) | Notes |
|---|---|
| imagen-4, imagen-4-fast, imagen-4-ultra | Text-to-image. |
| wan-2-5 (text-to-video, image-to-video) | |
| kling-2-5-image-to-video-pro |
All other models ignore negative_prompt.
Models that support quality (or equivalent):
| Model(s) | How it works | Values |
|---|---|---|
| sora-2-pro (text-to-video, image-to-video) | Mapped to size (standard vs HD). | standard, pro/high/hd (for HD). |
| imagen-4 variants | Mapped to model_variant. | standard, fast, ultra (use quality: standard / fast / ultra). |
| seedream-v4-5-edit (and seedream v4.5 text-to-image) | Uses quality directly. | e.g. basic and other API values. |
| 5-lite-text-to-image, 5-lite-image-to-image | Uses quality directly. | e.g. basic, high. |
| veo3-1 vs veo3-1-fast | Different model IDs, not a single quality param. | Use model veo3-1 (quality) or veo3-1-fast (speed). |
| flux-2, nano-banana-pro, nano-banana-2 | Resolution (1K/2K/4K), not a generic “quality” string. | Use model variant or resolution param when available. |
For other models, quality is ignored.
Prompt character limits:
Some models enforce a maximum prompt length. Exceeding it may return an error or truncation.
| Model(s) | Max characters |
|---|---|
| wan-2-5 | 800 |
| kling-2-6 (text-to-video, image-to-video) | 2,500 |
| kling-3-0-std, kling-3-0-pro | 2,500 |
| kling-2-5-image-to-video-pro | 2,500 |
| seedream-v4, seedream-v4-edit | 2,500 |
| seedream-v4-5, seedream-v4-5-edit | 3,000 |
| 5-lite-text-to-image, 5-lite-image-to-image | 2,995 |
| gpt-1.5-image-medium, gpt-1.5-image-high | 3,000 |
| nano-banana, imagen-4, sora-2, flux-2, veo3-1, v1-pro-fast-i2v, grok (image/video) | 5,000 |
| nano-banana-pro (all variants) | 20,000 |
| nano-banana-2 (all variants) | 20,000 |
Others may have no documented limit or use server defaults.
Input file (image and video) limits:
For image-to-video and image-to-image, source_media_urls is a list of URLs. Most models accept images only (JPEG, PNG, WebP, typically 10 MB max per file). Some models also accept video inputs; when they do, format and size limits apply (e.g. MP4, max duration).
| Model(s) | Input type | Limit | Notes |
|---|---|---|---|
| kling-2-6-motion-control-720p, kling-2-6-motion-control-1080p | Image + video | 1 image + 1 video | Motion Control: reference video drives motion. Video max 30 s; video file typically up to 100 MB (MP4/WebM). |
| kling-3-0-motion-control-720p, kling-3-0-motion-control-1080p | Image + video | 1 image + 1 video | Kling 3.0 Motion Control: same as Kling 2.6. 24 credits/s (720p), 32 credits/s (1080p). Video max 30 s; video file typically up to 100 MB (MP4/WebM). |
| kling-2-6-image-to-video, sora-2 (image-to-video), wan-2-5 (image-to-video), grok-image-to-video, v1-pro-fast-i2v | Images only | 1 image | Exactly one input image. |
| kling-2-5-image-to-video-pro | Images only | 2 images | Start frame and end frame. |
| seedance-1-5-pro | Images only | Mode-dependent | Text-to-video (generation_type: "text-to-video"): 0–1 images optional. Image-to-video (generation_type: "image-to-video"): exactly 2 images required (start + end frame). |
| kling-3-0-std, kling-3-0-pro | Images only | 1–2 images | Start frame, or start + end frame. PNG/JPG/JPEG. Supports elements (see below). |
| seedream-v4-edit | Images only | 10 | For editing. |
| 5-lite-text-to-image, 5-lite-image-to-image | Images only | 10 | For editing (image-to-image). |
| nano-banana, nano-banana-edit | Images only | 10 | |
| nano-banana-pro (all variants) | Images only | 8 | |
| nano-banana-2 (all variants) | Images only | 8 | |
| flux-2-edit (image-to-image) | Images only | 8 | |
| gpt-1.5-image (image-to-image) | Images only | 16 | |
| veo3-1 (image-to-video / reference modes) | Images only | 1-3 | Depends on mode (1 text-to-video optional ref; 2 first+last frame; 3 reference). |
| sora-2-pro-storyboard | Images only | 1 | Optional. |
Use get_models to confirm input_media_types and capabilities for a given model. See Account tools for model list and pricing.
Kling 3.0 – elements (optional):
Elements let you reference images or videos in your prompt using @element_name. Pass kling_elements as an array of objects with name, description, and either element_input_urls (2–4 image URLs) or element_input_video_urls (1 video URL). Elements can only be used when you have at least 1 picture: text-to-video + 1 image, or image-to-video with first and last frame. Each element requires a title (name) and description. Image elements: JPG/PNG, min 300×300px, max 10MB each. Video elements: MP4/MOV, max 50MB.
Seedance 1.5 Pro – two modes (check generation_type before using images):
| Mode | generation_type | source_media_urls | Can use images? |
|---|---|---|---|
| Text-to-video | "text-to-video" | Empty or 1 URL | Optional: 0–1 images. Omit for text-only; include 1 URL to animate that image. |
| Image-to-video | "image-to-video" | Exactly 2 URLs | Required: exactly 2 images (start frame + end frame). |
Models that support audio (sound parameter):
| Model(s) | Notes |
|---|---|
| kling-2-6-text-to-video, kling-2-6-image-to-video | Set sound: true to get video with generated audio. Different pricing for audio vs no-audio variants. |
| kling-3-0-std, kling-3-0-pro | Set sound: true to enable generated audio. |
| seedance-1-5-pro | Set sound: true to enable generated audio. Supports both text-to-video and image-to-video modes. |
All other video models do not support the sound parameter. Image-only models ignore sound.
#get_generation_status
Check the status of a media generation and get output URLs when done.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| generation_id | string | Yes | ID returned from generate_media. |
Response: Includes status (pending, queued, processing, completed, failed), progress, and when completed an outputs array with url, thumbnail_url, optimized_url, media_type, dimensions, etc.
#get_generation_estimate
Get a parameter-specific estimated processing time for a given model and options (no job is started). For a per-model estimated duration in one call, use get_models; each model includes estimated_time_seconds. Use get_generation_estimate when you need an estimate that depends on prompt length, duration, or other parameters.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Model ID. |
| generation_type | string | No | Same as in generate_media. Default: text-to-image. |
| prompt | string | No | Optional; can affect estimate. |
| negative_prompt | string | No | Optional. |
| parameters | object | No | Optional extra parameters. |
Response: Estimated time (and optionally confidence/sample size) so you can set user expectations before calling generate_media.
#Model rules
- Text-to-image and text-to-video: Do not send
source_media_urls(unless the model supports an optional reference image). Exception: seedance-1-5-pro in text-to-video mode accepts 0–1 optional images. - Image-to-video and image-to-image: Send image (and when supported, video) URL(s) in
source_media_urls. Most models need images only; some (e.g. Kling 2.6 Motion Control) require 1 image + 1 video. seedance-1-5-pro in image-to-video mode requires exactly 2 images (start + end frame). Respect each model’s input limits above. - Audio: kling-2-6 and kling-3-0 support
sound: true; other models ignoresound. - Use get_models to see which models support which generation types,
input_media_types(e.g. image, video), and required input counts.
See Limitations for rate limits and credits.
