Media tools

Kubeez

Generate images and videos using 40+ AI models. Always call get_models first to see available models, costs, and whether a model needs an input image.

REST HTTP clients: the same limits are documented in one place in REST API model requirements (and returned per model from GET /v1/models).

#generate_media

Starts an image or video generation.

Parameters:

Parameter	Type	Required	Description
prompt	string	Yes	What to generate (e.g. “A red car on a mountain road”).
model	string	Yes	Model ID (from get_models). Examples: nano-banana, sora-2, kling-2-6-image-to-video.
generation_type	string	No	`text-to-image`, `text-to-video`, `image-to-video`, or `image-to-image`. Default: `text-to-image`.
negative_prompt	string	No	What to avoid in the output.
source_media_urls	string or array	No	Required for image-to-video and image-to-image. URL(s) to image(s), or for some models (e.g. Kling 2.6 Motion) image + video. See input limits below. Omit for text-to-image and text-to-video.
aspect_ratio	string	No	e.g. `1:1`, `16:9`, `9:16`, `4:5`, `21:9`. Default: `1:1`. Note: each model only accepts a subset — `get_models` returns the allowed list.
duration	string	No	Video length. Only certain video models use this. See below.
quality	string	No	e.g. `fast`, `standard`, `pro`, `ultra`. Default: `standard`.
resolution	string	No	Output resolution tier. Only certain image models use this — `gpt-image-2` (`1K`/`2K`/`4K`), `nano-banana-pro`/`nano-banana-2` (`1K`/`2K`/`4K`), `flux-2` (`1K`/`2K`). Each tier is a separate pricing SKU; `get_models` returns the per-tier credit cost. Ignored by models where resolution is encoded in the variant model_id (Seedance, Kling, Sora, P-Video). See the Resolution tiers table below for per-model constraints.
sound	boolean	No	When `true`, request video with generated audio. Only certain video models use this. Default: `false`. See below.
seed	number	No	Seed for reproducible results.

Example (text-to-image):

{
  "prompt": "A futuristic city at sunset with flying cars",
  "model": "nano-banana",
  "generation_type": "text-to-image",
  "aspect_ratio": "16:9",
  "quality": "pro"
}

Example (image-to-video, one input image required):

{
  "prompt": "Gentle motion and subtle movement",
  "model": "kling-2-6-image-to-video",
  "generation_type": "image-to-video",
  "source_media_urls": ["https://example.com/your-image.jpg"],
  "aspect_ratio": "16:9",
  "duration": "5s"
}

Response: Includes generation_id, status (e.g. pending), and often estimated_time_seconds and estimated_cost_credits. Poll with get_generation_status until status is completed or failed.

Models that support duration:

Model(s)	Supported values	Notes
kling-2-6-text-to-video, kling-2-6-image-to-video	`5s`, `10s`	Optional with/without audio (model variant).
wan-2-5 (text-to-video, image-to-video)	`5s`, `10s`
v1-pro-fast-i2v	`5s`, `10s`
seedance-1-5-pro	`4s`, `8s`, `12s`	Supports both text-to-video (0–1 image optional) and image-to-video (2 images required).
seedance-2 (Standard) / seedance-2-fast (Fast)	`4s`–`15s` (integer)	ByteDance Seedance 2. The tier is the model family itself — `seedance-2-fast` for the cheaper/faster tier, `seedance-2` for the higher-quality tier. Each tier has concrete variant model_ids per resolution / reference-video combo (e.g. `seedance-2-fast-480p`, `seedance-2-480p-video-ref`). Pass the full variant id to `generate_media`; a family label alone returns a `variant_required` error with the choices. Multimodal text-to-video references (up to 9 images + 3 videos + 3 audios); image-to-video takes 1 required image + optional end frame + up to 3 reference audios. With a reference video the billing formula becomes `credits = (ref_video_s + output_s) × rate/s`.
sora-2, sora-2-pro (text-to-video, image-to-video)	`10s`, `15s`
sora-2-pro-storyboard	`10s`, `15s`, `25s`	Scene-based; duration from shots.
grok-text-to-video-6s	Fixed 6s	Duration parameter ignored.
kling-3-0-std, kling-3-0-pro	`3s`–`15s`	Single-shot mode. Max 2500 chars; supports @element_name references.
grok-image-to-video, kling-2-5-image-to-video-pro, veo3-1	Not configurable	Duration not set via this parameter.

For image-only models, duration is ignored.

Models that support negative_prompt:

Model(s)	Notes
imagen-4, imagen-4-fast, imagen-4-ultra	Text-to-image.
wan-2-5 (text-to-video, image-to-video)
kling-2-5-image-to-video-pro

All other models ignore negative_prompt.

Models that support quality (or equivalent):

Model(s)	How it works	Values
sora-2-pro (text-to-video, image-to-video)	Mapped to `size` (standard vs HD).	`standard`, `pro`/`high`/`hd` (for HD).
imagen-4 variants	Mapped to `model_variant`.	`standard`, `fast`, `ultra` (use `quality`: standard / fast / ultra).
seedream-v4, seedream-v4-edit	Resolution via `quality` param.	`1K` (default), `2K`, `4K`.
seedream-v4-5, seedream-v4-5-edit	Uses `quality` directly.	`basic` (2K, default), `high` (4K).
5-lite-text-to-image, 5-lite-image-to-image	Uses `quality` directly.	`basic` (2K, default), `high` (4K).
veo3-1 vs veo3-1-fast	Different model IDs, not a single quality param.	Use model `veo3-1` (quality) or `veo3-1-fast` (speed).
flux-2, nano-banana-pro, nano-banana-2	Resolution (1K/2K/4K), not a generic “quality” string.	Pass via the dedicated `resolution` parameter — see below.
gpt-image-2 (t2i + i2i)	Resolution via the `resolution` parameter.	See Resolution tiers below.

For other models, quality is ignored.

Resolution tiers (the resolution parameter):

Model	Values	Pricing	Constraint
gpt-image-2 (t2i + i2i)	`1K` (default), `2K`, `4K`	11 / 15 / 21 credits	2K and 4K require an explicit non-square, non-auto aspect_ratio — one of `9:16`, `16:9`, `4:3`, `3:4`. Passing `auto` or `1:1` at 2K/4K returns HTTP 400 with `error: "aspect_ratio_incompatible_with_high_res"` and no credits are held. 1K accepts every supported aspect including `auto`/`1:1`.
nano-banana-2	`1K` (default), `2K`, `4K`	See `get_models`	Each tier is a separate pricing SKU. Aspect ratio list unchanged across tiers.
nano-banana-pro	`1K` (default), `2K`, `4K`	See `get_models`	Same pattern as nano-banana-2.
flux-2, flux-2-edit	`1K` (default), `2K`	See `get_models`	Two tiers only.

When to pick which tier (GPT Image 2):

1K — default. Use for social posts, thumbnails, concepting, in-app previews, anything ≤ 1024 × 1024. Cheapest; no aspect-ratio gotchas.
2K — use when the client needs a crisp web hero, newsletter cover, in-product illustration at retina density. Must pick a directional aspect (landscape or portrait).
4K — use for print, out-of-home, banners, or any case the user explicitly asks for the highest output size. Confirm the aspect with the user first; the pill 1:1 / auto won't work.

Models not listed ignore resolution. For video families (Seedance, Kling, Sora, P-Video) the resolution is part of the concrete variant model_id — pass the variant (e.g. seedance-2-fast-480p, p-video-1080p), not this param.

Prompt character limits:

Some models enforce a maximum prompt length. Exceeding it may return an error or truncation.

Model(s)	Max characters
wan-2-5	800
kling-2-6 (text-to-video, image-to-video)	2,500
seedance-2 (Fast + Standard)	2,500
kling-3-0-std, kling-3-0-pro	2,500
kling-2-5-image-to-video-pro	2,500
seedream-v4, seedream-v4-edit	2,500
seedream-v4-5, seedream-v4-5-edit	3,000
5-lite-text-to-image, 5-lite-image-to-image	2,995
gpt-1.5-image-medium, gpt-1.5-image-high	3,000
nano-banana, imagen-4, sora-2, flux-2, veo3-1, v1-pro-fast-i2v, grok (image/video), p-image-edit	5,000
nano-banana-pro (all variants)	20,000
nano-banana-2 (all variants)	20,000

Others may have no documented limit or use server defaults.

Input file (image and video) limits:

For image-to-video and image-to-image, source_media_urls is a list of URLs. Most models accept images only (JPEG, PNG, WebP, typically 10 MB max per file). Some models also accept video inputs; when they do, format and size limits apply (e.g. MP4, max duration).

Model(s)	Input type	Limit	Notes
kling-2-6-motion-control-720p, kling-2-6-motion-control-1080p	Image + video	1 image + 1 video	Motion Control: reference video drives motion. Video max 30 s; video file typically up to 100 MB (MP4/WebM).
kling-3-0-motion-control-720p, kling-3-0-motion-control-1080p	Image + video	1 image + 1 video	Kling 3.0 Motion Control: same as Kling 2.6. Per-second billing — see the Motion Control pricing table below. Video max 30 s; video file typically up to 100 MB (MP4/WebM).
kling-2-6-image-to-video, sora-2 (image-to-video), wan-2-5 (image-to-video), grok-image-to-video, v1-pro-fast-i2v	Images only	1 image	Exactly one input image.
kling-2-5-image-to-video-pro	Images only	2 images	Start frame and end frame.
seedance-1-5-pro	Images only	Mode-dependent	Text-to-video (`generation_type: "text-to-video"`): 0–1 images optional. Image-to-video (`generation_type: "image-to-video"`): exactly 2 images required (start + end frame).
seedance-2 (Fast + Standard)	Image + video + audio	Mode-dependent	Text-to-video: up to 9 images, 3 videos (combined ≤ 15s), and 3 audio clips (combined ≤ 15s) — all optional. Image-to-video: 1 required image (first frame) + optional end frame (2 images max) + up to 3 audio clips; no reference videos allowed in this mode. Pass every URL (image, video, audio) in `source_media_urls` — the backend classifies by file extension (`.jpg`/`.png`/`.webp` → image, `.mp4`/`.mov`/`.webm` → video, `.mp3`/`.wav`/`.m4a` → audio) and routes each to the right bucket.
kling-3-0-std, kling-3-0-pro	Images only	1–2 images	Start frame, or start + end frame. PNG/JPG/JPEG. Supports elements (see below).
seedream-v4-edit	Images only	10	For editing.
5-lite-text-to-image, 5-lite-image-to-image	Images only	10	For editing (image-to-image).
nano-banana, nano-banana-edit	Images only	10
nano-banana-pro (all variants)	Images only	8
nano-banana-2 (all variants)	Images only	8
p-image-edit	Images only	1–8	Pruna AI P Image Edit. Image-to-image only — set `generation_type: "image-to-image"`. Pass 1–8 URLs in `source_media_urls`. aspect_ratio: `auto` matches the first input image, or use `1:1`, `16:9`, `9:16`, `4:3`, `3:4`, `3:2`, `2:3`. Optional turbo (default on). Default disable_safety_checker: true (moderation off); set `disable_safety_checker: false` to enable the safety checker. Optional seed.
flux-2-edit (image-to-image)	Images only	8
gpt-1.5-image (image-to-image)	Images only	16
veo3-1 (image-to-video / reference modes)	Images only	1-3	Depends on mode (1 text-to-video optional ref; 2 first+last frame; 3 reference).
sora-2-pro-storyboard	Images only	1	Optional.

Use get_models to confirm input_media_types and capabilities for a given model. See Account tools for model list and pricing.

Kling 3.0 – elements (optional):

Elements let you reference images or videos in your prompt using @element_name. Pass kling_elements as an array of objects with name, description, and either element_input_urls (2–4 image URLs) or element_input_video_urls (1 video URL). Reference images for elements come from each element’s element_input_urls; main image_urls may be empty for text-to-video or hold optional start/end frames for image-to-video. Each element requires a title (name) and description. Image elements: JPG/PNG, min 300×300px, max 10MB each. Video elements: MP4/MOV, max 50MB.

Seedance 1.5 Pro – two modes (check generation_type before using images):

Mode	`generation_type`	`source_media_urls`	Can use images?
Text-to-video	`"text-to-video"`	Empty or 1 URL	Optional: 0–1 images. Omit for text-only; include 1 URL to animate that image.
Image-to-video	`"image-to-video"`	Exactly 2 URLs	Required: exactly 2 images (start frame + end frame).

Seedance 2 – two tiers, two modes, mixed references:

ByteDance Seedance 2 ships as two separate model families — seedance-2-fast (cheaper, faster) and seedance-2 (standard, higher quality). Each family exposes concrete variant model_ids per resolution and reference-video combo; pass the full variant (e.g. seedance-2-fast-480p, seedance-2-720p-video-ref) — a bare family label returns a variant_required error listing the options. Resolutions are 480p or 720p (no 1080p). Duration is an integer 4–15 seconds. Supported aspect ratios: 1:1, 4:3, 3:4, 16:9, 9:16, 21:9, adaptive. Prompt max 2,500 characters. Audio is a free toggle via generate_audio (defaults to true) — unlike Kling 3.0 there is no surcharge for audio.

Mode	`generation_type`	Required inputs	Optional references	Not allowed
Text-to-video	`"text-to-video"`	None (prompt only)	Up to 9 images, up to 3 videos (combined ≤ 15s), up to 3 audio clips (combined ≤ 15s)	—
Image-to-video	`"image-to-video"`	1 image (first frame)	Optional 2nd image (last frame) · up to 3 audio clips (combined ≤ 15s)	Reference videos (rejected with a clear error)

Put every reference URL (images, videos, audio) into source_media_urls. The backend classifies each URL by file extension — .jpg / .png / .webp → image, .mp4 / .mov / .webm → video, .mp3 / .wav / .m4a → audio — and routes it to the correct bucket automatically.

Billing (two paths):

Without a reference video: credits = output_seconds × base_credits_per_second
With a reference video (upstream provider path): credits = (ref_video_seconds + output_seconds) × base_credits_per_second, where ref_video_seconds is the sum of all reference video durations (capped at 15s per request).

Live per-second rates (pulled from the ai_models_config catalog — always current):

Model	Name	Rate	Unit
seedance-2-1080p	Seedance 2 (1080p)	93	credits / sec
seedance-2-1080p-video-ref	Seedance 2 (1080p, video ref)	65	credits / sec
seedance-2-480p	Seedance 2 (480p)	18	credits / sec
seedance-2-480p-video-ref	Seedance 2 (480p, video ref)	13	credits / sec
seedance-2-720p	Seedance 2 (720p)	40	credits / sec
seedance-2-720p-video-ref	Seedance 2 (720p, video ref)	29	credits / sec
seedance-2-fast-480p	Seedance 2.0 Fast	16	credits / sec
seedance-2-fast-480p-video-ref	Seedance 2.0 Fast (video ref)	12	credits / sec
seedance-2-fast-720p	Seedance 2.0 Fast	34	credits / sec
seedance-2-fast-720p-video-ref	Seedance 2.0 Fast (video ref)	24	credits / sec

MCP / REST API note: the backend cannot probe remote video durations from a URL, so reference-video requests from the MCP and REST API are billed the pessimistic worst case of 15s of reference video. The web UI probes each video locally and bills the exact sum. For cost-sensitive workflows with short reference videos, prefer the web UI.

Hard limits (enforced with 400 errors):

More than 3 videos → too_many_videos
More than 3 audios → too_many_audios
More than 9 images → too_many_images
Combined reference video duration > 15s → rejected
Combined reference audio duration > 15s → rejected
Any single reference video or audio file longer than 15s → rejected

Family-name disambiguation:

Some model families expose multiple concrete variants per resolution / quality / reference-video combo: seedance-2, seedance-2-fast, p-video, kling-3-0-motion-control, kling-2-6-motion-control. Passing just the family label as model returns:

{
  "error": "variant_required",
  "family": "seedance-2-fast",
  "available_variants": [
    "seedance-2-fast-480p",
    "seedance-2-fast-720p",
    "seedance-2-fast-480p-video-ref",
    "seedance-2-fast-720p-video-ref"
  ],
  "hint": "Ask the user which variant they want — resolution, draft vs standard, or with/without reference video. Pull pricing from get_models."
}

No credits are deducted. Re-call generate_media with a concrete model_id from available_variants (or from get_models).

Video audio (two different concepts):

capabilities.video_audio in get_models — how to know if the output has sound:
- included — The model’s output normally includes an audio track without using a sound parameter (e.g. Veo, Sora, Wan, Grok, Kling 2.5 image-to-video, Motion Control).
- toggle_via_sound_param — Generated audio is turned on/off with sound: true / false (Kling 2.6, Kling 3.0, Seedance 1.5 Pro). Pricing may differ for audio vs silent. Kling 3.0 routes sound: true to dedicated -audio rows in the catalog — you keep using kling-3-0-std / kling-3-0-pro as the model id and just toggle sound, the server picks the right priced row. Live rates:

Model	Name	Rate	Unit
kling-3-0-pro	Kling 3.0 Pro	21	credits / sec
kling-3-0-pro-audio	Kling 3.0 Pro (with audio)	30	credits / sec
kling-3-0-std	Kling 3.0	17	credits / sec
kling-3-0-std-audio	Kling 3.0 Std (with audio)	23	credits / sec

Motion Control per-second rates (Kling 3.0 and Kling 2.6):

No models matched this family. kling-3-0-motion-control,kling-2-6-motion-control

(Motion-control rates are shown as one block since the two families share the same pricing shape: 720p and 1080p variants.) Seedance 2 also exposes a sound toggle (via its own generate_audio flag, default true) but audio is free — no surcharge vs silent.

silent — No generated audio (Seedance 1.0 / v1-pro-fast-i2v only).

supports_sound — only means the API accepts a sound toggle for that model. It does not mean other models are silent; most video models use video_audio: included instead.

Image-only models ignore sound.

#REST API: URLs for local or browser files

If you use the HTTP API (POST /v1/generate/media) and your inputs are files on disk or selected in the browser—not already public URLs—upload each file first with POST /v1/upload/media. Pass the returned urls values as source_media_urls.

#get_generation_status

Check the status of a media generation and get output URLs when done.

Parameters:

Parameter	Type	Required	Description
generation_id	string	Yes	ID returned from generate_media.

Response: Includes status (pending, queued, processing, completed, failed), progress, and when completed an outputs array with url, thumbnail_url, optimized_url, media_type, dimensions, etc.

#get_generation_estimate

Get a parameter-specific estimated processing time for a given model and options (no job is started). For a per-model estimated duration in one call, use get_models; each model includes estimated_time_seconds. Use get_generation_estimate when you need an estimate that depends on prompt length, duration, or other parameters.

Parameters:

Parameter	Type	Required	Description
model	string	Yes	Model ID.
generation_type	string	No	Same as in generate_media. Default: `text-to-image`.
prompt	string	No	Optional; can affect estimate.
negative_prompt	string	No	Optional.
parameters	object	No	Optional extra parameters.

Response: Estimated time (and optionally confidence/sample size) so you can set user expectations before calling generate_media.

#Model rules

Text-to-image and text-to-video: Do not send source_media_urls (unless the model supports an optional reference image). Exception: seedance-1-5-pro in text-to-video mode accepts 0–1 optional images.
Image-to-video and image-to-image: Send image (and when supported, video) URL(s) in source_media_urls. Most models need images only; some (e.g. Kling 2.6 Motion Control) require 1 image + 1 video. seedance-1-5-pro in image-to-video mode requires exactly 2 images (start + end frame). Respect each model’s input limits above.
Video audio: Use capabilities.video_audio from get_models. included — audio is part of typical output without a sound parameter (e.g. Veo, Sora, Wan, Grok). toggle_via_sound_param — use sound only when supports_sound is true (Kling 2.6, Kling 3.0, Seedance 1.5 Pro). silent — no generated audio (Seedance 1.0 only). Do not infer “no audio” from supports_sound: false alone.
Use get_models to see which models support which generation types, input_media_types (e.g. image, video), and required input counts.

See Limitations for rate limits and credits. For a single table of API defaults (prompt max, inputs, duration, flags), see REST API model requirements.