Media tools

    Generate images and videos using 40+ AI models. Always call get_models first to see available models, costs, and whether a model needs an input image.

    REST HTTP clients: the same limits are documented in one place in REST API model requirements (and returned per model from GET /v1/models).

    #generate_media

    Starts an image or video generation.

    Parameters:

    ParameterTypeRequiredDescription
    promptstringYesWhat to generate (e.g. “A red car on a mountain road”).
    modelstringYesModel ID (from get_models). Examples: nano-banana, sora-2, kling-2-6-image-to-video.
    generation_typestringNotext-to-image, text-to-video, image-to-video, or image-to-image. Default: text-to-image.
    negative_promptstringNoWhat to avoid in the output.
    source_media_urlsstring or arrayNoRequired for image-to-video and image-to-image. URL(s) to image(s), or for some models (e.g. Kling 2.6 Motion) image + video. See input limits below. Omit for text-to-image and text-to-video.
    aspect_ratiostringNoe.g. 1:1, 16:9, 9:16, 4:5, 21:9. Default: 1:1. Note: each model only accepts a subset — get_models returns the allowed list.
    durationstringNoVideo length. Only certain video models use this. See below.
    qualitystringNoe.g. fast, standard, pro, ultra. Default: standard.
    resolutionstringNoOutput resolution tier. Only certain image models use thisgpt-image-2 (1K/2K/4K), nano-banana-pro/nano-banana-2 (1K/2K/4K), flux-2 (1K/2K). Each tier is a separate pricing SKU; get_models returns the per-tier credit cost. Ignored by models where resolution is encoded in the variant model_id (Seedance, Kling, Sora, P-Video). See the Resolution tiers table below for per-model constraints.
    soundbooleanNoWhen true, request video with generated audio. Only certain video models use this. Default: false. See below.
    seednumberNoSeed for reproducible results.

    Example (text-to-image):

    {
      "prompt": "A futuristic city at sunset with flying cars",
      "model": "nano-banana",
      "generation_type": "text-to-image",
      "aspect_ratio": "16:9",
      "quality": "pro"
    }
    

    Example (image-to-video, one input image required):

    {
      "prompt": "Gentle motion and subtle movement",
      "model": "kling-2-6-image-to-video",
      "generation_type": "image-to-video",
      "source_media_urls": ["https://example.com/your-image.jpg"],
      "aspect_ratio": "16:9",
      "duration": "5s"
    }
    

    Response: Includes generation_id, status (e.g. pending), and often estimated_time_seconds and estimated_cost_credits. Poll with get_generation_status until status is completed or failed.

    Models that support duration:

    Model(s)Supported valuesNotes
    kling-2-6-text-to-video, kling-2-6-image-to-video5s, 10sOptional with/without audio (model variant).
    wan-2-5 (text-to-video, image-to-video)5s, 10s
    v1-pro-fast-i2v5s, 10s
    seedance-1-5-pro4s, 8s, 12sSupports both text-to-video (0–1 image optional) and image-to-video (2 images required).
    seedance-2 (Standard) / seedance-2-fast (Fast)4s15s (integer)ByteDance Seedance 2. The tier is the model family itself — seedance-2-fast for the cheaper/faster tier, seedance-2 for the higher-quality tier. Each tier has concrete variant model_ids per resolution / reference-video combo (e.g. seedance-2-fast-480p, seedance-2-480p-video-ref). Pass the full variant id to generate_media; a family label alone returns a variant_required error with the choices. Multimodal text-to-video references (up to 9 images + 3 videos + 3 audios); image-to-video takes 1 required image + optional end frame + up to 3 reference audios. With a reference video the billing formula becomes credits = (ref_video_s + output_s) × rate/s.
    sora-2, sora-2-pro (text-to-video, image-to-video)10s, 15s
    sora-2-pro-storyboard10s, 15s, 25sScene-based; duration from shots.
    grok-text-to-video-6sFixed 6sDuration parameter ignored.
    kling-3-0-std, kling-3-0-pro3s15sSingle-shot mode. Max 2500 chars; supports @element_name references.
    grok-image-to-video, kling-2-5-image-to-video-pro, veo3-1Not configurableDuration not set via this parameter.

    For image-only models, duration is ignored.

    Models that support negative_prompt:

    Model(s)Notes
    imagen-4, imagen-4-fast, imagen-4-ultraText-to-image.
    wan-2-5 (text-to-video, image-to-video)
    kling-2-5-image-to-video-pro

    All other models ignore negative_prompt.

    Models that support quality (or equivalent):

    Model(s)How it worksValues
    sora-2-pro (text-to-video, image-to-video)Mapped to size (standard vs HD).standard, pro/high/hd (for HD).
    imagen-4 variantsMapped to model_variant.standard, fast, ultra (use quality: standard / fast / ultra).
    seedream-v4, seedream-v4-editResolution via quality param.1K (default), 2K, 4K.
    seedream-v4-5, seedream-v4-5-editUses quality directly.basic (2K, default), high (4K).
    5-lite-text-to-image, 5-lite-image-to-imageUses quality directly.basic (2K, default), high (4K).
    veo3-1 vs veo3-1-fastDifferent model IDs, not a single quality param.Use model veo3-1 (quality) or veo3-1-fast (speed).
    flux-2, nano-banana-pro, nano-banana-2Resolution (1K/2K/4K), not a generic “quality” string.Pass via the dedicated resolution parameter — see below.
    gpt-image-2 (t2i + i2i)Resolution via the resolution parameter.See Resolution tiers below.

    For other models, quality is ignored.

    <a id="resolution-tiers"></a>

    Resolution tiers (the resolution parameter):

    ModelValuesPricingConstraint
    gpt-image-2 (t2i + i2i)1K (default), 2K, 4K11 / 15 / 21 credits2K and 4K require an explicit non-square, non-auto aspect_ratio — one of 9:16, 16:9, 4:3, 3:4. Passing auto or 1:1 at 2K/4K returns HTTP 400 with error: "aspect_ratio_incompatible_with_high_res" and no credits are held. 1K accepts every supported aspect including auto/1:1.
    nano-banana-21K (default), 2K, 4KSee get_modelsEach tier is a separate pricing SKU. Aspect ratio list unchanged across tiers.
    nano-banana-pro1K (default), 2K, 4KSee get_modelsSame pattern as nano-banana-2.
    flux-2, flux-2-edit1K (default), 2KSee get_modelsTwo tiers only.

    When to pick which tier (GPT Image 2):

    • 1K — default. Use for social posts, thumbnails, concepting, in-app previews, anything ≤ 1024 × 1024. Cheapest; no aspect-ratio gotchas.
    • 2K — use when the client needs a crisp web hero, newsletter cover, in-product illustration at retina density. Must pick a directional aspect (landscape or portrait).
    • 4K — use for print, out-of-home, banners, or any case the user explicitly asks for the highest output size. Confirm the aspect with the user first; the pill 1:1 / auto won't work.

    Models not listed ignore resolution. For video families (Seedance, Kling, Sora, P-Video) the resolution is part of the concrete variant model_id — pass the variant (e.g. seedance-2-fast-480p, p-video-1080p), not this param.

    Prompt character limits:

    Some models enforce a maximum prompt length. Exceeding it may return an error or truncation.

    Model(s)Max characters
    wan-2-5800
    kling-2-6 (text-to-video, image-to-video)2,500
    seedance-2 (Fast + Standard)2,500
    kling-3-0-std, kling-3-0-pro2,500
    kling-2-5-image-to-video-pro2,500
    seedream-v4, seedream-v4-edit2,500
    seedream-v4-5, seedream-v4-5-edit3,000
    5-lite-text-to-image, 5-lite-image-to-image2,995
    gpt-1.5-image-medium, gpt-1.5-image-high3,000
    nano-banana, imagen-4, sora-2, flux-2, veo3-1, v1-pro-fast-i2v, grok (image/video), p-image-edit5,000
    nano-banana-pro (all variants)20,000
    nano-banana-2 (all variants)20,000

    Others may have no documented limit or use server defaults.

    Input file (image and video) limits:

    For image-to-video and image-to-image, source_media_urls is a list of URLs. Most models accept images only (JPEG, PNG, WebP, typically 10 MB max per file). Some models also accept video inputs; when they do, format and size limits apply (e.g. MP4, max duration).

    Model(s)Input typeLimitNotes
    kling-2-6-motion-control-720p, kling-2-6-motion-control-1080pImage + video1 image + 1 videoMotion Control: reference video drives motion. Video max 30 s; video file typically up to 100 MB (MP4/WebM).
    kling-3-0-motion-control-720p, kling-3-0-motion-control-1080pImage + video1 image + 1 videoKling 3.0 Motion Control: same as Kling 2.6. Per-second billing — see the Motion Control pricing table below. Video max 30 s; video file typically up to 100 MB (MP4/WebM).
    kling-2-6-image-to-video, sora-2 (image-to-video), wan-2-5 (image-to-video), grok-image-to-video, v1-pro-fast-i2vImages only1 imageExactly one input image.
    kling-2-5-image-to-video-proImages only2 imagesStart frame and end frame.
    seedance-1-5-proImages onlyMode-dependentText-to-video (generation_type: "text-to-video"): 0–1 images optional. Image-to-video (generation_type: "image-to-video"): exactly 2 images required (start + end frame).
    seedance-2 (Fast + Standard)Image + video + audioMode-dependentText-to-video: up to 9 images, 3 videos (combined ≤ 15s), and 3 audio clips (combined ≤ 15s) — all optional. Image-to-video: 1 required image (first frame) + optional end frame (2 images max) + up to 3 audio clips; no reference videos allowed in this mode. Pass every URL (image, video, audio) in source_media_urls — the backend classifies by file extension (.jpg/.png/.webp → image, .mp4/.mov/.webm → video, .mp3/.wav/.m4a → audio) and routes each to the right bucket.
    kling-3-0-std, kling-3-0-proImages only1–2 imagesStart frame, or start + end frame. PNG/JPG/JPEG. Supports elements (see below).
    seedream-v4-editImages only10For editing.
    5-lite-text-to-image, 5-lite-image-to-imageImages only10For editing (image-to-image).
    nano-banana, nano-banana-editImages only10
    nano-banana-pro (all variants)Images only8
    nano-banana-2 (all variants)Images only8
    p-image-editImages only1–8Pruna AI P Image Edit. Image-to-image only — set generation_type: "image-to-image". Pass 1–8 URLs in source_media_urls. aspect_ratio: auto matches the first input image, or use 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3. Optional turbo (default on). Default disable_safety_checker: true (moderation off); set disable_safety_checker: false to enable the safety checker. Optional seed.
    flux-2-edit (image-to-image)Images only8
    gpt-1.5-image (image-to-image)Images only16
    veo3-1 (image-to-video / reference modes)Images only1-3Depends on mode (1 text-to-video optional ref; 2 first+last frame; 3 reference).
    sora-2-pro-storyboardImages only1Optional.

    Use get_models to confirm input_media_types and capabilities for a given model. See Account tools for model list and pricing.

    Kling 3.0 – elements (optional):

    Elements let you reference images or videos in your prompt using @element_name. Pass kling_elements as an array of objects with name, description, and either element_input_urls (2–4 image URLs) or element_input_video_urls (1 video URL). Reference images for elements come from each element’s element_input_urls; main image_urls may be empty for text-to-video or hold optional start/end frames for image-to-video. Each element requires a title (name) and description. Image elements: JPG/PNG, min 300×300px, max 10MB each. Video elements: MP4/MOV, max 50MB.

    Seedance 1.5 Pro – two modes (check generation_type before using images):

    Modegeneration_typesource_media_urlsCan use images?
    Text-to-video"text-to-video"Empty or 1 URLOptional: 0–1 images. Omit for text-only; include 1 URL to animate that image.
    Image-to-video"image-to-video"Exactly 2 URLsRequired: exactly 2 images (start frame + end frame).

    Seedance 2 – two tiers, two modes, mixed references:

    ByteDance Seedance 2 ships as two separate model familiesseedance-2-fast (cheaper, faster) and seedance-2 (standard, higher quality). Each family exposes concrete variant model_ids per resolution and reference-video combo; pass the full variant (e.g. seedance-2-fast-480p, seedance-2-720p-video-ref) — a bare family label returns a variant_required error listing the options. Resolutions are 480p or 720p (no 1080p). Duration is an integer 4–15 seconds. Supported aspect ratios: 1:1, 4:3, 3:4, 16:9, 9:16, 21:9, adaptive. Prompt max 2,500 characters. Audio is a free toggle via generate_audio (defaults to true) — unlike Kling 3.0 there is no surcharge for audio.

    Modegeneration_typeRequired inputsOptional referencesNot allowed
    Text-to-video"text-to-video"None (prompt only)Up to 9 images, up to 3 videos (combined ≤ 15s), up to 3 audio clips (combined ≤ 15s)
    Image-to-video"image-to-video"1 image (first frame)Optional 2nd image (last frame) · up to 3 audio clips (combined ≤ 15s)Reference videos (rejected with a clear error)

    Put every reference URL (images, videos, audio) into source_media_urls. The backend classifies each URL by file extension — .jpg / .png / .webp → image, .mp4 / .mov / .webm → video, .mp3 / .wav / .m4a → audio — and routes it to the correct bucket automatically.

    Billing (two paths):

    • Without a reference video: credits = output_seconds × base_credits_per_second
    • With a reference video (upstream provider path): credits = (ref_video_seconds + output_seconds) × base_credits_per_second, where ref_video_seconds is the sum of all reference video durations (capped at 15s per request).

    Live per-second rates (pulled from the ai_models_config catalog — always current):

    ModelNameRateUnit
    seedance-2-1080pSeedance 2 (1080p)93credits / sec
    seedance-2-1080p-video-refSeedance 2 (1080p, video ref)65credits / sec
    seedance-2-480pSeedance 2 (480p)18credits / sec
    seedance-2-480p-video-refSeedance 2 (480p, video ref)13credits / sec
    seedance-2-720pSeedance 2 (720p)40credits / sec
    seedance-2-720p-video-refSeedance 2 (720p, video ref)29credits / sec
    seedance-2-fast-480pSeedance 2.0 Fast16credits / sec
    seedance-2-fast-480p-video-refSeedance 2.0 Fast (video ref)12credits / sec
    seedance-2-fast-720pSeedance 2.0 Fast34credits / sec
    seedance-2-fast-720p-video-refSeedance 2.0 Fast (video ref)24credits / sec

    MCP / REST API note: the backend cannot probe remote video durations from a URL, so reference-video requests from the MCP and REST API are billed the pessimistic worst case of 15s of reference video. The web UI probes each video locally and bills the exact sum. For cost-sensitive workflows with short reference videos, prefer the web UI.

    Hard limits (enforced with 400 errors):

    • More than 3 videostoo_many_videos
    • More than 3 audiostoo_many_audios
    • More than 9 imagestoo_many_images
    • Combined reference video duration > 15s → rejected
    • Combined reference audio duration > 15s → rejected
    • Any single reference video or audio file longer than 15s → rejected

    Family-name disambiguation:

    Some model families expose multiple concrete variants per resolution / quality / reference-video combo: seedance-2, seedance-2-fast, p-video, kling-3-0-motion-control, kling-2-6-motion-control. Passing just the family label as model returns:

    {
      "error": "variant_required",
      "family": "seedance-2-fast",
      "available_variants": [
        "seedance-2-fast-480p",
        "seedance-2-fast-720p",
        "seedance-2-fast-480p-video-ref",
        "seedance-2-fast-720p-video-ref"
      ],
      "hint": "Ask the user which variant they want — resolution, draft vs standard, or with/without reference video. Pull pricing from get_models."
    }
    

    No credits are deducted. Re-call generate_media with a concrete model_id from available_variants (or from get_models).

    Video audio (two different concepts):

    1. capabilities.video_audio in get_models — how to know if the output has sound:
      • included — The model’s output normally includes an audio track without using a sound parameter (e.g. Veo, Sora, Wan, Grok, Kling 2.5 image-to-video, Motion Control).
      • toggle_via_sound_param — Generated audio is turned on/off with sound: true / false (Kling 2.6, Kling 3.0, Seedance 1.5 Pro). Pricing may differ for audio vs silent. Kling 3.0 routes sound: true to dedicated -audio rows in the catalog — you keep using kling-3-0-std / kling-3-0-pro as the model id and just toggle sound, the server picks the right priced row. Live rates:
    ModelNameRateUnit
    kling-3-0-proKling 3.0 Pro21credits / sec
    kling-3-0-pro-audioKling 3.0 Pro (with audio)30credits / sec
    kling-3-0-stdKling 3.017credits / sec
    kling-3-0-std-audioKling 3.0 Std (with audio)23credits / sec

    Motion Control per-second rates (Kling 3.0 and Kling 2.6):

    No models matched this family. kling-3-0-motion-control,kling-2-6-motion-control

    (Motion-control rates are shown as one block since the two families share the same pricing shape: 720p and 1080p variants.) Seedance 2 also exposes a sound toggle (via its own generate_audio flag, default true) but audio is free — no surcharge vs silent.

    • silent — No generated audio (Seedance 1.0 / v1-pro-fast-i2v only).
    1. supports_sound — only means the API accepts a sound toggle for that model. It does not mean other models are silent; most video models use video_audio: included instead.

    Image-only models ignore sound.

    #REST API: URLs for local or browser files

    If you use the HTTP API (POST /v1/generate/media) and your inputs are files on disk or selected in the browser—not already public URLs—upload each file first with POST /v1/upload/media. Pass the returned urls values as source_media_urls.


    #get_generation_status

    Check the status of a media generation and get output URLs when done.

    Parameters:

    ParameterTypeRequiredDescription
    generation_idstringYesID returned from generate_media.

    Response: Includes status (pending, queued, processing, completed, failed), progress, and when completed an outputs array with url, thumbnail_url, optimized_url, media_type, dimensions, etc.


    #get_generation_estimate

    Get a parameter-specific estimated processing time for a given model and options (no job is started). For a per-model estimated duration in one call, use get_models; each model includes estimated_time_seconds. Use get_generation_estimate when you need an estimate that depends on prompt length, duration, or other parameters.

    Parameters:

    ParameterTypeRequiredDescription
    modelstringYesModel ID.
    generation_typestringNoSame as in generate_media. Default: text-to-image.
    promptstringNoOptional; can affect estimate.
    negative_promptstringNoOptional.
    parametersobjectNoOptional extra parameters.

    Response: Estimated time (and optionally confidence/sample size) so you can set user expectations before calling generate_media.


    #Model rules

    • Text-to-image and text-to-video: Do not send source_media_urls (unless the model supports an optional reference image). Exception: seedance-1-5-pro in text-to-video mode accepts 0–1 optional images.
    • Image-to-video and image-to-image: Send image (and when supported, video) URL(s) in source_media_urls. Most models need images only; some (e.g. Kling 2.6 Motion Control) require 1 image + 1 video. seedance-1-5-pro in image-to-video mode requires exactly 2 images (start + end frame). Respect each model’s input limits above.
    • Video audio: Use capabilities.video_audio from get_models. included — audio is part of typical output without a sound parameter (e.g. Veo, Sora, Wan, Grok). toggle_via_sound_param — use sound only when supports_sound is true (Kling 2.6, Kling 3.0, Seedance 1.5 Pro). silent — no generated audio (Seedance 1.0 only). Do not infer “no audio” from supports_sound: false alone.
    • Use get_models to see which models support which generation types, input_media_types (e.g. image, video), and required input counts.

    See Limitations for rate limits and credits. For a single table of API defaults (prompt max, inputs, duration, flags), see REST API model requirements.