Media tools

    Generate images and videos using 40+ AI models. Always call get_models first to see available models, costs, and whether a model needs an input image.

    REST HTTP clients: the same limits are documented in one place in REST API model requirements (and returned per model from GET /v1/models).

    #generate_media

    Starts an image or video generation.

    Parameters:

    ParameterTypeRequiredDescription
    promptstringYesWhat to generate (e.g. “A red car on a mountain road”).
    modelstringYesModel ID (from get_models). Examples: nano-banana, kling-2-6-image-to-video.
    generation_typestringNotext-to-image, text-to-video, image-to-video, or image-to-image. Default: text-to-image.
    negative_promptstringNoWhat to avoid in the output.
    source_media_urlsstring or arrayNoRequired for image-to-video and image-to-image. URL(s) to image(s), or for some models (e.g. Kling 2.6 Motion) image + video. See input limits below. Omit for text-to-image and text-to-video.
    aspect_ratiostringNoe.g. 1:1, 16:9, 9:16, 4:5, 21:9. Default: 1:1. Note: each model only accepts a subset — get_models returns the allowed list.
    durationstringNoVideo length. Only certain video models use this. See below.
    qualitystringNoe.g. fast, standard, pro, ultra. Default: standard.
    resolutionstringNoOutput resolution tier. Only certain image models use thisgpt-image-2 (1K/2K/4K), nano-banana-pro/nano-banana-2 (1K/2K/4K), flux-2 (1K/2K). Each tier is a separate pricing SKU; get_models returns the per-tier credit cost. Ignored by models where resolution is encoded in the variant model_id (Seedance, Kling, P-Video). See the Resolution tiers table below for per-model constraints.
    soundbooleanNoWhen true, request video with generated audio. Only certain video models use this. Default: false. See below.
    seednumberNoSeed for reproducible results.

    Example (text-to-image):

    {
      "prompt": "A futuristic city at sunset with flying cars",
      "model": "nano-banana",
      "generation_type": "text-to-image",
      "aspect_ratio": "16:9",
      "quality": "pro"
    }
    

    Example (image-to-video, one input image required):

    {
      "prompt": "Gentle motion and subtle movement",
      "model": "kling-2-6-image-to-video",
      "generation_type": "image-to-video",
      "source_media_urls": ["https://example.com/your-image.jpg"],
      "aspect_ratio": "16:9",
      "duration": "5s"
    }
    

    Response: Includes generation_id, status (e.g. pending), and often estimated_time_seconds and estimated_cost_credits. Poll with get_generation_status until status is completed or failed.

    Models that support duration:

    Model(s)Supported valuesNotes
    kling-2-6-text-to-video, kling-2-6-image-to-video5s, 10sOptional with/without audio (model variant).
    wan-2-5 (text-to-video, image-to-video)5s, 10s
    wan-2-7-720p, wan-2-7-1080p (text-to-video, image-to-video)2s15s (integer, per second)PER-SECOND pricing; resolution is encoded in the model_id. Pass duration (2–15) + generation_type.
    grok-imagine-video-1-5-preview-480p, grok-imagine-video-1-5-preview-720p (image-to-video)2s15s (integer, per second)PER-SECOND pricing; resolution is encoded in the model_id. Image-to-video only — requires exactly 1 image URL in source_media_urls. Pass duration (2–15); optional aspect_ratio (auto by default).
    v1-pro-fast-i2v5s, 10s
    seedance-1-5-pro4s, 8s, 12sSupports both text-to-video (0–1 image optional) and image-to-video (2 images required).
    seedance-2 (Standard) / seedance-2-fast (Fast)4s15s (integer)ByteDance Seedance 2. The tier is the model family itself — seedance-2-fast for the cheaper/faster tier, seedance-2 for the higher-quality tier. Each tier has concrete variant model_ids per resolution / reference-video combo (e.g. seedance-2-fast-480p, seedance-2-480p-video-ref). Pass the full variant id to generate_media; a family label alone returns a variant_required error with the choices. Multimodal text-to-video references (up to 9 images + 3 videos + 3 audios); image-to-video takes 1 required image + optional end frame + up to 3 reference audios. With a reference video the billing formula becomes credits = (ref_video_s + output_s) × rate/s.
    gemini-omni-video (Google)4, 6, 8, 10 (baked into the variant id)Discrete durations only — the duration parameter is ignored; resolution is also encoded in the variant id (gemini-omni-video-720p-6s, gemini-omni-video-1080p-6s, gemini-omni-video-4k-10s, etc.). Three video-ref variants (gemini-omni-video-720p-video-ref, gemini-omni-video-1080p-video-ref, gemini-omni-video-4k-video-ref) take their output duration from the trimmed source clip (≤ 10s). Pass a bare gemini-omni-video family label and the API returns a variant_required error with the 15 choices.
    grok-text-to-video-6sFixed 6sDuration parameter ignored.
    kling-3-0-std, kling-3-0-pro3s15sSingle-shot mode. Max 2500 chars; supports @element_name references.
    grok-image-to-video, kling-2-5-image-to-video-pro, veo3-1Not configurableDuration not set via this parameter.

    For image-only models, duration is ignored.

    Models that support negative_prompt:

    Model(s)Notes
    imagen-4, imagen-4-fast, imagen-4-ultraText-to-image.
    wan-2-5 (text-to-video, image-to-video)
    wan-2-7-720p, wan-2-7-1080p (text-to-video, image-to-video)Up to 500 chars.
    kling-2-5-image-to-video-pro

    All other models ignore negative_prompt.

    Models that support quality (or equivalent):

    Model(s)How it worksValues
    imagen-4 variantsMapped to model_variant.standard, fast, ultra (use quality: standard / fast / ultra).
    seedream-v4, seedream-v4-editResolution via quality param.1K (default), 2K, 4K.
    seedream-v4-5, seedream-v4-5-editUses quality directly.basic (2K, default), high (4K).
    5-lite-text-to-image, 5-lite-image-to-imageUses quality directly.basic (2K, default), high (4K).
    veo3-1 vs veo3-1-fastDifferent model IDs, not a single quality param.Use model veo3-1 (quality) or veo3-1-fast (speed).
    flux-2, nano-banana-pro, nano-banana-2Resolution (1K/2K/4K), not a generic “quality” string.Pass via the dedicated resolution parameter — see below.
    gpt-image-2 (t2i + i2i)Resolution via the resolution parameter.See Resolution tiers below.

    For other models, quality is ignored.

    <a id="resolution-tiers"></a>

    Resolution tiers (the resolution parameter):

    ModelValuesPricingConstraint
    gpt-image-2 (t2i + i2i)1K (default), 2K, 4K11 / 15 / 21 credits2K and 4K require an explicit non-square, non-auto aspect_ratio — one of 9:16, 16:9, 4:3, 3:4. Passing auto or 1:1 at 2K/4K returns HTTP 400 with error: "aspect_ratio_incompatible_with_high_res" and no credits are held. 1K accepts every supported aspect including auto/1:1.
    nano-banana-21K (default), 2K, 4KSee get_modelsEach tier is a separate pricing SKU. Aspect ratio list unchanged across tiers.
    nano-banana-pro1K (default), 2K, 4KSee get_modelsSame pattern as nano-banana-2.
    flux-2, flux-2-edit1K (default), 2KSee get_modelsTwo tiers only.

    When to pick which tier (GPT Image 2):

    • 1K — default. Use for social posts, thumbnails, concepting, in-app previews, anything ≤ 1024 × 1024. Cheapest; no aspect-ratio gotchas.
    • 2K — use when the client needs a crisp web hero, newsletter cover, in-product illustration at retina density. Must pick a directional aspect (landscape or portrait).
    • 4K — use for print, out-of-home, banners, or any case the user explicitly asks for the highest output size. Confirm the aspect with the user first; the pill 1:1 / auto won't work.

    Models not listed ignore resolution. For video families (Seedance, Kling, P-Video, Gemini Omni) the resolution is part of the concrete variant model_id — pass the variant (e.g. seedance-2-fast-480p, p-video-1080p, gemini-omni-video-1080p-6s), not this param.

    Prompt character limits:

    Some models enforce a maximum prompt length. Exceeding it may return an error or truncation.

    Model(s)Max characters
    wan-2-5800
    wan-2-7-720p, wan-2-7-1080p5,000
    kling-2-6 (text-to-video, image-to-video)2,500
    seedance-2 (Fast + Standard)2,500
    kling-3-0-std, kling-3-0-pro2,500
    kling-2-5-image-to-video-pro2,500
    seedream-v4, seedream-v4-edit2,500
    seedream-v4-5, seedream-v4-5-edit3,000
    5-lite-text-to-image, 5-lite-image-to-image2,995
    gpt-1.5-image-medium, gpt-1.5-image-high3,000
    nano-banana, imagen-4, flux-2, veo3-1, v1-pro-fast-i2v, grok (image/video), p-image-edit5,000
    nano-banana-pro (all variants)20,000
    nano-banana-2 (all variants)20,000

    Others may have no documented limit or use server defaults.

    Input file (image and video) limits:

    For image-to-video and image-to-image, source_media_urls is a list of URLs. Most models accept images only (JPEG, PNG, WebP, typically 10 MB max per file). Some models also accept video inputs; when they do, format and size limits apply (e.g. MP4, max duration).

    Model(s)Input typeLimitNotes
    kling-2-6-motion-control-720p, kling-2-6-motion-control-1080pImage + video1 image + 1 videoMotion Control: reference video drives motion. Video max 30 s; video file typically up to 100 MB (MP4/WebM).
    kling-3-0-motion-control-720p, kling-3-0-motion-control-1080pImage + video1 image + 1 videoKling 3.0 Motion Control: same as Kling 2.6. Per-second billing — see the Motion Control pricing table below. Video max 30 s; video file typically up to 100 MB (MP4/WebM).
    kling-2-6-image-to-video, wan-2-5 (image-to-video), grok-image-to-video, v1-pro-fast-i2vImages only1 imageExactly one input image.
    wan-2-7-720p, wan-2-7-1080p (image-to-video)Images only1 imageFirst frame in source_media_urls; optional last_frame_url for a first+last-frame transition.
    grok-imagine-video-1-5-preview-480p, grok-imagine-video-1-5-preview-720pImages only1 imageExactly one input image (the starting frame). Aspect ratio: auto (default, derives from the image), 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3. Output includes audio.
    kling-2-5-image-to-video-proImages only2 imagesStart frame and end frame.
    seedance-1-5-proImages onlyMode-dependentText-to-video (generation_type: "text-to-video"): 0–1 images optional. Image-to-video (generation_type: "image-to-video"): exactly 2 images required (start + end frame).
    seedance-2 (Fast + Standard)Image + video + audioMode-dependentText-to-video: up to 9 images, 3 videos (combined ≤ 15s), and 3 audio clips (combined ≤ 15s) — all optional. Image-to-video: 1 required image (first frame) + optional end frame (2 images max) + up to 3 audio clips; no reference videos allowed in this mode. Pass every URL (image, video, audio) in source_media_urls — the backend classifies by file extension (.jpg/.png/.webp → image, .mp4/.mov/.webm → video, .mp3/.wav/.m4a → audio) and routes each to the right bucket.
    gemini-omni-video (all variants)Images + video7 reference slotsA single reference video consumes 2 slots; images fill the rest. Maximum 1 video reference per request; the video's first 10 seconds are used as the driving clip (no trim parameters exposed). A video reference requires a -video-ref variant — calling a duration variant with a video URL returns variant_mismatch. Audio file URLs are not accepted; voice output is controlled by the voice_id parameter (one of 29 built-in voices).
    kling-3-0-std, kling-3-0-proImages only1–2 imagesStart frame, or start + end frame. PNG/JPG/JPEG. Supports elements (see below).
    seedream-v4-editImages only10For editing.
    5-lite-text-to-image, 5-lite-image-to-imageImages only10For editing (image-to-image).
    nano-banana, nano-banana-editImages only10
    nano-banana-pro (all variants)Images only8
    nano-banana-2 (all variants)Images only8
    p-image-editImages only1–8Pruna AI P Image Edit. Image-to-image only — set generation_type: "image-to-image". Pass 1–8 URLs in source_media_urls. aspect_ratio: auto matches the first input image, or use 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3. Optional turbo (default on). Default disable_safety_checker: true (moderation off); set disable_safety_checker: false to enable the safety checker. Optional seed.
    flux-2-edit (image-to-image)Images only8
    gpt-1.5-image (image-to-image)Images only16
    veo3-1 (image-to-video / reference modes)Images only1-3Depends on mode (1 text-to-video optional ref; 2 first+last frame; 3 reference).

    Use get_models to confirm input_media_types and capabilities for a given model. See Account tools for model list and pricing.

    Kling 3.0 – elements (optional):

    Elements let you reference images or videos in your prompt using @element_name. Pass kling_elements as an array of objects with name, description, and either element_input_urls (2–4 image URLs) or element_input_video_urls (1 video URL). Reference images for elements come from each element’s element_input_urls; main image_urls may be empty for text-to-video or hold optional start/end frames for image-to-video. Each element requires a title (name) and description. Image elements: JPG/PNG, min 300×300px, max 10MB each. Video elements: MP4/MOV, max 50MB.

    Seedance 1.5 Pro – two modes (check generation_type before using images):

    Modegeneration_typesource_media_urlsCan use images?
    Text-to-video"text-to-video"Empty or 1 URLOptional: 0–1 images. Omit for text-only; include 1 URL to animate that image.
    Image-to-video"image-to-video"Exactly 2 URLsRequired: exactly 2 images (start frame + end frame).

    Seedance 2 – two tiers, two modes, mixed references:

    ByteDance Seedance 2 ships as two separate model familiesseedance-2-fast (cheaper, faster) and seedance-2 (standard, higher quality). Each family exposes concrete variant model_ids per resolution and reference-video combo; pass the full variant (e.g. seedance-2-fast-480p, seedance-2-720p-video-ref) — a bare family label returns a variant_required error listing the options. Resolutions are 480p or 720p (no 1080p). Duration is an integer 4–15 seconds. Supported aspect ratios: 1:1, 4:3, 3:4, 16:9, 9:16, 21:9, adaptive. Prompt max 2,500 characters. Audio is a free toggle via generate_audio (defaults to true) — unlike Kling 3.0 there is no surcharge for audio.

    Modegeneration_typeRequired inputsOptional referencesNot allowed
    Text-to-video"text-to-video"None (prompt only)Up to 9 images, up to 3 videos (combined ≤ 15s), up to 3 audio clips (combined ≤ 15s)
    Image-to-video"image-to-video"1 image (first frame)Optional 2nd image (last frame) · up to 3 audio clips (combined ≤ 15s)Reference videos (rejected with a clear error)

    Put every reference URL (images, videos, audio) into source_media_urls. The backend classifies each URL by file extension — .jpg / .png / .webp → image, .mp4 / .mov / .webm → video, .mp3 / .wav / .m4a → audio — and routes it to the correct bucket automatically.

    Billing (two paths):

    • Without a reference video: credits = output_seconds × base_credits_per_second
    • With a reference video (upstream provider path): credits = (ref_video_seconds + output_seconds) × base_credits_per_second, where ref_video_seconds is the sum of all reference video durations (capped at 15s per request).

    Live per-second rates (pulled from the ai_models_config catalog — always current):

    ModelNameRateUnit
    seedance-2-1080pSeedance 2 (1080p)75credits / sec
    seedance-2-1080p-video-refSeedance 2 (1080p, video ref)48credits / sec
    seedance-2-480pSeedance 2 (480p)18credits / sec
    seedance-2-480p-video-refSeedance 2 (480p, video ref)13credits / sec
    seedance-2-720pSeedance 2 (720p)35credits / sec
    seedance-2-720p-video-refSeedance 2 (720p, video ref)25credits / sec
    seedance-2-fast-480pSeedance 2.0 Fast16credits / sec
    seedance-2-fast-480p-video-refSeedance 2.0 Fast (video ref)12credits / sec
    seedance-2-fast-720pSeedance 2.0 Fast29credits / sec
    seedance-2-fast-720p-video-refSeedance 2.0 Fast (video ref)19credits / sec

    MCP / REST API note: the backend cannot probe remote video durations from a URL, so reference-video requests from the MCP and REST API are billed the pessimistic worst case of 15s of reference video. The web UI probes each video locally and bills the exact sum. For cost-sensitive workflows with short reference videos, prefer the web UI.

    Hard limits (enforced with 400 errors):

    • More than 3 videostoo_many_videos
    • More than 3 audiostoo_many_audios
    • More than 9 imagestoo_many_images
    • Combined reference video duration > 15s → rejected
    • Combined reference audio duration > 15s → rejected
    • Any single reference video or audio file longer than 15s → rejected

    Gemini Omni Video — Google's multi-modal video model:

    Gemini Omni Video produces video output with built-in audio (29 named voices) and accepts multi-modal inputs (text, image references, an optional single video reference). The model ships as 15 concrete variants — pick the one that matches your target resolution and duration. 720p and 1080p are priced identically; 4K is a separate price tier.

    Variant ids (pass to generate_media as model):

    Resolution4s6s8s10sVideo-ref (flat)
    720pgemini-omni-video-720p-4sgemini-omni-video-720p-6sgemini-omni-video-720p-8sgemini-omni-video-720p-10sgemini-omni-video-720p-video-ref
    1080pgemini-omni-video-1080p-4sgemini-omni-video-1080p-6sgemini-omni-video-1080p-8sgemini-omni-video-1080p-10sgemini-omni-video-1080p-video-ref
    4Kgemini-omni-video-4k-4sgemini-omni-video-4k-6sgemini-omni-video-4k-8sgemini-omni-video-4k-10sgemini-omni-video-4k-video-ref

    Video-ref variants take their output duration from the trimmed source clip (first 10s used; no trim parameters exposed). Duration variants accept 1–7 image references; video-ref variants accept 1 source clip (consumes 2 slots) + up to 5 image references.

    Pricing is flat-per-task. Live rates (720p and 1080p share each row — Google bills them identically):

    ModelNameRateUnit
    gemini-omni-video-4k-10sGemini Omni Video (4K)320credits
    gemini-omni-video-4k-4sGemini Omni Video (4K)230credits
    gemini-omni-video-4k-6sGemini Omni Video (4K)260credits
    gemini-omni-video-4k-8sGemini Omni Video (4K)290credits
    gemini-omni-video-4k-video-refGemini Omni Video (video ref, 4K)380credits
    gemini-omni-video-hd-10sGemini Omni Video200credits
    gemini-omni-video-hd-4sGemini Omni Video110credits
    gemini-omni-video-hd-6sGemini Omni Video140credits
    gemini-omni-video-hd-8sGemini Omni Video170credits
    gemini-omni-video-hd-video-refGemini Omni Video (video ref)260credits

    Parameters specific to Gemini Omni:

    ParameterTypeNotes
    resolutionIgnored — encoded in the variant id.
    durationIgnored — encoded in the variant id. Video-ref variants take their duration from the trimmed source clip.
    aspect_ratiostringStrictly 16:9 or 9:16. Any other value returns aspect_ratio_invalid_for_model.
    voice_idstring (optional)One of 29 built-in voices (e.g. kore, puck, achernar, zephyr). Discover the full list via the list_gemini_omni_voices tool — unknown ids are rejected client-side before any credits are charged. Omit to use the model's default voice.
    character_idsstring[] (optional)One or more saved characters from the user's library (created via manage_library(kind="character", action="create", ...), listed via list_gemini_omni_characters). Each consumes 1 of the 7 reference slots. Provider renders a consistent identity across multiple clips.
    video_trim_start_snumber (optional)Trim window start (seconds) for the reference video. Default 0.
    video_trim_end_snumber (optional)Trim window end (seconds) for the reference video. Default: min(source_duration, start+10) when the URL is in media_upload_metadata, else start+10. Provider hard limits: range ≤ 10s, ends ≤ 30s.
    seednumber (optional)Reproducible runs.
    negative_promptNot supported. Silently stripped.
    soundNot supported — audio is intrinsic to every output.

    Source media (source_media_urls): 7 reference slots total. A single video reference consumes 2 slots; each character_id consumes 1 slot; images fill the rest. Maximum 1 video per request; passing a video URL requires a -video-ref variant (a duration variant + video returns variant_mismatch). The driving clip window defaults to the first 10 seconds; override with video_trim_start_s / video_trim_end_s. Audio file URLs are not accepted (returns unsupported_audio_url); use voice_id instead.

    Companion tools (use these alongside Gemini Omni generations):

    ToolPurpose
    list_gemini_omni_voicesStatic catalog of all 30 voices (id, label, gender, character, preview_url to a ~7s mp3 sample). Free to call.
    play_gemini_omni_voiceReturns the preview_url for a single voice id — quicker than fetching the full catalog when you already know the voice. Surface the URL to the user so they can hear it before committing.
    list_gemini_omni_charactersList the user's saved characters. Each row has a character_id to pass to generate_media and an internal id for delete.
    manage_library(kind="character", action="create", ...)Persist a new character from a reference image + description (+ optional character_name, voice_id). Returns the provider's opaque character_id.
    manage_library(kind="character", action="delete", character_id=...)Local delete (the provider has no delete endpoint). Past generations that used this character keep their outputs.
    trim_videoCut a long source video down to the ≤10s window Gemini Omni needs for -video-ref variants. Returns a hosted URL ready to use as source_media_urls.

    Example request (1080p, 6s, vertical, reusing a saved character + chosen voice):

    {
      "model": "gemini-omni-video-1080p-6s",
      "prompt": "A barista pulls a perfect espresso shot, narrates the process",
      "aspect_ratio": "9:16",
      "voice_id": "kore",
      "character_ids": ["char_4f2a9b7c"],
      "source_media_urls": ["https://media.kubeez.com/cafe-shot.jpg"]
    }
    

    Family-name disambiguation:

    Some model families expose multiple concrete variants per resolution / quality / reference-video combo: seedance-2, seedance-2-fast, gemini-omni-video, p-video, kling-3-0-motion-control, kling-2-6-motion-control. Passing just the family label as model returns:

    {
      "error": "variant_required",
      "family": "seedance-2-fast",
      "available_variants": [
        "seedance-2-fast-480p",
        "seedance-2-fast-720p",
        "seedance-2-fast-480p-video-ref",
        "seedance-2-fast-720p-video-ref"
      ],
      "hint": "Ask the user which variant they want — resolution, draft vs standard, or with/without reference video. Pull pricing from get_models."
    }
    

    No credits are deducted. Re-call generate_media with a concrete model_id from available_variants (or from get_models).

    Video audio (two different concepts):

    1. capabilities.video_audio in get_models — how to know if the output has sound:
      • included — The model’s output normally includes an audio track without using a sound parameter (e.g. Veo, Wan, Grok, Kling 2.5 image-to-video, Motion Control, Gemini Omni Video — the latter exposes a voice_id parameter to pick from 29 built-in voices).
      • toggle_via_sound_param — Generated audio is turned on/off with sound: true / false (Kling 2.6, Kling 3.0, Seedance 1.5 Pro). Pricing may differ for audio vs silent. Kling 3.0 routes sound: true to dedicated -audio rows in the catalog — you keep using kling-3-0-std / kling-3-0-pro as the model id and just toggle sound, the server picks the right priced row. Live rates:
    ModelNameRateUnit
    kling-3-0-proKling 3.0 Pro21credits / sec
    kling-3-0-pro-audioKling 3.0 Pro (with audio)30credits / sec
    kling-3-0-stdKling 3.017credits / sec
    kling-3-0-std-audioKling 3.0 Std (with audio)23credits / sec

    Motion Control per-second rates (Kling 3.0 and Kling 2.6):

    No models matched this family. kling-3-0-motion-control,kling-2-6-motion-control

    (Motion-control rates are shown as one block since the two families share the same pricing shape: 720p and 1080p variants.) Seedance 2 also exposes a sound toggle (via its own generate_audio flag, default true) but audio is free — no surcharge vs silent.

    • silent — No generated audio (Seedance 1.0 / v1-pro-fast-i2v only).
    1. supports_sound — only means the API accepts a sound toggle for that model. It does not mean other models are silent; most video models use video_audio: included instead.

    Image-only models ignore sound.

    #REST API: URLs for local or browser files

    If you use the HTTP API (POST /v1/generate/media) and your inputs are files on disk or selected in the browser—not already public URLs—upload each file first with POST /v1/upload/media. Pass the returned urls values as source_media_urls.


    <a id="trim_video"></a>

    #trim_video

    Cut a window out of any publicly-accessible video URL and host the trimmed clip. Kubeez handles the trim server-side and returns a public url you can pass straight into generate_media as a source_media_urls value.

    When to use this tool:

    • The user's source video is longer than a model's reference-clip limit and you want the result to be deterministic instead of relying on provider-side trim hints.
    • You want to use a specific window of a longer asset (e.g. "use seconds 5–12 of this take, not the first 10").
    • You're chaining multiple generations and want a single canonical trimmed URL to reuse across calls.

    Per-model reference-clip caps (skip the trim when the source already fits):

    ModelReference-clip capNotes
    Gemini Omni Video (-video-ref variants)≤ 10s window from the first 30s of sourceRequired when using a duration variant + a video ref returns variant_mismatch.
    Kling 2.6 / 3.0 Motion Control≤ 30s video referenceProvider truncates longer clips; trim explicitly for predictable framing.
    Seedance 2 / 2 Fast≤ 15s combined reference video timeIf you pass multiple ref videos, trim each so the sum fits.

    Parameters:

    ParameterTypeRequiredDescription
    source_urlstringYesPublicly accessible source video URL (Kubeez upload URL, R2/S3 URL, any public CDN).
    end_snumberYesTrim window end, in seconds. Must be > start_s; the resulting end_s − start_s range is capped at 60s.
    start_snumberNoTrim window start, in seconds. Default 0.

    Returns: { url, size_bytes, duration_s, start_s, end_s, elapsed_s }. The url is a Kubeez-hosted public link pointing at the media-inputs Supabase storage bucket, safe to use as a source_media_urls value. Trimmed clips live alongside user-uploaded inputs and share the same weekly auto-cleanup — treat them as ephemeral, not as a permanent asset.

    Typical workflow (Gemini Omni Video with a long user source):

    1. get_upload_url(model_id="gemini-omni-video-1080p-video-ref")
           → presigned upload page
    2. (user uploads their 25s clip via the link)
    3. get_upload_session(token) → { media_urls: ["https://.../source.mp4"] }
    4. trim_video(source_url=media_urls[0], start_s=5, end_s=15)
           → { url: "https://.../trims/<uuid>.mp4", duration_s: 10 }
    5. generate_media(
           model="gemini-omni-video-1080p-video-ref",
           prompt="...",
           source_media_urls=[<trimmed url>]
       )
    

    Standalone trim (source URL already public — no upload step needed):

    trim_video(source_url="https://media.kubeez.com/<asset>.mp4", start_s=0, end_s=8)
           → { url, duration_s: 8 }
    generate_media(model="kling-3-0-motion-control-1080p", source_media_urls=[<image>, <trimmed url>], ...)
    

    Errors:

    Error codeCause
    missing_source_urlsource_url not provided.
    unsafe_source_urlsource_url uses a non-http(s) scheme, contains embedded credentials, or resolves to a private/loopback/metadata IP. Kubeez refuses to fetch internal hosts — pass a publicly-resolvable URL.
    invalid_trim_rangeend_sstart_s, negative start_s, or non-numeric values.
    trim_too_longend_s − start_s > 60s.
    source_too_largeSource video > 500 MB.
    source_fetch_failedSource URL returned 4xx, 5xx, was unreachable, or hit the redirect-hop cap.
    ffmpeg_failedffmpeg rejected the file (unsupported codec / corrupt container).
    processor_timeoutTrim service exceeded its download/processing budget.
    processor_unavailableTrim service is temporarily unreachable. Retry later or contact support if it persists.

    Example request:

    {
      "source_url": "https://media.kubeez.com/u123/uploads/long-take.mp4",
      "start_s": 5.2,
      "end_s": 14.7
    }
    

    Example response:

    {
      "url": "https://<project>.supabase.co/storage/v1/object/public/media-inputs/u123/trims/abc.mp4",
      "size_bytes": 1245678,
      "duration_s": 9.5,
      "start_s": 5.2,
      "end_s": 14.7,
      "elapsed_s": 1.83
    }
    

    Frame accuracy: -c copy snaps the actual start to the nearest preceding keyframe (≤ 1 GOP shift, typically ≤ 1s). This matches the web UI's native MP4 stream-copy behavior. For frame-accurate cuts, re-encode the source first — the tradeoff is a slower, lossy operation we deliberately avoid here.

    Don't keep the output: the URL points at the media-inputs bucket which is cleared weekly. If you need the trimmed clip to outlive the next generation, use manage_library(kind="asset", action="add", ...) to copy it into the user's permanent Asset Library.


    #get_generation_status

    Check the status of a media generation and get output URLs when done.

    Parameters:

    ParameterTypeRequiredDescription
    generation_idstringYesID returned from generate_media.

    Response: Includes status (pending, queued, processing, completed, failed), progress, and when completed an outputs array with url, thumbnail_url, optimized_url, media_type, dimensions, etc.


    #get_generation_estimate

    Get a parameter-specific estimated processing time for a given model and options (no job is started). For a per-model estimated duration in one call, use get_models; each model includes estimated_time_seconds. Use get_generation_estimate when you need an estimate that depends on prompt length, duration, or other parameters.

    Parameters:

    ParameterTypeRequiredDescription
    modelstringYesModel ID.
    generation_typestringNoSame as in generate_media. Default: text-to-image.
    promptstringNoOptional; can affect estimate.
    negative_promptstringNoOptional.
    parametersobjectNoOptional extra parameters.

    Response: Estimated time (and optionally confidence/sample size) so you can set user expectations before calling generate_media.


    #Model rules

    • Text-to-image and text-to-video: Do not send source_media_urls (unless the model supports an optional reference image). Exception: seedance-1-5-pro in text-to-video mode accepts 0–1 optional images.
    • Image-to-video and image-to-image: Send image (and when supported, video) URL(s) in source_media_urls. Most models need images only; some (e.g. Kling 2.6 Motion Control) require 1 image + 1 video. seedance-1-5-pro in image-to-video mode requires exactly 2 images (start + end frame). Respect each model’s input limits above.
    • Video audio: Use capabilities.video_audio from get_models. included — audio is part of typical output without a sound parameter (e.g. Veo, Wan, Grok). toggle_via_sound_param — use sound only when supports_sound is true (Kling 2.6, Kling 3.0, Seedance 1.5 Pro). silent — no generated audio (Seedance 1.0 only). Do not infer “no audio” from supports_sound: false alone.
    • Use get_models to see which models support which generation types, input_media_types (e.g. image, video), and required input counts.

    See Limitations for rate limits and credits. For a single table of API defaults (prompt max, inputs, duration, flags), see REST API model requirements.