Canonical media flow

    Five steps from a user file to a finished generation

    1. 01
      get_upload_url(model_id)

      Returns a temp browser link. Send it to the user verbatim.

    2. 02
      User uploads in browser

      Kubeez probes duration / dimensions client-side into media_upload_metadata.

    3. 03
      get_upload_session(token)

      Returns public URLs + per-file duration_seconds + a billing_readiness flag.

    4. 04
      estimate_generation_cost(...)

      Exact price preview, no credits deducted. Quote it to the user.

    5. 05
      generate_*(...)

      Only after the user confirms — generation_id returned for polling.

    Flat-rate models (Nano Banana, Flux, Imagen, Z-Image, Seedream, Logo Maker, ad-copy, music, speech) skip step 4 and quote cost_per_generation directly.

    How it works

    The MCP server runs every generation as an async job: you start it, then poll until it's done. Your client handles polling automatically — this page is for anyone building one or debugging the flow.

    #The canonical media flow

    For any tool that takes a file from the user (image edit, motion control, lip-sync, captions, separation, ad-copy):

    1. get_upload_url(model_id=X) — returns a temp browser link. Send it to the user verbatim.
    2. User uploads in their browser. Kubeez probes duration / dimensions client-side and writes them to the metadata table.
    3. get_upload_session(token) — returns the public URLs plus per-file duration_seconds and a billing_readiness flag (exact / pessimistic_fallback).
    4. estimate_generation_cost(model, reference_*_seconds=…) — exact price preview, no credits deducted. Quote it to the user.
    5. generate_*(...) — only after the user confirms.

    Flat-rate models (Nano Banana, Flux, Imagen, Z-Image, Seedream, Logo Maker, ad-copy, music, speech) skip step 4 — quote cost_per_generation from get_models directly.

    #Per-tool flows

    #Images & video

    1. Optional: get_models → pick a model_id and read its capabilities.
    2. Optional: get_balance to check credits.
    3. generate_media(prompt, model, …) → returns generation_id.
    4. Poll get_generation_status(id) every 5s after a brief initial wait (~10s for image, ~30s for short video).
    5. When status is completed, read outputs[] — each item has url, thumbnail_url, optional optimized_url.

    #Music

    1. generate_music(prompt, …)generation_id.
    2. Poll get_music_status(id) after ~30s, then every 5s.
    3. When done, read songs[] — each has audio_url, stream_url, cover_image_url.

    #Captions

    1. generate_captions(media_url, …) is synchronous — returns word-level captions in the response (1–5 minute wait).

    #Audio separation

    1. generate_separation(media_url)separation_id.
    2. Poll get_separation_status(id) after ~20s, then every 5s.
    3. When done, read vocals_url + instrumental_url.

    #Ad creatives

    1. create_ad_copy(reference_ad_url, …, variant_count)generation_ids: [...] (one per variant).
    2. Poll each id with get_generation_status until completed.

    #Polling cadence

    Every get_models row carries an estimated_time_seconds based on real historical completion times. Use it as the first-poll delay, then poll every 5s. Don't poll every second — you'll burn rate-limit budget for the same result.

    FamilyTypical first-poll wait
    Image~10s (z-image often <3s)
    Short video (5–10s)~30s
    Long video (15s+)~60s
    Music~30s
    Speech / TTS~5s
    Audio separation~20s

    #Best practices

    • Quote cost before generating for any per-duration model — estimate_generation_cost is read-only.
    • Always start with get_models — never hardcode model_ids; capabilities drive the request shape.
    • Respect input requirementsget_models tells you requires_input_media, max_input_images, etc.
    • Handle 429 (rate limit) — back off and retry. Don't loop tightly on errors.
    • Surface clear errors to the user on insufficient credits or invalid inputs — don't auto-retry without their action.

    See Limitations for rate limits and credit refund behavior.