How it works
The MCP server runs every generation as an async job: you start it, then poll until it's done. Your client handles polling automatically — this page is for anyone building one or debugging the flow.
#The canonical media flow
For any tool that takes a file from the user (image edit, motion control, lip-sync, captions, separation, ad-copy):
get_upload_url(model_id=X)— returns a temp browser link. Send it to the user verbatim.- User uploads in their browser. Kubeez probes duration / dimensions client-side and writes them to the metadata table.
get_upload_session(token)— returns the public URLs plus per-fileduration_secondsand abilling_readinessflag (exact/pessimistic_fallback).estimate_generation_cost(model, reference_*_seconds=…)— exact price preview, no credits deducted. Quote it to the user.generate_*(...)— only after the user confirms.
Flat-rate models (Nano Banana, Flux, Imagen, Z-Image, Seedream, Logo Maker, ad-copy, music, speech) skip step 4 — quote cost_per_generation from get_models directly.
#Per-tool flows
#Images & video
- Optional:
get_models→ pick a model_id and read its capabilities. - Optional:
get_balanceto check credits. generate_media(prompt, model, …)→ returnsgeneration_id.- Poll
get_generation_status(id)every 5s after a brief initial wait (~10s for image, ~30s for short video). - When
statusiscompleted, readoutputs[]— each item hasurl,thumbnail_url, optionaloptimized_url.
#Music
generate_music(prompt, …)→generation_id.- Poll
get_music_status(id)after ~30s, then every 5s. - When done, read
songs[]— each hasaudio_url,stream_url,cover_image_url.
#Captions
generate_captions(media_url, …)is synchronous — returns word-level captions in the response (1–5 minute wait).
#Audio separation
generate_separation(media_url)→separation_id.- Poll
get_separation_status(id)after ~20s, then every 5s. - When done, read
vocals_url+instrumental_url.
#Ad creatives
create_ad_copy(reference_ad_url, …, variant_count)→generation_ids: [...](one per variant).- Poll each id with
get_generation_statusuntil completed.
#Polling cadence
Every get_models row carries an estimated_time_seconds based on real historical completion times. Use it as the first-poll delay, then poll every 5s. Don't poll every second — you'll burn rate-limit budget for the same result.
| Family | Typical first-poll wait |
|---|---|
| Image | ~10s (z-image often <3s) |
| Short video (5–10s) | ~30s |
| Long video (15s+) | ~60s |
| Music | ~30s |
| Speech / TTS | ~5s |
| Audio separation | ~20s |
#Best practices
- Quote cost before generating for any per-duration model —
estimate_generation_costis read-only. - Always start with
get_models— never hardcode model_ids; capabilities drive the request shape. - Respect input requirements —
get_modelstells yourequires_input_media,max_input_images, etc. - Handle 429 (rate limit) — back off and retry. Don't loop tightly on errors.
- Surface clear errors to the user on insufficient credits or invalid inputs — don't auto-retry without their action.
See Limitations for rate limits and credit refund behavior.
