Interview with Kruno Sulić, Founder & Product Architect, Cliprise

Connectively

Connectively connects subject-matter experts with top publishers to increase their exposure and create Q & A content.

11 min read

Interview with Kruno Sulić, Founder & Product Architect, Cliprise

© Image Provided by Connectively

This interview is with Kruno Sulić, Founder & Product Architect, Cliprise.

For readers at Connectively, how do you describe your role as Founder & Product Architect at Cliprise and the specific problems you’re solving in AI-driven image, video, and voice creation?

My role sits at the intersection of founder, product strategist, and systems architect. I define what Cliprise should solve, translate that into product architecture, and work across the full system — model integrations, user experience, web and mobile applications, credit economics, reliability, APIs, and the workflows that connect everything.

The core problem we are solving is fragmentation. AI image, video, and voice creation has advanced rapidly, but the user experience remains unnecessarily complicated. Different models excel at different tasks; each provider has its own interface, pricing structure, controls, limitations, and subscription. Creators and businesses often spend more time comparing platforms, managing accounts, and troubleshooting workflows than actually producing content.

Cliprise brings 47+ image, video, voice, and editing models into one platform with a unified interface and credit system. The goal is not simply to offer a large model catalog; it is to help users move from an idea to a usable creative asset without needing to understand every provider, API, or technical tradeoff behind the process. The platform is available across the web, mobile apps, and developer workflows.

My broader background in building mobile apps, SaaS products, performance-marketing systems, and content platforms strongly influences how I approach the product. I evaluate features from both sides: whether they are technically sound and whether they solve a real problem clearly enough that users will repeatedly use and pay for them.

The principle guiding Cliprise is simple: the complexity should remain inside the platform, not be transferred to the user. My job is to make a fast-moving and technically fragmented market feel practical, understandable, and dependable for creators, marketers, teams, and developers.

What path led you to build Cliprise and shape your current approach to multimodal AI?

My path to Cliprise was shaped by more than a decade of building and operating digital businesses across affiliate marketing, content platforms, mobile apps, SaaS, paid acquisition, SEO, and automation. That background taught me that strong technology is only one part of a successful product. It also needs clear positioning, reliable infrastructure, sustainable unit economics, intuitive onboarding, and a reason for users to return.

As generative AI accelerated, I began working more deeply with image, video, voice, and editing models. The technical progress was remarkable, but the experience for ordinary users was fragmented. Each provider had different interfaces, subscriptions, credit systems, prompt requirements, output formats, and limitations. Producing one finished asset could require moving between several platforms and repeatedly rebuilding the same workflow.

That was the origin of Cliprise. I wanted to create the platform I would have wanted as an operator: one place where creators, marketers, founders, teams, and developers could access different forms of generative media without first becoming experts in every underlying model.

My current approach to multimodal AI is therefore model-agnostic and workflow-first. I do not believe any single model will be the permanent winner across every creative task. A model that excels at cinematic video may not be the best choice for product imagery, character consistency, voice, restoration, or fast social content. The product should help people use the right capability for the job while keeping the underlying complexity manageable.

Building Cliprise has also reinforced an important lesson: multimodal AI becomes valuable when separate capabilities are connected into a dependable workflow, not when they are presented as an impressive collection of demos. My focus is on turning rapid model innovation into something practical, accessible, and commercially useful across web, mobile, and API-based experiences.

Staying with model choices, as a product architect, what decision rule do you use to assign specific models to steps in Cliprise’s pipeline for image, video, and voice work?

My decision rule is outcome-first rather than model-first: assign each model only to the step where its specific strength creates measurable value for the final asset. I do not choose a model because it is new, popular, or impressive in a demo. I evaluate it against the job that step must perform.

For every candidate, I look at six factors:

  • Output quality
  • Prompt adherence
  • Controllability
  • Consistency
  • Latency
  • Cost per usable result

The important metric is not the price of one generation. It is the total cost of reaching an output the user can actually publish. A cheaper model that requires four retries may be more expensive than a premium model that succeeds on the first attempt.

In image workflows, the assignment depends on whether the user needs rapid ideation, photorealism, typography, reference-image fidelity, character consistency, editing, restoration, or upscaling. Those are different technical problems and should not automatically be routed to the same model.

For video, I separate visual generation from motion requirements. I evaluate temporal consistency, camera movement, subject stability, prompt adherence, image-to-video fidelity, first- and last-frame control, duration, resolution, and generation time. A cinematic model may be excellent for a hero sequence but inefficient for fast social variations.

For voice, I prioritize naturalness, pronunciation, emotional control, pacing, multilingual quality, speaker consistency, and synchronization with the intended video length. The most expressive voice is not always the best operational choice if timing or pronunciation is unreliable.

I also test models inside the complete workflow, not in isolation. A strong image model may produce outputs that another video model cannot animate reliably. A voice may sound excellent alone but fail once synchronized with the final scene. Pipeline compatibility therefore matters as much as individual benchmark quality.

Finally, every production route needs fallback logic. Models change, providers experience outages, moderation differs, and performance can vary by prompt type. Cliprise should be able to route users toward the best available path without exposing that infrastructure complexity.

The principle is simple: use the strongest model for each constraint, but optimize the pipeline for the best dependable result, not the most impressive individual generation.

When a generation workflow has blown up cost or quality at Cliprise, what happened and what concrete change did you make to fix it with before/after numbers?

One cost problem we identified was not a single runaway generation but the hidden expense created by fragmented AI providers. A creator working with images, video, and voice may need several separate accounts, subscriptions, and prepaid credit balances. Even when an individual provider appears inexpensive, the total workflow becomes costly because credits are split across platforms and cannot be reused elsewhere.

We addressed this by securing substantial credit allocations and commercial terms from several large AI providers and consolidating access into a single Cliprise credit system. Instead of applying the same markup to every model, we price each one according to its actual supplier economics and the value it provides inside the wider workflow.

The before-and-after difference is straightforward. Before, users had to maintain separate provider accounts and balances for different creative tasks. After, they could use one account and one shared balance across more than 47 image, video, voice, and editing models.

On some models, Cliprise can price generations below the provider’s direct retail price. On others, the price is equal, lower, or slightly higher, depending on the commercial terms and underlying generation cost. Even when a model is slightly more expensive, the user gains one interface, one billing system, and the ability to move between multiple models without purchasing another subscription.

The broader lesson was that the cheapest individual API call is not always the cheapest completed workflow. Separate subscriptions, minimum purchases, unused credits, and duplicated tooling all increase the real cost of producing a finished asset. By pooling supplier credits and placing multiple capabilities under one roof, we can reduce those inefficiencies while keeping pricing competitive and transparent.

On the safety side, how do you operationalize consent, licensing, and guardrails for voice synthesis and generated visuals so teams move fast without risking compliance?

We do not position Cliprise as an authority that grants legal permission to use a person’s face, voice, likeness, trademark, or copyrighted material. No platform policy can legalize conduct that applicable law prohibits.

Cliprise provides access to third-party image, video, and voice models through APIs. Each provider applies its own moderation, licensing conditions, and technical guardrails before a generation is completed. On top of that, we maintain an additional platform-level screening layer for clearly prohibited categories, including sexual content involving minors, extreme graphic violence, massacre-related material, and explicit sexual content.

Our operating principle is to separate three issues that are often incorrectly combined:

  • Access to a tool does not equal permission to use any input.
  • Ownership or commercial usage rights for an output do not override another person’s privacy, publicity, copyright, trademark, or consent rights.
  • Passing automated moderation does not mean that the final use is legally or ethically acceptable in every country or context.

Users remain responsible for ensuring they have the necessary consent and rights for any faces, voices, reference images, brands, music, or other protected material they upload or reproduce. They must also follow the laws that apply to their location, audience, and intended use.

From a product perspective, the goal is not to replace legal judgment with a checkbox. It is to create several practical layers of protection:

  • Clear terms
  • Provider-level moderation
  • Cliprise-level blocking of the most serious prohibited content
  • Explicit user responsibility for consent and lawful use

The rule we follow is simple: we can define what Cliprise permits on the platform, but we cannot grant rights that belong to another person or override national and international law.

Turning to quality, what evaluation method do you rely on to judge image, short-form video, and voice outputs before they ship that other teams could copy?

We separate platform readiness from creative judgment. Cliprise does not manually approve every image, video, or voice output before a user receives it, because quality is highly dependent on the prompt, reference material, model choice, and intended use. A result that is excellent for rapid social content may be unsuitable for a premium advertisement.

Before releasing a model or workflow, we evaluate whether it performs reliably for the tasks it is presented for. That includes whether the generation completes correctly, follows the requested format and aspect ratio, handles reference inputs as expected, and produces outputs that are reasonably consistent with the prompt. For video, motion stability and subject consistency matter. For voice, pronunciation, pacing, clarity, and naturalness matter. We are evaluating whether the tool is ready and accurately represented, not claiming that every possible generation will be publication-ready.

The most important quality control happens before generation through model-specific prompting. Cliprise includes a dedicated prompt enhancer that rewrites a user’s idea according to the requirements and strengths of the selected model. This matters because the same prompt can perform very differently across image, video, and voice systems. A universal prompt template often removes the details a particular model needs.

The enhancer is one of our strongest tools, although many users still skip it and submit very short prompts. In practice, output quality is often limited less by the model than by the clarity of the instruction, the suitability of the model for the task, and the quality of any reference media.

We therefore support generation quality in three layers:

  1. Verify that the model and workflow function reliably for their stated use cases.
  2. Improve the input with model-specific prompt enhancement.
  3. Educate users through practical articles and guides on prompting, references, model selection, and common failure patterns.

The method other teams can copy is simple: do not judge AI output with one universal quality score. Test operational reliability separately, optimize instructions for each model, and let the final evaluation reflect the user’s actual purpose.

Under the hood, which specific prompt and context management practices have proven most effective at keeping multimodal jobs fast and affordable at Cliprise?

The most effective practice has been to reduce wasted generations rather than trying to make every request carry more context. In multimodal AI, speed and affordability usually improve when the model receives the right information, not the maximum amount of information.

Cliprise uses a model-specific prompt enhancer because image, video, and voice models interpret instructions differently. A strong prompt for one model can be inefficient or even counterproductive for another. The enhancer restructures the user’s idea around the selected model’s strengths, expected syntax, and controllable parameters before the generation begins.

That improves the workflow in two ways:

  1. The model receives a clearer instruction, which reduces avoidable retries.
  2. The user does not need to learn a different prompting style for every provider inside the platform. The feature is still used less than it should be, even though it can materially improve the result from the same initial idea.

For context management, we follow several practical rules:

  • Keep only context that directly affects the output.
  • Separate creative intent from technical parameters such as aspect ratio, duration, camera motion, style, or voice characteristics.
  • Use reference images or media when visual consistency matters instead of trying to describe everything repeatedly in text.
  • Avoid asking one generation to solve too many independent creative problems at once.
  • Preserve essential subject, brand, and scene details while removing repetitive instructions that add latency without improving quality.
  • Route the job to a model suited to the task rather than forcing a cheaper or more popular model to handle the wrong workflow.

For video especially, a staged process is often more efficient than one oversized prompt. Establish the subject and visual direction first, then add motion, timing, or voice in the appropriate step. This gives the user more control and makes failures easier to identify without repeating the entire production process.

The underlying principle is simple: good context is selective, structured, and model-specific. Affordable generation is not always the one with the lowest API price. It is the one that reaches a usable result with the fewest unnecessary attempts.

Bringing it together, could you walk us through one end-to-end Cliprise workflow—from raw recording to publish-ready short video with captions and voiceover—highlighting where humans review versus where AI acts autonomously?

A representative Cliprise-assisted workflow begins with the human, not the model. The user uploads the raw recording and defines the purpose, audience, platform, target duration, and core message. That editorial brief determines everything that follows.

The first human decision is selecting the strongest section of the recording and determining what should become the hook. AI can accelerate processing, but it should not decide which statement best represents the speaker or the brand without review.

Next, the workflow produces a draft transcript and caption structure. The user checks names, numbers, technical terms, and any wording where a transcription error could change the meaning. Captions are then adjusted for readability, timing, line length, and the safe areas of the intended platform.

For voiceover, the user provides or approves the script, chooses the voice and delivery style, and confirms that the necessary rights and consent exist. AI generates the audio, but a person reviews pronunciation, pacing, emphasis, and whether the voice actually matches the message. A technically clean voice can still feel wrong for the content.

Cliprise can then be used to generate supporting images, short video inserts, or other creative elements. Our model-specific prompt enhancer helps structure those requests for the selected model rather than sending the same generic prompt everywhere. The user compares the outputs and selects only the material that supports the story.

The final stage is human-led: review the edit from beginning to end, verify factual accuracy, check captions and audio balance, remove weak or misleading material, confirm branding and rights, and approve the export. Publishing remains a deliberate user action.

The division is simple:

  • AI handles repetitive production work, draft generation, model execution, and rendering.
  • Humans control intent, truth, consent, taste, and final approval.

We do not treat a successful generation or a passed moderation check as proof that a video is ready to publish. The final editorial responsibility stays with the person or team using the platform.

Looking 12 months ahead, which capability in multimodal models will most change how you architect Cliprise and why?

The capability most likely to change how I architect Cliprise is reliable agentic orchestration with persistent multimodal context.

Today, most creative AI workflows are still directed step-by-step by the user. A person selects a model, writes the prompt, generates an asset, evaluates it, moves it into another tool, and repeats the process for video, voice, editing, or refinement.

The more significant shift will happen when an agent can understand the user’s objective, maintain the relevant creative context, and coordinate several specialized models across an entire job.

For example, it could:

  • interpret a campaign brief,
  • prepare model-specific prompts,
  • generate several visual directions,
  • preserve the chosen character or brand details,
  • create motion and voice assets,
  • identify failed stages, and
  • ask for human approval only at meaningful decision points.

That would change the architecture from a collection of model integrations into a stateful orchestration system. The platform would need:

  • persistent job memory,
  • provider-agnostic routing,
  • cost and latency budgets,
  • asset lineage,
  • permission controls,
  • fallback models, and
  • clear human approval checkpoints.

The difficult part is not giving an agent access to more tools. It is making sure it understands when to act, when to retry, when to switch models, and when to stop and ask a person. An autonomous workflow that creates ten unnecessary generations is not intelligent simply because it completed the task.

My view is that humans should continue to control intent, rights, taste, and final approval. Agents can increasingly manage the repetitive coordination between those decisions.

I see this as an important architectural direction for multimodal platforms, rather than a commitment to any specific future Cliprise feature.

Thanks for sharing your knowledge and expertise. Is there anything else you'd like to add?

The most important lesson I would add is that the future of multimodal AI will not be decided only by which model produces the most impressive demo. It will be decided by which products make advanced capabilities genuinely useful, affordable, understandable, and dependable for ordinary people and businesses.

The market is moving extremely fast, but users do not want to manage dozens of providers, subscriptions, prompt formats, pricing systems, and technical limitations. They want to turn an idea into a finished result without needing to understand the infrastructure behind it.

That is the opportunity we are pursuing with Cliprise. We are building a single environment where people can access leading image, video, voice, and editing models, compare different creative approaches, and move between them through one account and one shared credit system.

At the same time, more automation should not mean less human responsibility. AI can accelerate production, suggest directions, and coordinate repetitive work, but people should remain responsible for intent, consent, accuracy, taste, and final approval.

My broader view is that the strongest AI companies will not simply provide access to models. They will remove friction, earn trust, and help users consistently reach outcomes that are commercially and creatively useful.

The real breakthrough is not when AI can generate almost anything. It is when almost anyone can use it well.

Cliprise is being built around that principle.

Up Next