AI Can Setup Experiments and Close Them Now.

AI Can Setup Experiments and Close Them Now.

By Versia Team 4 min read

Coding agents are getting good at setting up experiments. Give one a brief (“test three headlines on the signup page, track conversion”), point it at your flag provider’s API, and it generates the variant code, creates the experiment, and opens a PR. The setup work that used to take an engineer a morning now takes an agent a few minutes.

This is a real improvement. But it only solves the setup.

What happens after the experiment starts running? Someone still needs to watch the dashboard, wait for statistical significance, decide when to stop, and ship a follow-up change to hardcode the winner. That “someone” is still a human, and they’re still the bottleneck.

We built Versia because we think the whole loop should close without that human step. Not just the setup. The optimization itself.

The gap between setup and resolution

Here’s how most AI-assisted experimentation works today:

  1. Agent writes variant code and creates a feature flag
  2. Agent opens a PR, human reviews and merges
  3. Experiment runs with a fixed traffic split (usually 50/50)
  4. Human checks results periodically over days or weeks
  5. Human decides the experiment has reached significance
  6. Human (or agent, with another prompt) ships a PR to remove the flag and keep the winner

Steps 1 and 2 are now fast. Steps 3 through 6 are exactly as slow as they were before agents existed. The experiment still takes weeks of calendar time. Traffic is still wasted on the losing variant for the entire duration. And the resolution still depends on someone remembering to check.

This matters because the whole point of using agents for experimentation is velocity. If you can set up ten experiments per day but each one still takes three weeks to resolve, you haven’t actually increased your experimentation throughput. You’ve just created a backlog of running experiments that nobody’s monitoring.

What closing the loop actually requires

For an experiment to resolve itself without human intervention, three things need to be true about the underlying system:

Traffic allocation can’t be static. A 50/50 split is fine for the first hour. After that, if variant B is clearly outperforming variant A, you’re burning conversions by continuing to send half your traffic to the loser. The system needs to continuously reallocate traffic based on observed performance. This is what contextual bandits do: they explore initially, then exploit what they learn. An agent doesn’t need to “declare a winner” because the system converges on one automatically.

The reward signal needs to be programmatic. The system can’t learn which variant is better by waiting for a human to look at a chart. It needs a structured signal, delivered via API, at the moment a conversion happens. A user signs up, the app sends a reward event with a reference to the variant they saw. The system updates its model immediately. No batch processing, no nightly ETL, no dashboard refresh cycle.

The evaluation protocol needs to be vendor-neutral. If an agent wires a flag evaluation into your code, that code shouldn’t be married to one vendor’s proprietary SDK. Standards exist for this (OpenFeature, OFREP). If the agent creates lock-in as a side effect of setting up experiments, you’ve traded one problem for another.

What this changes about the agent workflow

When the experimentation system closes its own loop, the agent’s job ends at setup:

  1. Agent writes variant code, creates a reward-driven flag via API
  2. Agent wires flag evaluation into the handler, wires reward into the conversion event
  3. Agent opens a PR, human reviews and merges
  4. System runs the experiment, shifting traffic toward the winner continuously
  5. There is no step 5

The flag is live and optimizing from the moment it ships. The total human involvement is one PR review. Everything else is API calls between the agent and the system. If you want to eventually remove the flag wrapper and dead code (you should), that’s a cleanup task the agent can do later by checking which variant the flag converged to.

The experiment brief is still the bottleneck

None of this changes the most important input: deciding what to test and what counts as success. A badly designed experiment produces a confident answer to the wrong question, whether a human or a bandit is running it.

In practice, an agent needs three things from you to set up a useful experiment:

  • What varies. “Three headline options on the signup page.” The agent can generate variant copy, but you should review it. Agents are bad at knowing what claims are factually accurate or on-brand.
  • What success looks like. “A signup event fires.” The clearer and more binary the reward signal, the faster the system converges. “Time on page” is fuzzier and takes longer to learn from than “clicked the button.”
  • What context matters. “Mobile vs desktop users might prefer different variants.” If you pass device type as a context attribute, the system can learn different preferences per segment. If you don’t, it treats everyone the same. More context means more to learn, which means slower convergence. Start simple.

The rest (flag creation, SDK wiring, reward plumbing) is mechanical work an agent handles well.

Next steps

If you’re building with coding agents and want the experiment lifecycle to be fully programmatic, that’s what Versia is for. The API spec lives at versia.dev/llms.txt and your agent can read it directly.