Skip to main content
A/B testing (also called split testing) enables comparing two or more prompt versions in production with real users and use cases. Rather than choosing between prompts based on intuition or small-scale testing, A/B testing provides statistical evidence about which prompt performs better under real-world conditions.

When to Use A/B Testing

A/B testing is powerful but not appropriate for every situation:
Consumer applications with high volume:
  • Applications with thousands of daily users (sufficient sample size)
  • Use cases where small quality variations are acceptable
  • Scenarios where you can collect quality signals (user feedback, automated scores)
Canary deployments:
  • You’ve validated improvements on test datasets
  • You want to verify production performance before full rollout
  • You can monitor metrics in real-time to catch issues early
Optimization iterations:
  • Incremental prompt improvements where directional changes are clear
  • Testing hypotheses about what drives quality (tone, length, structure)
  • Comparing prompts with similar expected performance
Examples: Chatbot greeting messages, content summarization, code completion suggestions, product recommendations
Mission-critical applications:
  • Healthcare decisions (potential patient harm)
  • Financial transactions (regulatory requirements)
  • Legal advice (liability concerns)
  • Safety-critical systems (autonomous vehicles, industrial controls)
Low-volume applications:
  • Fewer than 100 daily users (insufficient statistical power)
  • Use cases with long feedback cycles (weeks between samples)
  • Scenarios where each request is unique (no aggregate patterns)
High-stakes accuracy requirements:
  • Applications where any error is unacceptable
  • Regulated industries with strict compliance requirements
  • Use cases requiring deterministic outputs
Alternative: For these scenarios, use comprehensive offline evaluation on datasets before deploying to production, then monitor with 100% production traffic rather than split testing.
Before starting A/B testing, ensure you have:
  1. Measurable success metrics: Quality scores, user feedback, task completion rates, or business outcomes
  2. Sufficient traffic volume: At least 100-200 samples per variant for statistical significance
  3. Prompt linking infrastructure: Ability to link prompts to traces for metric aggregation
  4. Monitoring dashboards: Real-time visibility into quality metrics by prompt version
  5. Rollback capability: Ability to stop the test and revert if issues arise
  6. Statistical analysis skills: Understanding of significance testing, confidence intervals, and statistical power
Without these prerequisites, A/B testing becomes guesswork rather than scientific experimentation.

How A/B Testing Works

Complete workflow from setup to decision: The A/B testing lifecycle: Create variants → Implement random assignment → Collect sufficient data → Analyze for statistical significance → Make deployment decision → Monitor results.

Create prompt variants and assign labels

Create two (or more) prompt versions with different content, structure, or parameters:Via ABV UI:
  1. Navigate to your prompt in the ABV dashboard
  2. Create a new version with variant A content
  3. Assign label variant-a (or prod-a)
  4. Create another version with variant B content
  5. Assign label variant-b (or prod-b)
Via SDK:
# Create variant A
abv.create_prompt(
    name="movie-critic",
    prompt="As a {{criticlevel}} movie critic, provide a detailed review of {{movie}}.",
    labels=["variant-a"],
    config={"temperature": 0.7}
)

# Create variant B
abv.create_prompt(
    name="movie-critic",
    prompt="You're a {{criticlevel}} film critic. Share your thoughts on {{movie}}.",
    labels=["variant-b"],
    config={"temperature": 0.8}
)
Version numbers: ABV automatically assigns incremental version numbers (e.g., versions 3 and 4), but you’ll reference by label in your code.

Implement randomized assignment in application code

Modify your application to randomly select between variants for each request:Python implementation:
from abvdev import ABV
from openai import OpenAI
import random

abv = ABV(api_key="sk-abv-...", host="https://app.abv.dev")
openai_client = OpenAI(api_key="sk-proj-...")

# Fetch both variants
prompt_a = abv.get_prompt("movie-critic", label="variant-a")
prompt_b = abv.get_prompt("movie-critic", label="variant-b")

# Randomly select variant (50/50 split)
selected_prompt = random.choice([prompt_a, prompt_b])

# Compile and use
compiled_prompt = selected_prompt.compile(
    criticlevel="expert",
    movie="Dune 2"
)

# Link prompt to trace for metric tracking
with abv.start_as_current_observation(
    as_type="generation",
    name="movie-review",
    prompt=selected_prompt  # Crucial: link for metrics
) as generation:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": compiled_prompt}]
    )
    generation.update(output=response.choices[0].message.content)

abv.flush()  # For short-lived applications
TypeScript/JavaScript implementation:
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";
import OpenAI from "openai";

const abv = new ABVClient();
const openai = new OpenAI();

async function main() {
  // Fetch both variants
  const promptA = await abv.prompt.get("movie-critic", {
    label: "variant-a",
  });

  const promptB = await abv.prompt.get("movie-critic", {
    label: "variant-b",
  });

  // Randomly select variant (50/50 split)
  const selectedPrompt = Math.random() < 0.5 ? promptA : promptB;

  // Create generation span with linked prompt
  const generation = startObservation(
    "movie-review",
    {
      model: "gpt-4o",
      input: selectedPrompt.compile({
        criticlevel: "expert",
        movie: "Dune 2"
      }),
      prompt: selectedPrompt  // Link for metrics
    },
    { asType: "generation" }
  );

  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: selectedPrompt.compile({
      criticlevel: "expert",
      movie: "Dune 2"
    }),
  });

  generation.update({
    output: { content: completion.choices[0].message.content },
  });

  generation.end();
}

main();
Traffic split ratios: Use 50/50 for equal comparison, or adjust ratios (e.g., 90/10 for cautious canary deployment).

Collect data over sufficient time period

Run the A/B test until you’ve collected enough data for statistical significance:Minimum sample size:
  • At least 100-200 generations per variant
  • More samples for smaller expected differences
  • Use online sample size calculators for precise requirements
Time period:
  • Run for multiple days to account for day-of-week effects
  • Include weekdays and weekends if usage patterns differ
  • Ensure you capture diverse user segments and use cases
Monitor during collection:
  • Watch dashboards for unexpected issues
  • Check that traffic is splitting as expected
  • Verify metrics are being collected for both variants
Early stopping criteria: Stop the test early if:
  • One variant shows severe quality degradation
  • Error rates spike for one variant
  • Statistical significance is achieved with clear winner

Analyze results and calculate significance

Navigate to the prompt in ABV dashboard and compare metrics by version:Key metrics to compare:
  • Quality scores: Median score, score distribution by variant
  • Latency: Median, p95, p99 response times
  • Token usage: Input tokens, output tokens (affects cost)
  • Cost: Median cost per generation
  • User feedback: Thumbs up/down ratios, satisfaction ratings
Statistical significance:
  • Use significance tests (t-test, Mann-Whitney U test) to determine if differences are real
  • Calculate confidence intervals (95% CI recommended)
  • Consider practical significance: Is the improvement meaningful even if statistically significant?
Example analysis:
Variant A:
- Median quality score: 4.2/5
- Median latency: 450ms
- Median cost: $0.003
- Samples: 1,250

Variant B:
- Median quality score: 4.5/5 (7% improvement)
- Median latency: 480ms (6% slower)
- Median cost: $0.004 (33% more expensive)
- Samples: 1,238

Statistical significance: p < 0.05 (quality improvement is significant)
Decision: Variant B improves quality but at higher cost. Evaluate tradeoff.
Tools for analysis: Use Python (scipy, statsmodels), R, or online calculators for significance testing.

Make decision and deploy winner

Based on analysis, choose the winning variant:Clear winner:
  • Variant significantly better on primary metric (quality)
  • No significant degradation on secondary metrics (cost, latency)
  • Action: Promote winner to production by reassigning production label
Mixed results:
  • Variant better on quality but worse on cost
  • Small improvement with high uncertainty
  • Action: Evaluate tradeoffs, possibly run longer test, or choose based on business priorities
No significant difference:
  • Variants perform similarly across all metrics
  • Action: Keep existing version (simpler) or choose based on maintenance/cost
Deployment:
# After deciding variant-b is the winner, promote via UI or SDK:
abv.update_prompt(
    name="movie-critic",
    version=4,  # variant-b version number
    new_labels=["production"]  # Assign production label
)
Post-deployment monitoring: Continue monitoring quality after full rollout to ensure results hold at 100% traffic.

Implementation Examples

Complete examples for both SDKs:
Complete A/B testing implementation:
from abvdev import ABV
from openai import OpenAI
import random
import os

# Initialize clients
abv = ABV(
    api_key=os.getenv("ABV_API_KEY"),
    host="https://app.abv.dev",
)
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def run_ab_test(user_input: dict):
    """
    Run A/B test for movie critic prompt.

    Args:
        user_input: Dict with 'criticlevel' and 'movie' keys

    Returns:
        LLM response
    """
    # Fetch both variants
    prompt_a = abv.get_prompt("movie-critic", label="variant-a")
    prompt_b = abv.get_prompt("movie-critic", label="variant-b")

    # Randomly assign user to variant (50/50 split)
    selected_prompt = random.choice([prompt_a, prompt_b])

    # Compile prompt with user input
    compiled_prompt = selected_prompt.compile(
        criticlevel=user_input["criticlevel"],
        movie=user_input["movie"]
    )

    # Create generation with linked prompt
    with abv.start_as_current_observation(
        as_type="generation",
        name="movie-review-ab-test",
        prompt=selected_prompt  # Link for tracking by version
    ) as generation:
        # Call LLM
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": compiled_prompt}]
        )

        result = response.choices[0].message.content

        # Update generation with output
        generation.update(output=result)

        return result

# Usage
if __name__ == "__main__":
    result = run_ab_test({
        "criticlevel": "expert",
        "movie": "The Lord of the Rings"
    })
    print(result)

    # Flush events for short-lived applications
    abv.flush()
Weighted traffic split (90% control, 10% variant):
# Weighted random selection
selected_prompt = random.choices(
    [prompt_a, prompt_b],
    weights=[0.9, 0.1],  # 90% variant-a, 10% variant-b
    k=1
)[0]
Complete A/B testing implementation:Setup (instrumentation.ts):
import dotenv from "dotenv";
dotenv.config();

import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [
    new ABVSpanProcessor({
      apiKey: process.env.ABV_API_KEY,
      baseUrl: process.env.ABV_BASE_URL,
      exportMode: "immediate",
      flushAt: 1,
      flushInterval: 1,
    })
  ],
});

sdk.start();
A/B test implementation (index.ts):
import "./instrumentation"; // Must be first import
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";
import OpenAI from "openai";
import dotenv from "dotenv";
dotenv.config();

const abv = new ABVClient();
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

async function runABTest(userInput: {
  criticlevel: string;
  movie: string;
}) {
  // Fetch both variants
  const promptA = await abv.prompt.get("movie-critic", {
    label: "variant-a",
  });

  const promptB = await abv.prompt.get("movie-critic", {
    label: "variant-b",
  });

  // Randomly assign user to variant (50/50 split)
  const selectedPrompt = Math.random() < 0.5 ? promptA : promptB;

  // Compile prompt
  const compiledMessages = selectedPrompt.compile(userInput);

  // Create generation with linked prompt
  const generation = startObservation(
    "movie-review-ab-test",
    {
      model: "gpt-4o",
      input: compiledMessages,
      prompt: selectedPrompt  // Link for tracking
    },
    { asType: "generation" }
  );

  // Call LLM
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: compiledMessages,
  });

  const result = completion.choices[0].message.content;

  // Update generation
  generation.update({
    output: { content: result },
  });

  generation.end();

  return result;
}

// Usage
async function main() {
  const result = await runABTest({
    criticlevel: "expert",
    movie: "The Lord of the Rings"
  });
  console.log(result);
}

main();
Weighted traffic split (90% control, 10% variant):
const selectedPrompt = Math.random() < 0.9 ? promptA : promptB;
// 90% get promptA, 10% get promptB

Statistical Analysis

Understanding statistical concepts for A/B testing:
Definition: Probability that observed difference occurred by random chance.Interpretation:
  • p < 0.05: Less than 5% chance results are due to randomness (commonly used threshold)
  • p < 0.01: Less than 1% chance (stronger evidence)
  • p > 0.05: Difference not statistically significant (could be random)
Example:
  • Variant A: median score 4.2
  • Variant B: median score 4.5
  • p-value: 0.03
  • Conclusion: The 0.3 point improvement is statistically significant (p < 0.05)
Caution: Significance doesn’t guarantee practical importance. Always consider effect size.
Definition: Range where the true value likely falls.Interpretation:
  • 95% CI: We’re 95% confident the true value is in this range
  • Wider intervals indicate more uncertainty
  • Non-overlapping intervals suggest significant difference
Example:
  • Variant A: median score 4.2, 95% CI [4.0, 4.4]
  • Variant B: median score 4.5, 95% CI [4.3, 4.7]
  • Conclusion: Intervals don’t overlap—variant B is likely better
Use: Provides intuition about uncertainty in results, complements p-values.
Statistical power: Probability of detecting a real difference if it exists.Factors affecting required sample size:
  • Effect size: Smaller differences need more samples
  • Baseline variance: Higher variance needs more samples
  • Desired power: Higher power (80-90% recommended) needs more samples
  • Significance level: Stricter thresholds (p < 0.01) need more samples
Example calculation (simplified):
  • Baseline score: 4.0 (std dev 1.0)
  • Expected improvement: 10% (0.4 points)
  • Desired power: 80%
  • Significance: 0.05
  • Required samples: ~400 per variant
Tools: Use online calculators (Evan’s Awesome A/B Tools, Optimizely Sample Size Calculator) for precise calculations.
For continuous metrics (quality scores, latency):
  • t-test: Compares means, assumes normal distribution
  • Mann-Whitney U test: Compares medians, no distribution assumption (recommended for scores)
For binary metrics (thumbs up/down, success/failure):
  • Chi-square test: Compares proportions
  • Fisher’s exact test: For small sample sizes
For count data (errors, conversions):
  • Poisson test: Compares event rates
Python example (Mann-Whitney U test):
from scipy import stats

variant_a_scores = [4.2, 4.0, 4.5, 4.1, ...]  # 400 scores
variant_b_scores = [4.5, 4.3, 4.7, 4.4, ...]  # 400 scores

statistic, p_value = stats.mannwhitneyu(
    variant_a_scores,
    variant_b_scores,
    alternative='two-sided'
)

print(f"p-value: {p_value}")
if p_value < 0.05:
    print("Statistically significant difference")
else:
    print("No significant difference")

Common Pitfalls to Avoid

Problem: Declaring a winner after 50 samples because variant B looks better.Why it’s wrong: Small samples have high variance. Early results often don’t hold with more data.Solution: Pre-commit to minimum sample size (100-200+ per variant) before looking at results. Use sequential testing methods if you must peek early.
Problem: Running multiple tests on the same data until you find statistical significance.Example: Testing 20 different metrics, finding that 1 is significant at p < 0.05 (expected by chance).Solution: Pre-register your primary metric before starting the test. Treat secondary metrics as exploratory only.
Problem: Deploying a variant because it’s statistically better, even though the improvement is tiny.Example: p < 0.01 but quality improves only 0.5% while cost increases 30%.Solution: Set minimum thresholds for practical significance before the test. Consider cost-benefit tradeoffs.
Problem: Implementing A/B test but forgetting to link prompts to generation spans.Result: ABV can’t aggregate metrics by prompt version. You have no way to compare variants.Solution: Always pass prompt=selected_prompt when creating generation spans:
with abv.start_as_current_observation(
    as_type="generation",
    prompt=selected_prompt  # Don't forget this!
) as generation:
    ...
Problem: Running variant A during weekdays and variant B during weekends, then concluding B is better.Why it’s wrong: Weekend traffic might differ from weekday traffic. You can’t tell if the difference is due to the prompt or the day of week.Solution: Run variants concurrently with randomized assignment to ensure comparable populations.

Next Steps

Link Prompts to Traces

Essential setup for tracking metrics by prompt version

Version Control

Manage prompt versions and labels for A/B testing

Get Started with Prompts

Create and fetch prompts with the ABV SDK

Prompt Experiments

Offline evaluation as a complement to A/B testing

Scores Data Model

Understand quality scores used in A/B test analysis

Metrics Dashboard

Analyze and visualize A/B test results