Playgrounds Overview
Test prompts with evidence instead of guessing. Compare versions, models, and settings side-by-side with identical inputs.
The Problem Playgrounds Solves
Here's what usually happens when you're improving a prompt:
You edit the prompt. You test it once or twice in chat. It seems better. You deploy it. Later, someone tells you it's actually worse now. You have no idea what changed or why.
Or you're choosing between AI models. You try GPT-4. It's okay. You try Claude. Also okay. Which one is actually better for your use case? You're just guessing based on vibes.
Playgrounds fixes this by letting you test multiple versions side-by-side, with the same inputs, and compare real outputs. Not guesses. Actual results.
How It Works
Playgrounds provides a column-based workspace. Each column is a different prompt or configuration that you test simultaneously.
The key is "side-by-side." You're not testing one thing, then later testing another thing and trying to remember if the first one was better. You're looking at both results right next to each other.
When you save a playground, everything is preserved—columns, inputs, and outputs. Share the link with your team and they see the exact same results you're seeing.
Multi-Column Testing
You can use multiple columns in two ways:
Different inputs, same prompt - Load one prompt in multiple columns, but use different input values in each. See how your prompt handles various edge cases, tones, or content types. Does it work for short feedback and long feedback? Angry customers and happy ones?
Same inputs, different configurations - Load identical inputs across all columns, then change what you're testing. Switch models (GPT-4 vs Claude), adjust temperature, or compare prompt versions. Which setup produces better output?
This flexibility means you can test both "does my prompt work for different scenarios?" and "which model/settings work best?" in the same workspace.
When to Use This
You're changing a prompt - Load version 12 (current) and version 13 (proposed). Run both. See if your changes actually improve things.
You're choosing a model - Test the same prompt with GPT-4, Claude, Gemini. Pick based on actual outputs, not assumptions.
You need team buy-in - Set up the comparison, save it, share the link. Your manager sees real outputs instead of hearing your description of what changed.
Something broke - Load the working version from last week and the broken current version. See exactly what's different.
You're testing settings - Try temperature 0.3 vs 0.7. Try different max token limits. See what works.
You see actual outputs, not assumptions. When you save a playground, the results are preserved permanently—you can return to it months later and the outputs are still there. Share the link and your team reviews concrete evidence instead of abstract proposals.
Everything you test in playground is isolated. Changes don't affect your saved prompts. Experiment freely.
Next Steps
For detailed usage instructions, see Using Playgrounds.
For information about preserving and sharing your test results, see Saving and Sharing Playgrounds.