Using Playgrounds

How to set up comparisons, test different configurations, and use playgrounds effectively.

The Interface

Navigate to Playgrounds from the main navigation. You'll see a column-based interface where each column can hold a different prompt configuration.

Playground Interface

You can:

Load existing prompts from your library
Create new prompts from scratch
Test the same prompt with different models
Adjust settings like temperature or max tokens
Run all columns with the same inputs
Compare outputs side-by-side

Setting Up Comparisons

Testing Prompt Versions

Let's say you have a customer feedback analyzer that's currently using version 12. You want to test your proposed changes (version 13).

Add a column
Load your prompt from the library
Select version 12
Add another column
Load the same prompt
Select version 13
Fill in the input variables (same customer feedback in both)
Click run
Compare results

Now you can see, with actual evidence, whether your changes improve the output.

Comparing Models

Same prompt, different models:

Load your prompt into column 1
Duplicate it into columns 2, 3, maybe 4
Click the model selector in each column
Choose different models (GPT-4, Claude Sonnet, etc.)
Run them all with identical inputs

You'll see immediately which model:

Understands your instructions best
Produces the output format you need
Handles edge cases better
Works well enough at lower cost

Testing Settings

Click the gear icon in any column to adjust configuration.

Temperature controls randomness:

Lower (0.0-0.3): Consistent, predictable outputs. Good for data extraction, classification.
Higher (0.7-1.0): Creative, varied outputs. Good for brainstorming, writing.

Max Tokens limits response length. Set this to control output length and cost.

Test different settings in different columns to find what works best for your use case.

Versions

When you load a prompt into playground, you choose which version to use.

Every time you save changes to a prompt in Aisle, it creates a new version. If you edited your "Customer Feedback Analyzer" three times this week, you have at least three versions.

Load version 12 (current production) alongside version 13 (latest changes) to see if your changes actually improve outputs. Or go back to version 8 when things were working well and compare it to version 12 to see what changed.

The power move: Load the same prompt, different versions, in different columns. Run them both. Now you're not guessing whether your edits improved things—you're looking at the actual difference.

Common Scenarios

Testing iterations against a baseline

You've been improving a prompt. You've made 5 changes this week. Is it actually better now?

Load the current production version in column 1. Load your latest changes in column 2. Run both with the same inputs. Look at the outputs side-by-side.

If the new version is clearly better, deploy it. If it's worse, you know immediately and can either revert or keep iterating. If it's about the same, maybe the changes don't matter.

Choosing between models

You're building a new prompt. You know what you want it to do, but you don't know which model to use.

Load your prompt into 3-4 columns. Switch each one to a different model. Run them all.

Sometimes a smaller, faster, cheaper model works just as well as the flagship model. Sometimes you need the advanced reasoning of the best model. You won't know until you test.

Getting team approval

You want to change a critical business prompt. Your manager needs to approve it.

Set up a playground with the current version and proposed new version. Run both with realistic inputs. Save it. Share the link.

Now your manager reviews actual outputs, not your description of what changed. They can see exactly why the new version is better (or isn't). The approval conversation is concrete instead of abstract.

Diagnosing regressions

Something went wrong. The prompt that was working great last week is producing garbage now.

Load the current version and the stable version from last week in playground. Run them both. Document exactly what broke.

This testing record helps prevent the same regression in the future. You can see what changed and why it matters.

Proposing different approaches

You need to build something new. You have three completely different strategies for how the prompt should work.

Build all three in playground columns. Run them all. Share with your team. The saved results make the decision meeting productive because everyone is looking at real outputs, not hypothetical descriptions.

Testing Tips

Use actual examples of what the prompt will process in production, not toy examples. Don't judge based on one input—try several different examples to see consistency.

Name playgrounds descriptively ("Customer Feedback v2 vs v3 Testing" not "Playground 5"). Get feedback early. Don't wait until you've decided.

Change one thing at a time. If you change model AND temperature AND rewrite instructions all at once, you won't know which change caused what effect.

The Interface​

Setting Up Comparisons​

Testing Prompt Versions​

Comparing Models​

Testing Settings​

Versions​

Common Scenarios​

Testing iterations against a baseline​

Choosing between models​

Getting team approval​

Diagnosing regressions​

Proposing different approaches​

Testing Tips​