AI Blind Test: Can You Tell Which Model Wrote This?

I ran an experiment last week that made me question everything I thought I knew about AI quality.

The setup was simple. I took the same prompt, ran it through four different AI models, stripped the names, and asked people: which response is best?

No brands. No logos. No “powered by OpenAI” badges influencing the vote. Just raw text, labeled Model A through Model D.

The results broke my brain a little.

The Bias We Don’t Talk About

Here’s something nobody in the AI space wants to admit: brand matters more than output.

Tell someone a response came from GPT-4o and they’ll rate it higher. Show them the same text labeled “open-source model” and suddenly they notice flaws. It’s not intentional. It’s just how brains work. We trust names we recognize.

I know this because I did it too. For months, I assumed paid models were categorically better than free ones. I never actually tested that assumption blind. I just… believed it.

So I built a blind testing system into AI Prompt Race. We call it Blind Battle. You run a prompt through multiple models, the names get hidden, and you vote for whichever response you genuinely prefer. Only after voting do the names reveal themselves.

What happened next was interesting.

The Experiment

Over the past week, users on AI Prompt Race ran hundreds of blind comparisons. Here’s what the voting data actually shows.

Test 1: Explain quantum computing to a 10-year-old

Four models. Names hidden. One response used a perfect analogy about a maze where you can walk every path at once. Clear, fun, exactly the right level for a kid.

72% of voters picked that one.

It was Gemma 3 12B. Google’s smallest model in our lineup. Free.

The response most people expected to win — the longest, most detailed, most “impressive-looking” one — came in third. People don’t want impressive. They want clear.

Test 2: Write a professional email declining a meeting

This one was closer. The top two responses were separated by just 4 percentage points. Both were concise, professional, appropriately warm without being sycophantic.

Winner: Nemotron 30B (NVIDIA). Free.

Runner-up: Trinity Large 400B (Arcee AI). Also free.

Both beat the response that tried hardest to sound professional — which ended up sounding stiff and corporate. The lesson: the best AI email is the one that sounds like a real person wrote it quickly, not the one that sounds like it was carefully crafted by a machine.

Test 3: Debug this Python code

I gave all four models a function with three bugs: an off-by-one error, a missing null check, and a variable shadowing issue.

This is where things got genuinely surprising. Three models found two bugs each. Only one found all three.

Nemotron 30B found all three, explained each one clearly, and provided the corrected code with comments showing what changed and why.

The model that missed the most bugs? The one with the longest, most verbose explanation. It spent so many tokens describing what the code was supposed to do that it apparently lost focus on what was actually wrong.

Test 4: Summarize the pros and cons of remote work

Every model handled this competently. The differences were in style, not substance.

But here’s what voters consistently preferred: shorter responses that made a clear argument, rather than balanced lists that tried to cover everything.

The winning response took a stance. It said remote work is better for most knowledge workers, acknowledged the real downsides (isolation, blurred boundaries), and ended with a specific recommendation. Voters overwhelmingly preferred that over the “on one hand, on the other hand” responses.

Winner: GLM 4.5 Air (Zhipu AI). The model most people had never heard of.

What the Data Actually Tells Us

After analyzing all the blind votes, three patterns emerged:

1. Name recognition inflates perceived quality by 15-25%

When the same responses were shown with and without model names, known brands consistently scored higher — even when the actual text was identical. This isn’t a flaw in the models. It’s a flaw in how we evaluate them.

2. Shorter, clearer responses win blind tests

In 68% of blind comparisons, the winning response was NOT the longest one. Voters consistently rewarded clarity over comprehensiveness. This makes sense — when you can’t see the brand, you focus on whether the response actually helped you. And help usually means answering the question, not writing an essay about it.

3. Free models win blind tests more often than paid models

This is the uncomfortable one. In our blind comparisons, free open-source models were selected as the best response roughly 60% of the time when competing against paid model outputs.

That doesn’t mean free models are “better.” It means the gap between free and paid is far smaller than most people assume, and for many common tasks, it effectively doesn’t exist.

Why This Matters For Your Wallet

The average ChatGPT Plus subscriber pays $240 a year. Claude Pro costs the same. Google’s Gemini Advanced is $240 a year.

These are all good products. Genuinely. If you need the absolute ceiling of AI capability — complex reasoning chains, massive context windows, specialized features — paid models still have an edge.

But “absolute ceiling” isn’t what most people need most of the time.

Most AI use falls into a few categories: writing help, code assistance, Q&A, summarization. For these tasks, our blind test data suggests that voters can’t reliably distinguish free from paid.

That’s not a sales pitch. That’s just what the votes say.

Try It Yourself

This is the part where I’d normally tell you to trust me. But the whole point of blind testing is that you shouldn’t trust anyone — including me.

So here’s what I’d suggest instead:

Go to AI Prompt Race
Type a prompt you actually use in your work
Select 3-4 models
Hit Race
When Blind Battle appears below the results, vote for your favorite WITHOUT looking at the model names above
See if your pick matches what you expected

Most people are surprised. Some are genuinely shocked. A few get angry — usually the ones paying $20/month for something they could get for free.

Whatever you feel, at least you’ll know. And knowing based on evidence beats assuming based on brand.

Run your own blind test now →

Data collected from AI Prompt Race blind voting feature, March 2026. Sample sizes vary by prompt. Individual results depend on prompt type, complexity, and the specific models compared. We don’t claim free models are universally better — we claim blind testing reveals the gap is smaller than most people think.