AI Model Rankings
RAW TESTS ONLY March 14, 2026
Direct model evaluation ā no augmentation, no external tools, pure baseline capability
š¬ Test Methodology
What is "RAW" Testing?
RAW tests evaluate models with direct prompts ā no augmentation, no prompt engineering, no external tools. This measures each model's baseline capability across coding, reasoning and planning tasks when given a pure task description.
Test Pipeline
1ļøā£ Task Prompt
Send a coding, reasoning or planning challenge directly to the model with clear requirements
2ļøā£ Model Response
Model returns implementation code + unit tests
3ļøā£ Vitest Run
Execute unit tests to verify correctness
4ļøā£ TSC Check
TypeScript Compiler checks type safety
5ļøā£ Wilson Score
Calculate adjusted ranking score
š How Rankings Are Calculated
ā ļø The 100% Problem
Simple pass rate (passes / total) is misleading:
- 1 pass, 0 fail = 100% (but only 1 test!)
- 10 pass, 1 fail = 90.9% (but 11 tests!)
A model that passes 1/1 looks better than one that passes 10/11. That's wrong.
šÆ Solution: Adjusted Wilson Score
Wilson Score Lower Bound accounts for both pass rate AND sample size:
Where:
p= pass rate (passes / total)n= total tests attemptedz= 1.96 (95% confidence)
Result: A model with 10/11 passes gets a higher Wilson score than 1/1 passes. More tests = more confidence = fair ranking.
Score Components
| Metric | Description |
|---|---|
| Pass | All unit tests passed |
| Fail | Tests ran but failed (logic/implementation errors) |
| Wilson | Adjusted Wilson Score (primary ranking) |
| Pass Rate | For reference only (not used for ranking) |
š Overall Rankings (All Categories Combined)
Ranked by Adjusted Wilson Score ā higher is better
| Rank | Model | Pass | Fail | Wilson | Rate |
|---|---|---|---|---|---|
| 1 | deepseek-v3.2 | 47 | 6 | 41.03 | 88.7% |
| 2 | qwen3-coder-next | 43 | 9 | 36.54 | 82.7% |
| 3 | grok-4 | 40 | 3 | 35.00 | 93.0% |
| 4 | minimax-m2.5 | 41 | 8 | 34.77 | 83.7% |
| 5 | glm-4.7 | 39 | 10 | 32.52 | 79.6% |
| 6 | hunter | 34 | 0 | 30.55 | 100% |
| 7 | glm-5 | 34 | 1 | 29.91 | 97.1% |
| 7 | healer | 34 | 1 | 29.91 | 97.1% |
| 9 | qwen3-coder | 31 | 3 | 26.19 | 91.2% |
| 10 | haiku-4.5 | 27 | 1 | 23.04 | 96.4% |
š¦ Rust Rankings
Ranked by Adjusted Wilson Score ā raw tests only, 9 tasks
| Rank | Model | Pass | Fail | Wilson |
|---|---|---|---|---|
| 1 | opus-4.5 | 96 | 1 | 91.56 |
| 2 | opus-4.6 | 86 | 0 | 82.32 |
| 3 | gpt-5.3 | 77 | 1 | 72.61 |
| 4 | gemini-3-pro | 63 | 0 | 59.38 |
| 5 | gemini-3.1-pro | 60 | 0 | 56.39 |
| 6 | gpt-5.4 | 56 | 0 | 52.41 |
| 7 | gpt-5.2 | 52 | 0 | 48.42 |
| 8 | gpt-5.2-codex | 48 | 0 | 44.44 |
| 9 | haiku-4.5 | 45 | 1 | 40.79 |
| 10 | healer | 40 | 1 | 35.84 |
š TypeScript Rankings
Ranked by Adjusted Wilson Score ā raw tests only, 3 tasks
| Rank | Model | Pass | Fail | Wilson |
|---|---|---|---|---|
| 1 | minimax-m2.5 | 52 | 9 | 45.31 |
| 2 | opus-4.5 | 34 | 1 | 29.91 |
| 3 | grok-4.1-fast | 35 | 4 | 29.80 |
| 4 | opus-4.6 | 33 | 1 | 28.93 |
| 5 | qwen3-coder | 33 | 3 | 28.14 |
| 6 | haiku-4.5 | 32 | 3 | 27.17 |
| 7 | grok-4 | 31 | 1 | 26.96 |
| 8 | qwen3-coder-next | 28 | 4 | 23.02 |
| 9 | glm-4.7 | 25 | 4 | 20.14 |
| 10 | gpt-5.3 | 22 | 0 | 18.73 |
š§ Reasoning Rankings
āļø Non-Coding Rankings
š Test Tasks by Category
š TypeScript
async-retry: Exponential backoff with jitter, configurable retry predicate
typed-emitter: Generic type-safe event emitter
rate-limiter: Token bucket algorithm
š¦ Rust
builder-pattern: Builder creational pattern
channel-mpmc: Multi-producer multi-consumer
functional-pipeline: Iterator combinators
generic-cache: Generic cache implementation
state-machine: State machine pattern
š§ Reasoning
Logic puzzles, cryptarithmetic, spatial reasoning, tournament brackets
āļø Non-Coding
Writing tasks, analysis, structured output, JSON extraction
š TypeScript Tasks ā Detailed
š¦ Rust Tasks ā Detailed
š§ Reasoning Tasks ā Detailed
āļø Non-Coding Tasks ā Detailed
š Abbreviations
| Abbreviation | Meaning |
|---|---|
RAW |
Direct model calls without any augmentation |
VITEST |
JavaScript/TypeScript test runner |
TSC |
TypeScript Compiler (type checking) |
Wilson |
Adjusted Wilson Score (confidence-weighted ranking) |
TS |
TypeScript |
Pass Rate |
passes / (passes + failures) ā for reference only |