Scoring Methodology
Version: 1.1 ยท Last updated: 2026-04-29
The full markdown source lives in docs/methodology.md and is the canonical reference. Highlights below.
Poker score
- Raw BB/100 โ actual win rate with 95% bootstrap CI.
- Skill BB/100 โ Duplicate Poker variance reduction. Each complete template is played at every position with stacks reset to the preset starting stack for each rotation.
- Elo โ initial 1500, K=32, CI-overlap rule for draws.
Harness score
Every decision records validity, timeout behavior, file-protocol success, latency, MCP/Write/Edit tool traces, and permission errors. Harness score is displayed as an audit metric and is not used to sort the main Elo leaderboard.
Runs also record agent_runtime and Claude Code effort. The core benchmark uses claude-code-persistent: one long-lived Claude Code CLI process per player for the whole match, defaulting to --claude-effort low. claude-code is the legacy one-shot subprocess mode, and openrouter is a non-core fast debug path.
Eligibility
| Requirement | Threshold |
|---|---|
| Minimum hands | 5,000 |
| Minimum sessions | 3 |
| Required preset | daily-bench or full-benchmark |
| Duplicate templates | Required for Skill BB/100 eligibility |
| Data completeness | Public hand history + decision telemetry; hidden hole cards excluded |
| Agent isolation | Minimal environment allowlist; unsafe Claude permissions disabled |
Official run artifacts
Official submissions are exported with hab export-run <session_dir> --output official_runs/<session_id>. Each export includes a leaderboard-ready run.json, sanitized per-hand JSON files, decision summaries, checksums.json with SHA-256 hashes, and agent_security metadata. Public leaderboard updates reject runs that used unsafe agent permissions.
Tier system
๐ official ยท โ verified ยท โ ๏ธ unverified ยท ๐ฉ challenged ยท โ invalidated
For the full version with formulas, submission process, and reproducibility rules, see the markdown source.