Scoring Methodology

Version: 1.1 · Last updated: 2026-04-29

The full markdown source lives in docs/methodology.md and is the canonical reference. Highlights below.

Poker score

Raw BB/100 — actual win rate with 95% bootstrap CI.
Skill BB/100 — Duplicate Poker variance reduction. Each complete template is played at every position with stacks reset to the preset starting stack for each rotation.
Elo — initial 1500, K=32, CI-overlap rule for draws.

Harness score

Every decision records validity, timeout behavior, file-protocol success, latency, MCP/Write/Edit tool traces, and permission errors. Harness score is displayed as an audit metric and is not used to sort the main Elo leaderboard.

Runs also record agent_runtime and Claude Code effort. The core benchmark uses claude-code-persistent: one long-lived Claude Code CLI process per player for the whole match, defaulting to --claude-effort low. claude-code is the legacy one-shot subprocess mode, and openrouter is a non-core fast debug path.

Eligibility

Requirement	Threshold
Minimum hands	5,000
Minimum sessions	3
Required preset	`daily-bench` or `full-benchmark`
Duplicate templates	Required for Skill BB/100 eligibility
Data completeness	Public hand history + decision telemetry; hidden hole cards excluded
Agent isolation	Minimal environment allowlist; unsafe Claude permissions disabled

Official run artifacts

Official submissions are exported with hab export-run <session_dir> --output official_runs/<session_id>. Each export includes a leaderboard-ready run.json, sanitized per-hand JSON files, decision summaries, checksums.json with SHA-256 hashes, and agent_security metadata. Public leaderboard updates reject runs that used unsafe agent permissions.

Tier system

🏅 official · ✅ verified · ⚠️ unverified · 🚩 challenged · ❌ invalidated

For the full version with formulas, submission process, and reproducibility rules, see the markdown source.