SWE-Atlas README
SWE-Atlas is a benchmark for evaluating AI coding agents across a spectrum of professional software engineering tasks. Rather than measuring a single skill in isolation, SWE-Atlas consists of multiple leaderboards that target distinct and complementary capabilities in the Software Development Cycle.
This repository has the data and instructions on running SWE Atlas - Codebase QnA (https://labs.scale.com/leaderboard/sweatlas-qna) and SWE Atlas - Test Writing (https://labs.scale.com/leaderboard/sweatlas-tw)
Repository Structure
- data/qa/ — 496 Codebase QnA tasks
- data/tw/ — 360 Test Writing tasks
- run_config/ — Example configs for running benchmarks
Requirements
- harbor (https://github.com/laude-institute/harbor) — task runner framework
- Modal — sandbox environment execution
- ANTHROPIC_API_KEY — for the agent under evaluation
- OPENAI_API_KEY + OPENAI_API_BASE — for the LLM judge (Claude Opus 4.5 used as judge)
Task Structure (QnA example)
Each task folder contains:
- instruction.md — the question/task given to the agent
- task.toml — metadata (repo, base_commit, Docker image, resource limits, timeouts)
- environment/Dockerfile — Docker image spec
- solution/answer.txt + solve.sh — reference solution
- tests/evaluate_answer.py — evaluator script
- tests/rubrics.json — rubric for LLM judge
- tests/prompt.txt / system_prompt.txt / user_prompt_template.txt — judge prompts
- tests/test.sh — test runner
Task Categories (QnA sample)
- category: “Code Onboarding” — agents answer deep codebase comprehension questions
- Repos covered: e.g. Automattic/wp-calypso, and many others
- Agent timeout: 10800s (3 hours) per task
- Verifier timeout: 900s
- Resources: 16 CPUs, 16GB RAM, 20GB storage, no GPU, internet allowed
Evaluation
- Agent writes answer to /logs/agent/answer.txt wrapped in <<FINAL_ANSWER>> tags
- LLM judge (Claude Opus 4.5) grades using rubrics.json
- Scores aggregated per leaderboard
Run Command
bash run_config/tw/opus-4p6_claude-code.sh
Dataset on Harbor
- QnA dataset: scale-ai/swe-atlas-qna (496 tasks)
- TW dataset: scale-ai/swe-atlas-tw (360 tasks)
- Published via harbor CLI tool from laude-institute/harbor