Hallway is a continuous improvement engine. Define what “better” means, and it runs experiments, measures results, and ships improvements — automatically.
Powered by Claude · Daytona · Autohuman
Three steps. No configuration files. No CI pipelines to set up.
Type a directive like "a snake game with wrap-around edges" or connect an existing repo and say what to improve.
It creates deterministic tests and quality checks tailored to your goals. Each feature gets its own measurable eval.
Hallway writes code, runs evals, keeps improvements, reverts regressions. The score only goes up.
Every experiment follows the same rigorous cycle.
Creates an isolated git branch for each experiment
An LLM writes targeted changes based on the directive and eval signals
Runs every eval: build checks, feature tests, LLM judges, human testers
Computes a weighted composite score. Gate evals must pass or it's an instant revert.
Score went up? Merge to main. Score went down? Revert. No regressions, ever.
Not a toy demo. Hallway runs in sandboxed containers with real git, real tests, and real deployments.
Exit-code checks, grep assertions, build verification. One feature = one eval. No vague LLM judges.
Every experiment runs in an isolated Daytona container. Full Ubuntu, real package managers, real builds.
Real people playtest your app via Instahuman. Their feedback becomes signals the loop optimizes for.
Every project gets a live preview URL. Watch changes appear in real time as experiments land.
Mark any eval as a gate. If the build breaks, the composite score drops to zero. No regressions ship.
Set a directive, walk away. Come back to a repo full of tested, scored improvements.
Hallway combines automated checks with human judgment to measure what actually matters.
Describe what you want. Hallway takes it from there.