Ship software that
improves itself
Hallway is a continuous improvement engine. Define what “better” means, and it runs experiments, measures results, and ships improvements — automatically.
Powered by Claude · Daytona · Instahuman
How it works
Three steps. No configuration files. No CI pipelines to set up.
Describe what you want
Type a directive like “a snake game with wrap-around edges” or connect an existing repo and say what to improve.
Hallway generates evals
It creates deterministic tests and quality checks tailored to your goals. Each feature gets its own measurable eval.
Experiments run automatically
Hallway writes code, runs evals, keeps improvements, reverts regressions. The score only goes up.
The improvement loop
Every experiment follows the same rigorous cycle.
Branch
Creates an isolated git branch for each experiment
Code
An LLM writes targeted changes based on the directive and eval signals
Eval
Runs every eval: build checks, feature tests, LLM judges, human testers
Score
Computes a weighted composite score. Gate evals must pass or it's an instant revert.
Keep or revert
Score went up? Merge to main. Score went down? Revert. No regressions, ever.
Built for real software
Not a toy demo. Hallway runs in sandboxed containers with real git, real tests, and real deployments.
Deterministic evals
Exit-code checks, grep assertions, build verification. One feature = one eval. No vague LLM judges.
Sandboxed execution
Every experiment runs in an isolated Daytona container. Full Ubuntu, real package managers, real builds.
Human-in-the-loop
Real people playtest your app via Instahuman. Their feedback becomes signals the loop optimizes for.
Live preview
Every project gets a live preview URL. Watch changes appear in real time as experiments land.
Gate evals
Mark any eval as a gate. If the build breaks, the composite score drops to zero. No regressions ship.
Runs while you sleep
Set a directive, walk away. Come back to a repo full of tested, scored improvements.
Every kind of signal
Hallway combines automated checks with human judgment to measure what actually matters.
Deterministic checks
- Build passes
- Feature exists in code
- Tests pass
- No lint errors
Subjective quality
- UI polish score
- Code readability
- UX coherence
- Design consistency
Real human testers
- Playtest feedback
- Usability rating
- Bug reports
- Feature requests
Performance metrics
- Response time
- Bundle size
- Memory usage
- Throughput
Start building
Describe what you want. Hallway takes it from there.