Now in beta

Ship software that
improves itself

Hallway is a continuous improvement engine. Define what “better” means, and it runs experiments, measures results, and ships improvements — automatically.

How it works

Powered by Claude · Daytona · Instahuman

How it works

Three steps. No configuration files. No CI pipelines to set up.

01

Describe what you want

Type a directive like “a snake game with wrap-around edges” or connect an existing repo and say what to improve.

02

Hallway generates evals

It creates deterministic tests and quality checks tailored to your goals. Each feature gets its own measurable eval.

03

Experiments run automatically

Hallway writes code, runs evals, keeps improvements, reverts regressions. The score only goes up.

The improvement loop

Every experiment follows the same rigorous cycle.

Branch

Creates an isolated git branch for each experiment

Code

An LLM writes targeted changes based on the directive and eval signals

Eval

Runs every eval: build checks, feature tests, LLM judges, human testers

Score

Computes a weighted composite score. Gate evals must pass or it's an instant revert.

Keep or revert

Score went up? Merge to main. Score went down? Revert. No regressions, ever.

Built for real software

Not a toy demo. Hallway runs in sandboxed containers with real git, real tests, and real deployments.

Deterministic evals

Exit-code checks, grep assertions, build verification. One feature = one eval. No vague LLM judges.

Sandboxed execution

Every experiment runs in an isolated Daytona container. Full Ubuntu, real package managers, real builds.

Human-in-the-loop

Real people playtest your app via Instahuman. Their feedback becomes signals the loop optimizes for.

Live preview

Every project gets a live preview URL. Watch changes appear in real time as experiments land.

Gate evals

Mark any eval as a gate. If the build breaks, the composite score drops to zero. No regressions ship.

Runs while you sleep

Set a directive, walk away. Come back to a repo full of tested, scored improvements.

Every kind of signal

Hallway combines automated checks with human judgment to measure what actually matters.

exit_code

Deterministic checks

  • Build passes
  • Feature exists in code
  • Tests pass
  • No lint errors
llm-judge

Subjective quality

  • UI polish score
  • Code readability
  • UX coherence
  • Design consistency
human-eval

Real human testers

  • Playtest feedback
  • Usability rating
  • Bug reports
  • Feature requests
benchmark

Performance metrics

  • Response time
  • Bundle size
  • Memory usage
  • Throughput

Start building

Describe what you want. Hallway takes it from there.