VeriEnv: Safe and Scalable Web Agent Learning via Recreated Websites

Abstract

Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards — eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments.

Why Verifiable Environments?

Existing self-evolving web agents face a fundamental tension: they learn from unsafe, unresettable real websites using unreliable LLM-as-a-judge evaluation. This feedback loop compounds errors and produces agents that overfit to the judge's biases rather than the true task. VeriEnv breaks the loop by giving agents a controlled, reproducible training ground with deterministic rewards.

Comparison between traditional self-evolution and VeriEnv's verifiable environment framework — **Figure 1.** Traditional self-evolution (left) relies on unsafe exploration of live websites with unreliable LLM-based rewards. VeriEnv (right) trains agents in cloned environments where every task comes with an executable validator.

**Table 1.** Comparison with existing web agent datasets and benchmarks. VeriEnv uniquely enables verifiable evaluation *and* scalable task generation through executable synthetic environments.
Dataset / Benchmark	# Websites	# Tasks	Browser Interaction	Verifiable Judge	Scalable Task Gen.
WebArena	5	812	✓	✓	✗
WorkArena	1	33	✓	✓	✗
WebVoyager	15	643	✓	✗	✗
Mind2Web	137	2,350	✗	✓	✗
Mind2Web-Online	136	300	✓	✗	✗
Mind2Web-Live	137	542	✓	✗	✗
VeriEnv (Ours)	149	7,400	✓ w/ synthetic websites	✓	✓

The VeriEnv Pipeline

VeriEnv automates three stages: (1) a coding agent clones the target website end-to-end (frontend, backend, and database) from screenshots; (2) bug-report passes stabilize the clone; and (3) task instructions are generated alongside executable Python SDK validation programs, so every task ships with a programmatic judge.

VeriEnv pipeline: website cloning, task generation, and agent training — **Figure 2.** Overview of the VeriEnv framework. The coding agent produces the environment *and* the verifiable tasks that run against it.

A verifiable task

Each generated task is paired with a ground-truth Python SDK call whose return value is the reference answer. At evaluation time, the agent's answer is checked against this reference with deterministic operators (exact_match, must_include, fuzzy_match, etc.).

Site-specific self-evolution

Site-specific self-evolving training within the cloned environment — **Figure 4.** Because cloned environments are reset-able and verifiable, an agent can iterate thousands of trajectories on a single site and train on its own successes.

Cloned Websites

VeriEnv clones websites spanning a wide range of domains — travel, commerce, education, healthcare, entertainment, government, and more. The five examples below are highlighted from our full set, with each pair showing the original website that our coding agent was given as input next to the VeriEnv clone it produced.

**Table 2.** Statistics of constructed synthetic environments and generated tasks.
Number of websites	149
Number of tasks per website	49.5
Total number of tasks	7,400
Easy tasks	2,972 (40.2%)
Medium tasks	2,900 (39.2%)
Hard tasks	1,528 (20.6%)

**Table 3.** Human evaluation of the generated websites and tasks.
Environment quality
Functional correctness (avg.)	90%
Signup	94%
Login	95%
Search	81%
Filter	88%
Navigation	100%
Forms	100%
Visual rating (Likert 1–5)	4.7
Task validity
Task executability	90%
Judge correctness	76%

Sample Generated Tasks

Tasks are generated automatically and filtered by executable validation. The selector below rotates through five representative tasks per site — a mix of information-seeking queries and authenticated multi-step actions.

Loading tasks…

— 1 / 5

Benchmark Results

Agents trained on VeriEnv-cloned environments transfer to standard web-agent benchmarks. Tables 4 and 5 show consistent gains over backbone models and prior self-evolution baselines (Synatra, ADP), evaluated on WebArena-Lite (5 real websites) and Mind2Web-Online (held-out tasks across difficulty levels). VeriEnv delivers the largest absolute improvement on both benchmarks for both backbones tested.

**Table 4.** WebArena-Lite evaluation across five websites. Methods marked with ^† use numbers reported by Chae et al. (2025).
Method	Shopping	CMS	Reddit	GitLab	Map	Total	Δ
Closed-source baselines
GPT-4o-mini^†	21.74	22.86	19.05	34.38	19.35	23.64	—
GPT-4o^†	23.91	31.43	28.57	56.25	19.35	31.52	—
Backbone: Qwen3-4B
Qwen3-4B	3.77	6.67	4.17	13.89	14.29	7.88	—
+Synatra	0.00	0.00	12.50	8.33	0.00	3.64	−4.24
+ADP	4.35	5.71	9.52	3.13	9.68	6.06	−1.82
+VeriEnv (Ours)	4.35	20.00	23.81	12.50	16.13	13.94	+6.06
Backbone: LLaMA-3.2-3B-Instruct
LLaMA-3.2-3B-Instruct	0.00	2.86	9.52	3.13	3.23	3.03	—
+Synatra	2.17	2.86	14.29	9.38	6.45	6.06	+3.03
+ADP	4.35	11.43	14.29	12.50	6.45	9.09	+6.06
+VeriEnv (Ours)	4.35	17.14	19.05	15.63	12.90	12.73	+9.70

**Table 5.** Mind2Web-Online results across difficulty levels. Methods marked with ^† use numbers reported by Xue et al. (2025).
Method	Easy	Medium	Hard	Total	Δ
Closed-source baselines
Browser-Use-GPT-4o^†	55.40	26.60	8.10	30.00	—
Claude-3.5-Sonnet^†	56.60	26.60	6.80	28.80	—
Backbone: Qwen3-4B
Qwen3-4B	26.32	9.41	11.63	13.18	—
+Synatra	35.09	5.88	9.30	14.55	+1.37
+ADP	26.32	7.06	6.98	11.36	−1.82
+VeriEnv (Ours)	29.82	23.53	6.98	20.45	+7.27
Backbone: LLaMA-3.2-3B-Instruct
LLaMA-3.2-3B-Instruct	19.30	12.94	0.00	11.36	—
+Synatra	24.56	15.29	6.98	14.55	+3.19
+ADP	42.11	24.71	11.63	24.09	+12.73
+VeriEnv (Ours)	40.35	29.41	13.95	24.55	+13.19

Scaling & Reliability

Increasing the number of cloned training environments consistently improves transfer to held-out websites — VeriEnv turns the cost of environment construction into a controllable axis of scale.

Analysis on scaling effects of number of websites — **Figure 5.** Agent performance on unseen sites improves as we scale the number of VeriEnv training environments.

Compared to PAE-style LLM-as-a-judge pipelines, VeriEnv produces substantially less ambiguous tasks and more reliable evaluations.

Comparison of task ambiguity and evaluation reliability (PAE vs VeriEnv) — **Figure 6.** Executable validation sharply reduces task ambiguity and evaluation noise relative to LLM-judge baselines.

What Fails, and How Long Does It Take?

Primary failure reasons and bug types observed during cloning — **Figure 7.** Most failures during cloning fall into a small set of bug categories, which the bug-report pass is designed to address.

Distribution of website implementation time — **Figure 8.** Per-site implementation cost is bounded and predictable, making large-scale environment construction practical.

Citation

If you build on this work, please cite:

@article{chae2026verienv,
  title   = {Safe and Scalable Web Agent Learning via Recreated Websites},
  author  = {Chae, Hyungjoo and Park, Jungsoo and Ritter, Alan},
  journal = {arXiv preprint arXiv:2603.10505},
  year    = {2026}
}