Safe and Scalable Web Agent Learning
via Recreated Websites

Georgia Institute of Technology
Preprint  ·  2026

VeriEnv clones diverse real-world websites — e-commerce, education, real estate, professional networking, information portals, and more — into fully executable, verifiable synthetic environments.

Abstract

Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards — eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments.

Why Verifiable Environments?

Existing self-evolving web agents face a fundamental tension: they learn from unsafe, unresettable real websites using unreliable LLM-as-a-judge evaluation. This feedback loop compounds errors and produces agents that overfit to the judge's biases rather than the true task. VeriEnv breaks the loop by giving agents a controlled, reproducible training ground with deterministic rewards.

Comparison between traditional self-evolution and VeriEnv's verifiable environment framework
Figure 1. Traditional self-evolution (left) relies on unsafe exploration of live websites with unreliable LLM-based rewards. VeriEnv (right) trains agents in cloned environments where every task comes with an executable validator.
Table 1. Comparison with existing web agent datasets and benchmarks. VeriEnv uniquely enables verifiable evaluation and scalable task generation through executable synthetic environments.
Dataset / Benchmark # Websites # Tasks Browser
Interaction
Verifiable
Judge
Scalable
Task Gen.
WebArena 5 812
WorkArena 1 33
WebVoyager 15 643
Mind2Web 137 2,350
Mind2Web-Online 136 300
Mind2Web-Live 137 542
VeriEnv (Ours) 1497,400
w/ synthetic websites

The VeriEnv Pipeline

VeriEnv automates three stages: (1) a coding agent clones the target website end-to-end (frontend, backend, and database) from screenshots; (2) bug-report passes stabilize the clone; and (3) task instructions are generated alongside executable Python SDK validation programs, so every task ships with a programmatic judge.

VeriEnv pipeline: website cloning, task generation, and agent training
Figure 2. Overview of the VeriEnv framework. The coding agent produces the environment and the verifiable tasks that run against it.

A verifiable task

Each generated task is paired with a ground-truth Python SDK call whose return value is the reference answer. At evaluation time, the agent's answer is checked against this reference with deterministic operators (exact_match, must_include, fuzzy_match, etc.).

Example of a verifiable task with executable validation in a synthetic website
Figure 3. A task is valid only if its SDK call produces a well-defined result. Tasks that would require an LLM to judge are automatically filtered out.

Site-specific self-evolution

Site-specific self-evolving training within the cloned environment
Figure 4. Because cloned environments are reset-able and verifiable, an agent can iterate thousands of trajectories on a single site and train on its own successes.

Cloned Websites

VeriEnv clones websites spanning a wide range of domains — travel, commerce, education, healthcare, entertainment, government, and more. The five examples below are highlighted from our full set, with each pair showing the original website that our coding agent was given as input next to the VeriEnv clone it produced.

Table 2. Statistics of constructed synthetic environments and generated tasks.
Number of websites 149
Number of tasks per website49.5
Total number of tasks 7,400
Easy tasks 2,972 (40.2%)
Medium tasks 2,900 (39.2%)
Hard tasks 1,528 (20.6%)
Table 3. Human evaluation of the generated websites and tasks.
Environment quality
Functional correctness (avg.)90%
Signup 94%
Login 95%
Search 81%
Filter 88%
Navigation 100%
Forms 100%
Visual rating (Likert 1–5)4.7
Task validity
Task executability 90%
Judge correctness 76%

Sample Generated Tasks

Tasks are generated automatically and filtered by executable validation. The selector below rotates through five representative tasks per site — a mix of information-seeking queries and authenticated multi-step actions.

Loading tasks…
1 / 5

Benchmark Results

Agents trained on VeriEnv-cloned environments transfer to standard web-agent benchmarks. Tables 4 and 5 show consistent gains over backbone models and prior self-evolution baselines (Synatra, ADP), evaluated on WebArena-Lite (5 real websites) and Mind2Web-Online (held-out tasks across difficulty levels). VeriEnv delivers the largest absolute improvement on both benchmarks for both backbones tested.

Table 4. WebArena-Lite evaluation across five websites. Methods marked with use numbers reported by Chae et al. (2025).
Method Shopping CMS Reddit GitLab Map Total Δ
Closed-source baselines
GPT-4o-mini21.7422.8619.0534.3819.3523.64
GPT-4o 23.9131.4328.5756.2519.3531.52
Backbone: Qwen3-4B
Qwen3-4B 3.776.674.1713.8914.297.88
+Synatra 0.000.0012.508.330.003.64−4.24
+ADP 4.355.719.523.139.686.06−1.82
+VeriEnv (Ours) 4.3520.0023.8112.5016.1313.94+6.06
Backbone: LLaMA-3.2-3B-Instruct
LLaMA-3.2-3B-Instruct 0.002.869.523.133.233.03
+Synatra 2.172.8614.299.386.456.06+3.03
+ADP 4.3511.4314.2912.506.459.09+6.06
+VeriEnv (Ours) 4.3517.1419.0515.6312.9012.73+9.70
Table 5. Mind2Web-Online results across difficulty levels. Methods marked with use numbers reported by Xue et al. (2025).
Method Easy Medium Hard Total Δ
Closed-source baselines
Browser-Use-GPT-4o55.4026.608.1030.00
Claude-3.5-Sonnet56.6026.606.8028.80
Backbone: Qwen3-4B
Qwen3-4B 26.329.4111.6313.18
+Synatra 35.095.889.3014.55+1.37
+ADP 26.327.066.9811.36−1.82
+VeriEnv (Ours)29.8223.536.9820.45+7.27
Backbone: LLaMA-3.2-3B-Instruct
LLaMA-3.2-3B-Instruct 19.3012.940.0011.36
+Synatra 24.5615.296.9814.55+3.19
+ADP 42.1124.7111.6324.09+12.73
+VeriEnv (Ours)40.3529.4113.9524.55+13.19

Scaling & Reliability

Increasing the number of cloned training environments consistently improves transfer to held-out websites — VeriEnv turns the cost of environment construction into a controllable axis of scale.

Analysis on scaling effects of number of websites
Figure 5. Agent performance on unseen sites improves as we scale the number of VeriEnv training environments.

Compared to PAE-style LLM-as-a-judge pipelines, VeriEnv produces substantially less ambiguous tasks and more reliable evaluations.

Comparison of task ambiguity and evaluation reliability (PAE vs VeriEnv)
Figure 6. Executable validation sharply reduces task ambiguity and evaluation noise relative to LLM-judge baselines.

What Fails, and How Long Does It Take?

Primary failure reasons and bug types observed during cloning
Figure 7. Most failures during cloning fall into a small set of bug categories, which the bug-report pass is designed to address.
Distribution of website implementation time
Figure 8. Per-site implementation cost is bounded and predictable, making large-scale environment construction practical.

Citation

If you build on this work, please cite:

@article{chae2026verienv,
  title   = {Safe and Scalable Web Agent Learning via Recreated Websites},
  author  = {Chae, Hyungjoo and Park, Jungsoo and Ritter, Alan},
  journal = {arXiv preprint arXiv:2603.10505},
  year    = {2026}
}