This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.
More technical detail (architecture, code excerpts, additional log snippets):
https://www.sentienceapi.com/blog/verification-layer-amazon-...
Demo 0 vs Demo 3:
Demo 0 (cloud, GLM‑4.6 + structured snapshots) success: 1/1 run tokens: 19,956 (~43% reduction vs ~35k estimate) time: ~60,000ms cost: cloud API (varies) vision: not required
Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor) success: 7/7 steps (re-run) tokens: 11,114 time: 405,740ms cost: $0.00 incremental (local inference) vision: not required
Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.
Architecture
This worked because we changed the control plane and added a verification loop.
1) Constrain what the model sees (DOM pruning). We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).
2) Split reasoning from acting (planner vs executor).
Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward. Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text). 3) Gate every step with Jest‑style verification. After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.
Minimal shape:
ok = await runtime.check( exists("role=textbox"), label="search_box_visible", required=True, ).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)
What changed between “agents that look smart” and agents that work Two examples from the logs:
Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”
Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”
The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.
Takeaway If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.
Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.
That's a prerequisite for Show HN.
I'm removing the Show HN prefix for now, until we get clarity. Then we can consider re-upping the post once we know exactly how to present it.
how is this different than building a scraper script that does it traditionally?
A traditional scraper/script hard-codes selectors and control flow up front. When the layout changes, it usually breaks at an arbitrary line and you debug it manually.
In this setup, the agent chooses actions at *runtime* from a bounded action space, and the system uses the built-in predicates (e.g. url_changes, drawer_appeared, etc) to verify the outcomes. When it fails, it fails at a specific semantic assertion with artifacts, not a missing selector.
So it’s less “replace scripts” and more “apply test-style verification and recovery to AI-driven decisions instead of static code.”
> Show HN is for something you've made that other people can play with.
> Off topic: blog posts, sign-up pages, newsletters, lists, and other reading material. Those can't be tried out, so can't be Show HNs. Make a regular submission instead.
We used it because it’s a dynamic, hostile UI, but the design goal is a site-agnostic control plane. That’s why the runtime avoids selectors and screenshots and instead operates on pruned semantic snapshots + verification gates.
If the layout changes, the system doesn’t “half-work” — it fails deterministically with artifacts. That’s the behavior we’re optimizing for.