CAV-RFC-001: Specification

The formal Core Agent Vitals specification — definitions, deterministic methodology, thresholds, and reference agent profiles. The canonical spec, hosted on-site.

CAV-RFC-001: Core Agent Vitals

Status: Draft · Version: 0.1.0 · Author: Rafael Lopes

§1 Abstract

Core Agent Vitals (CAV) is a measurable, gated framework for the experience an AI agent has consuming a web page — the measurement standard for AI-agent web consumption. It defines three gated pillars (CRR, SSD, ARR), two supporting signals (TC, TTFUT), and one north-star outcome metric (AF), each with a formula, a deterministic measurement methodology, reference agent profiles, thresholds, and a CI gate. v0.1 thresholds are seeds to baseline-and-tighten.

§2 Problem Statement

  • The agent-web gap. A growing share of page “visits” are LLMs/agents extracting and recombining content to answer a query the user never clicks through on. They don’t execute JS well, don’t see images, and operate under a token budget. Nothing measures this.
  • Why existing metrics don’t cover it. Today’s page metrics measure a browser paint and human interaction that, on the agent path, never happen. A heavily client-rendered page can look perfect to a person while scoring near-zero for agents.
  • Revenue at risk. If a shopping/answer agent can’t parse your price, availability, or actions, it can’t cite, recommend, or transact — you’re the slow/illegible supplier dropped from its context.

§2.3 Why CAV Matters (Forcing Functions)

Human-performance metrics got adopted because Search made them a ranking input — one centralized lever. CAV has no equivalent single lever; its forcing functions are emerging and distributed, and this section states them as hypotheses, not as a measured market. Three channels:

  • (a) Agent commerce. ChatGPT Shopping, Perplexity, and AI Overviews with product cards consume pages to assemble a buying answer. A page with low CRR (content lost behind JS) or low SSD (no entity scaffolding) is unreadable to the buying agent — it can’t extract a price/availability it never recovered, so the page can’t be surfaced or transacted against. The selection happens upstream of any human click.
  • (b) AI-Overview citation. Extractable, structured content is what gets quoted and linked in AI summaries. CRR/SSD are direct proxies for “can this be lifted into an answer at all.” Illegible pages are silently omitted.
  • (c) Answer-fidelity risk (the asymmetric one). A human who hits a bad page bounces — a lost session. An agent that hits a bad page produces a wrong answer about your business: a stale price, a hallucinated “in stock,” an action it claims exists. That is a reputational/correctness cost a bounce never carries, and it scales with every query that touches the page. This is why AF is the north star (§3.6), not a vanity metric.

A human-side analogy (not a CAV measurement). Published research links faster, more stable pages correlate with roughly 24% lower abandonment. We cite this only as the human mechanism — paint/interaction latency driving a bounce. The agent mechanism is different: failure shows up as truncation or hallucination, not a bounce, and the cost lands as a wrong answer rather than a lost session. No CAV-measured equivalent of that number exists yet; producing one is field work, not a claim we make here. Treat all of §2.3 as the emerging case for CAV, to be replaced by measured evidence as adoption data accrues.

§3 Metric Definitions

§3.1 CRR — Content Recovery Ratio (gated)

Definition. CRR = min(token_ratio, content_overlap), where token_ratio = tokens(extract(raw_pre_JS_HTML)) / tokens(extract(fully_rendered_HTML)), content_overlap = |distinct_words(raw) ∩ distinct_words(rendered)| / |distinct_words(rendered)|, extract = Mozilla Readability main-content extraction, and tokens = cl100k_base count. Why the overlap term. A token ratio alone fake-passes when raw HTML has enough tokens but the wrong ones (a server-rendered “Loading…” / “enable JavaScript” placeholder with more tokens than the real content scores 1.0). The overlap term requires the raw page to contain the actual rendered words, not just a comparable token volume — so CRR needs enough tokens and the right ones. No LLM required; pure function. Methodology. Fetch A (raw HTTP, no JS) and Fetch B (headless, networkidle0). Readability-parse both; tokenize; take the min of the token ratio and the word-overlap. Edge cases. count_B==0 → CRR=0 (error state). count_A>count_B → token_ratio caps at 1.0 (SSR delivers ≥ client), but content_overlap still governs if the surplus tokens are the wrong content. Thresholds. Good ≥0.95 · Needs Work ≥0.80 · Poor <0.80. Known limitations. Readability can miss framework-specific content containers; SPA shells may need a custom selector. Word-overlap is set-based (ignores order/frequency) — sufficient for a recovery ratio, not a prose-similarity measure.

§3.2 SSD — Semantic Signal Density (gated)

Definition. SSD = 0.5*signal_ratio + 0.5*structured_coverage.

  • signal_ratio = tokens(main_content_markdown) / tokens(full_page_markdown) after stripping nav/header/footer/sidebar/ads/cookie-banners.
  • structured_coverage = required_jsonld_fields_present_and_nonempty / required_fields. The required-field set is auto-detected from the page’s own declared JSON-LD @type (zero-config: the page says what it is, and coverage scores how completely it described that thing). An explicit --preset/--schema overrides. A page with no JSON-LD scores coverage 0 — an agent gets prose but no entity scaffolding — never a free 1.0. Thresholds. Good ≥0.60 · Needs Work ≥0.40 · Poor <0.40. Known limitations. “Main content” detection is heuristic; structured coverage rewards JSON-LD presence, not correctness (AF covers correctness). A type CAV ships no profile for falls back to a generic floor (the entity must at least carry a name).

§3.3 ARR — Action Resolution Rate (gated)

Definition. ARR = resolved_actions / total_declared_actions, where an action resolves if any of its ordered locators (role+name | aria-label | test-id) matches a node in the accessibility-tree snapshot. Methodology. Headless load → accessibility.snapshot({interestingOnly:true}) → flatten → resolve each declared action → compare to a committed golden file for drift (BROKEN = critical action lost → FAIL; MOVED = path changed → INFO). Thresholds. Good =1.0 · Needs Work ≥0.90 · Poor <0.90. Known limitations. Hydration timing (wait for readyState===complete + hydration signal); dynamic content needs a declared prerequisite; accessible-name instability handled via regex locators.

§3.4 TC — Token Cost (supporting · tracked, not gated by default)

Definition. TC = tiktoken_cl100k.encode(agent_representation).length. Budget tiers. Good <4,000 · Needs Work <8,000 · Poor ≥8,000. Report truncation risk vs a 128K window.

§3.5 TTFUT — Time to First Useful Token (supporting)

For streaming agents: wall-clock to the first chunk of meaningful (post-boilerplate) content.

§3.6 AF — Answer Fidelity (north star · eval-gated)

Definition. Feed the page’s agent representation to an LLM; ask canonical per-template questions; score answers against expected truth. Weighted average over questions, majority agreement over N runs. Match types. exact · semantic (acceptable set) · set_contains · not_hallucinated (must refuse when absent). Judge requirements. Constrained prompt (“answer ONLY from the page; say not-available otherwise; never infer”), temperature 0, ≥3 runs, agreement ≥ threshold. Model configurable (default cheap; local supported). Thresholds. Good ≥0.95 · Needs Work ≥0.80 · Poor <0.80.

§3.7 Provenance and Auditability

Principle. Measurement without provenance is assertion without evidence. This is a design principle drawn from trustworthy-AI practice (reproducibility, auditability of automated judgments) — not an endorsement of CAV by any person or organization. A score with no record of what produced it is unfalsifiable; CAV scores are gated decisions, so they must be re-derivable.

Record. Every CAV measurement emits a provenance record containing: content hashes (raw pre-JS and rendered HTML), the thresholds hash actually applied, the spec version, timestamps (fetch + score), the profile id, the scores, and the verdict + fail_reasons.

Three properties typical lab numbers lack:

  1. Third-party reproducibility. Re-fetch the same content + re-score with the same thresholds → the same result. CRR/SSD/ARR/TC are deterministic, so a record is a falsifiable claim, not a snapshot to be trusted.
  2. Temporal comparison. Two records make the cause of a change explicit — content hash moved (the page changed), thresholds hash moved (the bar changed), or profile id moved (a different consumer) — instead of an unattributed score delta.
  3. Chain of custody. A timestamped content hash proves what content was actually available when an agent consumed or cited the page — evidence for a citation/availability dispute, which a transient lab number cannot give.

Regulated industries. Where pricing/availability claims carry compliance weight, a hashed, timestamped, re-scorable record of what an agent could read is the difference between an audit trail and an anecdote.

The CLI emits this record via --provenance.

§3.8 TCR — Tool-Calling Readiness (supporting · tracked, not gated)

Principle. CRR/SSD/AF measure reading; TCR measures acting. As the agent web moves from answering questions to completing tasks (add-to-cart, search, subscribe, book), a page an agent can read but not transact on is half a page. Definition. TCR = discoverable_actions / total_actions, computed from the raw, no-JS HTML (an agent that can’t run a browser must act from the markup). An action is a user-invocable, state-mutating affordance.

  • discoverable (method + endpoint + parameter schema derivable from markup): an <form> (action + method + named inputs), or a schema.org potentialAction with a target URL/urlTemplate, or a control carrying an explicit data-url/formaction (+ optional data-method).
  • not discoverable (JS-only): a labelled <button>/[role=button]/<input type=button> not wired to a form — its method/endpoint/params live in opaque JavaScript an agent can’t introspect.
  • excluded (no agent-invocable intent → keep false positives near zero): plain <a href> navigation (a GET page move is not a state-mutating tool-call), and unlabelled / disabled / hidden controls. Methodology. Parse raw HTML (jsdom). Enumerate forms + potentialAction (discoverable) and unwired labelled controls (JS-only); TCR = discoverable / total. A page with no actions scores N/A (not 1.0 or 0 — there is nothing to be illegible). No LLM; deterministic. Thresholds. Good ≥0.60 · Needs Improvement ≥0.25 · Poor <0.25. Supporting, not gated — TCR is a capability signal, not a correctness one (a failure means the agent can’t act, not that it got wrong data), and today’s median SPA scores ~0.15–0.30; gating now would fail most of the web and be bypassed. Gate only once ecosystem median crosses ~0.50. Trend it across deploys. Validation (GET-path proven). TCR predicts ACR — Action Completion Rate: an agent reconstructs an action from markup alone and it actually completes (HTTP 2xx — deterministic, no LLM). Proven on the GET-safe path (scripts/tcr-proof.mjs; report docs/tcr-proof-report.md): across 10 real pages, 5 exposed a discoverable GET form, and 100% of them completed — the reconstructed-from-markup request returned 2xx with a non-trivial body (Wikipedia, PyPI, StackOverflow, GitHub, HN). The other 5 had no discoverable GET form (CSR/JS-only search) and TCR correctly scored them low — the classification predicts real invocability, not mere presence. Honest boundary: GET/search actions only — POST/mutating ACR (add-to-cart, checkout), the bulk of TCR’s value, cannot be tested at scale without firing side-effecting requests at third parties; a controlled-fixture POST-ACR harness is the remaining work. TCR ships as supporting. Known limitations. Raw-HTML parse sees inline-style/attribute hiding but not computed-style/0×0 layout hiding (a headless-render enhancement); same-origin <iframe>/shadow-DOM actions are not yet walked (cross-origin iframes counted conservatively as opaque). See §9 A7.

§3.9 CEB / AUEC — Context Economy (supporting per-page · eval-gated benchmark)

Principle. Agents pay per token. The same page can be handed to an agent at very different fidelities and costs; most agents naively feed raw HTML, burning 10–200× the tokens of a structured extract for often-identical answer fidelity. CEB measures the answer-fidelity-per-token efficiency frontier across representation layers and scores it as a single number, AUEC. Layers. L0 raw no-JS HTML (markup minus script/style) · L1 Readability main · L2 structured-only (JSON-LD + OpenGraph + meta, pruned to recognized fields + capped) · L3 rendered + Readability (the fidelity ceiling). AUEC definition. Per page, take (token_cost, AF) for each layer; AF is the §3.6/§5.1 canary-seeded, corpus-free measure (priors-immune). Normalize the token axis by T_max. At any token budget x, an agent uses the best layer that fits, so the affordability frontier efficiency(x) = max{ AF_i : tokens_i ≤ x·T_max } is monotone non-decreasing; AUEC = ∫₀¹ efficiency(x) dx ∈ [0,1]. This rewards reaching high AF at a low token fraction and ignores dominated layers (a costlier layer with lower AF never raises the cumulative max). Token axis is linear (tokens map linearly to $; saving 10k tokens at the costly end matters more than 100 at the cheap end). Token cost = full cl100k count; AF is measured within a fixed agent context window, so a token-bloated layer that buries facts past the window scores lower AF — the intended efficiency signal. (An earlier convex-hull variant was falsified by the CEB benchmark — it assumed raw HTML retains high AF, which it does not; see the report.) Engine: metrics/auec.js + extractors/representations.js; benchmark scripts/ceb.mjs. Thresholds. Per-page AUEC is supporting (display + an optimal-layer recommendation: the cheapest layer within 0.9× of the best observed AF) — never gated per-page, because some pages genuinely need full HTML (no structured data) and that’s a site property, not a bug. The cross-corpus benchmark is gated at mean AUEC ≥ 0.60 (an always-raw-HTML pipeline scores ≈0.5; 0.60 requires benefiting from a cheaper layer on most pages). Validation (how it’s proven). AUEC must predict real token savings. Measured basis in docs/context-economy-report.md: for each page, find the cheapest layer reaching AF ≥ 0.9×AF(L0); report mean token savings (pass ≥50% with mean AF drop ≤0.05) and Spearman(AUEC, savings) (pass ≥0.65), on a CRR-spectrum sample, corpus-free judge. Deterministic for tokens, priors-immune for AF. Known limitations. Single-page single-query (no multi-turn context accumulation); cl100k normalization (actual cost varies ~15% by model tokenizer); ignores wall-clock latency (a cheap L2 that needs headless rendering is slower than L1 from raw HTTP). See §9 A8.

§3.10 DRI — Declared-Rendered Integrity (supporting score · gaming-flag gated)

Principle. CRR/SSD prove structured fields are present; §9 A3 proves a declared value appears somewhere in the rendered text. DRI proves they are correct: it compares each declared value to the value actually rendered at that field and classifies the direction of any mismatch. This is the §2.3(c) failure made measurable — an agent reads JSON-LD price $29.99, the page actually shows $34.99, and the agent confidently quotes the wrong price. Definition. From the rendered HTML, for each comparable field (offers.price, offers.availability, aggregateRating.ratingValue), extract the declared (JSON-LD) value and the rendered value (itemprop/ selector + canonical text), and classify: MATCH (numeric within 1% / enum equal), DRIFT (differs, unfavorable to the site), or FAVORABLE_DRIFT (declared is better than reality — cheaper price, more available, higher rating). DRI = matched / comparable; favorable_drift_ratio = favorable / drifted. Deterministic; no LLM. N/A when no JSON-LD / no comparable fields. Engine: extractors/dri.js. Thresholds. DRI score is supporting (reported, not gated) — a legitimate sale drifts the unfavorable direction (declared ≥ rendered) and must not fail a page. The bait-and-switch signal is gated: gaming_flag = (drifted ≥ 2 AND favorable_drift_ratio ≥ 0.5) → high severity, because a page systematically declaring better-than-rendered values to agents is actively misinforming them. This split is false-positive-averse: sales and one-off staleness never trip the gate; only a pattern of self-favoring drift does. Validation (proven — controlled experiment). A causal claim (“declared-rendered drift makes an agent misinform”) needs the counterfactual, so it is proven on 20 matched clean/drifted page pairs (scripts/dri-proof.mjs; report docs/dri-proof-report.md): each drifted variant declares better-than-rendered values (price −20%, InStock over a rendered pre-order, rating +0.6).

  • Detection rate = 100%gaming_flag fires on every drifted page.
  • False-positive rate = 0% — it never fires on the matched clean page (direction-awareness; a sale would drift the other way).
  • Agent-misled rate = 100% — an agent fed the structured representation (the cheap layer it prefers, CEB §3.9), judged corpus-free by qwen3:8b, returned the drifted (false) price on all 20 pages — the harm DRI prevents is real, and it lives in exactly the representation efficient agents use. Caveat: synthetic, controlled drift (required for the counterfactual), not a natural-drift field corpus — this proves the detector’s true/false-positive behaviour and the causal harm link, not the base rate of drift on the live web (field work toward §5). Known limitations. Point-in-time snapshot (a page that drifts off-hours and is fixed by crawl time is missed — continuous monitoring is infra, not the metric); geo/A/B-personalized pricing shows one variant (measure from a clean, neutral profile = the agent’s view); text fields only. See §9 A9.

§3.11 ERI — Entity Resolution Integrity (supporting · seeds toward a gate)

Principle. Every other CAV metric assumes the agent already knows which entity the page is about. CRR reads the facts, DRI verifies they’re honest, AF checks the answer — none ask “facts about what?” An agent that extracts a price perfectly (CRR=1, DRI=1) but attributes it to the wrong product variant has failed catastrophically, and the rest of the stack scores it a pass. ERI measures whether a page gives enough deterministic signal to uniquely resolve its primary entity in a global namespace — from markup alone, no web search, no LLM. Definition. ERI = 0.4·coverage + 0.3·consistency + 0.3·disambiguation, on the page’s primary entity (most-specific JSON-LD @type, identity-bearing block):

  • coverage = type-expected globally-resolvable identifiers present / expected (Product → gtin/sku/ mpn/canonical/@id; Person → orcid/wikidata/canonical/@id; Org → lei/wikidata/canonical/@id; Article → doi/canonical/@id/isPartOf; default → canonical/@id/sameAs).
  • consistency = cross-validation of canonical / og:url / JSON-LD @id (conflicts lower it; a cross-origin canonical is penalized — §9 A13).
  • disambiguation = strongest identity tier present: authority ID (Wikidata QID / ORCID / DOI / check-digit-valid GTIN / LEI) = 1.0 · namespaced @id or sku+canonical = 0.7 · canonical only = 0.4 · nothing = 0.0. Deterministic; no LLM. Engine: extractors/eri.js (GTIN mod-10, ORCID/DOI/Wikidata format checks inline). Thresholds. Grade A ≥0.70 (authority-grade identity) · B ≥0.30 (canonical + a typed id) · F <0.30 (unresolvable). Supporting at v0.1 (reported, not gated) — content pages legitimately carry only a canonical; the metric seeds the gate (the §5 N≥500 bar) rather than failing the current web on day one. Validation (proven — authority cross-check + A11). ERI’s authority signal must actually resolve to the right entity. Proven on 10 real Wikidata entities (scripts/eri-proof.mjs; report docs/eri-proof-report.md), each as a genuine page (name = the QID’s own label, declaring that QID) and a spoofed page (same name, a different entity’s QID):
  • Genuine resolution rate = 100% — the declared QID’s label matches the page name (the identity externally resolves).
  • Spoof detection rate = 100% — a mismatched QID is caught by the §9 A11 cross-check.
  • ERI’s sync disambig is identical (1.0) on genuine and spoofed — markup alone cannot tell them apart, which is precisely why the external label cross-check (A11) is the necessary complement. Priors-immune — a Wikidata QID’s label either matches or it doesn’t; no embeddings, no judge. Caveat: English labels + overlap-coefficient name match (robust for org/person, not multilingual edge cases); A11 is an offline/opt-in network check, not in the deterministic live analyzer. A larger labeled confusable corpus (Pearson r vs ERI score) remains §5 field work. Known limitations. Validates identifier format + cross-consistency, not real-world truth (a format-valid GTIN can still be the wrong product — that’s AF’s job); primary-entity only (a 10-product comparison page scores its primary); point-in-time; live authority cross-fetch (Wikidata-label vs schema:name, §9 A11) is a documented follow-up, not in v0.1. See §9 A10.

§4 Reference Agent Profiles

ProfileConsumerSees
A · raw-httpcurl-like, no JSraw HTML only
B · headlessPuppeteer/Chromiumfully rendered DOM
C · a11y-treecomputer-use agentsaccessibility tree
D · markdownJina-like readersconverted markdown

CRR/SSD scores vary by profile — report which profile produced a score. Profile definitions live in spec/profiles/*.json.

§4.1 Agent Classes & the Collection Event

A profile (above) is how CAV measures. An agent class is who is consuming the page in the wild — a separate axis, used by field collection. The canonical taxonomy is the single source of truth in spec/agent-classes.json (loaded by @cav/agent-vitals and @cav/cav-collector so the two cannot drift): llm-crawler · search-crawler · agent-browser · generic-fetcher · human. Only the first four are collectable; human is never reported (the middleware passes humans through).

Field telemetry flows as a collection event with two fidelities, distinguished by a source field so they are never conflated in aggregation:

  • middleware-estimate@cav/agent-vitals server middleware. Privacy-first: it emits an already-hashed origin_hash (never a URL), the agent class, and a heuristic crr_estimate + confidence (the server knows if it shipped SSR or a CSR shell, but not the true rendered DOM). These estimates converge at scale and are never averaged into measured percentiles.
  • cli-lab / platform-scan — real deterministic crr/ssd/tc from @cav/cli or a scheduled scan.

The collector stores origin-only (like CrUX): it drops any URL and the path_pattern on ingest, keeping the hashed origin, agent class, source, low-cardinality page_type, the metric(s), month, and cav_version. The origin hash is sha256(origin) with a published algorithm so anyone can reproduce their own.

§5 Threshold Rationale

v0.1 thresholds were educated defaults; they are now anchored to a first measured distribution (see docs/field-data-report.md): a 2026-06 scan of 299 pages across two corpora — Tranco top-120 homepages (n=89 scored) and a 210-page content corpus across six diverse sitemaps (n=210, 0 errors).

  • CRR 0.95 — validated by the data, not just by analogy. CRR is bimodal on both corpora: a population near 0 (client-rendered / JS-walled) and a population near 1.0 (SSR-complete), with a near-empty middle (homepage p75 1.0, content p75 1.0; ~14–27% of pages score <0.1). The 0.95 cutoff sits in the empty valley, cleanly separating “an agent gets the page” from “it doesn’t.” Keep.
  • SSD 0.60 — deliberately aspirational, above the current web’s top quartile. Measured p75 is ~0.51 on both corpora (homepage 0.52, content 0.51). A “good = p75” rule would put GOOD at ~0.52; we hold it at 0.60 on purpose — SSD is a brand-new signal and today’s structured-data hygiene is poor, so anchoring “good” to the current p75 would canonise that poverty. Revisit toward p75 only if adoption data shows 0.60 is unreachable.
  • TC 4K — content pages run higher than homepages (content median 224, p90 2,722; homepage p90 864) but the long tail stays under 4K; TC is supporting/ungated, so leniency is intended (a budget warning, not a gate).
  • ARR 1.0 — a missing critical action is a broken integration. AF 0.95 — below it the page misinforms agents. Machine-readable in spec/thresholds.json; values unchanged at v0.1 (measured N=299 < the N≥500 bar we set for moving a canonical threshold); SemVer-bumped with recorded basis in CHANGELOG.md as more field data accrues.

Field data for the newer metrics (TCR/DRI/ERI), descriptive — docs/field-data-v2-report.md, N=103 and accruing. The original N=299 scan predated TCR §3.8 / DRI §3.10 / ERI §3.11, so their thresholds were educated guesses; this is their measured web distribution (deterministic, one pass per URL; dataset persisted at data/field-data.ndjson, grown idempotently toward the N≥500 bar). Observations stable across N=40→103, not threshold moves:

  • TCR median 0.04 (p75 0.29, p90 1.00; n=87 action-bearing pages) — most pages expose almost no agent-invocable actions (button+fetch SPAs). Empirically confirms TCR must stay supporting, not gated (§3.8): a gate would fail the majority of the web.
  • ERI median 0.55, grade mix A/B/F = 19/61/23 of 103 — most pages reach grade B (canonical + a typed id); authority identity (grade A: Wikidata/GTIN/ORCID) is a ~18% minority. Confirms ERI supporting (§3.11) at v0.1.
  • CRR p50 1.00 (bimodal, p25 0.85) and SSD p75 0.65 — consistent with the original N=299 scan.
  • DRI had no comparable-field (price/availability/rating JSON-LD) pages in this general corpus — DRI bites on commerce, as designed; a product-page corpus is the right field-data source for it (noted gap; its base rate is adoption-gated, while its detector is already proven, §3.10). Descriptive increments toward the N≥500 bar; no canonical threshold moves until N≥500 + reference-judge agreement.

Reference-judge agreement (AF judge-independence) — docs/af-interrater-report.md. AF underpins PVB/CEB/DRI, so the outcome must not be an artifact of one judge model. Two corpus-free judges of different size and family (qwen3 8B vs qwen2.5-coder 14B) answered the same canary cloze on the same content (N=12): 95% per-question agreement (57/60), mean per-page AF difference 0.05 — strong judge-independence. (The §5 ρ≥0.8 rank-form of the bar is degenerate on this corpus — AF is at a ceiling, ~all pages 1.0, so there is no rank variance; per-question agreement is the correct measure for binary canary outcomes.) On a variance-bearing corpus (raw repr, mixed CRR, AF spanning 0–1, N=20): per-question agreement 91%, but Spearman ρ = 0.64 — moderate, below the ρ≥0.8 bar (the partial-AF disagreements add rank noise). Verdict: judge-independence holds at the per-question level (91–95%); the strict ρ≥0.8 rank-form is not yet met, so the reference-judge prerequisite is only partially satisfied and a canonical threshold must not move on this evidence — tightening (more pages / a third judge) is the remaining step. Judges are corpus-free by requirement; a retrieval model is excluded (it answers from its corpus, not the page).

§5.1 Proxy→Outcome Validation (the proxies predict the outcome — measured, not asserted)

§2.3 and §3.6 assert that the cheap deterministic proxies (CRR/SSD) predict the expensive outcome (AF — can an agent recover the page’s facts). This subsection replaces that assertion with a measurement. Method: Canary-Seeded Extraction (CSE) — for each page, extract facts that demonstrably appear in the rendered content, replace each with a synthetic canary value the judge cannot know a priori, inject the same canaries into the rendered and the raw (no-JS) extraction, and ask a corpus-free local judge (no retrieval) for each canary via a context-anchored cloze. Returning the canary means it read the page; returning the real value is priors leakage. This makes the ground truth deterministic and immune to both RAG-corpus and parametric-prior contamination — the LLM only demonstrates the read. Harness: scripts/pvb.mjs; engine: extractors/canary-facts.js + af/canary.js; report: docs/proxy-outcome-report.md.

Result (N=27 valid pages of 33 fetched, CRR spectrum SSR/static → CSR SPA, judge qwen3:8b, 2026-06):

  • CRR → AF_raw: Spearman ρ = 0.70 (t≈4.9, df=25, p<0.001) — CRR predicts the absolute agent answer-fidelity on a page’s facts. Strengthened from ρ=0.53 at the N=13 seed as N grew.
  • CRR → recovery (AF_raw/AF_rendered): ρ = 0.68 (p<0.001) — predicts how much of the content an agent recovers from no-JS HTML.
  • CRR predicts AF_raw≥0.8: ROC AUC = 1.00 — perfect separation of answerable-from-raw vs not, on this sample.
  • Priors-leak control: mean 0.01 — canaries hold at larger N; the outcome is not the judge’s world-knowledge.
  • Honest finding — the composite is worse than CRR alone. composite (0.6·CRR+0.4·SSD) → AF_raw: ρ = 0.41, below CRR’s 0.70. SSD adds noise, not signal, to AF prediction — CRR is the dominant predictor of fact recovery. (SSD measures entity scaffolding, a different axis; this argues against blending it into an AF-predicting composite, not against SSD itself.) Recorded as evidence, not hidden.
  • Known divergence (reported, not hidden): a small number of prerendered pages score moderate CRR yet recovery≈1.0 — raw HTML carries the facts while the rendered tree adds tokens, deflating the CRR ratio. CRR can understate recovery for prerendered-but-token-heavy pages; a known limitation of the ratio, not a measurement error.

Power: at the observed ρ≈0.68–0.70, N=27 yields p<0.001 — past the §5 significance bar (the rule-of-thumb N≥28 for 80% power at ρ≈0.7 is essentially met; the measured p is already well below 0.01).

Replication (N=46 valid of 59, broader/more-diverse corpus, 2026-07). Re-running the same canary-seeded, corpus-free method on a larger page set holds the classification result — CRR predicts AF_raw≥0.8 at ROC AUC = 0.95, priors-leak 0.00 — while the rank correlation moderates to Spearman ρ ≈ 0.5. This is the expected shape of a bimodal corpus: as the sample fills in the extremes, “whether a page is legible” separates cleanly (high AUC), while “how legible, ranked” gets noisier near the ceiling (lower ρ). The headline claim is therefore the AUC, not the ρ — CRR reliably predicts whether an agent recovers a page’s facts.

Status. Direction is now measured evidence, not assertion, at a publishable significance. Still a single corpus-free judge + page-derived canary ground truth — not a basis for moving a canonical threshold, which requires the §5 N≥500 bar with reference-judge agreement (ρ≥0.8 inter-rater) in CHANGELOG.md. SemVer: a clarification (no metric redefinition) → minor bump.

§6 CI Gate Integration

Sequence per money-template: (1) render raw + headless → agent representation; (2) compute CRR/SSD/TC (cheap, every commit); (3) run ARR vs golden; (4) AF eval (expensive — sample: full on template/layout change, skip on content tweak). PASS/WARN/FAIL with an override-and-audit policy (a documented override is allowed but logged).

§7 Versioning Policy

SemVer. MAJOR = metric removed/redefined; MINOR = metric added or threshold loosened with rationale; PATCH = docs/typos. Every threshold change records the field-data basis in CHANGELOG.md.

§8 Prior Art & Differentiation

A mature web-performance program — field data + a lab tool + a CI gate — is the structural model. Fragmented prior art — Googlebot/Vercel crawlability (guidance, no metric), schema.org/Rich Results (opt-in, no S/N metric), CDP/computer-use (a11y as agent API, unmeasured), Jina/Firecrawl (markdown conversion, measures nothing), RAGAS/DeepEval (generic evals, not page fidelity), MCP/A2A (protocol not measurement), GEO/AI-SEO (marketing checklists). None is a named, measurable standard with metrics + thresholds + a CI gate. That’s the gap CAV fills. Architecture follows the familiar scorecard + CI-gate model.

§9 Adversarial Robustness & Known Gaming Vectors

A score is only credible if a high one cannot be faked without genuinely serving agents. We enumerate the gaming vectors and CAV’s defense for each — published with the metrics, the way mature standards publish “how this can mislead” alongside the definitions. Self-disclosure is the trust signal; a silent metric invites the gaming it hides.

#AttackTargetsStatus
A1Server-rendered placeholder textCRRClosed
A2Preset omission for free structured-data creditSSDClosed
A3Present-but-false JSON-LD valuesSSDLargely closed (v0.1.1)
A4Cosmetic aria-label affordancesARRPartially closed (v0.1.1)
A5Cloaking (UA/IP-dependent content)allDetected (v0.1.1)
A6Content minimization for TCTCBy design
A7Fake affordances inflating TCRTCRMitigated (v0.1.x), deterministic
A8Layer padding / invisible charactersCEBMitigated (v0.1.x), deterministic
A9Bait-and-switch structured dataDRIDetected + gated (v0.1.x), deterministic (§3.10)
A10Authority-ID stuffing / spoofingERIMitigated (v0.1.x), deterministic (§3.11)

§9.1 Defenses in detail

A1 — Server-rendered placeholder text

Attack: Server-render placeholder text (“Loading…”, “enable JavaScript”) with token volume ≈ real content → CRR → fake 1.0.

Closed. §3.1 content_overlap term: raw must contain the rendered words, not just a similar token count.

A2 — Preset omission for free structured-data credit

Attack: No --preset → free structured-data credit → SSD → fake 1.0.

Closed. §3.2 auto-detects required fields from the page’s own @type; no JSON-LD ⇒ coverage 0.

A3 — Present-but-false JSON-LD values

Attack: Stuff JSON-LD with fields that are present but false (price $10 in markup, $100 on page) → SSD → fake 1.0.

Largely closed (v0.1.1). A deterministic structured-data-consistency check now verifies each cited JSON-LD value (price, availability, name, rating, sku…) actually appears in the rendered page text; a declared-but-absent value is flagged (SSD/AF risk) on every scan — no LLM, reproducible. AF remains the backstop for values that are present-but-semantically-wrong. Residual gap: a value present on the page but in a different role (e.g. a $10 shipping fee matching a declared $10 price) can still pass the presence check — AF covers that.

A4 — Cosmetic aria-label affordances

Attack: aria-label cosmetic elements so locators resolve, though clicking does nothing → ARR → fake 1.0.

Partially closed (v0.1.1). A deterministic actionability check reads the platform’s own verdict (no click simulation): a genuine control (interactive role or native control/link) that is visible but has pointer-events:none is flagged — it looks clickable but clicks pass through. Deterministic, reproducible, near-zero false-positive (landmark aria-labels and honest disabled/hidden states are intentionally NOT flagged — flagging them eroded trust in testing). Residual gap: a control with aria-label but a no-op/missing handler that isn’t CSS-blocked needs runtime behavior observation — out of deterministic scope; golden-file drift + AF cover it.

A5 — Cloaking (UA/IP-dependent content)

Attack: Cloaking — serve rich content to cav’s UA/IP, thin content to real agents (or vice-versa) → all.

Detected (v0.1.1). A deterministic dual-UA check fetches the same URL with a browser UA and an LLM-crawler UA (GPTBot), extracts main content (Readability), and compares via the CRR word-overlap; parity <0.90 triggers an A/B noise gate (re-fetch both — random A/B/geo/personalization variance flips high and PASSes; deterministic UA-cloaking stays low across both rounds → FLAG). Reports which UA was served thin. Detection (not prevention) but now a first-class signal, not just a recommendation; IP-based cloaking still needs ≥2 IPs (documented limit).

A6 — Content minimization for TC

Attack: Minimize page content so TC looks “GOOD” while the page says nothing → TC → fake GOOD.

By design. TC is supporting, not gated (§3.4) precisely so volume can’t be gamed into a pass; an empty page tanks SSD signal and AF answers.

A7 — Fake affordances inflating TCR

Attack: Inflate TCR with fake affordances — hidden forms (display:none), duplicate form + potentialAction for one action, or unlabelled decoy buttons → TCR → fake GOOD.

Mitigated (v0.1.x), deterministic. Hidden controls (inline display:none/visibility:hidden, [hidden], aria-hidden on the element or an ancestor) are excluded from both numerator and denominator; discoverable actions are deduped by normalized endpoint URL (a form and a potentialAction at the same target count once); unlabelled controls are excluded (no agent-invocable intent). TCR is also supporting, not gated (§3.8), so inflating it wins nothing. Residual: computed-style/0×0 hiding and endpoint-liveness (HEAD) checks need the headless pass (documented limit).

A8 — Layer padding / invisible characters

Attack: Inflate AUEC by padding the cheap layer (junk JSON-LD to bloat L2) or gaming token counts with zero-width/invisible characters → CEB → fake GOOD.

Mitigated (v0.1.x), deterministic. L2 is pruned to recognized structured fields and capped (a token budget), so junk JSON-LD can’t both stay in L2 and stay cheap; all layer content is NFC-normalized with zero-width characters stripped before tokenizing, so invisible codepoints can’t move the token count. AUEC is also supporting per-page (§3.9). Canary AF is unaffected: canaries replace real facts in the page content, so padding the page doesn’t plant answers.

A9 — Bait-and-switch structured data

Attack: Bait-and-switch via structured data — declare a better-than-rendered value to agents (lower price, InStock over a pre-order, inflated rating) so an agent over-promises to the user → DRI → fake GOOD; agent misinforms.

Detected + gated (v0.1.x), deterministic (§3.10). DRI compares each declared JSON-LD value to the rendered value and classifies drift direction; a pattern of self-favoring drift (drifted ≥ 2 AND favorable_drift_ratio ≥ 0.5) sets gaming_flag (high severity). Direction-awareness makes it false-positive-averse: a legitimate sale drifts the opposite way (declared ≥ rendered) and never trips the gate. Residual: computed-style/Shadow-DOM price hiding and currency-without-priceCurrency need the headless DOM-eval / currency cross-check (documented follow-ups).

A10 — Authority-ID stuffing / spoofing

Attack: Inflate ERI by stuffing dozens of sameAs URIs, or claiming an authority ID (Wikidata QID / GTIN) that isn’t really the entity → ERI → fake GOOD.

Mitigated (v0.1.x), deterministic (§3.11). sameAs coverage counts only the first 3 distinct authority domains, and coverage is scored against type-specific expected identifiers, so dumping generic URIs adds nothing; GTINs must pass the mod-10 check digit to count; a cross-origin canonical is penalized (can’t inherit another domain’s identity). Semantic authority-spoofing (a valid-format Wikidata QID for the wrong entity) is addressed by the A11 cross-check — fetch the declared QID’s label, compare to schema:name — now proven offline (scripts/eri-proof.mjs: 100% spoof detection / 100% genuine resolution on 10 entities); it ships as an opt-in/offline check (network), not in the deterministic live analyzer. Residual: like DRI, format-valid-but-wrong on non-authority IDs is AF’s job, not a page-level deterministic check.

Principle. No single deterministic metric is un-gameable in isolation; the gate’s robustness comes from the combination — a page gaming CRR or SSD alone still fails AF, and AF is the outcome the proxies must predict (§5, proxy→outcome validation). Each new gaming vector found in the field is added here with its defense and status.