CAV-RFC-001: Specification
The formal Core Agent Vitals specification — definitions, deterministic methodology, thresholds, and reference agent profiles. The canonical spec, hosted on-site.
CAV-RFC-001: Core Agent Vitals
Status: Draft · Version: 0.1.0 · Author: Rafael Lopes
§1 Abstract
Core Agent Vitals (CAV) is a measurable, gated framework for the experience an AI agent has consuming a web page — the measurement standard for AI-agent web consumption. It defines three gated pillars (CRR, SSD, ARR), two supporting signals (TC, TTFUT), and one north-star outcome metric (AF), each with a formula, a deterministic measurement methodology, reference agent profiles, thresholds, and a CI gate. v0.1 thresholds are seeds to baseline-and-tighten.
§2 Problem Statement
- The agent-web gap. A growing share of page “visits” are LLMs/agents extracting and recombining content to answer a query the user never clicks through on. They don’t execute JS well, don’t see images, and operate under a token budget. Nothing measures this.
- Why existing metrics don’t cover it. Today’s page metrics measure a browser paint and human interaction that, on the agent path, never happen. A heavily client-rendered page can look perfect to a person while scoring near-zero for agents.
- Revenue at risk. If a shopping/answer agent can’t parse your price, availability, or actions, it can’t cite, recommend, or transact — you’re the slow/illegible supplier dropped from its context.
§2.3 Why CAV Matters (Forcing Functions)
Human-performance metrics got adopted because Search made them a ranking input — one centralized lever. CAV has no equivalent single lever; its forcing functions are emerging and distributed, and this section states them as hypotheses, not as a measured market. Three channels:
- (a) Agent commerce. ChatGPT Shopping, Perplexity, and AI Overviews with product cards consume pages to assemble a buying answer. A page with low CRR (content lost behind JS) or low SSD (no entity scaffolding) is unreadable to the buying agent — it can’t extract a price/availability it never recovered, so the page can’t be surfaced or transacted against. The selection happens upstream of any human click.
- (b) AI-Overview citation. Extractable, structured content is what gets quoted and linked in AI summaries. CRR/SSD are direct proxies for “can this be lifted into an answer at all.” Illegible pages are silently omitted.
- (c) Answer-fidelity risk (the asymmetric one). A human who hits a bad page bounces — a lost session. An agent that hits a bad page produces a wrong answer about your business: a stale price, a hallucinated “in stock,” an action it claims exists. That is a reputational/correctness cost a bounce never carries, and it scales with every query that touches the page. This is why AF is the north star (§3.6), not a vanity metric.
A human-side analogy (not a CAV measurement). Published research links faster, more stable pages correlate with roughly 24% lower abandonment. We cite this only as the human mechanism — paint/interaction latency driving a bounce. The agent mechanism is different: failure shows up as truncation or hallucination, not a bounce, and the cost lands as a wrong answer rather than a lost session. No CAV-measured equivalent of that number exists yet; producing one is field work, not a claim we make here. Treat all of §2.3 as the emerging case for CAV, to be replaced by measured evidence as adoption data accrues.
§3 Metric Definitions
§3.1 CRR — Content Recovery Ratio (gated)
Definition. CRR = min(token_ratio, content_overlap), where
token_ratio = tokens(extract(raw_pre_JS_HTML)) / tokens(extract(fully_rendered_HTML)),
content_overlap = |distinct_words(raw) ∩ distinct_words(rendered)| / |distinct_words(rendered)|,
extract = Mozilla Readability main-content extraction, and tokens = cl100k_base count.
Why the overlap term. A token ratio alone fake-passes when raw HTML has enough tokens but the wrong ones
(a server-rendered “Loading…” / “enable JavaScript” placeholder with more tokens than the real content scores 1.0).
The overlap term requires the raw page to contain the actual rendered words, not just a comparable token volume —
so CRR needs enough tokens and the right ones. No LLM required; pure function.
Methodology. Fetch A (raw HTTP, no JS) and Fetch B (headless, networkidle0). Readability-parse both; tokenize;
take the min of the token ratio and the word-overlap.
Edge cases. count_B==0 → CRR=0 (error state). count_A>count_B → token_ratio caps at 1.0 (SSR delivers ≥ client),
but content_overlap still governs if the surplus tokens are the wrong content.
Thresholds. Good ≥0.95 · Needs Work ≥0.80 · Poor <0.80.
Known limitations. Readability can miss framework-specific content containers; SPA shells may need a custom selector.
Word-overlap is set-based (ignores order/frequency) — sufficient for a recovery ratio, not a prose-similarity measure.
§3.2 SSD — Semantic Signal Density (gated)
Definition. SSD = 0.5*signal_ratio + 0.5*structured_coverage.
signal_ratio = tokens(main_content_markdown) / tokens(full_page_markdown)after stripping nav/header/footer/sidebar/ads/cookie-banners.structured_coverage = required_jsonld_fields_present_and_nonempty / required_fields. The required-field set is auto-detected from the page’s own declared JSON-LD@type(zero-config: the page says what it is, and coverage scores how completely it described that thing). An explicit--preset/--schemaoverrides. A page with no JSON-LD scores coverage 0 — an agent gets prose but no entity scaffolding — never a free 1.0. Thresholds. Good ≥0.60 · Needs Work ≥0.40 · Poor <0.40. Known limitations. “Main content” detection is heuristic; structured coverage rewards JSON-LD presence, not correctness (AF covers correctness). A type CAV ships no profile for falls back to a generic floor (the entity must at least carry aname).
§3.3 ARR — Action Resolution Rate (gated)
Definition. ARR = resolved_actions / total_declared_actions, where an action resolves if any of its ordered
locators (role+name | aria-label | test-id) matches a node in the accessibility-tree snapshot.
Methodology. Headless load → accessibility.snapshot({interestingOnly:true}) → flatten → resolve each declared
action → compare to a committed golden file for drift (BROKEN = critical action lost → FAIL; MOVED = path changed → INFO).
Thresholds. Good =1.0 · Needs Work ≥0.90 · Poor <0.90.
Known limitations. Hydration timing (wait for readyState===complete + hydration signal); dynamic content needs a
declared prerequisite; accessible-name instability handled via regex locators.
§3.4 TC — Token Cost (supporting · tracked, not gated by default)
Definition. TC = tiktoken_cl100k.encode(agent_representation).length.
Budget tiers. Good <4,000 · Needs Work <8,000 · Poor ≥8,000. Report truncation risk vs a 128K window.
§3.5 TTFUT — Time to First Useful Token (supporting)
For streaming agents: wall-clock to the first chunk of meaningful (post-boilerplate) content.
§3.6 AF — Answer Fidelity (north star · eval-gated)
Definition. Feed the page’s agent representation to an LLM; ask canonical per-template questions; score answers
against expected truth. Weighted average over questions, majority agreement over N runs.
Match types. exact · semantic (acceptable set) · set_contains · not_hallucinated (must refuse when absent).
Judge requirements. Constrained prompt (“answer ONLY from the page; say not-available otherwise; never infer”),
temperature 0, ≥3 runs, agreement ≥ threshold. Model configurable (default cheap; local supported).
Thresholds. Good ≥0.95 · Needs Work ≥0.80 · Poor <0.80.
§3.7 Provenance and Auditability
Principle. Measurement without provenance is assertion without evidence. This is a design principle drawn from trustworthy-AI practice (reproducibility, auditability of automated judgments) — not an endorsement of CAV by any person or organization. A score with no record of what produced it is unfalsifiable; CAV scores are gated decisions, so they must be re-derivable.
Record. Every CAV measurement emits a provenance record containing: content hashes (raw pre-JS and rendered HTML), the thresholds hash actually applied, the spec version, timestamps (fetch + score), the profile id, the scores, and the verdict + fail_reasons.
Three properties typical lab numbers lack:
- Third-party reproducibility. Re-fetch the same content + re-score with the same thresholds → the same result. CRR/SSD/ARR/TC are deterministic, so a record is a falsifiable claim, not a snapshot to be trusted.
- Temporal comparison. Two records make the cause of a change explicit — content hash moved (the page changed), thresholds hash moved (the bar changed), or profile id moved (a different consumer) — instead of an unattributed score delta.
- Chain of custody. A timestamped content hash proves what content was actually available when an agent consumed or cited the page — evidence for a citation/availability dispute, which a transient lab number cannot give.
Regulated industries. Where pricing/availability claims carry compliance weight, a hashed, timestamped, re-scorable record of what an agent could read is the difference between an audit trail and an anecdote.
The CLI emits this record via --provenance.
§3.8 TCR — Tool-Calling Readiness (supporting · tracked, not gated)
Principle. CRR/SSD/AF measure reading; TCR measures acting. As the agent web moves from answering questions to
completing tasks (add-to-cart, search, subscribe, book), a page an agent can read but not transact on is half a page.
Definition. TCR = discoverable_actions / total_actions, computed from the raw, no-JS HTML (an agent that
can’t run a browser must act from the markup). An action is a user-invocable, state-mutating affordance.
- discoverable (method + endpoint + parameter schema derivable from markup): an
<form>(action + method + named inputs), or a schema.orgpotentialActionwith atargetURL/urlTemplate, or a control carrying an explicitdata-url/formaction(+ optionaldata-method). - not discoverable (JS-only): a labelled
<button>/[role=button]/<input type=button>not wired to a form — its method/endpoint/params live in opaque JavaScript an agent can’t introspect. - excluded (no agent-invocable intent → keep false positives near zero): plain
<a href>navigation (a GET page move is not a state-mutating tool-call), and unlabelled /disabled/ hidden controls. Methodology. Parse raw HTML (jsdom). Enumerate forms + potentialAction (discoverable) and unwired labelled controls (JS-only);TCR = discoverable / total. A page with no actions scores N/A (not 1.0 or 0 — there is nothing to be illegible). No LLM; deterministic. Thresholds. Good ≥0.60 · Needs Improvement ≥0.25 · Poor <0.25. Supporting, not gated — TCR is a capability signal, not a correctness one (a failure means the agent can’t act, not that it got wrong data), and today’s median SPA scores ~0.15–0.30; gating now would fail most of the web and be bypassed. Gate only once ecosystem median crosses ~0.50. Trend it across deploys. Validation (GET-path proven). TCR predicts ACR — Action Completion Rate: an agent reconstructs an action from markup alone and it actually completes (HTTP 2xx — deterministic, no LLM). Proven on the GET-safe path (scripts/tcr-proof.mjs; reportdocs/tcr-proof-report.md): across 10 real pages, 5 exposed a discoverable GET form, and 100% of them completed — the reconstructed-from-markup request returned 2xx with a non-trivial body (Wikipedia, PyPI, StackOverflow, GitHub, HN). The other 5 had no discoverable GET form (CSR/JS-only search) and TCR correctly scored them low — the classification predicts real invocability, not mere presence. Honest boundary: GET/search actions only — POST/mutating ACR (add-to-cart, checkout), the bulk of TCR’s value, cannot be tested at scale without firing side-effecting requests at third parties; a controlled-fixture POST-ACR harness is the remaining work. TCR ships as supporting. Known limitations. Raw-HTML parse sees inline-style/attribute hiding but not computed-style/0×0 layout hiding (a headless-render enhancement); same-origin<iframe>/shadow-DOM actions are not yet walked (cross-origin iframes counted conservatively as opaque). See §9 A7.
§3.9 CEB / AUEC — Context Economy (supporting per-page · eval-gated benchmark)
Principle. Agents pay per token. The same page can be handed to an agent at very different
fidelities and costs; most agents naively feed raw HTML, burning 10–200× the tokens of a structured
extract for often-identical answer fidelity. CEB measures the answer-fidelity-per-token efficiency
frontier across representation layers and scores it as a single number, AUEC.
Layers. L0 raw no-JS HTML (markup minus script/style) · L1 Readability main · L2
structured-only (JSON-LD + OpenGraph + meta, pruned to recognized fields + capped) · L3
rendered + Readability (the fidelity ceiling).
AUEC definition. Per page, take (token_cost, AF) for each layer; AF is the §3.6/§5.1
canary-seeded, corpus-free measure (priors-immune). Normalize the token axis by T_max. At any token
budget x, an agent uses the best layer that fits, so the affordability frontier
efficiency(x) = max{ AF_i : tokens_i ≤ x·T_max } is monotone non-decreasing; AUEC = ∫₀¹ efficiency(x) dx ∈ [0,1].
This rewards reaching high AF at a low token fraction and ignores dominated layers (a costlier layer
with lower AF never raises the cumulative max). Token axis is linear (tokens map linearly to $;
saving 10k tokens at the costly end matters more than 100 at the cheap end). Token cost = full cl100k
count; AF is measured within a fixed agent context window, so a token-bloated layer that buries facts
past the window scores lower AF — the intended efficiency signal. (An earlier convex-hull variant was
falsified by the CEB benchmark — it assumed raw HTML retains high AF, which it does not; see the report.) Engine: metrics/auec.js + extractors/representations.js; benchmark scripts/ceb.mjs.
Thresholds. Per-page AUEC is supporting (display + an optimal-layer recommendation: the
cheapest layer within 0.9× of the best observed AF) — never gated per-page, because some pages
genuinely need full HTML (no structured data) and that’s a site property, not a bug. The
cross-corpus benchmark is gated at mean AUEC ≥ 0.60 (an always-raw-HTML pipeline scores ≈0.5;
0.60 requires benefiting from a cheaper layer on most pages).
Validation (how it’s proven). AUEC must predict real token savings. Measured basis in
docs/context-economy-report.md: for each page, find the cheapest layer reaching AF ≥ 0.9×AF(L0);
report mean token savings (pass ≥50% with mean AF drop ≤0.05) and Spearman(AUEC, savings) (pass
≥0.65), on a CRR-spectrum sample, corpus-free judge. Deterministic for tokens, priors-immune for AF.
Known limitations. Single-page single-query (no multi-turn context accumulation); cl100k
normalization (actual cost varies ~15% by model tokenizer); ignores wall-clock latency (a cheap L2
that needs headless rendering is slower than L1 from raw HTTP). See §9 A8.
§3.10 DRI — Declared-Rendered Integrity (supporting score · gaming-flag gated)
Principle. CRR/SSD prove structured fields are present; §9 A3 proves a declared value appears
somewhere in the rendered text. DRI proves they are correct: it compares each declared value to
the value actually rendered at that field and classifies the direction of any mismatch. This is
the §2.3(c) failure made measurable — an agent reads JSON-LD price $29.99, the page actually shows
$34.99, and the agent confidently quotes the wrong price.
Definition. From the rendered HTML, for each comparable field (offers.price, offers.availability,
aggregateRating.ratingValue), extract the declared (JSON-LD) value and the rendered value (itemprop/
selector + canonical text), and classify: MATCH (numeric within 1% / enum equal), DRIFT
(differs, unfavorable to the site), or FAVORABLE_DRIFT (declared is better than reality —
cheaper price, more available, higher rating). DRI = matched / comparable;
favorable_drift_ratio = favorable / drifted. Deterministic; no LLM. N/A when no JSON-LD / no
comparable fields. Engine: extractors/dri.js.
Thresholds. DRI score is supporting (reported, not gated) — a legitimate sale drifts the
unfavorable direction (declared ≥ rendered) and must not fail a page. The bait-and-switch signal
is gated: gaming_flag = (drifted ≥ 2 AND favorable_drift_ratio ≥ 0.5) → high severity, because a
page systematically declaring better-than-rendered values to agents is actively misinforming them.
This split is false-positive-averse: sales and one-off staleness never trip the gate; only a
pattern of self-favoring drift does.
Validation (proven — controlled experiment). A causal claim (“declared-rendered drift makes an
agent misinform”) needs the counterfactual, so it is proven on 20 matched clean/drifted page
pairs (scripts/dri-proof.mjs; report docs/dri-proof-report.md): each drifted variant declares
better-than-rendered values (price −20%, InStock over a rendered pre-order, rating +0.6).
- Detection rate = 100% —
gaming_flagfires on every drifted page. - False-positive rate = 0% — it never fires on the matched clean page (direction-awareness; a sale would drift the other way).
- Agent-misled rate = 100% — an agent fed the structured representation (the cheap layer it
prefers, CEB §3.9), judged corpus-free by
qwen3:8b, returned the drifted (false) price on all 20 pages — the harm DRI prevents is real, and it lives in exactly the representation efficient agents use. Caveat: synthetic, controlled drift (required for the counterfactual), not a natural-drift field corpus — this proves the detector’s true/false-positive behaviour and the causal harm link, not the base rate of drift on the live web (field work toward §5). Known limitations. Point-in-time snapshot (a page that drifts off-hours and is fixed by crawl time is missed — continuous monitoring is infra, not the metric); geo/A/B-personalized pricing shows one variant (measure from a clean, neutral profile = the agent’s view); text fields only. See §9 A9.
§3.11 ERI — Entity Resolution Integrity (supporting · seeds toward a gate)
Principle. Every other CAV metric assumes the agent already knows which entity the page is
about. CRR reads the facts, DRI verifies they’re honest, AF checks the answer — none ask “facts about
what?” An agent that extracts a price perfectly (CRR=1, DRI=1) but attributes it to the wrong
product variant has failed catastrophically, and the rest of the stack scores it a pass. ERI measures
whether a page gives enough deterministic signal to uniquely resolve its primary entity in a
global namespace — from markup alone, no web search, no LLM.
Definition. ERI = 0.4·coverage + 0.3·consistency + 0.3·disambiguation, on the page’s primary
entity (most-specific JSON-LD @type, identity-bearing block):
- coverage = type-expected globally-resolvable identifiers present / expected (Product → gtin/sku/ mpn/canonical/@id; Person → orcid/wikidata/canonical/@id; Org → lei/wikidata/canonical/@id; Article → doi/canonical/@id/isPartOf; default → canonical/@id/sameAs).
- consistency = cross-validation of
canonical/og:url/ JSON-LD@id(conflicts lower it; a cross-origin canonical is penalized — §9 A13). - disambiguation = strongest identity tier present: authority ID (Wikidata QID / ORCID / DOI /
check-digit-valid GTIN / LEI) = 1.0 · namespaced
@idor sku+canonical = 0.7 · canonical only = 0.4 · nothing = 0.0. Deterministic; no LLM. Engine:extractors/eri.js(GTIN mod-10, ORCID/DOI/Wikidata format checks inline). Thresholds. Grade A ≥0.70 (authority-grade identity) · B ≥0.30 (canonical + a typed id) · F <0.30 (unresolvable). Supporting at v0.1 (reported, not gated) — content pages legitimately carry only a canonical; the metric seeds the gate (the §5 N≥500 bar) rather than failing the current web on day one. Validation (proven — authority cross-check + A11). ERI’s authority signal must actually resolve to the right entity. Proven on 10 real Wikidata entities (scripts/eri-proof.mjs; reportdocs/eri-proof-report.md), each as a genuine page (name = the QID’s own label, declaring that QID) and a spoofed page (same name, a different entity’s QID): - Genuine resolution rate = 100% — the declared QID’s label matches the page name (the identity externally resolves).
- Spoof detection rate = 100% — a mismatched QID is caught by the §9 A11 cross-check.
- ERI’s sync disambig is identical (1.0) on genuine and spoofed — markup alone cannot tell them apart, which is precisely why the external label cross-check (A11) is the necessary complement. Priors-immune — a Wikidata QID’s label either matches or it doesn’t; no embeddings, no judge. Caveat: English labels + overlap-coefficient name match (robust for org/person, not multilingual edge cases); A11 is an offline/opt-in network check, not in the deterministic live analyzer. A larger labeled confusable corpus (Pearson r vs ERI score) remains §5 field work. Known limitations. Validates identifier format + cross-consistency, not real-world truth (a format-valid GTIN can still be the wrong product — that’s AF’s job); primary-entity only (a 10-product comparison page scores its primary); point-in-time; live authority cross-fetch (Wikidata-label vs schema:name, §9 A11) is a documented follow-up, not in v0.1. See §9 A10.
§4 Reference Agent Profiles
| Profile | Consumer | Sees |
|---|---|---|
| A · raw-http | curl-like, no JS | raw HTML only |
| B · headless | Puppeteer/Chromium | fully rendered DOM |
| C · a11y-tree | computer-use agents | accessibility tree |
| D · markdown | Jina-like readers | converted markdown |
CRR/SSD scores vary by profile — report which profile produced a score. Profile definitions live in spec/profiles/*.json.
§4.1 Agent Classes & the Collection Event
A profile (above) is how CAV measures. An agent class is who is consuming the page in the wild — a
separate axis, used by field collection. The canonical taxonomy is the single source of truth in
spec/agent-classes.json (loaded by @cav/agent-vitals and @cav/cav-collector so the two cannot drift):
llm-crawler · search-crawler · agent-browser · generic-fetcher · human. Only the first four are
collectable; human is never reported (the middleware passes humans through).
Field telemetry flows as a collection event with two fidelities, distinguished by a source field so they are
never conflated in aggregation:
middleware-estimate—@cav/agent-vitalsserver middleware. Privacy-first: it emits an already-hashedorigin_hash(never a URL), the agent class, and a heuristiccrr_estimate+ confidence (the server knows if it shipped SSR or a CSR shell, but not the true rendered DOM). These estimates converge at scale and are never averaged into measured percentiles.cli-lab/platform-scan— real deterministiccrr/ssd/tcfrom@cav/clior a scheduled scan.
The collector stores origin-only (like CrUX): it drops any URL and the path_pattern on ingest, keeping the
hashed origin, agent class, source, low-cardinality page_type, the metric(s), month, and cav_version. The origin
hash is sha256(origin) with a published algorithm so anyone can reproduce their own.
§5 Threshold Rationale
v0.1 thresholds were educated defaults; they are now anchored to a first measured distribution (see
docs/field-data-report.md): a 2026-06 scan of 299 pages across two corpora — Tranco top-120 homepages (n=89
scored) and a 210-page content corpus across six diverse sitemaps (n=210, 0 errors).
- CRR 0.95 — validated by the data, not just by analogy. CRR is bimodal on both corpora: a population near 0 (client-rendered / JS-walled) and a population near 1.0 (SSR-complete), with a near-empty middle (homepage p75 1.0, content p75 1.0; ~14–27% of pages score <0.1). The 0.95 cutoff sits in the empty valley, cleanly separating “an agent gets the page” from “it doesn’t.” Keep.
- SSD 0.60 — deliberately aspirational, above the current web’s top quartile. Measured p75 is ~0.51 on both corpora (homepage 0.52, content 0.51). A “good = p75” rule would put GOOD at ~0.52; we hold it at 0.60 on purpose — SSD is a brand-new signal and today’s structured-data hygiene is poor, so anchoring “good” to the current p75 would canonise that poverty. Revisit toward p75 only if adoption data shows 0.60 is unreachable.
- TC 4K — content pages run higher than homepages (content median 224, p90 2,722; homepage p90 864) but the long tail stays under 4K; TC is supporting/ungated, so leniency is intended (a budget warning, not a gate).
- ARR 1.0 — a missing critical action is a broken integration. AF 0.95 — below it the page misinforms agents.
Machine-readable in
spec/thresholds.json; values unchanged at v0.1 (measured N=299 < the N≥500 bar we set for moving a canonical threshold); SemVer-bumped with recorded basis inCHANGELOG.mdas more field data accrues.
Field data for the newer metrics (TCR/DRI/ERI), descriptive — docs/field-data-v2-report.md, N=103 and accruing.
The original N=299 scan predated TCR §3.8 / DRI §3.10 / ERI §3.11, so their thresholds were educated guesses; this is
their measured web distribution (deterministic, one pass per URL; dataset persisted at data/field-data.ndjson, grown
idempotently toward the N≥500 bar). Observations stable across N=40→103, not threshold moves:
- TCR median 0.04 (p75 0.29, p90 1.00; n=87 action-bearing pages) — most pages expose almost no agent-invocable actions (button+fetch SPAs). Empirically confirms TCR must stay supporting, not gated (§3.8): a gate would fail the majority of the web.
- ERI median 0.55, grade mix A/B/F = 19/61/23 of 103 — most pages reach grade B (canonical + a typed id); authority identity (grade A: Wikidata/GTIN/ORCID) is a ~18% minority. Confirms ERI supporting (§3.11) at v0.1.
- CRR p50 1.00 (bimodal, p25 0.85) and SSD p75 0.65 — consistent with the original N=299 scan.
- DRI had no comparable-field (price/availability/rating JSON-LD) pages in this general corpus — DRI bites on commerce, as designed; a product-page corpus is the right field-data source for it (noted gap; its base rate is adoption-gated, while its detector is already proven, §3.10). Descriptive increments toward the N≥500 bar; no canonical threshold moves until N≥500 + reference-judge agreement.
Reference-judge agreement (AF judge-independence) — docs/af-interrater-report.md. AF underpins PVB/CEB/DRI,
so the outcome must not be an artifact of one judge model. Two corpus-free judges of different size and family
(qwen3 8B vs qwen2.5-coder 14B) answered the same canary cloze on the same content (N=12): 95% per-question
agreement (57/60), mean per-page AF difference 0.05 — strong judge-independence. (The §5 ρ≥0.8 rank-form of
the bar is degenerate on this corpus — AF is at a ceiling, ~all pages 1.0, so there is no rank variance; per-question
agreement is the correct measure for binary canary outcomes.) On a variance-bearing corpus (raw repr, mixed CRR,
AF spanning 0–1, N=20): per-question agreement 91%, but Spearman ρ = 0.64 — moderate, below the ρ≥0.8 bar (the
partial-AF disagreements add rank noise). Verdict: judge-independence holds at the per-question level (91–95%); the
strict ρ≥0.8 rank-form is not yet met, so the reference-judge prerequisite is only partially satisfied and a
canonical threshold must not move on this evidence — tightening (more pages / a third judge) is the remaining step.
Judges are corpus-free by requirement; a retrieval model is excluded (it answers from its corpus, not the page).
§5.1 Proxy→Outcome Validation (the proxies predict the outcome — measured, not asserted)
§2.3 and §3.6 assert that the cheap deterministic proxies (CRR/SSD) predict the expensive outcome (AF — can an
agent recover the page’s facts). This subsection replaces that assertion with a measurement. Method:
Canary-Seeded Extraction (CSE) — for each page, extract facts that demonstrably appear in the rendered content,
replace each with a synthetic canary value the judge cannot know a priori, inject the same canaries into the
rendered and the raw (no-JS) extraction, and ask a corpus-free local judge (no retrieval) for each canary via a
context-anchored cloze. Returning the canary means it read the page; returning the real value is priors leakage.
This makes the ground truth deterministic and immune to both RAG-corpus and parametric-prior contamination — the LLM
only demonstrates the read. Harness: scripts/pvb.mjs; engine: extractors/canary-facts.js + af/canary.js;
report: docs/proxy-outcome-report.md.
Result (N=27 valid pages of 33 fetched, CRR spectrum SSR/static → CSR SPA, judge qwen3:8b, 2026-06):
- CRR → AF_raw: Spearman ρ = 0.70 (t≈4.9, df=25, p<0.001) — CRR predicts the absolute agent answer-fidelity on a page’s facts. Strengthened from ρ=0.53 at the N=13 seed as N grew.
- CRR → recovery (AF_raw/AF_rendered): ρ = 0.68 (p<0.001) — predicts how much of the content an agent recovers from no-JS HTML.
- CRR predicts AF_raw≥0.8: ROC AUC = 1.00 — perfect separation of answerable-from-raw vs not, on this sample.
- Priors-leak control: mean 0.01 — canaries hold at larger N; the outcome is not the judge’s world-knowledge.
- Honest finding — the composite is worse than CRR alone.
composite (0.6·CRR+0.4·SSD) → AF_raw: ρ = 0.41, below CRR’s 0.70. SSD adds noise, not signal, to AF prediction — CRR is the dominant predictor of fact recovery. (SSD measures entity scaffolding, a different axis; this argues against blending it into an AF-predicting composite, not against SSD itself.) Recorded as evidence, not hidden. - Known divergence (reported, not hidden): a small number of prerendered pages score moderate CRR yet recovery≈1.0 — raw HTML carries the facts while the rendered tree adds tokens, deflating the CRR ratio. CRR can understate recovery for prerendered-but-token-heavy pages; a known limitation of the ratio, not a measurement error.
Power: at the observed ρ≈0.68–0.70, N=27 yields p<0.001 — past the §5 significance bar (the rule-of-thumb N≥28 for 80% power at ρ≈0.7 is essentially met; the measured p is already well below 0.01).
Replication (N=46 valid of 59, broader/more-diverse corpus, 2026-07). Re-running the same canary-seeded, corpus-free method on a larger page set holds the classification result — CRR predicts AF_raw≥0.8 at ROC AUC = 0.95, priors-leak 0.00 — while the rank correlation moderates to Spearman ρ ≈ 0.5. This is the expected shape of a bimodal corpus: as the sample fills in the extremes, “whether a page is legible” separates cleanly (high AUC), while “how legible, ranked” gets noisier near the ceiling (lower ρ). The headline claim is therefore the AUC, not the ρ — CRR reliably predicts whether an agent recovers a page’s facts.
Status. Direction is now measured evidence, not assertion, at a publishable significance. Still a single
corpus-free judge + page-derived canary ground truth — not a basis for moving a canonical threshold, which
requires the §5 N≥500 bar with reference-judge agreement (ρ≥0.8 inter-rater) in CHANGELOG.md. SemVer: a clarification
(no metric redefinition) → minor bump.
§6 CI Gate Integration
Sequence per money-template: (1) render raw + headless → agent representation; (2) compute CRR/SSD/TC (cheap, every commit); (3) run ARR vs golden; (4) AF eval (expensive — sample: full on template/layout change, skip on content tweak). PASS/WARN/FAIL with an override-and-audit policy (a documented override is allowed but logged).
§7 Versioning Policy
SemVer. MAJOR = metric removed/redefined; MINOR = metric added or threshold loosened with rationale; PATCH = docs/typos.
Every threshold change records the field-data basis in CHANGELOG.md.
§8 Prior Art & Differentiation
A mature web-performance program — field data + a lab tool + a CI gate — is the structural model. Fragmented prior art — Googlebot/Vercel crawlability (guidance, no metric), schema.org/Rich Results (opt-in, no S/N metric), CDP/computer-use (a11y as agent API, unmeasured), Jina/Firecrawl (markdown conversion, measures nothing), RAGAS/DeepEval (generic evals, not page fidelity), MCP/A2A (protocol not measurement), GEO/AI-SEO (marketing checklists). None is a named, measurable standard with metrics + thresholds + a CI gate. That’s the gap CAV fills. Architecture follows the familiar scorecard + CI-gate model.
§9 Adversarial Robustness & Known Gaming Vectors
A score is only credible if a high one cannot be faked without genuinely serving agents. We enumerate the gaming vectors and CAV’s defense for each — published with the metrics, the way mature standards publish “how this can mislead” alongside the definitions. Self-disclosure is the trust signal; a silent metric invites the gaming it hides.
| # | Attack | Targets | Status |
|---|---|---|---|
| A1 | Server-rendered placeholder text | CRR | Closed |
| A2 | Preset omission for free structured-data credit | SSD | Closed |
| A3 | Present-but-false JSON-LD values | SSD | Largely closed (v0.1.1) |
| A4 | Cosmetic aria-label affordances | ARR | Partially closed (v0.1.1) |
| A5 | Cloaking (UA/IP-dependent content) | all | Detected (v0.1.1) |
| A6 | Content minimization for TC | TC | By design |
| A7 | Fake affordances inflating TCR | TCR | Mitigated (v0.1.x), deterministic |
| A8 | Layer padding / invisible characters | CEB | Mitigated (v0.1.x), deterministic |
| A9 | Bait-and-switch structured data | DRI | Detected + gated (v0.1.x), deterministic (§3.10) |
| A10 | Authority-ID stuffing / spoofing | ERI | Mitigated (v0.1.x), deterministic (§3.11) |
§9.1 Defenses in detail
A1 — Server-rendered placeholder text
Attack: Server-render placeholder text (“Loading…”, “enable JavaScript”) with token volume ≈ real content → CRR → fake 1.0.
Closed. §3.1 content_overlap term: raw must contain the rendered words, not just a similar token count.
A2 — Preset omission for free structured-data credit
Attack: No --preset → free structured-data credit → SSD → fake 1.0.
Closed. §3.2 auto-detects required fields from the page’s own @type; no JSON-LD ⇒ coverage 0.
A3 — Present-but-false JSON-LD values
Attack: Stuff JSON-LD with fields that are present but false (price $10 in markup, $100 on page) → SSD → fake 1.0.
Largely closed (v0.1.1). A deterministic structured-data-consistency check now verifies each cited JSON-LD value (price, availability, name, rating, sku…) actually appears in the rendered page text; a declared-but-absent value is flagged (SSD/AF risk) on every scan — no LLM, reproducible. AF remains the backstop for values that are present-but-semantically-wrong. Residual gap: a value present on the page but in a different role (e.g. a $10 shipping fee matching a declared $10 price) can still pass the presence check — AF covers that.
A4 — Cosmetic aria-label affordances
Attack: aria-label cosmetic elements so locators resolve, though clicking does nothing → ARR → fake 1.0.
Partially closed (v0.1.1). A deterministic actionability check reads the platform’s own verdict (no click simulation): a genuine control (interactive role or native control/link) that is visible but has pointer-events:none is flagged — it looks clickable but clicks pass through. Deterministic, reproducible, near-zero false-positive (landmark aria-labels and honest disabled/hidden states are intentionally NOT flagged — flagging them eroded trust in testing). Residual gap: a control with aria-label but a no-op/missing handler that isn’t CSS-blocked needs runtime behavior observation — out of deterministic scope; golden-file drift + AF cover it.
A5 — Cloaking (UA/IP-dependent content)
Attack: Cloaking — serve rich content to cav’s UA/IP, thin content to real agents (or vice-versa) → all.
Detected (v0.1.1). A deterministic dual-UA check fetches the same URL with a browser UA and an LLM-crawler UA (GPTBot), extracts main content (Readability), and compares via the CRR word-overlap; parity <0.90 triggers an A/B noise gate (re-fetch both — random A/B/geo/personalization variance flips high and PASSes; deterministic UA-cloaking stays low across both rounds → FLAG). Reports which UA was served thin. Detection (not prevention) but now a first-class signal, not just a recommendation; IP-based cloaking still needs ≥2 IPs (documented limit).
A6 — Content minimization for TC
Attack: Minimize page content so TC looks “GOOD” while the page says nothing → TC → fake GOOD.
By design. TC is supporting, not gated (§3.4) precisely so volume can’t be gamed into a pass; an empty page tanks SSD signal and AF answers.
A7 — Fake affordances inflating TCR
Attack: Inflate TCR with fake affordances — hidden forms (display:none), duplicate form + potentialAction for one action, or unlabelled decoy buttons → TCR → fake GOOD.
Mitigated (v0.1.x), deterministic. Hidden controls (inline display:none/visibility:hidden, [hidden], aria-hidden on the element or an ancestor) are excluded from both numerator and denominator; discoverable actions are deduped by normalized endpoint URL (a form and a potentialAction at the same target count once); unlabelled controls are excluded (no agent-invocable intent). TCR is also supporting, not gated (§3.8), so inflating it wins nothing. Residual: computed-style/0×0 hiding and endpoint-liveness (HEAD) checks need the headless pass (documented limit).
A8 — Layer padding / invisible characters
Attack: Inflate AUEC by padding the cheap layer (junk JSON-LD to bloat L2) or gaming token counts with zero-width/invisible characters → CEB → fake GOOD.
Mitigated (v0.1.x), deterministic. L2 is pruned to recognized structured fields and capped (a token budget), so junk JSON-LD can’t both stay in L2 and stay cheap; all layer content is NFC-normalized with zero-width characters stripped before tokenizing, so invisible codepoints can’t move the token count. AUEC is also supporting per-page (§3.9). Canary AF is unaffected: canaries replace real facts in the page content, so padding the page doesn’t plant answers.
A9 — Bait-and-switch structured data
Attack: Bait-and-switch via structured data — declare a better-than-rendered value to agents (lower price, InStock over a pre-order, inflated rating) so an agent over-promises to the user → DRI → fake GOOD; agent misinforms.
Detected + gated (v0.1.x), deterministic (§3.10). DRI compares each declared JSON-LD value to the rendered value and classifies drift direction; a pattern of self-favoring drift (drifted ≥ 2 AND favorable_drift_ratio ≥ 0.5) sets gaming_flag (high severity). Direction-awareness makes it false-positive-averse: a legitimate sale drifts the opposite way (declared ≥ rendered) and never trips the gate. Residual: computed-style/Shadow-DOM price hiding and currency-without-priceCurrency need the headless DOM-eval / currency cross-check (documented follow-ups).
A10 — Authority-ID stuffing / spoofing
Attack: Inflate ERI by stuffing dozens of sameAs URIs, or claiming an authority ID (Wikidata QID / GTIN) that isn’t really the entity → ERI → fake GOOD.
Mitigated (v0.1.x), deterministic (§3.11). sameAs coverage counts only the first 3 distinct authority domains, and coverage is scored against type-specific expected identifiers, so dumping generic URIs adds nothing; GTINs must pass the mod-10 check digit to count; a cross-origin canonical is penalized (can’t inherit another domain’s identity). Semantic authority-spoofing (a valid-format Wikidata QID for the wrong entity) is addressed by the A11 cross-check — fetch the declared QID’s label, compare to schema:name — now proven offline (scripts/eri-proof.mjs: 100% spoof detection / 100% genuine resolution on 10 entities); it ships as an opt-in/offline check (network), not in the deterministic live analyzer. Residual: like DRI, format-valid-but-wrong on non-authority IDs is AF’s job, not a page-level deterministic check.
Principle. No single deterministic metric is un-gameable in isolation; the gate’s robustness comes from the combination — a page gaming CRR or SSD alone still fails AF, and AF is the outcome the proxies must predict (§5, proxy→outcome validation). Each new gaming vector found in the field is added here with its defense and status.