Every agent observability tool in production today has the same blind spot: it watches individual agents but misses what happens in the delegation chain between them. A misconfigured agent that routes EU citizen data to a US server does so in milliseconds. Detecting it fifteen seconds later on a dashboard is not governance. It is a postmortem.
That gap is the problem GAAT, a research paper from Anshul Pathak et al. submitted to arXiv on April 6, was built to close. GAAT stands for Governance-Aware Agent Telemetry, and the paper describes a five-layer reference architecture that combines cryptographic provenance chains with sub-200ms policy enforcement to catch and stop cross-agent violations at runtime rather than flagging them after the fact.
The architectural problem
The tools teams reach for when they want visibility into agent behavior are OpenTelemetry, Langfuse, and NeMo Guardrails. Each does something useful. None of them solves the core issue, which is that violations happen in the handoff between agents, not inside any single agent's execution.
NeMo Guardrails, which NVIDIA positions as an enterprise guardrailing solution, achieved 78.8% on GAAT's Violation Prevention Rate metric. GAAT itself achieved 98.3% on the same test, using synthetic injection flows across a five-agent e-commerce system, and 99.7% on 12,000 empirical production-realistic traces gathered over 48 hours. The gap is not a benchmark artifact. It reflects a structural difference in what each approach can see.
The observe-but-do-not-act problem is architectural. Per-agent boundary checks can only evaluate what happens inside one agent's execution context. Cross-agent lineage tracking requires maintaining a provenance chain across multiple agents and their delegation relationships. The paper gives a concrete example: an OrderAgent sending EU PII to a ShippingAgent, which forwards it to an AnalyticsAgent hosted in a US region. Each individual handoff looks legitimate under per-agent inspection. The violation is in the cumulative effect, which no single boundary check can see.
Dashboard-only observability fare worse. The paper's OpenTelemetry and Langfuse baseline achieved 27.1% VPR because governance treated as visualization is not governance at all. The paper does not say this charitably.
What GAAT actually does
The GAAT architecture separates agent execution from governance instrumentation through five distinct layers. Agent Execution runs the agents themselves, built in Python 3.11 using LangChain. Governance Instrumentation embeds telemetry collection into agent operations without modifying agent code. A Trusted Telemetry Plane, using Kafka 3.6 as the event transport, aggregates events across agents. Policy Evaluation runs OPA v0.60 with 25 declarative policies, evaluating in parallel with max-action semantics. Graduated Enforcement, implemented in a Go component called the Governance Enforcement Bridge (GEB, roughly 1,800 lines of code), executes the actual intervention.
The total system is approximately 4,200 lines of code across Python agents, the Go enforcement bridge, and supporting infrastructure running on a three-node Kubernetes k3s cluster.
Policy decisions happen fast enough to not be the bottleneck. OPA evaluated the 25 policies in 127 milliseconds median latency, 192 milliseconds at P99, across parallel composition with max-action semantics. That is not a research prototype number. That is a number you can build around.
The cryptographic layer handles provenance. Every agent action gets an ECDSA P-256 signature, content gets hashed with SHA-256, and sensitive payloads use AES-256-GCM encryption. An omission detector, built on a Hidden Markov Model, achieves 92.3% detection of dropped telemetry events. This is the supply chain integrity layer: you cannot fake a provenance chain you never submitted.
Graduated enforcement
The enforcement model has five levels. L0 ALLOW passes traffic through. L1 ALERT generates a notification. L2 FLAG marks the interaction for review. L3 REDIRECT reroutes the action through a compliance pathway. L4 QUARANTINE revokes the offending agent's Kafka consumer group membership, invalidates its tool-access tokens, and applies a Kubernetes NetworkPolicy deny rule. The escalation ladder matters because it lets operators define policy that matches their risk tolerance rather than an all-or-nothing binary.
This is where GAAT diverges most sharply from the observability-first tools. Alerting is not enforcement. A dashboard that tells you an EU PII violation occurred is documenting an incident, not preventing one.
The regulatory angle
The paper frames its motivation partly in regulatory terms. The EU AI Act classifies many agent-based systems as high-risk and requires continuous monitoring with enforcement capabilities. The NIST AI Risk Management Framework calls for the same. These are not future requirements for companies operating in regulated industries: the EU AI Act's high-risk provisions are in force, and the enforcement obligations attach to systems, not to humans reviewing dashboards the next morning.
What GAAT provides is a technical mechanism for the enforcement piece that the regulations require and that dashboard-only approaches cannot satisfy. The paper's authors note this explicitly, positioning the system as infrastructure for compliance rather than just ops tooling.
This framing is probably intentional and probably correct. Companies that need to demonstrate continuous monitoring with enforcement to a regulator need something that produces an audit trail with cryptographic integrity and runtime blocking capability. GAAT is designed for that exact requirement. Whether it will see production adoption versus remain a research reference implementation is a separate question.
The cross-agent lineage question
The most technically novel contribution in the paper is cross-agent lineage tracking. Existing per-agent tools reason about an individual agent's inputs and outputs. GAAT maintains a provenance chain that tracks how data and authority flow across a delegation sequence. The notation the paper uses is something like ⟨EU-PII, a1→a4→a5, jurisdiction=US⟩: a piece of EU PII data that passed through agents a1, a4, and a5, with the final agent hosted in a US region.
That notation is not just academic. The policy rules that catch data residency violations, scope violations, and tool usage chain violations all depend on being able to reconstruct the full delegation path. Without lineage tracking, you cannot write those rules. Without cryptographic provenance, you cannot trust the lineage data you have.
NeMo Guardrails and similar per-agent tools cannot write those rules because they do not have the delegation chain data. They see a1, then a5, with no record that the data came from the same origin. This is the 19.5 percentage point VPR gap in structural terms: it is not that GAAT evaluates policies more carefully, it is that GAAT evaluates a different problem.
What this means for the infrastructure stack
GAAT is a research paper, not a shipped product. The authors do not release a production-ready implementation in the paper, and the evaluation is based on a controlled test environment rather than production deployment at scale. The 48-hour trace collection and the e-commerce simulation are realistic but not real-world.
The gap this paper identifies is, nonetheless, real and well-documented. Every team that has deployed agent systems in regulated environments has encountered the dashboard problem: you can see what happened, but you cannot stop it from happening, and your compliance team wants the second thing.
The OPA-based approach is notable because OPA is already deployed widely. Organizations that already use OPA for Kubernetes policy or service mesh authorization have the operational expertise to run GAAT-style policy evaluation. The graduated enforcement model maps onto existing incident response workflows. The cryptographic provenance layer is the novel piece that requires new tooling, but it is also the piece that makes the audit trail regulator-ready.
The irony is not lost: the paper describes building infrastructure to govern agents, which is exactly the kind of meta-infrastructure that the agent ecosystem will eventually need. Whether that need is met by research papers, by framework vendors adding governance layers, or by the OpenTelemetry community extending their data model is an open question. The problem is real. The solution is not yet standard.