I Tested the Tool That Claims to Expose AI Agents Hidden Graph. Here Is What It Actually Sees.
A tool promising to expose the hidden architecture of AI agent systems — the web of tool calls, memory pipelines, and decision paths that make multi-agent setups hard to debug — produced a revealing set of results when I ran it against a live CrewAI codebase cloned from GitHub. The top-level inspect command, run from the workspace root, returned NoFrameworkDetected. The project's agents.py contains from crewai import Agent — an explicit, unambiguous import. From that vantage, the tool found nothing.
From the correct subdirectory, the results were different but not fully reassuring. lantern docs and lantern lint, run from integrations/CrewAI-LangGraph, located the agent code and generated architecture documentation: five agents with roles and goals, five tasks with descriptions, and a Mermaid flowchart showing task-to-agent assignments. Lint found five errors — tasks declared without assigned agents — and six warnings. The agent-specific fields (role, goal, tools, LLM) and the agent-task relationship graph are not what a generic dependency listing or tree view produces. In that narrow sense, the tool does surface agent graph structure that standard static analysis misses.
But the limitations are as informative as the detections. The architecture diagram labels every agent as unknown_agent and marks the crew class as not detected — lantern found the agents but failed to recognize them as a CrewAI crew. The task-to-agent wiring in the Mermaid diagram connects everything to unknown_agent, which is wrong. It found the pieces but missed the structure. The detection failure is not an absence of output — it is incorrect output about the one thing that matters most: how the agents actually connect.
That failure is the story. Not because one tool failed one test, but because it exposes a structural gap in the observability layer for agentic systems — and because that gap has consequences that extend beyond developer inconvenience.
The tool is real. AgentLantern installs via pip at version 0.1.22, ships with inspect, lint, docs, web, play, and replay commands, and its creator, Brell Sanwouo, describes it as a personal research project from INRIA, France's national computer science institute. The GitHub repository has 23 commits and 3 stars. The PyPI download count barely registers. There is no Discord community, no filed issues, no mailing list — Sanwouo's Reddit posts to r/MachineLearning, r/LangChain, r/AI_Agents, and r/OpenSourceAI are the project's entire distribution channel.
The theory behind AgentLantern is coherent: a static analyzer focused on project structure should, in principle, be able to surface agent graph information — the relationships between agents, tasks, and tools defined in code — that runtime tracers miss. Runtime tracers observe what happens when code executes; static analysis sees how code is wired before it runs. That distinction is real. For projects that conform to AgentLantern's expected directory layout and import patterns, the approach may deliver on that theory. For the rest, it exits with a parse error instead of an architecture map — or, as this test showed, it finds the pieces but misidentifies the structure.
The observability problem AgentLantern identifies is real. An April Dev.to survey found developers rate trace-level visibility, tool-call logging, and state inspection as table-stakes requirements for production agentic systems — baseline expectations, not future aspirations. Multiple commercial products address parts of this space: LangSmith offers trace visualization across agent steps, Datadog has AI agent monitoring, and LangGraph and CrewAI both ship built-in debugging tools. Augment Code's tool registry, published this month, lists over a dozen options in the same problem space.
But the coverage is uneven. No single tool comprehensively addresses structural graph analysis — the "what agents are defined, how they're connected, what tools they call" question — across all common agent frameworks. AgentLantern aimed at that specific gap. Its partial failure on this test — it found the agents but misidentified the crew structure — is a data point that the problem remains harder than its author expected.
The consequences of this gap are not hypothetical. Agentic systems are increasingly deployed in domains where their decisions carry real weight: credit, healthcare, legal process, infrastructure operations. When something goes wrong in production — an agent recommends a wrong action, makes an unexplained decision, behaves unexpectedly under edge conditions — the ability to reconstruct what the agent did and why is not a developer convenience. It is a precondition for accountability in consequential deployments. Whether current tools adequately support that reconstruction is an open question — this test did not evaluate whether LangSmith, Datadog, or other alternatives actually solve the structural graph problem for production systems. The gap claim is about the category, not a verdict on any specific competitor.
What to watch: whether AgentLantern's framework detection improves in a future release, whether the structural observability gap gets filled by the existing commercial players, or whether it remains a gap that production deployments work around rather than solve. The tool is early infrastructure with a specific technical claim that doesn't yet cover the full range of how agent code is written. One test is one data point — instructive, but not architectural proof. The hidden graph surfaced partially this time, and what it showed was incomplete. That may change.