Independent Test of Grok Skills Memory Persistence Still Missing
The persistent-memory claim is the whole point. Everything else is noise.
xAI launched Grok Skills on May 18th with a one-line pitch: teach Grok something once and it remembers across every conversation. That sounds obvious until you remember what "remember" actually means in AI infrastructure — and how many times vendors have sold context windows as memory.
Here's the thing worth testing before anyone builds a product on this claim: is Grok Skills' persistence a genuine cross-session state mechanism, or a session-ID wrapper that reconstructs context from whatever you paste back in?
Because those are two completely different engineering problems, with completely different reliability profiles, and only one of them is worth betting your workflow on.
What Grok Skills Actually Ships
The announcement is straightforward. Every Grok 4.3 account gets built-in skills for Word documents, PowerPoint decks, Excel spreadsheets, and PDFs — formatted output that preserves headings, tables, and styles. Users can create custom skills by describing them in natural language, uploading a file, or writing from scratch. Skills live at the account level, invoke via slash command, and override defaults when you add your own version.
On the developer side, the Grok 4.3 Responses API handles tool calling with an OpenAI-compatible format. Grok 4.3 supports up to 128 parallel tool calls per request, maintains a 1 million token context window, and routes built-in tools (web search, X search, code execution) server-side while custom functions run client-side. Grok Build — xAI's coding agent, separate from Skills — is in beta at $1.00 per million input tokens and $2.00 per million output tokens.
That's the product. None of it is novel. OpenAI has skills. Anthropic has projects with context. The question is what xAI specifically built that others haven't.
The Number Worth Sitting With
A third-party benchmark from AI engineer Chew Loong Nian, published May 2nd on Towards AI, ran Grok 4.3 against Claude Opus 4.7 across 18 long-horizon agent tasks. The results: Grok 4.3 won 13 of 18, finishing in roughly half the wall time. The cost differential for the identical task battery was $7.84 on Grok 4.3 versus $71.50 on Claude Opus 4.7 — roughly a 9x cost advantage in favor of xAI.
On the Artificial Analysis Intelligence Index, Grok 4.3 scores 53, placing it above Claude Sonnet 4.6 and four points clear of Grok 4.20. The ELO jump from Grok 4.20 to Grok 4.3 on agentic benchmarks was +321 points — xAI's largest single-release agentic improvement to date. Grok 4.3's API pricing landed at $1.25 per million input tokens and $2.50 per million output tokens, representing a significant reduction from Grok 4.20 pricing.
The catch: this is a single benchmark from a single tester with a small sample size. The full methodology is paywalled. xAI has not published a formal model card for Grok 4.3. Take the numbers seriously, not as gospel.
The Real Test Nobody Has Done
The persistence question is the load-bearing issue for anyone evaluating Grok Skills as an agent platform layer. "Remembering" across sessions is the core promise — and the core claim that deserves actual verification.
A genuine persistent memory mechanism means server-side state that survives session termination. A session wrapper means the model receives whatever context you re-inject on next launch, with no stateful continuity underneath. The behavioral difference is testable: create a Grok Skill with a complex, multi-step workflow, terminate the session, resume days later, and observe whether Grok actually resumes the interrupted task or requires full re-explanation.
Per InfoQ's coverage, community reaction on X highlighted practical value for workflows — one developer connected Grok to their GitHub account to read and commit, adding a context.md file to maintain state across development conversations. Another noted that custom skills and workflow automation have become the default in other AI tools and Grok was catching up.
That test — whether the persistence is architectural or cosmetic — has not been published by any independent reviewer.
Until someone runs it, the "persistent" in Grok Skills remains an adjective, not a technical description.
Why the Cost Story Is Real Even If the Benchmark Isn't
The benchmark needs independent confirmation. The cost dynamics do not.
Grok 4.3 at $1.25/$2.50 per million tokens versus Claude Opus 4.7 at $3/$15 per million tokens is public pricing. The functional delta — roughly 10x cheaper for agentic workloads — is real regardless of which specific benchmark you trust. If you're running agent loops that consume 50 million tokens per month, the difference between xAI and Anthropic pricing at current rates is the difference between a $125/month API bill and a $900/month one. Scale that to an enterprise deployment and the number is serious money.
The implication is not that Claude Opus 4.7 is a bad model. It's that the unit economics of agentic pipelines are now a legitimate procurement criterion, not just a performance discussion. Any organization running autonomous agents at scale should be running the math on Grok 4.3 right now, with the caveat that the API does not yet expose Grok Skills — so the persistent workflow layer is currently consumer-facing only.
What Still Needs Answering
Three questions that matter more than the benchmark scores:
First: does Grok Skills persistence actually hold across sessions with real workflows — not demo scenarios? The answer determines whether this is platform infrastructure or a consumer convenience feature.
Second: when does Grok Skills reach the API? xAI confirmed Skills are live on Grok 4.3 across web, iOS, and Android. Enterprise adoption is gated on API availability.
Third: does Anthropic respond? Claude Opus 4.7's pricing was set for a world where the performance delta justified it. If Grok 4.3's agentic numbers hold up under independent scrutiny, the pressure on Anthropic to adjust is immediate and material.
The Grok Skills announcement is real. The persistent memory claim is testable. The cost advantage is documentable. Run the test before building on it — but do not ignore the signal while waiting.