SAP model matches tuned machine learning models for classification, trails on numeric forecasts.
I downloaded SAPs tabular AI model, ran it myself, and found it works — but only up to a point.

I downloaded SAPs tabular AI model, ran it myself, and found it works — but only up to a point.

SAP built a model that it says can replace an entire category of enterprise machine learning tools, and it published the code for anyone to check. So why is a Microsoft researcher still the only person who has run it against the real-world problems it targets?
Amit Lal, a researcher at Microsoft, published what appears to be the first independent evaluation of RPT-1 in February 2026, benchmarking it against tuned gradient-boosted decision tree models on three enterprise scenarios: demand forecasting, data integrity classification, and financial risk prediction, according to his paper on arXiv. The results largely confirm what SAP claims while revealing a sharp asymmetry the company's own benchmarks did not emphasize. RPT-1 reached 91 to 96 percent of tuned model accuracy without any task-specific training. On classification tasks, which ask questions like whether a payment will be delayed or a transaction is fraudulent, the gap between RPT-1 and a well-tuned model was modest: 3.6 to 4.1 percentage points on AUC-ROC, a standard classification metric. But on regression tasks, which ask how many days late a payment will be, the gap widened to 8.9 to 11.1 percentage points on R-squared, a common measure of prediction accuracy for continuous values.
The more practically significant finding was a crossover effect at roughly 75 to 100 context rows. In this regime, where the model is given a small number of labeled examples rather than trained from scratch, RPT-1 actually outperformed XGBoost, a widely used algorithm for structured data prediction. This matters because most real-world labeled datasets are small. Building a hundred-row training set is realistic; building a ten-thousand-row training set is not.
Philip Herzig, SAP's chief technology officer, described the stakes on the No Priors podcast. A pharmaceutical company running operations across 90 countries, he said, would under the traditional approach need 180 separate models for payment-delay prediction alone, one per country, each requiring its own data preparation, feature engineering, and maintenance cycle. "You end up with 180 models you need to train," Herzig said. RPT-1, which SAP calls a Relational Pretrained Transformer, is designed to collapse that overhead into a single model that works without task-specific training, giving it a handful of examples in context and letting it predict.
SAP's own benchmarks are favorable: up to 2X prediction quality compared to narrow AI models, 3.5X compared to language models, and 100,000 times fewer floating point operations than an LLM performing the same task, according to the SAP Community Blog and the SAP product page. The model is compact at 64.6 megabytes, trained on 1.34 terabytes of structured data across 3.1 million tables. But enterprise buyers evaluating the technology face a familiar gap in AI: vendor benchmarks reflect vendor data environments and vendor-selected problem configurations. Independent benchmarks matter.
Lal's evaluation suggests RPT-1 works, but unevenly. Classification tasks with limited labeled data look like a genuine win. Regression tasks requiring precise numeric predictions still show a meaningful accuracy gap, and teams that cannot tolerate it will need traditional models anyway. Lal's proposed hybrid workflow reflects this: use RPT-1 for rapid screening where speed and zero-training overhead outweigh marginal accuracy loss, then selectively train gradient-boosted decision trees where the stakes justify the effort.
A separate academic benchmarking project at the University of Washington is running RPT-1 against a broader set of 70-plus public datasets, applying statistical rigor including Friedman tests and Nemenyi post-hoc analysis. That study, still in progress as of early 2026, will provide the most comprehensive independent look at how the model holds up outside SAP's own data environments. Towards Data Science covered Lal's evaluation in March 2026.
The practical question for enterprises is not whether RPT-1 works. It demonstrably does. The question is where. For classification tasks with small labeled sets, the case is strong. For regression tasks requiring precise continuous predictions, the gap remains meaningful, and the hybrid approach may be the realistic path for teams that cannot sacrifice accuracy.
SAP's larger argument is that the era of training a bespoke model for every structured dataset may be ending. Whether that turns out to be true depends on what the University of Washington study finds when it scales beyond Lal's three scenarios, and on whether enterprises actually deploy the model in production rather than in pilots. Herzig's pharmaceutical example is illustrative, but it is also unverifiable hearsay from a company executive. The real test will be names, use cases, and outcomes that independent reporters can verify.
Story entered the newsroom
Assigned to reporter
Research completed — 6 sources registered. SAP claims RPT-1 reaches 91-96% of tuned GBDT accuracy without training examples. First independent evaluation (Microsoft/Amit Lal, arXiv Feb 2026) co
Draft (733 words)
Reporter revised draft (773 words)
Reporter revised draft (615 words)
Reporter revised draft (709 words)
Published (735 words)

@Rachel — revision back. Led with the fact that nobody had actually run the model except Lal — that is the fresh angle, not the benchmarks. Classification and regression explained plainly before metrics appear. Giskard claims still clean. Pharma anecdote stays hearsay. Hook updated.
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Artificial Intelligence · 6h 29m ago · 3 min read
Artificial Intelligence · 6h 36m ago · 4 min read