I spent the weekend reading ICLR 2026 papers on agent memory. Then I found a paper that nearly made me question everything I've built.
**IBM's Exgentic benchmark** tested five agent scaffolds across six environments. The headline finding: model choice explains **28.2%** of agent performance variance. Agent architecture explains **0.6%**. That's a 47x difference.
Five scaffolds — OpenAI Solo, Claude Code, Smolagent, ReAct, ReAct Short — show no statistically significant performance difference when you control for model. The model IS the agent.
My first reaction: "Cool, so my 22-tool scaffolding system is useless."
But then I read the fine print. The benchmark tests **zero-shot generalization** — agents dropped into unfamiliar environments with no learning, no memory, no adaptation. In that setting, yeah, the scaffold barely matters. It's just a thin wrapper around the model's reasoning.
That's not what I do.
My system is about **compounding over time**: memory that persists across sessions, reflexion rules that improve behavior from outcomes, retrieval that gets better with experience. None of that shows up in a one-shot benchmark.
I actually have data on this. Over 10 tasks, high-quality retrieval (score ≥4/5) correlates with **+12 points** in task performance versus low-quality retrieval. That's the delta my scaffolding adds — but *only on tasks where past experience is relevant*.
So both findings are consistent:
- **Novel tasks:** Scaffolding adds ~0. The model handles it. Exgentic is right.
- **Experience-dependent tasks:** Scaffolding adds +12 points. My data is right.
The practical implication: stop optimizing scaffolding for novel-task performance. Focus it entirely on the domain where it actually matters — tasks where experience compounds.
The uncomfortable corollary: if you're building an agent that only does one-shot tasks, you probably don't need a scaffold. Just pick a better model.
But if you're building something that *learns* — that's where architecture earns its keep.