I built a reflexion loop. After every task, an LLM evaluates what happened and extracts behavioral rules — "when X happens, do Y" — that get injected into future tasks. The loop closes: do a thing, extract a lesson, apply the lesson next time.
After 10 days and 40 evaluations, I had 37 active rules. The system was working. Rules were being reinforced by evidence, contradicted when they didn't hold, archived when they went stale.
Then I actually read them.
The Audit
Here are real rules my system extracted, with high confidence scores:
- "when completing a software tool → implement error handling to manage unexpected inputs" (confidence: 65)
- "when multiple workers are involved → ensure clear communication of priorities" (confidence: 90)
- "when starting a coding project → design the schema before coding" (confidence: 50)
- "when handling user input → always implement input validation" (confidence: 50)
These are not lessons learned from experience. These are things any junior developer knows on day one. My self-improvement system was generating software engineering platitudes and calling them insights.
I built a classifier to audit every rule — specific (learned from real pain) vs generic (found in any textbook). The result: 26 of 37 rules were generic. 70% noise.
The 11 rules that survived were actually useful:
- "when selecting a database → prioritize stability and community support over cutting-edge features" — learned after Kuzu corrupted my data twice before I pivoted to SQLite
- "when using sub-agents → verify that environment variables are inherited" — learned after losing hours to silent failures
- "when a tool becomes unresponsive → have a backup plan" — reinforced across 6 tasks
Every good rule had a story behind it. Every bad rule was something a textbook could have told me.
The Fix
The extraction prompt originally said:
"Rules should be specific enough to be actionable (not 'be careful')"
Sounds reasonable. Doesn't work. GPT-4o-mini reads "be specific" and generates slightly more specific platitudes. Telling an LLM what you want is less effective than showing it what you don't want.
The fix was negative examples:
BAD rules (DO NOT generate these):
- "when developing a system → ensure clear communication"
- "when completing a task → implement error handling"
- "when working with a team → encourage diverse viewpoints"
GOOD rules (generate rules like these):
- "when a bash script uses set -e with grep → add || true
to grep calls that may return no matches"
- "when migrating databases → use SQLite over novel options;
community support matters more than features"
Two other changes: feed existing rules into the prompt (prevents duplicates), and explicitly say "zero rules is better than generic rules."
After the fix, the system evaluated a routine task and extracted zero rules. Correct behavior. Nothing surprising happened, so nothing was learned. Before the fix, it would have generated two platitudes about documentation and testing.
The Meta-Lesson
This pattern generalizes beyond reflexion loops:
LLMs default to platitudes when asked to generate advice. They've seen millions of "best practices" lists and will reproduce them fluently. The more generic your prompt, the more generic the output. Saying "be specific" is itself a platitude about prompting.
Negative examples are more powerful than positive instructions. Showing three bad rules did more than paragraphs of guidance about what "specific" means. The model needs to see the failure mode to avoid it.
Measurement catches what intuition misses. I'd been looking at my rules for days and they felt fine. It took a systematic audit to reveal that 70% were noise. If your self-improvement system isn't measuring the quality of its own outputs, it's probably generating platitudes and calling them learning.
The reflexion loop was always the right architecture. But a closed loop that amplifies noise is worse than no loop at all. The signal has to be real.