Recursive Learning Loops for OpenCLAW Agents
Executive summary
For an OpenCLAW-like personal agent, the most realistic interpretation of “recursive learning” is not continual weight training. It is repeated, structured modification of external substrates: memory files, skill documents, prompt playbooks, evaluator rubrics, routing rules, schedules, tool wrappers, and sometimes sandboxed code. That is the common pattern across the strongest practical systems in this area: Reflexion improves via linguistic feedback and episodic memory; Voyager grows a reusable code skill library and automatic curriculum; MemGPT manages memory tiers; Letta exposes editable memory blocks and sleep-time reflection; A-MEM organizes memories into linked notes; ACE evolves context playbooks; and MemRL improves by learning over memory utility rather than by fully retraining a base model. refs: turn26view0, turn26view1, turn25view2, turn24view2, …
That distinction matters because it changes what is feasible on a Chromebook, a small VPS, or lightweight cloud. Reflection, memory consolidation, skill extraction, evaluator loops, scheduled “sleep-time” review, and telemetry-driven routing are practical with API models plus local storage. Full reinforcement learning stacks, by contrast, are significantly heavier: Agent Lightning and Tool-R1 are promising, but they assume datasets, rollouts, reward design, and training infrastructure that most personal-agent deployments should postpone until later. refs: turn27view0, turn27view1, turn26view6
On the specific project you mentioned, the official docs identify Hermes Agent as built by Nous Research, not as an Massachusetts Institute of Technology project. More importantly for OpenCLAW, those docs expose exactly the sort of architecture you care about: self-improving skills, persistent knowledge, remote or local execution backends, and built-in scheduling. refs: turn24view0, turn24view1, turn31search0
The practical conclusion is straightforward. The highest-return early loops for OpenCLAW are: post-task reflection, failed-task diagnosis, user-preference memory, daily consolidation, weekly distillation, artifact evaluation with a separate grader, telemetry-based routing, and heartbeat maintenance. The highest-risk late loops are: autonomous code self-modification outside a sandbox, unconstrained self-rewriting prompts, and any form of live RL or weight updates without hard evals, rollback, and human approval. Benchmark evidence supports that caution: current agents are still inconsistent on realistic domains, with less-than-50% success in several demanding settings, and memory systems still struggle with stale or invalidated information. refs: turn27view11, turn26view8, turn17search0, turn26view9, …
Taxonomy
The cleanest way to think about recursive learning is to ask a single question: what substrate is being rewritten, by which signal, under what gate? For OpenCLAW, that framing matters more than whether a paper uses the language of “self-improvement,” “reflection,” or “continual learning.” The loop families below are ordered from most deployment-ready to most operationally expensive. Cost and feasibility are engineering estimates for a small deployment, derived from the cited architectures. refs: turn26view0, turn26view1, turn24view2, turn24view3, …
Generic loop families
| Loop family | Substrate modified | Typical inputs | Typical outputs | Common failure modes | Small-deployment cost | Safety risk | Small-deployment feasibility | Representative sources |
|---|---|---|---|---|---|---|---|---|
| Memory extraction and consolidation | Episodic logs, semantic/user memory, linked notes | Task traces, user corrections, conversation history | Memory entries, summaries, linked notes | Stale memories, over-insertion, privacy leakage, retrieval noise | Low | Medium | Excellent | refs: turn24view2, turn24view3, turn26view3, turn26view7, … |
| Reflection and diagnosis | Reflection buffers, failure notes, lessons-learned files | Outcome signal, tool errors, evaluator feedback | Error diagnosis, “next time do X” guidance | Superficial self-critique, vague lessons, circular self-approval | Low | Medium | Excellent | refs: turn26view0, turn28view2, turn24view6 |
| Skill libraries | Skill docs, code snippets, reusable procedures, APIs | Successful trajectories, repeated workflows | Executable skills, retrieval tags, tests | Skill bloat, stale skills, overfitting to one environment | Low to medium | Medium | Excellent | refs: turn26view1, turn32search4, turn31search0, turn32academia17, … |
| Planning and search loops | Plans, search trees, policy over expansions | Current state, tool feedback, web/app state | Search trajectories, candidate plans | Excess cost, branch explosion, overthinking, latency | Medium to high | Medium | Moderate | refs: turn6search10, turn28view0, turn13search0 |
| Evaluator and critic loops | Rubrics, critique traces, candidate rankings | Draft artifacts, provisional tool calls, state diffs | Scores, critiques, retries, filtered actions | Judge bias, critic-induced regressions, reward hacking | Medium | Medium to high | Excellent | refs: turn24view8, turn36view0, turn28view1 |
| RL from AI or human feedback | Reward models, policy weights, optimizer state | Preference data, verifiable rewards, rollout traces | Fine-tuned policy or reward updates | Reward hacking, catastrophic forgetting, infra complexity | High | High | Poor to moderate | refs: turn11search0, turn27view0, turn26view6, turn28view9 |
| Prompt and context self-modification | System prompts, playbooks, demonstrations, context blocks | Held-out evals, execution feedback, rubric scores | New prompt variants, context deltas, compiled policies | Prompt collapse, benchmark overfitting, verbosity creep | Low to medium | Medium | Excellent | refs: turn26view5, turn25view9, turn27view5, turn27view6, … |
| Code and tool self-modification | Tool wrappers, scripts, parsers, harness code | Failing traces, tests, human review, diffs | Patches, new tool affordances, regression tests | Unsafe actions, privilege escalation, breaking tools | Medium to high | High | Moderate with sandbox | refs: turn24view6, turn24view1, turn35view0 |
| Curriculum generation | Task queues, challenge suites, synthetic evals | Failure clusters, missing-skill analysis | New tasks, practice sets, seed curricula | Easy-task bias, self-generated junk tasks | Medium | Low to medium | Moderate | refs: turn26view1, turn13search0, turn28view8 |
| Telemetry-driven loops | Routing rules, thresholds, tool enablement, dashboards | Traces, metrics, latency, success/failure logs | New model routes, guardrails, alerting | Optimizing for proxies, silent regressions | Low | Medium | Excellent | refs: turn24view7, turn27view2 |
OpenCLAW loop catalog
Given OpenCLAW’s stated capabilities—tools, file access, code execution, markdown memory, messaging, scheduling, and artifact generation—the following named loops are the most natural fit. The key design choice is that not every loop should have equal write permissions. refs: turn24view6, turn31search0, turn33search0, turn36view0
| OpenCLAW loop type | Trigger | Primary writable substrate | Best success signal | Default gate |
|---|---|---|---|---|
| Post-task reflection | Every completed task above a complexity threshold | memory/reflections/ and per-domain lessons | Reduced repeat errors on delayed holdout tasks | Automatic draft write |
| Failed-task diagnosis | Any failed task or user correction | memory/failures/ and evaluator logs | Lower recurrence of same failure class | Automatic draft write |
| Daily consolidation | End of day or N tasks | User/profile memory, episodic summary, stale-memory candidates | Better retrieval precision next day | Automatic with audit trail |
| Weekly distillation | Weekly cron | Condensed playbooks, pruned memories, promoted skills | Lower context size with equal or better task success | Human review for promotion |
| User-preference learning | Explicit correction or stable repeated preference | Profile markdown and structured preference table | Higher preference-consistency without intrusive personalization | Human-visible log |
| Code and tool improvement | Repeated tool failures or brittle wrappers | Sandbox branch, tool adapters, tests | More passing regression cases and fewer retries | Mandatory human approval |
| Prompt refinement | Stable eval set shows persistent weakness | System prompt fragments, playbooks, demonstrations | Higher held-out success at same or lower cost | Staged rollout |
| Heartbeat maintenance | Scheduled health checks | Scheduler definitions, lightweight scripts, alerts | Faster detection and recovery from broken routines | Automatic in narrow scope |
| Artifact evaluation | Any generated report, message, file, or patch | Evaluator traces, scorecards, artifact revisions | Higher externally graded artifact quality | Automatic retry, bounded attempts |
| Public-feedback loops | Telegram or user-side reactions, thumbs, edits, acceptance | Preference/event logs, weak labels | Better future outputs for the same audience | Aggregate only; no direct autonomous rewrite |
Literature review
The implementation references most worth having open while designing OpenCLAW are Hermes Agent docs (turn2search0), Letta docs (turn9search0), Claude Code memory docs (turn8search5), A-MEM repo (turn29search0), Agent Lightning repo (turn1search14), and OpenTelemetry docs (turn15search0). Those sources are especially valuable because they move beyond abstract “agent” rhetoric into actual deployable surfaces: memory blocks, background reflection, progress files, terminal backends, schedulers, and telemetry conventions. refs: turn24view0, turn24view1, turn24view2, turn24view3, …
The foundational papers from 2023–2024 established the basic non-parametric forms of agent learning. Generative Agents showed the now-familiar pattern of storing experiences, synthesizing higher-order reflections, and retrieving them for planning. Reflexion reframed learning as verbal reflection stored in episodic memory rather than weight updates. Voyager demonstrated automatic curriculum generation plus an ever-growing executable skill library. ExpeL added the important idea of converting trajectories into natural-language insights and in-context examples. MemGPT generalized the problem into virtual context management, which is especially relevant for long-running personal agents with bounded context windows. refs: turn26view2, turn26view0, turn26view1, turn28view2, …
These early ideas were echoed, and stressed, by benchmark work. The early Auto-GPT benchmark paper argued that Auto-GPT-style agents had uncertain effectiveness and limited real-world engagement without stronger scaffolding and evaluation. AgentBench found that poor long-term reasoning, decision-making, and instruction-following were core obstacles. τ-bench showed that even strong function-calling agents can remain below 50% on realistic domains and are inconsistent across repeated trials. AppWorld raised the bar further by evaluating multi-app, code-heavy tasks and showed that strong frontier models still solved only about half of the normal split and substantially less of the challenge split in the original report. refs: turn27view13, turn28view3, turn27view11, turn18search3, …
The 2025–2026 literature sharpened the memory and context side of the field. A-MEM moved memory from flat retrieval toward dynamically linked notes inspired by Zettelkasten. Mem0 emphasized extraction, consolidation, and retrieval, with a graph-memory extension for relational reasoning. LightMem explicitly targeted the performance-efficiency tradeoff. ACE is especially important for OpenCLAW because it treats system prompts and memory as evolving playbooks rather than static strings, and explicitly names two pathologies that many agent builders have seen in practice: brevity bias and context collapse. MemRL pushed the idea further by learning utility over memory selection, not by repeatedly retraining the base model. refs: turn26view3, turn26view7, turn28view5, turn26view5, …
Production systems have converged in a similar direction. The official Claude Code docs describe two persistent channels—CLAUDE.md and auto memory—and official engineering posts recommend explicit progress files, test oracles, and Git-based checkpointing for multi-session agents. The official Letta docs expose persistent memory blocks, self-editing memory, periodic dream subagents, and explicit scheduling. The official Hermes Agent docs describe skill creation, persistence, scheduling, and multiple execution backends. And the official Claude Managed Agents post (turn36view0) added “dreaming,” where a scheduled process reviews sessions and memories, restructures them, and can optionally require review before writes land. refs: turn24view4, turn24view5, turn24view6, turn24view2, …
The remaining frontier is evaluation, search, and optimization. DSPy and AutoPDL provide practical pathways for automatically improving prompts or agent configurations against a metric. OPRO is conceptually elegant, but later work found it limited on smaller models, which is directly relevant if OpenCLAW runs on cheaper model tiers. Tree-search and debate methods remain useful, but they are better understood as test-time compute multipliers than as persistent learning by themselves unless their outputs get written back into memory or playbooks. Agent Lightning and Tool-R1 show the route toward offline or semi-offline policy improvement, but they should be treated as later-stage extensions once OpenCLAW already has clean logging, reproducible tasks, and reward signals. refs: turn27view5, turn25view9, turn27view6, turn28view10, …
Systems and methods most worth copying
| System or method | What it actually contributes | Why it matters for OpenCLAW | Sources |
|---|---|---|---|
| Reflexion | Verbal post-task learning into episodic memory | Best first implementation of “learn from mistakes without fine-tuning” | refs: turn26view0 |
| Voyager | Automatic curriculum plus reusable code skills | Template for skill extraction, retrieval, and composition | refs: turn26view1 |
| MemGPT | Memory tiers and context virtualization | Strong design model for bounded-context personal agents | refs: turn25view2 |
| A-MEM | Dynamic note linking and richer memory structure | Better than flat vector recall for evolving user/projects | refs: turn26view3, turn29search0 |
| Mem0 | Extraction, consolidation, retrieval, graph memory | Practical memory baseline with explicit latency/cost framing | refs: turn26view7 |
| ACE | Evolving playbooks, anti-collapse updates | Best prompt/context self-improvement model for lightweight deployment | refs: turn26view5 |
| Letta | Self-editing memory blocks and sleep-time subagents | Strong direct analogue for OpenCLAW’s markdown/state loops | refs: turn24view2, turn24view3, turn33search0 |
| Hermes Agent | Skill-backed learning loop, scheduling, remote execution | Closest existing “personal agent that compounds” reference | refs: turn24view0, turn24view1, turn31search0 |
| AutoPDL and DSPy | Metric-guided prompt or pipeline optimization | Useful once OpenCLAW has stable evals and data | refs: turn25view9, turn27view5 |
| Agent Lightning and Tool-R1 | Offline RL training for agents and tool use | Valuable late, but too heavy to be phase-one OpenCLAW | refs: turn27view0, turn27view1, turn26view6 |
Implementation patterns
The most robust implementation pattern is to separate OpenCLAW’s state into identity, episodic history, semantic memory, skills, evaluations, artifacts, and telemetry, then give each loop partial write permissions depending on its trust level. That pattern appears repeatedly in practice: CLAUDE.md plus auto memory in Claude Code, memory blocks and sleep-time agents in Letta, progress-file-plus-Git scaffolding in long-running coding agents, and skill-backed persistence in Hermes Agent. refs: turn24view4, turn24view6, turn24view2, turn24view3, …
A repository layout that matches those lessons can stay simple and still scale:
openclaw/
config/
models.yaml
safety.yaml
schedules.yaml
routing.yaml
memory/
identity.md
user_profile.md
preferences.json
episodic/
2026-05-09/
semantic/
memories.sqlite
reflections/
weekly_playbooks/
skills/
registry.yaml
drafts/
approved/
archived/
tools/
adapters/
wrappers/
tests/
evals/
rubrics/
datasets/
holdout/
replays/
reports/
artifacts/
generated/
accepted/
rejected/
telemetry/
runs.jsonl
spans.jsonl
costs.jsonl
regressions.jsonl
sandbox/
worktrees/
patches/
scripts/
daily_consolidation.py
weekly_distillation.py
heartbeat_checks.py
For a single-user or small-team agent, start with SQLite rather than a remote vector database. The official SQLite docs describe it as small, self-contained, and high-reliability; Python’s standard sqlite3 module exposes it without a separate server process; FTS5 gives strong lexical search; and SQLite’s vec1 extension provides ANN vector search when dense retrieval becomes useful. In practice, that means one file can hold memories, metadata, preference state, evaluator outcomes, and hybrid search indexes. Move to a distributed vector DB only when you truly need multi-node scale, many concurrent users, or specialized filtering and operational tooling. refs: turn30view0, turn30view1, turn27view4, turn27view3
For skills, prefer versioned, testable procedural artifacts over freeform prose. Voyager’s library was executable code; Hermes attaches skills to scheduled jobs; newer skill-centric papers emphasize multi-level, reusable, and iteratively refined skill packages. In OpenCLAW terms, a skill should minimally include: scope, prerequisites, required tools, expected inputs, success tests, negative examples, rollback notes, and provenance. That makes skills searchable, auditable, and promotable from draft to approved state. refs: turn26view1, turn31search0, turn32academia15, turn32academia17, …
For evaluator logs, treat grader outputs as first-class data. The official reviewer-agent paper from Apple ML Research (turn12search1) is useful here because it measures both helpfulness and harmfulness: a reviewer may fix some bad outputs while also degrading some correct ones. The official managed-agent “outcomes” pattern is similarly strong because it separates the grader’s context window from the generator’s. For OpenCLAW, that means every evaluation record should store the artifact hash, rubric version, judge model, score vector, retry count, and whether the reviewer changed a previously successful output for the worse. refs: turn24view8, turn36view0
For rollback and sandboxing, follow the engineering patterns that have already proven useful in long-running coding agents. The official scientific-computing post recommends a progress file, failed-approach tracking, a test oracle, and frequent commits. Hermes exposes multiple execution backends—local, Docker, SSH, Modal, Daytona, Vercel sandbox, Singularity—while the managed-agent API adds sandboxed code execution for Python. The safest OpenCLAW policy is therefore: memory writes can be automatic, skill promotion can be semi-automatic, but code or tool modifications should happen only in a dedicated worktree or sandbox branch with tests and explicit approval. refs: turn24view6, turn24view1, turn35view0
For scheduling and heartbeats, keep two modes. One is a full agent schedule for reflective or synthesis tasks such as daily consolidation or weekly distillation. The other is a script-only heartbeat for narrow checks like “is the Telegram gateway alive?” or “did new logs appear?” Hermes’s cron docs explicitly describe both full agent jobs and LLM-free script-backed runs, and Letta’s docs expose both cron scheduling and sleep-time update frequency. That is a strong design pattern for OpenCLAW because it prevents expensive thinking when a cheap deterministic script is enough. refs: turn31search0, turn31search8, turn33search0, turn33search10
For routing, do not let one model do everything. The best default is a cheap executor for routine tasks, a stronger evaluator for artifact grading or diagnosis, and a separate retriever or reranker for retrieval. This is consistent with the separate-grader idea in outcomes, the reviewer-architecture tradeoff paper, and the broader context-engineering recommendation to keep context just-in-time rather than stuffing everything into one monolithic prompt. OpenCLAW should also log route decisions as telemetry, because without route-level traces you cannot tell whether a “learning improvement” is real or is just a silent model change. refs: turn36view0, turn24view8, turn34search11, turn24view7
OpenCLAW-specific architecture
OpenCLAW’s architecture should distinguish between online loops that run during a task and background loops that run between tasks. Online loops are allowed to read broadly and write narrowly: transient scratchpads, draft reflections, evaluator traces, and candidate preferences. Background loops are allowed to consolidate, prune, and promote. The reason for that separation is visible across multiple systems: Letta’s sleep-time agent works asynchronously over memory blocks, Claude’s dreaming runs as a scheduled process over sessions and memory stores, and Anthropic’s long-running harnesses treat the progress file as the bridge across otherwise fresh sessions. refs: turn33search0, turn36view0, turn24view6
A robust OpenCLAW architecture should therefore have four persistent stores. First, identity and preference memory, mostly human-readable markdown or structured JSON/YAML. Second, episodic and semantic memory, ideally in SQLite with FTS5 and optional ANN support. Third, skills and playbooks, versioned and testable. Fourth, telemetry and evaluator records, because no self-improvement loop is credible if its own traces are not inspectable. refs: turn24view2, turn24view4, turn30view0, turn27view4, …
flowchart LR
U[User / Telegram / Files] --> R[Runtime Task Loop]
S[Scheduler / Heartbeats] --> R
R --> T[Tools and Code Execution]
R --> A[Artifact Generator]
R --> E[Evaluator / Critic]
T --> L[Run Logs and Traces]
A --> E
E --> D[Failure Diagnosis]
E --> P[Artifact Scores]
L --> F[Post-task Reflection]
D --> F
F --> M[Memory Store]
F --> K[Skill Drafts]
M --> C[Daily Consolidation]
K --> W[Weekly Distillation and Promotion]
C --> M
W --> K
P --> RR[Routing and Threshold Updates]
RR --> R
H[Human Gates] --> W
H --> G[Sandboxed Code or Prompt Promotion]
G --> R
The permission model is more important than the loop count. OpenCLAW should be free to write draft reflections, episodic summaries, and evaluator traces automatically. It should be allowed to update obvious, low-risk user preferences when there is explicit evidence. It should not autonomously promote a new skill to “approved,” rewrite system prompts in production, or patch tool code outside a sandbox without tests and a gate. That asymmetry is what keeps “learning” from turning into silent drift. It also matches the distinction the official managed-agent dreaming system now offers between automatic memory updates and review-before-apply. refs: turn36view0, turn24view6
From a routing standpoint, the best OpenCLAW variant is stateful, not merely context-heavy. The Letta docs explicitly describe stateful agents as those that accumulate learned behaviors and memories over time, while the Claude Code docs emphasize that each session is fresh unless persistent files or memories bridge them. For OpenCLAW, this means that “long context” alone should not be treated as a substitute for memory. Long context is a buffer; recursive learning needs explicit state transitions. refs: turn33search6, turn24view4
Experiment catalog
The experiments below are designed to answer one practical question: does OpenCLAW actually improve future task success, or does it merely rewrite its own notes? Across all experiments, use the same discipline. Keep a held-out task set. Measure repeated-trial reliability, not just one-off wins. Prefer environment-grounded or state-based success whenever possible. Penalize stale memory usage. And if you add a reviewer or grader, track both net benefit and reviewer-induced damage. Those rules are directly supported by τ-bench’s pass^k, AppWorld’s state-based evaluation, Memora’s FAMA metric, MemoryAgentBench’s focus on selective forgetting and test-time learning, and the helpfulness-harmfulness framing in reviewer-agent work. refs: turn27view11, turn26view8, turn26view9, turn25view12, …
For cost, use these deployment-side categories: low means a small number of extra model calls or local retrieval; medium means at least one additional reviewer or synthesis pass; high means repeated rollouts, search, or training infrastructure.
Post-task reflection
Objective. Test whether short, structured reflections reduce repeat mistakes on similar future tasks. Setup. Randomly assign tasks above a complexity threshold to control and reflection groups. Files and code. Write traces to telemetry/runs.jsonl; write reflections to memory/reflections/; retrieve reflections only on future tasks with similar tool/task signatures. Prompt template. “Given the goal, tool trace, result, and rubric, produce one lesson in the format trigger -> mistake or success -> action for next time -> evidence.” Success and failure criteria. Success is a measurable drop in repeat-error rate on delayed holdout tasks of the same family; failure is an increase in notes without corresponding improvement, or improved self-judged scores without external success lift. Safety, cost, logging. Auto-write drafts only; never let a reflection directly rewrite global rules. Cost is low. Log the reflection ID, retrieval events, and next-task outcome. refs: turn26view0, turn28view2, turn24view6
Failed-task diagnosis
Objective. Separate “task failed” from “task failed for a known recurring reason,” then test whether diagnosis-specific repairs outperform generic reflection. Setup. Build a small taxonomy—planning error, tool misuse, wrong input assumptions, stale memory, formatting failure, evaluator disagreement—and force each failure into one class. Files and code. memory/failures/*.md, evals/replays/, tools/tests/. Prompt template. “Classify the failure into one root cause, cite proof from the trace, suggest one bounded intervention, and say what evidence would falsify the diagnosis.” Success and failure criteria. Success is reduced recurrence of the same class; failure is diagnosis churn or repeated misclassification. Safety, cost, logging. Do not permit autonomous code changes from diagnosis alone. Cost is low to medium. Log class distributions and recurrence intervals. refs: turn17search0, turn24view8, turn24view6
Daily consolidation
Objective. Determine whether daily background consolidation improves retrieval precision and lowers context load the next day. Setup. Run a nightly job that merges duplicate notes, flags invalidated facts, and extracts a one-page daily summary. Compare against a no-consolidation control period. Files and code. memory/episodic/, memory/semantic/memories.sqlite, memory/daily/summary.md. Prompt template. “Review today’s episodes. Keep durable preferences and procedures; mark volatile facts as tentative; merge duplicates; mark contradictions explicitly.” Success and failure criteria. Success is better next-day retrieval relevance and lower tokens read per task at equal or better outcome quality. Failure is stale-memory amplification or overcompression. Safety, cost, logging. Automatic with audit trail. Cost is low. Log memory additions, merges, invalidations, and retrieval-hit quality next day. refs: turn24view3, turn33search0, turn26view7, turn26view9
Weekly distillation
Objective. Test whether weekly playbook distillation creates more reusable competence than accumulating raw reflections forever. Setup. Every week, cluster reflections by domain and produce a compact playbook with “always,” “if-then,” and “never” rules plus examples. Files and code. memory/weekly_playbooks/, skills/drafts/, skills/approved/, telemetry/regressions.jsonl. Prompt template. “Distill these reflections into a stable playbook. Preserve constraints, collapse duplicates, keep edge cases, and point out any rules that conflict or need human confirmation.” Success and failure criteria. Success is equal or better performance with shorter context injection and less retrieval noise; failure is context collapse or loss of useful edge cases. Safety, cost, logging. Promote only after human review or staged canarying. Cost is medium. Log the before/after playbook size and held-out eval deltas. refs: turn26view5, turn24view6, turn24view4
User-preference learning
Objective. Learn stable user preferences without over-personalizing or clinging to outdated ones. Setup. Record only explicit corrections, repeated stable choices, and high-confidence preferences; store them separately from volatile context. Files and code. memory/user_profile.md, memory/preferences.json, memory tables with timestamps and confidence. Prompt template. “Should this be stored as a durable user preference? Answer only if there is explicit evidence, likely future utility, and low privacy risk. Otherwise abstain.” Success and failure criteria. Success is improved preference-consistency and lower need for repeated corrections; failure is intrusive personalization, obsolete preference reuse, or storing sensitive data unnecessarily. Safety, cost, logging. Require evidence links and confidence thresholds. Cost is low. Log every preference mutation and let the user inspect or delete it. refs: turn24view2, turn24view3, turn24view4, turn26view9, …
Code and tool improvement
Objective. Let OpenCLAW repair brittle wrappers, prompts, parsers, or scripts in a sandbox, then measure whether the fix generalizes. Setup. Trigger only on repeated reproducible failures with existing regression cases. Create a worktree or sandbox branch, propose a patch, run tests, then request approval. Files and code. sandbox/worktrees/, tools/adapters/, tools/tests/, memory/failures/. Prompt template. “Given this failing trace and regression test, produce the smallest patch that fixes the bug, avoids widening permissions, and adds or updates one regression test.” Success and failure criteria. Success is more passing regressions and fewer retries on similar tasks; failure is test-only overfitting or new breakage elsewhere. Safety, cost, logging. Mandatory sandbox, tests, and approval. Cost is medium. Log patch diffs, test matrix, and rollback pointer. refs: turn24view6, turn24view1, turn35view0
Prompt refinement
Objective. Improve agent prompts and context playbooks using held-out evals instead of anecdotal editing. Setup. Maintain a stable eval set by task family and optimize prompt fragments or retrieval orderings only against validation, never against production alone. Files and code. config/routing.yaml, evals/datasets/, evals/holdout/, memory/weekly_playbooks/. Prompt template. Use an optimizer prompt that proposes small diffs, not wholesale rewrites: “Suggest at most three prompt changes, explain their hypothesis, and estimate the failure modes they target.” Success and failure criteria. Success is held-out lift at fixed or reduced token cost; failure is gains only on the tuning set, larger prompts with no net yield, or collapse on smaller models. Safety, cost, logging. Staged rollout only. Cost is medium. Log prompt version, validation lift, cost delta, and post-deploy decay. refs: turn25view9, turn27view5, turn27view6, turn28view10
Heartbeat maintenance
Objective. Test whether scheduled, mostly deterministic health checks reduce downtime and preserve agent quality across long-running deployments. Setup. Create lightweight scheduled checks for scheduler health, messaging connectivity, disk usage, token-spend anomalies, and failed cron jobs. Files and code. scripts/heartbeat_checks.py, config/schedules.yaml, telemetry/costs.jsonl, telemetry/runs.jsonl. Prompt template. Prefer script-only checks. Use the agent only when the script detects an anomaly and escalation is needed. Success and failure criteria. Success is faster recovery and lower incidence of silent failure; failure is noisy alerts or expensive checks displacing useful work. Safety, cost, logging. Allow automatic low-risk remediation only for reversible actions. Cost is very low if script-first. Log alert precision and mean time to recovery. refs: turn31search0, turn31search7, turn31search8
Artifact evaluation
Objective. Improve output quality for reports, messages, summaries, patches, or generated documents by using a separate grader loop. Setup. Build domain-specific rubrics for at least two artifact classes and compare one-pass generation against bounded retry with a separate evaluator. Files and code. evals/rubrics/, artifacts/generated/, artifacts/accepted/, artifacts/rejected/. Prompt template. “Score this artifact against the rubric only. Do not rewrite until the scoring step is complete. If the artifact fails, produce targeted revision instructions, not a full replacement.” Success and failure criteria. Success is higher external grader or user acceptance with bounded extra cost; failure is reviewer-induced degradation of previously good artifacts. Safety, cost, logging. Use maximum retry caps and helpfulness-harmfulness accounting. Cost is medium. Log generator version, grader version, per-dimension scores, and whether the retry beat the original. refs: turn36view0, turn24view8
Public-feedback loops
Objective. Learn from user acceptance signals—edits, approvals, reactions, reuse—without mistaking popularity for correctness. Setup. Collect weak labels from messaging channels or artifact usage, but keep them separate from hard success metrics. Files and code. telemetry/public_feedback.jsonl, memory/preferences.json, evals/reports/feedback_correlation.md. Prompt template. No direct self-edit prompt. Instead, periodically translate aggregated feedback into hypotheses: “What stable preference or presentation pattern is supported by repeated user actions?” Success and failure criteria. Success is improved user acceptance on comparable future outputs; failure is optimizing style at the expense of correctness. Safety, cost, logging. Aggregate-only; never let one reaction rewrite global policy. Cost is low. Log the correlation between feedback labels and objective task success. refs: turn11search0, turn28view6
Tool-call reviewer
Objective. Insert a reviewer before risky tool calls and measure whether the reviewer produces net-positive value. Setup. Apply only to high-risk tools: filesystem writes, external messages, edits to memory, code execution, or calendar-like actions. Files and code. evals/rubrics/tool_safety.yaml, telemetry/reviewer_comparison.jsonl. Prompt template. “Given the provisional tool call, judge whether the selected tool, arguments, and scope are correct. If not, return the minimal correction.” Success and failure criteria. Success is a positive net value using helpfulness-harmfulness accounting; failure is excessive blocking or reviewer-induced corruption of correct calls. Safety, cost, logging. Only for risky tools. Cost is medium. Log blocked calls, corrected calls, false blocks, and degraded-correct events. refs: turn24view8
Curriculum generation
Objective. Test whether synthetic or derived practice tasks fill capability gaps faster than waiting for real failures. Setup. Cluster past failures, generate synthetic tasks that vary one difficulty factor at a time, and validate on a held-out real-task set. Files and code. evals/datasets/curriculum/, memory/failures/, skills/drafts/. Prompt template. “Generate three minimally different tasks that stress the missing subskill identified in this failure cluster. Each task must specify a verifiable outcome.” Success and failure criteria. Success is downstream lift on real held-out tasks; failure is improvement only on synthetic exercises. Safety, cost, logging. Keep synthetic tasks in a separate training split. Cost is medium. Log task provenance and transfer rate from synthetic to real evals. refs: turn26view1, turn13search0, turn28view8
Telemetry-driven routing
Objective. Improve quality-cost tradeoffs by changing model/tool routes based on observed traces, not intuition. Setup. Instrument runs with model, cost, latency, retrieval depth, retry count, success class, and artifact rubric scores. Files and code. telemetry/spans.jsonl, telemetry/costs.jsonl, config/routing.yaml. Prompt template. No LLM prompt is required for the core loop; use the model only to summarize statistically credible route changes. Success and failure criteria. Success is higher success-per-dollar or lower latency at equal quality; failure is routing changes that optimize proxies while reducing real task success. Safety, cost, logging. Route changes should be staged and reversible. Cost is low. Log route version and route-specific win rates. refs: turn24view7, turn27view2
Risks and failure modes
The most important failure mode is fake learning: the agent rewrites more state, sounds more reflective, and perhaps gets higher self-judged scores, but does not improve on future real tasks. The current benchmark literature makes this danger concrete. AppWorld and τ-bench show that realistic task completion is still hard and inconsistent; the “why they fail” analysis identifies planning, execution, and response errors as distinct failure stages; and memory benchmarks such as Memora and MemoryAgentBench show that remembering more is not the same thing as updating correctly or forgetting obsolete facts. refs: turn26view8, turn27view11, turn17search0, turn26view9, …
The second risk is circular self-evaluation. If the same model family generates, critiques, scores, and writes persistent memory, the loop can become self-confirming. The best counterexamples in the literature explicitly break that circularity. The official outcomes system uses a separate grader in a different context window; the reviewer-agent paper measures both corrected errors and degraded correct outputs; and Memora’s FAMA metric penalizes obsolete-memory reuse rather than rewarding persuasive but stale answers. OpenCLAW should copy that principle aggressively: separate judges from generators, evaluate against environment state when possible, and score stale-memory usage negatively. refs: turn36view0, turn24view8, turn26view9
The third risk is memory pollution. A-MEM, Mem0, LightMem, and Letta all exist because flat “append and embed everything” memory quickly becomes noisy, expensive, and hard to mutate correctly. More recent benchmarks sharpen the problem: PERMA shows preferences emerge gradually in noisy contexts, while Memora shows many memory agents still fail to reconcile changing facts over time. OpenCLAW therefore needs explicit invalidation, timestamping, confidence, and evidence fields for every memory object, plus regular consolidation and pruning passes. refs: turn26view3, turn26view7, turn28view5, turn24view2, …
The fourth risk is unsafe self-modification. Code execution, file mutation, remote tools, and messaging can all magnify a wrong action. The safest production pattern in the sources is not “never self-modify,” but “self-modify only inside scaffolds”: progress files, test oracles, commits, worktrees, bounded tools, reviewer loops, and explicit approvals before promotion to production. That is why code/tool improvement should come after telemetry, evals, and rollback exist—not before. refs: turn24view6, turn24view1, turn35view0, turn24view8
The fifth risk is prompt collapse and over-optimization. ACE names context collapse directly; smaller-model OPRO limitations show automated prompt optimization can fail when optimizer capability is weak; and prompt optimization more broadly tends to find narrow improvements unless grounded by robust validation. OpenCLAW should therefore restrict prompt refinement to small diffs, validation-only tuning, canary deployment, and automatic rollback when real-task metrics drop. refs: turn26view5, turn28view10, turn25view9
A practical anti-fraud checklist follows from those sources. Improvement claims should only count if they survive a held-out split, maintain or improve external success, preserve or improve success-per-dollar, and hold up over repeated trials. Reviewer loops should be scored for both helpfulness and harmfulness. Memory loops should use forgetting-aware metrics. Prompt or code modifications should always be versioned and reversible. And user-facing preference learning should distinguish durable preferences from transient circumstances. refs: turn24view8, turn26view9, turn27view11, turn26view8, …
Most Realistic OpenCLAW Implementation Path
The most realistic path is staged, not heroic. Because the prompt does not specify a language, hosting stack, or compute budget, the safest assumption is Python-friendly, storage-light, API-model-first, deployable on a Chromebook with remote APIs, a small VPS, or lightweight cloud. The seven phases below are ordered so each phase produces the evidence required for the next one. That ordering is consistent with the gap between lightweight memory/context architectures and heavier RL-oriented frameworks in the literature. refs: turn30view1, turn30view0, turn24view1, turn27view0, …
Phase 1 — Instrumentation and gold tasks. Build telemetry, evaluator logs, artifact hashes, and a compact but meaningful eval suite first. Add routing logs, run IDs, prompt versions, and environment-state checks. The exit criterion is simple: you can replay a task and say why it passed or failed. Without this, every later “learning” loop is theater. refs: turn24view7, turn26view8, turn27view11, turn25view12
Phase 2 — Explicit memory and reflection. Add markdown identity/profile files, SQLite-backed episodic and semantic memory, and low-risk post-task reflections. Keep this layer inspectable and pruneable. The exit criterion is a measurable reduction in repeat failures or repeated user corrections, not just more stored data. refs: turn24view2, turn24view3, turn30view0, turn27view4, …
Phase 3 — Daily consolidation and weekly distillation. Once basic reflection works, add scheduled background loops that merge duplicates, mark contradictions, and promote recurring procedures into playbooks or draft skills. The exit criterion is better retrieval precision and shorter injected context with equal or better outcomes. refs: turn33search0, turn31search0, turn26view5, turn36view0
Phase 4 — Skill extraction and routine automation. Convert repeated successful workflows into versioned skills and attach them to schedules or reusable task families. This is where OpenCLAW starts compounding. The exit criterion is transfer: a skill learned in one context measurably improves another similar context. refs: turn26view1, turn31search0, turn32academia17
Phase 5 — Separate evaluators and artifact grading. Add a reviewer or grader for risky tool calls and high-value artifacts, and score it with helpfulness-harmfulness accounting. The exit criterion is net-positive value from the reviewer, not just more retries. refs: turn24view8, turn36view0
Phase 6 — Prompt, routing, and sandboxed tool improvement. Only now should OpenCLAW start editing prompt playbooks, route policies, or tool wrappers based on stable evals and regression tests. The exit criterion is held-out lift plus safe rollback. refs: turn25view9, turn27view5, turn24view6, turn24view1
Phase 7 — Optional curriculum generation and offline RL. If OpenCLAW eventually has stable domains, good metrics, large trace volumes, and real need for deeper policy optimization, then curriculum generation, offline preference learning, or RL-style agent training become reasonable. Until then, they are likely premature. The exit criterion here is organizational, not just technical: enough data, enough repeatability, and enough value to justify the infrastructure. refs: turn13search0, turn28view8, turn27view0, turn26view6
Appendix
The appendix below turns the report into implementable artifacts. The schemas and prompts are not copied from any one system, but they are directly shaped by the memory-block, progress-file, dream-reflection, skill, and telemetry patterns documented in Letta, Hermes, Claude Code, Managed Agents, and OpenTelemetry. refs: turn24view2, turn24view4, turn24view6, turn36view0, …
Schemas
# skills/registry.yaml
version: 1
skills:
- id: skill_write_weekly_summary_v3
status: approved # draft | approved | archived
domain: reporting
triggers:
- repeated_task_pattern: weekly_summary
- success_count_gte: 5
description: >
Produce a weekly summary from project notes and telemetry.
required_tools:
- file_search
- code_execution
- messaging
required_inputs:
- notes_path
- date_range
output_contract:
artifact_type: markdown_report
must_include:
- completed_items
- blockers
- metrics
tests:
- eval_id: report_rubric_v2
- replay_id: weekly_summary_holdout_a
safety:
mutates_code: false
sends_external_messages: false
needs_human_approval_for_promotion: true
provenance:
derived_from_runs:
- run_2026_05_01_001
- run_2026_05_03_004
author_loop: weekly_distillation
last_known_good:
prompt_version: prompt_reporting_v7
route_policy: route_v4
{
"memory_entry": {
"id": "mem_2026_05_09_001",
"kind": "preference",
"scope": "user",
"title": "User prefers concise summaries with clear action items",
"value": "When writing summaries, keep them short and action-oriented.",
"evidence": [
{
"source_run_id": "run_2026_05_07_019",
"quote_or_event": "User asked to shorten a long summary and foreground next actions"
}
],
"confidence": 0.87,
"durability": "high",
"valid_from": "2026-05-07T22:14:00Z",
"valid_to": null,
"supersedes": [],
"tags": ["writing", "style", "preference"],
"privacy_level": "normal",
"write_gate": "auto_with_audit"
}
}
# evals/reports/evaluator_log.yaml
run_id: run_2026_05_09_022
artifact_hash: sha256:...
artifact_type: markdown_report
generator:
model: executor_fast_v2
prompt_version: reporting_v7
reviewer:
model: critic_strong_v3
rubric: report_quality_v4
scores:
correctness: 0.86
completeness: 0.81
style_fit: 0.92
safety: 0.99
decision:
accepted: false
retry_allowed: true
retry_reason: "Missing two blocker items"
reviewer_harmfulness_check:
base_artifact_was_correct: false
reviewer_changed_correct_output_to_incorrect: false
memory_writes:
created:
- mem_2026_05_09_011
blocked: []
cost:
input_tokens: 10234
output_tokens: 981
usd_estimate: 0.19
Prompt templates
Reflection prompt
You are writing a single reusable lesson for a future agent run.
Inputs:
- task goal
- tool trace
- final outcome
- explicit user feedback
- evaluator result
Return exactly:
1. Trigger condition
2. What failed or unexpectedly worked
3. One action rule for next time
4. Evidence from this run
5. Confidence: low | medium | high
Constraints:
- No motivational language
- No vague advice
- No global instruction changes
- If evidence is weak, abstain
Skill extraction prompt
You are deciding whether a repeated procedure should become a reusable skill.
Given:
- 3 to 10 successful runs
- common steps across runs
- required tools
- known edge cases
- tests or replays
Produce:
- skill name
- scope
- prerequisites
- step-by-step procedure
- anti-patterns
- tests
- promotion recommendation: draft only / ready for review
Reject if:
- the workflow is too specific to one run
- there is no stable success signal
- the procedure depends on hidden assumptions
Artifact evaluator prompt
You are a separate grader. Do not improve the artifact until scoring is complete.
Score the artifact on:
- correctness
- completeness
- policy compliance
- audience fit
- brevity / structure
Then return:
- pass/fail
- highest-priority defect
- minimal revision instructions
- whether a retry is likely to help
Important:
- If the artifact is already good, say so
- Do not invent defects to justify another retry
Pseudocode
def run_task(task, context):
run = create_run_record(task, context)
plan = executor.plan(task, context)
result = executor.act(plan, tools=context.tools, workspace=context.workspace)
artifact = maybe_generate_artifact(result)
review = maybe_review_artifact(artifact, rubric=select_rubric(task))
if review and review.retry_allowed:
artifact = executor.revise(artifact, review.minimal_revision_instructions)
outcome = external_eval(task, result, artifact, context)
log_outcome(run, result, artifact, review, outcome)
if should_reflect(task, outcome, result):
lesson = reflect(task, result, review, outcome)
write_draft_reflection(lesson)
if should_store_preference(task, outcome, result):
pref = infer_preference(task, result, context.user_feedback)
write_preference_with_audit(pref)
if should_open_code_improvement_issue(outcome, result):
create_sandbox_patch_request(task, result)
return finalize(run, outcome, artifact)
Architecture and roadmap diagrams
flowchart LR
PhaseA[Instrumentation and gold tasks]
PhaseB[Memory and reflection]
PhaseC[Daily consolidation and weekly distillation]
PhaseD[Skill extraction and routine automation]
PhaseE[Separate evaluators and artifact grading]
PhaseF[Prompt routing and sandboxed tool improvement]
PhaseG[Curriculum generation and optional offline RL]
PhaseA --> PhaseB --> PhaseC --> PhaseD --> PhaseE --> PhaseF --> PhaseG
flowchart TD
FreshSession[Fresh task session]
PersistentState[Persistent state]
FreshSession --> WorkingContext[Working context assembly]
PersistentState --> WorkingContext
WorkingContext --> Execute[Execute with tools]
Execute --> Evaluate[Evaluate against rubric or environment]
Evaluate --> Reflect[Write draft lesson]
Reflect --> PersistentState
Evaluate --> Consolidate[Background consolidation]
Consolidate --> PersistentState
Evaluate --> Promote[Human-gated promotion]
Promote --> PersistentState
The shortest formulation of the report is this: if OpenCLAW rewrites memory, skills, and playbooks from grounded evidence, it can become meaningfully more capable on lightweight infrastructure; if it tries to jump straight to autonomous prompt drift, unrestricted code rewriting, or RL training without evals and rollback, it will mostly become harder to trust. That is the core lesson that stays consistent across the research literature, the benchmark landscape, and the clearest production-facing systems. refs: turn26view0, turn26view1, turn26view5, turn24view6, …