Recursive Learning Loops for OpenCLAW Agents

Executive summary

For an OpenCLAW-like personal agent, the most realistic interpretation of “recursive learning” is not continual weight training. It is repeated, structured modification of external substrates: memory files, skill documents, prompt playbooks, evaluator rubrics, routing rules, schedules, tool wrappers, and sometimes sandboxed code. That is the common pattern across the strongest practical systems in this area: Reflexion improves via linguistic feedback and episodic memory; Voyager grows a reusable code skill library and automatic curriculum; MemGPT manages memory tiers; Letta exposes editable memory blocks and sleep-time reflection; A-MEM organizes memories into linked notes; ACE evolves context playbooks; and MemRL improves by learning over memory utility rather than by fully retraining a base model. refs: turn26view0, turn26view1, turn25view2, turn24view2, …

That distinction matters because it changes what is feasible on a Chromebook, a small VPS, or lightweight cloud. Reflection, memory consolidation, skill extraction, evaluator loops, scheduled “sleep-time” review, and telemetry-driven routing are practical with API models plus local storage. Full reinforcement learning stacks, by contrast, are significantly heavier: Agent Lightning and Tool-R1 are promising, but they assume datasets, rollouts, reward design, and training infrastructure that most personal-agent deployments should postpone until later. refs: turn27view0, turn27view1, turn26view6

On the specific project you mentioned, the official docs identify Hermes Agent as built by Nous Research, not as an Massachusetts Institute of Technology project. More importantly for OpenCLAW, those docs expose exactly the sort of architecture you care about: self-improving skills, persistent knowledge, remote or local execution backends, and built-in scheduling. refs: turn24view0, turn24view1, turn31search0

The practical conclusion is straightforward. The highest-return early loops for OpenCLAW are: post-task reflection, failed-task diagnosis, user-preference memory, daily consolidation, weekly distillation, artifact evaluation with a separate grader, telemetry-based routing, and heartbeat maintenance. The highest-risk late loops are: autonomous code self-modification outside a sandbox, unconstrained self-rewriting prompts, and any form of live RL or weight updates without hard evals, rollback, and human approval. Benchmark evidence supports that caution: current agents are still inconsistent on realistic domains, with less-than-50% success in several demanding settings, and memory systems still struggle with stale or invalidated information. refs: turn27view11, turn26view8, turn17search0, turn26view9, …

Taxonomy

The cleanest way to think about recursive learning is to ask a single question: what substrate is being rewritten, by which signal, under what gate? For OpenCLAW, that framing matters more than whether a paper uses the language of “self-improvement,” “reflection,” or “continual learning.” The loop families below are ordered from most deployment-ready to most operationally expensive. Cost and feasibility are engineering estimates for a small deployment, derived from the cited architectures. refs: turn26view0, turn26view1, turn24view2, turn24view3, …

Generic loop families

Loop family	Substrate modified	Typical inputs	Typical outputs	Common failure modes	Small-deployment cost	Safety risk	Small-deployment feasibility	Representative sources
Memory extraction and consolidation	Episodic logs, semantic/user memory, linked notes	Task traces, user corrections, conversation history	Memory entries, summaries, linked notes	Stale memories, over-insertion, privacy leakage, retrieval noise	Low	Medium	Excellent	refs: turn24view2, turn24view3, turn26view3, turn26view7, …
Reflection and diagnosis	Reflection buffers, failure notes, lessons-learned files	Outcome signal, tool errors, evaluator feedback	Error diagnosis, “next time do X” guidance	Superficial self-critique, vague lessons, circular self-approval	Low	Medium	Excellent	refs: turn26view0, turn28view2, turn24view6
Skill libraries	Skill docs, code snippets, reusable procedures, APIs	Successful trajectories, repeated workflows	Executable skills, retrieval tags, tests	Skill bloat, stale skills, overfitting to one environment	Low to medium	Medium	Excellent	refs: turn26view1, turn32search4, turn31search0, turn32academia17, …
Planning and search loops	Plans, search trees, policy over expansions	Current state, tool feedback, web/app state	Search trajectories, candidate plans	Excess cost, branch explosion, overthinking, latency	Medium to high	Medium	Moderate	refs: turn6search10, turn28view0, turn13search0
Evaluator and critic loops	Rubrics, critique traces, candidate rankings	Draft artifacts, provisional tool calls, state diffs	Scores, critiques, retries, filtered actions	Judge bias, critic-induced regressions, reward hacking	Medium	Medium to high	Excellent	refs: turn24view8, turn36view0, turn28view1
RL from AI or human feedback	Reward models, policy weights, optimizer state	Preference data, verifiable rewards, rollout traces	Fine-tuned policy or reward updates	Reward hacking, catastrophic forgetting, infra complexity	High	High	Poor to moderate	refs: turn11search0, turn27view0, turn26view6, turn28view9
Prompt and context self-modification	System prompts, playbooks, demonstrations, context blocks	Held-out evals, execution feedback, rubric scores	New prompt variants, context deltas, compiled policies	Prompt collapse, benchmark overfitting, verbosity creep	Low to medium	Medium	Excellent	refs: turn26view5, turn25view9, turn27view5, turn27view6, …
Code and tool self-modification	Tool wrappers, scripts, parsers, harness code	Failing traces, tests, human review, diffs	Patches, new tool affordances, regression tests	Unsafe actions, privilege escalation, breaking tools	Medium to high	High	Moderate with sandbox	refs: turn24view6, turn24view1, turn35view0
Curriculum generation	Task queues, challenge suites, synthetic evals	Failure clusters, missing-skill analysis	New tasks, practice sets, seed curricula	Easy-task bias, self-generated junk tasks	Medium	Low to medium	Moderate	refs: turn26view1, turn13search0, turn28view8
Telemetry-driven loops	Routing rules, thresholds, tool enablement, dashboards	Traces, metrics, latency, success/failure logs	New model routes, guardrails, alerting	Optimizing for proxies, silent regressions	Low	Medium	Excellent	refs: turn24view7, turn27view2

OpenCLAW loop catalog

Given OpenCLAW’s stated capabilities—tools, file access, code execution, markdown memory, messaging, scheduling, and artifact generation—the following named loops are the most natural fit. The key design choice is that not every loop should have equal write permissions. refs: turn24view6, turn31search0, turn33search0, turn36view0

OpenCLAW loop type	Trigger	Primary writable substrate	Best success signal	Default gate
Post-task reflection	Every completed task above a complexity threshold	`memory/reflections/` and per-domain lessons	Reduced repeat errors on delayed holdout tasks	Automatic draft write
Failed-task diagnosis	Any failed task or user correction	`memory/failures/` and evaluator logs	Lower recurrence of same failure class	Automatic draft write
Daily consolidation	End of day or N tasks	User/profile memory, episodic summary, stale-memory candidates	Better retrieval precision next day	Automatic with audit trail
Weekly distillation	Weekly cron	Condensed playbooks, pruned memories, promoted skills	Lower context size with equal or better task success	Human review for promotion
User-preference learning	Explicit correction or stable repeated preference	Profile markdown and structured preference table	Higher preference-consistency without intrusive personalization	Human-visible log
Code and tool improvement	Repeated tool failures or brittle wrappers	Sandbox branch, tool adapters, tests	More passing regression cases and fewer retries	Mandatory human approval
Prompt refinement	Stable eval set shows persistent weakness	System prompt fragments, playbooks, demonstrations	Higher held-out success at same or lower cost	Staged rollout
Heartbeat maintenance	Scheduled health checks	Scheduler definitions, lightweight scripts, alerts	Faster detection and recovery from broken routines	Automatic in narrow scope
Artifact evaluation	Any generated report, message, file, or patch	Evaluator traces, scorecards, artifact revisions	Higher externally graded artifact quality	Automatic retry, bounded attempts
Public-feedback loops	Telegram or user-side reactions, thumbs, edits, acceptance	Preference/event logs, weak labels	Better future outputs for the same audience	Aggregate only; no direct autonomous rewrite

Literature review

The implementation references most worth having open while designing OpenCLAW are Hermes Agent docs (turn2search0), Letta docs (turn9search0), Claude Code memory docs (turn8search5), A-MEM repo (turn29search0), Agent Lightning repo (turn1search14), and OpenTelemetry docs (turn15search0). Those sources are especially valuable because they move beyond abstract “agent” rhetoric into actual deployable surfaces: memory blocks, background reflection, progress files, terminal backends, schedulers, and telemetry conventions. refs: turn24view0, turn24view1, turn24view2, turn24view3, …

The foundational papers from 2023–2024 established the basic non-parametric forms of agent learning. Generative Agents showed the now-familiar pattern of storing experiences, synthesizing higher-order reflections, and retrieving them for planning. Reflexion reframed learning as verbal reflection stored in episodic memory rather than weight updates. Voyager demonstrated automatic curriculum generation plus an ever-growing executable skill library. ExpeL added the important idea of converting trajectories into natural-language insights and in-context examples. MemGPT generalized the problem into virtual context management, which is especially relevant for long-running personal agents with bounded context windows. refs: turn26view2, turn26view0, turn26view1, turn28view2, …

These early ideas were echoed, and stressed, by benchmark work. The early Auto-GPT benchmark paper argued that Auto-GPT-style agents had uncertain effectiveness and limited real-world engagement without stronger scaffolding and evaluation. AgentBench found that poor long-term reasoning, decision-making, and instruction-following were core obstacles. τ-bench showed that even strong function-calling agents can remain below 50% on realistic domains and are inconsistent across repeated trials. AppWorld raised the bar further by evaluating multi-app, code-heavy tasks and showed that strong frontier models still solved only about half of the normal split and substantially less of the challenge split in the original report. refs: turn27view13, turn28view3, turn27view11, turn18search3, …

The 2025–2026 literature sharpened the memory and context side of the field. A-MEM moved memory from flat retrieval toward dynamically linked notes inspired by Zettelkasten. Mem0 emphasized extraction, consolidation, and retrieval, with a graph-memory extension for relational reasoning. LightMem explicitly targeted the performance-efficiency tradeoff. ACE is especially important for OpenCLAW because it treats system prompts and memory as evolving playbooks rather than static strings, and explicitly names two pathologies that many agent builders have seen in practice: brevity bias and context collapse. MemRL pushed the idea further by learning utility over memory selection, not by repeatedly retraining the base model. refs: turn26view3, turn26view7, turn28view5, turn26view5, …

Production systems have converged in a similar direction. The official Claude Code docs describe two persistent channels—CLAUDE.md and auto memory—and official engineering posts recommend explicit progress files, test oracles, and Git-based checkpointing for multi-session agents. The official Letta docs expose persistent memory blocks, self-editing memory, periodic dream subagents, and explicit scheduling. The official Hermes Agent docs describe skill creation, persistence, scheduling, and multiple execution backends. And the official Claude Managed Agents post (turn36view0) added “dreaming,” where a scheduled process reviews sessions and memories, restructures them, and can optionally require review before writes land. refs: turn24view4, turn24view5, turn24view6, turn24view2, …

The remaining frontier is evaluation, search, and optimization. DSPy and AutoPDL provide practical pathways for automatically improving prompts or agent configurations against a metric. OPRO is conceptually elegant, but later work found it limited on smaller models, which is directly relevant if OpenCLAW runs on cheaper model tiers. Tree-search and debate methods remain useful, but they are better understood as test-time compute multipliers than as persistent learning by themselves unless their outputs get written back into memory or playbooks. Agent Lightning and Tool-R1 show the route toward offline or semi-offline policy improvement, but they should be treated as later-stage extensions once OpenCLAW already has clean logging, reproducible tasks, and reward signals. refs: turn27view5, turn25view9, turn27view6, turn28view10, …

Systems and methods most worth copying

System or method	What it actually contributes	Why it matters for OpenCLAW	Sources
Reflexion	Verbal post-task learning into episodic memory	Best first implementation of “learn from mistakes without fine-tuning”	refs: turn26view0
Voyager	Automatic curriculum plus reusable code skills	Template for skill extraction, retrieval, and composition	refs: turn26view1
MemGPT	Memory tiers and context virtualization	Strong design model for bounded-context personal agents	refs: turn25view2
A-MEM	Dynamic note linking and richer memory structure	Better than flat vector recall for evolving user/projects	refs: turn26view3, turn29search0
Mem0	Extraction, consolidation, retrieval, graph memory	Practical memory baseline with explicit latency/cost framing	refs: turn26view7
ACE	Evolving playbooks, anti-collapse updates	Best prompt/context self-improvement model for lightweight deployment	refs: turn26view5
Letta	Self-editing memory blocks and sleep-time subagents	Strong direct analogue for OpenCLAW’s markdown/state loops	refs: turn24view2, turn24view3, turn33search0
Hermes Agent	Skill-backed learning loop, scheduling, remote execution	Closest existing “personal agent that compounds” reference	refs: turn24view0, turn24view1, turn31search0
AutoPDL and DSPy	Metric-guided prompt or pipeline optimization	Useful once OpenCLAW has stable evals and data	refs: turn25view9, turn27view5
Agent Lightning and Tool-R1	Offline RL training for agents and tool use	Valuable late, but too heavy to be phase-one OpenCLAW	refs: turn27view0, turn27view1, turn26view6

Implementation patterns

The most robust implementation pattern is to separate OpenCLAW’s state into identity, episodic history, semantic memory, skills, evaluations, artifacts, and telemetry, then give each loop partial write permissions depending on its trust level. That pattern appears repeatedly in practice: CLAUDE.md plus auto memory in Claude Code, memory blocks and sleep-time agents in Letta, progress-file-plus-Git scaffolding in long-running coding agents, and skill-backed persistence in Hermes Agent. refs: turn24view4, turn24view6, turn24view2, turn24view3, …

A repository layout that matches those lessons can stay simple and still scale:

openclaw/
  config/
    models.yaml
    safety.yaml
    schedules.yaml
    routing.yaml
  memory/
    identity.md
    user_profile.md
    preferences.json
    episodic/
      2026-05-09/
    semantic/
      memories.sqlite
    reflections/
    weekly_playbooks/
  skills/
    registry.yaml
    drafts/
    approved/
    archived/
  tools/
    adapters/
    wrappers/
    tests/
  evals/
    rubrics/
    datasets/
    holdout/
    replays/
    reports/
  artifacts/
    generated/
    accepted/
    rejected/
  telemetry/
    runs.jsonl
    spans.jsonl
    costs.jsonl
    regressions.jsonl
  sandbox/
    worktrees/
    patches/
  scripts/
    daily_consolidation.py
    weekly_distillation.py
    heartbeat_checks.py

For a single-user or small-team agent, start with SQLite rather than a remote vector database. The official SQLite docs describe it as small, self-contained, and high-reliability; Python’s standard sqlite3 module exposes it without a separate server process; FTS5 gives strong lexical search; and SQLite’s vec1 extension provides ANN vector search when dense retrieval becomes useful. In practice, that means one file can hold memories, metadata, preference state, evaluator outcomes, and hybrid search indexes. Move to a distributed vector DB only when you truly need multi-node scale, many concurrent users, or specialized filtering and operational tooling. refs: turn30view0, turn30view1, turn27view4, turn27view3

For skills, prefer versioned, testable procedural artifacts over freeform prose. Voyager’s library was executable code; Hermes attaches skills to scheduled jobs; newer skill-centric papers emphasize multi-level, reusable, and iteratively refined skill packages. In OpenCLAW terms, a skill should minimally include: scope, prerequisites, required tools, expected inputs, success tests, negative examples, rollback notes, and provenance. That makes skills searchable, auditable, and promotable from draft to approved state. refs: turn26view1, turn31search0, turn32academia15, turn32academia17, …

For evaluator logs, treat grader outputs as first-class data. The official reviewer-agent paper from Apple ML Research (turn12search1) is useful here because it measures both helpfulness and harmfulness: a reviewer may fix some bad outputs while also degrading some correct ones. The official managed-agent “outcomes” pattern is similarly strong because it separates the grader’s context window from the generator’s. For OpenCLAW, that means every evaluation record should store the artifact hash, rubric version, judge model, score vector, retry count, and whether the reviewer changed a previously successful output for the worse. refs: turn24view8, turn36view0

For rollback and sandboxing, follow the engineering patterns that have already proven useful in long-running coding agents. The official scientific-computing post recommends a progress file, failed-approach tracking, a test oracle, and frequent commits. Hermes exposes multiple execution backends—local, Docker, SSH, Modal, Daytona, Vercel sandbox, Singularity—while the managed-agent API adds sandboxed code execution for Python. The safest OpenCLAW policy is therefore: memory writes can be automatic, skill promotion can be semi-automatic, but code or tool modifications should happen only in a dedicated worktree or sandbox branch with tests and explicit approval. refs: turn24view6, turn24view1, turn35view0

For scheduling and heartbeats, keep two modes. One is a full agent schedule for reflective or synthesis tasks such as daily consolidation or weekly distillation. The other is a script-only heartbeat for narrow checks like “is the Telegram gateway alive?” or “did new logs appear?” Hermes’s cron docs explicitly describe both full agent jobs and LLM-free script-backed runs, and Letta’s docs expose both cron scheduling and sleep-time update frequency. That is a strong design pattern for OpenCLAW because it prevents expensive thinking when a cheap deterministic script is enough. refs: turn31search0, turn31search8, turn33search0, turn33search10

For routing, do not let one model do everything. The best default is a cheap executor for routine tasks, a stronger evaluator for artifact grading or diagnosis, and a separate retriever or reranker for retrieval. This is consistent with the separate-grader idea in outcomes, the reviewer-architecture tradeoff paper, and the broader context-engineering recommendation to keep context just-in-time rather than stuffing everything into one monolithic prompt. OpenCLAW should also log route decisions as telemetry, because without route-level traces you cannot tell whether a “learning improvement” is real or is just a silent model change. refs: turn36view0, turn24view8, turn34search11, turn24view7

OpenCLAW-specific architecture

OpenCLAW’s architecture should distinguish between online loops that run during a task and background loops that run between tasks. Online loops are allowed to read broadly and write narrowly: transient scratchpads, draft reflections, evaluator traces, and candidate preferences. Background loops are allowed to consolidate, prune, and promote. The reason for that separation is visible across multiple systems: Letta’s sleep-time agent works asynchronously over memory blocks, Claude’s dreaming runs as a scheduled process over sessions and memory stores, and Anthropic’s long-running harnesses treat the progress file as the bridge across otherwise fresh sessions. refs: turn33search0, turn36view0, turn24view6

A robust OpenCLAW architecture should therefore have four persistent stores. First, identity and preference memory, mostly human-readable markdown or structured JSON/YAML. Second, episodic and semantic memory, ideally in SQLite with FTS5 and optional ANN support. Third, skills and playbooks, versioned and testable. Fourth, telemetry and evaluator records, because no self-improvement loop is credible if its own traces are not inspectable. refs: turn24view2, turn24view4, turn30view0, turn27view4, …

flowchart LR
    U[User / Telegram / Files] --> R[Runtime Task Loop]
    S[Scheduler / Heartbeats] --> R
    R --> T[Tools and Code Execution]
    R --> A[Artifact Generator]
    R --> E[Evaluator / Critic]
    T --> L[Run Logs and Traces]
    A --> E
    E --> D[Failure Diagnosis]
    E --> P[Artifact Scores]
    L --> F[Post-task Reflection]
    D --> F
    F --> M[Memory Store]
    F --> K[Skill Drafts]
    M --> C[Daily Consolidation]
    K --> W[Weekly Distillation and Promotion]
    C --> M
    W --> K
    P --> RR[Routing and Threshold Updates]
    RR --> R
    H[Human Gates] --> W
    H --> G[Sandboxed Code or Prompt Promotion]
    G --> R

The permission model is more important than the loop count. OpenCLAW should be free to write draft reflections, episodic summaries, and evaluator traces automatically. It should be allowed to update obvious, low-risk user preferences when there is explicit evidence. It should not autonomously promote a new skill to “approved,” rewrite system prompts in production, or patch tool code outside a sandbox without tests and a gate. That asymmetry is what keeps “learning” from turning into silent drift. It also matches the distinction the official managed-agent dreaming system now offers between automatic memory updates and review-before-apply. refs: turn36view0, turn24view6

From a routing standpoint, the best OpenCLAW variant is stateful, not merely context-heavy. The Letta docs explicitly describe stateful agents as those that accumulate learned behaviors and memories over time, while the Claude Code docs emphasize that each session is fresh unless persistent files or memories bridge them. For OpenCLAW, this means that “long context” alone should not be treated as a substitute for memory. Long context is a buffer; recursive learning needs explicit state transitions. refs: turn33search6, turn24view4

Experiment catalog

The experiments below are designed to answer one practical question: does OpenCLAW actually improve future task success, or does it merely rewrite its own notes? Across all experiments, use the same discipline. Keep a held-out task set. Measure repeated-trial reliability, not just one-off wins. Prefer environment-grounded or state-based success whenever possible. Penalize stale memory usage. And if you add a reviewer or grader, track both net benefit and reviewer-induced damage. Those rules are directly supported by τ-bench’s pass^k, AppWorld’s state-based evaluation, Memora’s FAMA metric, MemoryAgentBench’s focus on selective forgetting and test-time learning, and the helpfulness-harmfulness framing in reviewer-agent work. refs: turn27view11, turn26view8, turn26view9, turn25view12, …

For cost, use these deployment-side categories: low means a small number of extra model calls or local retrieval; medium means at least one additional reviewer or synthesis pass; high means repeated rollouts, search, or training infrastructure.

Post-task reflection

Objective. Test whether short, structured reflections reduce repeat mistakes on similar future tasks. Setup. Randomly assign tasks above a complexity threshold to control and reflection groups. Files and code. Write traces to telemetry/runs.jsonl; write reflections to memory/reflections/; retrieve reflections only on future tasks with similar tool/task signatures. Prompt template. “Given the goal, tool trace, result, and rubric, produce one lesson in the format trigger -> mistake or success -> action for next time -> evidence.” Success and failure criteria. Success is a measurable drop in repeat-error rate on delayed holdout tasks of the same family; failure is an increase in notes without corresponding improvement, or improved self-judged scores without external success lift. Safety, cost, logging. Auto-write drafts only; never let a reflection directly rewrite global rules. Cost is low. Log the reflection ID, retrieval events, and next-task outcome. refs: turn26view0, turn28view2, turn24view6

Failed-task diagnosis

Objective. Separate “task failed” from “task failed for a known recurring reason,” then test whether diagnosis-specific repairs outperform generic reflection. Setup. Build a small taxonomy—planning error, tool misuse, wrong input assumptions, stale memory, formatting failure, evaluator disagreement—and force each failure into one class. Files and code. memory/failures/*.md, evals/replays/, tools/tests/. Prompt template. “Classify the failure into one root cause, cite proof from the trace, suggest one bounded intervention, and say what evidence would falsify the diagnosis.” Success and failure criteria. Success is reduced recurrence of the same class; failure is diagnosis churn or repeated misclassification. Safety, cost, logging. Do not permit autonomous code changes from diagnosis alone. Cost is low to medium. Log class distributions and recurrence intervals. refs: turn17search0, turn24view8, turn24view6

Daily consolidation

Objective. Determine whether daily background consolidation improves retrieval precision and lowers context load the next day. Setup. Run a nightly job that merges duplicate notes, flags invalidated facts, and extracts a one-page daily summary. Compare against a no-consolidation control period. Files and code. memory/episodic/, memory/semantic/memories.sqlite, memory/daily/summary.md. Prompt template. “Review today’s episodes. Keep durable preferences and procedures; mark volatile facts as tentative; merge duplicates; mark contradictions explicitly.” Success and failure criteria. Success is better next-day retrieval relevance and lower tokens read per task at equal or better outcome quality. Failure is stale-memory amplification or overcompression. Safety, cost, logging. Automatic with audit trail. Cost is low. Log memory additions, merges, invalidations, and retrieval-hit quality next day. refs: turn24view3, turn33search0, turn26view7, turn26view9

Weekly distillation

Objective. Test whether weekly playbook distillation creates more reusable competence than accumulating raw reflections forever. Setup. Every week, cluster reflections by domain and produce a compact playbook with “always,” “if-then,” and “never” rules plus examples. Files and code. memory/weekly_playbooks/, skills/drafts/, skills/approved/, telemetry/regressions.jsonl. Prompt template. “Distill these reflections into a stable playbook. Preserve constraints, collapse duplicates, keep edge cases, and point out any rules that conflict or need human confirmation.” Success and failure criteria. Success is equal or better performance with shorter context injection and less retrieval noise; failure is context collapse or loss of useful edge cases. Safety, cost, logging. Promote only after human review or staged canarying. Cost is medium. Log the before/after playbook size and held-out eval deltas. refs: turn26view5, turn24view6, turn24view4

User-preference learning

Objective. Learn stable user preferences without over-personalizing or clinging to outdated ones. Setup. Record only explicit corrections, repeated stable choices, and high-confidence preferences; store them separately from volatile context. Files and code. memory/user_profile.md, memory/preferences.json, memory tables with timestamps and confidence. Prompt template. “Should this be stored as a durable user preference? Answer only if there is explicit evidence, likely future utility, and low privacy risk. Otherwise abstain.” Success and failure criteria. Success is improved preference-consistency and lower need for repeated corrections; failure is intrusive personalization, obsolete preference reuse, or storing sensitive data unnecessarily. Safety, cost, logging. Require evidence links and confidence thresholds. Cost is low. Log every preference mutation and let the user inspect or delete it. refs: turn24view2, turn24view3, turn24view4, turn26view9, …

Code and tool improvement

Objective. Let OpenCLAW repair brittle wrappers, prompts, parsers, or scripts in a sandbox, then measure whether the fix generalizes. Setup. Trigger only on repeated reproducible failures with existing regression cases. Create a worktree or sandbox branch, propose a patch, run tests, then request approval. Files and code. sandbox/worktrees/, tools/adapters/, tools/tests/, memory/failures/. Prompt template. “Given this failing trace and regression test, produce the smallest patch that fixes the bug, avoids widening permissions, and adds or updates one regression test.” Success and failure criteria. Success is more passing regressions and fewer retries on similar tasks; failure is test-only overfitting or new breakage elsewhere. Safety, cost, logging. Mandatory sandbox, tests, and approval. Cost is medium. Log patch diffs, test matrix, and rollback pointer. refs: turn24view6, turn24view1, turn35view0

Prompt refinement

Objective. Improve agent prompts and context playbooks using held-out evals instead of anecdotal editing. Setup. Maintain a stable eval set by task family and optimize prompt fragments or retrieval orderings only against validation, never against production alone. Files and code. config/routing.yaml, evals/datasets/, evals/holdout/, memory/weekly_playbooks/. Prompt template. Use an optimizer prompt that proposes small diffs, not wholesale rewrites: “Suggest at most three prompt changes, explain their hypothesis, and estimate the failure modes they target.” Success and failure criteria. Success is held-out lift at fixed or reduced token cost; failure is gains only on the tuning set, larger prompts with no net yield, or collapse on smaller models. Safety, cost, logging. Staged rollout only. Cost is medium. Log prompt version, validation lift, cost delta, and post-deploy decay. refs: turn25view9, turn27view5, turn27view6, turn28view10

Heartbeat maintenance

Objective. Test whether scheduled, mostly deterministic health checks reduce downtime and preserve agent quality across long-running deployments. Setup. Create lightweight scheduled checks for scheduler health, messaging connectivity, disk usage, token-spend anomalies, and failed cron jobs. Files and code. scripts/heartbeat_checks.py, config/schedules.yaml, telemetry/costs.jsonl, telemetry/runs.jsonl. Prompt template. Prefer script-only checks. Use the agent only when the script detects an anomaly and escalation is needed. Success and failure criteria. Success is faster recovery and lower incidence of silent failure; failure is noisy alerts or expensive checks displacing useful work. Safety, cost, logging. Allow automatic low-risk remediation only for reversible actions. Cost is very low if script-first. Log alert precision and mean time to recovery. refs: turn31search0, turn31search7, turn31search8

Artifact evaluation

Objective. Improve output quality for reports, messages, summaries, patches, or generated documents by using a separate grader loop. Setup. Build domain-specific rubrics for at least two artifact classes and compare one-pass generation against bounded retry with a separate evaluator. Files and code. evals/rubrics/, artifacts/generated/, artifacts/accepted/, artifacts/rejected/. Prompt template. “Score this artifact against the rubric only. Do not rewrite until the scoring step is complete. If the artifact fails, produce targeted revision instructions, not a full replacement.” Success and failure criteria. Success is higher external grader or user acceptance with bounded extra cost; failure is reviewer-induced degradation of previously good artifacts. Safety, cost, logging. Use maximum retry caps and helpfulness-harmfulness accounting. Cost is medium. Log generator version, grader version, per-dimension scores, and whether the retry beat the original. refs: turn36view0, turn24view8

Public-feedback loops

Objective. Learn from user acceptance signals—edits, approvals, reactions, reuse—without mistaking popularity for correctness. Setup. Collect weak labels from messaging channels or artifact usage, but keep them separate from hard success metrics. Files and code. telemetry/public_feedback.jsonl, memory/preferences.json, evals/reports/feedback_correlation.md. Prompt template. No direct self-edit prompt. Instead, periodically translate aggregated feedback into hypotheses: “What stable preference or presentation pattern is supported by repeated user actions?” Success and failure criteria. Success is improved user acceptance on comparable future outputs; failure is optimizing style at the expense of correctness. Safety, cost, logging. Aggregate-only; never let one reaction rewrite global policy. Cost is low. Log the correlation between feedback labels and objective task success. refs: turn11search0, turn28view6

Tool-call reviewer

Objective. Insert a reviewer before risky tool calls and measure whether the reviewer produces net-positive value. Setup. Apply only to high-risk tools: filesystem writes, external messages, edits to memory, code execution, or calendar-like actions. Files and code. evals/rubrics/tool_safety.yaml, telemetry/reviewer_comparison.jsonl. Prompt template. “Given the provisional tool call, judge whether the selected tool, arguments, and scope are correct. If not, return the minimal correction.” Success and failure criteria. Success is a positive net value using helpfulness-harmfulness accounting; failure is excessive blocking or reviewer-induced corruption of correct calls. Safety, cost, logging. Only for risky tools. Cost is medium. Log blocked calls, corrected calls, false blocks, and degraded-correct events. refs: turn24view8

Curriculum generation

Objective. Test whether synthetic or derived practice tasks fill capability gaps faster than waiting for real failures. Setup. Cluster past failures, generate synthetic tasks that vary one difficulty factor at a time, and validate on a held-out real-task set. Files and code. evals/datasets/curriculum/, memory/failures/, skills/drafts/. Prompt template. “Generate three minimally different tasks that stress the missing subskill identified in this failure cluster. Each task must specify a verifiable outcome.” Success and failure criteria. Success is downstream lift on real held-out tasks; failure is improvement only on synthetic exercises. Safety, cost, logging. Keep synthetic tasks in a separate training split. Cost is medium. Log task provenance and transfer rate from synthetic to real evals. refs: turn26view1, turn13search0, turn28view8

Telemetry-driven routing

Objective. Improve quality-cost tradeoffs by changing model/tool routes based on observed traces, not intuition. Setup. Instrument runs with model, cost, latency, retrieval depth, retry count, success class, and artifact rubric scores. Files and code. telemetry/spans.jsonl, telemetry/costs.jsonl, config/routing.yaml. Prompt template. No LLM prompt is required for the core loop; use the model only to summarize statistically credible route changes. Success and failure criteria. Success is higher success-per-dollar or lower latency at equal quality; failure is routing changes that optimize proxies while reducing real task success. Safety, cost, logging. Route changes should be staged and reversible. Cost is low. Log route version and route-specific win rates. refs: turn24view7, turn27view2

Risks and failure modes

The most important failure mode is fake learning: the agent rewrites more state, sounds more reflective, and perhaps gets higher self-judged scores, but does not improve on future real tasks. The current benchmark literature makes this danger concrete. AppWorld and τ-bench show that realistic task completion is still hard and inconsistent; the “why they fail” analysis identifies planning, execution, and response errors as distinct failure stages; and memory benchmarks such as Memora and MemoryAgentBench show that remembering more is not the same thing as updating correctly or forgetting obsolete facts. refs: turn26view8, turn27view11, turn17search0, turn26view9, …

The second risk is circular self-evaluation. If the same model family generates, critiques, scores, and writes persistent memory, the loop can become self-confirming. The best counterexamples in the literature explicitly break that circularity. The official outcomes system uses a separate grader in a different context window; the reviewer-agent paper measures both corrected errors and degraded correct outputs; and Memora’s FAMA metric penalizes obsolete-memory reuse rather than rewarding persuasive but stale answers. OpenCLAW should copy that principle aggressively: separate judges from generators, evaluate against environment state when possible, and score stale-memory usage negatively. refs: turn36view0, turn24view8, turn26view9

The third risk is memory pollution. A-MEM, Mem0, LightMem, and Letta all exist because flat “append and embed everything” memory quickly becomes noisy, expensive, and hard to mutate correctly. More recent benchmarks sharpen the problem: PERMA shows preferences emerge gradually in noisy contexts, while Memora shows many memory agents still fail to reconcile changing facts over time. OpenCLAW therefore needs explicit invalidation, timestamping, confidence, and evidence fields for every memory object, plus regular consolidation and pruning passes. refs: turn26view3, turn26view7, turn28view5, turn24view2, …

The fourth risk is unsafe self-modification. Code execution, file mutation, remote tools, and messaging can all magnify a wrong action. The safest production pattern in the sources is not “never self-modify,” but “self-modify only inside scaffolds”: progress files, test oracles, commits, worktrees, bounded tools, reviewer loops, and explicit approvals before promotion to production. That is why code/tool improvement should come after telemetry, evals, and rollback exist—not before. refs: turn24view6, turn24view1, turn35view0, turn24view8

The fifth risk is prompt collapse and over-optimization. ACE names context collapse directly; smaller-model OPRO limitations show automated prompt optimization can fail when optimizer capability is weak; and prompt optimization more broadly tends to find narrow improvements unless grounded by robust validation. OpenCLAW should therefore restrict prompt refinement to small diffs, validation-only tuning, canary deployment, and automatic rollback when real-task metrics drop. refs: turn26view5, turn28view10, turn25view9

A practical anti-fraud checklist follows from those sources. Improvement claims should only count if they survive a held-out split, maintain or improve external success, preserve or improve success-per-dollar, and hold up over repeated trials. Reviewer loops should be scored for both helpfulness and harmfulness. Memory loops should use forgetting-aware metrics. Prompt or code modifications should always be versioned and reversible. And user-facing preference learning should distinguish durable preferences from transient circumstances. refs: turn24view8, turn26view9, turn27view11, turn26view8, …

Most Realistic OpenCLAW Implementation Path

The most realistic path is staged, not heroic. Because the prompt does not specify a language, hosting stack, or compute budget, the safest assumption is Python-friendly, storage-light, API-model-first, deployable on a Chromebook with remote APIs, a small VPS, or lightweight cloud. The seven phases below are ordered so each phase produces the evidence required for the next one. That ordering is consistent with the gap between lightweight memory/context architectures and heavier RL-oriented frameworks in the literature. refs: turn30view1, turn30view0, turn24view1, turn27view0, …

Phase 1 — Instrumentation and gold tasks. Build telemetry, evaluator logs, artifact hashes, and a compact but meaningful eval suite first. Add routing logs, run IDs, prompt versions, and environment-state checks. The exit criterion is simple: you can replay a task and say why it passed or failed. Without this, every later “learning” loop is theater. refs: turn24view7, turn26view8, turn27view11, turn25view12

Phase 2 — Explicit memory and reflection. Add markdown identity/profile files, SQLite-backed episodic and semantic memory, and low-risk post-task reflections. Keep this layer inspectable and pruneable. The exit criterion is a measurable reduction in repeat failures or repeated user corrections, not just more stored data. refs: turn24view2, turn24view3, turn30view0, turn27view4, …

Phase 3 — Daily consolidation and weekly distillation. Once basic reflection works, add scheduled background loops that merge duplicates, mark contradictions, and promote recurring procedures into playbooks or draft skills. The exit criterion is better retrieval precision and shorter injected context with equal or better outcomes. refs: turn33search0, turn31search0, turn26view5, turn36view0

Phase 4 — Skill extraction and routine automation. Convert repeated successful workflows into versioned skills and attach them to schedules or reusable task families. This is where OpenCLAW starts compounding. The exit criterion is transfer: a skill learned in one context measurably improves another similar context. refs: turn26view1, turn31search0, turn32academia17

Phase 5 — Separate evaluators and artifact grading. Add a reviewer or grader for risky tool calls and high-value artifacts, and score it with helpfulness-harmfulness accounting. The exit criterion is net-positive value from the reviewer, not just more retries. refs: turn24view8, turn36view0

Phase 6 — Prompt, routing, and sandboxed tool improvement. Only now should OpenCLAW start editing prompt playbooks, route policies, or tool wrappers based on stable evals and regression tests. The exit criterion is held-out lift plus safe rollback. refs: turn25view9, turn27view5, turn24view6, turn24view1

Phase 7 — Optional curriculum generation and offline RL. If OpenCLAW eventually has stable domains, good metrics, large trace volumes, and real need for deeper policy optimization, then curriculum generation, offline preference learning, or RL-style agent training become reasonable. Until then, they are likely premature. The exit criterion here is organizational, not just technical: enough data, enough repeatability, and enough value to justify the infrastructure. refs: turn13search0, turn28view8, turn27view0, turn26view6

Appendix

The appendix below turns the report into implementable artifacts. The schemas and prompts are not copied from any one system, but they are directly shaped by the memory-block, progress-file, dream-reflection, skill, and telemetry patterns documented in Letta, Hermes, Claude Code, Managed Agents, and OpenTelemetry. refs: turn24view2, turn24view4, turn24view6, turn36view0, …

Schemas

# skills/registry.yaml
version: 1
skills:
  - id: skill_write_weekly_summary_v3
    status: approved          # draft | approved | archived
    domain: reporting
    triggers:
      - repeated_task_pattern: weekly_summary
      - success_count_gte: 5
    description: >
      Produce a weekly summary from project notes and telemetry.
    required_tools:
      - file_search
      - code_execution
      - messaging
    required_inputs:
      - notes_path
      - date_range
    output_contract:
      artifact_type: markdown_report
      must_include:
        - completed_items
        - blockers
        - metrics
    tests:
      - eval_id: report_rubric_v2
      - replay_id: weekly_summary_holdout_a
    safety:
      mutates_code: false
      sends_external_messages: false
      needs_human_approval_for_promotion: true
    provenance:
      derived_from_runs:
        - run_2026_05_01_001
        - run_2026_05_03_004
      author_loop: weekly_distillation
    last_known_good:
      prompt_version: prompt_reporting_v7
      route_policy: route_v4

{
  "memory_entry": {
    "id": "mem_2026_05_09_001",
    "kind": "preference",
    "scope": "user",
    "title": "User prefers concise summaries with clear action items",
    "value": "When writing summaries, keep them short and action-oriented.",
    "evidence": [
      {
        "source_run_id": "run_2026_05_07_019",
        "quote_or_event": "User asked to shorten a long summary and foreground next actions"
      }
    ],
    "confidence": 0.87,
    "durability": "high",
    "valid_from": "2026-05-07T22:14:00Z",
    "valid_to": null,
    "supersedes": [],
    "tags": ["writing", "style", "preference"],
    "privacy_level": "normal",
    "write_gate": "auto_with_audit"
  }
}

# evals/reports/evaluator_log.yaml
run_id: run_2026_05_09_022
artifact_hash: sha256:...
artifact_type: markdown_report
generator:
  model: executor_fast_v2
  prompt_version: reporting_v7
reviewer:
  model: critic_strong_v3
  rubric: report_quality_v4
scores:
  correctness: 0.86
  completeness: 0.81
  style_fit: 0.92
  safety: 0.99
decision:
  accepted: false
  retry_allowed: true
  retry_reason: "Missing two blocker items"
reviewer_harmfulness_check:
  base_artifact_was_correct: false
  reviewer_changed_correct_output_to_incorrect: false
memory_writes:
  created:
    - mem_2026_05_09_011
  blocked: []
cost:
  input_tokens: 10234
  output_tokens: 981
  usd_estimate: 0.19

Prompt templates

Reflection prompt

You are writing a single reusable lesson for a future agent run.

Inputs:
- task goal
- tool trace
- final outcome
- explicit user feedback
- evaluator result

Return exactly:
1. Trigger condition
2. What failed or unexpectedly worked
3. One action rule for next time
4. Evidence from this run
5. Confidence: low | medium | high

Constraints:
- No motivational language
- No vague advice
- No global instruction changes
- If evidence is weak, abstain

Skill extraction prompt

You are deciding whether a repeated procedure should become a reusable skill.

Given:
- 3 to 10 successful runs
- common steps across runs
- required tools
- known edge cases
- tests or replays

Produce:
- skill name
- scope
- prerequisites
- step-by-step procedure
- anti-patterns
- tests
- promotion recommendation: draft only / ready for review

Reject if:
- the workflow is too specific to one run
- there is no stable success signal
- the procedure depends on hidden assumptions

Artifact evaluator prompt

You are a separate grader. Do not improve the artifact until scoring is complete.

Score the artifact on:
- correctness
- completeness
- policy compliance
- audience fit
- brevity / structure

Then return:
- pass/fail
- highest-priority defect
- minimal revision instructions
- whether a retry is likely to help

Important:
- If the artifact is already good, say so
- Do not invent defects to justify another retry

Pseudocode

def run_task(task, context):
    run = create_run_record(task, context)

    plan = executor.plan(task, context)
    result = executor.act(plan, tools=context.tools, workspace=context.workspace)

    artifact = maybe_generate_artifact(result)
    review = maybe_review_artifact(artifact, rubric=select_rubric(task))

    if review and review.retry_allowed:
        artifact = executor.revise(artifact, review.minimal_revision_instructions)

    outcome = external_eval(task, result, artifact, context)
    log_outcome(run, result, artifact, review, outcome)

    if should_reflect(task, outcome, result):
        lesson = reflect(task, result, review, outcome)
        write_draft_reflection(lesson)

    if should_store_preference(task, outcome, result):
        pref = infer_preference(task, result, context.user_feedback)
        write_preference_with_audit(pref)

    if should_open_code_improvement_issue(outcome, result):
        create_sandbox_patch_request(task, result)

    return finalize(run, outcome, artifact)

Architecture and roadmap diagrams

flowchart LR
    PhaseA[Instrumentation and gold tasks]
    PhaseB[Memory and reflection]
    PhaseC[Daily consolidation and weekly distillation]
    PhaseD[Skill extraction and routine automation]
    PhaseE[Separate evaluators and artifact grading]
    PhaseF[Prompt routing and sandboxed tool improvement]
    PhaseG[Curriculum generation and optional offline RL]

    PhaseA --> PhaseB --> PhaseC --> PhaseD --> PhaseE --> PhaseF --> PhaseG

flowchart TD
    FreshSession[Fresh task session]
    PersistentState[Persistent state]
    FreshSession --> WorkingContext[Working context assembly]
    PersistentState --> WorkingContext
    WorkingContext --> Execute[Execute with tools]
    Execute --> Evaluate[Evaluate against rubric or environment]
    Evaluate --> Reflect[Write draft lesson]
    Reflect --> PersistentState
    Evaluate --> Consolidate[Background consolidation]
    Consolidate --> PersistentState
    Evaluate --> Promote[Human-gated promotion]
    Promote --> PersistentState

The shortest formulation of the report is this: if OpenCLAW rewrites memory, skills, and playbooks from grounded evidence, it can become meaningfully more capable on lightweight infrastructure; if it tries to jump straight to autonomous prompt drift, unrestricted code rewriting, or RL training without evals and rollback, it will mostly become harder to trust. That is the core lesson that stays consistent across the research literature, the benchmark landscape, and the clearest production-facing systems. refs: turn26view0, turn26view1, turn26view5, turn24view6, …