AI agents that learn
from their mistakes.
For real this time.

buildlog captures what your agent gets right and wrong, selects which rules to keep with Thompson Sampling, and proves it with posterior convergence data. Rules that help get promoted. Rules that don't get dropped. The loop runs itself.

$ pip install buildlog
$ buildlog init-mcp --global -y
v0.22 on PyPI
1,496 tests

Your CLAUDE.md is guessing. buildlog is measuring.

You've edited your agent's instruction file. Added rules from experience. Maybe even organized them by category. But do you know which ones work?

You don't. Nobody does. Every AI coding tool on the market stores rules, memories, or instructions without measuring whether they reduce mistakes. Mem0 stores. LangSmith observes. CodeRabbit reviews. None of them close the loop.

Your AI agent repeats the same mistakes every session. We can prove it. buildlog tracks Repeated Mistake Rate across sessions. The number is usually higher than you think. The good news: once you can measure it, you can fix it.

Stop editing CLAUDE.md and hoping for the best. Start measuring.

How it works

A closed feedback loop. Each stage exists because the previous one wasn't enough.

01 · CAPTURE Commits + Entries via MCP 02 · REVIEW Gauntlet Review → findings 03 · CREDIT Rule Attribution → posteriors 04 · SELECT Thompson Sampling → active rules ← core 05 · RENDER Agent Instructions → CLAUDE.md 06 · MEASURE RMR + Convergence → evidence belief update 61 rules 3 personas RMR % posteriors CLAUDE.md .cursorrules

Every stage works on its own. Capture without reviewing. Review without the bandit. Render without experiments. But the full loop is where compound improvement happens.

The Review Gauntlet

Three reviewer personas. 61 curated rules from OWASP, ACL, COLING, PNAS, and GPTZero research. Each finding cites the rule that caught it. Those citations are how buildlog knows which rules are working.

Security Karen
OWASP Top 10, injection, auth, secrets, path traversal, SSRF. Catches the stuff your linter misses because it requires context.
13 rules
Test Terrorist
Coverage gaps, missing edge cases, property-based testing, metamorphic relations, contract tests. Thinks your test suite is too thin.
21 rules
Bragi
LLM prose detection. 27 rules derived from peer-reviewed research: em-dash overuse, epistemic flatness, throat-clearing, sentence uniformity, nominalization, both-sides hedging.
27 rules

How credits close the loop

When the gauntlet finds an issue, it cites the rule that caught it. buildlog validates the citation against the active rule set, writes a gauntlet credit, and snapshots the rule's posterior (alpha, beta, mean). Over sessions, posteriors converge. You can see which rules are carrying their weight and which are noise.

This is the part nobody else has. Not the review. Not the rules. The measurement of whether the rules actually help.

Real data from a real codebase

This is buildlog running on itself. 40 sessions, 100 logged mistakes, 81 review cycles. Not a demo.

The loop is working

Running reward mean at 0.817 across 81 gauntlet review cycles. 59% accepted on first pass, 41% revised, 0% rejected. The green line is the target. We're above it.

buildlog dashboard: reward trend at 0.817 mean across 81 events

81 patterns extracted, 12 learnings reinforced

Insights by category (architectural, workflow, tool usage, domain knowledge) extracted from journal entries. Review learnings tracked as reinforced or contradicted across sessions.

buildlog dashboard: 81 insights across 4 categories, review learnings reinforced by evidence

buildlog viz launches this dashboard locally. Your data, your machine.

The first AI coding tool that gets better at its job

Hover to flip. Nine components, each independently useful.

36-Tool MCP Server
hover to flip
MCP Server

36 tools via Model Context Protocol. Your agent calls buildlog during sessions: log commits, run gauntlet reviews, query posteriors, close feedback loops.

Thompson Sampling
hover to flip
Thompson Sampling

Beta-Bernoulli contextual bandit. Error classes partition state space. Seeds start at Beta(3,1), non-seeds at Beta(1,1). Posteriors converge with ~20 observations.

Gauntlet Credits
hover to flip
Gauntlet Credits

When a persona cites a rule in a finding, the rule gets credited. Credits drive posterior updates. No credit assignment problem: the citation IS the attribution.

Bragi v3
hover to flip
Bragi LLM Detection

27 rules from ACL, COLING, PNAS, GPTZero research. Catches em-dash waterfalls, epistemic flatness, throat-clearing, sentence uniformity, nominalization, hedging. Your docs should sound like you.

Posterior History
hover to flip
Posterior History

Every gauntlet credit snapshots alpha, beta, and mean for the credited rule. Query convergence over time. Watch rules stabilize or stall. The receipts for "show me it works."

Extraction
hover to flip
Extraction Pipeline

Regex (fast, free), Anthropic, OpenAI, or Ollama. Semantic dedup via sentence-transformers or OpenAI embeddings. Pluggable at every stage.

Multi-Agent Render
hover to flip
Multi-Agent Render

Same knowledge base renders to Claude Code, Cursor, Copilot, Windsurf, Continue.dev, VS Code. Switch agents, keep what you learned.

Global SQLite
hover to flip
Global SQLite

Single database at ~/.buildlog/buildlog.db. WAL mode. Project isolation via SHA-256. Posterior snapshots, gauntlet credits, mistake history, bandit state. All local.

Live Dashboard
hover to flip
marimo Dashboard

buildlog viz launches a live browser dashboard: reward trends, bandit posteriors, session history, mistake analysis, RMR tracking. Interactive, not static.

Agent targets

One knowledge base. Every agent format. buildlog promote --target <agent> writes learned rules to the file your agent reads.

Claude Code CLAUDE.md
Cursor .cursor/rules/
GitHub Copilot copilot-instructions.md
Windsurf .windsurf/rules/
Continue.dev .continue/rules/
VS Code settings.json

Quickstart

Two commands. Works in every project.

$ pip install buildlog # or: pipx install buildlog
$ buildlog init-mcp --global -y # register MCP + write agent instructions

That's it. buildlog is now ambient across every repo. Your agent has all 36 tools and knows how to use them.

Per-project setup

$ buildlog init --defaults # scaffold buildlog/, register MCP, update CLAUDE.md

With extras

$ pip install "buildlog[all]" # embeddings + LLM extraction + qortex backend
# or pick: [embeddings] [openai] [anthropic] [ollama] [qortex]

Requires Python 3.11+.

FAQ

How is this different from editing CLAUDE.md by hand?

Manual editing is step 1. buildlog is steps 2 through 6: reviewing with curated personas, crediting rules when they catch issues, selecting which rules to include based on statistical evidence, rendering to your agent's format, and measuring whether mistakes actually decrease. You can edit CLAUDE.md forever and never know what's working.

How is this different from Mem0 / agent memory?

Mem0 stores memories. buildlog measures whether memories help. Storage without measurement is a filing cabinet. You need the filing cabinet, but you also need to know which files are worth keeping. That's what Thompson Sampling does.

Does this need an API key?

No. The base install uses regex extraction and local computation. LLM-backed extraction (Anthropic, OpenAI, Ollama) is available via extras for richer patterns. The core loop, bandit, gauntlet, and dashboard are entirely local.

What data leaves my machine?

Nothing, by default. Everything is in ~/.buildlog/buildlog.db (local SQLite). If you enable LLM extraction via API key, prompts go to the provider you chose. Bandit state, posteriors, gauntlet credits, and experiments are entirely local.

How do I know the posteriors are real?

Run buildlog posterior-history --rule-id <id> or open the marimo dashboard with buildlog viz. You'll see alpha, beta, and mean for every snapshot. The math is standard Beta-Bernoulli conjugate updates. If a rule is cited 20 times and the mean is 0.85, that's 20 real observations pulling the distribution toward 1.0.

Can I use this without Claude Code?

Yes. The MCP server is Claude Code-native, but the CLI works standalone and rule rendering works for Cursor, Copilot, Windsurf, Continue.dev, and VS Code. buildlog promote --target cursor writes to .cursor/rules/buildlog-rules.mdc.

Is this production-ready?

Used in production daily across multiple projects. 1,496 tests. Stable API surface. The gauntlet, bandit, and measurement loop work. Extraction quality is the current bottleneck (see limits). Use it, measure it, report what breaks.

Open source. MIT licensed. On PyPI.

Two commands and your agent starts learning from its mistakes.