buildlog captures what your agent gets right and wrong, selects which rules to keep with Thompson Sampling, and proves it with posterior convergence data. Rules that help get promoted. Rules that don't get dropped. The loop runs itself.
You've edited your agent's instruction file. Added rules from experience. Maybe even organized them by category. But do you know which ones work?
You don't. Nobody does. Every AI coding tool on the market stores rules, memories, or instructions without measuring whether they reduce mistakes. Mem0 stores. LangSmith observes. CodeRabbit reviews. None of them close the loop.
Stop editing CLAUDE.md and hoping for the best. Start measuring.
A closed feedback loop. Each stage exists because the previous one wasn't enough.
Every stage works on its own. Capture without reviewing. Review without the bandit. Render without experiments. But the full loop is where compound improvement happens.
Three reviewer personas. 61 curated rules from OWASP, ACL, COLING, PNAS, and GPTZero research. Each finding cites the rule that caught it. Those citations are how buildlog knows which rules are working.
When the gauntlet finds an issue, it cites the rule that caught it. buildlog validates the citation against the active rule set, writes a gauntlet credit, and snapshots the rule's posterior (alpha, beta, mean). Over sessions, posteriors converge. You can see which rules are carrying their weight and which are noise.
This is the part nobody else has. Not the review. Not the rules. The measurement of whether the rules actually help.
This is buildlog running on itself. 40 sessions, 100 logged mistakes, 81 review cycles. Not a demo.
Running reward mean at 0.817 across 81 gauntlet review cycles. 59% accepted on first pass, 41% revised, 0% rejected. The green line is the target. We're above it.
Insights by category (architectural, workflow, tool usage, domain knowledge) extracted from journal entries. Review learnings tracked as reinforced or contradicted across sessions.
buildlog viz launches this dashboard locally. Your data, your machine.
Hover to flip. Nine components, each independently useful.
36 tools via Model Context Protocol. Your agent calls buildlog during sessions: log commits, run gauntlet reviews, query posteriors, close feedback loops.
Beta-Bernoulli contextual bandit. Error classes partition state space. Seeds start at Beta(3,1), non-seeds at Beta(1,1). Posteriors converge with ~20 observations.
When a persona cites a rule in a finding, the rule gets credited. Credits drive posterior updates. No credit assignment problem: the citation IS the attribution.
27 rules from ACL, COLING, PNAS, GPTZero research. Catches em-dash waterfalls, epistemic flatness, throat-clearing, sentence uniformity, nominalization, hedging. Your docs should sound like you.
Every gauntlet credit snapshots alpha, beta, and mean for the credited rule. Query convergence over time. Watch rules stabilize or stall. The receipts for "show me it works."
Regex (fast, free), Anthropic, OpenAI, or Ollama. Semantic dedup via sentence-transformers or OpenAI embeddings. Pluggable at every stage.
Same knowledge base renders to Claude Code, Cursor, Copilot, Windsurf, Continue.dev, VS Code. Switch agents, keep what you learned.
Single database at ~/.buildlog/buildlog.db. WAL mode. Project isolation via SHA-256. Posterior snapshots, gauntlet credits, mistake history, bandit state. All local.
buildlog viz launches a live browser dashboard: reward trends, bandit posteriors, session history, mistake analysis, RMR tracking. Interactive, not static.
One knowledge base. Every agent format. buildlog promote --target <agent> writes learned rules to the file your agent reads.
Two commands. Works in every project.
That's it. buildlog is now ambient across every repo. Your agent has all 36 tools and knows how to use them.
Requires Python 3.11+.
Manual editing is step 1. buildlog is steps 2 through 6: reviewing with curated personas, crediting rules when they catch issues, selecting which rules to include based on statistical evidence, rendering to your agent's format, and measuring whether mistakes actually decrease. You can edit CLAUDE.md forever and never know what's working.
Mem0 stores memories. buildlog measures whether memories help. Storage without measurement is a filing cabinet. You need the filing cabinet, but you also need to know which files are worth keeping. That's what Thompson Sampling does.
No. The base install uses regex extraction and local computation. LLM-backed extraction (Anthropic, OpenAI, Ollama) is available via extras for richer patterns. The core loop, bandit, gauntlet, and dashboard are entirely local.
Nothing, by default. Everything is in ~/.buildlog/buildlog.db (local SQLite). If you enable LLM extraction via API key, prompts go to the provider you chose. Bandit state, posteriors, gauntlet credits, and experiments are entirely local.
Run buildlog posterior-history --rule-id <id> or open the marimo dashboard with buildlog viz. You'll see alpha, beta, and mean for every snapshot. The math is standard Beta-Bernoulli conjugate updates. If a rule is cited 20 times and the mean is 0.85, that's 20 real observations pulling the distribution toward 1.0.
Yes. The MCP server is Claude Code-native, but the CLI works standalone and rule rendering works for Cursor, Copilot, Windsurf, Continue.dev, and VS Code. buildlog promote --target cursor writes to .cursor/rules/buildlog-rules.mdc.
Used in production daily across multiple projects. 1,496 tests. Stable API surface. The gauntlet, bandit, and measurement loop work. Extraction quality is the current bottleneck (see limits). Use it, measure it, report what breaks.
Two commands and your agent starts learning from its mistakes.