buildlog — Agentic learning for AI coding tools

Your CLAUDE.md is guessing. buildlog is measuring.

You've edited your agent's instruction file. Added rules from experience. Maybe even organized them by category. But do you know which ones work?

You don't. Nobody does. Every AI coding tool on the market stores rules, memories, or instructions without measuring whether they reduce mistakes. Mem0 stores. LangSmith observes. CodeRabbit reviews. None of them close the loop.

Your AI agent repeats the same mistakes every session. We can prove it. buildlog tracks Repeated Mistake Rate across sessions. The number is usually higher than you think. The good news: once you can measure it, you can fix it.

Stop editing CLAUDE.md and hoping for the best. Start measuring.

The Review Gauntlet

Three reviewer personas. 61 curated rules from OWASP, ACL, COLING, PNAS, and GPTZero research. Each finding cites the rule that caught it. Those citations are how buildlog knows which rules are working.

Security Karen

OWASP Top 10, injection, auth, secrets, path traversal, SSRF. Catches the stuff your linter misses because it requires context.

13 rules

Test Terrorist

Coverage gaps, missing edge cases, property-based testing, metamorphic relations, contract tests. Thinks your test suite is too thin.

21 rules

Bragi

LLM prose detection. 27 rules derived from peer-reviewed research: em-dash overuse, epistemic flatness, throat-clearing, sentence uniformity, nominalization, both-sides hedging.

27 rules

How credits close the loop

When the gauntlet finds an issue, it cites the rule that caught it. buildlog validates the citation against the active rule set, writes a gauntlet credit, and snapshots the rule's posterior (alpha, beta, mean). Over sessions, posteriors converge. You can see which rules are carrying their weight and which are noise.

This is the part nobody else has. Not the review. Not the rules. The measurement of whether the rules actually help.

Real data from a real codebase

This is buildlog running on itself. 40 sessions, 100 logged mistakes, 81 review cycles. Not a demo.

The loop is working

Running reward mean at 0.817 across 81 gauntlet review cycles. 59% accepted on first pass, 41% revised, 0% rejected. The green line is the target. We're above it.

buildlog dashboard: reward trend at 0.817 mean across 81 events

81 patterns extracted, 12 learnings reinforced

Insights by category (architectural, workflow, tool usage, domain knowledge) extracted from journal entries. Review learnings tracked as reinforced or contradicted across sessions.

buildlog dashboard: 81 insights across 4 categories, review learnings reinforced by evidence

buildlog viz launches this dashboard locally. Your data, your machine.

The first AI coding tool that gets better at its job

Hover to flip. Nine components, each independently useful.

36-Tool MCP Server

hover to flip

MCP Server

36 tools via Model Context Protocol. Your agent calls buildlog during sessions: log commits, run gauntlet reviews, query posteriors, close feedback loops.

Thompson Sampling

hover to flip

Thompson Sampling

Beta-Bernoulli contextual bandit. Error classes partition state space. Seeds start at Beta(3,1), non-seeds at Beta(1,1). Posteriors converge with ~20 observations.

Gauntlet Credits

hover to flip

Gauntlet Credits

When a persona cites a rule in a finding, the rule gets credited. Credits drive posterior updates. No credit assignment problem: the citation IS the attribution.

Bragi v3

hover to flip

Bragi LLM Detection

27 rules from ACL, COLING, PNAS, GPTZero research. Catches em-dash waterfalls, epistemic flatness, throat-clearing, sentence uniformity, nominalization, hedging. Your docs should sound like you.

Posterior History

hover to flip

Posterior History

Every gauntlet credit snapshots alpha, beta, and mean for the credited rule. Query convergence over time. Watch rules stabilize or stall. The receipts for "show me it works."

Extraction

hover to flip

Extraction Pipeline

Regex (fast, free), Anthropic, OpenAI, or Ollama. Semantic dedup via sentence-transformers or OpenAI embeddings. Pluggable at every stage.

Multi-Agent Render

hover to flip

Multi-Agent Render

Same knowledge base renders to Claude Code, Cursor, Copilot, Windsurf, Continue.dev, VS Code. Switch agents, keep what you learned.

Global SQLite

hover to flip

Global SQLite

Single database at ~/.buildlog/buildlog.db. WAL mode. Project isolation via SHA-256. Posterior snapshots, gauntlet credits, mistake history, bandit state. All local.

Live Dashboard

hover to flip

marimo Dashboard

buildlog viz launches a live browser dashboard: reward trends, bandit posteriors, session history, mistake analysis, RMR tracking. Interactive, not static.

Quickstart

Two commands. Works in every project.

        $ pip install buildlog              # or: pipx install buildlog

        $ buildlog init-mcp --global -y     # register MCP + write agent instructions

That's it. buildlog is now ambient across every repo. Your agent has all 36 tools and knows how to use them.

Per-project setup

        $ buildlog init --defaults          # scaffold buildlog/, register MCP, update CLAUDE.md
      

With extras

        $ pip install "buildlog[all]"       # embeddings + LLM extraction + qortex backend

        # or pick: [embeddings] [openai] [anthropic] [ollama] [qortex]

Requires Python 3.11+.

FAQ

How is this different from editing CLAUDE.md by hand?

Manual editing is step 1. buildlog is steps 2 through 6: reviewing with curated personas, crediting rules when they catch issues, selecting which rules to include based on statistical evidence, rendering to your agent's format, and measuring whether mistakes actually decrease. You can edit CLAUDE.md forever and never know what's working.

How is this different from Mem0 / agent memory?

Mem0 stores memories. buildlog measures whether memories help. Storage without measurement is a filing cabinet. You need the filing cabinet, but you also need to know which files are worth keeping. That's what Thompson Sampling does.

Does this need an API key?

No. The base install uses regex extraction and local computation. LLM-backed extraction (Anthropic, OpenAI, Ollama) is available via extras for richer patterns. The core loop, bandit, gauntlet, and dashboard are entirely local.

What data leaves my machine?

Nothing, by default. Everything is in ~/.buildlog/buildlog.db (local SQLite). If you enable LLM extraction via API key, prompts go to the provider you chose. Bandit state, posteriors, gauntlet credits, and experiments are entirely local.

How do I know the posteriors are real?

Run buildlog posterior-history --rule-id <id> or open the marimo dashboard with buildlog viz. You'll see alpha, beta, and mean for every snapshot. The math is standard Beta-Bernoulli conjugate updates. If a rule is cited 20 times and the mean is 0.85, that's 20 real observations pulling the distribution toward 1.0.

Can I use this without Claude Code?

Yes. The MCP server is Claude Code-native, but the CLI works standalone and rule rendering works for Cursor, Copilot, Windsurf, Continue.dev, and VS Code. buildlog promote --target cursor writes to .cursor/rules/buildlog-rules.mdc.

Is this production-ready?

Used in production daily across multiple projects. 1,496 tests. Stable API surface. The gauntlet, bandit, and measurement loop work. Extraction quality is the current bottleneck (see limits). Use it, measure it, report what breaks.

AI agents that learn
from their mistakes.
For real this time.

Your CLAUDE.md is guessing. buildlog is measuring.

How it works

The Review Gauntlet

How credits close the loop

Real data from a real codebase

The loop is working

81 patterns extracted, 12 learnings reinforced

The first AI coding tool that gets better at its job

Agent targets

Quickstart

Per-project setup

With extras

FAQ

Open source. MIT licensed. On PyPI.