The Idea
Every LLM agent call has two parts: reasoning (hard) and execution (easy). Today both hit a frontier model at $1–10/M tokens. That's expensive for the easy part.
Merlin is not a general-purpose model that happens to write code. It is a highly specialized agentic coding model — trained end-to-end on code, bash, tool-call traces, and commit history, with a custom tokenizer that treats tool invocations as first-class syntax. It knows one domain and executes it without hesitation.
A smarter orchestrator (Claude, GPT-4) handles planning; Merlin is the brute-force execution layer beneath it — fast, parallel, and completely under your control.
Runs entirely on your machine. No API calls, no telemetry, no cloud. Your codebase never leaves your laptop — not even for a status ping.
No tokens, no subscriptions, no rate limits. Run it a thousand times in a pre-commit hook, a file watcher, a test suite. The marginal cost is electricity.
One worker on your MacBook. Private, offline, instant. Always on — no latency to a remote API, no outage risk.
100–1000 workers via GPU batching. Pay-per-task. Spawn one per file and refactor an entire repo in seconds.
Design Choices
| Choice | What | Why |
|---|---|---|
| Pre-trained from scratch | 100B tokens, corpus weighted toward code + agentic traces | No distillation from proprietary models — weights are commercially clean |
| Custom tokenizer | 32K BPE vocab + 18 agent protocol special tokens | Tool-call protocol is first-class, not bolted on |
| 6K context window | block_size = 4096 training, 6144 packed chunks | Sized for one large Python file + agent overhead; not a general-purpose model |
| RL post-training | GRPO on verifiable bash/filesystem rewards | Ground truth without a judge model; scalable to millions of tasks |
| MLX inference | int4 quantization via MLX on Apple Silicon | ~1.5GB weights, >500 tok/s on M3 MacBook — no GPU needed |
Agent Protocol
18 special tokens define the tool-call format. The model learns to emit structured tool calls and parse results — no prompt engineering required at inference time.
Corpus
~100B tokens across 7 sources. Two-phase curriculum: 80B general mix → 20B upweighted traces + instruction data.
| Source | Share | Role |
|---|---|---|
| Stack v2 — Python | ~38% | Core coding ability |
| Stack v2 — Bash/Markdown | ~8% | Shell and docs |
| Agentic traces (synthetic) | ~15% | Tool-call protocol, task execution |
| GitHub commits + issues | ~11% | Code + natural language reasoning about code |
| Stack Overflow | ~10% | Q&A, debugging patterns |
| Math + instruction mix | ~12% | Reasoning, instruction following |
| tldr pages | <1% | Bash command reference |
v0 corpus (1.19B tokens) already published: tsuberim/merlin-corpus-v0
Architecture
GPT-style decoder-only transformer. RMSNorm, SwiGLU, GQA (n_kv_head=8), no bias, weight tying, pre-norm.
| Config | Params | n_embd | n_head | n_layer | block_size |
|---|---|---|---|---|---|
| tiny | ~1.6M | 32 | 2 | 2 | 64 |
| medium | ~21M | 256 | 8 | 8 | 512 |
| base (330M) | ~330M | 1024 | 16 | 16 | 2048 |
| 3b (target) | ~3.17B | 3072 | 24 | 20 | 4096 |
Inference Performance
Measured on prototype (~117M params, M4 MacBook Pro). The int4 MLX path is the production target for the 3B model — extrapolated weight size is ~1.5GB.
Build Status
| Milestone | Status |
|---|---|
| Agentic protocol + eval harness (49 tasks) | Done |
| Custom BPE tokenizer (32K vocab) | Done |
| Data pipeline (download → tokenize → pack) | Done |
| v0 corpus on HuggingFace (1.19B tokens) | Done |
| E2E training loop, 330M on H100 via Modal | Done |
| SFT infrastructure | Done |
| Repo scanning pipeline (clone + pytest → passing repos) | In progress |
| Agentic trace generation (target: 200K traces) | Planned |
| Full 100B token corpus | Planned |
| 3B pre-training run (100B tokens) | Planned |
| RL post-training (GRPO on verifiable rewards) | Planned |
| MLX int4 3B model release | Planned |
Merlin