Merlin — Open-Source Agentic Coding LLM

The Idea

Every LLM agent call has two parts: reasoning (hard) and execution (easy). Today both hit a frontier model at $1–10/M tokens. That's expensive for the easy part.

Merlin is not a general-purpose model that happens to write code. It is a highly specialized agentic coding model — trained end-to-end on code, bash, tool-call traces, and commit history, with a custom tokenizer that treats tool invocations as first-class syntax. It knows one domain and executes it without hesitation.

A smarter orchestrator (Claude, GPT-4) handles planning; Merlin is the brute-force execution layer beneath it — fast, parallel, and completely under your control.

Your code stays yours

Runs entirely on your machine. No API calls, no telemetry, no cloud. Your codebase never leaves your laptop — not even for a status ping.

Zero cost per call

No tokens, no subscriptions, no rate limits. Run it a thousand times in a pre-commit hook, a file watcher, a test suite. The marginal cost is electricity.

Local mode

One worker on your MacBook. Private, offline, instant. Always on — no latency to a remote API, no outage risk.

Hosted mode

100–1000 workers via GPU batching. Pay-per-task. Spawn one per file and refactor an entire repo in seconds.

Design Choices

Choice	What	Why
Pre-trained from scratch	100B tokens, corpus weighted toward code + agentic traces	No distillation from proprietary models — weights are commercially clean
Custom tokenizer	32K BPE vocab + 18 agent protocol special tokens	Tool-call protocol is first-class, not bolted on
6K context window	block_size = 4096 training, 6144 packed chunks	Sized for one large Python file + agent overhead; not a general-purpose model
RL post-training	GRPO on verifiable bash/filesystem rewards	Ground truth without a judge model; scalable to millions of tasks
MLX inference	int4 quantization via MLX on Apple Silicon	~1.5GB weights, >500 tok/s on M3 MacBook — no GPU needed

Agent Protocol

18 special tokens define the tool-call format. The model learns to emit structured tool calls and parse results — no prompt engineering required at inference time.

<|task|> Read the file at src/main.py and return the function names. <|think|> I need to read the file first.<|/think|> <|tool_call|><|tool_name|>read_file<|tool_args|>{"path": "src/main.py"}<|/tool_call|> <|tool_result|>def train(): ...\ndef evaluate(): ...<|/tool_result|> <|answer|> train, evaluate

Corpus

~100B tokens across 7 sources. Two-phase curriculum: 80B general mix → 20B upweighted traces + instruction data.

Source	Share	Role
Stack v2 — Python	~38%	Core coding ability
Stack v2 — Bash/Markdown	~8%	Shell and docs
Agentic traces (synthetic)	~15%	Tool-call protocol, task execution
GitHub commits + issues	~11%	Code + natural language reasoning about code
Stack Overflow	~10%	Q&A, debugging patterns
Math + instruction mix	~12%	Reasoning, instruction following
tldr pages	<1%	Bash command reference

v0 corpus (1.19B tokens) already published: tsuberim/merlin-corpus-v0

Architecture

GPT-style decoder-only transformer. RMSNorm, SwiGLU, GQA (n_kv_head=8), no bias, weight tying, pre-norm.

Config	Params	n_embd	n_head	n_layer	block_size
tiny	~1.6M	32	2	2	64
medium	~21M	256	8	8	512
base (330M)	~330M	1024	16	16	2048
3b (target)	~3.17B	3072	24	20	4096

Inference Performance

Measured on prototype (~117M params, M4 MacBook Pro). The int4 MLX path is the production target for the 3B model — extrapolated weight size is ~1.5GB.

Peak TPS (prototype)

625

int4 + KV cache, M4

Memory (int4, 3B)

~1.5 GB

fits on any M-series Mac

Agentic eval

47%

49-task harness, 3B baseline

Tokenizer

32K

BPE + 20 special tokens

Build Status

Milestone	Status
Agentic protocol + eval harness (49 tasks)	Done
Custom BPE tokenizer (32K vocab)	Done
Data pipeline (download → tokenize → pack)	Done
v0 corpus on HuggingFace (1.19B tokens)	Done
E2E training loop, 330M on H100 via Modal	Done
SFT infrastructure	Done
Repo scanning pipeline (clone + pytest → passing repos)	In progress
Agentic trace generation (target: 200K traces)	Planned
Full 100B token corpus	Planned
3B pre-training run (100B tokens)	Planned
RL post-training (GRPO on verifiable rewards)	Planned
MLX int4 3B model release	Planned

Stack

PyTorch (training, CUDA) MLX (inference, Metal) Modal (cloud compute) HuggingFace tokenizers (BPE) vLLM (trace generation) W&B (training observability) HuggingFace Hub (datasets + models)