Specialized Agentic Coding Model · Pre-trained from scratch · CC BY-NC 4.0

Merlin / 3B

A language model built from scratch exclusively for agentic coding — not a general assistant, not a fine-tuned GPT. Every training token, every design decision, every protocol token is optimized for one job: executing code tasks fast, locally, and at scale.

specialized agentic coding 100% local zero API cost your data stays on-device 3.17B params · int4 ~1.5GB RL post-training
View on GitHub → HuggingFace

The Idea

Every LLM agent call has two parts: reasoning (hard) and execution (easy). Today both hit a frontier model at $1–10/M tokens. That's expensive for the easy part.

Merlin is not a general-purpose model that happens to write code. It is a highly specialized agentic coding model — trained end-to-end on code, bash, tool-call traces, and commit history, with a custom tokenizer that treats tool invocations as first-class syntax. It knows one domain and executes it without hesitation.

A smarter orchestrator (Claude, GPT-4) handles planning; Merlin is the brute-force execution layer beneath it — fast, parallel, and completely under your control.

Your code stays yours

Runs entirely on your machine. No API calls, no telemetry, no cloud. Your codebase never leaves your laptop — not even for a status ping.

Zero cost per call

No tokens, no subscriptions, no rate limits. Run it a thousand times in a pre-commit hook, a file watcher, a test suite. The marginal cost is electricity.

Local mode

One worker on your MacBook. Private, offline, instant. Always on — no latency to a remote API, no outage risk.

Hosted mode

100–1000 workers via GPU batching. Pay-per-task. Spawn one per file and refactor an entire repo in seconds.

Design Choices

ChoiceWhatWhy
Pre-trained from scratch 100B tokens, corpus weighted toward code + agentic traces No distillation from proprietary models — weights are commercially clean
Custom tokenizer 32K BPE vocab + 18 agent protocol special tokens Tool-call protocol is first-class, not bolted on
6K context window block_size = 4096 training, 6144 packed chunks Sized for one large Python file + agent overhead; not a general-purpose model
RL post-training GRPO on verifiable bash/filesystem rewards Ground truth without a judge model; scalable to millions of tasks
MLX inference int4 quantization via MLX on Apple Silicon ~1.5GB weights, >500 tok/s on M3 MacBook — no GPU needed

Agent Protocol

18 special tokens define the tool-call format. The model learns to emit structured tool calls and parse results — no prompt engineering required at inference time.

<|task|> Read the file at src/main.py and return the function names. <|think|> I need to read the file first.<|/think|> <|tool_call|><|tool_name|>read_file<|tool_args|>{"path": "src/main.py"}<|/tool_call|> <|tool_result|>def train(): ...\ndef evaluate(): ...<|/tool_result|> <|answer|> train, evaluate

Corpus

~100B tokens across 7 sources. Two-phase curriculum: 80B general mix → 20B upweighted traces + instruction data.

SourceShareRole
Stack v2 — Python~38%Core coding ability
Stack v2 — Bash/Markdown~8%Shell and docs
Agentic traces (synthetic)~15%Tool-call protocol, task execution
GitHub commits + issues~11%Code + natural language reasoning about code
Stack Overflow~10%Q&A, debugging patterns
Math + instruction mix~12%Reasoning, instruction following
tldr pages<1%Bash command reference

v0 corpus (1.19B tokens) already published: tsuberim/merlin-corpus-v0

Architecture

GPT-style decoder-only transformer. RMSNorm, SwiGLU, GQA (n_kv_head=8), no bias, weight tying, pre-norm.

ConfigParamsn_embdn_headn_layerblock_size
tiny~1.6M322264
medium~21M25688512
base (330M)~330M102416162048
3b (target)~3.17B307224204096

Inference Performance

Measured on prototype (~117M params, M4 MacBook Pro). The int4 MLX path is the production target for the 3B model — extrapolated weight size is ~1.5GB.

Peak TPS (prototype)
625
int4 + KV cache, M4
Memory (int4, 3B)
~1.5 GB
fits on any M-series Mac
Agentic eval
47%
49-task harness, 3B baseline
Tokenizer
32K
BPE + 20 special tokens

Build Status

MilestoneStatus
Agentic protocol + eval harness (49 tasks)Done
Custom BPE tokenizer (32K vocab)Done
Data pipeline (download → tokenize → pack)Done
v0 corpus on HuggingFace (1.19B tokens)Done
E2E training loop, 330M on H100 via ModalDone
SFT infrastructureDone
Repo scanning pipeline (clone + pytest → passing repos)In progress
Agentic trace generation (target: 200K traces)Planned
Full 100B token corpusPlanned
3B pre-training run (100B tokens)Planned
RL post-training (GRPO on verifiable rewards)Planned
MLX int4 3B model releasePlanned

Stack

PyTorch (training, CUDA) MLX (inference, Metal) Modal (cloud compute) HuggingFace tokenizers (BPE) vLLM (trace generation) W&B (training observability) HuggingFace Hub (datasets + models)