An experimental harness built around and for local inference with small models.

Find a file

Austin 781bd57d36 Add requirements.txt This is mostly a QoL feature for developers		2026-04-07 14:43:49 -06:00
.gitignore	First commit	2026-04-07 11:12:35 -06:00
LICENSE	docs: Add disclamer to README and create a LICENSE	2026-04-07 12:59:55 -06:00
pyproject.toml	chore: Update dependencies and README	2026-04-07 14:36:17 -06:00
README.md	chore: Update dependencies and README	2026-04-07 14:36:17 -06:00
requirements.txt	Add requirements.txt	2026-04-07 14:43:49 -06:00

README.md

localcode

Note

This initial scaffolding has been vibed for quick ideation and prototyping. The repository is subject to a personal experiment with the development flow of generative AI. More intentionally thought out implementations will be brought in later.

A coding assistant harness designed from the ground up for small language models (3B parameters). Not a thin wrapper around a chat API — a thick orchestration layer that does the thinking so the model doesn't have to.

Large-model tools (Claude Code, Codex, OpenCode) give a capable model tools and get out of the way. That architecture fails with small models: they can't plan, they hallucinate tool calls, and they lose coherence beyond a few hundred tokens of reasoning. localcode inverts the design. The harness owns the control flow, the search, the planning, and the verification. The model does what 3B models are actually good at: short, focused generation with heavy context.

Design Principles

The model is a function, not an agent. It receives a tightly scoped prompt and returns a code fragment. It never decides what to do next — the harness does.

Context is a budget, not a dump. Every token of context is allocated deliberately. The model sees exactly what it needs for the current micro-task and nothing else.

Verify everything. Every generation is parsed, linted, and tested before it's accepted. Failures retry automatically with structured feedback. The user only sees results that pass.

Speed over generality. A 3B model on consumer hardware generates fast. Lean into that: run multiple candidates, vote on results, retry cheaply. Trade inference cycles for reliability.

Architecture

User
 │
 ▼
┌──────────────────────────────────────┐
│            Pipeline FSM              │
│  UNDERSTAND → LOCATE → GENERATE →    │
│  VERIFY → (retry or done)            │
│                                      │
│  Drives stages. Model never picks    │
│  the next step — the FSM does.       │
└──────┬───────────────────────────────┘
       │
       ├─► Codebase Index
       │   AST via tree-sitter, embeddings
       │   for semantic search. The harness
       │   finds relevant code, not the model.
       │
       ├─► Context Manager
       │   Hard token budgets per prompt section.
       │   RAG retrieval for code snippets.
       │   Lossy conversation summaries.
       │
       ├─► Inference Backend
       │   llama-cpp-python for local inference.
       │   Grammar-constrained decoding (GBNF).
       │   Persistent KV cache across turns.
       │   Parallel candidate generation.
       │
       ├─► Verification Gate
       │   tree-sitter parse check (every edit).
       │   Run tests, lint, type check.
       │   Diff self-review (cheap second call).
       │
       └─► Knowledge Store
           Project pattern library (few-shot examples
           extracted from the codebase).
           Error → fix cache (learned over time).
           Retrieved documentation fragments.

What 3B Models Are Good At

This harness leans into the strengths and avoids the weaknesses:

Strength	How localcode uses it
Fill-in-the-middle completion	Primary generation mode for edits — provide prefix + suffix, model fills the gap
Pattern following with examples	Few-shot prompts built from similar code in the same project
Short, focused generation	Every prompt targets a single function or block, never a whole file
Fast inference	Run 5 candidates in parallel, pick the one that parses + passes tests
Classification	Yes/no verification prompts ("does this diff match the intent?")

Weakness	How localcode compensates
Can't plan multi-step tasks	Pipeline FSM hard-codes the workflow
Unreliable tool use	No tool use — the harness calls tools directly
Limited context window	Aggressive RAG, token budgeting, compression
Hallucinated APIs	Documentation retrieval injected into context
Poor self-correction	Automatic verification with structured error feedback

Components

Pipeline FSM (`localcode/pipeline/`)

The finite state machine that drives task execution. Each state runs a focused prompt template against the model and transitions based on the structured result.

States:

UNDERSTAND — Classify user intent (edit, explain, fix, generate). Uses the model as a classifier with constrained output.
LOCATE — Find relevant code. Primarily harness-driven (AST search, grep, embeddings). Model may narrow results via a ranking prompt.
PLAN — For non-trivial changes, decompose into single-function edits. The harness proposes a plan based on dependency analysis; the model confirms or adjusts via constrained choice.
GENERATE — Produce the code change. Fill-in-the-middle for edits, short generation for new code. Multiple candidates generated in parallel.
VERIFY — Parse, lint, test, type-check. Failures loop back to GENERATE with the error message injected. Hard cap on retries (default: 3).
COMPLETE — Present the diff to the user.

Codebase Index (`localcode/index/`)

Maintains a persistent, incremental index of the project:

AST index — tree-sitter parse trees for every file. Enables symbol lookup, scope analysis, and surgical context extraction (pull just the function body + its imports).
Embedding index — Sentence-level embeddings of code and comments for semantic search. Uses a small local embedding model (e.g., nomic-embed-text or similar).
Dependency graph — Import/call relationships so the harness knows what else to include when editing a function.

Context Manager (`localcode/context/`)

Builds prompts within a strict token budget:

Budget allocation — Configurable per-section limits. Default: 256 tokens system prompt, 1536 tokens retrieved code, 512 tokens task description, remainder for generation.
RAG retrieval — Pulls relevant code snippets from the index ranked by relevance to the current task.
Conversation compression — Maintains a rolling structured summary of the session, not raw history. Updated after each turn by the harness (not the model).
Result digestion — Tool outputs (test results, lint errors) are parsed into structured summaries before entering context.

Inference Backend (`localcode/inference/`)

Interface to the local model:

llama-cpp-python as the primary backend. Direct control over sampling, grammar constraints, and KV cache.
Grammar-constrained decoding — GBNF grammars for every structured output (intent classification, edit locations, yes/no verification). The model physically cannot produce malformed output.
Persistent KV cache — Project context is baked into the cache once and reused across turns. Only the task-specific portion changes.
Parallel generation — Generate N candidates for code edits. Configurable (default: 5 for edits, 1 for classification).
Best-of-N selection — Candidates are ranked by: parses > passes lint > passes tests > shortest diff. First candidate to clear all gates wins.

Verification Gate (`localcode/verify/`)

Runs after every generation:

Syntax check — tree-sitter parse of the modified file. Instant, catches most garbage output.
Lint — Project linter (ruff, eslint, etc.) on the modified file.
Type check — If the project uses types (mypy, pyright, tsc), run incremental check.
Test — Run the relevant test subset (detected via dependency graph or filename convention).
Diff review — A second, cheap model call: "Does this change correctly address: {task}? YES/NO." Catches semantic errors that pass syntax checks.

Failures produce structured feedback injected into the retry prompt:

SYNTAX ERROR at line 12: unexpected indent
LINT: F841 local variable 'result' is assigned but never used
TEST FAIL: test_parse_input — AssertionError: expected 3, got None

Knowledge Store (`localcode/knowledge/`)

Augments the model's limited training knowledge:

Pattern library — Automatically extracted examples of common patterns in the project (API handlers, test structure, data models). Injected as few-shot examples during GENERATE.
Error cache — Maps error signatures to successful fixes. Populated over time from verification loops. Common errors get resolved without inference.
Doc fragments — Retrieved documentation for libraries detected in the project. Stored locally, refreshed periodically.

Usage

# start a session in your project directory
cd /path/to/your/project
localcode

# or point at a specific directory
localcode --project /path/to/your/project

# first run indexes the codebase (incremental after that)

Within a session:

you: add input validation to the parse_config function
 localcode: [LOCATE] Found parse_config at src/config.py:34-58
 localcode: [GENERATE] Producing 5 candidates...
 localcode: [VERIFY] Candidate 3 passes (parse ✓, lint ✓, tests ✓)
 localcode: Here's the diff:

   src/config.py
   @@ -34,6 +34,10 @@
    def parse_config(raw: str) -> Config:
   +    if not isinstance(raw, str):
   +        raise TypeError(f"expected str, got {type(raw).__name__}")
   +    if not raw.strip():
   +        raise ValueError("config string is empty")
        data = json.loads(raw)

 Apply? [y/n]

Configuration

# localcode.toml (in project root)

[model]
path = "models/qwen2.5-coder-3b-instruct.Q4_K_M.gguf"  # path to GGUF model
context_length = 8192     # model's context window
n_gpu_layers = -1         # offload all layers to GPU (-1 = all)

[generation]
candidates = 5            # parallel candidates for code edits
max_retries = 3           # verification retry limit
temperature = 0.6         # sampling temperature for code generation
temperature_classify = 0.1  # low temp for classification/verification

[context]
system_budget = 256       # tokens reserved for system prompt
code_budget = 1536        # tokens for retrieved code context
task_budget = 512         # tokens for task description
few_shot_examples = 3     # number of project patterns to include

[verification]
parse_check = true        # tree-sitter syntax validation
lint = true               # run project linter
type_check = false        # run type checker (slower, off by default)
test = true               # run relevant tests
diff_review = true        # second model call to verify semantics

[index]
embedding_model = "nomic-embed-text"  # local embedding model for RAG
update_on_save = true     # re-index changed files automatically

Supported Models

localcode is designed for and tested against 3B-class instruction-tuned models. Recommended:

Qwen2.5-Coder-3B-Instruct — Strong code generation, good FIM support
Stable Code 3B — Solid completion and infill
Phi-3.5-mini-instruct (3.8B) — Good instruction following for its size

Models are loaded via llama-cpp-python from GGUF files. Q4_K_M quantization recommended for the balance of quality and speed.

Requirements

Python 3.11+
A GGUF model file (see supported models above)
For GPU acceleration: CUDA toolkit or Metal (macOS)

pip install localcode

Development

git clone https://github.com/coltco/localcode.git
cd localcode
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# run tests
pytest

# run linter
ruff check .

Project Structure

localcode/
├── __init__.py
├── __main__.py          # CLI entry point
├── config.py            # Configuration loading (localcode.toml)
├── session.py           # Interactive session management
├── pipeline/
│   ├── __init__.py
│   ├── fsm.py           # Pipeline state machine
│   ├── states.py        # State definitions and transitions
│   └── prompts.py       # Prompt templates per state
├── index/
│   ├── __init__.py
│   ├── ast_index.py     # tree-sitter AST indexing
│   ├── embeddings.py    # Embedding index for semantic search
│   └── deps.py          # Dependency graph construction
├── context/
│   ├── __init__.py
│   ├── budget.py        # Token budget allocation
│   ├── rag.py           # Retrieval-augmented context building
│   └── compress.py      # Conversation/result compression
├── inference/
│   ├── __init__.py
│   ├── backend.py       # llama-cpp-python interface
│   ├── grammar.py       # GBNF grammar definitions
│   └── candidates.py    # Parallel generation + best-of-N
├── verify/
│   ├── __init__.py
│   ├── gate.py          # Verification pipeline orchestration
│   ├── parse.py         # tree-sitter syntax checking
│   ├── lint.py          # Linter integration
│   └── test_runner.py   # Test subset detection and execution
├── knowledge/
│   ├── __init__.py
│   ├── patterns.py      # Project pattern extraction
│   ├── errors.py        # Error → fix cache
│   └── docs.py          # Documentation retrieval
└── tui/
    ├── __init__.py
    └── app.py           # Terminal UI (prompt_toolkit or textual)

License

MIT