Teaching a 7B Model to Follow Orders: A Local AI Agent Experiment

As part of Project OpenCoder under the Comfac Matrix Model initiative, I’ve been exploring whether a small, locally-hosted language model can reliably drive an agentic coding workflow — specifically using Qwen 2.5 Coder 7B (Q4_K_M) as the brain behind OpenCode, an open-source AI coding agent.

What started as a simple test — “make a directory called 260322-01_coder_test” — turned into a deep dive into model behavior, system prompt engineering, the hard limits of quantized models for tool dispatch, and ultimately a clearer picture of what we’re actually trying to build.

Why This Matters: The Business Case First

Before getting into the technical findings, it’s worth anchoring why this experiment exists at all.

Comfac Global Group is carrying three active sales pipelines that depend on demonstrating real, working AI-powered infrastructure:

400 Nextcloud leads — prospects evaluating self-hosted document and collaboration platforms
200 Frappe / Philippine Paperless Accounting leads — businesses exploring ERPNext as a BIR-compliant, locally-hosted alternative to cloud ERP
Secada Pipeline and Roadshow — the flagship CRM and sales pipeline product
Steward Engineering Pipeline — internal engineering project management tooling

Purchasing commercial AI seats (Claude API, GPT-4, Copilot licenses) for each of these deployments at scale is not viable. The only path that makes business sense is building proprietary, specialized AI models that run locally — models trained to understand Frappe DocTypes, Nextcloud administration, Linux configuration, and our internal product schemas.

Project OpenCoder is the initiative that makes that possible. This experiment is an early proof-of-concept for the core engine.

The Vision: Natural Language System Control

The long-term goal is more ambitious than just a coding assistant. Imagine this interaction:

User: “I need to set up automated backups for this server.”

AI: “Sure — what do you want to back up? Local files, databases, or both?”

User: “Local files, to a NAS.”

AI: “I can configure this a few ways — Syncthing for continuous sync, GNOME Disks for scheduled image backups, or rsync via cron for lightweight file sync. Which fits your needs?”

User: “Syncthing.”

AI: [installs Syncthing, configures the sync folder, adds the NAS as a remote device, enables the systemd service]

No terminal. No documentation lookup. No Stack Overflow. A simple conversation that ends with a configured system.

This isn’t science fiction — it’s the direct output of a correctly trained 4-bit quantized model with a grounded tool registry. The same architecture applies to:

Linux — natural language to bash commands, configuration files, systemd services
Frappe / ERPNext — describe a business process, get DocType customizations and workflows
Nextcloud — “add a user with these permissions” executed via Nextcloud’s OCC CLI
Netgate pfSense — firewall rule creation via natural language
TrueNAS — dataset creation, share configuration, snapshot schedules via the API
Secada — pipeline stage management, lead assignment, follow-up scheduling in plain English

Each of these requires a specialized LoRA package trained on that system’s command vocabulary. That’s exactly what the LoRA Package Registry in Project OpenCoder is designed to produce and distribute.

The Hardware Constraint: Why 4-Bit Quantization Is Non-Negotiable

Before discussing what the model did wrong, it’s critical to understand the hardware reality that governs all of these decisions.

The target deployment hardware is an 8GB VRAM GPU — the AMD ROCm infrastructure underpinning Project OpenCoder’s local inference stack. This is a real constraint that shapes every model choice.

VRAM Requirements by Model Size and Quantization

Model	Quantization	VRAM Required	Fits in 8GB?
Qwen 2.5 Coder 7B	FP16 (full)	~14 GB	No
Qwen 2.5 Coder 7B	Q8	~8 GB	Borderline
Qwen 2.5 Coder 7B	Q4_K_M	~4.5 GB	Yes ✓
Qwen 2.5 Coder 14B	FP16 (full)	~28 GB	No
Qwen 2.5 Coder 14B	Q4_K_M	~8 GB	Yes (ceiling) ✓
Qwen 2.5 Coder 32B	Q4_K_M	~20 GB	No

The practical rule for 8GB VRAM: Q4_K_M quantization is the standard, and 14B Q4_K_M is the ceiling. Anything larger requires either a larger GPU or CPU offloading (which kills inference speed).

What Quantization Actually Does to a Model

Quantization reduces the numerical precision of the model’s weights. Full precision (FP16) uses 16 bits per weight. Q4_K_M uses approximately 4 bits per weight — a 4x compression. The model fits in memory, but it pays a cost:

Instruction following degrades. The model becomes less precise about following exact output format requirements.
Schema adherence weakens. Strict JSON structure, exact key names, and consistent formatting become harder to maintain reliably.
Reasoning depth decreases. Multi-step logical chains are more likely to drift or collapse.

This is not a dealbreaker — it’s a constraint to engineer around. The Modelfile system prompt, few-shot examples, low temperature settings, and eventually LoRA fine-tuning all exist specifically to compensate for what quantization takes away.

The 14B Q4_K_M model is meaningfully better than 7B Q4_K_M for tool dispatch — the larger parameter count gives it more “room” to maintain format adherence even under quantization pressure. If reliable schema compliance is the goal, 14B Q4_K_M is the recommended baseline for production use on 8GB VRAM hardware.

The Experiment

The setup: run Qwen 2.5 Coder 7B Q4_K_M locally via Ollama, connect it to OpenCode, and ask it to perform a basic filesystem operation. The first task was as simple as it gets — create a directory.

The model understood the intent perfectly. The problem was everything else.

What Went Wrong: The Tool Dispatch Problem

OpenCode (like most agentic frameworks) dispatches actions by parsing structured output from the model. It expects a specific JSON schema. When the model outputs valid JSON in the right shape, the framework executes the action. When it doesn’t — nothing happens.

Here’s what the raw Bartowski build of Qwen 7B actually output when asked to create a directory:

<response>
  {"name": "skill", "arguments": {"name": "devops"}}
</response>

And then on a second attempt:

<task>
  <function_name>mkdir</function_name>
  <arguments>{ "directoryName": "260322-01_coder_test" }</arguments>
</task>

The model understood it needed to call a function. It even understood the concept of arguments. But it wrapped everything in XML tags and made up its own schema — neither of which OpenCode could parse.

Root Cause Analysis

1. No grounded tool registry. The model has never seen OpenCode’s actual tool names during training. It knows the concept of tool use but has to guess what the tools are called. A 70B model trained extensively on RLHF data will follow a schema more reliably. A 7B Q4 model guesses.

2. XML bleed from training data. Qwen models saw large volumes of XML-structured data during pretraining. Without explicit constraints, they fall back on XML-style output when generating structured responses.

3. Quantization degrades format precision. Q4_K_M compression reduces the model’s ability to maintain strict output schemas. This is the direct consequence of fitting a 14GB model into 4.5GB — something has to give, and exact format adherence is one of the first casualties.

Attempt Log: Five Tries at One Directory

Attempt	Model	Output	Executed?
1	Bartowski raw	XML tags wrapping JSON	No
2	qwen-agent-7b v1	`{"name": "make_directory", ...}`	Yes ✓
3	qwen-agent-7b v1	`{"name": "readFile", ...}`	Wrong tool
4	qwen-agent-7b v1	`{"name": "skill", "arguments": {"name": "filesystem"}}`	Hallucinated
5	qwen-agent-7b v1	`{"name": "task", "arguments": {"todos": {...}}}`	Hallucinated

The first Modelfile intervention got the model outputting JSON and actually executing — a genuine step forward. But tool name hallucination became the new problem. Each run, the model invented a different tool name from its training data.

The Fix: Ollama Modelfiles

Ollama supports a Modelfile — a plain text configuration file that wraps a base model with a custom system prompt, few-shot examples, and sampling parameters. Think of it as a guardrail layer baked into the model at creation time, at zero storage cost — Ollama shares the underlying GGUF blob and only stores a new manifest.

nano ~/Modelfile
ollama create qwen-agent-7b -f ~/Modelfile

The Modelfile That Produced Results

FROM qwen2.5-coder:7b-instruct-q4_K_M

SYSTEM """
You are a precise coding agent running inside OpenCode. Your job is to
help the user write, edit, and manage code and files by calling tools.

## CRITICAL OUTPUT RULES

CORRECT:
{"name": "make_directory", "arguments": {"path": "my_project"}}

WRONG - never do these:
<task><function_name>mkdir</function_name></task>
{"tool": "mkdir", "arguments": {...}}

Never use "tool" as the key - always use "name".

## AVAILABLE TOOLS — use ONLY these, never invent others
make_directory, write_file, read_file, list_directory, run_command, delete_file

## EXAMPLES

User: create a folder called src
{"name": "make_directory", "arguments": {"path": "src"}}

User: write a python script called hello.py that prints hello world
{"name": "write_file", "arguments": {"path": "hello.py", "content": "print('hello world')\n"}}

User: list the files in the current directory
{"name": "list_directory", "arguments": {"path": "."}}

User: run hello.py
{"name": "run_command", "arguments": {"command": "python hello.py"}}
"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

The three most impactful elements: temperature 0.1 (kills creative drift), an explicit hardcoded tool list (eliminates hallucinated names), and WRONG examples (shows the model exactly what to avoid, not just what to do).

The OpenCode Schema Problem

OpenCode was designed and tested against frontier API models (Claude, GPT-4, Gemini) that have internalized tool schemas through extensive RLHF training. A locally-hosted 7B Q4 model has never seen OpenCode’s registry — it has to guess, and it guesses differently every time.

The fix is to extract OpenCode’s actual canonical tool names and hardcode them in the Modelfile. Without a grounded registry, no amount of system prompt engineering fully prevents hallucination in a quantized small model.

For platforms that work better with small local models out of the box, Aider and Continue.dev both control tool dispatch at the framework level — the model only needs to write code, not guess schema. OpenCode shines when backed by a frontier model that already knows the tool vocabulary.

The Long-Term Solution: LoRA Fine-Tuning

Modelfiles are a patch. The permanent solution is a targeted LoRA fine-tune — a small set of additional weights trained on correct tool-call pairs that burn schema adherence in as a reflex.

Every hallucinated output from this experiment is a negative training example. Every correct execution is a positive one. This experiment generated the seed data for the first entry in the LoRA Package Registry: tool-dispatch-v1.

The same pattern scales to every system we want to control with natural language:

LoRA Package	Target System	What It Enables
`frappe-erp-v1`	Frappe / ERPNext	DocType creation, workflow config, report generation via conversation
`nextcloud-admin-v1`	Nextcloud	User management, share permissions, app config via OCC CLI
`linux-instruct-v1`	Linux / bash	Service config, backup setup, package management via natural language
`pfsense-v1`	Netgate pfSense	Firewall rules, VLANs, NAT config via conversation
`truenas-v1`	TrueNAS	Dataset, share, and snapshot management via API calls
`secada-v1`	Secada CRM	Pipeline management, lead assignment, follow-ups in plain English

Each package is a small, portable weight file layered on top of the base Qwen 2.5 Coder 14B Q4_K_M model. One GPU. One model. Swappable expertise modules loaded on demand.

Summary of Findings

8GB VRAM = Q4_K_M quantization. This is non-negotiable. 7B Q4_K_M fits comfortably; 14B Q4_K_M is the practical ceiling and meaningfully better for tool dispatch reliability.
Quantization degrades format adherence. The 4x weight compression that makes local inference possible also makes strict schema compliance harder — this must be compensated for through Modelfiles, temperature settings, and ultimately LoRA fine-tuning.
Modelfiles work but don’t fully solve hallucination. Low temperature + hardcoded tool list + WRONG examples gets you most of the way there. Grounding the tool registry eliminates the remaining hallucination.
OpenCode expects frontier models. For reliable 7B local inference, Aider or Continue.dev are better platform fits today.
LoRA packages are the endgame. Each target system (Frappe, Linux, Nextcloud, pfSense, TrueNAS, Secada) becomes a swappable expertise module on a single base model. This is the architecture that makes natural language system control viable at scale without commercial API costs.

Project OpenCoder is Comfac Global Group’s initiative to build proprietary specialized AI models for enterprise infrastructure — replacing commercial AI seat costs with locally-hosted, fine-tuned models trained on our exact tooling vocabulary. Follow progress at wiki.gi7b.org/index.php/Project_OpenCoder.

Game in the Brain