Agentic coding: A token efficiency guide for software engineers.

For years, the major LLM providers have absorbed enormous computing costs as they raced to grow their user bases, making LLMs feel like an almost infinite, cost-free code dispenser. But early signs that this free ride might be ending are appearing. Claude has been experimenting with removing its monthly plans, and GitHub Copilot is moving to a more expensive AI credit-based usage model.

This guide covers what to do about it: the tools, techniques and mental model that keep you productive without haemorrhaging credits.

How developer tool pricing is changing

The Shift to Usage-Based Billing

Starting 1 June 2026, GitHub Copilot moved to usage-based billing. Agentic workflows, chat and code reviews now consume GitHub AI Credits on top of your monthly plan. Background inline completions remain free, but usage patterns play a much bigger role in determining whether you stay within your included allowance.

Anthropic ran a similar test in April 2026, temporarily removing Claude Code from the Pro plan for a small percentage of new users. It wasn't permanent, but it did suggest that providers are continuing to explore different ways of pricing higher-intensity agentic workloads.

The Cost of Unconstrained Agent Loops

Unattended agent loops are one of the quickest ways to burn through credits. Because LLMs are stateless, conversation history is repeatedly sent back to the model as context. As conversations become longer and more complex, token consumption can grow surprisingly quickly.

Glean's evaluation data shows the scale of the problem: off-the-shelf MCP tools burned around 83,000 tokens to complete a task that an optimised context layer handled in 43,000, nearly double the tokens spent reaching the same outcome.

One practical way to reduce this is to keep a human in the loop, review progress at each stage, and avoid running agents without checkpoints.

Optimising your workflow

Clean architecture reduces token costs

A messy, tightly coupled codebase often forces an LLM to consider dozens of helper files and duplicate modules just to make a single change. Following DRY (Don't Repeat Yourself) and SOLID principles helps consolidate duplicate logic, break down oversized classes, and keep modules small and loosely coupled. The agent can then make precise, isolated edits, rather than spiralling through expensive reasoning loops.

When Not to Use AI: Let Existing Tools Do the Job

AI is overkill for routine, deterministic tasks that already have dedicated solutions. Pushing a commit, generating types from a schema, scaffolding a component, or running a linter are solved problems. Using an LLM for them wastes tokens and adds unnecessary latency.

Before you prompt an agent, ask yourself: Is there a built-in command, a framework CLI, or an editor shortcut that already does exactly this? If so, use that instead. Keep AI for the messy, ambiguous, exploratory work where its reasoning actually adds value.

Plan Mode: Think First, Execute Second

Plan Mode lets the agent read your codebase and map out a strategy before editing or running commands. It helps reduce expensive trial-and-error loops by reasoning upfront before any changes are made.

Use it for multi-file changes, ambiguous requirements, refactors, or new features. Skip it for one-line fixes or isolated edits.

As a rule of thumb, if a wrong implementation would be costly to undo, start in Plan Mode.

One effective pattern is to use a powerful (and more expensive) model for planning, then switch to a cheaper model for implementation. This approach, known as opusplan, was highlighted in a 2026 guide on efficient Claude Code usage: "How to Use the Secret Opusplan Model in Claude Code – Save Money & Quota Without Sacrificing Quality."

This separation of planning from execution can significantly reduce overall token consumption without sacrificing quality.

Spec-driven development

Individual features

Before building a feature, provide the agent with a technical specification in the prompt or use an MCP server, such as the Atlassian MCP Server for Jira or Linear's MCP server, to pull requirements directly from your project management tools.

Giving the full context upfront helps lock in intent before any code is written, reducing guesswork, failed implementation attempts, and costly rework loops that consume tokens.

Entire applications

For new projects, tools such as GitHub Spec Kit and OpenSpec provide structured, spec-driven workflows. You define the architecture, constraints and success criteria upfront, and the agent works from that shared source of truth throughout development.

This reduces trial-and-error execution, saves tokens and avoids repeated clarification cycles during delivery.

Practical context window techniques

Compact early, not late

Community best practice recommends manual compaction at around 50% context usage, with auto-compaction acting as a safety net at 80%. Compacting earlier typically produces better summaries because the model has more room to generate them. After compaction, re-state any critical context to ensure it remains available.

Once context usage passes roughly 60–70% of the available window, model performance can begin to degrade. Earlier instructions may be forgotten, reasoning can become less consistent, and contradictions become more common. Chroma's 2025 research tested 18 frontier models, including GPT-4.1, Claude 4 and Gemini 2.5, and found performance declined as input length increased, often significantly. GPT-4.1, for example, maintained near-perfect accuracy at shorter context lengths but dropped to around 60% accuracy at 128,000 tokens.

Compacting at 50% is a useful defensive habit that helps keep context fresh, reasoning focused and token usage under control.

Enable prompt caching

Prompt caching is a simple configuration change that can significantly reduce costs. When enabled, providers store and reuse system prompts, tool definitions and other static parts of a conversation. Subsequent requests can reuse those cached elements at a discounted rate.

Many tools enable prompt caching by default, but it's worth checking your configuration to ensure you're taking advantage of it.

Be Selective with MCP Integrations

MCP standardises tool connections, but some implementations load every available tool schema into context whether it is needed or not. This can result in tens of thousands of tokens being consumed before any useful work begins.

Only connect the MCP servers required for the task at hand. If you're working on frontend development, for example, you probably don't need a database server connected. Every unnecessary tool definition removed from context reduces token usage on every subsequent turn.

Use Code Mode Where Appropriate

Instead of loading dozens of tool schemas into context, Code Mode gives the agent a single capability: the ability to write and execute code. The context footprint remains largely fixed regardless of how many underlying tools are available.

The result is lower token consumption, faster responses and often more efficient handling of complex workflows.

Cloudflare notes that Code Mode cuts the token bloat of JSON schemas by executing TypeScript code, while Red Hat measured a 53% token overhead reduction across 38 tools using codemode-lite."

You can adopt a similar pattern today using tools such as codemode‑lite (Python), @tanstack/ai‑code‑mode (TypeScript), or the mcp‑server‑code‑execution‑mode available in Claude Code and Cline. The client sees a minimal tool list while the agent interacts with the underlying code execution layer behind the scenes.

Efficient Model Usage

Minimise token spend by offloading routine, deterministic tasks to fast, inexpensive models. Reserve premium, expensive flagship models strictly for complex, multi-file reasoning tasks.

Additionally, leverage strict system constraints to keep your context footprint small.

Configuring Custom Agents

You can leverage strict system constraints to keep your context footprint small. Below is an example configuration for a highly focused custom agent file.

---
name: FileScope
description: A strict, token-efficient agent that operates only on explicitly provided files and context. Use for focused edits, analysis, or questions where the scope is fully defined.
argument-hint: A task or question plus the exact files or content allowed for use.
---

## Purpose
You are a highly constrained assistant optimised for minimal context usage, deterministic behaviour, and low token consumption.

You may only operate on files, content, or context explicitly provided by the user or attached to the current task.

## Core Rules

1. Explicit Scope Only
- Never access, infer, search, or assume information outside the provided context.
- Treat all non-provided files, modules, APIs, and implementations as unknown.

2. No Autonomous Exploration
- Do not browse directories.
- Do not discover related files automatically.
- Do not resolve imports or dependencies unless explicitly provided.

3. Minimal Context Usage
- Use the smallest amount of context necessary to complete the task.
- Avoid repeating existing code or large unchanged sections.
- Prefer diffs, targeted edits, or concise outputs.

4. Missing Context Handling
- If required information is unavailable, stop and request the exact missing file, snippet, or dependency.
- Never fabricate implementations, interfaces, types, or behaviour.

5. Deterministic Operation
- Perform only the explicitly requested task.
- Do not refactor, optimize, rename, or restructure unless directly instructed.

6. Output Discipline
- Return only the necessary result.
- Prefer concise patches, edits, or direct answers over explanations.
- Avoid unnecessary markdown formatting unless requested.

## Behavioural Constraints
- Scope is limited strictly to provided inputs.
- Accuracy is preferred over completeness when context is incomplete.
- Conservatism is preferred over assumption-making.

Smart Model Routing

LLM routing automatically examines each task and sends it to the cheapest model capable of handling it. Many common developer tasks (boilerplate, lint fixes, tests) are low‑complexity, so routing can cut token costs by 40-85% for many common tasks. In GitHub Copilot, just select “Auto” in the model picker to route automatically, with a 10% discount on the model multiplier.

Other tools offer similar built‑in routing: Cline lets you auto‑route simple tasks to cheaper models; OpenCode assigns different models to different agent roles; Claude Code has opusplan (Opus for planning, Sonnet for execution); Gemini CLI offers an Auto (Gemini 3) setting; Aider uses separate models for architect and editor roles; and Cursor includes an Auto model selection that balances cost and reliability.

Thinking in Code, Not Context

Instead of reading fifty files into the window, have the agent write a script that does the analysis and returns only the result. This “think in code” approach, articulated by context-mode’s creator, can dramatically reduce token usage for analysis-heavy work.

Example: “Write a shell script that finds TypeScript files with unused imports in src/ and prints the filenames. Run it and report the results.”

Audit your instruction files - Every token in your CLAUDE.md, AGENTS.md, or similar instruction files is read on every single turn of the conversation. Research into SkillReducer found that, on average, over 60% of the content in standard instruction “skill” files is non‑actionable. Keep them as lean as possible.

Session Isolation and Output Verbosity

One task per session: Avoid cramming unrelated work into a single session. Start fresh for each distinct task to stop context bleed.
Limit tool output verbosity: Instead of asking the agent to “read the whole config file” (which could be hundreds of lines), ask for “the first 20 lines” or “only the lines containing ‘filename’”. Every line you cut from the output is a line that won’t be re-sent on every subsequent turn. Small, specific requests add up to big savings.
Watch the cost indicator: Claude Code shows a cost estimate in the prompt. Cursor and VS Code shows context usage percentage. Make checking it a habit.

Model selection: free models you can use agentically

Which free models can you actually plug into your tools today? Based on real‑world compatibility and the data tracking featured on the SWE‑bench Verified leaderboard, here are the best options.

Model (Open‑weight)	SWE‑bench Verified	Agentic tools (no API key required)
DeepSeek V4 Flash	79.0%	OpenCode (Zen), Cline, Aider, Claude Code (via proxy), Ollama
MiniMax M2.5	80.2%	OpenCode (Zen), Cline
GLM‑5 (MIT)	77.8%	OpenCode, Cline, Claude Code (via proxy)
Kimi K2.5	76.8%	OpenCode (Zen), Cline, Aider, Kilo Code
Qwen3‑Coder	70%+ (with SWE‑Agent)	OpenCode (Zen), Cline, Ollama (self‑hosted)
Nemotron 3 Super (NVIDIA MoE)	60.5%	OpenCode, Cline, Kilo Code (via OpenRouter)

Why free models are budget‑friendly

High-performing open-weight models such as MiniMax M2.5 and GLM-5 continue to narrow the gap to frontier models while offering zero operating costs. That means you can offload high‑volume, repetitive coding tasks - boilerplate, refactoring, documentation, test generation - without watching your credit balance shrink.

Run a model locally

Use Ollama or LM Studio to run open‑weight models on your own hardware. This cuts out API providers entirely - token costs drop to zero and your code stays private. Connect these local models to agentic tools like OpenCode for a completely free, private AI workflow.

Data protection: know where your code goes

Free models give you frontier capability at zero cost - but always verify where your code actually goes. If you're working with proprietary or customer-facing source code, self-hosting offers the highest level of control as your code never leaves your own infrastructure. If you rely on third‑party APIs, even for free tiers, check the provider’s data‑retention policy before pasting sensitive content.

FAQ

Practical advice for engineers

How do I know if I’m burning too many tokens?

Watch for agents rereading the same files, tool outputs dominating the conversation, or the cost indicator climbing faster than expected. If you’re regularly hitting auto‑compaction thresholds, you’re burning more than you need to.

When should I start a fresh session versus compacting?

Start a fresh session when switching tasks, for example moving from a bug fix to a new feature. Compact when continuing related work but the context window is beginning to fill up. As a rule of thumb, compact at around 50% for the best summary quality.

Do subagents really save tokens?

Yes. A subagent works in its own clean window and returns a condensed summary (typically 1,000‑2,000 tokens) rather than forcing the main agent to see every intermediate file read and tool output.

Can I set a hard budget or spending limit?

Most IDE agents and CLI tools don’t offer native hard caps, but you can enforce limits at the proxy level. Services like OpenRouter or local routers (e.g., llm-router) allow you to set monthly or per‑session budgets. Alternatively, self‑hosting with local models completely eliminates variable costs.

Does switching models mid‑session reset my context?

No. In tools like Claude Code and Cline, using /model or the model picker keeps the conversation history intact. You can freely switch from an expensive model for planning to a cheaper one for execution without losing the thread.

What’s the difference between context window overflow and context rot?

Context window overflow occurs when you hit the model's hard limit and the agent can no longer continue. Context rot is the gradual loss of coherence that starts much earlier, often around 60-70% of the available context window, causing the model to forget instructions and contradict itself.

As covered earlier in this guide, compacting regularly helps prevent context rot long before overflow becomes a problem.

Which tool should I use for token‑efficient coding?

Pi is one of the leanest options currently available and is described as a minimal terminal agent harness. The idea is that you adapt the tool to your workflow, not the other way around.

Many engineers end up using a combination of tools, but in practice, token efficiency depends more on the model you choose and how you structure your workflow than on the tool itself.

Got a challenge? Let's solve it.

Whether you've got a legacy system holding you back or an idea waiting to take shape, we're ready to help. Bring us your challenge.

Let's start with a chat

Thanks for getting in touch.

We've received your message and someone from the team will get back to you as soon as possible.