# Code Agent and Token Cost / Code Agent 和 Token Cost

> Published 2026-05-11 · By lawted (https://x.com/lawted2) · Published on HA7CH (https://ha7ch.com)
> Canonical: https://ha7ch.com/writing/code-agent-and-token-efficiency

## English

This one runs a bit long — about ten minutes. If you're a heavy VibeCoding user or curious about how LLM billing actually works under the hood, it's worth finishing.

VibeCoding is becoming infrastructure for a lot of engineers. When you hit a rate limit or context ceiling in the middle of a long task, the quickest fix is obvious: throw $200 at a GPT Pro subscription, throw another $200 at Claude Max — problem gone.

But if we don't just buy our way around this, and instead ask the real question — where the hell do all the tokens go — aren't you curious? I sure am.

---

Let's start with billing. Most mainstream LLM APIs, including Claude and OpenAI, charge by token count, with separate rates for input and output. Input is cheaper; output costs more. In the specific context of Code Agents, input tokens are the overwhelming majority — often over 80% of total consumption.

Some providers offer a KV Cache Discount: when the server detects that the input prefix of a new request heavily overlaps with a previous one, the overlapping portion hits the cache and gets a significant discount. The mechanism makes physical sense — it avoids redundant attention computation.

This will become a plot point. We'll come back to it.

---

Most Code Agents today, including Claude Code, still run on the classic ReAct framework. The original ReAct paper is over three years old. Its core loop: the model produces a Tool Call, receives an Observation, reasons through a CoT, then produces the next Tool Call.

It was an elegant design when it was proposed. But over the past three years, the agent research community has produced a lot of optimization paradigms — Plan Before Act, Hierarchical Planning, Task Decomposition with Memory… Code Agents have adopted basically none of them.

The arrogance isn't entirely unjustified — engineering stability will always outrank academic novelty at the product level. But that doesn't mean we can't look at the cost.

The cost lands on token consumption, and ultimately on the user. The worst offender is context management. The only way to describe it is: A Piece of SHIT.

Each loop iteration, the Agent appends the full user Query, the current Tool Call, the Observation, and the model's own CoT into the Context — then feeds the entire thing back unchanged on the next round. Under this model, Context grows quadratically, and most of it becomes historical noise that's nearly useless for whatever the model is actually trying to do right now.

When Context approaches the model's limit, Claude Code triggers Auto Compact — an interesting mechanism in itself. It's not a semantic summarization pass; it's a rule-based structural pruning at the linguistic level. Codex takes a blunter approach and just terminates the session.

Claude's context window is around 200K tokens. That sounds large until you've done a few dozen tool calls on any reasonably-sized codebase — then it's gone fast.

---

Solutions to this problem fall into two camps: Harness Level and Model Level.

Harness Level means engineering the Agent's runtime framework without touching the model itself. Two main approaches:

First, fine-grained Context management — actively filtering and compressing historical information, keeping only what's genuinely relevant to the current task. Compressing Observations is the primary lever.

Second, introducing a Plan Before Code paradigm — having the Agent complete an explicit planning pass before execution, reducing aimless exploratory Tool Calls and cutting token consumption by reducing loop iterations.

Papers along these lines have been coming out for about a year. The arrogant CC has shown zero interest. From an engineering standpoint, adding structural complexity always introduces potential side effects.

The standard academic benchmark for validating these methods is SWE-Bench. Hit good numbers there and you can publish. But SWE-Bench is fundamentally a closed evaluation with deterministic answers. Most real Code Agent usage is open-ended exploration — unfamiliar codebases, undefined requirements — far beyond what SWE-Bench covers. Academic proof doesn't translate cleanly to actual user experience.

Model Level is an entirely different angle: keep the Agent code unchanged, but use a better model. If it can solve your problem in 3 loops instead of 15, token consumption drops on its own.

The most striking news on this front came from DeepSeek. DeepSeek V4's API pricing is near-disruptive — after two rounds of discounts, it lands at roughly one-tenth of the baseline price. At that level, almost no token compression technique can match the savings, because you can't algorithmically optimize your way to a 1000% efficiency gain.

What's even more counterintuitive: because of KV Cache Discounts, some optimization approaches that reduce raw token count actually end up costing more — because restructuring the input breaks the cache hit pattern. Counterintuitive, but completely logical once you understand the billing mechanics.

---

Worth noting: some Claude Code developers are pretty dismissive of Harness Level approaches. They don't want to introduce complex context management at the harness layer. Their position is that whatever CC can't solve today, the next model will handle.

Maybe that's principled engineering conservatism. Maybe it's passing the buck.

There's another take: on a long enough timeline, obsessing over token counts is just a phase. A mentor of mine compared tokens to mobile data — we might be in the 3G era right now. When 5G arrives, nobody cares how much data a single request burns.

That analogy has some weight. Compute costs will keep falling. Context windows will keep growing — a 1M context window isn't unthinkable. What feels like a bottleneck today might genuinely be a historical footnote in the transition period.

---

My own take: I'm open on this field, but I lean Model Level right now. Partly hindsight — Harness Level approaches have been around for a year and none of them have made it into production at scale, which tells you something. Partly distribution — if Model Level solves the problem, it'll spread like DeepSeek did. People running low on tokens will find the better API on their own.

And from a vendor's perspective: token cost is temporary. Time and compute are forever.

If you have a take or a solution, reach out — happy to talk.

## 中文

这篇文章略长，大概需要十分钟。如果你是 VibeCoding 的重度用户，或者对 LLM 的计费机制感兴趣，值得读完。

VibeCoding 正在成为很多工程师日常开发的基础设施。当你在某次长任务中撞上了 rate limit 或者 context 上限的时候，最直接的解决办法当然是给 GPT 充一个 200 刀的 Pro、给 Claude 充一个 200 刀 Max——问题立刻消失。

但如果我们不从财力上绕开这个问题，而是从原理上真正问一句「token 到底去哪了」，你难道不好奇吗？反正我是很好奇。

---

先说计费机制。目前主流的 LLM API，包括 Claude 和 OpenAI，都按 token 数量计费，区分输入和输出。输入会便宜一点，输出会贵一些。在 Code Agent 这个具体的情景之下，Input 的 token 消耗占绝对大头，甚至达到 80% 以上。

部分厂家的 API 会提供一种 KV Cache Discount：当 Server 端检测到本次请求的输入前缀与历史请求高度重叠时，重叠部分的计算可以命中缓存，因此给出相当幅度的折扣。这个机制设计得很合理，背后的物理含义是避免了重复的注意力计算。

这个机制会成为伏笔，我们接下来会讲到。

---

当前绝大多数 Code Agent，包括 Claude Code，依旧运行在经典的 ReAct 框架上。ReAct 最早的论文距今已有三年多，其核心循环是：模型产生一个 Tool Call，收到 Observation，结合 CoT 思考下一步，再产生下一个 Tool Call。

这个框架在它提出的年代是相当优雅的设计。但三年以来，Agent 领域涌现出了大量优化范式——Plan Before Act、Hierarchical Planning、Task Decomposition with Memory……Code Agent 几乎一个都没有采用。

傲慢并非没有理由，毕竟工程稳定性的优先级在产品层面永远高于学术新颖性，但这不妨碍我们审视其代价。

代价落在 token 消耗上，最终落在消费者身上。Token 消耗的重灾区是上下文管理，这里只能用「A Piece of SHIT」来形容。

每一轮循环，Agent 会把当前的用户 Query、本轮的 Tool Call、Observation、以及模型自身的 CoT 全量追加进 Context，然后在下一轮把整个 Context 原封不动地喂回给模型。这种模式下，Context 以二次方速度膨胀，且大量内容是对模型当前任务几乎不再有用的历史噪声。

当 Context 逼近模型的上限时，Claude Code 会触发 Auto Compact——这个机制本身也颇有意思，它并非调用一次模型做语义层面的摘要压缩，而是从语言学结构角度做规则性的删减；Codex 则更为简单粗暴，直接终止本次对话。

Claude 的 Context 窗口大约在 200K token 量级，这个数字听起来很大，但在一个稍有规模的代码库上执行几十轮工具调用之后，很快就会见底。

---

目前针对这个问题的解决思路大致分两派：Harness Level 和 Model Level。

Harness Level 的核心是在不修改模型本身的前提下，对 Agent 的运行框架做工程改造。核心思路有两条：

一是精细化管理 Context，主动过滤和压缩历史信息，只保留对当前任务真正有价值的内容，以压缩 Observation 的思路为主力军；

二是引入 Plan Before Code 的范式，让 Agent 在实际执行之前先完成一次显式的任务规划，从而减少无效的探索性 Tool Call，从减少循环次数来减少 token。

这类论文已经陆续发表了约一年，傲慢的 CC 依旧没有任何反应。因为从工程实践角度来说，让结构变复杂必然带来潜在的 Side Effect。

学界验证这类方法的标准工具是 SWE-Bench，跑下来指标漂亮的话足以发表，但 SWE-Bench 本质上是一个有确定性答案的封闭评测。大多数人使用 Code Agent 的场景是开放性探索——面对一个陌生的代码库、一个未定义的需求——这类场景的复杂度远超 SWE-Bench 所能覆盖的边界，学界的证明因此很难直接转化为对用户实际体验的保证。

Model Level 则是一个视角完全不同的方向：构建 Agent 的代码保持不变，调用的模型更牛逼了，3 轮循环就能把你问题解决了，token 消耗自然下来了。

这个方向最牛逼的新闻来源于 DeepSeek。DeepSeek V4 的 API 定价策略是近乎颠覆性的——两轮打折后折扣力度达到了基准价格的一折。在这个价格体系下，几乎没有任何一种 token 压缩方法能够产生与之匹配的效益，因为你无法通过算法优化做到 1000% 的效率提升。

更吊诡的是，由于 KV Cache Discount 的存在，某些优化方案在减少了 token 消耗的同时，却因为改变了输入结构导致缓存命中率下降，实际扣费不减反增。这是一个反直觉的结果，但从计费机制的逻辑来看完全合理。

---

值得一提的是，Claude Code 的部分开发者对 Harness Level 的改造方案持相对消极的态度，不太倾向于在 Harness Level 引入复杂的上下文管理逻辑。他们的观点认为，现在 CC 解决不了的问题，等新模型出来之后就能解决了。

这或许是出于工程保守主义的考量，但也可能是在为自己的工作甩锅。

还有一种观点认为，从更长的时间轴来看，现在对 token 斤斤计较这件事本身就是阶段性的。我的一位导师曾把 token 比作流量：我们现在可能处于 3G 时代，当 5G 到来的时候，无人在意一次请求消耗了多少流量。

这个比喻有它的说服力。计算成本的下降是可以预期的，模型的 Context Window 也在持续提升（比如可能会实现的 1M 上下文），今天被视为瓶颈的问题，未来或许真的只是一个过渡期的历史注脚。

---

本人对这个 Field 持开放态度，目前来说站 Model Level。一个是马后炮唯结果论的原因，Harness Level 的方法现在工业界一个都没有用上，那总有它的原因在。还有一个是推广度上的看法，如果从 Model Level 解决了问题那么就会像这次 DeepSeek 一样，无需过多推广，缺 token 的人自然会使用你的 API。

更何况从厂商的角度来说，token 的减耗是暂时的，时间和算力的减耗是永远的。

任何观点和方法都可以联系我们进行探讨。
