The hidden cost driver
A team running 50 MCP tools at 200 tokens each sends 10,000 input tokens before the model starts thinking. At GPT-4o pricing ($2.50 per million input tokens), that is $0.025 per call just for tool definitions. At 10,000 calls per day, that is $250 per day or $7,500 per month -- and that is before the model generates a single output token.
Most cost optimization focuses on output tokens (completion length, model selection, caching). But for teams with large tool registries, input tokens from tool schemas are the dominant cost driver. They are sent in full on every request, they grow linearly with tool count, and they are invisible in most monitoring dashboards.
Four approaches compared
1. Prompt caching
Reduces repeated input costs but does not reduce input size. Effective when the same tools are called repeatedly. Limited ceiling: you still pay for the full schema on first call and cache misses. Most providers offer 50-75% discount on cached tokens.
2. Tool filtering
Only send relevant tools per request. Requires intent classification or routing logic upfront. Reduces tokens significantly but adds latency and engineering complexity. Risk: misrouted queries miss the right tool and produce worse results.
3. Schema redesign
Manually rewrite tool descriptions to be shorter. One-time improvement with a limited ceiling. Every new tool needs manual optimization. Does not scale with growing tool registries. Typical reduction: 30-50% with careful editing.
4. Token compression
Automated compression that preserves semantics. Highest reduction ceiling (85-97 percent). Works on any schema without manual rewriting. Adds minimal latency. No routing logic needed. Scales automatically as tools are added.
Tradeoffs
| Approach | Reduction ceiling | Implementation effort | Ongoing maintenance | Risk |
|---|---|---|---|---|
| Prompt caching | 20-40% | Low | None | Cache misses |
| Tool filtering | 50-80% | Medium | High | Misrouted queries |
| Schema redesign | 30-50% | High | High per tool | Semantic drift |
| Token compression | 85-97% | Low | None | Compression artifacts |
What this means in practice
NOVA Token Optimizer uses proprietary compression to achieve 85-97 percent reduction on tool definitions with less than 50ms latency. It works as a standalone REST API or as an MCP server that sits between your client and your tool servers. No changes to your existing tools required.
For teams that also need better tool selection, DeepNova handles grounded retrieval -- finding the right tools for each request before compression. The NOVA Platform bundles both at 20% off.