Do Not Use the Most Expensive Model for Every Agent Step

Many teams build AI agents by starting with the strongest model they can access and asking it to do everything: understand the task, read the materials, retrieve context, draft the output, check errors, and repair mistakes. That approach is simple, but the cost rises quickly. It can also reduce system control because every step depends on one expensive model call pattern.

Fireworks AI published a useful alternative on June 3, 2026: use an open-source worker for most of the work, and call a frontier advisor only at key moments.

For legal, coding, research, and operations agents, this is a practical shift. The core agent question is no longer only, "Which model is smartest?" It becomes, "Which step needs the strongest model, and which step can be handled by a lower-cost, more controllable worker?"

What Fireworks Tested

Fireworks and Harvey evaluated this idea on a 100-task slice of the Legal Agent Benchmark. LAB is designed for long legal-agent tasks involving documents, citations, deliverables, and strict scoring. It is closer to professional workflow execution than simple Q&A.

In the setup described by Fireworks, a GLM 5.1 open-source worker can call a Claude Opus 4.7 advisor when it decides help is needed. The advisor is not a fixed external router and does not write the entire answer for the worker. It is a tool the worker can use during the task trajectory.

The source article reports that GLM 5.1 plus an Opus 4.7 advisor reached 18 / 100 all-pass at roughly $368 total cost. By comparison, Claude Opus 4.7 run end-to-end reached 14 / 100 all-pass at roughly $954 total cost. Fireworks notes that the cost estimate uses current serverless rates and public API rates, and that the number will vary with token mix and pricing.

This should not be read as a blanket claim that open-source models have replaced frontier models. The more useful conclusion is narrower: on a specific long-task benchmark, orchestration changed the relationship between quality and cost.

Fireworks also discusses a second path: post-training. In its reported experiments, Kimi K2.6 improved from 11 / 100 to 15 / 100 all-pass after supervised fine-tuning on LAB trajectories, while reinforcement fine-tuning improved mean score to 0.886 after 46 rollout steps. The article is not about one trick. It points to two routes: change how models are called at inference time, and improve the worker model with domain trajectories after training.

Why the Advisor Pattern Works

Agent tasks are rarely single model calls. They are trajectories. A model reads material, builds a plan, calls tools, finds uncertain areas, retrieves more context, validates output, and then enters another step.

If every step uses the most expensive model, cost gets amplified by the length of the trajectory. If every step uses a cheaper model, key judgments can fail and create more human rework.

The open-source worker plus frontier advisor pattern separates those jobs:

The worker handles most reading, drafting, tool use, and process movement.
The advisor helps at high-risk moments with judgment, review, or deeper reasoning.
The system records where the advisor was called and whether it helped.
The product team can treat advisor call rate as a quality and cost control knob.

Fireworks reports that the advisor was called 0.83 times per task on average. That number matters. It suggests the frontier model did not disappear. It was used sparsely where it had more leverage.

Post-training addresses a different problem. If a task type appears repeatedly, the team may not need to rely forever on a stronger advisor to cover the worker's weak spots. High-quality trajectories can be turned into training data so the worker makes fewer domain-specific mistakes. Runtime routing and model-side improvement are separate levers, and teams can test them in stages.

What AI Product Teams Should Take From This

Teams building customer support agents, coding agents, research assistants, sales-ops automation, or enterprise knowledge workflows should avoid making the default model the only design decision.

Better questions include:

Which steps need higher reasoning quality?
Which steps are mostly extraction, formatting, summarization, or repetition?
Which failures create real business risk?
Did advisor calls actually improve completion rate?
Did the quality gain cover the extra cost?

These questions cannot be answered by a leaderboard alone. They require task-level evaluation.

For example, a coding agent could let a lower-cost worker read the issue, inspect related files, and propose a fix path. When it reaches an architectural boundary, a complex bug, or a security-sensitive change, it can call a stronger advisor for review. A research agent could use an open-source worker for source organization, then use an advisor to check logic gaps, missing citations, and conclusion risk.

This is a little more work than using the strongest model end to end, but it is closer to the economics of real production systems.

Where WisGate Fits

WisGate's API docs describe it as an AI inference API relay service with unified OpenAI-style REST access to multiple models. For agent teams, the value of that unified layer is not only writing less integration code. It also makes routing experiments easier to reuse.

A team can split an agent into worker, advisor, reviewer, and fallback roles, then connect different models through one request format. That gives the team three practical advantages.

First, evaluation is cleaner. Model changes do not get mixed with SDK changes, authentication changes, or logging changes.

Second, cost is easier to read. The team can separate call count, token usage, retry rate, and contribution to success for each role.

Third, routing is easier to adjust. If one task category needs a stronger advisor, the team can change a key step instead of rebuilding the entire agent.

The model combination in the Fireworks article is one experiment, not a universal recipe. The safer approach is to reuse the architecture pattern and validate it on a team's own tasks.

A Minimum Viable Test

A team can test the advisor pattern with 20 real tasks before making a large system change.

The test can be simple:

Select one long workflow, such as code repair, contract summary, competitor research, or multi-document reporting.
Run three configurations: one high-capability model, one lower-cost model, and lower-cost worker plus high-capability advisor.
Track all-pass or an equivalent completion standard, human rework, total tokens, total cost, and total time.
Label where the advisor was called: planning, retrieval, drafting, validation, or repair.
Expand the pattern only when advisor calls clearly improve completion rate or reduce rework.

Stop conditions should be written before the test starts. If the advisor pattern only adds calls without reducing rework, or if it makes latency unacceptable, the team should not scale it.

Conclusion

The value of open-source worker plus frontier advisor is not that one model is always better than another. It is that agent teams can redesign the cost structure of complex work.

In a complex workflow, a model is not one button. It is a set of roles. The worker moves the task forward. The advisor handles key judgment. The reviewer checks the output. The fallback protects reliability. Which model appears at which moment determines quality, cost, and control.

For teams building agents with multiple models, the next step is not chasing a new strongest model. It is building a repeatable task evaluation and comparing models inside the workflow.

If an advisor call consistently reduces rework, it is a specialist worth paying for. If it only makes the system more expensive, the team should return to worker quality, prompts, tool design, and evaluation criteria.

FAQ

What is a frontier advisor?

A frontier advisor is a high-capability model called on demand inside an agent workflow. It does not handle every step. It helps with complex judgment, review, validation, or higher-risk decisions.

Does the advisor pattern mean open-source models replace all closed models?

No. The better reading is that open-source models can take on more base work, while frontier models can be used more precisely at key steps. Whether that works depends on the task and evaluation result.

Why not compare only single-call model prices?

Agent cost comes from the full task trajectory: multiple calls, tool use, retries, human rework, and failure recovery. Single-call token pricing does not show task completion cost.

How can WisGate help teams test this pattern?

WisGate's unified OpenAI-style API interface helps teams place different models into the same worker, advisor, reviewer, and fallback workflow, then compare success rate, cost, and routing strategy.