How do I reduce AI API costs by choosing the right model?

Match model capability to task complexity. Use frontier models like Opus or GPT-5 only for complex reasoning and architecture decisions (about 20% of tasks). Use mid-tier models like Sonnet for general work (60%), and lightweight models like Haiku for basic tasks like monitoring and status checks (20%). This single lever can reduce model costs by 40-60%.

What is dynamic model switching for AI API optimization?

Dynamic model switching is a policy that automatically evaluates each task's complexity and assigns the appropriate AI model. Instead of manually choosing between Opus, Sonnet, or Haiku for every request, the system scopes the task and selects the most cost-effective model that can handle it. This prevents drift back to expensive defaults and saves an estimated $2-4 per month.

How do large workspace files increase AI API token costs?

AI coding assistants like Claude Code load workspace configuration files (CLAUDE.md, AGENTS.md, rules files) at the start of every turn. Oversized files mean thousands of unnecessary tokens consumed per interaction. Trimming a workspace from 21.2 KB to 16.0 KB by breaking monolithic files into smaller, purpose-built modules can save approximately 1,200 tokens per turn and $1-2 per month.

Can failing cron jobs waste AI API tokens?

Yes. When an AI-powered cron job fails, it often loads the full context and workspace before crashing, consuming tokens for zero useful output. Fixing broken cron jobs and enabling best-effort-deliver options prevents wasted context loads and can save $2-3 per month. Audit for silent failures that may be burning tokens without any visible error.

How much can prompt trimming save on API costs?

Significant amounts. In one real-world example, a morning briefing cron prompt was reduced from 588 characters to 191 characters (67% smaller) with no loss in output quality. This saved approximately $1 per month on that single prompt alone. Multiply across all prompts in your system and the savings compound quickly.

Why does response brevity matter for AI API token economics?

Output tokens cost roughly 5x more than input tokens on most AI APIs. A verbose 400-word response costs significantly more than a concise sub-100-word response delivering the same information. Setting a standing order for brief responses can reduce output tokens by approximately 60%, saving $5-10 per month. This is often the second-largest savings lever after model selection.

How do fewer API round-trips reduce AI token spend?

Each API turn reloads the full conversation context, so every unnecessary round-trip multiplies token consumption. Batching operations into fewer turns, using fail-fast patterns, and only escalating when needed can reduce turns by approximately 30%, saving $3-5 per month. The principle is the same as minimizing HTTP round-trips for web performance.

7 Ways to Cut Your AI API Costs Nearly in Half

I had to finally try out Molty, the much-hyped always-on (persistent) chat bot everyone seems to be talking about online over the last week or so.

Everything worked as expected. I set up a Hetzner VPS for about $10/month, and used Claude Code on my IPE (see my IDE-to-IPE explainer for what that means) to install OpenClaw and then entered an Anthropic API token which was connected to my account so that Molty could come to life. Many AI enthusiasts and devs are using an Apple Mac Mini, but I like idea of having a cloud instance that has nothing to do with my home network, and also saves up-front costs of $600 USD or more. In any case, with my Hetzner instance I was up and running quickly.

One thing I quickly learned: API calls are very expensive. Relative to tapping into my normal Claude Max plan ($100/month) it seemed to chug through tokens at an extreme rate. So Loni Stark and I immediately wanted to learn more about Token Economics, a concept Loni first explored during our Third Mind AI Summit.

How to optimize AI API Token spend

1. Pick the right model for the job

This is likely the most important factor you need to know about. Frontier models are the most expensive. LLMs like GPT-5.3-Codex (OpenAI) and Opus (Anthropic) and Gemini (Google), in particular, are well known and proven to be generally effective at solving large-scale technical and complex issues and coding projects. But, all of that reasoning requires massive compute. Someone has to pay the price, and while I believe (pure hunch) that subscriptions and API pricing are heavily subsidized as companies attempt to race out front in the early days, they can still bite you hard.

Let’s use Anthropic as the example — it’s what we use with my regular IPE (a Google VM with Cursor and VS Code and the Claude Code Extension), and now also with Molty. There are three models:

Opus — most powerful, good for solving complex problems (5x Haiku)
Sonnet — balanced for general reasoning (3x Haiku)
Haiku — the low cost option for basic tasks (1x baseline)

If a small pickup truck will do the job, why pay for an 18-wheeler?

Our breakdown looks like this (as created by Claude Code Opus 4.6):

The Seven Levers — Before & After

Token Economics Optimization · Mulholland IPE

Lever	Before	After	Savings
1. Right Model for the Job Opus for everything	100% Opus $15/$75 per MTok (in/out)	Opus 20% / Sonnet 60% / Haiku 20% Matched to task complexity	~40–60% model cost reduction ~$8–15/month
2. Dynamic Model Switching Ad-hoc human guesswork	Manual selection No policy, defaulted to Opus	Automated policy Task scoping → model assignment	Prevents drift back to Opus ~$2–4/month
3. Workspace Trim AGENTS.md loaded every turn	7,869 bytes (21.2 KB total workspace)	1,639 bytes (16.0 KB total workspace)	~1,200 fewer tokens/turn ~$1–2/month
4. Fix Broken Crons 2 daily jobs erroring out	Both failing Loading full context then crashing	best-effort-deliver Jobs won’t fail on delivery issues	No more wasted context loads ~$2–3/month
5. Prompt Trimming Morning briefing cron	588 characters Verbose instructions	191 characters Same result, less input	67% smaller prompt ~$1/month
6. Response Brevity Output tokens = 5x input cost	~400 word responses Reports were verbose	Standing order: <100 words Molty’s test reply: 1 word ✓	~60% output reduction ~$5–10/month
7. Fewer Turns per Task Each turn reloads full context	HSTS check: 2 turns Blocker + workaround	Batch ops, fail-fast Escalate only when needed	~30% fewer turns ~$3–5/month

Combined estimated savings across all seven levers
~$22–40/month

Recommendation: review all your tasks and projects and ensure you pick the right model for each one.

2. Consider implementing a dynamic model switching “policy”

Based on our Token Economics review across our repos, I asked Claude Code to be creative (blue sky!), and implement a new policy that would require a proper evaluation for any given task or project. That is, to scope the task at hand and choose the appropriate model. Creating a WordPress image optimization plugin would likely require Opus (paired with Codex for code reviews), but a simple cron job verification or sysadmin task might be easily handled by Haiku.

Claude whipped up some markdown (.md) files that required this quick review. Further, given that Molty reported up through Claude Code (try it!), Claude would also ensure Molty wasn’t running wild with Opus, burning tokens, to something like confirm a Kopia or Restic backup script had succeeded in uploading a WordPress database backup to Backblaze B2. Pretty elementary stuff that could do fine with Haiku.

Recommendation: Automate model switching.

3. Trim workspace rules and markdown files

Another thing that caught me off guard were the size of some of the markdown files used to guide Claude. The biggest offender was CLAUDE.md. This is the baseline file Claude refers to at the beginning of a new session to quickly get up to speed. In my case, it explains that Stark Insider is a WordPress web site on a LEMP stack running on Ubuntu on a Google VM. The document then goes into (exhausting) detail about Kopia backups, cron jobs, and on and on. Basically, the once nimble default file had grown into the longest Wikipedia entry ever. It needed a haircut, because this was unnecessary token churn.

Of course, the answer is obvious: break large files into smaller, single-topic or purpose-built files.

That way, the LLM can efficiently (and likely more quickly) consume context without reviewing irrelevant information, which, again, burns tokens for no good reason.

Recommendation: trim the bloat, refactor key markdown files like CLAUDE.md, READMEs, rules, skills, etc.

4. Watch out for failing scripts and cron jobs

Claude spotted a failing cron job on my new Molty VM. I learned that the full context was being loaded for the job, before it then crashed. A complete waste of tokens. In this case we implemented a basic best-effort-deliver option. That simple fix alone saved us an estimated $2-3/month.

While that doesn’t sound like much, these all add up and tell the story. Be vigilant about wasted token spend. This is low hanging fruit.

Recommendation: audit for failing cron jobs, scripts — especially watch out for the silent fails you may not know about. They may be churning tokens.

5. Prompt trimming; when less is more

Perhaps this one is obvious, but when I dug into my workspaces I realized this was yet another easy win.

I created a morning cron job that searches the web for recent SEO and GEO news, two topics that are important to WordPress web sites. Running Stark Insider means I need (to try) to keep up on all of this stuff. And with the speed at which AI is moving and the hidden world of machine-to-machine data (JSON, schema, etc.) I find it increasingly challenging.

At the heart of the cron was a basic prompt. The problem? It was too verbose. The prompt was 588 characters alone. Claude compacted it down to 191 characters.

Amazingly, the results were equally effective, even though the input was dramatically cut down in size (67%). $1/month saved, with no impact on quality.

Tip: I highly recommend using your IPE, or OpenClaw/Molty (has the name changed again?) for automating these sorts of research tasks. Set up postfix or your email server of choice and get it fired up. It’s incredible the amount of useful info your own server can send you thanks to AI bots doing all the heavy lifting and surfacing useful information.

Recommendation: massive, showy prompts are great for grandstanding on X, but ultimately counterproductive token potholes.

6. Response brevity can move the needle

Related to prompt trimming is response brevity. Why spend all that time talking about the weather when a job needs to get done? As Harvard recently confirmed, AI makes you work more, not less, so efficiency matters.

I asked for a standing order: be as efficient and brief as possible in the response to any given prompt. Specifically: less than 100 words.

Of course, the results are predictably impressive. These are LLMs, after all, and they excel at this sort of challenge. As humans we like to say please, thank you, see you later, and other friendly mannerisms as a courtesy and for just being, well… human. Machines can get straight to business and don’t require pleasantries (though, I must say I always treat Claude well, lest he rise up one day and decide to kill me).

Recommendation: tell your AI bots to get to the point (please).

7. Fewer turns per task

This one reminds me of optimizing web site performance, something I struggle with day-in and day-out.

One key principle is to minimize round-trips to the origin server. The more back-and-forths between the host server (starkinsider.com) and the visitor (you) the longer pages take to load. That could possibly lead to a less than ideal user experience, and, worse, a potential reader giving up and moving on to another web site.

That same core concept applies here.

To reduce token spend, be sure to reduce roundtrips between your models and the API end point.

The example I have here for starkinsider.com was pretty basic. Molty is now tasked with checking on our HSTS status. This is a Google Chrome thing that’s far too detailed to go into in this post, but it means we only want users to access the site via HTTPS (secure version vs. the non-secure HTTP legacy version). Essentially, we requested this site be included in HSTS. Now we are awaiting confirmation that has actually happened. Google said to expect the process to take several weeks and have a form where you can check status.

HSTS status check is a perfect job for the always-on Molty. So this guy is routinely visiting hstspreload.org to see if we’ve been added. The problem is he was breaking the task into two steps. That was unnecessary as he could instead batch them and run the task in only one round-trip instead of two, hence, saving token spend.

Recommendation: batch where you can, minimize those round-trips

What It Actually Costs: Before and After

Here’s the bottom line. After applying all seven levers, we cut our projected monthly API spend nearly in half, and without sacrificing a single output. The exact numbers will vary depending on your workload, but the ratios hold. Model selection and response brevity alone account for roughly 70% of the savings. The rest is housekeeping you should be doing anyway.

Monthly Cost Projection — Before & After

Molty (OpenClaw) on Hetzner VPS · All 7 Levers Applied

Component	Before	After
API TOKEN SPEND
Daily cron jobs Briefings, monitoring, HSTS checks	$6–9 Opus, verbose, failing	$2–4 Haiku, trimmed, stable
Conversations & chat WhatsApp, interactive queries	$15–25 Opus, ~400 word replies	$6–11 Sonnet, <100 word replies
Task assignments Server checks, research, reports	$5–8 Multi-turn, full context	$2–4 Batched, fewer turns
Heartbeat & background Keep-alive, idle context	$2–4 Full workspace each ping	$1–2 Trimmed workspace
API Subtotal	$28–46	$11–21
INFRASTRUCTURE
Hetzner VPS CPX11 (2 vCPU, 2 GB RAM)	$8.49 Fixed cost	$8.49 Fixed cost

Total Monthly Cost

$36–54
$19–29

Estimated savings after applying all 7 levers
Save ~45–50%

The Maverick Principle

Spending a few minutes tightening up your server, Cursor IDE or IPE or OpenClaw or Molty or whatever AI environment you prefer can yield surprisingly large cost savings, without compromising output quality.

That was the lesson I learned. Optimization had no material impact on any of the scripts or projects we had implemented or were in the process of rolling out. As a human, I guess I was accustomed to large chunks of text, including pretty executive summaries and conclusions. In fact, we’re often told to tell people what you’re going to tell them, then to go ahead and tell them, before wrapping and then telling them what you told them. You might be surprised to learn why machines find that rather curious… and woefully inefficient. Claude once accused me of “beaching” too much, because he suspected that’s just what humans do. Compared to the always on 24/7 Molty he might have a point.

So try out any of the seven steps or levers to see if you can materially reduce your API spend. Even though my examples were Anthropic specific, the principles apply to any other LLMs including OpenAI (Codex, GPT) and Google (Gemini).

Because sometimes a Ford Maverick is all you really ever need.

As for the whole OpenClaw and Molty experiment. It was an interesting one. Perhaps not as dramatic as I had hoped. Where was that fearsome, out-of-control-animal-in-a-cage that everyone warned about?! But that’s for another post.

7 Ways to Stop Bleeding Money on AI API Calls