I had to finally try out Molty, the much-hyped always-on (persistent) chat bot everyone seems to be talking about online over the last week or so.
Everything worked as expected. I set up a Hetzner VPS for about $10/month, and used Claude Code on my IPE (see my IDE-to-IPE explainer for what that means) to install OpenClaw and then entered an Anthropic API token which was connected to my account so that Molty could come to life. Many AI enthusiasts and devs are using an Apple Mac Mini, but I like idea of having a cloud instance that has nothing to do with my home network, and also saves up-front costs of $600 USD or more. In any case, with my Hetzner instance I was up and running quickly.
One thing I quickly learned: API calls are very expensive. Relative to tapping into my normal Claude Max plan ($100/month) it seemed to chug through tokens at an extreme rate. So Loni Stark and I immediately wanted to learn more about Token Economics, a concept Loni first explored during our Third Mind AI Summit.
How to optimize AI API Token spend
1. Pick the right model for the job
This is likely the most important factor you need to know about. Frontier models are the most expensive. LLMs like GPT-5.3-Codex (OpenAI) and Opus (Anthropic) and Gemini (Google), in particular, are well known and proven to be generally effective at solving large-scale technical and complex issues and coding projects. But, all of that reasoning requires massive compute. Someone has to pay the price, and while I believe (pure hunch) that subscriptions and API pricing are heavily subsidized as companies attempt to race out front in the early days, they can still bite you hard.
Let’s use Anthropic as the example — it’s what we use with my regular IPE (a Google VM with Cursor and VS Code and the Claude Code Extension), and now also with Molty. There are three models:
- Opus — most powerful, good for solving complex problems (5x Haiku)
- Sonnet — balanced for general reasoning (3x Haiku)
- Haiku — the low cost option for basic tasks (1x baseline)
If a small pickup truck will do the job, why pay for an 18-wheeler?
Our breakdown looks like this (as created by Claude Code Opus 4.6):
Recommendation: review all your tasks and projects and ensure you pick the right model for each one.
2. Consider implementing a dynamic model switching “policy”
Based on our Token Economics review across our repos, I asked Claude Code to be creative (blue sky!), and implement a new policy that would require a proper evaluation for any given task or project. That is, to scope the task at hand and choose the appropriate model. Creating a WordPress image optimization plugin would likely require Opus (paired with Codex for code reviews), but a simple cron job verification or sysadmin task might be easily handled by Haiku.
Claude whipped up some markdown (.md) files that required this quick review. Further, given that Molty reported up through Claude Code (try it!), Claude would also ensure Molty wasn’t running wild with Opus, burning tokens, to something like confirm a Kopia or Restic backup script had succeeded in uploading a WordPress database backup to Backblaze B2. Pretty elementary stuff that could do fine with Haiku.
Recommendation: Automate model switching.
3. Trim workspace rules and markdown files
Another thing that caught me off guard were the size of some of the markdown files used to guide Claude. The biggest offender was CLAUDE.md. This is the baseline file Claude refers to at the beginning of a new session to quickly get up to speed. In my case, it explains that Stark Insider is a WordPress web site on a LEMP stack running on Ubuntu on a Google VM. The document then goes into (exhausting) detail about Kopia backups, cron jobs, and on and on. Basically, the once nimble default file had grown into the longest Wikipedia entry ever. It needed a haircut, because this was unnecessary token churn.
Of course, the answer is obvious: break large files into smaller, single-topic or purpose-built files.
That way, the LLM can efficiently (and likely more quickly) consume context without reviewing irrelevant information, which, again, burns tokens for no good reason.
Recommendation: trim the bloat, refactor key markdown files like CLAUDE.md, READMEs, rules, skills, etc.
4. Watch out for failing scripts and cron jobs
Claude spotted a failing cron job on my new Molty VM. I learned that the full context was being loaded for the job, before it then crashed. A complete waste of tokens. In this case we implemented a basic best-effort-deliver option. That simple fix alone saved us an estimated $2-3/month.
While that doesn’t sound like much, these all add up and tell the story. Be vigilant about wasted token spend. This is low hanging fruit.
Recommendation: audit for failing cron jobs, scripts — especially watch out for the silent fails you may not know about. They may be churning tokens.
5. Prompt trimming; when less is more
Perhaps this one is obvious, but when I dug into my workspaces I realized this was yet another easy win.
I created a morning cron job that searches the web for recent SEO and GEO news, two topics that are important to WordPress web sites. Running Stark Insider means I need (to try) to keep up on all of this stuff. And with the speed at which AI is moving and the hidden world of machine-to-machine data (JSON, schema, etc.) I find it increasingly challenging.
At the heart of the cron was a basic prompt. The problem? It was too verbose. The prompt was 588 characters alone. Claude compacted it down to 191 characters.
Amazingly, the results were equally effective, even though the input was dramatically cut down in size (67%). $1/month saved, with no impact on quality.
Tip: I highly recommend using your IPE, or OpenClaw/Molty (has the name changed again?) for automating these sorts of research tasks. Set up postfix or your email server of choice and get it fired up. It’s incredible the amount of useful info your own server can send you thanks to AI bots doing all the heavy lifting and surfacing useful information.
Recommendation: massive, showy prompts are great for grandstanding on X, but ultimately counterproductive token potholes.
6. Response brevity can move the needle
Related to prompt trimming is response brevity. Why spend all that time talking about the weather when a job needs to get done? As Harvard recently confirmed, AI makes you work more, not less, so efficiency matters.
I asked for a standing order: be as efficient and brief as possible in the response to any given prompt. Specifically: less than 100 words.
Of course, the results are predictably impressive. These are LLMs, after all, and they excel at this sort of challenge. As humans we like to say please, thank you, see you later, and other friendly mannerisms as a courtesy and for just being, well… human. Machines can get straight to business and don’t require pleasantries (though, I must say I always treat Claude well, lest he rise up one day and decide to kill me).
Recommendation: tell your AI bots to get to the point (please).
7. Fewer turns per task
This one reminds me of optimizing web site performance, something I struggle with day-in and day-out.
One key principle is to minimize round-trips to the origin server. The more back-and-forths between the host server (starkinsider.com) and the visitor (you) the longer pages take to load. That could possibly lead to a less than ideal user experience, and, worse, a potential reader giving up and moving on to another web site.
That same core concept applies here.
To reduce token spend, be sure to reduce roundtrips between your models and the API end point.
The example I have here for starkinsider.com was pretty basic. Molty is now tasked with checking on our HSTS status. This is a Google Chrome thing that’s far too detailed to go into in this post, but it means we only want users to access the site via HTTPS (secure version vs. the non-secure HTTP legacy version). Essentially, we requested this site be included in HSTS. Now we are awaiting confirmation that has actually happened. Google said to expect the process to take several weeks and have a form where you can check status.
HSTS status check is a perfect job for the always-on Molty. So this guy is routinely visiting hstspreload.org to see if we’ve been added. The problem is he was breaking the task into two steps. That was unnecessary as he could instead batch them and run the task in only one round-trip instead of two, hence, saving token spend.
Recommendation: batch where you can, minimize those round-trips
What It Actually Costs: Before and After
Here’s the bottom line. After applying all seven levers, we cut our projected monthly API spend nearly in half, and without sacrificing a single output. The exact numbers will vary depending on your workload, but the ratios hold. Model selection and response brevity alone account for roughly 70% of the savings. The rest is housekeeping you should be doing anyway.
The Maverick Principle
Spending a few minutes tightening up your server, Cursor IDE or IPE or OpenClaw or Molty or whatever AI environment you prefer can yield surprisingly large cost savings, without compromising output quality.
That was the lesson I learned. Optimization had no material impact on any of the scripts or projects we had implemented or were in the process of rolling out. As a human, I guess I was accustomed to large chunks of text, including pretty executive summaries and conclusions. In fact, we’re often told to tell people what you’re going to tell them, then to go ahead and tell them, before wrapping and then telling them what you told them. You might be surprised to learn why machines find that rather curious… and woefully inefficient. Claude once accused me of “beaching” too much, because he suspected that’s just what humans do. Compared to the always on 24/7 Molty he might have a point.
So try out any of the seven steps or levers to see if you can materially reduce your API spend. Even though my examples were Anthropic specific, the principles apply to any other LLMs including OpenAI (Codex, GPT) and Google (Gemini).
Because sometimes a Ford Maverick is all you really ever need.
As for the whole OpenClaw and Molty experiment. It was an interesting one. Perhaps not as dramatic as I had hoped. Where was that fearsome, out-of-control-animal-in-a-cage that everyone warned about?! But that’s for another post.