Tech

Bot or Not: Navigating the New World of AI Content Crawlers

Why Your Content Is Being Digested by Machines and What You Can Do About It

BY Clinton Stark 07.23.2025 Modified date: 07.24.2025

Cartoon trading-card bots (ClaudeBot, GPTBot, ByteSpider, PerplexityBot) swooping toward the Stark Insider server while a stressed publisher holds his head.

It started as a routine server log check. I occasionally (reluctantly) take a trip down to the IT Dungeon to see what’s going on and if starkinsider.com has any issues to address. After 20 years of running this site, most of it on WordPress, I’ve learned that, well… there’s always something wrong. Ah, yes, the joys of self-hosting.

So it was supposed to be one of those “I’ll just peek for five minutes” things that always turn into a 2 a.m. rabbit hole.

Dozens of bots. Names I’d never before seen. User agents that read like a roll call at an AI startup convention.

Googlebot? Sure. Bingbot? Fine. But ClaudeBot? GPTBot? Bytespider? YisouSpider? And something cryptically named “YouBot” that sounded either friendly or vaguely threatening.

Suddenly my NGINX logs looked like a Pokémon deck. Only these little dudes weren’t here to cuddle; they were here to feed. On my (our!) content and data.

My first thought: We’re under attack.

My second thought: Wait, is this… normal now?

“The Web isn’t dead. It’s just being quietly digested.”

Welcome to 2025, where the audience is half human, half machine. Search, as in typing something into a little box at google.com, is morphing into something else: AI agents, LLMs, and a swarm of crawlers hoovering up content to train on, summarize, and—if we’re lucky—attribute.

This is the story of how I went bot hunting, learned to sort the saints from the sinners, and why this matters if you publish anything on the internet… especially if you’d like to be seen (and credited) in the post-Google era.

The Discovery: “So… many… user-agents”

This is where your server logs become a bot convention. In the old days it was mostly our beloved Googlebot, crawling and discovering content for users to find on google.com. Times, though, are changing, and rapidly too.

With the help of Claude again, I ran a quick grep across access logs, expecting the usual suspects. Instead I found dozens of new (to me) bots. Some announced themselves proudly. Others disguised as browsers or “curl/7.64.1” because apparently this is the new fake driver’s license.

Pattern spotting ensued:

Content focus: Which sections of Stark Insider were they hammering? (Culture? Tech? That random pomegranate post?)
Crawl cadence: Midnight burst scrapes vs polite daylight indexing.
Robots.txt manners: Do they read and respect it? Or treat directives as merely as a polite suggestion.

I dumped everything into buckets—good, meh, nope. Then piped those bad IPs and user agents into Fail2Ban, because yes, sometimes you need a digital bouncer with brass knuckles.

Here’s what a typical day used to look like:

Googlebot: 40% of bot traffic
Bingbot: 15%
Various social media crawlers: 20%
Random scrapers and spam bots: 25%

Here’s what I was seeing now:

Traditional search bots: 30%
AI training bots: 45%
Mystery bots with cryptic purposes: 25%

The landscape had shifted while I wasn’t looking. Like discovering your quiet neighborhood had become Times Square overnight.

Intent Matters: What do these bots really want?

Let’s be blunt: most AI/LLM crawlers want training data. They’re building models. That can be good (discovery, citations) or bad (zero credit, zero traffic, thanks for the free lunch). Others are SEO tools (Ahrefs is one I find useful), price scrapers, shady aggregators, or just sloppy code pinging everything that moves.

With the help of Claude I was able to figure out that one major bot operation was using almost 800 different IP addresses as part of a sophisticated scraping mission. I learned that by spreading hits across multiple IPs the offender could sort of fly-under-the-radar and my fail2ban rules would not detect these log entries because they were just one or two hits for each IP. Add them up, however, and a quick strike could net approximately 1,600 (2x 800 IPs) page/content scrapes. I would be none the wiser.

In any case, I pulled three months of logs and started categorizing. Each bot left fingerprints in the form of user agent strings, crawl patterns, request frequencies. Some were polite, respecting robots.txt like well-mannered dinner guests. Others? Not so much.

The worst bots, as expected, would spoof UA strings, slap on a cheap digital wig, and slip into the Stark Insider party.

Some patterns emerged:

The Polite Ones: Announced themselves clearly, provided documentation links, crawled at reasonable rates. Like neighbors who knock before entering. (Googlebot, Bingbot, Anthropic/Claude)

The Aggressive Ones: Hit the server like they were trying to download the entire internet before lunch. No rate limiting, no respect for server resources. (Bytespider, Bytedance)

The Mysterious Ones: Vague user agents, no documentation, crawling patterns that made no sense. Digital ghosts wandering through our content.

I asked Claude to create a daily Bot script. You might be surprised by the level of intel sitting on your server, even for small sites. Prior to Gen AI creating these sorts of things would be near impossible, or at best a worthless time sink. I asked Claude to alert me to new, never before seen bots. This is what that looks like:

Bot Summary Table (Partial)

Screenshot of daily bot summary table showing hit counts and trend arrows for Lighthouse, AhrefsBot, Googlebot, etc. — Daily crawl report highlighting which bots hit Stark Insider. Script created by Claude based on my requirements and tuning.

Spotting New Bots

Screenshot of an email alert listing newly detected bots like DotBot and AliyunSecBot. — Automated email alert flags never-before-seen crawlers.

Getting these email reports daily really helped me begin to understand the cadence, goals and patterns that these companies seek to employ.

As a small publisher, I had a one primary concern…

Attribution: Where do our words actually surface

Example: “Is the Sony A6000 still worth it?”

This is an SI evergreen that regularly shows up in AI answers, and in traditional Google searches as well.

ChatGPT often summarizes and links (sometimes).
Claude tends to quote and inline attribute (chef’s kiss).
Others… reference without attribution.

Point being: we want to be in those answers no doubt. But we also want a name credit, a link, a whisper of “Stark Insider” somewhere. That’s the give-and-take of this new world. So I:

Explicitly listed allowed bots in robots.txt
Added an AI-Index endpoint with usage terms
Started tracking which articles LLMs seem to love

Good Bots vs Bad Bots: A Summary

*At least according to me from what I know so far

GOOD BOTS (let ’em in)

Googlebot – search
bingbot – search
ChatGPT-User / GPTBot / OAI-SearchBot – AI/LLM (generally cite)
Claude-Web / anthropic-ai – AI/LLM (polite, attribution-friendly)
PerplexityBot – AI/LLM (solid citations)
AhrefsBot – SEO tool you use/benefit from
SemrushBot – SEO tool you use/benefit from
Applebot – Apple services/search previews
Lighthouse – performance audits/Core Web Vitals
Yahoo! Slurp – legacy search, harmless at low volume
DuckDuckBot – privacy search engine crawler

BAD BOTS (block, jail, rate-limit hard)

MJ12bot (and clones) – low value, heavy crawl
AliyunSecBot – security scanner style hits
Bytespider (aggressive variants) – high-volume scraping
YisouSpider – unknown intent, heavy hitter
robot / SearchBot / feedbot (generic UAs) – too vague, often spoofed
python / wget / curl – scripted scrapers pretending to be browsers
BitSightBot
Thinkbot
Flyriverbot
SEBot
SurdotlyBot
WebwikiBot
StartmeBot
trendictionbot
ImagesiftBot – image scraping, unclear purpose
YandexRenderResourcesBot – resource grabber, not needed

It’s like watching your words get absorbed into a massive digital consciousness. Your content becomes part of the training data, the collective knowledge, but your name? That’s optional, apparently.

As venture capitalist Marc Andreessen once noted, “Software is eating the world.” Well, now AI is eating the software that ate the world. Meta, isn’t it?

Implementation: Bring in the digital judge & jury (Fail2Ban aka the bouncer)

Fortunately, thanks to ChatGPT, I learned about something called fail2ban. This is the bouncer that stands outside the club door frowning at everyone and keeping us peons from entering. It’s free and open source, and installs with a single line. You can just let it run with the default configuration, or tweak as needed. For instance: I whitelist Amazon CloudFront because we use that CDN to serve content at the edge (images, js, css mainly), closer to end users.

Example Fail2Ban jail configuration rules for blocking bad bots on Stark Insider. — Example Fail2Ban jail used to throttle and ban abusive crawlers.

Over several weeks, including lots of trial and error, discussions with Loni, and feedback from Claude, I landed on this basic classification system:

VIP Access (Never Block):

Googlebot (still pays some of the bills)
Bingbot (Microsoft’s scrappy underdog)
ClaudeBot (Anthropic plays nice)
GPTBot (OpenAI, despite the attribution issue)

Monitored Access (Trust but Verify):

Perplexity (good with attribution)
Various academic crawlers
Legitimate monitoring services

The Banned List (Jail time!):

Aggressive scrapers hitting 1000+ pages/minute
Bots ignoring robots.txt
Anything from known bot farms

The fail2ban rules read like a judicial system. First offense? 10-minute timeout. Second offense? An hour in digital jail. Third strike? Welcome to the recidive list: banned for 30 days. It’s harsh, but necessary.

Bot Trading Cards: You’ve Got to Collect Them All!

The AI Bot Collection – 2025 Edition

Print ’em, trade ’em, ban ’em. Your call.

ClaudeBot

Origin: Anthropic
Behavior: Polite, respects rate limits
Attribution: Minimal
Verdict: ALLOW
Special Power: Actually reads robots.txt

GPTBot

Origin: OpenAI
Behavior: Generally respectful
Attribution: Rare
Verdict: ALLOW
Special Power: Training the AI that’s training us all

PerplexityBot

Origin: Perplexity AI
Behavior: Moderate crawler
Attribution: Excellent!
Verdict: ALLOW
Special Power: Actually cites sources (revolutionary!)

ByteSpider

Origin: ByteDance (TikTok)
Behavior: Aggressive (Massive IP bot attacks!)
Attribution: None
Verdict: BLOCK
Special Power: Shadow Crawl

CCBot

Origin: Common Crawl
Behavior: Intensive
Attribution: It’s complicated
Verdict: LIMIT RATE
Special Power: Building datasets for future AIs

SemrushBot

Origin: SEO Tool
Behavior: Heavy crawler
Attribution: N/A (different purpose)
Verdict: ALLOW
Special Power: Knows your SEO secrets

MJ12bot

Origin: Majestic-12 Ltd (UK)
Behavior: Persistent, sometimes aggressive
Attribution: Zero
Verdict: MONITOR/BLOCK (depends on crawl rate)
Special Power: Claims to be “building search engine” since 2004, still no public engine

360Spider

Origin: 360 Search (China)
Behavior: Extremely aggressive, ignores robots.txt
Attribution: None whatsoever
Verdict: BLOCK IMMEDIATELY
Special Power: Can somehow hit your server from 50 different IPs simultaneously while claiming to be “respecting webmaster guidelines”

Images: Trading card images by ChatGPT o4-mini-high

The AI Mind Shift: Writing for the Machines

But this is all the tactical nitty-gritty. Obviously, there’s a much bigger picture at work, one that I think most of us our struggling to understand.

Nate from Nate’s Notebook wrote a short paper, Beyond SEO: Winning Visibility in the AI Search Era, that really opened my eyes. He notes the usual things we’re witnessing like the downward trend in global search traffic, and its subsequent impact on news and media. But the real juicy stuff is framing the new thing: AI SEO.

We’re at an inflection point. For twenty years, we’ve optimized for Google. Meta descriptions, keywords, backlinks — the whole SEO song and dance. But what happens when Google isn’t the primary discovery mechanism?

Write beautifully for humans, structure obsessively for machines.

What happens when people stop Googling and start ChatGPT-ing? (I spend most of my days with five or six Chrome tabs open to AIs like Claude, ChatGPT, Gemini, Copilot and Perplexity).

A Concrete Example: Writing for Machines vs Writing for Humans

Let’s take a real example from our camera coverage. Here’s the same information, presented both ways:

Writing for Humans (Traditional Style):

“After spending three months with the Sony A7 IV, I’m convinced it’s the sweet spot for hybrid shooters. The autofocus is mind-blowing — it locked onto Loni’s eye from across the room in near darkness. Sure, the 33-megapixel sensor isn’t breaking any resolution records, but in reality, who needs more? The files are already massive. What really impressed me was the 10-bit 4:2:2 internal recording. No more lugging around an external recorder! Though I’ll warn you: prepare for some serious hard drive upgrades.”

Writing for Machines (AI-Optimized Style):

Sony A7 IV Review: Full-Frame Mirrorless Camera

Key Specifications:

Sensor: 33-megapixel full-frame CMOS
Autofocus: 759 phase-detection points, Real-time Eye AF for humans/animals
Video: 4K 60fps, 10-bit 4:2:2 internal recording
Price: $2,498 (body only)
Release Date: December 2021

Performance Metrics:

Low-light AF: -4 EV sensitivity
Continuous Shooting: 10 fps
Battery Life: 580 shots (CIPA)
Storage: Dual card slots (CFexpress Type A / SD)

Use Cases:

Hybrid photo/video creation
Wedding photography
Content creation
Wildlife photography (with appropriate lens)

Pros: Excellent autofocus, 10-bit internal recording, improved ergonomics

Cons: Large file sizes, expensive CFexpress cards, 4K 60fps has 1.5x crop

And wrap it in a structured JSON product (or review wrapper) that is invisible to the reader, but readable by machines and test with the Google Rich Results Test tool.

The game is changing. Fast. As we explored in our piece on AI disrupting creative work, the shift isn’t just technical — it’s fundamental.

The New Optimization Playbook

Old World (Google-first):

Keywords in titles
Meta descriptions
Backlink building
Page speed optimization

New World (AI-first):

Structured data (JSON-LD is your friend)

Clear, factual writing
Comprehensive coverage
Machine-readable formats
API accessibility (maybe)

It’s like we’re shifting from writing for humans who use search engines to writing for machines that answer humans. The intermediary changed, and we’re all scrambling to keep up. I’m paraphrasing, but that’s the essence of Nate’s piece. A bit of a shocker really.

Looking Forward: Commodity or Community?

So where does this leave publishers, bloggers, and content creators? Are we just feeding the machine, creating commodity content for AI training sets? Or is there something more? Does pooling our content into massive LLM cauldrons commoditize us? Possibly. But it can also amplify us.

A kid researching mirrorless cams might not find Stark Insider via SERP, but they might meet us through a Claude answer box (Yes, the Sony a6000 is absolutely still worth it!) and click through because we sounded like real humans who actually tested cameras, obsessed about espresso shots, and watched weird Czech New Wave films at 1 a.m.

Nevertheless, in my opinion core principles remain:

Quality still matters: AIs trained on garbage produce garbage. Good content makes better AIs.
Voice remains valuable: While facts commoditize, perspective and experience don’t. Slow cinema can’t be replicated by pattern matching.
Direct relationships win: Email lists, communities, loyal readers. I suspect these become more valuable, not less.
New discovery mechanisms: If ChatGPT sends someone to read our espresso brewing deep-dive, that’s still a win.

Going Forward: Practical Steps for Fellow Publishers

Here’s just a sampling of what I’ve learned over the past month or so diving into AI, server logs and bots.

Audit your logs monthly. Know who’s crawling you.
Segment bots: good, neutral, bad. Adjust robots.txt + Fail2Ban accordingly. Be careful not to block things like your CDN you will need to hit your origin server for content!
Publish an AI use policy (simple JSON or HTML page). Make attribution terms clear. I should note that robots.txt doesn’t see this as an “official” item, and its technical value is debatable. Still, I like the idea and getting my thoughts out there on the subject.
Implement structured data everywhere: FAQ, How-To, Product, Review.
Track where you surface in AI answers. Screenshot, archive, celebrate. I have to admit it is a hoot seeing the Stark Insider logo and articles show up in these Chatbots from time to time.
Think series, not one-offs: Make AI coverage a recurring beat. (We try.)

Final Thoughts: Embracing the Bot Overlords

After two weeks of bot wrangling, I’ve come to a conclusion: This is our new reality. Fighting it is like trying to hold back the tide with a teaspoon.

In that respect, I don’t frame this as fighting bots. Rather, it’s about managing them, the way you manage SEO, newsletters, or social feeds. You decide who gets in, at what speed, and on what terms.

The bots are here… well they’ve always been here, they’re just operating on a whole new level. They’re reading our content, learning from it, potentially sharing it in ways we never imagined. The question isn’t whether to allow them: it’s how to make this symbiotic rather than parasitic.