How We Detect AI Crawlers Hitting Your Site, and What We Caught
A behind-the-scenes look at the engine Vergrank built to detect AI crawlers like GPTBot and ClaudeBot, tell a real bot from a spoofer, and what we caught in the first two days of logging.
In the last post I explained how we track whether LLMs mention your site. This one is the other half of the same question: who is actually crawling your pages to feed those models, and how do you even see them?
So we built a small engine to log every AI crawler that touches a page, tell a real bot from a faker, and sort out what it came for. We turned it on, let it run, and the first day of data was already more interesting than I expected. Here’s how it works, and what showed up.
Why this is harder than it sounds
The obvious approach is to drop a JavaScript tag on the page, the way analytics does. It’s also the approach that cannot work for the bots you most want to see.
Declared AI crawlers don’t run JavaScript. GPTBot, ClaudeBot, CCBot and the rest
fetch your raw HTML and leave. They never execute a <script>, so an
analytics-style tag is blind to exactly the visitors the whole exercise is
about.
That single fact shapes the whole design. To see everything, you need two layers that catch different things:
Layer 1: catch the declared crawlers at the edge
The first layer is a piece of middleware that runs before your page’s HTML,
right at the edge. Every request passes through it. It reads the User-Agent
and checks it against a registry of known AI crawlers, and when it sees one (or
sees something that isn’t a normal browser) it logs the visit and waves it
through. It never blocks, it never slows the page, and if it errors it just gets
out of the way. Detection should never break the thing it’s watching.
Because this layer lives on the server, it’s the only one that reliably sees the no-JavaScript crawlers. Each match gets tagged with three things:
- Who: the provider (OpenAI, Anthropic, Perplexity, Google).
- What for: a purpose, either
training,search, oragent. - How sure: whether it looks like a real browser, a known bot, or something in between.
That purpose tag is the part worth dwelling on, because not all AI traffic means the same thing:
| Purpose | What it is | Examples |
|---|---|---|
| Training | Scraping pages to train a model | GPTBot, CCBot, ClaudeBot |
| Search | Indexing content for retrieval | PerplexityBot, OAI-SearchBot, Bingbot |
| Agent | A live user asking an LLM about your page right now | ChatGPT-User, Claude-User |
An agent hit is the most interesting of the three. It isn’t a faceless scrape;
it’s a real person who pointed an AI at your page this second.
Layer 2: fingerprint whatever runs JavaScript
The second layer is the small in-page beacon, and its job is everything the first layer can’t see: real humans, plus the sneakier class of AI agents that do drive a real (headless) browser and therefore do run JavaScript.
A couple of seconds after load, it collects a lightweight fingerprint and sends it once. The interesting fields aren’t the obvious ones:
webdriver: set totrueby browsers under automation (Selenium, Puppeteer, Playwright). Real users arefalse.- WebGL renderer: the GPU string. A real laptop reports something like
Apple M1 Pro. A headless browser in a datacenter reportsSwiftShader, a software renderer, because there’s no GPU. - Interaction: did the visitor ever move the mouse, scroll, or press a key before the beacon fired? Humans almost always do. Automation usually doesn’t.
Put together, those three turn “is this a person?” from a guess into a fairly confident read.
Telling a real bot from a faker
Here’s the catch with Layer 1: a User-Agent is just a string anyone can type.
Run curl -A "GPTBot" and you’re “GPTBot.” So a bot label, on its own, proves
nothing. We confirm identity two ways.
Reverse-DNS, forward-confirmed. A real GPTBot connects from OpenAI’s own
network. So we take the visitor’s IP, ask DNS what hostname it belongs to, and
check that the hostname ends in the provider’s domain (.openai.com,
.anthropic.com, and so on). Then we resolve that hostname forward and confirm
it points back to the same IP, which is the step that makes it spoof-proof. You
can fake a User-Agent. You can’t fake control of OpenAI’s DNS. The verdict comes
out as verified, spoofed, or unknown.
Paths a real crawler never asks for. The other tell is behavioral. A
legitimate AI crawler reads your content. It does not go looking for
/.env.production, /.git/config, /secrets.json, or /wp-config. Those are
secret-hunting requests, and any “bot” making them is a scanner wearing a
costume, no matter what its DNS says.
What we actually caught
We turned the logging on and watched. The sample is small so far, a little over a day across two sites, so treat this as early signal, not a census. About 140 hits in total, 46 of them identified AI bots. Even at this size it’s a pretty honest snapshot of who’s knocking.
The AI crawlers themselves. Anthropic’s ClaudeBot was the single most active declared crawler, ahead of OpenAI’s GPTBot. The surprise was second place: Amazon’s crawler, not anyone I’d have guessed:
| Crawler | Provider | Purpose | Hits |
|---|---|---|---|
| ClaudeBot | Anthropic | training | 15 |
| Amazonbot | Amazon | training | 10 |
| GPTBot | OpenAI | training | 8 |
| ChatGPT-User | OpenAI | agent | 6 |
| PerplexityBot | Perplexity | search | 2 |
| CCBot | Common Crawl | training | 2 |
| OAI-SearchBot | OpenAI | search | 2 |
| Bingbot | Microsoft | search | 1 |
Count by provider instead of by bot and OpenAI just edges ahead (16 hits across its three crawlers to Anthropic’s 15), but ClaudeBot was still the busiest single name on the list.
Sorted by purpose, the traffic skewed hard toward training: about three in
four identified hits (76%) were models scraping to learn. The rest split almost
evenly between live agent fetches and search indexing, and the part I didn’t
expect is that the agent hits actually edged out search. So the AI crawl economy
at our doorstep is mostly machines reading to train, but the humans pointing an
AI at a page in the moment already showed up more often than the search indexers.
The headless agents the beacon caught. This was the part I found most fun.
Only eleven visitors ran JavaScript and got fingerprinted, but most of them came
back with a headless-automation signature: webdriver: true and a SwiftShader
software renderer sitting where a real GPU should be. A few looked like a genuine
machine instead (an Apple M1 Pro reporting webdriver: false). That’s the exact
traffic the server layer is blind to: automation driving a real browser engine,
quietly running your JavaScript, that a User-Agent check would wave through as
“just Chrome.”
The fakers and the junk. Then there’s the bucket of things that are neither a real browser nor a known AI bot, and it filled up fast. A few standouts:
- The single loudest visitor was a request whose entire User-Agent was a URL:
http://vergrank.com/wp-admin/install.php?step=1. That’s someone fishing for an unconfigured WordPress to take over, and it knocked a dozen times. - The usual SEO crawlers showed up in volume. AhrefsBot was actually the busiest source of all, ahead of every AI bot, with SemrushBot close behind. Not AI, but not browsers either, so they land in the “unknown fetcher” bucket instead of getting mislabeled.
- Link-preview scrapers like
facebookexternalhitwere busy too, the bot that fetches a page whenever someone pastes its link into a chat or a feed. - Plain scripts owning up to what they are:
axiosandBun.
The one thing I expected and did not see yet: a scanner wearing a famous bot
name while fishing for /.env or /.git/config. The behavioral check is armed
for it, but in this short window nobody bothered. The disguises that did turn up
were lazier than that. The moment you start logging, the open web’s background
noise becomes visible, and you find out which costumes people actually wear.
Why we’re building this
Vergrank is about your visibility inside AI: whether models mention you, who they recommend instead, and now, who’s actually crawling you to build those answers in the first place. Seeing GPTBot and ClaudeBot read a page is the upstream signal to the mention-tracking I wrote about last time. The crawl is the cause and the citation is the effect, and it’s a lot easier to reason about the effect when you can watch the cause.
This is still early. The sample is tiny, the verification pass is just getting going, and the next step is letting clients install the same tracking on sites we don’t even host. But the engine is live, it’s honest about what it can and can’t see, and it’s already showing me things I’d never have known were happening.
More as the data grows.