V
GEO

How We Detect AI Crawlers Hitting Your Site, and What We Caught

A behind-the-scenes look at the engine Vergrank built to detect AI crawlers like GPTBot and ClaudeBot, tell a real bot from a spoofer, and what we caught in the first two days of logging.

In the last post I explained how we track whether LLMs mention your site. This one is the other half of the same question: who is actually crawling your pages to feed those models, and how do you even see them?

So we built a small engine to log every AI crawler that touches a page, tell a real bot from a faker, and sort out what it came for. We turned it on, let it run, and the first day of data was already more interesting than I expected. Here’s how it works, and what showed up.

Why this is harder than it sounds

The obvious approach is to drop a JavaScript tag on the page, the way analytics does. It’s also the approach that cannot work for the bots you most want to see.

Declared AI crawlers don’t run JavaScript. GPTBot, ClaudeBot, CCBot and the rest fetch your raw HTML and leave. They never execute a <script>, so an analytics-style tag is blind to exactly the visitors the whole exercise is about.

That single fact shapes the whole design. To see everything, you need two layers that catch different things:

How Vergrank detects AI crawlers Every request to a page is seen by two layers at once. An edge middleware runs before the HTML and catches declared crawlers like GPTBot and ClaudeBot that never run JavaScript. An in-page JavaScript beacon fingerprints whatever does run JavaScript, catching humans and headless automation. Both write to one crawler_hits store. An hourly job then reverse-DNS-verifies each bot and attributes the hit to a page, feeding the crawlers dashboard. A request hits your page seen by two layers at once Edge middleware (server) runs before the HTML · no JS needed catches declared crawlers GPTBot · ClaudeBot · PerplexityBot… In-page JS beacon fingerprints what runs JavaScript catches humans + headless agents webdriver · GPU · interaction signals crawler_hits one row per visit · best-effort Hourly enrichment job reverse-DNS verify — is this really GPTBot? flag secret-probing paths (/.env, /.git…) attribute the hit to a page + client Who is crawling you Real bot vs. spoofer The two layers are complementary: server-only bots never trip the beacon; JS-only agents never trip the middleware.
Two capture layers, one store — declared crawlers caught at the edge, JavaScript-running agents caught by the beacon, then verified and attributed hourly.

Layer 1: catch the declared crawlers at the edge

The first layer is a piece of middleware that runs before your page’s HTML, right at the edge. Every request passes through it. It reads the User-Agent and checks it against a registry of known AI crawlers, and when it sees one (or sees something that isn’t a normal browser) it logs the visit and waves it through. It never blocks, it never slows the page, and if it errors it just gets out of the way. Detection should never break the thing it’s watching.

Because this layer lives on the server, it’s the only one that reliably sees the no-JavaScript crawlers. Each match gets tagged with three things:

  • Who: the provider (OpenAI, Anthropic, Perplexity, Google).
  • What for: a purpose, either training, search, or agent.
  • How sure: whether it looks like a real browser, a known bot, or something in between.

That purpose tag is the part worth dwelling on, because not all AI traffic means the same thing:

PurposeWhat it isExamples
TrainingScraping pages to train a modelGPTBot, CCBot, ClaudeBot
SearchIndexing content for retrievalPerplexityBot, OAI-SearchBot, Bingbot
AgentA live user asking an LLM about your page right nowChatGPT-User, Claude-User

An agent hit is the most interesting of the three. It isn’t a faceless scrape; it’s a real person who pointed an AI at your page this second.

Layer 2: fingerprint whatever runs JavaScript

The second layer is the small in-page beacon, and its job is everything the first layer can’t see: real humans, plus the sneakier class of AI agents that do drive a real (headless) browser and therefore do run JavaScript.

A couple of seconds after load, it collects a lightweight fingerprint and sends it once. The interesting fields aren’t the obvious ones:

  • webdriver: set to true by browsers under automation (Selenium, Puppeteer, Playwright). Real users are false.
  • WebGL renderer: the GPU string. A real laptop reports something like Apple M1 Pro. A headless browser in a datacenter reports SwiftShader, a software renderer, because there’s no GPU.
  • Interaction: did the visitor ever move the mouse, scroll, or press a key before the beacon fired? Humans almost always do. Automation usually doesn’t.

Put together, those three turn “is this a person?” from a guess into a fairly confident read.

Telling a real bot from a faker

Here’s the catch with Layer 1: a User-Agent is just a string anyone can type. Run curl -A "GPTBot" and you’re “GPTBot.” So a bot label, on its own, proves nothing. We confirm identity two ways.

Reverse-DNS, forward-confirmed. A real GPTBot connects from OpenAI’s own network. So we take the visitor’s IP, ask DNS what hostname it belongs to, and check that the hostname ends in the provider’s domain (.openai.com, .anthropic.com, and so on). Then we resolve that hostname forward and confirm it points back to the same IP, which is the step that makes it spoof-proof. You can fake a User-Agent. You can’t fake control of OpenAI’s DNS. The verdict comes out as verified, spoofed, or unknown.

Paths a real crawler never asks for. The other tell is behavioral. A legitimate AI crawler reads your content. It does not go looking for /.env.production, /.git/config, /secrets.json, or /wp-config. Those are secret-hunting requests, and any “bot” making them is a scanner wearing a costume, no matter what its DNS says.

What we actually caught

We turned the logging on and watched. The sample is small so far, a little over a day across two sites, so treat this as early signal, not a census. About 140 hits in total, 46 of them identified AI bots. Even at this size it’s a pretty honest snapshot of who’s knocking.

The AI crawlers themselves. Anthropic’s ClaudeBot was the single most active declared crawler, ahead of OpenAI’s GPTBot. The surprise was second place: Amazon’s crawler, not anyone I’d have guessed:

CrawlerProviderPurposeHits
ClaudeBotAnthropictraining15
AmazonbotAmazontraining10
GPTBotOpenAItraining8
ChatGPT-UserOpenAIagent6
PerplexityBotPerplexitysearch2
CCBotCommon Crawltraining2
OAI-SearchBotOpenAIsearch2
BingbotMicrosoftsearch1

Count by provider instead of by bot and OpenAI just edges ahead (16 hits across its three crawlers to Anthropic’s 15), but ClaudeBot was still the busiest single name on the list.

Sorted by purpose, the traffic skewed hard toward training: about three in four identified hits (76%) were models scraping to learn. The rest split almost evenly between live agent fetches and search indexing, and the part I didn’t expect is that the agent hits actually edged out search. So the AI crawl economy at our doorstep is mostly machines reading to train, but the humans pointing an AI at a page in the moment already showed up more often than the search indexers.

The headless agents the beacon caught. This was the part I found most fun. Only eleven visitors ran JavaScript and got fingerprinted, but most of them came back with a headless-automation signature: webdriver: true and a SwiftShader software renderer sitting where a real GPU should be. A few looked like a genuine machine instead (an Apple M1 Pro reporting webdriver: false). That’s the exact traffic the server layer is blind to: automation driving a real browser engine, quietly running your JavaScript, that a User-Agent check would wave through as “just Chrome.”

The fakers and the junk. Then there’s the bucket of things that are neither a real browser nor a known AI bot, and it filled up fast. A few standouts:

  • The single loudest visitor was a request whose entire User-Agent was a URL: http://vergrank.com/wp-admin/install.php?step=1. That’s someone fishing for an unconfigured WordPress to take over, and it knocked a dozen times.
  • The usual SEO crawlers showed up in volume. AhrefsBot was actually the busiest source of all, ahead of every AI bot, with SemrushBot close behind. Not AI, but not browsers either, so they land in the “unknown fetcher” bucket instead of getting mislabeled.
  • Link-preview scrapers like facebookexternalhit were busy too, the bot that fetches a page whenever someone pastes its link into a chat or a feed.
  • Plain scripts owning up to what they are: axios and Bun.

The one thing I expected and did not see yet: a scanner wearing a famous bot name while fishing for /.env or /.git/config. The behavioral check is armed for it, but in this short window nobody bothered. The disguises that did turn up were lazier than that. The moment you start logging, the open web’s background noise becomes visible, and you find out which costumes people actually wear.

Why we’re building this

Vergrank is about your visibility inside AI: whether models mention you, who they recommend instead, and now, who’s actually crawling you to build those answers in the first place. Seeing GPTBot and ClaudeBot read a page is the upstream signal to the mention-tracking I wrote about last time. The crawl is the cause and the citation is the effect, and it’s a lot easier to reason about the effect when you can watch the cause.

This is still early. The sample is tiny, the verification pass is just getting going, and the next step is letting clients install the same tracking on sites we don’t even host. But the engine is live, it’s honest about what it can and can’t see, and it’s already showing me things I’d never have known were happening.

More as the data grows.

FAQ

How do you detect AI crawlers like GPTBot and ClaudeBot?

An edge middleware runs before the page HTML and inspects the User-Agent against a registry of known AI crawlers. Because declared crawlers don't run JavaScript, this server-side layer is the only thing that reliably sees them. A JS tag alone never would.

Can't anyone fake the GPTBot user-agent?

Yes, and people do. We confirm a bot's identity with reverse-DNS, looking up the IP's hostname and forward-resolving it back to check it belongs to the provider's domain (such as .openai.com). A "GPTBot" whose IP doesn't resolve to OpenAI, or that requests files like /.env, is a spoofer.

How do you catch AI agents that DO run JavaScript?

A small in-page beacon fingerprints anything that executes JS, reading the webdriver flag, the GPU/WebGL renderer string, and whether the visitor ever moved or scrolled. Headless automation tends to report a software renderer (SwiftShader) and webdriver=true, which is how it stands out from a real browser.

What's the difference between a training, search, and agent crawler?

Training crawlers (GPTBot, CCBot) scrape pages to train models. Search crawlers (PerplexityBot, OAI-SearchBot) index content for retrieval. Agent fetchers (ChatGPT-User, Claude-User) are a live person asking an LLM about your page right now, arguably the most valuable visit of the three.

Related reading