The question every founder asks: when will AI find me?
Every new business owner I talk to has the same question: when will ChatGPT know my brand? When will Claude cite my work? When does Perplexity start listing my company as an option?
Every honest answer starts with “it depends.” Then comes a list of variables - existing citations, discoverability, training-data freshness per model, retrieval vs. training pipelines, content age. Useful, but not a number.
Test subject: this website
llmcartel.com went live a few days ago. The starting state is deliberately minimal in the ways that matter for this experiment.
No third-party citations that I know of. No press, no podcast spots, no guest posts, no SaaS directory listings, no Reddit threads.
No paid distribution. No ads, no PR push, no link buys.
A fully populated GEO foundation, because I run a GEO agency and that’s the floor:
- Fully semantic HTML
- Almost zero JavaScript
- Hand-written
robots.txtwith explicitAllow:directives /llms.txtand/llms-full.txtper the llmstxt.org standardsitemap.xmlwithlastmodper page- JSON-LD structured data on every page
- Open Graph + Twitter Card + Apple touch icon + PWA manifest
- WCAG 2.2 AA accessibility
- No Google Analytics or any analytics tracking at all, at least for the first month
Almost screwed up by having Cloudflare Bot Fight Mode Turned On
Before I could honestly start the clock, I had to fix my own infrastructure.
llmcartel.com is hosted on Cloudflare Pages. By default, Cloudflare’s free tier turns on a feature called Bot Fight Mode. If you’ve never heard of it: Bot Fight Mode is a heuristic + JavaScript-challenge layer that sits in front of every request and tries to block automated traffic. It does this with three mechanisms:
- Heuristic scoring. Suspicious-looking requests get challenged or blocked at the edge based on user-agent patterns, IP reputation, and request fingerprinting.
- Browser Integrity Check. A lightweight check that looks for browser-like behavior.
- JavaScript Detection (JSD). Cloudflare injects a script (
/cdn-cgi/challenge-platform/scripts/jsd/main.js) into every HTML response.
The first time I ran Lighthouse on the live site, my Best Practices score capped at 81 with a deprecation warning I couldn’t fix in my own code. I investigated this an remembered:
robots.txt is a polite request not a wall.
I had spent days writing the most explicit AI-bot-welcome robots.txt I could write. Allow: / for every named LLM crawler. Then Cloudflare, the network in front of the site, was challenging the same crawlers with a JavaScript check that most LLM bots don’t run.
I was turning away, at the network layer, the same crawlers I was inviting at the policy layer.
This is the double gotcha:
- Layer 1 (you): robots.txt says yes.
- Layer 2 (your CDN): challenges everything that isn’t a full browser.
AI crawlers typically issue plain GET requests without a JS runtime. Cloudflare’s JSD script was injecting into responses they did receive, but they couldn’t and wouldn’t execute it, and on the next visit the heuristic layer would have them flagged as “suspicious automated traffic.”
Most teams launching on Cloudflare Pages will have this configured exactly the way ours was: on by default, quiet, looking like protection but working against the very crawlers the site is trying to attract.
I know it’s kind of risky removing all of these from Cloudflare, but this is for science. I can flip it back any time if the abuse logs start screaming.
How I fixed Cloudflare Bot Fight Mode using the API
I disabled Bot Fight Mode programmatically through the Cloudflare API. My workflow for managing the site is:
- Cloudflare API token with zone-level write scope, stored locally outside the repo.
- Claude Code as the operator. I describe the change in plain English; Claude inspects the live state via the CF API, surfaces the tradeoff, and, with explicit approval, issues the change. Same loop for DNS, security headers, page rules, bot management, anything else exposed by the API.
- Diff-and-apply discipline. I always read current state before writing, and I always print the new state after. No silent changes.
- Cloudflare Pages preview branches. Because the site lives on Cloudflare Pages, every git branch and pull request gets its own staging subdomain (
<branch>.agencysite-c71.pages.dev) so I can preview changes against a real edge before merging to main.
The actual change for Bot Fight Mode was a single PUT:
PUT /zones/{zone_id}/bot_management
{
"fight_mode": false,
"enable_js": false
}
Three things happened immediately:
- The JSD script stopped injecting into HTML responses on cache miss.
- AI bots arriving at the front door now actually get through to the site instead of being challenged.
- Lighthouse’s Best Practices score recovered (this is what bumped my Lighthouse speed score back to a 99 / 100).
If I’d left Bot Fight Mode on, every update of this case study would be measuring a site that was rejecting the very crawlers I’d told it to invite. The post would have been fiction.
How I’m logging new LLM bots that hit my website
Free Cloudflare plans don’t include Logpush, the official “ship every HTTP log to your storage” feature. So I built my own observability layer in 50 lines of TypeScript:
- A Pages Function middleware (
functions/_middleware.ts) runs at the edge on every request, before the static page is served. - It inspects the user-agent against an explicit allow-list of known LLM crawlers (the same names I welcome in
robots.txt). - For every match, it builds a structured event and forwards it to a Discord webhook (real-time pings on bot arrivals) and to a generic JSON webhook (for queryable archive). Console logs go to Cloudflare Pages real-time logs as a backup.
- Each event also classifies the request by content kind:
html,llms-text(/llms.txt),llms-full(/llms-full.txt),sitemap,robots,manifesto, ormanifest. That tells me at a glance whether bots are crawling my polished HTML pages or my LLM-friendly text artifacts, which is itself one of the most interesting things to watch.
A quick word on user-agents. Every HTTP request includes a User-Agent header, a self-identifying string like ClaudeBot/1.0 ([email protected]) or GPTBot/1.3. It’s a name tag, not an ID card: voluntary, unauthenticated, and trivial to forge. I match the UA against a list of known LLM crawlers (the same list I welcome in robots.txt) and log every match. Treating the spoofability as a feature rather than a bug is deliberate, it surfaces self-identified bot traffic regardless of who’s actually behind it, which is its own piece of the GEO landscape. Day 3 is what happens when someone takes that loophole for a spin.
The middleware never slows the response. It forwards the request via context.next() synchronously and ships the webhook calls in the background via context.waitUntil(). Bot detection happens in-line; webhook delivery does not.
Asset requests (.css, .js, .png, fonts, images) are skipped to keep the signal clean. I log HTML, .txt, .xml, and JSON requests only, which is exactly what bots crawl when they’re trying to understand a site.
Here’s what a single bot ping looks like in my Discord channel:
/robots.txt and /sitemap.xml. Each ping carries the bot name, content-kind classification, country, ASN, Cloudflare ray ID, timestamp, and the full user-agent.In parallel, I also pull data from Cloudflare’s AI Crawl Control dashboard as a cross-reference. AI Crawl Control identifies bots via Cloudflare’s own fingerprinting (network signals + ASN + UA + behavior), which is a different methodology than my middleware’s UA-only matching. I treat the middleware as the canonical record for this case study; AI Crawl Control is the second opinion. When the two diverge, the divergence itself becomes a data point worth writing about. See the Day 1 results section for the first-24h breakdown from both sources.
I’m going to update this post any time something major happens. I have a Discord webhook setup to ping me anytime a new bot hits the site so this will be a really fun case study, I love anything meta.
Things I’ll publish from these logs as they accumulate:
- The name of the bot / LLM (e.g. ChatGPT, ClaudeBot, PerplexityBot, GoogleOther, Bytespider).
- First-seen-by-bot dates (e.g. “ClaudeBot first crawled llmcartel.com on 2026-05-09 17:42 UTC, six hours after publish”).
- Crawl frequency per bot (e.g. “Bytespider hit 14 paths in 24 hours; GPTBot hit 2 paths in a week”).
- Most-crawled and least-crawled paths (e.g. “
/llms-full.txtgot 23 hits this week,/aboutgot 1,/contactgot 0”). - Country and ASN distribution (e.g. “CCBot from Amazon US-East, Bytespider from Singapore, Applebot from Apple ASNs across CA and NL”).
- Response codes (200s vs anything else, e.g. “ClaudeBot all 200, but PerplexityBot got 4 404s on broken inbound links”).
- Anomalies and surprises, including bots with strange names that look like LLMs but aren’t (there is no shortage of those, and figuring out which is real is half the fun).
When to expect AI bots to crawl a brand-new website
Educated guesses for the record, so I can be honest about how wrong I was when the data lands:
- Within 24 to 72 hours: the most prolific crawlers (Bytespider, Amazonbot, GoogleOther) will probably show up just from sitemap discovery. They aggressively poll fresh DNS records.
- Within 7 days: GPTBot, ClaudeBot, and PerplexityBot likely make a first pass if they’ve picked up my domain through any signal (Common Crawl seed, Cloudflare DNS resolution patterns, certificate transparency logs).
- Within 30 days: training-data crawlers (
Google-Extended,Applebot-Extended, Common Crawl) probably do a baseline pass. - Beyond 30 days: the slower, retrieval-grade re-crawls, the ones that decide whether AI answers actually cite me versus merely know I exist, start showing up. This is the lag I expect to be longest.
If none of this happens? That’s also data. The whole point of publishing predictions is to be measured against them.
Tracking llms.txt
One thread I’m watching closely across this run is which genuine crawlers actually fetch /llms.txt. Compiled across the day-by-day sections that follow, the genuine hits so far:
- Day 2 - ClaudeBot (Anthropic), inside a broad discovery sweep.
- Day 2 - GoogleOther (Google), inside a 16-path site walk.
- Day 3 - GoogleOther (Google), a single check-in with a mobile UA.
- Day 5 - Amazonbot (Amazon), as part of its first-ever pass on the site.
- Day 9 - GPTBot (OpenAI), inside its first wide crawl after eight days of only fetching
/. - Day 11 - GPTBot (OpenAI), inside its second wide crawl three days later.
For /llms-full.txt: Amazonbot on Day 5 was the first crawler in the whole run to fetch the long-form variant; GPTBot took it on Days 9 and 11.
That is enough to confirm the files are reachable and being fetched, but - per the longer treatment under Day 9 - not enough to claim the convention is honored above and beyond “any file linked in the site footer gets crawled.” To get a cleaner read I’ve spun up a sister study at llmcartelbottesting.com, a separate domain that exists only to test agent-ready surfaces - no marketing pages, no blog, nothing competing with /llms.txt for crawler attention. Findings from that study will get posted here as they land.
Day 1: the first crawls are already on the board
Window: first 24 hours
By bot, all 2xx unless noted:
- GPTBot - OpenAI - 5 requests
- OAI-SearchBot - OpenAI - 1 request
- Googlebot - Google - 6 requests
- BingBot - Microsoft - 2 successful, 6 4xx (sitemap-shape probing)
- ClaudeBot - Anthropic - 1 request
- PerplexityBot, Bytespider, Amazonbot, FacebookBot - 0
One nuance worth noting: Cloudflare’s AI Crawl Control aggregates by operator, so “OpenAI” in the dashboard rolls up GPTBot plus its sibling crawlers. Splitting the same window with the GraphQL Analytics API surfaced the OAI-SearchBot row above - OpenAI’s distinct answer-engine crawler. Two different OpenAI crawlers landing on the site inside 24 hours is itself a finding: one bot for training corpus collection, a separate bot for live answer retrieval. The BingBot 4xx column is the day’s most interesting data point. Full breakdown two sections down.
What stood out:
- GPTBot tied Googlebot for volume. OpenAI’s crawler hit the site as hard as Google’s flagship bot on day one - parity, for now.
- The training-data crawlers haven’t shown up yet. Bytespider, Amazonbot, Common Crawl, Applebot-Extended and Google-Extended are all at zero; the bots that did arrive are the live-retrieval / answer-engine ones.
- ClaudeBot appeared but only once. Anecdotally consistent with Anthropic running a lighter crawl rhythm than OpenAI, but n=1 means it is only anecdote so far.
Twenty-four hours is too short to draw conclusions about anything except “the site exists to crawlers now.” Treat this as week-zero noise. The point of writing it down today is to anchor against it later, when I have a month of data to argue about.
Day 2 update: ClaudeBot ramped up overnight, GoogleOther toured the whole site
Window: May 8th evening to May 9th afternoon (~17 hours after the Day 1 snapshot)
By bot, all US-origin, all 2xx (one ClaudeBot 308 redirect mixed in):
- ClaudeBot/1.0 - Anthropic - 22 requests. Full discovery sweep:
/robots.txt(6),/sitemap.xml(6), the case-study post via both URL forms,/blog,/llms.txt,/llms-full.txt,/atom.xml,/site.webmanifest,/apple-touch-icon.png, plus the Discord screenshot embedded in this post. - GoogleOther - Google - 16 requests. Broadest single-bot site walk recorded so far: home, about, services, faq, privacy,
/ada,/blog, the blog post (both URL forms),/llms.txt,/llms-full.txt, and/atom.xml. - OAI-SearchBot/1.3 - OpenAI - 1 request to
/robots.txt. Quiet check-in.
What stood out:
- ClaudeBot went from one hit to 22 in under a day. Day 1 left open whether it would stay sparse; for at least one cycle the answer is clear - when ClaudeBot comes back, it comes back wide.
- New content surfaces get ingested fast. The Atom feed shipped yesterday, and ClaudeBot and GoogleOther both pulled it - and
/llms-full.txt- within hours of those URLs existing. - ClaudeBot fetched the case-study post twice, once at
.html, once without. It follows internal links and the sitemap entry independently, so publishing both URL forms surfaces both. - GoogleOther does a tour, not a sip. Where ClaudeBot walked the discovery surfaces, GoogleOther traversed the actual marketing site in one pass - a different shape entirely.
- Predictions check. ClaudeBot’s first pass was predicted within 7 days and landed inside 24 hours; Bytespider, Amazonbot, Common Crawl, Google-Extended and Applebot-Extended are all still at zero.
What BingBot does with a brand-new site
Of all the crawlers that touched me in the first 24 hours, BingBot was the most aggressive about discovery probing - and that turned out to be the most useful behavioral signal of the entire day. Filtering Cloudflare’s analytics down to BingBot’s 4xx requests in the window gives a clean picture of exactly what Microsoft’s crawler tries when it lands on an unknown domain:
- 2 hits on
/atom.xml- Atom feed for blog/article discovery - 2 hits on
/sitemap.xml.gz- gzipped sitemap (used by some CMSes for size) - 1 hit on
/sitemap.txt- plain-text sitemap (one URL per line) - 1 hit on
/sitemap_index.xml- sitemap-of-sitemaps (Yoast/WordPress convention) - 1 hit on
/sitemaps.xml- plural-form sitemap convention
Alongside these probes, BingBot also pulled my actual /sitemap.xml and a handful of HTML pages (4 successful 200 responses on the same day). So BingBot’s first-day playbook on a new site looks like: read the advertised sitemap, then probe every plausible alternate sitemap convention, then probe for an Atom feed. It’s looking for a content firehose, in any shape I’ll give it.
What stood out:
- BingBot wants an Atom feed. It asked twice in 24 hours and I didn’t ship one, so I’m adding
/atom.xml- free, generated at build time by Astro, and a retrieval-friendly shape for the LLM bots too. - BingBot is far more aggressive about sitemap probing than the LLM crawlers. No LLM crawler probed alternate sitemap conventions in the first day - classic search has to be exhaustive about discovery shape, while LLM crawlers lean on other signals (user queries, citation graphs, training seeds).
- 404 noise is signal. If you only watch your 2xx logs you miss what crawlers wish you had - the 4xx paths are a free wishlist, especially on a brand-new domain.
This is exactly the kind of small finding the case study was set up to surface. I never would have noticed the Atom-feed gap from looking at a normal traffic dashboard; it only showed up because I was hunting for behavioral signal in the 4xx column.
Day 3 update: when 68% of overnight bot pings turn out to be forged
Window: May 9th to May 10th (burst at 9pm UTC on the 9th)
By ASN, because the spoof story is the lens that matters here:
- NextGenWebs, S.L. - NL - 28 requests. Budget VPS provider in Amsterdam that cycled through ten different bot identities in about ninety seconds: GoogleOther, ClaudeBot, ChatGPT-User, OAI-SearchBot, Bytespider, Applebot, PerplexityBot, Perplexity-User, Amazonbot, and Meta-ExternalAgent. None of those bots actually run from there.
- Anthropic, PBC - US - 10 requests. Genuine ClaudeBot. Every hit was a
/robots.txt+/sitemap.xmlpair - five pairs, spaced roughly two hours apart from mid-afternoon through late evening. This is where the steady beat the Day 4 section covers actually started. - Vultr Holdings - US - 2 requests. Two synthetic UAs (one GPTBot string, one ClaudeBot string, both tagged
(markdown-test)) hitting/llms.txtand the home-page markdown variant. Looked like an external markdown-content-negotiation probe, not a real crawl. - Google LLC - US - 1 request. Genuine GoogleOther, pulling
/llms.txtwith a mobile user-agent.
The forged batch wasn’t just suspicious by ASN. It was suspicious by behavior. The paths they probed:
/asset-manifest.json- Create React App build manifest/_next/build-manifest.json- Next.js build manifest/_nuxt/manifest.json- Nuxt build manifest/dist/.vite/manifest.json- Vite build manifest/static/manifest.json- generic static-asset manifest/build/manifest.json- generic build manifest/stats.json- webpack bundle stats/manifest.json- PWA manifest
Real LLM crawlers don’t probe for JavaScript-framework build artifacts. They read your sitemap, your robots, your HTML, and (lately) your llms.txt. The path list above is fingerprinting / recon traffic from a vulnerability scanner - looking for accidentally-exposed source maps, framework version strings, or internal build paths - dressed up in bot-friendly user-agent strings to slip past basic UA-deny rules. The fact that the same ASN cycled through ten different bot identities (right down to GoogleOther and ClaudeBot) inside about ninety seconds confirms it.
What stood out:
- UA spoofing at scale is trivial. Anyone with a $5/month VPS can paint a request as
PerplexityBotand the middleware will count it - which is exactly why AI Crawl Control runs as the fingerprinting cross-reference. - The divergence between the two methodologies is the story. The middleware (UA-only) saw 41 events; AI Crawl Control’s fingerprinting count for the same window will be lower, and that delta is the spoof rate.
- 404 noise is signal, redux. When a “bot” probes for
/dist/.vite/manifest.json, that path is a fingerprint of the requester, not of this site - real LLM crawlers have no reason to know it exists.
What I’m not doing: I’m not adding ASN allow-lists or verified-bot-IP filters to the middleware. The whole point of running UA-only matching is to count self-identified bot traffic. If a request claims to be ClaudeBot, that’s a meaningful data point regardless of whether the IP belongs to Anthropic - at minimum it tells me someone wants their probes to look like ClaudeBot, and that’s its own piece of the GEO landscape worth measuring.
What I am doing: publishing this finding the same day I noticed it, because that’s the discipline a live case study earns.
Day 4 update: ClaudeBot’s steady beat holds, GPTBot eats the Atom feed
Window: May 10 00:11 UTC to May 11 00:09 UTC (~24 hours)
By bot, the full 24-hour slice (30 events, all 2xx, all US-origin):
- ClaudeBot/1.0 - Anthropic, PBC - 22 requests. Eleven paired cycles of
/robots.txt+/sitemap.xml, no other paths. First cycle 01:44, last cycle 22:19. No HTML orllms.txtfetches at all this window. - OAI-SearchBot/1.3 - Microsoft Limited (OpenAI on Azure) - 6 requests. Every one to
/robots.txt. Never touched a page. - GPTBot/1.3 - Microsoft Limited (OpenAI on Azure) - 2 requests in a single 00:30 burst:
/(the home page) and/atom.xml.
What stood out:
- ClaudeBot runs like clockwork. Eleven robots-then-sitemap cycles in 22 hours at a ~2 hour cadence - counting the Day 3 pairs, unbroken for about thirty hours.
- Same volume as Day 2, completely different posture. Day 2’s 22 requests were one wide discovery sweep; Day 4’s 22 never left robots and sitemap - the same crawler runs a sweep mode and a heartbeat mode, and what flips it back is the open question.
- OAI-SearchBot is robots-only across the whole window. Six visits, every one to
/robots.txt- the answer-engine bot checking whether it is still welcome before sending retrieval crawlers in. - GPTBot consumed the Atom feed. We shipped
/atom.xmlfor BingBot on Day 1, and GPTBot pulled it two days later - ship the surface one crawler asks for and the next one uses it too. - All OpenAI traffic this window routed through “Microsoft Limited.” Anthropic comes from its own ASN, so a self-identified GPTBot off a non-Microsoft ASN - or ClaudeBot off a non-Anthropic one - is suspect until proven otherwise.
- No spoofs this window. The day after the NextGenWebs forged batch was 100% legitimate traffic by both UA and ASN - spoof traffic so far comes in bursts, not as a steady baseline.
- Predictions check. Bytespider, Amazonbot, CCBot, Google-Extended and Applebot-Extended are all still at zero, with three days left on the 7-day window.
The most useful thing in this slice isn’t the bot tally - it’s the posture flip from ClaudeBot. A crawler that does 22 wide-discovery hits one day and 22 narrow-heartbeat hits two days later is telling you something about its internal state machine that wouldn’t show up in a per-day request-count metric. The shape of the requests matters more than the count.
Day 5 update: Amazonbot finally arrives, ClaudeBot stays on robots-only
Window: May 12 00:00 UTC to May 13 11:48 UTC (~36 hours)
By bot, the full slice (48 events, all 2xx, all US-origin):
- ClaudeBot/1.0 - Anthropic, PBC - 28 requests. Sixteen paired
/robots.txt+/sitemap.xmlcycles, no other paths, no HTML, nollms.txt. Same steady beat as Day 4, longer window, more cycles. - Amazonbot/0.1 - Amazon Technologies Inc. and Amazon Data Services Northern Virginia - 10 requests, all on May 12 between 07:50 and 20:06 UTC. First real Amazonbot activity on the site, four days after launch. Ten distinct paths, one path per request:
/privacy,/contact,/services,/blog,/about,/llms-full.txt,/faq,/atom.xml,/ada,/llms.txt. Inter-request gaps ran from ~30 minutes to ~2 hours. - OAI-SearchBot/1.3 - Microsoft Limited (OpenAI on Azure) - 5 requests. Every one to
/robots.txt. No page fetches, same as Day 4. - GPTBot/1.3 - Microsoft Limited (OpenAI on Azure) - 1 request. A single
/fetch at 04:25:55 UTC, 11 milliseconds after an OAI-SearchBot robots check from the same ASN.
What stood out:
- Amazonbot skipped the robots-and-sitemap intro entirely. It opened straight on
/privacyand walked the site one URL at a time, with no/robots.txtor/sitemap.xmlanywhere in the ten-request sweep - either cached from another property or not checked at all. - Amazonbot pulled
/llms-full.txt. First crawler in the entire run to fetch the long-form agent-ready dump; the short-form/llms.txthas now been picked up twice. - Amazonbot also asked for
/atom.xml. Three different operators have now hit the feed we shipped in response to a single BingBot 404 probe. - Two Amazon ASNs inside one sweep. Seven requests from Amazon Technologies Inc., three from Amazon Data Services Northern Virginia - both inside the Amazon AS family, so the spoof-detection heuristic still holds.
- ClaudeBot is still on robots-only. Sixteen paired cycles in 35 hours, no HTML or
llms.txt- publishing this post and bumping the sitemaplastmoddid not flip it back to a wide sweep. - GPTBot ran the two-step again. A SearchBot robots hit, then a same-ASN GPTBot fetch of
/11ms later - the same fingerprint as Day 4, twice now. - Predictions check. Amazonbot landed inside the 7-day cohort window; Bytespider, CCBot, Google-Extended and Applebot-Extended are still at zero with about a day and a half left.
The useful read of this slice is that the bot personalities from earlier days are holding shape. Anthropic does heartbeat polling, OpenAI does a robots-check-then-fetch two-step, and Amazon walks one page at a time without ever checking robots in the same session. None of them are running anything close to the same playbook.
Day 6 update: GoogleOther found the case-study post, then toured the rest of the site
Window: May 13 11:48 UTC to May 14 00:12 UTC (~12 hours)
By bot, the slice (25 events, all 2xx, all US-origin):
- ClaudeBot/1.0 - Anthropic, PBC - 12 requests. Six paired
/robots.txt+/sitemap.xmlcycles, no other paths. Same rhythm as Day 4 and Day 5, same ~2 hour cadence. - GoogleOther - Google LLC - 9 requests. Wide site walk in two phases: a single hit on the case-study post (no
.html), then a ~2 hour pause, then a sweep of/services,/,/about,/privacy,/blog, the case-study post again (the.htmlform this time),/atom.xml, and/faq. - Amazonbot/0.1 - Amazon.com, Inc. and Amazon Data Services Northern Virginia - 2 requests. Both to the case-study post, one URL form per ASN, 149ms apart.
- OAI-SearchBot/1.3 - Microsoft Limited (OpenAI on Azure) - 1 request to
/robots.txt. - GPTBot/1.3 - Microsoft Limited (OpenAI on Azure) - 1 request to
/, 206ms after the OAI-SearchBot robots check from the same ASN.
What stood out:
- GoogleOther found the case-study post. It hit the post alone at 18:52, paused two hours, then swept the marketing pages - the post is the door Google walked through, not a line buried in a sitemap walk.
- Amazonbot is back, but only for the case study. A two-request return visit to the post - both URL forms, two Amazon ASNs, 149ms apart - while the rest of the site got one pass and was done.
- The OpenAI two-step is now a confirmed pattern. An OAI-SearchBot robots check, then a GPTBot
/fetch 206ms later, same Microsoft Limited ASN, three consecutive windows. - ClaudeBot’s steady beat continues. Six pairs in ~12 hours, still ~2 hour cadence, still robots-only - no HTML, no
llms.txt, no Atom feed. - Three different return-visit shapes inside one window. GoogleOther came back for a full tour, Amazonbot for one post, OpenAI for its robots-then-fetch two-step - different cadences, different surfaces.
- Predictions check. The 7-day LLM-cohort window closes tomorrow: GPTBot and ClaudeBot landed inside 24 hours, PerplexityBot is still at zero.
Day 7 update: Claude-User shows up requesting markdown, and the 7-day predictions land
Window: May 14 00:00 UTC to May 15 09:52 UTC (~34 hours)
By bot, the slice (35 events, all 2xx, all US-origin):
- ClaudeBot/1.0 - Anthropic, PBC - 28 requests. Fourteen paired
/robots.txt+/sitemap.xmlcycles, no other paths. Same clockwork as Days 4 through 6, ~1.9 hour cadence across the slice. - Claude-User - Vultr Holdings, LLC - 2 requests.
/servicesat 17:07 UTC and/at 18:07 UTC, both classified asmarkdownkind by the middleware (i.e. the request carriedAccept: text/markdownand was served the.mdvariant rather than HTML). UA:Claude-User (claude-code/2.1.141; +https://support.anthropic.com/). Update (Day 9): this is self-traffic, not an external bot. The Vultr ASN is the dev VPS this site is built and deployed from - see the revised reading below and the Day 9 section. It is excluded from the running tally. - OAI-SearchBot/1.3 - Microsoft Limited (OpenAI on Azure) - 3 requests, all to
/robots.txt. Robots-only posture continues unbroken. - Amazonbot/0.1 - Amazon Technologies Inc. - 1 request to the case-study post (no
.html). Third consecutive day Amazonbot has come back specifically for this post. - GPTBot/1.3 - Microsoft Limited (OpenAI on Azure) - 1 request to
/, 26ms after a same-ASN OAI-SearchBot robots check.
What stood out:
- Claude-User is a different bot than ClaudeBot. ClaudeBot is Anthropic’s training crawler on Anthropic’s own ASN and schedule; Claude-User fires when a person has Claude fetch a URL, originating wherever that person’s tooling runs.
- Correction (Day 9): this was us. The Vultr ASN (AS20473) is this project’s own dev box - public IP
155.138.210.13- so this was a Claude Code session checking the live site, not an external client. It is the first self-attribution catch of the run. - The methodology lesson. The ASN-versus-UA discipline that caught the spoofers has to point inward too: these two requests are excluded from the tally, and genuine external uptake of the markdown variant is still zero.
- The 7-day predictions land, and two missed. PerplexityBot (predicted within 7 days) and Bytespider (predicted within 24-72h) never showed up at all - the most-named “LLM crawler” bots were not the first ones here.
- The OpenAI two-step is now four-for-four, with the tightest gap yet. Day 4 11ms, Day 5 11ms, Day 6 206ms, Day 7 26ms - four consecutive windows, same robots-then-fetch fingerprint.
- Amazonbot has confirmed return-target behavior. Days 5, 6 and 7 were the same surgical move - come back, pull the case-study post, leave - and three days running to one URL is a pattern.
- ClaudeBot’s steady beat is six days unbroken. Fourteen pairs at a ~1.9hr cadence, with zero HTML,
llms.txtor Atom hits in the whole stretch.
The original close of this section called Claude-User “a different category of signal” - an end user’s curiosity surfaced through Claude. That reading is retracted. It was the project’s own dev box, and the more useful lesson is the one about the case study auditing itself: instrumentation that logs everything will log you too, and a live case study only stays honest if it points the same spoof-versus-self discipline at its own traffic that it points at everyone else’s. The Day 9 section picks this up alongside a second, unrelated spoofer.
Day 8 update: the NextGenWebs spoofer comes back, with a wider mask
Window: May 15 09:52 UTC to May 16 14:35 UTC (~28 hours)
The Day 3 forged batch read like a one-off at the time - a single budget VPS taking the UA-spoofing loophole for a spin, then gone. The Day 4 section flagged the obvious open question: was that a single event, or the first data point in a series? This window answers it. The same ASN came back seven days later and ran a second forged batch - wider, better disguised, and on what is starting to look like a cadence. Genuine ClaudeBot and OpenAI traffic ran their established shapes underneath it, so the spoofer is the part that earns the close read.
By ASN, because the spoof story is the lens that matters again:
- NextGenWebs, S.L. - NL - 23 requests. The same Amsterdam budget-VPS ASN as the Day 3 forged batch, back seven days later. In 52 seconds (14:09:32 to 14:10:24 UTC on the 16th) it cycled twelve bot identities: CCBot, ClaudeBot, PerplexityBot, Perplexity-User, OAI-SearchBot, GoogleOther, Bytespider, GPTBot, YouBot, Amazonbot, ChatGPT-User, and Applebot. None of those bots run from there.
- Anthropic, PBC - US - genuine ClaudeBot, the same
/robots.txt+/sitemap.xmlpairs on the steady beat (16:37 and 18:34 UTC on the 15th, same ~2 hour cadence as Days 4 through 7). - Microsoft Limited (OpenAI on Azure) - US - genuine OpenAI: an OAI-SearchBot hit to
/robots.txt, then a GPTBot fetch of/147ms later, same ASN.
The forged batch was suspicious by behavior, not just by ASN. Inside the same burst it probed:
/asset-manifest.json- Create React App build manifest/manifest.json- PWA manifest/_next/build-manifest.json- Next.js build manifest/build/manifest.json- generic build manifest/.vite/manifest.json- Vite build manifest/_nuxt/manifest.json- Nuxt build manifest/.astro/manifest.json- Astro build manifest (new this round)
What stood out:
- The spoof rate cycles, and it cycles from the same source. The exact ASN that ran the Day 3 forged batch ran another one seven days later - a recurring burst, roughly weekly so far, not a one-off.
- The mask got wider and the recon got better camouflaged. Day 8 interleaved the framework probes with plausible page GETs across twelve identities instead of ten - camouflage meant to read as ordinary multi-crawler traffic, with the ASN still giving it away.
- Real and forged GPTBot on the same calendar day. A forged GPTBot from NextGenWebs and the genuine one from Microsoft Limited landed 26 minutes apart, near-identical UA, only the ASN separating them.
- The ClaudeBot UA tell held across both bursts. The forged ClaudeBot still uses the short
Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])form - a reliable secondary tell when an ASN check isn’t available. - The probe still doesn’t fit the target. The scanner added
/.astro/manifest.jsonthis round and still learned nothing - a static Astrodistdeploy ships no build manifest; the list fingerprints the scanner, not the site. - The OpenAI two-step is five-for-five. Day 8’s gap was 147ms - five consecutive windows, Days 4 through 8, same pipeline fingerprint.
Two forged bursts from one ASN, seven days apart, with the second better disguised than the first, is enough to call this a pattern rather than an incident. The spoof rate isn’t background noise on a fresh GEO domain - it’s a recurring signal with its own cadence and its own slowly-improving tradecraft, and it belongs in the running tally math as a tracked number, not a footnote.
Day 9 update: a second spoofer goes hunting for secrets, and GPTBot finally crawls wide
Window: May 16 14:35 UTC to May 17 14:11 UTC (~24 hours)
Day 8 established that the forged traffic was a recurring burst from one Dutch VPS. Day 9 breaks that framing in a way worth being precise about: it is not one recurring actor, it is a category. A second, unrelated network showed up the same afternoon as the Day 8 NextGenWebs burst, with a different and more hostile objective. Underneath the noise, genuine GPTBot did something it had not done once in the previous eight days.
By ASN, since the lens still matters:
- Limestone Networks, Inc. - US - 5 requests. A Dallas-area hosting provider, forging GPTBot and ClaudeBot user-agents at 18:51 UTC on the 16th. This is not the NextGenWebs ASN and not its behavior - the path list is the tell.
- Microsoft Limited (OpenAI on Azure) - US - genuine OpenAI. A robots-then-fetch pair on the afternoon of the 16th, and then the one that matters: overnight into the 17th, an OAI-SearchBot robots check followed by GPTBot’s first wide content crawl of the entire run. A genuine OAI-SearchBot robots check that afternoon closes the window.
- Anthropic, PBC - US - genuine ClaudeBot, still on the
/robots.txt+/sitemap.xmlsteady beat (the hand-assembled slice for this window is partial, so this is a floor on the cycle count, not a census).
What the Limestone batch actually probed:
/.env- environment / secrets file/.env.bak- backed-up environment file/.env.production- production environment file/api/config- application config endpoint
Real LLM crawlers do not request /.env. The Day 3 and Day 8 NextGenWebs bursts were fingerprinting - working out which JavaScript framework built the site. This is a different game: /.env, /.env.bak, /.env.production, and /api/config are a credential-harvesting sweep, hunting for an accidentally-deployed secrets file or an exposed config route, wearing GPTBot and ClaudeBot user-agents to slip past naive UA-deny rules. It found nothing: this is a static Astro dist deploy with no environment files and no API routes at the web root, so every probe returned a 404.
What stood out:
- Two spoofers, two objectives, one afternoon. The NextGenWebs framework-recon burst and a Limestone credential sweep landed about 4.5 hours apart - different networks running different playbooks, so the spoof rate has to track distinct ASNs, not just request volume.
- The forged ClaudeBot string is identical across both spoofers. Two unrelated networks producing the same wrong short-form UA points at a shared off-the-shelf scanning toolkit rather than two hand-rolled efforts.
- GPTBot finally crawled wide. After eight days of only fetching
/it did its first full link-walk - the behavior change is the finding, but it reached/llms-full.txtby following a footer link, which proves reachability and nothing about content value. - The markdown-for-agents surface still has zero genuine uptake. Every
kind=markdownevent in the whole run is the corrected Day 7 self-traffic - an honest negative result, not a broken feature. - A control experiment now exists for the “crawlers fetch anything reachable” problem. There is now an orphan file reachable through no link, sitemap, robots or
llms.txtentry; a paired llms.txt-only probe follows, and the contrast is the real test of whether the convention is honored. - The OpenAI two-step held twice more. A 147ms gap on May 16 and a 151ms gap on May 17 - unbroken Days 4 through 9.
The spoof story has matured from “someone took the UA loophole for a spin” into a small taxonomy: recurring framework recon from one network, opportunistic credential harvesting from another, both masked as LLM crawlers. The genuine-crawler story moved the opposite direction - GPTBot widened out from a heartbeat into a real ingest. The discipline holding both halves together is the same one this section had to apply to its own Day 7 mistake: a fetch is an observation, not a conclusion, and the value claims stay narrow until the orphan-versus-llms.txt experiment earns the stronger one or honestly retires it.
Day 10 update: WTF, do LLM bots take Sunday off?
Window: May 17 14:11 UTC to May 18 13:48 UTC (~24 hours)
Here is the Day 10 finding stated plainly: across this entire ~24-hour window, the number of genuine third-party LLM crawler requests to llmcartel.com was zero. Not “down.” Zero. The last genuine LLM hit of the whole run was the OAI-SearchBot /robots.txt check at roughly 14:00 UTC on May 17 - the same request Day 9 used to close its window - and nothing genuine has landed since. That cutoff is around 10am US Eastern on a Sunday, which is where the section title comes from. Whether that timing means anything is a separate question from whether the pause is real, and the two have to be answered in that order.
The pause is real, and here is how that was established rather than assumed. The Day 7 lesson was that instrumentation which logs everything will eventually mislead you about itself, so a sudden silence on our own Discord feed is exactly the kind of result that has to be checked against a source that does not share the pipeline. It was. Cloudflare’s own request analytics (the GraphQL httpRequestsAdaptiveGroups dataset, completely independent of the _middleware.ts logger and the Discord webhook) was queried directly for the window. It agrees: one OAI-SearchBot robots check at the start, one self-originated ClaudeBot/1.0 (manual-test) request from a deliberate pipeline probe, and not a single genuine GPTBot, ClaudeBot, PerplexityBot, Applebot, or Google-Extended request in between. The middleware was not broken - the manual-test probe returned 200 and was logged correctly, and the middleware code was unchanged across the entire window. There was simply nothing for it to log.
What did keep hitting the site, the whole time, by ASN:
- Monitoring and self-traffic -
curland uptime-style requests dominated the window by volume, including the project’s own dev box. Steady, unremarkable, and not LLM-identified, so correctly ignored by UA-only matching. - Exploit and recon scanners - the usual background sewage:
/wp-admin/install.php,/wp-config.php,InternetMeasurement/1.0, a Palo Alto Cortex Xpanse scan, BuiltWith, DotBot (Moz’s SEO crawler - not an LLM bot, correctly not matched). All of it kept arriving on schedule. - Genuine LLM crawlers - none. Zero requests for the full window.
That breakdown matters because it rules out the boring explanations one at a time. The site was up and reachable the entire window - scanners and monitors got their normal responses, so “the site was down” is out. Cloudflare logged exactly two firewall blocks in 24 hours, both legitimate abuse (a curl hit on /wp-config.php and one exploit scanner on /), and zero LLM bots blocked; Bot Fight Mode is still off, so “Cloudflare started challenging the crawlers” is out. The spoofers were quiet too - the roughly-weekly NextGenWebs burst was not due in this window and no Limestone-style sweep appeared - so this is not forged traffic crowding out a signal either. The genuine LLM crawlers, specifically and only the genuine LLM crawlers, stopped.
What stood out:
- The steady beat stopped, and that is the headline. ClaudeBot’s ~2-hour heartbeat ran unbroken for nine days, then went silent after ~14:00 UTC May 17 and has not ticked since.
- The OpenAI two-step did not fire. Five-plus windows unbroken through Day 9, and in this window it did not happen once.
- GPTBot’s wide crawl did not recur. Day 9’s link-walk was reported as a one-off, not a baseline - Day 10 is the honest follow-through: one wide crawl in ten days.
- n=1, and the title is bait. The pause ran a full 24 hours from Sunday midday straight into Monday, which undercuts any tidy “they rest on weekends” story - the defensible finding is the independently verified cessation, not the day of the week.
- The more interesting hypothesis is discovery-versus-steady-state. Discovery was answered in the first hour; what steady-state recrawl looks like is a different question, and the nine-day rhythm may have been discovery-phase behavior that has now decayed.
The single most useful thing Day 10 does is convert a scare into a method. A live case study’s own feed going dark is not a finding until it is checked against a source that cannot fail the same way; once it was, the silence stopped being a bug report and became the data. The crawlers genuinely paused, the pause is independently confirmed, and the question of why stays open and explicitly under-claimed - the next few weekends are the experiment, not this one.
Day 11 update: the pause breaks, GPTBot crawls wide again, and a sister case study muddies the water
Window: May 18 13:48 UTC to May 19 15:00 UTC (~25 hours)
Day 10 closed on an open question: the genuine-crawler feed had been dark for 24 hours, and the two readings were “a temporary pause” versus “the discovery-phase steady beat has decayed for good.” Day 11 resolves it in favor of the first. Genuine third-party crawlers resumed roughly 24 hours after they stopped, and they came back selectively rather than all at once.
What stood out:
- The pause ran about 24 hours, then ended on its own. PerplexityBot was back by mid-afternoon on May 18, then Applebot and Amazonbot - the silence was real, time-bounded, and broke with no change on our side.
- The ClaudeBot steady beat did not come back with the rest. Every other crawler resumed; the ~2-hour ClaudeBot robots+sitemap heartbeat did not, and there is no genuine ClaudeBot in this window at all.
- GPTBot crawled wide again. Around midday on May 19 it walked twelve paths in one pass - the second full ingest, three days after the first, so Day 10’s “one wide crawl in ten days” no longer holds.
- The OpenAI two-step resumed, twice. Two gaps of ~127ms and ~169ms, both Microsoft Limited - the second instance fed the wide crawl above.
- A third spoofer ASN, running the same recon playbook. The Infrastructure Group B.V. (NL) cycled roughly twelve forged identities through the same framework-fingerprinting sweep as the Day 3 and Day 8 bursts - a third distinct network on one shared toolkit.
- The toolkit added an Astro probe, and the ASN-versus-UA tell held again. The recon list now includes
/.astro/manifest.json, and the forged Applebot used a short UA string while genuine Applebot the same day used the full Safari-based form - separated only by ASN.
The sister case study is now a confounder, and that is the honest headline of Day 11. This site published its second case-study post today - GEO Case Study: AI Chatbots Are Really Bad at Reading JavaScript - and within hours Applebot had crawled it twice (14:50 and 14:53 UTC May 19). That is a clean, fast mini-discovery in its own right, but it changes what this study can still claim. Up to Day 10 the premise was clean: one new entity, one domain, measure how crawlers find and re-crawl it. From Day 11 forward that premise no longer holds, because a fresh, link-worthy content drop just entered the same crawl surface. Any wide sweep from here - including GPTBot’s second one above and Applebot’s jump straight to the new post - is plausibly a response to new content rather than steady-state recrawl, and the two can no longer be cleanly separated on this domain.
That is worth stating as a hypothesis rather than burying as a caveat. The long-open question from Day 4 onward - what flips a bot out of heartbeat mode and back into a wide ingest sweep - now has a candidate answer. Publishing this very post did not do it (Day 5 established that), and sitemap lastmod bumps did not do it. A genuinely new, separately-themed post going live the same day GPTBot crawled wide and Applebot beelined to it is the strongest correlation the run has produced for “new publishable content is the wide-sweep trigger.” It is one co-occurrence, not a proven cause, and it is now structurally entangled with the second study - so the way to test it is no longer this post in isolation. The two case studies share a domain, a crawl log, and from here a single analysis seam. Future updates will read both together rather than pretending this one is still a single-entity experiment.
One smaller note for the record. For about a minute on May 18, a ChatGPT-User request hit the case-study post and then / from a Microsoft Corporation ASN, and roughly a minute later a Claude-User request hit /robots.txt and the same post from a Google LLC ASN. Both are user-initiated fetch agents, not autonomous crawlers, and the Claude-User one is off an ASN that is not this project’s dev box - so it is not auto-excluded as self-traffic the way the Day 7 Vultr hits were, but it is also not a crawler and stays out of the crawler tally. Two assistants fetching the post within a minute of each other, right as the genuine-crawler pause was ending, is the prompt-driven-fetch shape the next experiment in this series is built to isolate.
Day 12 update: OAI-SearchBot picks up the steady beat, Applebot circles back to the case study
Window: May 19 15:00 UTC to May 20 13:00 UTC (~22 hours)
Day 11 closed on an open question: now that the pause is over and the bots are back, would the resumption settle into a recognizable shape or stay sporadic? Day 12 starts to answer it. The shape that has emerged is not the old ClaudeBot heartbeat - it is OAI-SearchBot doing the same job, on a longer cadence.
What stood out:
- OAI-SearchBot is the new steady beat. Five robots-only hits to
/robots.txtin roughly 24 hours at a ~5-6 hour cadence - the same posture ClaudeBot ran on Days 4 through 9, now executed by a different operator on a longer interval. - ClaudeBot is still absent. Still zero genuine ClaudeBot, roughly 48 hours past the resumption that brought everything else back - no longer a window-of-one absence but a behavior change.
- Applebot circled back to the case-study post. A single hit at 12:52 UTC May 20 - Day 11’s newer sister post pulled the crawler in first, and this older post was picked up on a day-later lag.
- GoogleOther ran the same two-step on mobile. It hit
/then the new sister post two hours later on the Nexus 5X mobile UA - the same new-content-pull shape as Applebot. - No GPTBot wide sweep, consistent with the working hypothesis. Sweeps fire days apart near new-content events, and nothing was published this window, so nothing fired.
The throughline from Day 11 into Day 12 is that this study now reads as two interleaved questions on the same domain rather than one. Steady-state recrawl of an established entity is what the OAI-SearchBot rhythm and the ClaudeBot absence speak to. New-content discovery on a known entity is what Applebot and mobile GoogleOther both demonstrated by jumping to the sister post first and circling back to the older one later. Both halves are useful; neither would be visible if I had only published the original post and stopped.
Day 13 update: the OAI-SearchBot steady beat is two OpenAI fleets
Window: May 20 13:00 UTC to May 21 10:06 UTC (~21 hours)
Day 12 named OAI-SearchBot the new steady beat and left the cadence at “~5-6 hours.” Day 13 holds that cadence and then complicates the picture: the single “steady beat” is not one crawler on a timer. It is at least two distinct OpenAI deployment groups taking turns on /robots.txt, and once you log the user-agent string, the Cloudflare colo, and the ASN label together they separate cleanly.
Every genuine request in this window was OpenAI - no ClaudeBot, no Perplexity, Applebot or Amazonbot, and no spoofers in the feed. Six events:
- 14:26 UTC May 20 -
OAI-SearchBot/1.0, DFW colo, ASN “Microsoft Corporation” -/robots.txt. - 21:07 UTC May 20 -
OAI-SearchBot/1.3, ASN “Microsoft Limited” -/robots.txt. - 01:34 UTC May 21 -
OAI-SearchBot/1.3, ATL colo, ASN “Microsoft Limited” -/robots.txt. - 08:12:50.427 UTC May 21 -
OAI-SearchBot/1.3, ATL colo, ASN “Microsoft Limited” -/robots.txt. - 08:12:50.559 UTC May 21 -
GPTBot/1.3, ATL colo, ASN “Microsoft Limited” -/. - 10:06 UTC May 21 -
OAI-SearchBot/1.0, DFW colo, ASN “Microsoft Corporation” -/robots.txt.
What stood out:
- The steady beat held a second straight window. Five OAI-SearchBot robots-only hits averaging just under five hours, inside the ~5-6 hour band Day 12 measured - a second window of the same posture confirms it as a stable shape.
- The steady beat is two fleets, not one. The two “Microsoft Corporation” hits ran
OAI-SearchBot/1.0from DFW while all three “Microsoft Limited” hits ranOAI-SearchBot/1.3from ATL - version, colo and ASN label move together, and only the 1.3 string carries therobots.txt;purpose token. - OpenAI’s own crawler fleet is heterogeneous. The genuine baseline is a set, not a single fingerprint - two ASN labels, two colos, two UA versions - so spoof detection has to check a self-identified OpenAI bot against the whole known set.
- The OpenAI two-step fired once, in its narrow form. An OAI-SearchBot robots check followed by a GPTBot
/fetch 132ms later, both on the 1.3 / Microsoft Limited fleet - and no wide sweep, since nothing new was published this window. - ClaudeBot is still absent. Zero genuine ClaudeBot again, roughly three days past the Day 11 resumption - Day 12 called it a behavior change, and Day 13 extends the count.
The throughline of Day 13 is that “genuine OpenAI traffic” is not a single thing. The steady beat Day 12 handed to OAI-SearchBot is really two OpenAI deployment groups - an older 1.0 fleet and a current 1.3 fleet - alternating on the same robots check, and only the 1.3 fleet carries the GPTBot two-step. Whether that is a staged version rollout caught mid-flight or two permanently separate fleets is the open question; either way, the practical lesson is the one Day 4 started and Day 13 sharpens: spoof detection needs the genuine baseline mapped as a set of profiles, because the real crawler does not hold still.
Day 14 update: PerplexityBot sweeps the site, then Perplexity-User follows
Window: May 21 15:53 UTC to May 22 03:51 UTC (~12 hours)
This is the last entry in the field log; the conclusion below explains why the study closes here. The window itself was busy - sixteen genuine events, no spoofers, and three findings worth recording.
What stood out:
- PerplexityBot ran its first wide crawl. Eight requests in 33 seconds - robots,
/ada,/contact,/,/faq, and the case-study post three times across both URL forms. The bot the Day 7 predictions wrote off as a no-show is now doing full content sweeps. - PerplexityBot and Perplexity-User landed two minutes apart. The crawl finished, then
Perplexity-Userfetched/aboutabout two minutes later, same Amazon ASN and IAD colo - a crawler sweep tightly coupled to a user-agent fetch, which reads as one human query driving both. - OAI-SearchBot bumped to 1.4. The last 1.3 hit landed on the afternoon of May 21; by the early hours of May 22 the same ATL / Microsoft Limited fleet was sending
OAI-SearchBot/1.4- the version rolling forward supports Day 13’s “staged rollout” reading over “two permanent fleets.” - A genuine Meta-ExternalAgent, at last.
meta-externalagent/1.1hit/late on May 21 from Meta Platforms Ireland Limited - the first real one of the run, after three forged claims to the identity on Day 3. - The familiar shapes held. The OpenAI two-step fired once more (OAI-SearchBot robots then GPTBot
/, 201ms apart), and Applebot did a robots-then-post move straight to the case-study post.
Conclusion: closing the experiment
The question this study set out to answer - how long does it take an LLM bot to discover a new business entity - was answered in the first hour: GPTBot crawled the launch post about sixty minutes after it went live, and five separate bots reached the site within Day 1. The fuller picture filled in over the first week, and by Day 7 the predictions could be graded - two of the most-named “LLM crawlers,” PerplexityBot and Bytespider, had missed their windows entirely.
The experiment closes here because its premise has expired. A discovery study needs a clean slate - an entity nothing has found yet - and llmcartel.com is no longer that. The site now turns up in Google search, the sister case study published on Day 11 put fresh content on the same crawl surface, and Day 14 caught a PerplexityBot sweep and a Perplexity-User fetch two minutes apart. That last pair is the clearest sign: a bot visit can no longer be cleanly read as autonomous discovery, because demand-side traffic - real people asking answer engines about this business - is now interleaved with it. You cannot measure time-to-discovery on an entity that has already been discovered.
That is the natural end of a clean-slate study, not a failure of one. It began with a domain no crawler had touched and ends with an answer engine fetching the About page to answer a question about the company. That progression is the outcome generative engine optimization is meant to produce, and running the measurement in public on instrumented infrastructure was the point.
Several threads are still open and worth their own treatment: what steady-state recrawl looks like once an entity is established - the OAI-SearchBot steady beat and ClaudeBot’s unexplained disappearance both point at it - the spoof taxonomy that grew across the run, and the orphan-versus-llms.txt control experiment that has so far run only its control arm. Those are the material for future updates; the bot logger stays live. The full tally of every crawler that showed up - and the predicted ones that never did - is below.
Running tally: every bot that has touched llmcartel.com so far
Final tally, genuine third-party crawlers only. Request counts are deliberately omitted - on a UA-only hand-assembled record an exact total would be false precision, and first-seen, last-seen and posture are the honest cells. Forged and self-originated traffic is excluded throughout: the NextGenWebs, Limestone and Infrastructure Group spoof bursts (Days 3, 8, 9 and 11) and this project’s own dev-VPS requests.
| Bot / operator | First seen | Last seen | Status & posture |
|---|---|---|---|
| ClaudeBot / Anthropic | 2026-05-08 | 2026-05-17 | Heaviest crawler of the run. A Day 2 wide sweep, then a ~2hr robots+sitemap clockwork run that ran nine days unbroken, stopped on Day 10, and never resumed. |
| GPTBot / OpenAI | 2026-05-08 | 2026-05-21 | First bot in, ~60min after publish. Sparse home-page fetches via the OpenAI two-step, plus two wide ingest crawls (Days 9 and 11). |
| OAI-SearchBot / OpenAI | 2026-05-08 | 2026-05-22 | Robots-only throughout. Became the field’s steady beat after ClaudeBot went quiet; UA rolled 1.0 to 1.3 to 1.4 across the run. |
| Googlebot / Google | 2026-05-08 | 2026-05-08 | One Day-1 pass, no return. |
| BingBot / Microsoft | 2026-05-08 | 2026-05-08 | Day-1 sitemap-shape probing, mostly 4xx variant-sitemap probes. Quiet since. |
| GoogleOther / Google | 2026-05-09 | 2026-05-20 | Two desktop site tours (Days 2 and 6), then mobile-UA discovery hits on newer posts. |
| Amazonbot / Amazon | 2026-05-12 | 2026-05-18 | A Day 5 content sweep, first crawler to pull /llms-full.txt; afterwards a repeat visitor to the case-study post. |
| PerplexityBot / Perplexity | 2026-05-18 | 2026-05-21 | Missed both predicted windows; first appeared after the Day 10 pause and ran its first wide content sweep on Day 14. |
| Applebot / Apple | 2026-05-18 | 2026-05-21 | Arrived after the pause; a repeat visitor to both case-study posts. |
| Meta-ExternalAgent / Meta | 2026-05-21 | 2026-05-21 | One genuine hit, on Day 14, after three forged claims to the identity on Day 3. |
Never appeared as genuine crawlers: Bytespider, CCBot, Google-Extended, Applebot-Extended and FacebookBot - all predicted, none arrived. The user-fetch agents ChatGPT-User and Perplexity-User did show up (Days 11 and 14) but are demand-side, not crawlers, so they sit outside this tally.