Block or Allow AI Crawlers? A Per-Bot Decision Guide

by

·

Decision matrix for whether to block or allow AI crawlers, mapping GPTBot, ClaudeBot, PerplexityBot, and Google-Extended along a visibility-versus-control axis

Whether to block or allow AI crawlers is now a real line item in technical SEO, and a single site-wide rule will cost you. GPTBot, ClaudeBot, PerplexityBot, and Google-Extended do different jobs, and blocking each one carries a different price. Some feed model training you may not care about. Others decide whether your brand appears when a buyer asks ChatGPT or Perplexity for a shortlist. This guide replaces blanket advice with a per-bot framework: what each crawler powers, what you lose by blocking it, and how to confirm your robots.txt does what you think it does.

Most articles hand you a copy-paste block list and call it done. The real decision isn't block everything or allow everything—it's matching each bot to the visibility you'd lose and the control you'd gain.

Block or allow AI crawlers: the short answer

Allow the bots that put you in AI answers; treat training bots as a separate, lower-stakes decision. Search and retrieval crawlers—OAI-SearchBot, PerplexityBot, Claude-SearchBot—are what get you cited when someone asks an assistant for a recommendation. Block those and you remove yourself from the results.

Training crawlers—GPTBot, ClaudeBot, Google-Extended—are an optional choice. Block them only for a specific reason: licensing use, genuinely proprietary content, or a brand-safety policy. Whatever you do, don't block the wrong bot by accident—the most common self-inflicted wound here is blocking your own visibility while believing you're protecting your data.

Decision matrix for whether to block or allow AI crawlers, mapping GPTBot, ClaudeBot, PerplexityBot, and Google-Extended along a visibility-versus-control axis

What's actually at stake: visibility versus control

Two different decisions hide inside "should I block AI crawlers," and conflating them produces most bad robots.txt files.

Decision one is about training. Bots like GPTBot and ClaudeBot collect public pages that may train future models. Blocking them affects whether your content shapes tomorrow's model weights. It does little to your visibility today—and since your content likely already sits in past training sets and third-party copies, the control you gain is partial at best.

Decision two is about live answers. Search and retrieval bots index or fetch your pages to answer questions right now. Block these and you vanish when ChatGPT runs a search or Perplexity builds a citation list. This is the reversible-but-expensive lever: the cost lands immediately, every time someone asks.

The four bots, decoded: training, search, and user fetches

AI crawlers are automated bots that fetch web pages for AI systems—to train models, build search indexes, or pull live answers for assistants like ChatGPT, Claude, and Perplexity. Before you decide anything, you need the map. Each operator now runs multiple bots with distinct user agents and distinct robots.txt behavior—so "block GPTBot" and "block OpenAI" are not the same instruction.

User agent Operator Job Obeys robots.txt? What blocking costs you
GPTBot OpenAI Train foundation models Yes Future training inclusion
OAI-SearchBot OpenAI Index for ChatGPT search Yes Your spot in ChatGPT search answers
ChatGPT-User OpenAI Live fetch when a user asks No (user-initiated) Little—robots.txt may not apply
ClaudeBot Anthropic Train Claude models Yes Future training inclusion
Claude-SearchBot Anthropic Index for Claude search Yes Visibility in Claude search answers
Claude-User Anthropic Live fetch for a user query Yes Real-time answers that cite you
PerplexityBot Perplexity Build the answer index Yes (states it does) Your place in Perplexity citations
Perplexity-User Perplexity Live user-triggered fetch May ignore robots.txt Little you can reliably block
Google-Extended Google Token: Gemini training/grounding Token only Gemini training use—not Search or AI Overviews

The pattern that matters: every operator now separates training, search indexing, and user-initiated fetches. That separation is your lever. Use it surgically.

GPTBot, OAI-SearchBot, and ChatGPT-User: OpenAI's three controls

OpenAI runs three crawlers, and they are not interchangeable. GPTBot trains models. OAI-SearchBot indexes the web so ChatGPT can surface pages in search. ChatGPT-User fetches a page live when a person asks ChatGPT to look something up.

The practical upshot: blocking GPTBot keeps your pages out of training but does nothing to your presence in ChatGPT's search answers—that's a different bot. OpenAI's crawler documentation recommends allowing OAI-SearchBot and notes that because ChatGPT-User actions are user-initiated, "robots.txt rules may not apply." So if you want to get recommended by ChatGPT while still opting out of training, the move is simple: disallow GPTBot, allow OAI-SearchBot. The single most expensive mistake is blocking OAI-SearchBot by accident—a direct visibility loss with no upside.

ClaudeBot, Claude-SearchBot, and Claude-User: Anthropic's split

Anthropic mirrors the same three-way split, with one notable difference. ClaudeBot trains models; Claude-SearchBot indexes pages to improve Claude's search answers; Claude-User fetches a page when a user's question requires it.

Anthropic states that all of its bots respect standard robots.txt directives—including Disallow and the non-standard Crawl-delay—and that you can control each independently. That's the meaningful contrast with OpenAI: where ChatGPT-User may ignore robots.txt, Anthropic's crawler documentation says Claude-User honors it. So for Claude, your robots.txt is a more reliable instrument across all three bots. The decision framework is identical: keep Claude-SearchBot and Claude-User open to stay in live answers, and decide ClaudeBot on its own merits.

PerplexityBot and Perplexity-User: when robots.txt isn't enough

Perplexity splits its crawling too: PerplexityBot builds the index behind every answer, and Perplexity-User fetches pages in real time when a person asks a current question. Per Perplexity's published crawler documentation, PerplexityBot won't index the full content of a site that disallows it—though it may still retain the domain, headline, and a brief factual summary.

Here's why Perplexity needs its own section. In August 2025, Cloudflare reported that Perplexity used undeclared crawlers—rotating user agents and source networks, sometimes impersonating a Chrome-on-macOS browser—to reach content on sites that had explicitly blocked its declared bots, generating an estimated 3–6 million requests a day before Cloudflare delisted it as a Verified Bot. Perplexity disputed the framing, arguing Perplexity-User acts only on live user requests.

The lesson isn't "Perplexity is the villain." It's structural: a Disallow line is a request, not a wall. If blocking genuinely matters for a given bot, robots.txt alone won't prove it worked. (If your problem is the opposite—Perplexity citing rivals instead of you—the fix is content and citations, not crawler config, which we cover in why Perplexity cites competitors instead of you.)

Google-Extended: the most misunderstood toggle

Google-Extended is the bot people block for the wrong reason. It isn't a crawler at all—it's a robots.txt control token. Google Search Central's documentation is explicit: "Google-Extended doesn't have a separate HTTP request user agent string," and it "does not impact a site's inclusion in Google Search nor is it used as a ranking signal."

So what does it do? It governs whether content Google already crawls may be used to train Gemini and ground generative-AI products. What it does not do is remove you from AI Overviews—those are part of Google Search and draw on the live Search index that Googlebot builds, not on the Google-Extended training pipeline. Blocking Google-Extended to "get out of AI Overviews" is a misfire; it changes nothing there. If you're worried about showing up in Google's AI answers at all, that's a visibility problem, not a crawler one—it starts with being citable, not with blocking a token.

Beyond the big four: other AI crawlers worth knowing

The four operators above dominate the decision, but several more bots show up in real server logs—and a few follow the same training-versus-search logic.

User agent Operator Job What to know
CCBot Common Crawl Builds the open crawl dataset many AI models train on Obeys robots.txt; blocking removes you from a dataset reused across many models
Bytespider ByteDance (TikTok) Trains ByteDance's models Reported to crawl aggressively and disregard robots.txt at times—verify with logs
Amazonbot Amazon Feeds Alexa and Amazon's AI answers Obeys robots.txt
Meta-ExternalAgent Meta Trains Meta AI and Llama Obeys robots.txt
Applebot-Extended Apple Opt-out token for Apple Intelligence training Token only—like Google-Extended, it doesn't change Siri or Spotlight indexing

The same three-question test applies to every one of them: is it search/retrieval (allow—it feeds answers), a user-triggered fetch (you can't reliably block it anyway), or training (decide on the merits)?

The per-bot decision framework

Now the part the ranking pages skip: a framework instead of a verdict. The goal isn't crawler control for its own sake—it's answer engine optimization, making sure that when someone asks an assistant for options, your brand is on the list. Score each bot against one question: does allowing it help me get recommended, or does blocking it protect something I genuinely need to protect?

Bot group Default Block only if
Search / retrieval (OAI-SearchBot, PerplexityBot, Claude-SearchBot) Allow Almost never—blocking is a direct AI visibility loss
User fetches (ChatGPT-User, Claude-User, Perplexity-User) Allow You can't reliably enforce a block anyway; the signal is symbolic
Training (GPTBot, ClaudeBot, Google-Extended, CCBot) Decide deliberately You have licensing use, paywalled or proprietary content, or a legal/brand-safety mandate

Then layer your role on top, because the right answer changes with your business model:

  • B2B SaaS or a startup chasing AI shortlists: allow everything. You want to be in the training data and the live answers. Crawler-blocking works against your own growth.
  • Publisher with a licensing strategy: allow search and retrieval, block training to preserve negotiating use—a posture many major news publishers have adopted toward GPTBot and similar training crawlers.
  • Agency managing many clients: set a documented per-client default (usually allow) and monitor outcomes rather than trusting the config. Deciding how AI systems are allowed to learn from and represent each client is part of the job.

This framework is the actual deliverable. A block list tells you what to type; the framework tells you why—the only thing that survives the next bot launch.

Blocking is a request, not a wall: verify what's really happening

Your robots.txt is a statement of intent. Whether it's honored is an empirical question—so measure it. Two checks close the gap between what you typed and what's true.

Server log view comparing declared AI crawler user agents against the actual hits recorded, used to verify robots.txt rules

First, read your server logs. Filter by user agent and compare declared bots against actual hits. If you disallowed PerplexityBot but still see fetches from Perplexity IP ranges—or generic browser agents arriving from data-center IPs right after a Perplexity query—your rule isn't being enforced, and you've learned something a config file can't tell you.

Second, watch the outcome, not the config. The real question is never "did I block GPTBot." It's "does my brand still get cited?" Track your share of voice in AI answers—your brand mentions in ChatGPT and citations in Perplexity—over time, so you see the effect of a crawler decision instead of guessing. To compare tools for that, there's a tested roundup of brand-visibility trackers across AI search, and a breakdown of which AI visibility metrics actually matter.

A common failure pattern: a B2B SaaS team blocks GPTBot for "data control," assumes they're protected, and never touches OAI-SearchBot. Months later their visibility in ChatGPT is flat—because OAI-SearchBot was open the entire time and GPTBot was never the lever they thought they were pulling. The fix wasn't "block more." It was measure the outcome and stop optimizing the wrong bot.

A copy-paste robots.txt starting point (not gospel)

Pick the posture that matches your goal, then verify it. These are starting points, not finished policies—adjust per the framework above.

Posture A — Visibility-first (recommended for brands that want AI recommendations): allow everything. Don't block what feeds the answers you're trying to win.

# Visibility-first: allow all major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Posture B — Control-first (for publishers protecting training use): block training, keep search and retrieval open so you still appear in answers.

# Control-first: block model training, keep AI search visibility
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Opt out of training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Whichever posture you pick, pair it with structured data and a clean entity layer so the bots you do allow can actually understand you. And remember: blocking training doesn't erase your brand from models that already learned about you elsewhere—much of that knowledge arrives through off-site citations on Reddit, G2, and Wikipedia, which no robots.txt rule controls.

Common questions

Does blocking AI crawlers hurt my Google rankings?
No. GPTBot, ClaudeBot, and PerplexityBot are separate from Googlebot, so blocking them has no effect on Google Search. Even Google-Extended, per Google's documentation, "does not impact a site's inclusion in Google Search nor is it used as a ranking signal."

Do AI crawlers obey robots.txt?
Most declared training and search bots do—GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, and PerplexityBot all state they honor it. User-triggered fetchers are the exception: ChatGPT-User and Perplexity-User may ignore robots.txt because a human initiated the request. And Cloudflare reported in 2025 that Perplexity reached blocked pages using undeclared crawlers, so "honored" never means "guaranteed."

Will blocking GPTBot remove me from ChatGPT?
No—GPTBot only handles training. ChatGPT's search answers are powered by OAI-SearchBot, a different bot. Block GPTBot, keep OAI-SearchBot allowed, and your pages can still surface in ChatGPT.

Does blocking Google-Extended stop AI Overviews?
No. AI Overviews are part of Google Search and draw on the live Search index built by Googlebot. Google-Extended only governs whether your content trains or grounds Gemini, so blocking it changes nothing about AI Overviews.

Can I actually block Perplexity?
Officially, yes—via PerplexityBot and Perplexity-User in robots.txt. In practice, Cloudflare reported in 2025 that Perplexity reached blocked content using undeclared crawlers, so reliable enforcement may require network- or WAF-level blocking plus monitoring to confirm it worked.

If my content is already in training data, is blocking pointless?
Largely, for past data—blocking affects future crawls, not what models already learned. The higher-use decision is your visibility in live answers, governed by the search and retrieval bots, not the training ones.


Written by

Founder of MaxAEO. Helping brands get found in AI search across ChatGPT, Perplexity, Google AI Overviews, and more.

Run a free AI visibility audit →