Technical Setup

robots.txt and AI Crawlers — What Brands Need to Know

A complete guide to which AI crawlers exist, what they do, and how to configure robots.txt to control AI access to your website.

Your website's robots.txt file is the first thing AI crawlers read when they visit your domain. If it tells them to leave, they will — and your brand won't appear in AI-generated answers from ChatGPT, Perplexity, Gemini, or Claude. Most brands don't realise their robots.txt is blocking AI crawlers. And the ones using a WAF like Cloudflare may be blocking them at the network level without robots.txt even being involved.

This guide covers every AI crawler that matters in 2026, what each one does, and how to configure your robots.txt to control AI access to your site.

Why AI Crawlers Exist

AI crawlers serve three distinct purposes, and understanding the difference matters for how you configure access:

Training crawlers scrape content for future model training. GPTBot (OpenAI), ClaudeBot (Anthropic), and Google-Extended (Google) fall here. Blocking these prevents your content from being used for training — but also means the model won't "know" your brand from its training data.

Search crawlers index content for real-time AI search results. OAI-SearchBot indexes for ChatGPT Search, Claude-SearchBot for Claude. They work like Googlebot — building an index queried when users ask questions.

Browsing bots fetch pages in real time during live conversations. ChatGPT-User, Claude-User, and Perplexity-User operate this way — blocking them means AI can't pull live data from your pages.

The Complete AI Crawler Reference

Tier 1 — Core AI Training Crawlers

These are the most important crawlers to be aware of. They collect data used to train the foundation models behind major AI platforms.

BotCompanyPurposerobots.txt Token
GPTBotOpenAITraining data for ChatGPT modelsGPTBot
ClaudeBotAnthropicTraining data for ClaudeClaudeBot
PerplexityBotPerplexityReal-time search answersPerplexityBot
Google-ExtendedGoogleGemini / AI Overviews trainingGoogle-Extended

Tier 2 — Search and Browsing Bots

These crawlers fetch content for real-time AI search and live browsing during conversations. Blocking them has a more immediate impact on your AI visibility than blocking training crawlers.

BotCompanyPurposerobots.txt Token
OAI-SearchBotOpenAIChatGPT Search indexOAI-SearchBot
ChatGPT-UserOpenAIReal-time browsing in ChatGPTChatGPT-User
OAI-AdsBotOpenAIAd landing page validationOAI-AdsBot
Claude-UserAnthropicReal-time browsing in ClaudeClaude-User
Claude-SearchBotAnthropicSearch result indexing for ClaudeClaude-SearchBot
anthropic-aiAnthropicGeneral-purpose crawleranthropic-ai
Perplexity-UserPerplexityReal-time fetching during queriesPerplexity-User
GoogleOtherGoogleSupplementary AI training crawlingGoogleOther
BytespiderByteDanceTikTok AI / Doubao trainingBytespider

Note on Bytespider: ByteDance does not publish official documentation for this crawler. Many site operators block it due to aggressive crawl volumes and lack of transparency.

Tier 3 — AI-Adjacent Crawlers

BotCompanyPurposerobots.txt Token
CCBotCommon CrawlOpen dataset used by many LLMsCCBot
AmazonbotAmazonAlexa / Amazon AI featuresAmazonbot
Meta-ExternalAgentMetaMeta AI / Llama trainingMeta-ExternalAgent
Applebot-ExtendedAppleApple Intelligence / SiriApplebot-Extended

Other crawlers worth knowing about: Bingbot powers Microsoft Copilot — blocking Bingbot means blocking Copilot entirely. cohere-ai is Cohere's training crawler. Diffbot builds knowledge graphs used by multiple AI applications.

The Default Blocking Problem

Many brands are blocking AI crawlers without knowing it. This happens in three ways:

Platform defaults. Shopify, WordPress, and other CMS platforms ship with robots.txt configurations that don't explicitly allow AI crawlers. If you haven't edited your robots.txt since 2024, you likely haven't accounted for the wave of new AI crawlers.

Aggressive bot protection. Cloudflare's "Bot Fight Mode," AWS WAF rules, and similar services block automated traffic — and AI crawlers are automated traffic. Without explicit allow-lists, these services block legitimate AI bots alongside malicious scrapers.

Inherited configurations. Development teams copy robots.txt files from templates that predate the AI crawler era and frequently include blanket bot-blocking rules.

How to Check Your AI Crawler Access

Manual check: Visit yourdomain.com/robots.txt in a browser and search for the bot names listed above. If you see Disallow: / for GPTBot, ClaudeBot, or PerplexityBot, those crawlers are blocked.

Automated check: Crawl Radar tests your site against every major AI crawler and reports which ones can access your pages — including WAF-level blocks that robots.txt alone won't reveal.

Example robots.txt Configuration

Here's a configuration that allows all major AI crawlers to access your site:

# AI Crawlers — Allow
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

If you want to allow AI search and browsing but block training, you can be selective:

# Block training crawlers
User-agent: GPTBot
Disallow: /

# Allow search and browsing bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

This tells OpenAI not to use your content for model training while still allowing ChatGPT Search and Perplexity to cite your pages in real-time answers.

WAF vs robots.txt — A Critical Distinction

robots.txt is advisory — it tells well-behaved crawlers what to do, but the crawler must first reach your server to read the file.

A Web Application Firewall (WAF) operates at the network level. It blocks requests before they reach your server. If your Cloudflare settings block GPTBot's IP range or user-agent string, your robots.txt is irrelevant — the crawler never gets to read it.

This is why testing actual access matters more than reading your robots.txt. Many brands have a robots.txt that says "allow" while their WAF silently blocks every AI crawler at the door.

The Agentic Browser Challenge

A new category of AI agents emerged in 2025-2026 that robots.txt cannot control at all.

OpenAI's Operator and Google's Project Mariner are agentic browsers — they control real browser instances with standard Chrome user-agent strings. They navigate websites the way a human would. From your server's perspective, they look identical to a human visitor. There is no robots.txt token to block them and no reliable way to detect them in server logs.

The practical implication: assume that AI agents will eventually access anything a human can access on the open web. Focus on making that content work for you — structured, accurate, and citation-ready — rather than trying to restrict it.

Key Takeaways

  • AI crawlers fall into 3 categories: training (GPTBot, ClaudeBot), search (OAI-SearchBot), and browsing (ChatGPT-User) — each has different visibility implications
  • Blocking training crawlers is a choice, but blocking search and browsing bots directly reduces your AI visibility
  • robots.txt is advisory — WAFs like Cloudflare can block AI crawlers at the network level before robots.txt is even read
  • Many CMS platforms and security tools block AI crawlers by default without explicit configuration
  • Agentic browsers like OpenAI Operator use standard Chrome user-agents and cannot be controlled via robots.txt
  • Test actual crawler access with Crawl Radar rather than relying on your robots.txt file alone

Frequently Asked Questions

Should I block any AI crawlers?+
For most brands, the answer is no. Blocking AI crawlers means your content won't appear in AI-generated answers — and that's a visibility cost you'll feel as AI search grows. The exception is if you have proprietary content you don't want used for model training. In that case, you can block training crawlers (GPTBot, Google-Extended) while keeping search and browsing bots (OAI-SearchBot, ChatGPT-User, PerplexityBot) allowed. This lets AI platforms cite your content without using it for training.
Does blocking GPTBot affect Google rankings?+
No. GPTBot is OpenAI's crawler, completely separate from Googlebot. Blocking GPTBot has zero impact on Google Search rankings. However, blocking Google-Extended can affect whether your content appears in Google AI Overviews and Gemini answers — which increasingly drive clicks. And blocking Bingbot affects Microsoft Copilot, which uses Bing's index.
How does Shopify handle AI crawlers by default?+
Shopify's default robots.txt blocks several bot patterns and doesn't explicitly allow AI crawlers. Some Shopify stores also use aggressive bot protection via Cloudflare or Shopify's built-in security that can inadvertently block legitimate AI crawlers at the network level — before robots.txt is even read. If you're on Shopify, test your actual crawler accessibility with a tool like Crawl Radar rather than relying on your robots.txt file alone.
How do I know if my WAF is blocking AI bots?+
Check your WAF or CDN dashboard (Cloudflare, AWS WAF, Akamai) for blocked requests from known AI crawler user-agent strings like GPTBot, ClaudeBot, or PerplexityBot. Many WAFs have a 'bot fight' mode that blocks automated traffic indiscriminately. You can also test directly — Cited's Crawl Radar tool simulates requests from each major AI crawler and reports which ones get through.

Test which AI crawlers can reach your site

Free Crawl Radar Scan →